|Table of Contents|
Energy Aware Runtime (EAR) is a system level tool used on SuperMUC-NG for optimisation of energy consumption.
It has been created in the context of the Barcelona Supercomputing Centre (BSC)/Lenovo Cooperation project.
Details: User Guide
How it works
EAR regularly monitors the runtime behaviour of a job taking instruction throughput, memory access behaviour, and power consumption into account. From this, it derives the best frequency setting according to a configured policy.
For MPI jobs (Intel MPI or OpenMPI), EAR can hook into MPI functions to detect iterative computational phases of an application, allowing it to immediately change frequency when a phase with already known behaviour is entered. In this mode, EAR monitors its own overhead. If that is too high, it switches back to a mode that uses time-based behaviour monitoring. The latter is the default if MPI is not used.
Default EAR Configuration On SuperMUC-NG
By default, the policy of EAR is set to targeting higher performance by using higher frequencies. The frequency drops at a base level of 2.3GHz when higher frequencies don't result in an increase of performance. This is usually the case in memory bound codes. This policy is called "MIN_TIME_TO_SOLUTION" in EAR terms.
Controlling EAR behaviour
EAR can render profiling or benchmark measurements difficult and unstable. In this case, users can enforce a fixed base frequency of 2.3 GHz by switching EAR off, putting the following line in the job script:
Attention: switching EAR off for regular runs is not recommended, as it probably will slow down your jobs due to not using higher CPU clock frequencies!
It is also possible to use the above switches as command line arguments on salloc.
Getting EAR general debug information
If your application, without having made any changes to the code, fails for no apparent reason, consider enabling the EAR debugging information.
This information will be saved in the error file:
... #SBATCH --error=<desired error file> #SBATCH --ear-verbose=1 ...
Crash right after application startup on python based codes
The cause of this problem is that the mpi symbols are not recognized. Therefore, please specify whether you are using an Intel-MPI version or an Open MPI version with one of these exports respectively:
export SLURM_EAR_LOAD_MPI_VERSION="intel" export SLURM_EAR_LOAD_MPI_VERSION="open mpi"
Note: when combining a python-mpi and a regular mpi application (i.e., no python but C/C++/Fortran) in the same batch please unset this variable for the regular mpi application while using EAR, otherwise your application may crash:
Another option is to switch EAR to off on the entire batch script (see how to do this here).
Crash on scripts using anaconda / miniconda
Since intel conda channel provides mpich, it is necessary to disable ear completely for running jobs with anaconda or miniconda setup. In this case, please set ear to off.
Trouble with shared libraries
With the current setup, there may be trouble with switching modules; an error message like
/usr/bin/tclsh: error while loading shared libraries: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory
appears. To work around this, there are two options:
- switch EAR off as described above.
- temporarily unset the LD_PRELOAD variable before making changes to the environment, and set it back to its original value just before running mpiexec.
EAR is developed by Lenovo under an Open-Source licence. Please contact LRZ if you are interested in collaboration on energy efficiency of HPC systems.