Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
          

Excerpt

Energy Aware Runtime (EAR) is a system level tool used on SuperMUC-NG for optimisation of energy consumption.

It has been created in the context of the Barcelona Supercomputing Centre (BSC)/Lenovo Cooperation project.

Details: User Guide

How it works

EAR regularly monitors the runtime behaviour of a job taking instruction throughput, memory access behaviour, and power consumption into account. From this, it derives the best frequency setting according to a configured policy.

For MPI jobs (Intel MPI or OpenMPI), EAR can hook into MPI functions to detect iterative computational phases of an application, allowing it to immediately change frequency when a phase with already known behaviour is entered. In this mode, EAR monitors its own overhead. If that is too high, it switches back to a mode that uses time-based behaviour monitoring. The latter is the default if MPI is not used.

Default EAR Configuration On SuperMUC-NG

By default, the policy of EAR is set to targeting higher performance by using higher frequencies. The frequency drops at a base level of 2.3GHz when higher frequencies don't result in an increase of performance. This is usually the case in memory bound codes. This policy is called "MIN_TIME_TO_SOLUTION" in EAR terms.

Controlling EAR behaviour

EAR can render profiling or benchmark measurements difficult and unstable. In this case, users can enforce a fixed base frequency of 2.3 GHz by switching EAR off, putting the following line in the job script:

Code Block
#SBATCH --ear=off

(warning) Attention: switching EAR off for regular runs is not recommended, as it probably will slow down your jobs due to not using higher CPU clock frequencies!


It is also possible to use the above switches as command line arguments on salloc.

Troubleshooting

Getting EAR general debug information

If your application, without having made any changes to the code, fails for no apparent reason, consider enabling the EAR debugging information.

This information will be saved in the error file:

Code Block
...
#SBATCH --error=<desired error file> 
#SBATCH --ear-verbose=1
...

Crash right after application startup on python based codes

The cause of this problem is that the mpi symbols are not recognized. Therefore, please specify whether you are using an Intel-MPI version or an Open MPI version with one of these exports respectively:

Code Block
export SLURM_EAR_LOAD_MPI_VERSION="intel"

export SLURM_EAR_LOAD_MPI_VERSION="open mpi"

(warning) Note: when combining a python-mpi and a regular mpi application (i.e., no python but C/C++/Fortran) in the same batch please unset this variable for the regular mpi application while using EAR, otherwise your application may crash:

Code Block
unset SLURM_EAR_LOAD_MPI_VERSION

Another option is to switch EAR to off on the entire batch script (see how to do this here).

Crash on scripts using anaconda / miniconda

Since intel conda channel provides mpich, it is necessary to disable ear completely for running jobs with anaconda or miniconda setup. In this case, please set ear to off.

Trouble with shared libraries

With the current setup, there may be trouble with switching modules; an error message like

Code Block
/usr/bin/tclsh: error while loading shared libraries: libpsm_infinipath.so.1: cannot open shared object file: No such file or directory

appears. To work around this, there are two options:

  1. switch EAR off as described above.
  2. temporarily unset the LD_PRELOAD variable before making changes to the environment, and set it back to its original value just before running mpiexec.


Further Information

EAR is developed by Lenovo under an Open-Source licence. Please contact LRZ if you are interested in collaboration on energy efficiency of HPC systems.