Intel's MPI implementation allows us to build an MPI application once and run it on various interconnects. Good performance can be achieved also over proprietary interconnects if the vendor provides a DAPL or Libfabric implementation which Intel MPI can make use of.
Setting up for the use of Intel MPI
Intel MPI is available on all HPC systems at LRZ that support parallel processing in their batch queuing setup. An environment module makes available all tools needed to compile and execute MPI programs as described in the main MPI document. Since Intel MPI may not be binary compatible with other MPI flavours, you should completely re-compile and re-link your application under the Intel MPI environment.
The intel-mpi environment module is provided as a default setting on this system.
The mpi.intel environment module is provided as a default setting on these systems.
The intel-mpi (previously mpi.intel) environment module is provided as a default setting on this system.
Compiling and linking
The following table lists a number of options that can be used with the compiler wrappers in addition to the usual switches for optimization etc. The compiler wrappers' names follow the usual mpicc, mpif90, mpiCC pattern.
|-mt_mpi||link against thread-safe MPI||Thread-safeness up to MPI_THREAD_MULTIPLE is provided. Note that this option is implied if you build with the -openmp switch.|
|-check_mpi||link with Intel Trace Collector MPI checking library.||Prior to the invocation of the compiler and/or running the program, you need to load the special tracing module for this to work. Please see the page on ITAC for details.|
|-static_mpi||use static instead of dynamic MPI libraries||By default, the dynamic linkage is performed.|
|-t[=log]||compile with MPI tracing, using Intel Trace Collector||Prior to the invocation of the compiler and/or running the program, you need to load the special tracing module for this to work. Please see the 'Get Started with ITAC' page for details.|
|-ilp64||link against MPI interface with 8 byte integers||you may need to also specify -i8 for compiling Fortran code that uses default integers only.|
|-g||link against debugging version of the MPI library||This will also toggle debugging mode in the compiler.|
|-gtool ...||start selected MPI tasks under control of a tool||See the Intel documentation page on this for more details. This option should be used to perform various analysis types with MPI programs, e.g. using Inspector or VTune.|
The compiler used by the Intel MPI default module is the Intel Fortran/C/C++ suite; the version of the compiler used depends on the presently loaded fortran/intel and ccomp/intel environment module. However, it is possible to use other compilers with Intel MPI as well. The following table illustrates the availability of such alternative compilers.
|Modules||Compiler||Supported Versions / Comments|
|GCC||The system GCC as well as at least a subset of LRZ-provided gcc modules are supported. Any supported gcc module must be loaded prior to the Intel MPI one.|
|intel-mpi/2019-intel||Intel||default module on CoolMUC-2|
Executing Intel MPI programs
The Hydra process management infrastructure, which is aware of the batch queuing system, is always used for starting up Intel MPI programs. This also applies if the mpiexec command is used.
Execution on the Linux Cluster or SuperMUC-NG (SLURM)
You can use either the SLURM srun command or the mpiexec command to start up your program inside a SLURM script or interactive salloc environment. For example,
mpiexec -n 32 ./myprog.exe
will start up 32 MPI tasks, using as many cores of the system. The same is done if you issue
srun -n 32 ./myprog.exe
Sometimes MPI tasks need more memory per task than is available per core. Then, you need to reserve more resources in your job and leave cores idling: For example,
srun --cpus-per-task=2 -n 32 ./myprog.exe
or (on the MPP cluster with 16 cores per node)
mpiexec --perhost=8 -n 32 ./myprog.exe
would require 64 cores and allow each task to use a factor of 2 more memory.
Executing hybrid-parallel programs
This section deals with programs that use both MPI and OpenMP for parallelization. In this case, the number of cores used by each MPI task is usually equal to the number of OpenMP threads to be used by that task, and the latter is set via the environment variable OMP_NUM_THREADS. For example, an
executed prior to the startup of the MPI program would cause each MPI task to use 4 threads; the job setup should therefore usually provide 4 cores to each MPI task. In order to perform appropriate pinning of the OpenMP threads, please use the compiler-specific pinning mechanism; for Intel compilers, the KMP_AFFINITY environment variable serves this purpose; however this will usually only work well on systems with Intel processors. Please consult the Intel MPI Reference Manual (see below) for information on how to perform pinning in more general setups.
Hybrid program execution on the Linux Cluster or SuperMUC-NG (SLURM)
The command sequence
srun --cpus-per-task=4 -n 12 ./myprog.exe
will start 12 MPI tasks with 4 threads each. However, the placement of tasks and threads is not optimal. A better way is to say
mpiexec --perhost=4 -n 12 ./myprog.exe
Note that the perhost argument must be the number of cores in a node, divided by the number of cores per task.
Handling environment variables
The mpiexec command takes a number of options to control how environment variables are transmitted to the started MPI tasks. A typical command line might look like
mpiexec -genv MY_VAR_1 value1 -genv MY_VAR_2 value2 -n 12 ./myprog.exe
Please consult the documentation linked below for further details and options.
Environment variables controlling the execution
Please consult the documentation for Intel MPI for the very large set of I_MPI_* variables that allow us to extensively configure and optimize at compile as well as run time.
Settings for low memory footprint
Applications demanding high levels of memory per-node may benefit from reducing the MPI footprint. Using the appropriate environment variables it is possible to control the behaviour of Intel MPI in this sense.
Using these settings, test programs on single SuperMUC-NG compute nodes reduced the footprint for MPI collective operations by about tenfold, from a few GB down to a few hundred MB, with even a slight performance improvement.
The exact memory and execution time will depend on the details of the collective operations.
The most relevant environment variables are reported below; they can e.g. be set at the beginning of the SLURM processing job scripts.
I_MPI_SHM_CELL_FWD_SIZE – size of forward cells
I_MPI_SHM_CELL_FWD_NUM – number of forward cells per rank
I_MPI_SHM_CELL_BWD_SIZE – size of backward cells
I_MPI_SHM_CELL_BWD_NUM – number of backward cells per rank
I_MPI_SHM_CELL_EXT_SIZE – size of extended cells
I_MPI_SHM_CELL_EXT_NUM_TOTAL – total number of extended cells per computational node
To reduce shared memory consumption use, for example:
Also, to reduce memory consumption it might be helpful to play with enabling/disabling custom memory allocators:
I_MPI_MALLOC – control Intel MPI custom allocator of private memory
I_MPI_SHM_HEAP – control Intel MPI custom allocator of shared memory
For example, to disable them, use:
Finally, it is possible to control in detail the exact algorithm used by each collective operation:
These will affect the global memory footprint, though it is not explicitly hinted how. Forcing collective
intra-node operations to be performed on a point-to-point basis, instead of the default shared-memory algorithms can help further:
although it seems less influential than the variables described above.
Generating Core dumps for debugging
Note: on current operating system releases, core dumps cannot be generated. There is work underway to re-enable this feature.
Non-Blocking MPI Calls
MPI_Isend and MPI_Irecv are non-blocking calls. However, this does not make the memory transfer asynchronous. The Intel® MPI Library does not spawn a separate thread for communication, so this will have to happen in the main program thread. When using shared memory, the CPU will need cycles in order to transfer the data. Those cycles typically occur during Waitall. If you are using RDMA, then the transfer can happen asynchronously, so there is a slight improvement. For more asynchronous behavior, you will want to use threading, and have one thread perform the Waitall call while other threads perform calculations.
MPI Tag's Upper bound exceeded
Intel MPI commonly changes the value of MPI_TAG_UB which sometimes results in errors like this:
Fatal error in PMPI_Issend: Invalid tag, error stack:PMPI_Issend(156): MPI_Issend(buf=0x7fff88bd4518, count=1, MPI_INT, dest=0, tag=894323, MPI_COMM_WORLD, request=0x7fff88bd4438) failed
The reason for this error is that programs used more than the maximum number of tags than Intel MPI allows by default and much higher than what MPI standard guarantees (32k). In the past, Intel kept the maximum number of tags pretty high but never releases significantly reduced it. The maximum number of MPI tags value was reduced from 2G to 1G and finally to 0.5M, on Intel MPI 2018.4, 2019.6, and in 2019.7, respectively. The real solution is to adapt your program to what standard guarantees, 32k, however, there is a workaround available to change default values.
It is possible to change, redistribute to be more precise, the number of bits being used for MPI tags but one must reduce the maximum number of MPI ranks. The sum of 39 bits is used for both, MPI tags and MPI ranks.
For example, one could get 2G, 2^31, MPI tags but the number of processes will be reduced to 256, 2^8 by exporting the following environment variables:
Current versions support the MPI-3 interface. In particular, the new mpi_f08 Fortran interface can be used in conjunction with the Intel Fortran compiler.
General Information on MPI
Please refer to the MPI page at LRZ for the API documentation and information about MPI in general.
Intel MPI documentation
For the most up-to-date release, the documentation can also be found on Intel's web site.