OpenMP - shared memory and device parallelism

This document provides a guide to usage of OpenMP and availability of OpenMP on the HPC systems at LRZ.

An abstract description of OpenMP

OpenMP is a parallelization method available for the programming languages Fortran, C and C++, which is targeted toward use on shared memory systems. Since the OpenMP standard was developed with support from many vendors, programs parallelized with OpenMP should be widely portable.

The current standard document is that for OpenMP 5.0,, for the examples annex only the 4.5 version exists as a (separately published) document; this in particular extends OpenMP to apply to accelerator devices. Actual compiler support is now reasonably complete, however support for specific accelerator devices is heavily compiler dependent, which rather limits portability of code that uses the target directive.

The OpenMP parallelization model

From the operating system point of view, OpenMP functionality is based on the use of threads, while the user's job simply consists in inserting suitable parallelization directives into her/his code. These directives should not influence the sequential functionality of the code; this is supported through their taking the form of Fortran comments and C/C++ preprocessor pragmas, respectively. However, an OpenMP aware compiler will be capable of transforming the code-blocks marked by OpenMP directives into threaded code; at run time the user can then decide (by setting suitable environment variables) what resources should be made available to the parallel parts of his executable, and how they are organized or scheduled. The following image illustrates this.

arch.png

However hardware and operation mode of the computing system put limits to the application of OpenMP parallel programs: Usually, it will not be sensible to share processors with other applications because scalability of the codes will be negatively impacted due to load imbalance and/or memory contention. For much the same reasons it is in many cases not useful to generate more threads than CPUs are available. Correspondingly, you need to be aware of your computing centers' policies regarding the usage of multiprocessing resources.

At run time the following situation presents itself: Certain regions of the application can be executed in parallel, the rest - which should be as small as possible - will be executed serially (i.e., by one CPU with one thread).

exec.png

Program execution always starts in serial mode; as soon as the first parallel region is reached, a team of threads is formed ("forked") based on the user's requirements (4 threads in the image above), and each thread executes code enclosed in the parallel region. Of course it is necessary to impose a suitable division of the work to be done among the threads ("work sharing"). At the end of the parallel region all threads are synchronized ("join"), and the following serial code is only worked on by the master thread, while the slave threads wait (shaded yellow squares) until a new parallel region begins. This alteration between serial and parallel regions can be repeated arbitrarily often; the threads are only terminated when the application finishes.

The OpenMP standard also allows nesting of parallelism: A thread in a team of threads may generate a sub-team; some OpenMP implementations however do not allow the use of more than one thread below the top nesting level.

A priori the number of threads used does not need to conform with the number of CPUs available for a job. However for achieving good performance it is necessary to determine and possibly enforce an optimal assignment of CPUs to threads. This may involve additional functionality in the operating system, and is not covered by the OpenMP standard.

Comparison with other parallelization methods

In contrast to using MPI (which usually requires a lot of work), one can very quickly obtain a functioning parallel program with OpenMP in many cases. However, in order to achieve good scalability and high performance it will be necessary to use suitable tools to perform further optimization. Even then, scalability of the resulting code will not always be on par with the corresponding code parallelized with MPI.


MPI
(shared and
distributed memory
systems)

OpenMP
(shared memory
systems)

proprietary
directives

PGAS

(coarrays, UPC)

portable

yes

yes

no

yes

scalable

yes

partially

partially

yes

supports data parallelism

no

yes

yes

partially

supports incremental parallelization

no

yes

yes

partially

serial functionality intact?

no

yes

yes

yes

correctness verifiable?

no

yes

?

in principle yes

On high performance computing systems with a combined shared and distributed memory architecture using MPI and OpenMP in a complementary manner is one possible strategy for parallelization ("hybrid parallel programs"). Schematically one obtains the following hierarchy of parallelism:

  • The job to be done is subdivided into large chunks, which are distributed to compute nodes using MPI.

  • Each chunk of work is then further subdivided by suitable OpenMP directives. Hence each compute node generates a team of threads, each thread working on part of a chunk.

  • On the lowest level, e.g. the loops within part of a chunk, the well-known optimization methods (either by compiler or by manual optimization) should be used to obtain good single CPU or single thread performance. The method used will depend on the hardware (e.g., RISC-like vs. vector-like).

Note that the hardware architecture may also have an influence on the OpenMP parallelization method itself. Furthermore, unlimited intermixing of MPI and OpenMP requires a thread-safe implementation of MPI; the level of thread-safeness can be obtained by calling the MPI_Init_thread subroutine with suitable arguments; depending on the result, appropriate care may be required to follow the limitations inherent in the various threading support levels defined by the MPI standard.

Overview of OpenMP functionality

Please consult the Wikipedia OpenMP article for such an overview.

Remarks on the usage of OpenMP

OpenMP compilers

OpenMP enabled compilers are available on all HPC systems at LRZ:

  • The current (19.0) Intel Compiler suite supports most features of OpenMP 4.5, particularly the SIMD directives. These compilers are available on all HPC systems at LRZ.

  • The PGI Compiler (2019) suite supports OpenMP 3.1 and some 4.5 features on x86_64 based systems. Accelerator support however does not use OpenMP, but is based on OpenACC.

  • The currently deployed default release (7) of GCC supports OpenMP.

  • The NAG compiler fully supports OpenMP 3.1 in its 6.2 release.

Compiler switches

For activation of OpenMP directives at least one additional compiler switch is required.

Vendor

Compiler calls

OpenMP option

Intel

ifort / icc / icpc

-qopenmp

GCC

gfortran / gcc / g++-fopenmp

PGI

pgf90 / pgcc / pgCC

-mp

Please specify this switch for all program units that contain either OpenMP directives, or calls to OpenMP run time functions. Also, the OpenMP switch must be specified at link time.

Controlling the Run Time environment

Before executing your OpenMP program, please set up the OpenMP environment using one or more of the variables indicated in the following table.

Variable namepossible valuesmeaning
OMP_NUM_THREADSa positive integer, or a comma-separated list of positive integersThe number of threads that will be spawned by the first encountered parallel regions, or by subsequent nested parallel regions.
OMP_SCHEDULEtype[,chunk], where type is one of static, dynamic, guided or auto, and chunk is an optional positive integerdetermines how work sharing of loops with a runtime clause is scheduled to threads
OMP_DISPLAY_ENVtrue, false or verboseinstructs run time to display the values of all OpenMP related environment settings (including internal control variables) at the beginning of program execution if set to true or verbose.
OMP_DYNAMICtrue or falsedetermine whether or not the OpenMP run time may decide how many threads should be used for a parallel region
OMP_PROC_BIND

false, true, master, close, spread


false: set no binding 
true: lock threads to a core
master: collocate threads with the master thread  
close: place threads close to the master in the places list  
spread: spread out threads as much as possible

OMP_PLACESsee the standard document for grammardefine a partitioning of available resources for thread- to-resource mapping.
OMP_NESTEDtrue or falsedetermine whether or not the OpenMP run time should support nesting of parallel regions
OMP_STACKSIZEsize or sizeB or sizeK or sizeM or sizeG, where size is a positive integerSize of the thread-individual stack in appropriate units (kByte, MByte etc.). If no unit specifier appears, the size is given in units of kBytes.
OMP_WAIT_POLICYactive or passiveprovide hint to OpenMP run time about desired behaviour of waiting threads
OMP_MAX_ACTIVE_LEVELSa positive integermaximum number of active levels of nested parallelism
OMP_THREAD_LIMITa positive integermaximum number of threads to use for the whole OpenMP program. There may be an implementation defined maximum value which cannot be exceeded.
OMP_CANCELLATIONtrue or falsedetermines whether or not cancellation is enabled.
OMP_MAX_TASK_PRIORITYa non-negative integersets the internal control variable that controls the use of task priorities

Simple example (using the bash or ksh shell):

export OMP_NUM_THREADS=8
OMP_PLACES=cores
OMP_PROC_BIND=spread
./my_openmp_program.exe

This will execute the OpenMP-compiled binary "my_openmp_program.exe" with up to 8 threads.

Stub library, module and include file for Fortran

In order to keep code compilable for the serial functionality, any OpenMP function calls or declarations should also be decorated with an active comment:

      implicit none
      ...
!$    integer OMP_GET_THREAD_NUM
!$    external OMP_GET_THREAD_NUM
      ...
      mythread = 0
!$    mythread = OMP_GET_THREAD_NUM()

Note that without the IMPLICIT NONE statement and missing declaration OMP_GET_THREAD_NUM has the wrong type!

If you do not wish to do this, it is also possible to use

  • either an include file omp_lib.h

  • or a Fortran 90 module omp_lib.f90

for compilation of the serial code. For linkage one also needs a stub library. All this is provided in the Intel compilers via the option -openmp-stubs, which will otherwise produce purely serial code. The above code can then be written as follows:

Fortran 77 style

Fortran 90 style (recommended)

      implicit none
      ...
      include 'omp_lib.h'
      ...
      mythread = OMP_GET_THREAD_NUM()
      use omp_lib
      implicit none
      ...
      ...
      mythread = OMP_GET_THREAD_NUM()

Additional environment variables supported by the Intel Compilers

Name

Explanation

Default value

KMP_AFFINITY and/or KMP_PLACE_THREADS

see the description on Intel's web site

schedule threads to cores or threads (logical CPUs) in a user controlled manner.

KMP_ALL_THREADS

maximum number of threads available to a parallel region

max(32, 4*OMP_NUM_THREADS, 4*(No. of processors)

KMP_BLOCKTIME

Interval after which inactive thread is put to sleep, in milliseconds.
Should be short in throughput mode, can be longer in turnaround mode; see KMP_LIBRARY below.

200 milliseconds

KMP_LIBRARY

Select execution mode for OpenMP runtime library. Possible values are:

  • throughput: optimized for sharing resources with other programs

  • turnaround: suited to dedicated use of resources, as in HPC.

  • serial: enforce serial execution

throughput

KMP_MONITOR_STACKSIZE

Set stacksize in bytes for monitor thread

max(32768, system minimum thread stack size)

KMP_STACKSIZE

stack size (in bytes) usable for each thread. Change if your application segfaults for no apparent reason. You may also need to increase your shell stack limit appropriately.

With OpenMP 3.0 and higher, the standardized variable OMP_STACKSIZE should be used.

2 MByte

Notes:

  • Setting suitable postfixes where appropriate allows you to specify units. I.e., KMP_STACKSIZE=6m sets a value of 6 MByte.

  • There are also some extension routine calls, i.e. kmp_set_stacksize_s(...) with an implementation dependent integer kind as argument, which can be used instead of the environment variables described above. However this will usually not be portable and usage is hence discouraged unless for specific needs.

Additional environment variables supported by the GNU Compilers

References 

  • OpenMP home pageThe central source of information about OpenMP
  • OpenMP Specifications for Fortran, C, and C++. Older versions of the spec can also be picked up here.

  • Course material for a two day OpenMP course (collaboration between LRZ and RRZE)