Hybrid Programming in HPC - MPI+X


LRZ PRACE   


Participants 2022 © VSC TU Wien

Overview

Most HPC systems are clusters of shared memory nodes. To use such systems efficiently both memory consumption and communication time has to be optimized. Therefore, hybrid programming may combine the distributed memory parallelization on the node interconnect (e.g., with MPI) with the shared memory parallelization inside of each node (e.g., with OpenMP or MPI-3.0 shared memory). This course analyzes the strengths and weaknesses of several parallel programming models on clusters of SMP nodes. Multi-socket-multi-core systems in highly parallel environments are given special consideration. MPI-3.0 has introduced a new shared memory programming interface, which can be combined with inter-node MPI communication. It can be used for direct neighbor accesses similar to OpenMP or for direct halo copies, and enables new hybrid programming models. These models are compared with various hybrid MPI+OpenMP approaches and pure MPI. Numerous case studies and micro-benchmarks demonstrate the performance-related aspects of hybrid programming.

Hands-on sessions are included on all days. Tools for hybrid programming such as thread/process placement support and performance analysis are presented in a "how-to" section. This course provides scientific training in Computational Science and, in addition, the scientific exchange of the participants among themselves.

This online course is a PRACE training event. It is organised by LRZ in cooperation with HLRS, NHR@FAU and the VSC Research Center, TU Wien.

Preliminary Agenda

1st day

08:45   Join online
09:00      Welcome
09:05      Motivation
09:15      Introduction
09:30      Programming Models
09:35         - MPI + OpenMP
10:00            Practical (how to compile and start)
10:30   Break
10:45         - continue: MPI + OpenMP
11:30   Break
11:45         - continue: MPI + OpenMP
12:30            Practical (how to do pinning)
13:00   Lunch
14:00            Practical (hybrid through OpenMP parallelization)
15:30            Q & A, Discussion
16:00   End of first day

2nd day

08:45   Join online
09:00         - Overlapping Communication and Computation
09:30            Practical (taskloops)
10:30   Break
10:45         - MPI + OpenMP Conclusions
11:00         - MPI + Accelerators
11:30      Tools
11:45   Break
12:00      Programming Models (continued)
12:05         - MPI + MPI-3.0 Shared Memory
13:00   Lunch
14:00            Practical (replicated data)
15:30            Q & A, Discussion
16:00   End of second day

3rd day

08:45   Join online
09:00         - MPI Memory Models and Synchronization
09:40         - Pure MPI
10:00   Break
10:15         - Recap - MPI Virtual Topologies
10:45         - Topology Optimization
11:15   Break
11:30           Practical/Demo (application aware Cartesian topology)
12:30         - Topology Optimization (Wrap up)
12:45       Conclusions
13:00   Lunch
14:00       Finish the hands-on labs, Discussion, Q & A, Feedback
16:00   End of third day (course)

Lecturers

Dr. Claudia Blaas-Schenner (VSC Research Center, TU Wien), Dr. habil. Georg Hager (NHR@FAU), Dr. Rolf Rabenseifner (HLRS, Uni. Stuttgart)

Slides

see http://tiny.cc/MPIX-LRZ 

Recommended Access Tools

Login under Windows:

  • Start xming and after that PUTTY
  • Enter host name lxlogin1.lrz.de into the putty host field and click Open.
  • Accept & save host key [only first time]
  • Enter user name and password (provided by LRZ staff) into the opened console.

Login under Mac:

Login under Linux:

  • Open xterm
  • ssh -Y lxlogin1.lrz.de -l username
  • Use user name and password (provided by LRZ staff)

How to use the CoolMUC-2 System

Login Nodes:

Sample job file

#!/bin/bash
#SBATCH -o /dss/dsshome1/0D/hpckurs99/test.%j.%N.out
#SBATCH -D/dss/dsshome1/0D/hpckurs99
#SBATCH -J test
#SBATCH --clusters=cm2
#SBATCH --partition=cm2_std
#SBATCH --nodes=1
#SBATCH --qos=unlimitnodes
#SBATCH --cpus-per-task=28
#SBATCH --get-user-env
#SBATCH --reservation=hhyp1s22
#SBATCH --time=02:00:00
module load slurm_setup
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
echo hello, world

Reservation is only valid during the workshop, for general usage on our Linux Cluster remove the "--reservation=hhyp1s22"

  • Submit a job:
    sbatch --reservation=hhyp1s22 job.sh
  • List own jobs:
    squeue -M cm2
  • Cancel jobs:
    scancel -M cm2 jobid
  • Show reservations:
    sinfo -M cm2  --reservation
  • Interactive Access:

salloc -M cm2 --time=00:30:00 --reservation=hhyp1s22 --partition=cm2_std



Details: https://doku.lrz.de/display/PUBLIC/Running+parallel+jobs+on+the+Linux-Cluster
Examples: https://doku.lrz.de/display/PUBLIC/Example+parallel+job+scripts+on+the+Linux-Cluster
Resource limits: https://doku.lrz.de/display/PUBLIC/Resource+limits+for+parallel+jobs+on+Linux+Cluster


Intel Software Stack

The Intel software stack is automatically loaded at login. The Intel compilers are called icc (for C), icpc (for C++) and ifort (for Fortran). They behave similar to the GNU compiler suite (option –help shows an option summary). For reasonable optimisation including SIMD vectorisation, use options -O3 -xavx (you can use -O2 instead of -O3 and sometimes get better results, since the compiler will sometimes try be overly smart and undo many of your hand-coded optimizations).

By default, OpenMP directives in your code are ignored. Use the -qopenmp option to activate OpenMP.

Use mpiexec -n #tasks to run MPI programs. The compiler wrappers' names follow the usual mpicc, mpiifort, mpiCC pattern.

PRACE Survey

Please fill out the PRACE online survey under

https://events.prace-ri.eu/event/1334/surveys/980

This helps us and PRACE to increase the quality of the courses, design the future training programme at LRZ and in Europe according to your needs and wishes and get future funding for training events.