PRACE Course: HPC Code Optimisation Workshop 2022

LRZ PRACE   


Contents

In the ever-growing complexity of computer architectures, code optimisation has become the main route to keep pace with hardware advancements and effectively make use of current and upcoming High Performance Computing systems.

Have you ever asked yourself:

  • Where are the performance bottlenecks of my application?
  • What is the maximum speed-up achievable on the architecture I am using?
  • Does my code scale well across multiple machines?
  • Does my implementation match my HPC objectives?

In this workshop, we will discuss these questions and provide a unique opportunity to learn techniques, methods and solutions on how to improve code, how to enable the new hardware features and how to use visualise the potential benefits of an optimisation process.

We will describe the latest micro-processor architectures and how developers can efficiently use modern HPC hardware, including SIMD vector units and the memory hierarchy. We will also touch upon exploiting intra-node and inter-node parallelism.

Attendees will be guided along the optimisation process through the incremental improvement of an example application. Through hands-on exercises they will learn how to enable vectorisation using simple pragmas and more effective techniques like changing data layout and alignment.

The work is guided by hints from compiler reports, and profiling tools such as Intel® Advisor, Intel® VTune™ Amplifier, Intel® Application Performance Snapshot and LIKWID for investigating and improving the performance of an HPC application.

You can ask the lecturers in the Q&A session about how to optimise your code. Please provide a description of your code in the registration form.

Learning Goals

Through a sequence of simple, guided examples of code modernisation, the attendees will develop awareness on features of multi and many-core architecture which are crucial for writing modern, portable and efficient applications.

A special focus will be dedicated to scalar and vector optimisations for the Intel® Xeon® Scalable processor, code-named Skylake, utilised in the SuperMUC-NG machine at LRZ.

The workshop interleaves lecture and practical sessions.

Preliminary Agenda


Session

1st day morning
(10:00-12:00)

Intro (Volker Weinberg)
Intro to LRZ HPC Systems and Software Stack (Gerald Mathias, Nisarg Patel)
Principles of optimization (Jonathan Coles)

1st day afternoon
(13:00-16:00)

HPC Architecture, Vectorization
Example code
Data structures
(Jonathan Coles)

2nd day morning
(10:00-12:00)

Profiling: Code instrumentation, Roofline Model, Intel Advisor (Jonathan Coles)

2nd day afternoon
(13:00-16:00)

Debuggers (Gerald Mathias)
Additional Tools: valgrind and Cache simulators. (Josef Weidendorfer)
I/O Considerations (Patrick Böhl)

3rd day morning
(10:00-12:00)

LikWid (Carla Guillen/Thomas Gruber)
HPC report (Carla Guillen)

3rd day afternoon
(13:00-16:00)

Optimisation highlights by LRZ (CXS Group LRZ)
Q&A

The workshop is a PRACE training event organised by LRZ in cooperation with NHR@FAU .

Lecturers

Dr. Patrick Böhl, Dr. Jonathan Coles, Dr. Gerald Mathias , Dr. Carla Guillen, Nisarg Patel, Dr. Josef Weidendorfer (LRZ)

Thomas Gruber (NHR@FAU)

Slides and Exercises

COW-Code.tar.gzhdf5_examples.tarCOW-Code2.tar.gz

Recommended Access Tools

Login under Windows:

  • Start xming and after that PUTTY
  • Enter host name lxlogin1.lrz.de into the putty host field and click Open.
  • Accept & save host key [only first time]
  • Enter user name and password (provided by LRZ staff) into the opened console.

Login under Mac:

Login under Linux:

  • Open xterm
  • ssh -Y lxlogin1.lrz.de -l username
  • Use user name and password (provided by LRZ staff)

How to use the CoolMUC-2 System

Login Nodes:

Reservation is only valid during the workshop, for general usage on our Linux Cluster remove the "--reservation=hcow1s22"


  • Submit a job:
    sbatch --reservation=hcow1s22 job.sh
  • List own jobs:
    squeue -M cm2
  • Cancel jobs:
    scancel -M cm2 jobid
  • Show reservations:
    sinfo -M cm2  --reservation
  • Interactive Access:

salloc -M cm2 --time=00:30:00 --reservation=hcow1s22 --partition=cm2_std


Details: https://doku.lrz.de/display/PUBLIC/Running+parallel+jobs+on+the+Linux-Cluster
Examples: https://doku.lrz.de/display/PUBLIC/Example+parallel+job+scripts+on+the+Linux-Cluster
Resource limits: https://doku.lrz.de/display/PUBLIC/Resource+limits+for+parallel+jobs+on+Linux+Cluster

Example OpenMP Batch File


#!/bin/bash
#SBATCH -o /dss/dsshome1/0D/hpckurs99/test.%j.%N.out
#SBATCH -D/dss/dsshome1/0D/hpckurs99
#SBATCH -J test
#SBATCH --clusters=cm2
#SBATCH --partition=cm2_std
#SBATCH --nodes=1
#SBATCH --qos=unlimitnodes
#SBATCH --cpus-per-task=28
#SBATCH --get-user-env
#SBATCH --reservation=hcow1s22
#SBATCH --time=02:00:00
module load slurm_setup
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
echo hello, world

Intel Software Stack

The Intel software stack is automatically loaded at login. The Intel compilers are called icc (for C), icpc (for C++) and ifort (for Fortran). They behave similar to the GNU compiler suite (option –help shows an option summary). For reasonable optimisation including SIMD vectorisation, use options -O3 -xavx (you can use -O2 instead of -O3 and sometimes get better results, since the compiler will sometimes try be overly smart and undo many of your hand-coded optimizations).

By default, OpenMP directives in your code are ignored. Use the -qopenmp option to activate OpenMP.

Use mpiexec -n #tasks to run MPI programs. The compiler wrappers' names follow the usual mpicc, mpifort, mpiCC pattern.

Intel OneAPI

The most recent version of the Intel software stack "Intel OneAPI" can be loaded with
Intel OneAPI software stack

uid@cm2login1:~> module load intel-oneapi
intel-oneapi-mpi: using intel wrappers for mpicc, mpif77, etc
 
Loading intel-oneapi/2021.4
  Unloading conflict: intel-mpi/2019-intel intel/19.0.5 intel-mkl/2019
  Loading requirement: intel-oneapi-compilers/2021.4.0 intel-oneapi-mkl/2021
                       intel-oneapi-mpi/2021-intel intel-oneapi-itac/2021.4.0
uid@cm2login1:~> module list
Currently Loaded Modulefiles:
 1) admin/1.0   2) tempdir/1.0   3) lrz/1.0   4) spack/21.1.1   5) intel-oneapi-compilers/2021.4.0  
 6) intel-oneapi-mkl/2021   7) intel-oneapi-mpi/2021-intel   8) intel-oneapi-itac/2021.4.0  
 9) intel-oneapi/2021.4 
uid@cm2login1:~> module av intel-oneapi
-------------- /lrz/sys/spack/.oneapi_rebuild/modules/x86_64/linux-sles15-x86_64 ---------------
intel-oneapi-advisor/2021.4.0    intel-oneapi-ipp/2021.4.0    intel-oneapi-mkl/2021.3.0   
intel-oneapi-ccl/2021.4.0        intel-oneapi-ippcp/2021.4.0  intel-oneapi-mkl/2021.4.0   
intel-oneapi-clck/2021.4.0       intel-oneapi-itac/2021.4.0   intel-oneapi-mpi/2021-gcc   
intel-oneapi-compilers/2021.4.0  intel-oneapi-mkl/2021        intel-oneapi-mpi/2021-intel 
intel-oneapi-dal/2021.4.0        intel-oneapi-mkl/2021-gcc8   intel-oneapi-tbb/2021.4.0   
intel-oneapi-dnn/2021.4.0        intel-oneapi-mkl/2021-seq    intel-oneapi-vpl/2021.6.0   
intel-oneapi-dpcpp-ct/2021.4.0   intel-oneapi-mkl/2021.1.1    intel-oneapi-vtune/2021.7.1 
intel-oneapi-inspector/2021.4.0  intel-oneapi-mkl/2021.2.0   

Upon loading the main intel-oneapi module, the default modules intel, intel-mpi, and intel-mkl are unloaded and replaced by the intel-oneapi-* variants. Further intel-oneapi-xxx modules are available via the module command.


PRACE Survey

Please fill out the PRACE online survey under

tbd.

This helps us and PRACE to increase the quality of the courses, design the future training programme at LRZ and in Europe according to your needs and wishes, get future funding for training events.