Pilot Operation SuperMUC-NG Phase 2


Invitation Only

This documentation is for the pilot operation of SuperMUC-NG phase 2. Access and participation is invitation only until we start regular operation. Information on this page changes as we bring up the system. Please refrain from requests for participation at this point.

Introduction

Beginning of 2024 we start with the pilot operation of SuperMUC-NG phase 2. Participation is only for selected projects that help us to identify flaws and fix problems. We expect benchmarking and tuning activities during this period with reservations of the resources and frequent reboots of login and compute nodes. Please do not rely on a production ready environment for the moment, but rather focus on compilation, testing and benchmarking.

More information on the hardware can be found here.

More information about the Intel GPU usage can be found on Intel's documentation page.

More information about Zero Level can be obtained here.

Access

Invited projects have been provided with the name of the login nodes of SuperMUC-NG Phase 2. 

Same security measures are implemented as for phase 1: static IPs must be registered/white-listed project-wise; firewall does not allow out-going internet access (please checkout our docu for workarounds).

VNC on login nodes:

> module use /lrz/sys/graphics/vncserver/modules
> module load vncserver
> vncserver

Rest as documented here. (For GUI applications like vtune-gui, etc.)]

File Systems

HOME, DSS

Same as on SuperMUC-NG Phase 1.

WORK

The file system is mounted from SuperMUC-NG Phase 1. Expect lower I/O performance for the moment.

SCRATCH

The file system is mounted from SuperMUC-NG Phase 1. Expect lower I/O performance for the moment.

DAOS

TBA. <description how to operate DAOS will follow>

SLURM

We've setup the queues similar to SNG phase 1, except that there is no micro queue. Please check via sinfo!

#!/bin/bash
#SBATCH -J MolecularSuperscaling
#SBATCH --account=abc123def  #your project ID
#SBATCH --time=02:00:00
#SBATCH --export=NONE
#SBATCH --partition={test,general}
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8  #one MPI job per GPU tile

Interactive work via salloc is also possible with the same options.

Module System

With the new system, we are building up a new software stack that supports accelerated applications on the Intel Ponte Vecchio GPUs, so many of the CPU-only applications are hidden. Furthermore, we keep the number of modules loaded by default to a bare minimum. 

Module Initialization
> module list
Currently Loaded Modulefiles:
 1) admin/2.0   2) tempdir/1.0   3) lrz/1.0   4) spack/24.1.0   5) mpi_settings/1.0  

Particularly, we do not load default modules for the compilers and MPI by intel. mpi_settings just provides environment variables independent of MPI compiler wrappers and mpiexec. Nevertheless, in many cases the Intel OneAPI software stack should be your staring point to build and run your applications.

OneAPI software stack
> module load intel-toolkit

Note: You are loading a complete Base and HPC Intel-OneAPI toolkit / environment.
      This will remove any conflicting modules
      already loaded in your environment

Loading intel-toolkit/2024.0.0
  Loading requirement: intel/2024.0.0 intel-mpi/2021.11.0 intel-mkl/2024.0.0 intel-dpcpp-ct/2024.0.0 intel-inspector/2024.0.0 intel-ccl/2021.11.0 intel-dnn/2024.0.0 intel-itac/2022.0.0
    intel-tbb/2021.11.0 intel-ipp/2021.10.0 intel-dal/2024.0.0 intel-ippcp/2021.9.0 intel-dpl/2022.3.0 intel-dpct/2024.0.0

You can add the load command to your .bashrc, but note that the toolkit is not yet available on SuperMUC-NG Phase 1 by default. So it would be best to encapsulate the command, e.g.


.bashrc
...
 if [[ $(module info-loaded spack) =~ "spack/24" ]] ; then
     module load intel-toolkit
 fi
...


Software

The following software provides dedicated/optimized versions for SuperMUC-NG Phase 2.

Compiler/Tools/MPI

Intel MPI

Intel MPI & Multi GPU Programming

Basic settings for offloading to the Ponte Vecchio GPUs with Intel MPI are defined in the module mpi_settings, which is loaded by default.

Documentation of Intel MPI offloading variables I_MPI_OFFLOAD* can be found here <FIXME link>.

Some setup for Intel MPI offloading
export I_MPI_OFFLOAD=1
export I_MPI_OFFLOAD_IPC=0          # switches Xe Link off
export I_MPI_OFFLOAD_CELL_LIST=0,1  # assuming it is two single tiles card, or single card with two tiles
export I_MPI_OFFLOAD_L0_D2D_ENGINE_TYPE=1  # this improves performance

OpenMP Offload with GPU Pinning to full devices (not tiles), and using Xe Link communication (MPI), you can use (vtune profiling included).

module load intel-toolkit
module load intel-vtune
export I_MPI_OFFLOAD=1                               # enable offload
export I_MPI_OFFLOAD_IPC=1                           # Xe Link on
export I_MPI_OFFLOAD_RDMA=1                          # full RDMA features
export I_MPI_OFFLOAD_L0_D2D_ENGINE_TYPE=1            # device to device communication
export I_MPI_DEBUG=3                                 # increase debug level to see MPI rank to GPU pinning
export I_MPI_OFFLOAD_CELL=device                     # use GPUs instead of Tiles (default: tile)
mpiexec -l -n 4 vtune -quiet -collect hpc-performance -trace-mpi -result-dir res4_xe_hpc ./mpi_omp_offload_prog 
mpiexec -l -n 4 vtune -quiet -collect gpu-hotspots -knob analyze-xelink-usage=true -result-dir res4_xe_gpu ./mpi_omp_offload_prog 

(The program needs to be compiled with -fiopenmp -fopenmp-targets=spir64) The output through I_MPI_DEBUG=3 contains 

[0] [0] MPI startup(): ===== GPU topology on i20r01c03s04 =====
[0] [0] MPI startup(): NUMA nodes : 2
[0] [0] MPI startup(): GPUs       : 4
[0] [0] MPI startup(): Tiles      : 8
[0] [0] MPI startup(): NUMA Id	GPU Id         Tiles                          Ranks on this NUMA
[0] [0] MPI startup(): 0      	0,1            (0,1)(2,3)                     0,1
[0] [0] MPI startup(): 1      	2,3            (4,5)(6,7)                     2,3
[0] [0] MPI startup(): ===== GPU pinning on i20r01c03s04 =====
[0] [0] MPI startup(): Rank	Pin tile
[0] [0] MPI startup(): 0	{0,1}
[0] [0] MPI startup(): 1	{2,3}
[0] [0] MPI startup(): 2	{4,5}
[0] [0] MPI startup(): 3	{6,7}

where Socket (NUMA) IDs, GPU IDs, Tile IDs and MPI Rank IDs are associated.

Intel Tool Kit - Bundled Libraries

Following libraries are bundled in intel-toolkit, intel-vtune, intel-insepector, intel-ccl, intel-dnn, intel-itac, intel-tbb, intel-ipp, intel-dal, intel-ippcp, intel-dpl, intel-dpct, to use any of these libraries it is required to 

module load intel-toolkit/version
module load intel-bundledLibrary

alternative, better approach is, 

module show intel-toolkit/version
module use intel-bundledLibrary



Intel Software Stack

Intel oneAPI compilers are loaded with the intel module (also loaded in turn by the intel-toolkit module): module load intel.
The LLVM compilers ifx, icx, and icpx (for Fortran, C and C++ respectively) should be used as default and must be adopted when targeting GPUs (with one of OpenMP offload, SYCL or OpenCL). 

Once the module is loaded, the environment variables $FC, $CC and $CXX point to the right instance of the respective compilers, and can be used for automating compiler selection at compilation time.

All Fortran, C and C++ workloads must migrate to LLVM. Among the Intel classic compiler, only ifort is present and can be used when targeting CPUs only, and only as a fallback. 

Plese report to us any problem or inconsistency you may experience when using the ifx or the new compilers.

Migration from classic to LLVM compilers should be straightforward, as the synthax is in large part backwards compatible.

More assistance can be found here about Fortran compilation, and here about C/C++. These references include a list of feature changes and a few example (e.g. compiling for OpenMP offload).

In case you have SYCL/DPC++ code targeting GPU and/or CPU, you should use icpx with the -fsycl option, or refer to your application manual for more detailed compilation instructions.

Debugger

GDB, DDT, and TotalView are available. (docu)

The LLVM contains lldb.

LLVM

Until we provide LLVM via Spack, there is a temporary solution.

> module use /lrz/sys/share/modules/extfiles
> module av llvm
> ----------------------- /lrz/sys/share/modules/extfiles -----------------------
llvm/15.0.7  llvm/17.0.6

GCC

> module av gcc
-------------- /lrz/sys/spack/release/24.1.0/modules/compilers ---------------
gcc/8.5.0  gcc/9.5.0  gcc/10.5.0  gcc/11.4.0  gcc/12.3.0  gcc/13.2.0 

User-Spack

Julia (OneAPI)

Setup and Usage Example
> module load intel-toolkit
> module load julia/1.10.0
> julia

julia> import Pkg; Pkg.add("oneAPI")                   # internet access necessary
...
julia> using oneAPI
julia> oneAPI.versioninfo()
Binary dependencies:
- NEO: 23.17.26241+3
- libigc: 1.0.13822+0
- gmmlib: 22.3.0+0
- SPIRV_LLVM_Translator_unified: 0.3.0+0
- SPIRV_Tools: 2023.2.0+0

Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7

1 driver:
- 00000000-0000-0000-17ac-a75001036681 (v1.3.26241, API v1.3.0)

4 devices:
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550

For extended oneAPI julia package documentation, please consult the respective docu page.

Example: OneMKL example

Julia with PETSc

With internet connection active (see proxy above)

Installation
> module load petsc                                        # for PETSc see below (possibly add module use path before)
> module load julia                                        # better use latest version available
> export JULIA_PETSC_LIBRARY=$PETSC_BASE
> export I_MPI_HYDRA_BOOTSTRAP=fork I_MPI_FABRICS=shm      # on login nodes; PETSc build tries to test also MPI :(
> julia
>>> ]add PETSc
>>> ]build PETSc                                           # depending on possible error messages iterate
>>> using PETSc
>>> PETSc.libs
(("/lrz/sys/libraries/petsc/2024-03-11/linux-sles15-sapphirerapids/petsc/3.20.1-oneapi-2024.0.0-z7lhx4k", 0x00000046),)

See also documentation here (no guaranties ... seems outdated somewhat).

AdaptiveCpp

AdaptiveCpp (formerly hipSYCL/openSYCL) is the independent, community-driven modern platform for C++-based heterogeneous programming models targeting CPUs and GPUs from all major vendors.

> module use /lrz/sys/share/modules/extfiles/AdaptiveCpp
> module load adaptiveCpp
> acpp-info -l
=================Backend information===================
Loaded backend 0: OpenMP
  Found device: hipSYCL OpenMP host device
Loaded backend 1: Level Zero
  Found device: Intel(R) Data Center GPU Max 1550
  Found device: Intel(R) Data Center GPU Max 1550
  Found device: Intel(R) Data Center GPU Max 1550
  Found device: Intel(R) Data Center GPU Max 1550

> pip install --user github-clone                 # *)
> ghclone https://github.com/AdaptiveCpp/AdaptiveCpp/tree/a1647a832ce81eed6d5b5972b7bbd76a45473150/examples
> cd examples/bruteforce_nbody
> acpp --acpp-targets=generic -o bruteforce_nbody bruteforce_nbody.cpp -O3
> ./bruteforce_nbody                              # **)
# or
# > cd examples
# > cmake -S . -B build
# > cmake --build build
# > ./build/bruteforce_nbody/bruteforce_nbody     # **)
...

*) for SuperMUC-NG internet access see here.
**) checking GPU involvment e.g. via xpu-smi dump -d 0,1,2,3 -m 0,1,2,3 , or, use onetrace, oneprof, ... (see pti-gpu below)

UPC++ (experimental)

UPC++ is a PGAS C++ parallel framework, based on GASNet. Documentation 

Usage:

> module use /lrz/sys/share/modules/extfiles
> module load upcxx
> upcxx-info

Other "conduits" can be set via UPCXX_NETWORK. ucx is the default (although declared "experimental", it's the only one that worked ... Ok. mpi and smp also work .... But mpi is not recommended. And smp is shared-memory-only).

OpenCL

Querying the available OpenCL platforms and devices and displaying their properties:

> clinfo            # Display properties of all available devices
> clinfo --list     # List platforms
> clinfo -d 0:1     # Show information about device 1 from platform 0

Performance Measurement

Intel Performance Suite (vtune)

> module load intel-toolkit      # set's among others a new "module use" path
> module load intel-vtune

PTI-GPU

> module use /lrz/sys/share/modules/extfiles
> module load pti-gpu
> ls $PTI_GPU_BASE/bin | grep -v .so              # for a list of available tools
...
> onetrace --version
0.49.22
> onetrace --help

gpuinfo, unfortunately, does not work, yet.

XPU-SMI

xpu-smi is installed in the system. It is a very basic GPU monitoring tool. Open a terminal (on the node where you want to see the GPU usage), and execute, for instance,

> xpu-smi dump -d 0,1,2,3 -m 0,1,2,3

OpenMP

For applications using OpenMP offloading you can get a print tabular kernel runtime profile using the environment variable LIBOMPTARGET_PLUGIN_PROFILE=T, i.e.

> ifx -fiopenmp -fopenmp-targets=spir64 -o app app.f90 
> LIBOMPTARGET_PLUGIN_PROFILE=T ./app

Applications

Molecular Dynamics / Chemistry

For running Gromacs the following jobscript can be used:

gromacs jobscript
#!/bin/bash
#SBATCH -J MolecularSuperscaling
#SBATCH --account=........
#SBATCH --time=02:00:00
#SBATCH --export=NONE
#SBATCH --partition=general
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=8 

export OMP_NUM_THREADS=8

module load slurm_setup
module load gromacs/2023.3-intel-impi-openmp-r32

srun -n 8 gmx mdrun -s relax_ethanol_water_novsite_995334.tpr -nb gpu -pme gpu -update gpu -ntomp 8 -tunepme -npme 1

This script will start 8 MPI tasks (--ntasks-per-node=8) each starting on a separate GPU tile. Each task will  use 8 CPUs (OMP_NUM_THREADS=8, -ntomp 8) and only one tile will be used for PME calculation (-npme 1) all particle calculations will be carried out on the rest of the GPUs. For efficient scaling with further increase of number of GPUs, the PME calculations need to be split over several GPUs. That feature even possible doesn't yield sufficient results at the moment and it is not supported for now.

Amber with SYCL implementation of pmemd

To use the Amber code with Intel PVC GPUs, you can follow these general steps. Keep in mind that this Amber version 20 release includes the SYCL implementation of pmemd to enable simulations on Intel GPUs. Please note that this release will not include all features.

module load amber/20_sycl 
Execute your parallel simulation example with:  for single node and 2 PVC tiles:
> mpirun -np 2 pmemd.sycl_SPFP.MPI -O -i relax.in -p 3981318_atoms.parm7 -c 3981318_atoms_eq_0.rst -r 3981318_atoms_bench_sycl.rst -o 3981318_atoms_bench_sycl0.out -x 3981318_atoms_bench_sycl0.crd


Use the setup above for Intel MPI offloading, may bring a better performance. This command will start Amber on two tiles. Please note that even though Amber can run on more than two tiles, if the simulation is split to more than one GPU, the performance will significantly deteriorate. Therefore for the moment, we don't recommend the usage of more than two tiles per simulation with Amber.  


cp2k

<fixme: point to appropriate module; example script>

PDE Solving Frameworks

Kokkos/SYCL

From sources (user_spack kokkos package.py does not support sapphirerapid, and OneDPL dependency is not correctly specified)

> module load intel-toolkit   # (cmake)
> git clone https://github.com/kokkos/kokkos.git
> cmake -S kokkos -B build -DBUILD_SHARED_LIBS=ON -DKokkos_ARCH_INTEL_PVC=ON -DKokkos_ARCH_SPR=ON -DKokkos_ENABLE_EXAMPLES=ON -DKokkos_ENABLE_ONEDPL=ON -DKokkos_ENABLE_SYCL=ON
> cmake --build build -j 30
> cmake --install build --prefix <desired-install-path> 

Finally, add bin and lib64 to PATH and LD_LIBRARY_PATH/CMAKE_PREFIX_PATH, respectively.

Example Application

<to be fixed>

PETSc/Kokkos/SYCL

Installation via user_spack
> module load intel-toolkit user_spack
> export CMAKE_PREFIX_PATH=/dss/lrzsys/sys/spack/release/24.1.0/opt/x86_64/intel-toolkit/2024.0.0/intel-oneapi-dpl/2022.3.0-gcc-qpc3i3d/dpl/2022.3/lib/cmake/oneDPL
> spack install -j 30 --dirty petsc@3.20.1%oneapi +sycl+int64+kokkos ^kokkos@develop%oneapi+hwloc intel_gpu_arch=intel_pvc ^kokkos-kernels@develop%oneapi
> spack install -j 30 py-petsc4py ^petsc /<hash of petsc>

--dirty was used here because of some missing OneDPL dependency in kokkos's CMakeLists.txt in this spack version. Furthermore, sapphirerapids aren't supported here. But for function of PETSc, this seems not to bother. We propose here to better use the "own repo approach of user_spack", and add "sapphirerapids": "SPR" in the spack_micro_arch_map, and depends_on("intel-oneapi-dpl", when="+sycl").

Please check whether this was already fixed!

Installed Version
> module use /lrz/sys/share/modules/extfiles
> module av petsc
-------------------------- /lrz/sys/share/modules/extfiles --------------------------------
petsc/3.20.1-intel24-impi-real

Also installed is py-petsc4py.

Example Application
> module load py-petsc4py
> module load python/3.10.12-extended
# > export I_MPI_HYDRA_BOOTSTRAP=fork I_MPI_FABRICS=shm            # on login nodes; otherwise salloc
> export ONEAPI_DEVICE_SELECTOR=level_zero:gpu                     # for offload to the GPU
> export SYCL_PI_TRACE=2                                           # to see L0 offload; ontrace can also be used
> python poisson2d.py                                              # lot of L0 output
# or
>>> from petsc4py import PETSc

The poisson2d.py example was taken from here: https://petsc.org/release/petsc4py/demo/poisson2d/poisson2d.html

Machine Learning and AI

conda Environments

To run PyTorch and Tensorflow on Phase 2, we use conda environments. 

First you need to set up conda on Phase 2, as described here.

Then, to create the conda environment, you need to establish the internet connection on Phase 2 login node using the reverse SSH tunnel, as described here.

Please make sure you do the following steps only on the login node, this is where you have internet connection with the reverse SSH tunnel. This will not work on the compute node.

Following are the steps to create a conda environment with PyTorch v2.1.30 from scratch. Based on compatibility, the versions for the necessary Intel packages are set. Please refer documentation in the links for more information on versions.

This particular conda environment has been found to work with single tile, multi-tile and multi-node setups.

Steps for conda environment with pytorch v2.1.30:

  1. Create the conda environment (in this case, you have to use this downgraded setuptools and numpy versions):
    > source ~/.conda_init
    > conda create -n your_conda_env_pytorch_v2.1 python=3.9 setuptools=69.5.1 numpy=1.26.4 -y
    > conda activate your_conda_env_pytorch_v2.1
  2. Install Pytorch, Intel Extension for Pytorch, TorchVision and compatible Intel® oneCCL Bindings for Pytorch libraries:
    >  python -m pip install torch==2.1.0.post2 torchvision==0.16.0.post2 torchaudio==2.1.0.post2 intel-extension-for-pytorch==2.1.30.post0 oneccl_bind_pt==2.1.300+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
  3. Install any additional code dependencies:

    >  pip install -r requirements.txt

Workflow

Make sure you:

  1. Load Intel toolkit module that has the compatible version of oneAPI with your PyTorch/Tensorflow version.
  2. Activate the conda environment with the needed PyTorch/Tensorflow and corresponding oneCCL version.
  3. Set up any environment variables that your use case may need.

Here is an example of the sbatch script that can be used to submit jobs to be run on the Phase 2 compute node. Please note that this uses intel-toolkit/2024.1.0 that is compatible with PyTorch 2.1.30+xpu.

Example of the sbatch script for the AI workload that can be submitted to the Phase 2 compute node
#!/bin/bash
#SBATCH --partition=general
#SBATCH --time=00:30:00
#SBATCH --nodes=1 #increase it in case of multi-node jobs
#SBATCH --ntasks-per-node=8 #8 maximum possible tasks per node (8 tiles)
#SBATCH --account=your_account
#SBATCH --export=NONE
#SBATCH --job-name=your_job_name
#SBATCH --output=your_job_name-%j.out

module load slurm_setup

# load oneapi base and hpc (called intel-toolkit on phase 2)
module load intel-toolkit/2024.1.0    ## use this one for pytorch v2.1.30

# activate conda
source ~/.conda_init

# activate conda env
conda activate your_conda_env_pytorch_v2.1   ## conda env for pytorch v2.1.30

# environment variables to run multi-tile
export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE
export ZE_AFFINITY_MASK=0.0,0.1,1.0,1.1,2.0,2.1,3.0,3.1  ## for all 4 gpus / 8 tiles
#export ZE_AFFINITY_MASK=0.0,0.1  ## for 1 gpu / 2 tiles
 
export ONECCL_BINDINGS_FOR_PYTORCH_ENV_VERBOSE=1

# environment variables for slurm
export NP=${SLURM_NTASKS}
export NNODES=${SLURM_NNODES}
export PPN=${SLURM_NTASKS_PER_NODE:-$(( NP / NNODES ))}
echo "NP =" $NP " PPN =" $PPN

# set up the master_addr/url for running torch distributed multi-node
export URL=$(mpirun -n 1 -ppn 1 hostname -I | awk '{print $1}')
echo "URL =" $URL

# run the python script
mpirun -n $NP -ppn $PPN -l python -u your_python_file.py