Pilot Operation SuperMUC-NG Phase 2
Invitation Only
This documentation is for the pilot operation of SuperMUC-NG phase 2. Access and participation is invitation only until we start regular operation. Information on this page changes as we bring up the system. Please refrain from requests for participation at this point.
Introduction
Beginning of 2024 we start with the pilot operation of SuperMUC-NG phase 2. Participation is only for selected projects that help us to identify flaws and fix problems. We expect benchmarking and tuning activities during this period with reservations of the resources and frequent reboots of login and compute nodes. Please do not rely on a production ready environment for the moment, but rather focus on compilation, testing and benchmarking.
More information on the hardware can be found here.
More information about the Intel GPU usage can be found on Intel's documentation page.
More information about Zero Level can be obtained here.
Access
Invited projects have been provided with the name of the login nodes of SuperMUC-NG Phase 2.
Same security measures are implemented as for phase 1: static IPs must be registered/white-listed project-wise; firewall does not allow out-going internet access (please checkout our docu for workarounds).
VNC on login nodes:
> module use /lrz/sys/graphics/vncserver/modules > module load vncserver > vncserver
Rest as documented here. (For GUI applications like vtune-gui, etc.)]
File Systems
HOME, DSS
Same as on SuperMUC-NG Phase 1.
WORK
The file system is mounted from SuperMUC-NG Phase 1. Expect lower I/O performance for the moment.
SCRATCH
The file system is mounted from SuperMUC-NG Phase 1. Expect lower I/O performance for the moment.
DAOS
TBA. <description how to operate DAOS will follow>
SLURM
We've setup the queues similar to SNG phase 1, except that there is no micro queue. Please check via sinfo
!
#!/bin/bash #SBATCH -J MolecularSuperscaling #SBATCH --account=abc123def #your project ID #SBATCH --time=02:00:00 #SBATCH --export=NONE #SBATCH --partition={test,general} #SBATCH --nodes=1 #SBATCH --ntasks-per-node=8 #one MPI job per GPU tile
Interactive work via salloc
is also possible with the same options.
Module System
With the new system, we are building up a new software stack that supports accelerated applications on the Intel Ponte Vecchio GPUs, so many of the CPU-only applications are hidden. Furthermore, we keep the number of modules loaded by default to a bare minimum.
> module list Currently Loaded Modulefiles: 1) admin/2.0 2) tempdir/1.0 3) lrz/1.0 4) spack/24.1.0 5) mpi_settings/1.0
Particularly, we do not load default modules for the compilers and MPI by intel. mpi_settings
just provides environment variables independent of MPI compiler wrappers and mpiexec
. Nevertheless, in many cases the Intel OneAPI software stack should be your staring point to build and run your applications.
> module load intel-toolkit Note: You are loading a complete Base and HPC Intel-OneAPI toolkit / environment. This will remove any conflicting modules already loaded in your environment Loading intel-toolkit/2024.0.0 Loading requirement: intel/2024.0.0 intel-mpi/2021.11.0 intel-mkl/2024.0.0 intel-dpcpp-ct/2024.0.0 intel-inspector/2024.0.0 intel-ccl/2021.11.0 intel-dnn/2024.0.0 intel-itac/2022.0.0 intel-tbb/2021.11.0 intel-ipp/2021.10.0 intel-dal/2024.0.0 intel-ippcp/2021.9.0 intel-dpl/2022.3.0 intel-dpct/2024.0.0
You can add the load command to your .bashrc
, but note that the toolkit is not yet available on SuperMUC-NG Phase 1 by default. So it would be best to encapsulate the command, e.g.
... if [[ $(module info-loaded spack) =~ "spack/24" ]] ; then module load intel-toolkit fi ...
Software
The following software provides dedicated/optimized versions for SuperMUC-NG Phase 2.
Compiler/Tools/MPI
Intel MPI
Intel MPI & Multi GPU Programming
Basic settings for offloading to the Ponte Vecchio GPUs with Intel MPI are defined in the module mpi_settings
, which is loaded by default.
Documentation of Intel MPI offloading variables I_MPI_OFFLOAD*
can be found here <FIXME link>.
export I_MPI_OFFLOAD=1 export I_MPI_OFFLOAD_IPC=0 # switches Xe Link off export I_MPI_OFFLOAD_CELL_LIST=0,1 # assuming it is two single tiles card, or single card with two tiles export I_MPI_OFFLOAD_L0_D2D_ENGINE_TYPE=1 # this improves performance
OpenMP Offload with GPU Pinning to full devices (not tiles), and using Xe Link communication (MPI), you can use (vtune profiling included).
module load intel-toolkit module load intel-vtune export I_MPI_OFFLOAD=1 # enable offload export I_MPI_OFFLOAD_IPC=1 # Xe Link on export I_MPI_OFFLOAD_RDMA=1 # full RDMA features export I_MPI_OFFLOAD_L0_D2D_ENGINE_TYPE=1 # device to device communication export I_MPI_DEBUG=3 # increase debug level to see MPI rank to GPU pinning export I_MPI_OFFLOAD_CELL=device # use GPUs instead of Tiles (default: tile) mpiexec -l -n 4 vtune -quiet -collect hpc-performance -trace-mpi -result-dir res4_xe_hpc ./mpi_omp_offload_prog mpiexec -l -n 4 vtune -quiet -collect gpu-hotspots -knob analyze-xelink-usage=true -result-dir res4_xe_gpu ./mpi_omp_offload_prog
(The program needs to be compiled with -fiopenmp -fopenmp-targets=spir64
) The output through I_MPI_DEBUG=3
contains
[0] [0] MPI startup(): ===== GPU topology on i20r01c03s04 ===== [0] [0] MPI startup(): NUMA nodes : 2 [0] [0] MPI startup(): GPUs : 4 [0] [0] MPI startup(): Tiles : 8 [0] [0] MPI startup(): NUMA Id GPU Id Tiles Ranks on this NUMA [0] [0] MPI startup(): 0 0,1 (0,1)(2,3) 0,1 [0] [0] MPI startup(): 1 2,3 (4,5)(6,7) 2,3 [0] [0] MPI startup(): ===== GPU pinning on i20r01c03s04 ===== [0] [0] MPI startup(): Rank Pin tile [0] [0] MPI startup(): 0 {0,1} [0] [0] MPI startup(): 1 {2,3} [0] [0] MPI startup(): 2 {4,5} [0] [0] MPI startup(): 3 {6,7}
where Socket (NUMA) IDs, GPU IDs, Tile IDs and MPI Rank IDs are associated.
Intel Tool Kit - Bundled Libraries
Following libraries are bundled in intel-toolkit, intel-vtune, intel-insepector, intel-ccl, intel-dnn, intel-itac, intel-tbb, intel-ipp, intel-dal, intel-ippcp, intel-dpl, intel-dpct
, to use any of these libraries it is required to
module load intel-toolkit/version module load intel-bundledLibrary
alternative, better approach is,
module show intel-toolkit/version module use intel-bundledLibrary
Intel Software Stack
Intel oneAPI compilers are loaded with the intel
module (also loaded in turn by the intel-toolkit
module): module load intel
.
The LLVM compilers ifx,
icx, and
icpx
(for Fortran, C and C++ respectively) should be used as default and must be adopted when targeting GPUs (with one of OpenMP offload, SYCL or OpenCL).
All Fortran, C and C++ workloads must migrate to LLVM. Among the Intel classic compiler, only ifort
is present and can be used when targeting CPUs only, and only as a fallback.
Migration from classic to LLVM compilers should be straightforward, as the synthax is in large part backwards compatible.
More assistance can be found here about Fortran compilation, and here about C/C++. These references include a list of feature changes and a few example (e.g. compiling for OpenMP offload).
In case you have SYCL/DPC++ code targeting GPU and/or CPU, you should use icpx
with the -fsycl
option, or refer to your application manual for more detailed compilation instructions.
Debugger
GDB, DDT, and TotalView are available. (docu)
The LLVM contains lldb.
LLVM
Until we provide LLVM via Spack, there is a temporary solution.
> module use /lrz/sys/share/modules/extfiles > module av llvm > ----------------------- /lrz/sys/share/modules/extfiles ----------------------- llvm/15.0.7 llvm/17.0.6
GCC
> module av gcc -------------- /lrz/sys/spack/release/24.1.0/modules/compilers --------------- gcc/8.5.0 gcc/9.5.0 gcc/10.5.0 gcc/11.4.0 gcc/12.3.0 gcc/13.2.0
User-Spack
Julia (OneAPI)
For extended oneAPI julia package documentation, please consult the respective docu page.
Example: OneMKL example
Julia with PETSc
With internet connection active (see proxy above)
See also documentation here (no guaranties ... seems outdated somewhat).
AdaptiveCpp
AdaptiveCpp (formerly hipSYCL/openSYCL) is the independent, community-driven modern platform for C++-based heterogeneous programming models targeting CPUs and GPUs from all major vendors.
> module use /lrz/sys/share/modules/extfiles/AdaptiveCpp > module load adaptiveCpp > acpp-info -l =================Backend information=================== Loaded backend 0: OpenMP Found device: hipSYCL OpenMP host device Loaded backend 1: Level Zero Found device: Intel(R) Data Center GPU Max 1550 Found device: Intel(R) Data Center GPU Max 1550 Found device: Intel(R) Data Center GPU Max 1550 Found device: Intel(R) Data Center GPU Max 1550 > pip install --user github-clone # *) > ghclone https://github.com/AdaptiveCpp/AdaptiveCpp/tree/a1647a832ce81eed6d5b5972b7bbd76a45473150/examples > cd examples/bruteforce_nbody > acpp --acpp-targets=generic -o bruteforce_nbody bruteforce_nbody.cpp -O3 > ./bruteforce_nbody # **) # or # > cd examples # > cmake -S . -B build # > cmake --build build # > ./build/bruteforce_nbody/bruteforce_nbody # **) ...
*) for SuperMUC-NG internet access see here.
**) checking GPU involvment e.g. via xpu-smi dump -d 0,1,2,3 -m 0,1,2,3
, or, use onetrace, oneprof, ... (see pti-gpu below)
UPC++ (experimental)
UPC++ is a PGAS C++ parallel framework, based on GASNet. Documentation
Usage:
> module use /lrz/sys/share/modules/extfiles > module load upcxx > upcxx-info
Other "conduits" can be set via UPCXX_NETWORK
. ucx
is the default (although declared "experimental", it's the only one that worked ... Ok. mpi
and smp
also work .... But mpi
is not recommended. And smp
is shared-memory-only).
OpenCL
Querying the available OpenCL platforms and devices and displaying their properties:
> clinfo # Display properties of all available devices > clinfo --list # List platforms > clinfo -d 0:1 # Show information about device 1 from platform 0
Performance Measurement
Intel Performance Suite (vtune)
> module load intel-toolkit # set's among others a new "module use" path > module load intel-vtune
PTI-GPU
> module use /lrz/sys/share/modules/extfiles > module load pti-gpu > ls $PTI_GPU_BASE/bin | grep -v .so # for a list of available tools ... > onetrace --version 0.49.22 > onetrace --help
gpuinfo
, unfortunately, does not work, yet.
XPU-SMI
xpu-smi
is installed in the system. It is a very basic GPU monitoring tool. Open a terminal (on the node where you want to see the GPU usage), and execute, for instance,
> xpu-smi dump -d 0,1,2,3 -m 0,1,2,3
OpenMP
For applications using OpenMP offloading you can get a print tabular kernel runtime profile using the environment variable LIBOMPTARGET_PLUGIN_PROFILE=T, i.e.
> ifx -fiopenmp -fopenmp-targets=spir64 -o app app.f90 > LIBOMPTARGET_PLUGIN_PROFILE=T ./app
Applications
Molecular Dynamics / Chemistry
For running Gromacs the following jobscript can be used:
#!/bin/bash #SBATCH -J MolecularSuperscaling #SBATCH --account=........ #SBATCH --time=02:00:00 #SBATCH --export=NONE #SBATCH --partition=general #SBATCH --nodes=1 #SBATCH --ntasks-per-node=8 export OMP_NUM_THREADS=8 module load slurm_setup module load gromacs/2023.3-intel-impi-openmp-r32 srun -n 8 gmx mdrun -s relax_ethanol_water_novsite_995334.tpr -nb gpu -pme gpu -update gpu -ntomp 8 -tunepme -npme 1
This script will start 8 MPI tasks (--ntasks-per-node=8) each starting on a separate GPU tile. Each task will use 8 CPUs (OMP_NUM_THREADS=8, -ntomp 8) and only one tile will be used for PME calculation (-npme 1) all particle calculations will be carried out on the rest of the GPUs. For efficient scaling with further increase of number of GPUs, the PME calculations need to be split over several GPUs. That feature even possible doesn't yield sufficient results at the moment and it is not supported for now.
Amber with SYCL implementation of pmemd
To use the Amber code with Intel PVC GPUs, you can follow these general steps. Keep in mind that this Amber version 20 release includes the SYCL implementation of pmemd to enable simulations on Intel GPUs. Please note that this release will not include all features.
module load amber/20_sycl
> mpirun -np 2 pmemd.sycl_SPFP.MPI -O -i relax.in -p 3981318_atoms.parm7 -c 3981318_atoms_eq_0.rst -r 3981318_atoms_bench_sycl.rst -o 3981318_atoms_bench_sycl0.out -x 3981318_atoms_bench_sycl0.crd
Use the setup above for Intel MPI offloading, may bring a better performance. This command will start Amber on two tiles. Please note that even though Amber can run on more than two tiles, if the simulation is split to more than one GPU, the performance will significantly deteriorate. Therefore for the moment, we don't recommend the usage of more than two tiles per simulation with Amber.
cp2k
<fixme: point to appropriate module; example script>
PDE Solving Frameworks
Kokkos/SYCL
From sources (user_spack kokkos package.py does not support sapphirerapid, and OneDPL dependency is not correctly specified)
> module load intel-toolkit # (cmake) > git clone https://github.com/kokkos/kokkos.git > cmake -S kokkos -B build -DBUILD_SHARED_LIBS=ON -DKokkos_ARCH_INTEL_PVC=ON -DKokkos_ARCH_SPR=ON -DKokkos_ENABLE_EXAMPLES=ON -DKokkos_ENABLE_ONEDPL=ON -DKokkos_ENABLE_SYCL=ON > cmake --build build -j 30 > cmake --install build --prefix <desired-install-path>
Finally, add bin
and lib64
to PATH
and LD_LIBRARY_PATH
/CMAKE_PREFIX_PATH
, respectively.
Example Application
<to be fixed>
PETSc/Kokkos/SYCL
Installation via user_spack
> module load intel-toolkit user_spack > export CMAKE_PREFIX_PATH=/dss/lrzsys/sys/spack/release/24.1.0/opt/x86_64/intel-toolkit/2024.0.0/intel-oneapi-dpl/2022.3.0-gcc-qpc3i3d/dpl/2022.3/lib/cmake/oneDPL > spack install -j 30 --dirty petsc@3.20.1%oneapi +sycl+int64+kokkos ^kokkos@develop%oneapi+hwloc intel_gpu_arch=intel_pvc ^kokkos-kernels@develop%oneapi > spack install -j 30 py-petsc4py ^petsc /<hash of petsc>
--dirty
was used here because of some missing OneDPL dependency in kokkos's CMakeLists.txt in this spack version. Furthermore, sapphirerapids aren't supported here. But for function of PETSc, this seems not to bother. We propose here to better use the "own repo approach of user_spack", and add "sapphirerapids": "SPR"
in the spack_micro_arch_map
, and depends_on("intel-oneapi-dpl", when="+sycl")
.
Please check whether this was already fixed!
Installed Version
> module use /lrz/sys/share/modules/extfiles > module av petsc -------------------------- /lrz/sys/share/modules/extfiles -------------------------------- petsc/3.20.1-intel24-impi-real
Also installed is py-petsc4py
.
Example Application
> module load py-petsc4py > module load python/3.10.12-extended # > export I_MPI_HYDRA_BOOTSTRAP=fork I_MPI_FABRICS=shm # on login nodes; otherwise salloc > export ONEAPI_DEVICE_SELECTOR=level_zero:gpu # for offload to the GPU > export SYCL_PI_TRACE=2 # to see L0 offload; ontrace can also be used > python poisson2d.py # lot of L0 output # or >>> from petsc4py import PETSc
The poisson2d.py example was taken from here: https://petsc.org/release/petsc4py/demo/poisson2d/poisson2d.html
Machine Learning and AI
conda
Environments
To run PyTorch and Tensorflow on Phase 2, we use conda
environments.
First you need to set up conda
on Phase 2, as described here.
Then, to create the conda
environment, you need to establish the internet connection on Phase 2 login node using the reverse SSH tunnel, as described here.
Please make sure you do the following steps only on the login node, this is where you have internet connection with the reverse SSH tunnel. This will not work on the compute node.
Following are the steps to create a conda
environment with PyTorch v2.1.30 from scratch. Based on compatibility, the versions for the necessary Intel packages are set. Please refer documentation in the links for more information on versions.
This particular conda
environment has been found to work with single tile, multi-tile and multi-node setups.
Steps for conda
environment with pytorch v2.1.30:
- Create the
conda
environment (in this case, you have to use this downgraded setuptools and numpy versions):> source ~/.conda_init > conda create -n your_conda_env_pytorch_v2.1 python=3.9 setuptools=69.5.1 numpy=1.26.4 -y > conda activate your_conda_env_pytorch_v2.1
- Install Pytorch, Intel Extension for Pytorch, TorchVision and compatible Intel® oneCCL Bindings for Pytorch libraries:
> python -m pip install torch==2.1.0.post2 torchvision==0.16.0.post2 torchaudio==2.1.0.post2 intel-extension-for-pytorch==2.1.30.post0 oneccl_bind_pt==2.1.300+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
Install any additional code dependencies:
> pip install -r requirements.txt
Workflow
Make sure you:
- Load Intel toolkit module that has the compatible version of oneAPI with your PyTorch/Tensorflow version.
- Activate the
conda
environment with the needed PyTorch/Tensorflow and corresponding oneCCL version. - Set up any environment variables that your use case may need.
Here is an example of the sbatch
script that can be used to submit jobs to be run on the Phase 2 compute node. Please note that this uses intel-toolkit/2024.1.0 that is compatible with PyTorch 2.1.30+xpu.
Useful links
- Intel Extension for PyTorch documentation: https://intel.github.io/intel-extension-for-pytorch/xpu/latest/#
- Intel Extension for PyTorch installation links: https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
- Intel Extension for Tensorflow documentation: https://intel.github.io/intel-extension-for-tensorflow/latest/get_started.html