9. Multi-Node Jobs on the LRZ AI Systems

Some jobs require even more GPUs than a single node can provide for performing their computations. In this case the --nodes argument can be accordingly.  The --gres argument then specifies the total number of GPUs requested per node.

Interactive Jobs

For example, if you need 4 GPUs on the lrz-hgx-h100-94x4 partition, you can type the following command:

$ salloc -p lrz-hgx-h100-94x4 --nodes=2 --gres=gpu:2

Batch Jobs

The situation is similar for batch jobs. The --gres argument needs to be added to the script preamble preceded by the #SBATCH label. Afterwards, you can use the --ntasks-per-node argument within the srun command as indicated above. An example is as follows:

#!/bin/bash
#SBATCH -p lrz-hgx-h100-94x4
#SBATCH --nodes=2
#SBATCH --gres=gpu:2
#SBATCH -o enroot_test.out
#SBATCH -e enroot_test.err

srun -N2 --ntasks-per-node=2 --container-mounts=path/to/source:path/to/destination \
     --container-image='path/to/your/image.sqsh' \
     python script.py --epochs 55 --batch-size 512