9. Multi-Node Jobs on the LRZ AI Systems
Some jobs require even more GPUs than a single node can provide for performing their computations. In this case the --nodes argument can be accordingly. The --gres
argument then specifies the total number of GPUs requested per node.
Interactive Jobs
For example, if you need 4 GPUs on the lrz-hgx-h100-94x4
partition, you can type the following command:
$ salloc -p lrz-hgx-h100-94x4 --nodes=2 --gres=gpu:2
Batch Jobs
The situation is similar for batch jobs. The --gres
argument needs to be added to the script preamble preceded by the #SBATCH
label. Afterwards, you can use the --ntasks-per-node
argument within the srun
command as indicated above. An example is as follows:
#!/bin/bash #SBATCH -p lrz-hgx-h100-94x4 #SBATCH --nodes=2 #SBATCH --gres=gpu:2 #SBATCH -o enroot_test.out #SBATCH -e enroot_test.err srun -N2 --ntasks-per-node=2 --container-mounts=path/to/source:path/to/destination \ --container-image='path/to/your/image.sqsh' \ python script.py --epochs 55 --batch-size 512