5.4 Slurm Batch Jobs - Multi Node

Some jobs require more GPUs than are available on a single node. In such cases, the --nodes argument can be used to request multiple nodes. The --gres argument then specifies the number of GPUs per node, not the total number of GPUs for the entire job.

The same considerations apply to the --ntasks-per-node option as described for single-node, multi-GPU workloads. As a best practice, it should be specified directly with srun.

Examples

Interactive Jobs

For example, if you need 8 GPUs on the lrz-hgx-h100-94x4 partition, you can type the following command:

$ salloc -p lrz-hgx-h100-94x4 --nodes=2 --gres=gpu:4

Batch Jobs

This SLURM script example launches a distributed job across 2 nodes on the lrz-hgx-h100-94x4 partition, with 4 GPUs per node (8 GPUs total).

#!/bin/bash
#SBATCH -p lrz-hgx-h100-94x4
#SBATCH --nodes=2
#SBATCH --gres=gpu:4
#SBATCH -o log_%j.out
#SBATCH -e log_%j.err

srun --nodes=2 --ntasks-per-node=4 --container-mounts=path/to/source:path/to/destination \
     --container-image='path/to/your/image.sqsh' \
     python script.py --epochs 55 --batch-size 512