5.4 Slurm Batch Jobs - Multi Node
Some jobs require more GPUs than are available on a single node. In such cases, the --nodes argument can be used to request multiple nodes. The --gres argument then specifies the number of GPUs per node, not the total number of GPUs for the entire job.
The same considerations apply to the --ntasks-per-node option as described for single-node, multi-GPU workloads. As a best practice, it should be specified directly with srun.
Examples
Interactive Jobs
For example, if you need 8 GPUs on the lrz-hgx-h100-94x4 partition, you can type the following command:
$ salloc -p lrz-hgx-h100-94x4 --nodes=2 --gres=gpu:4
Batch Jobs
This SLURM script example launches a distributed job across 2 nodes on the lrz-hgx-h100-94x4 partition, with 4 GPUs per node (8 GPUs total).
The script is launched using torchrun, which works when then underlying deep learning framework used is PyTorch.
#!/bin/bash
#SBATCH --output=%x-%j-%t.out
#SBATCH --nodes=2
#SBATCH --gres=gpu:4
#SBATCH --partition=lrz-hgx-h100-94x4
#SBATCH --time=00:15:00
export NCCL_DEBUG=WARN # or INFO, VERSION
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_GPUS_ON_NODE)) # The total number of GPUs allocated
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_NODELIST" | head -n 1)
export MASTER_PORT=29353 # choose a free port or generate dynamically
echo "##########Inside jobfile.sh#################"
echo "World size: $WORLD_SIZE"
echo "Master address: $MASTER_ADDR"
echo "Master port: $MASTER_PORT"
echo "###########################"
# srun launches the container inside which torchrun is executed
# the ntasks-per-node=1 is required to launch one container and one torchrun process per node
# torchrun will then handle the parallelization of the training script
echo "##########Inside srun#################"
srun -N$SLURM_NNODES \
--ntasks-per-node=1 \
--container-mounts=/path/to/source:/path/to/destination \
--container-image=/path/to/your/container/image.sqsh \
torchrun \
--nproc_per_node=$SLURM_GPUS_ON_NODE \
--nnodes=$SLURM_NNODES \
--node_rank=$SLURM_NODEID \
--rdzv_id=$SLURM_JOB_ID \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
pytorch_script.py