5.4 Slurm Batch Jobs - Multi Node

Some jobs require more GPUs than are available on a single node. In such cases, the --nodes argument can be used to request multiple nodes. The --gres argument then specifies the number of GPUs per node, not the total number of GPUs for the entire job.

The same considerations apply to the --ntasks-per-node option as described for single-node, multi-GPU workloads. As a best practice, it should be specified directly with srun.

Examples

Interactive Jobs

For example, if you need 8 GPUs on the lrz-hgx-h100-94x4 partition, you can type the following command:

$ salloc -p lrz-hgx-h100-94x4 --nodes=2 --gres=gpu:4

Batch Jobs

This SLURM script example launches a distributed job across 2 nodes on the lrz-hgx-h100-94x4 partition, with 4 GPUs per node (8 GPUs total).

The script is launched using torchrun, which works when then underlying deep learning framework used is PyTorch.

#!/bin/bash
#SBATCH --output=%x-%j-%t.out
#SBATCH --nodes=2
#SBATCH --gres=gpu:4
#SBATCH --partition=lrz-hgx-h100-94x4
#SBATCH --time=00:15:00

export NCCL_DEBUG=WARN # or INFO, VERSION

export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_GPUS_ON_NODE)) # The total number of GPUs allocated
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_NODELIST" | head -n 1)
export MASTER_PORT=29353  # choose a free port or generate dynamically

echo "##########Inside jobfile.sh#################"
echo "World size: $WORLD_SIZE"
echo "Master address: $MASTER_ADDR"
echo "Master port: $MASTER_PORT"
echo "###########################"

# srun launches the container inside which torchrun is executed
# the ntasks-per-node=1 is required to launch one container and one torchrun process per node
# torchrun will then handle the parallelization of the training script

echo "##########Inside srun#################"
srun -N$SLURM_NNODES \
	--ntasks-per-node=1 \
	--container-mounts=/path/to/source:/path/to/destination \
     --container-image=/path/to/your/container/image.sqsh \
     torchrun \
        --nproc_per_node=$SLURM_GPUS_ON_NODE \
        --nnodes=$SLURM_NNODES \
        --node_rank=$SLURM_NODEID \
        --rdzv_id=$SLURM_JOB_ID \
        --rdzv_backend=c10d \
        --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
        pytorch_script.py