5.3 Slurm Batch Jobs - Multi GPU

Some jobs, such as parallel machine learning training, require the use of multiple GPUs. To request several GPUs within an allocation, use the --gres argument (e.g., --gres=gpu:4).

Many frameworks, especially those relying on MPI for parallelization (such as Horovod), require one process per GPU on each node. To meet this requirement, it is essential to specify both the number of GPUs with --gres and the number of processes per node using --ntasks-per-node, ensuring a one-to-one mapping between processes and GPUs.

This becomes particularly important when the job spans multiple nodes. In such cases, --ntasks-per-node ensures that each node runs the correct number of processes. Without this setting, the process-to-GPU mapping may be inconsistent across nodes, leading to resource underutilization or misconfiguration.

Examples

Interactive Jobs

For example, if you need all the 4 GPUs on the lrz-dgx-1-v100x8 partition, use the following command:

salloc -p lrz-dgx-1-v100x8 --ntasks-per-node=4 --gres=gpu:4

Batch Jobs

The --gres argument should be included in the script preamble using the #SBATCH directive to request the desired number of GPUs.

The srun command is used to launch the container inside which the torchrun is executed. The --ntasks-per-node=1 is required in order to launch only one container and one torchrun process per node. The torchrun command will then handle the parallelization of the PyTorch script.

Here’s an example SLURM script to launch a distributed job across 2 GPUs on a single node of the lrz-v100x2 partition.

#!/bin/bash
#SBATCH --job-name=multi-gpu-single-node
#SBATCH --output=log-%j.out
#SBATCH --nodes=1
#SBATCH --gres=gpu:2
#SBATCH --partition=test-v100x2
#SBATCH --qos=testing
#SBATCH --time=00:15:00

# One container per node; torchrun launches one worker per GPU on the node
srun --ntasks=1 \
    --container-image="$HOME/nvidia+pytorch+23.10-py3.sqsh" \
    --container-mounts="$HOME/ai-systems-examples/ddp_scaling_benchmark:/workspace" \
    torchrun --standalone --nproc_per_node="${SLURM_GPUS_ON_NODE}" \
        benchmark.py --epochs 3 --steps-per-epoch 20 --warmup-steps 5