5.3 Slurm Batch Jobs - Multi GPU

Some jobs, such as parallel machine learning training, require the use of multiple GPUs. To request several GPUs within an allocation, use the --gres argument (e.g., --gres=gpu:4).

Many frameworks, especially those relying on MPI for parallelization (such as Horovod), require one process per GPU on each node. To meet this requirement, it is essential to specify both the number of GPUs with --gres and the number of processes per node using --ntasks-per-node, ensuring a one-to-one mapping between processes and GPUs.

This becomes particularly important when the job spans multiple nodes. In such cases, --ntasks-per-node ensures that each node runs the correct number of processes. Without this setting, the process-to-GPU mapping may be inconsistent across nodes, leading to resource underutilization or misconfiguration.

Examples

Interactive Jobs

For example, if you need all the 8 GPUs on the lrz-dgx-1-v100x8 partition, you would need to type the following command:

salloc -p lrz-dgx-1-v100x8 --ntasks-per-node=8 --gres=gpu:8

Batch Jobs

The --gres argument should be included in the script preamble using the #SBATCH directive to request the desired number of GPUs. Additionally, the --ntasks-per-node argument can be specified with the srun command to ensure the correct number of parallel tasks per node. 

When running MPI-based jobs, it is recommended to specify --ntasks-per-node directly in the srun command. This ensures the process layout is explicitly defined for the MPI launch, which is particularly important for correct GPU binding and process-to-core mapping.

Additionally, to enable MPI support for the job step, include --mpi=pmi2 in the srun command. This activates MPI support using the PMI2 interface. If your job does not rely on MPI, this option is not required.

Here’s an example:

#!/bin/bash
#SBATCH -p lrz-dgx-1-v100x8
#SBATCH --gres=gpu:8
#SBATCH -o log_%j.out
#SBATCH -e log_%j.err

srun --mpi=pmi2 --ntasks-per-node=8 --container-mounts=./data-test:/mnt/data-test \
     --container-image='horovod/horovod+0.16.4-tf1.12.0-torch1.1.0-mxnet1.4.1-py3.5' \
     python script.py --epochs 55 --batch-size 512