5.2 Slurm Batch Jobs - Single GPU

Batch jobs are non-interactive and the preferred way of using the LRZ AI Systems.
In a batch job, the allocation of resources and job submission are done in a single step.
If no resources are available, the job queues until the requested allocation is possible.

Slurm

SLURM is a resource manager designed for multi-user systems. If the requested resources are not immediately available, the job is placed in a queue until the allocation can be granted. You can check the queue using squeue, or view only your own jobs with squeue --me. SLURM schedules jobs using policies such as fair-share to balance access among users. Batch jobs are executed automatically once scheduled.

SLURM provides a set of command-line tools, often referred to as s-commands, such as sinfo, salloc, srun, sbatch, and squeue, which are used to submit, allocate, run, and monitor jobs on the system. For more details, see the official SLURM documentation.

Slurm Essentials

The sbatch command submits jobs by describing them in a file with a special format. This file is usually referred to as batch script.
Once the script is created, it is submitted as:

sbatch example.sbatch

An example of a batch script is depicted next.

#!/bin/bash
#SBATCH -p lrz-v100x2                   # Select partition (use sinfo)
#SBATCH --gres=gpu:1                    # Request 1 GPU
#SBATCH -o log_%j.out                   # File to store standard output
#SBATCH -e log_%j.err                   # File to store standard error

echo "Start on $(hostname) at $(date)"  # Run outside of srun
srun command                            # Run the actual command GPU-enabled with srun

The first part of a batch script is the preamble, which includes lines starting with #! and #SBATCH. This section defines the resource allocation required to run the job, such as the partition, number of GPUs, and runtime limits.

In addition, two important #SBATCH options specify where to redirect the job’s output and error messages. Since batch jobs are non-interactive, there is no terminal or shell to display output. Instead, the standard output and error streams must be written to files. In our example, we use log_%j.out and log_%j.err, where %j is automatically replaced by the Slurm job ID.

Following the preamble, the actual job commands are listed.
In this example, the script runs two commands sequentially.

The first command (echo) is not run with srun. It is executed directly by the SLURM batch script on the first node of the allocation. Since it is outside of SLURM’s job step management, it does not benefit from features like resource binding or tracking. This is fine for simple shell operations such as logging or environment setup.

The second command is run with srun, which means it is executed as a managed SLURM job step. This launches the command in a parallel context, typically across all nodes of the allocation (unless specified otherwise). If the allocation includes only a single node, srun will still create a parallel job, but limited to that one node.

Running with srun initializes MPI-related environment variables such as LOCAL_RANK, RANK, and WORLD_SIZE, which are often required by distributed frameworks (e.g., PyTorch, TensorFlow, MPI). These variables help coordinate parallel computation by assigning each process a role and identity within the job.

Batch Jobs with Enroot Containers

Non-Parallel Jobs

To run non-parallel containerized jobs with SLURM using Enroot, you typically work with a pre-existing container image.
This involves two separate steps in your batch script:

Creating a container from the container image
Running the desired command inside the created container

The following script illustrates this approach:

#!/bin/bash
#SBATCH -p lrz-v100x2                   # Select partition (use sinfo)
#SBATCH --gres=gpu:1                    # Request 1 GPU
#SBATCH -o log_%j.out                   # File to store standard output
#SBATCH -e log_%j.err                   # File to store standard error

enroot create <NAME>.sqsh
enroot start NAME command

The option --name CNAME in enroot create would assign the container the name CNAME and create it on the first node of your allocation.
The line enroot start NAME command also executes the command in the first node of the allocation within the container.
As of Ubuntu 22.04, using the Enroot command line interface for starting the job without previously creating the container is not possible.

Parallel Jobs

One Container

To apply a single container image across all commands in your job, it is recommended to use the --container-image option in the batch script preamble.
Although srun is not explicitly used before each command in the script, they are all executed as part of a parallel job.

Keep in mind that invoking srun within this context will fail, as it is not available inside the scope of an already parallel job.

#!/bin/bash
#SBATCH -p lrz-v100x2                   # Select partition (use sinfo)
#SBATCH --gres=gpu:1                    # Request 1 GPU
#SBATCH -o log_%j.out                   # File to store standard output
#SBATCH -e log_%j.err                   # File to store standard error
#SBATCH --container-image="docker://nvcr.io/nvidia/pytorch:23.07-py3"

command1
command2

One Containers per Job Step

For containerized parallel jobs, even when allocating just a single node, we recommend using the Pyxis plugin for SLURM to manage container execution.

The following example illustrates a typical workflow:

command1 is executed in a container created from
the image nvcr.io#nvidia/pytorch:23.07-py3 (on each allocated node)
After completion, command2 is run in a new container based on
the image nvcr.io#nvidia/tensorflow:22.12-py3 (on each allocated node)
Finally, command3 and command4 are executed consecutively within the same container
created from the image nvcr.io#nvidia/tensorflow:22.12-py3

Note: each call to srun results in the creation of a fresh container instance, even if the same image is used.

#!/bin/bash
#SBATCH -p lrz-v100x2                   # Select partition (use sinfo)
#SBATCH --gres=gpu:1                    # Request 1 GPU
#SBATCH -o log_%j.out                   # File to store standard output
#SBATCH -e log_%j.err                   # File to store standard error

srun --container-image=nvcr.io#nvidia/pytorch:23.07-py3 command1 
srun --container-image=nvcr.io#nvidia/tensorflow:22.12-py3 command2 
srun --container-image=nvcr.io#nvidia/tensorflow:22.12-py3 bash -c "command3 ; command4"