7. Running Applications as Batch Jobs on the LRZ AI Systems

Batch jobs are the preferred way of using the LRZ AI Systems. They do not require an extra step for resource allocation. There are different ways of submitting batch jobs to Slurm. Here, the sbatch command is described. 

The sbatch command submits jobs by describing them in a file with a special format. This file is usually referred to as "sbatch script", or just "batch script".
An example of a batch script which will be saved as enroot_test.sbatch is as follows:

#SBATCH -p lrz-v100x2 
#SBATCH --gres=gpu:1
#SBATCH -o enroot_test.out
#SBATCH -e enroot_test.err

srun --container-mounts=./data-test:/mnt/data-test \
     --container-image='nvcr.io#nvidia/pytorch:22.12-py3' \
     python script.py --epochs 55 --batch-size 512

The first part of the example batch script file is the preamble (indicated by the lines starting with #SBATCH). There, the resources that are needed for executing the job are described. Slurm options are used that are  comparable to the ones for the interactive jobs discussed above. Two additional arguments required in sbatch scripts are: where we want the output and error messages of the job to be redirected to. As the job is not interactive, there will be no terminal/shell where it will write to, so we indicate here the files where we would like to have the output and error saved when the job executes (enroot_test.out and enroot_test.err respectively in this example.)

After the preamble, the job to be executed is described. The example executes a Python script within an Enroot container created out of  the image 'nvcr.io#nvidia/pytorch:22.12-py3'. This is enabled by the pyxis plugin for SLURM (check the pyxis documentation and provided options here https://github.com/NVIDIA/pyxis).

Once the script is created, it can be used to submit jobs via the sbatch command:

$ sbatch enroot_test.sbatch 

Slurm will execute submitted jobs when the requested resources are available.