Running parallel jobs on the Linux Cluster

Documentation links you should know
Job Processing on the Linux-ClusterOverview of Linux Cluster partitions, resource limits, job limits, job specifications, common Slurm commands on the Linux Cluster
Job policies and general rules

General and cluster-/job-specific policies. Please also take note of and respect our Linux Cluster rules!

SLURM Workload ManagerSlurm commands and options for job submission, explanations, recommendations
Environment Modules
Spack Generated Modules
Overview of module system and LRZ software stack

Overview

On all HPC systems at LRZ, the SLURM scheduler is used to execute batch jobs. For any kind of parallel jobs we recommend to use the partitions cm4_tiny or cm4_std of cluster segment cm4. Please also refer to Job Processing on the Linux-Cluster to check the resource and job limits on this cluster. On cm4 we only allow batch jobs. This type of execution method should be used for all production runs.

Usually, we distinguish two types of parallel jobs:

  • The term "shared memory" for parallel jobs assumes that a number of cores assigned by Slurm will be used by threads. Typically, setting the environment variable OMP_NUM_THREADS should achieve this (see examples below).
  • The term "distributed memory" for parallel jobs assumes that MPI is used to start one single-threaded MPI task per core assigned by Slurm. In principle, it is also possible to run hybrid MPI + threaded programs, in which case the number of cores assigned by the system will be equal to the product (# of MPI tasks) * (# of threads), rounded up if necessary.

This document briefly describes how to set-up and start batch jobs via a step-by-step recipe for the simplest type of a parallel job (illustrating the use of the SLURM commands for users of the bash shell). Full examples of typical use cases are given at the end of this document.

Batch job step by step

In order to start a batch job, you need to log in to the Linux Cluster. Then, you can submit a pre-prepared job script to the Slurm scheduler.

Step 1: Prepare a batch job script

In the following we show a job script example, which essentially looks like a bash script. However, there are specially marked comment lines ("control sequences"), which have a special meaning in the SLURM context explained on the right hand of the above table. The entries marked "Placeholder" must be suitably modified to have valid user-specific values. Check Job Processing on the Linux-Cluster for details on limits on the parallel cluster segment.

Batch job script lineExplanationComment
#!/bin/bash


#SBATCH -J <job_name>

(Placeholder) Set name of job (not more than 10 characters please).


#SBATCH -D  ./
Set working directory. This directory is used by the script as starting point. The directory must exist before the job starts. Here, the path is relative to the directory where the job script is submitted.
#SBATCH -o ./%x.%j.%N.out
#SBATCH -e ./%x.%j.%N.err
(Placeholder) Standard output/error goes there. The directory where the output file is placed must exist before the job starts, and the (full) path name must be specified (no environment variable!). "%x" encodes the job name into the output file name. "%j" encodes the job ID. "%N" encodes the name of the master node of the job. Here, the specified path is relative to the directory specified in the -D specification.
#SBATCH --get-user-env
Set user environment properly.
#SBATCH --export=NONE
Do not export the environment of the submitting shell into the job; while SLURM allows to also use ALL here, this is strongly discouraged, because the submission environment is very likely to be inconsistent with the environment required for execution of the job.
#SBATCH --clusters=cm4
#SBATCH --partition=<partition_name>

(Placeholder) Specify the names of the cluster segmen (cm4) and partition within the cluster, e.g. cm4_tiny or cm4_std.


#SBATCH --qos=<partition_name>

(Placeholder) Slurm needs that setting for internal job management. It is mandatory on partitions cm4_tiny and cm4_std.


#SBATCH --ntasks=<number>
(Placeholder) The number of MPI tasks to start on each node. Typically, the value used here should not be larger than the number of physical cores in a node. It may be chosen smaller for various reasons (memory needed for a task, hybrid programs, etc). Set it to 1 in purely serial jobs.This is typically needed on cm4_tiny, as jobs cannot request more than 1 node there.
#SBATCH --nodes=<number>
(Placeholder) Number of (shared-memory multi-core) nodes assigned to the job.This is typically needed on cm4_std in order to define the requested resources uniquely.
#SBATCH --ntasks-per-node=<number>
(Placeholder) The number of MPI tasks to start on each node. Typically, the value used here should not be larger than the number of physical cores in a node. It may be chosen smaller for various reasons (memory needed for a task, hybrid programs, etc).
#SBATCH --cpus-per-task=<number>
(Placeholder) Set the number of (OpenMP) threads per (MPI) task. This parameter is required in shared-memory and hybrid job setups.
#SBATCH --mem=<size>[unit]

(Placeholder) Specify maximum memory the job can use, e.g. 10G = 10 GB (typical units: M, G). Very large values can cause assignment of additional cores to the job that remain unused, so this feature should be used with care. The default memory per job scales with the amount of CPU cores used by the job.

This is typically needed on cm4_tiny to define the memory requirement explicitly (if more memory is required than is specified by default).
#SBATCH --time=<HH:MM:SS>
(Placeholder) Set the maximum run time of the job using the format "hours:minutes:seconds".
module load slurm_setup
First executed line: SLURM settings necessary for proper setup of batch environment.
module load <module_name>
(Placeholder) Load any required environment modules (usually needed if program is linked against shared libraries, or if paths to applications are needed), e.g. MPI.Modules are not auto-loaded by default (e.g. for libraries or software packages)! Please load all required modules explicitly. Otherwise, the job starts with a clean environment!
mpiexec -n $SLURM_NTASKS ./my_mpi_program.exe

Start MPI executable. The MPI variant used depends on the loaded module set. Non-MPI programs may fail to start up. Please consult the job examples given below or the software-specific documentation for other startup mechanisms. The total number of MPI tasks is supplied by SLURM via the referenced variable $SLURM_NTASKS, which is computed from the number of nodes and number of tasks per node. Apart from "mpiexec", "srun" may also be used.


Step 2: Job submission procedure

The job script "my_job.slurm" is submitted to the queue via Slurm's sbatch command. At submission time the control sequences are evaluated and stored in the queuing database, and the script is copied into an internal directory for later execution. If the command was executed successfully, the job ID will be returned. The sequence should look like:

userid@loginnode:~> sbatch my_job.slurm 
Submitted batch job 17130 on cluster cm4

It is a good idea to note down your Job IDs, for example to provide to LRZ Linux Cluster Support as information if anything goes wrong. Nevertheless, you may also invoke Slurm commands to inspect your jobs or to check their status. See section "Job management".

Furthermore, the submission command sbatch can also contain control sequences, which would override the settings in the script. We strongly advice against doing so! Otherwise, it might be difficult for you or the LRZ HPC support to understand and reproduce the job configuration based on the job script file.

Step 3: Job management

Once submitted, the job will be queued for some time, depending on how many jobs are presently submitted and how many resources are available. Eventually, more or less after previously submitted jobs have completed, the job will be started on one or more of the systems determined by its resource requirements. Slurm provides several commands to check the status of waiting or running jobs, to inspect or even modify (modification only to limited extent) waiting/running jobs, to obtain information of finished jobs and to delete waiting/running jobs.

Please consult Job Processing on the Linux-Cluster for a list of common Slurm commands on the Linux Cluster.

Batch job examples

The exemplary job scripts are provided as templates which you can adapt for your own settings. In particular, you should consider that some entries are placeholders, which you must replace with correct, user-specific settings. In particular, path specifications must be adapted, for example you need to substitute "path_to_my_prog". Furthermore, the job in our examples will be executed in the path where the job script is located. That is, the working directory (check option "-D") is set to "./". All other path names used in the script are relative paths with respect to this working directory. Please also keep in mind, that the job output files are written to this directory. After running some jobs, it might be cluttered with lots of files. We recommend to use a separate directory for that.

Your job might produce a lot of (temporary) data. Please also configure the data paths accordingly. For recommendations on how to do large-scale I/O please refer to the description of the file systems available on the cluster.

Shared-memory jobs

This job type uses a single (shared-memory) compute node of the designated Slurm partition. Parallelization can be achieved either via directive-based OpenMP programming or also (POSIX) thread programming.

#!/bin/bash
#SBATCH -J job_name
#SBATCH -D ./
#SBATCH -o ./%x.%j.%N.out
#SBATCH -e ./%x.%j.%N.err
#SBATCH --get-user-env
#SBATCH --export=NONE
#SBATCH --clusters=cm4
#SBATCH --partition=cm4_tiny
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=112
#SBATCH --time=08:00:00
 
module load slurm_setup
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
MYPROG=path_to_my_prog/my_shared-memory_program.exe
 
$MYPROG

The example will start an OpenMP parallel application using 112 threads on all available cores of a single cm4_tiny compute node.

The maximum job runtime is set to 8 hours.

Hyperthreading is not used. Using hyperthreading, 224 threads is the maximum reasonable value for CoolMUC-4.

#!/bin/bash
#SBATCH -J job_name
#SBATCH -D ./
#SBATCH -o ./%x.%j.%N.out
#SBATCH -e ./%x.%j.%N.err
#SBATCH --get-user-env
#SBATCH --export=NONE
#SBATCH --clusters=cm4
#SBATCH --partition=cm4_tiny
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=28
#SBATCH --mem=80G
#SBATCH --time=08:00:00
 
module load slurm_setup
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
MYPROG=path_to_my_prog/my_shared-memory_program.exe
 
$MYPROG

The example will start an OpenMP parallel application using 28 threads on the partition cm4_tiny.

As cm4_tiny is a shared partition, the remaining CPU cores may be used by other jobs/users. Thus, if necessary, it might be useful to specifiy the required memory of the job, e.g. 80 GiB in this example.

The maximum job runtime is set to 8 hours.

Distributed-memory jobs (MPI parallel and hybrid jobs)

MPI jobs may be jobs that use MPI only for parallelization, or jobs that combine usage of MPI and OpenMP ("hybrid"). It is mandatory to load an MPI module. Please consult the MPI documentation page.

#!/bin/bash
#SBATCH -J job_name
#SBATCH -D ./
#SBATCH -o ./%x.%j.%N.out
#SBATCH -e ./%x.%j.%N.err
#SBATCH --get-user-env
#SBATCH --export=NONE
#SBATCH --clusters=cm4
#SBATCH --partition=cm4_std
#SBATCH --qos=cm4_std
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=112
#SBATCH --cpus-per-task=1
#SBATCH --time=08:00:00
 
module load slurm_setup
module load <MPI_MODULE>
MYPROG=path_to_my_prog/my_mpi_program.exe
 
mpiexec -n $SLURM_NTASKS $MYPROG

The example will start an MPI parallel application using 224 parallel processes (tasks) distributed across 2 compute nodes on the partition cm4_std.

The maximum job runtime is set to 8 hours.

Substitute the placeholder for the MPI module with a reasonable module name.

#!/bin/bash
#SBATCH -J job_name
#SBATCH -D ./
#SBATCH -o ./%x.%j.%N.out
#SBATCH -e ./%x.%j.%N.err
#SBATCH --get-user-env
#SBATCH --export=NONE
#SBATCH --clusters=cm4
#SBATCH --partition=cm4_std
#SBATCH --qos=cm4_std
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=112
#SBATCH --time=08:00:00
 
module load slurm_setup
module load <MPI_MODULE>
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
MYPROG=path_to_my_prog/my_hybrid_program.exe
 
mpiexec -n $SLURM_NTASKS $MYPROG

The example will start a hyprid application on 2 compute nodes of partition cm4_std using 2 MPI processes (tasks) and 112 OpenMP threads per MPI task. Both MPI processes will be distributed across both compute nodes.

The maximum job runtime is set to 8 hours.

Substitute the placeholder for the MPI module with a reasonable module name.

Special job configurations

Job farming in terms of starting multiple serial "subjobs" within a single job on a single compute node

Please use this with care! Before you consider the use of job farming, you may consult General Considerations to Job-/Task-Farming or Job farming with SLURM for more complex setups.

Please note: The total number of job steps per Slurm job is limited. In other words: On the cm4 cluster, the total number of "srun" calls is limited to 40000!

The example job script on this page illustrates how to start up multiple serial job steps ("subjobs") within a parallel job script. It represents a very simple strategy which is most efficient if all job steps are well-balanced with respect to runtime. A single outlier may cause a considerable waste of CPU resources. At LRZ's discretion, unbalanced jobs may be removed forcibly.

#!/bin/bash
#SBATCH -J job_name
#SBATCH -D ./
#SBATCH -o ./%x.%j.%N.out
#SBATCH -e ./%x.%j.%N.err
#SBATCH --get-user-env
#SBATCH --export=NONE
#SBATCH --clusters=cm4
#SBATCH --partition=cm4_tiny
#SBATCH --qos=cm4_tiny
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=112
#SBATCH --cpus-per-task=1
#SBATCH --time=08:00:00
 
module load slurm_setup
MYPROG=path_to_my_prog/my_serial_program.exe
 
for ((i=1; i<=$SLURM_NTASKS; i++)); do
    outdir=./dataout_${SLURM_JOB_NAME}_${SLURM_JOB_ID}/$i
	mkdir -p $outdir
	cd $outdir
    srun --exact -n 1 -c 1 $MYPROG &
    cd ..
done
wait

The example will start a serial application within a job farming approach using a single node on the partition cm4_tiny. Each job step will involve a single task and a single thread per task.

The number of job steps is defined by the job script specifications "--nodes" and "--ntasks-per-node".

In a loop over all job steps, Slurm's srun command will allocate resources for each step ("-n" tasks and "-c" threads "cpus" per task) and launch the serial program, which will run as background application ("&").

The wait command is used to wait for completion of all background processes. Otherwise, the job might unintentionally finish earlier. Job steps might be incomplete.

In this example, the job also automatically creates unique output directories for each job step. As many jobs will run, a lot of temporary data might be produced. We advice against using $HOME. For example, use the $SCRATCH file system instead (File Systems and IO on Linux-Cluster).