Job Processing with SLURM on SuperMUC-NG

see also:

General

The batch system on SuperMUC-NG is the open-source workload manager SLURM (Simple Linux Utility for Resource management). For details about the SLURM batch system, see Slurm Workload Manager.

Submit hosts are usually login nodes that permit to submit and manage batch jobs.

Intel processors on SuperMUC-NG support the hyperthreading mode which might increase the performance of your application. With hyperthreading, you have to increase the number of MPI tasks per node from 48 to 96 in your job script. Please be aware that with 96 MPI tasks per node each process gets only half of the memory by default. If you need more memory, you have to specify it in your job script and use the fat nodes (see example batch scripts).

List of relevant commands

CommandPurpose
sbatchsubmit a job script
scanceldelete or terminate a queued or running job
squeueprint table of submitted jobs and their state.
Note: non-privileged users can only see their own jobs.
salloccreate an interactive SLURM shell
srunexecute argument command on the resources assigned to a job.
Note: must be executed inside an active job (script or interactive environment).
mpiexec is an alternative and preferred on LRZ system
sstatDisplay various status information of a running job/step.
sinfoprovide overview of cluster status
scontrolquery and modify SLURM state

sacct is not available for users.

SLURM partitions (Queues) and their limits

  • Batch queues are called partitions in SLURM. 
  • The allocation granularity is multiples of one node (only complete nodes are allocated and accounted for).
  • Scheduling and priorization is based on a multifactor scheme including wait time, job size, partition, and required quality of service.

The following partitions are available. Check with sinfo for more details:

partitionmin-max
nodes per job
max usable
memory
cores
per node

max run time (hours)

max running jobs per usermax submitted
jobs per user (qos)
base job processing priority
test1-1690 GB480.513(dedicated nodes)
micro1-1690 GB48482040low (*)
general17-76890 GB48481030medium
large769-3168
(approx. half of system)
90 GB482425high
fat1-128740 GB4848210

(*) Remark: "micro" jobs are frequently executed by SLURM's backfilling algorithm, if some larger job from the "general" or "large" queue is terminating earlier than epected and leaving some unoccupied time slot in SLURM's scheduling matrix. For obvious reasons this time slot must be less than 48 hours long (less than 24 hours for a breaking "large" job). This implies, that it can be helpful to specify for "micro" jobs a maximum execution time well below the allowed maximum 48 (24) hours, because slots in SLURM's processor-time scheduling matrix being usable for backfilling usually need to have less than 48 hours duration. If a "micro" job is specifying a maximum execution time limit of the max. allowed 48 hours, than obviously such a job cannot qualify for its usage in the backfilling algorithm and consequently the queue waiting time will be longer (if not even maximized). 

srun and mpiexec

With SLURM srun command users can spawn any kind of application, process or task inside a job allocation or directly start executing a parallel job (and indirectly ask SLURM to create the appropriate allocation). It can be a shell command, any single-/multi-threaded executable in binary or script format, MPI application or hybrid application with MPI and OpenMP. When no allocation options are defined with srun command the options from sbatch or salloc are inherited.

Note: mpiexec is the preferred and only supported way to start applications. srun might fail (particularly for hyperthreaded applications).

salloc / srun  for interactive processing

  • allocate nodes 
  • then execute one or more commands in this allocation

salloc is used the allocated nodes for interactive processing. The options for resource specification in salloc/srun/sbatch are the same.
Currently, at least --account, --time and --partition must be specified!

"srun" can be used instead of "mpiexec"; both commands execute on the nodes previously allocated by the salloc.
There is no advantage by using "salloc" over "sbatch --partition=test" in terms of wait time.



# allocate 4 Nodes 
login node> salloc -t 00:30:00 -p test -A <project-id> -N 4
salloc: Pending job allocation 48417
salloc: job 48417 queued and waiting for resources
salloc: job 48417 has been allocated resources
salloc: Granted job allocation 48417
salloc: Waiting for resource configuration
salloc: Nodes i01r01c01s[03-06] are ready for job

i01r01c01s03> srun -n 4 hostname 
i01r01c01s05.sng.lrz.de
i01r01c01s04.sng.lrz.de
i01r01c01s06.sng.lrz.de
i01r01c01s03.sng.lrz.de

# ppn is needed because whole node is allocated
# otherwise 4 processes are started on first node
i01r01c01s03> mpiexec -n 4 -ppn 1 hostname
#(same output)

#Hybrid applications
OMP_NUM_THREADS=4 mpiexec -ppn 12 -n 48 ./a.out
OMP_NUM_THREADS=8 mpiexec -ppn 6  -n 48 ./a.out

i01r01c01s03> exit
exit
salloc: Relinquishing job allocation 48417
login node> 

# if you allocate 4 nodes and 48 tasks you can run mpiexec 
# like in a batch job
salloc -t 00:30:00 -p test -A <project-id> -N 4 -n 48
mpiexec hostname
OMP_NUM_THREADS=4 mpiexec -ppn 12 -n 48 ./a.out


sbatch Command / #SBATCH option

Batch job options and resources can be given as command line flagsto sbatch (in which case they override script-provided values), or they can be embedded into a SLURM job script as a comment line of the form.

For a very simple job to test your setup see the following:

Simple Test Job
#!/bin/bash
#SBATCH --time=00:20:00
#SBATCH --account=<project name>
#SBATCH --partition=test
#SBATCH --ntasks=1
module load slurm_setup
date
pwd
env | grep SLURM


Batch Job Examples

General options applicable for all jobs

#!/bin/bash 
# Job Name and Files (also --job-name) 
#SBATCH -J jobname
#Output and error (also --output, --error): 
#SBATCH -o ./%x.%j.out 
#SBATCH -e ./%x.%j.err 
#Initial working directory (also --chdir): 
#SBATCH -D ./
#Notification and type
#SBATCH --mail-type=END
#SBATCH --mail-user=insert_your_email_here
# Wall clock limit: 
#SBATCH --time=24:00:00
#SBATCH --no-requeue
#Setup of execution environment
#SBATCH --export=NONE 
#SBATCH --get-user-env
#SBATCH --account=insert your_projectID_here
#SBATCH --partition=insert test, micro, general, large or fat 

#constraints are optional
#--constraint="scratch&work"

<insert the specific options for resources 
 and execution from below here>

 

Hints and Explanations:

Replacement patterns in filenames:
%J:  jobid.stepid of the running job. (e.g. "128.0")
%j:  jobid of the running job.
%s:  stepid of the running job.
%t:  task identifier (rank) relative to current job. 
        this will create a separate IO file per task.
%u:  User name.
%x:  Job name.
%a:  Job array ID

Notification types:
NONE, BEGIN, END, FAIL, REQUEUE

requeue/no-requeue:
Wether the job should eligible to being requeue or not.  When
a job is requeued, the batch script is initiated from its
beginning. no-requeue specifies that the batch job should
never be requeued under any circumstances.

environment:
Do not export the variables of the submitting shell into the job (which would make debugging of errors nearly impossible for LRZ).

get-user-env
will set environment variables as during Login.

account:
Resources used by this job are substracted from budget of this project. The billing unit is core-hours. Make sure that you use the right project. Remark: This is not the user account but the project name usually.

Partition:

Request a specific partition ("queue")  for the resource allocation.

constraint (optional):
Nodes can have features. Users can specify which of these features are required by their job using the constraint option. Only nodes having features matching the job constraints will be used to satisfy the request. Multiple constraints may be specified with AND (&), OR (|), matching OR, resource counts, etc. The availability of specific file systems can be specified as a constraint, giving the LRZ the opportunity to start jobs which do not need all. See:

List of SLURM Constraints and its Usage

Options for resources and execution (select and click to expand)


#... (general part)
#SBATCH --partition=general 
#Number of nodes and MPI tasks per node:
#SBATCH --nodes=100 
#SBATCH --ntasks=4800 

#Important 
module load slurm_setup

#Run the program: 
mpiexec -n $SLURM_NTASKS ./myprog 
#... (general part)
#SBATCH --partition=general 
#Number of nodes and MPI tasks per node:
#SBATCH --nodes=100
#SBATCH --ntasks=4800
#SBATCH --ear-mpi-dist=openmpi

#Important 
module load slurm_setup

#Run the program: 
srun -n $SLURM_NTASKS ./myprog 
#If you want to use Openmpi's mpiexec
##SBATCH --ear=off
#mpiexec -n $SLURM_NTASKS ./myprog
#... (general part) 
#SBATCH --partition=general 
#Number of nodes and MPI tasks per node: 
#SBATCH --nodes=128 
#SBATCH --ntasks-per-node=48 

#Important 
module load slurm_setup

#Run the program: 
mpiexec -n $SLURM_NTASKS ./myprog 
#... (general part) 
#SBATCH --partition=general 
#Number of nodes and MPI tasks per node:
#SBATCH --nodes=128
## due to a bug, this will not work here
## if ntasks-per-node >48
## just allocate the nodes
## use explicit number of task in mpiexec
##SBATCH --ntasks-per-node=96 

#Important 
module load slurm_setup

mpiexec -n $((128*96)) ./myprog 

#Note: Needs specific MPI version >= 2019
#Task0->CPU0, Task1->CPU48 .. 
#CPU0 and CPU48 are on the same physical core) !#Optional: use more explicit pinning (spreading)
#export I_MPI_PIN_PROCESSOR_LIST=0-95
#Task0->CPU0, Task1->CPU1,... CPU2, ... CPU4
#... (general part)
#SBATCH --partition=general
#Number of nodes and MPI tasks per node:
#SBATCH --nodes=128
#SBATCH --ntasks-per-node=8

#Important 
module load slurm_setup

#Run the program: 
export OMP_NUM_THREADS=6 
#Default Pinning:
#Thread0/Task0 ->CPU0 or CPU48
#Thread1/Task0 ->CPU1 or CPU49
#Thread2/Task0 ->CPU2 or CPU50
mpiexec -n $SLURM_NTASKS ./myprog

#export I_MPI_PIN_CELL=core
#export I_MPI_PIN_DOMAIN=omp:compact
#Thread0/Task0 ->CPU0, Thread1/Task0 ->CPU1
#Thread2/Task0 ->CPU2, Thread3/Task0 ->CPU3


#explicit pinning with masks or lists
#export I_MPI_PIN_CELL=unit
#export I_MPI_PIN_DOMAIN=[3F,FC0,3F000,FC0000,etc.] or
#export #I_MPI_PIN_PROCESSOR_LIST=0-95 etc.
#... (general part) 
#SBATCH --partition=general 
#Number of nodes and MPI tasks per node: 
#SBATCH --nodes=128  
#SBATCH --ntasks-per-node=8  

#Important 
module load slurm_setup

#Run the program:  
export OMP_NUM_THREADS=24
mpiexec -n $SLURM_NTASKS ./myprog

#Optional: Resulting in same pinning as default 
#export I_MPI_PIN_CELL=unit
#export I_MPI_PIN_DOMAIN=omp:compact
#Thread0/Task0 ->CPU0, Thread1/Task0 ->CPU1
#Thread5/Task0 ->CPU5 !!!
#Thread6/Task0 ->CPU48, Thread7/Task0 ->CPU49

#explicit pinning with masks
#export I_MPI_PIN_CELL=unit
#export I_MPI_PIN_DOMAIN="[FFFFFF,FFFFFF0000000,etc.]"
#... (general part)
#SBATCH --partition=fat
#Number of nodes and MPI tasks per node: 
#SBATCH --nodes=64 
#SBATCH --ntasks-per-node=48

#Important 
module load slurm_setup

#Run the program: 
mpiexec -n $SLURM_NTASKS ./myprog
#... (general part)
#SBATCH --partition=micro
#Number of nodes and MPI tasks per node:
#SBATCH --nodes=1 
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48

#Important 
module load slurm_setup

#Run the program:
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./myprog
#Default:KMP_AFFINITY=granularity=thread,compact,1,0
#Thread0 ->CPU0, #Thread1 ->CPU1


#… (general part)  
#SBATCH --partition=large
#Max number of islands and max waittime
#SBATCH --switches=2@24:00:00 
#SBATCH --nodes=1024
#SBATCH --ntasks-per-node=48 

#Important 
module load slurm_setup

mpiexec -n $SLURM_NTASKS ./myprog
#... (general part
#SBATCH -o Array_job.%A_%a.out 
#SBATCH -e Array_job.%A_%a.err
#SBATCH --partition=general 
#Number of nodes and MPI tasks per node:
#SBATCH --nodes=10
#SBATCH --ntasks=480 
#SBATCH --cpus-per-task=1  
#SBATCH --array=1-10

#Important 
module load slurm_setup

mpiexec -n  $SLURM_NTASKS ./myprog  \
         <in.$SLURM_ARRAY_TASK_ID 

  

#... (general part)
#fixed frequency, no dynamic adjustment
#SBATCH --ear=off
#optional: keep job within one island
#SBATCH --switches=1
... 


Select the appropriate case on the left and merge it with the general options above.


Resource Specifications and Explanations:

module load slurm_setup:
specific settings which cannot be set up in the job prolog. Without this line your job will fail !

nodes=<minnodes[-maxnodes]>
Request that a minimum of minnodes nodes be allocated to this job. A maximum node count may also be specified with maxnodes. If only one number is specified, this is used as both the minimum and maximum node count. The default behavior is to allocate enough nodes to satisfy the requirements of the ntasks and cpus-per-task options.

ntasks:
The default is one task per node, but note that the cpus-per-task option will change this default.

ntasks-per-node:
Request that ntasks be invoked on each node. If used with the ntasks option, the ntasks option will take precedence and the ntasks-per-node will be treated as a maximum count of tasks per node. Meant to be used with the nodes option.

ntasks-per-core:
Request that the maximum ntasks be invoked on each core.

cpus-per-task:
Without this option, the controller will just try to allocate one core per task

switches=<number>[@waittime hh:mm:ss]
Maximum count of switches (this means islands on SuperMUC-NG) desired for the job allocation and optionally the maximum time to wait for that number of switches. Use this option only for very large jobs or when you need reproducable results for profiling and performance measurements (e.g. switches=1 for jobs which should run only within one island).

array:
Submit a job array, multiple jobs to be executed with identical parameters.

mpiexec:
In most cases mpiexec can be used without specifying the number of tasks, because this is inherited from the sbatch command. Slurm output variables can also be used e.g.,
mpiexec -n $SLURM_NTASKS ./myprog

If SLURM can detect the number of tasks form its settings it is sufficient to use mpiexec without further parameters e.g.,
mpiexec ./myprog

ear (energy aware runtime): 
By default, the system may dynamically change the clock frequency of CPUs during the run time of a job to optimise for energy consumption (for more details, see Energy Aware Runtime). This makes profiling or benchmark measurements difficult and unstable.
Users can enforce a fixed default (but low) frequency by switching EAR off:
#SBATCH --ear=off 
Without this switching-off, the system is able to switch to use a higher frequency.

Pinning:


Submitting several jobs with dependencies


#!/bin/bash 
# Chain of batch jobs with dependencies 
NR_OF_JOBS=6 
JOB_SCRIPT=./my_batch_script 
echo "Submitting chain of $NR_OF_JOBS jobs for batch script $JOB_SCRIPT" 
#submit and get JOBID
JOBID=$(  sbatch ${JOB_SCRIPT} 2>&1 \
       | awk '{print $(NF)}'  ) 
I=1 
while [ $I -lt $NR_OF_JOBS ] 
do  
   JOBID=$( sbatch --dependency=afterok:$JOBID \
         $JOB_SCRIPT 2>&1 \
       | awk '{print $(NF)}'  )  
   I=$(( $I+1 ))
done 
dependency=<dependency_list>

Defer the start of this job until the specified dependencies have been satisfied completed.< dependency_list> is of the form< type:job_id[:job_id][,type:job_id[:job_id]]>

  • after:job_id[:jobid...] job can begin execution after the specified jobs have begun execution.
  • afterany:job_id[:jobid...] job can begin execution after the specified jobs have terminated.
  • afternotok:job_id[:jobid...] job can begin execution after the specified jobs have terminated in some failed state (non-zero exit code, node failure, timed out, etc).
  • afterok:job_id[:jobid...]job can begin execution after the specified jobs have successfully executed (ran to completion with an exit code of zero

Chained jobs will be set on hold (Dependency). Please note that priorities start being set for a job in the chain only once the dependency has been released. Thus, there is no guarantee that the next job in the chain will start right after its predecessor in the chain finishes. The maximum length of the job chain is determined by max submitted jobs per user for the queue (see table on top of the page). Chaining can be used to execute workflows with a minimum of supervision in a long lasting campaign.

Input Environment Variables

Upon startup, sbatch will read and handle the options set in the following environment variables.  Note that environment variables will override any options set in a batch script, and command line options will override any environment variables. Some which may be used by you in $HOME/.profile:

VariableOption
SBATCH_ACCOUNT--account
SBATCH_JOB_NAME--jobid
SBATCH_REQUEUE
SBATCH_NOREQUEUE
--requeue
--no-requeue

Output Environment Variables

The Slurm controller will sets the variables in the environment of the batch script

VariableOption
SLURM_JOB_ID
SLURM_JOBID
Both variants return the SLURM  JobID
SLURM_JOB_ACCOUNTAccount name associated of the job allocation
SLURM_JOB_NUM_NODESNumber of nodes.
SLURM_JOB_NODELIST

To convert the Slurm compressed format into a full list:
scontrol show hostname $SLURM_JOB_NODELIST

SLURM_NTASKSNumber of tasks. Example of usage:
mpiexec -n $SLURM_TASKS

SLURM_NTASKS
SLURM_NNODES
SLURM_NPROCS
SLURM_NTASKS_PER_NODE
SLURM_NTASKS_PER_CORE
SLURM_CPUS_PER_TASK
SLURM_JOB_NUM_NODES

These variables are only set if the corresponding sbatch option was given. Example of usage:
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

SLURM_JOB_CPUS_PER_NODE
SLURM_TASKS_PER_NODE
Count of processors available to the Job: Returned value looks like "96(x128)".
Number of tasks to be initiated on each node. Returned value looks like "8(x128)".
 SLURM_PROCID

The MPI rank (or relative process ID) of the current process. Can be used in wrapper scripts

if [ $SLURM_PROCID ]
then
    ./master
else
    ./slave
fi

File Patterns

sbatch allows for a filename pattern to contain one or more replacement symbols, which are a percent sign "%" followed by a letter 

Example:   #SBATCH -o ./%x.%j.out

PatternExpansion
%j
%J
%a

jobid of the running job,
jobid.stepid of the running job,
Job array ID (index)

%uUser name
%xJob Name
%t

task identifier (rank) relative to current job. This will create a separate IO file per task.

Useful commands

Show the estimated start time of a job: squeue --start [-u <userID>]

Guidelines for resource selection

Processing Mode

  • Jobs that  only use one or at most a few hardware cores perform serial processing and are not supported on SuperMUC-NG. Use the SuperMUC-Cloud for such purposes.
  • Bunches of multiple independent tasks can be bundled into one job, using one or more nodes.

Run time limits 

  • Please note that all job classes impose a maximum run time limit. It can be adjusted downward for any individual job. Since the scheduler uses a backfill algorithm, the better you specify a realistic runtime limit, the better throughput of your job may be achieved.

Number of Islands/Switches

  • This defines the maximum count of switches (=islands of SuperMUC-NG) desired for the job allocation and optionally the maximum time to wait for that number of switches. If Slurm finds an allocation containing more switches than the count specified, the job remains pending until it either finds an allocation with desired (lower) switch count or the time limit expires. It there is no switch count limit, there is no delay in starting the job. This trades off better performance vs. shorter wait time in the queue. Also use the minimum number of switches when you need good reproducablilty for profiling or benchmarking.

Energy aware runtime

  • Switch the dynamic frequency adjustment off when you need good reproducablility for profiling.

Memory Requirements

  • The total memory available in user space for the set of nodes requested by the job must not be exceeded.
  • The memory used on each individual node must not be exceeded by all tasks run on that node.
  • Applications exist for which the memory usage is unsymmetric. In this case it may become necessary to work with a variable number of tasks per node. One relevant scenario is a master-worker scheme where the master may need an order of magnitude more memory and therefore requires a node of its own, while worker nodes can share a node. LRZ provides the "mixed" partition for using thin and fat nodes concurrently

Disk and I/O Requirements

  • The disk and I/O requirements are not controlled by the batch scheduling system, but rely on parallel shared file systems, which provide system-global services with respect to bandwidth - this means that the total I/O bandwidth is shared between all users. The consequence is that all I/O may be significantly slowed down if heavily used by multiple users at the same time, or even - for large scale parallel jobs - by a single user. At present, LRZ can not make any Quality of Service assurance for I/O bandwidth.
  • The appropriate usage of the parallel file systems is essential.
  • Please consult File Systems of SuperMUC-NG for more detailed technical information.

Licences

  • Some jobs may make use of licensed software, either from the LRZ software application stack, or of software installed in the user's HOME directory. In many cases, the software needs to access a license server because there exist limits on how many instances of the software may run and who may access it at all.
  • There is no connection from SuperMUC-NG to the outside. Check with LRZ if you are in need of such licenses.
  • LRZ is currently not able to manage license contingents. The reason is that a significant additional effort is required, not only with suitable configuration of SLURM, but also with how the license servers are managed. The situation implies that a job will fail if the usage limit of a licensed software is exceeded when the job starts.

Conversion of scripts from LoadLeveler and other Workload Managers table

  • see: List of the most common command, environment variables, and job specification options used by the major workload management systems

Resource usage of jobs

For currently running jobs, queries can be done via the sstat command, for example

sstat --fields=MaxRSS%30,MaxRSSnode%30 --jobs=123456

would supply the Maximum resident set size of all tasks in job with ID 123456, as well as the node on which this value was reached. Note that this will only work if the executable is appropriately executed under SLURM control i.e. via the mpiexec or srun commands.

For jobs that are already completed, you need to contact the servicedesk to obtain such information. We currently cannot expose the sacct interface to regular users.

Specific Topics (jobfarming, constraints)

SLURM Documentation