Page tree
Skip to end of metadata
Go to start of metadata

see also:

General

The batch system on SuperMUC-NG is the open-source workload manager SLURM (Simple Linux Utility for Resource management). For details about the SLURM batch system, see Slurm Workload Manager.

Submit hosts are usually login nodes that permit to submit and manage batch jobs.

Intel processors on SuperMUC-NG support the hyperthreading mode which might increase the performance of your application. With hyperthreading, you have to increase the number of MPI tasks per node from 48 to 96 in your job script. Please be aware that with 96 MPI tasks per node each process gets only half of the memory by default. If you need more memory, you have to specify it in your job script and use the fat nodes (see example batch scripts).

List of relevant commands

Command's nameFunctionality
sbatchsubmit a job script
scanceldelete or terminate a queued or running job
squeueprint table of submitted jobs and their state.
Note: non-privileged users can only see their own jobs.e
salloccreate an interactive SLURM shell
srunexecute argument command on the resources assigned to a job.
Note: must be executed inside an active job (script or interactive environment).
mpiexec is an alternative and preferred on LRZ system
sinfoprovide overview of cluster status
scontrolquery and modify SLURM state

sacct is currently not working.e

Queues (SLURM partitions) and their limits

  • Batch queues are called partitions in SLURM. 
  • The allocation granularity is multiples of one node (only complete nodes are allocated and accouonted for).
  • Scheduling and priorization is based on a multifactor scheme including wait time, job size, partition, and required quality of service.

The following partitions are available. Check with sinfo for more details and special partitions:

partitionmin-max
nodes per job
max usable
memory
cores
per node

max run time (hours)

max running jobs per usermax submitted
jobs per user (qos)
test
(also used for interactive access with salloc)
1-1690 GB480.513
micro1-1690 GB48482030
general17-76890 GB4848520
large769-3072
(half of system)
90 GB4824
to be increased
25
fat1-128740 GB4848210
mixed
(not yet available)
 64-307290 GB and 760 GB 48

12
to be increased

 1

srun and mpiexec

With SLURM srun command users can spawn any kind of application, process or task inside a job allocation or directly start executing a parallel job (and indirectly ask SLURM to create the appropriate allocation). It can be a shell command, any single-/multi-threaded executable in binary or script format, MPI application or hybrid application with MPI and OpenMP. When no allocation options are defined with srun command the options from sbatch or salloc are inherited.

Note: srun at LRZ is defined as the alias
srun='I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so /usr/bin/srun'.
Since aliases are not inherited the alias is only available in the login shell or in the initial batch script, everywhere else it falls back to /usr/bin/srun. Use the full syntac in these cases.

Note: mpiexec is the preferred and supported way to start applications. srun might fail for hyperthreaded applications.

salloc / srun  for interactive processing

salloc is used the allocated nodes for interactive processing. The options for resource specification in salloc/srun/sbatch are the same.
Currently, at least --account, --time and --partition must be specified!
If there are difficulties starting up, it may be advantageous to also specify --ear=off. See the EAR document for more details.
"srun" can be used instead of "mpiexec"; both commands execute on the nodes previously allocated by the salloc.
There is no advantage by using "salloc" over "sbatch --partition=test" in terms of wait time.

 Example
login node> salloc -t 00:30:00 -p test -A <my-desired-project-id> -N 4
salloc: Pending job allocation 48417
salloc: job 48417 queued and waiting for resources
salloc: job 48417 has been allocated resources
salloc: Granted job allocation 48417
salloc: Waiting for resource configuration
salloc: Nodes i01r01c01s[03-06] are ready for job

i01r01c01s03> srun  --nodes=2 --ntasks=2 --partition=test hostname
i01r01c01s03.sng.lrz.de
i01r01c01s04.sng.lrz.de

i01r01c01s03> mpiexec -n 5 hostname 
i01r01c01s03.sng.lrz.de
i01r01c01s03.sng.lrz.de
i01r01c01s03.sng.lrz.de
i01r01c01s03.sng.lrz.de
i01r01c01s03.sng.lrz.de

i01r01c01s03> srun -n 5 hostname 
i01r01c01s05.sng.lrz.de
i01r01c01s04.sng.lrz.de
i01r01c01s06.sng.lrz.de
i01r01c01s03.sng.lrz.de
i01r01c01s03.sng.lrz.de

i01r01c01s03> exit
exit
salloc: Relinquishing job allocation 48417
login node> 


sbatch Command / #SBATCH option

Batch job options and resources can be given as command line switches to sbatch (in which case they override script-provided values), or they can be embedded into a SLURM job script as a comment line of the form.

Batch Job Examples

General options applicable for all jobs

#!/bin/bash 
# Job Name and Files (also --job-name) 
#SBATCH -J jobname
#Output and error (also --output, --error): 
#SBATCH -o ./%x.%j.out 
#SBATCH -e ./%x.%j.err 
#Initial working directory (also --chdir): 
#SBATCH -D ./
#Notification and type
#SBATCH --mail-type=END
#SBATCH --mail-user=insert_your_email_here
# Wall clock limit: 
#SBATCH --time=24:00:00
#SBATCH --no-requeue
#Setup of execution environment
#SBATCH --export=NONE 
#SBATCH --get-user-env
#SBATCH --account=insert your_projectID_here
#constraints are optional
#--constraint="scratch&work"

<insert the specific options for resources and execution from below here>

#Important
module load slurm_setup

 

Hints and Explanations:

Replacement patterns in filenames:
%J:  jobid.stepid of the running job. (e.g. "128.0")
%j:  jobid of the running job.
%s:  stepid of the running job.
%t:  task identifier (rank) relative to current job. 
        this will create a separate IO file per task.
%u:  User name.
%x:  Job name.
%a:  Job array ID

Notification types:
NONE, BEGIN, END, FAIL, REQUEUE

requeue/no-requeue:
Wether the job should eligible to being requeue or not.  When
a job is requeued, the batch script is initiated from its
beginning. no-requeue specifies that the batch job should
never be requeued under any circumstances.

environment:
Do not export the variables of the submitting shell into the job (which would make debugging of errors nearly impossible for LRZ).

get-user-env will set Environment variable as during Login.

account:
Resources used by this job are substracted from budget of this project. The billing unit is core-hours.
Make sure that you use the right project.

constraint (optional):
Nodes can have features. Users can specify which of these features are required by their job using the constraint option. Only nodes having features matching the job constraints will be used to satisfy the request. Multiple constraints may be specified with AND (&), OR (|), matching OR, resource counts, etc. The availability of specific file systems can be specified as a constraint, giving the LRZ the opportunity to start jobs which do not need all. See:

slurm_setup

specific settings which cannot be set up in the job prolog. Without this line your job will fail.

Options for resources and execution (select and click to expand)


 MPI without hyperthreading using number of tasks
#... (general part)
#SBATCH --partition=general 
#Number of nodes and MPI tasks per node:
#SBATCH --nodes=100 
#SBATCH --ntasks=4800 
#Run the program: 
mpiexec -n $SLURM_NTASKS ./myprog 
 MPI without hyperthreading using ntasks per node
#... (general part) 
#SBATCH --partition=general 
#Number of nodes and MPI tasks per node: 
#SBATCH --nodes=128 
#SBATCH --ntasks-per-node=48 
#Run the program: 
mpiexec -n $SLURM_NTASKS ./myprog 
 MPI with hyperthreading using tasks per node
#... (general part) 
#SBATCH --partition=general 
#Number of nodes and MPI tasks per node:
#SBATCH --nodes=128
#SBATCH --ntasks-per-node=96 
#Note: Needs specific MPI version > 2019
#Run the program: 
#Pinning
#Task0->CPU0, Task1->CPU48 .. 
#CPU0 and CPU48 are on the same physical core) !
#Task2->CPU1, Task3->CPU49
mpiexec -n $SLURM_NTASKS ./myprog 

#Optional: use more explicit pinning (spreading)
#I_MPI_PIN_PROCESSOR_LIST=0-95
#Task0->CPU0, Task1->CPU1,... CPU2, ... CPU4
 Hybrid MPI/OpenMP without hyperthreading
#... (general part)
#SBATCH --partition=general
#Number of nodes and MPI tasks per node:
#SBATCH --nodes=128
#SBATCH --ntasks-per-node=8

#Run the program: 
export OMP_NUM_THREADS=6 
#Default Pinning:
#Thread0/Task0 ->CPU0 or CPU48
#Thread1/Task0 ->CPU1 or CPU49
#Thread2/Task0 ->CPU2 or CPU50
mpiexec -n $SLURM_NTASKS ./myprog

#export I_MPI_PIN_CELL=core
#export I_MPI_PIN_DOMAIN=omp:compact
#Thread0/Task0 ->CPU0, Thread1/Task0 ->CPU1
#Thread2/Task0 ->CPU2, Thread3/Task0 ->CPU3
#....
#Thread0/Task1 ->CPU6, Thread1/Task1 ->CPU7

#explicit pinning with masks
#export I_MPI_PIN_CELL=unit
#export I_MPI_PIN_DOMAIN=[3F,FC0,3F000,FC0000,etc.]
 Hybrid MPI/OpenMP with hyperthreading
#... (general part) 
#SBATCH --partition=general 
#Number of nodes and MPI tasks per node: 
#SBATCH --nodes=128  
#SBATCH --ntasks-per-node=8  

#Run the program:  
export OMP_NUM_THREADS=24
mpiexec -n $SLURM_NTASKS ./myprog

#Optional: Resulting in same pinning as default 
#export I_MPI_PIN_CELL=unit
#export I_MPI_PIN_DOMAIN=omp:compact
#Thread0/Task0 ->CPU0, Thread1/Task0 ->CPU1
#Thread5/Task0 ->CPU5 !!!
#Thread6/Task0 ->CPU48, Thread7/Task0 ->CPU49

#explicit pinning with masks
#export I_MPI_PIN_CELL=unit
#export I_MPI_PIN_DOMAIN="[FFFFFF,FFFFFF0000000,etc.]"
 Huge (Fat) Memory Jobs (>90 GB/node)
#... (general part)
#SBATCH --partition=fat
#Number of nodes and MPI tasks per node: 
#SBATCH --nodes=64 
#SBATCH --ntasks-per-node=48
#Run the program: 
mpiexec -n $SLURM_NTASKS ./myprog
 OpenMP Job
#... (general part)
#SBATCH --partition=micro
#Number of nodes and MPI tasks per node:
#SBATCH --nodes=1 
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48
#Run the program:
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./myprog
#Default:KMP_AFFINITY=granularity=thread,compact,1,0
#Thread0 ->CPU0, #Thread1 ->CPU1


 Large MPI Job (>792 nodes)
#… (general part)  
#SBATCH --partition=large
#Max number of islands and max waittime
#SBATCH --switches=2@24:00:00 
#SBATCH --nodes=1024
#SBATCH --ntasks-per-node=48 
mpiexec -n $SLURM_NTASKS ./myprog
 Array Jobs
#... (general part
#SBATCH -o Array_job.%A_%a.out 
#SBATCH -e Array_job.%A_%a.err
#SBATCH --partition=general 
#Number of nodes and MPI tasks per node:
#SBATCH --nodes=10
#SBATCH --ntasks=480 
#SBATCH --cpus-per-task=1  
#SBATCH --array=1-10
mpiexec -n  $SLURM_NTASKS ./myprog  \
         <in.$SLURM_ARRAY_TASK_ID 

  

 Fixed frequency (for profiling/benchmarking)
#... (general part)
#SBATCH --ear=off
... 

Resource Specifications:

nodes=<minnodes[-maxnodes]>
Request that a minimum of minnodes nodes be allocated to this job. A maximum node count may also be specified with maxnodes. If only one number is specified, this is used as both the minimum and maximum node count. The default behavior is to allocate enough nodes to satisfy the requirements of the ntasks and cpus-per-task options.

ntasks:
The default is one task per node, but note that the cpus-per-task option will change this default.

ntasks-per-node:
Request that ntasks be invoked on each node. If used with the ntasks option, the ntasks option will take precedence and the ntasks-per-node will be treated as a maximum count of tasks per node. Meant to be used with the nodes option.

ntasks-per-core:
Request that the maximum ntasks be invoked on each core.

cpus-per-task:
Without this option, the controller will just try to allocate one core per task

switches=<number>[@waittime hh:mm:ss]
Maximum count of switches desired for the job allocation and optionally the maximum time to wait for that number of switches. +Use this option only for very large Jobs.

array:
Submit a job array, multiple jobs to be executed with identical parameters.

mpiexec:
In most cases mpiexec can be used without specifying the number of tasks, because this is inherited from the sbatch command. Slurm output variables can also be used e.g.,
mpiexec -n $SLURM_NTASKS ./myprog

If SLURM can detect the number of tasks form its settings it is sufficient to use mpiexec without further parameters e.g.,
mpiexec ./myprog


Execution Specification:

By default, the system may dynamically change the clock frequency of CPUs during the run time of a job to optimise for energy consumption (for more details, see Energy Aware Runtime). This makes profiling or benchmark measurements difficult and unstable. Users can enforce a fixed default frequency by switching EAR off:

#SBATCH --ear=off





Submitting several jobs with dependencies


 Script for submitting several jobs with dependencies
#!/bin/bash 
# Chain of batch jobs with dependencies 
NR_OF_JOBS=6 
JOB_SCRIPT=./my_batch_script 
echo "Submitting chain of $NR_OF_JOBS jobs for batch script $JOB_SCRIPT" 
#submit and get JOBID
JOBID=$(  sbatch ${JOB_SCRIPT} 2>&1 \
       | awk '{print $(NF)}'  ) 
I=1 
while [ $I -lt $NR_OF_JOBS ] 
do  
   JOBID=$( sbatch --dependency=afterok:$JOBID \
         $JOB_SCRIPT 2>&1 \
       | awk '{print $(NF)}'  )  
   I=$(( $I+1 ))
done 
 

dependency=<dependency_list>
Defer the start of this job until the specified dependencies have been satisfied completed.< dependency_list> is of the form< type:job_id[:job_id][,type:job_id[:job_id]]>

  • after:job_id[:jobid...] job can begin execution after the specified jobs have begun execution.
  • afterany:job_id[:jobid...] job can begin execution after the specified jobs have terminated.
  • afternotok:job_id[:jobid...] job can begin execution after the specified jobs have terminated in some failed state (non-zero exit code, node failure, timed out, etc).
  • afterok:job_id[:jobid...]job can begin execution after the specified jobs have successfully executed (ran to completion with an exit code of zero

Input Environment Variables

Upon startup, sbatch will read and handle the options set in the following environment variables.  Note that environment variables will override any options set in a batch script, and command line options will override any environment variables. Some which may be used by you in $HOME/.profile:

VariableOption
SBATCH_ACCOUNT--account
SBATCH_JOB_NAME--jobid
SBATCH_REQUEUE
SBATCH_NOREQUEUE
--requeue
--no-requeue

Output Environment Variables

The Slurm controller will sets the variables in the environment of the batch script

VariableOption
SLURM_JOB_ID
SLURM_JOBID
 Both variants return the SLURM  JobID
SLURM_JOB_ACCOUNTAccount name associated of the job allocation
SLURM_JOB_NUM_NODESNumber of nodes.
SLURM_JOB_NODELIST

To convert the Slurm compressed format into a full list:
scontrol show hostname $SLURM_JOB_NODELIST

SLURM_NTASKSNumber of tasks. Example of usage:
mpiexec -n $SLURM_TASKS

SLURM_NTASKS
SLURM_NNODES
SLURM_NPROCS
SLURM_NTASKS_PER_NODE
SLURM_NTASKS_PER_CORE
SLURM_CPUS_PER_TASK
SLURM_JOB_NUM_NODES

These variables are only set if the corresponding sbatch option was given. Example of usage:
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

SLURM_JOB_CPUS_PER_NODE
SLURM_TASKS_PER_NODE
Count of processors available to the Job: Returned value looks like "96(x128)".
Number of tasks to be initiated on each node. Returned value looks like "8(x128)".
 SLURM_PROCID

The MPI rank (or relative process ID) of the current process. Can be used in wrapper scripts

if [ $SLURM_PROCID ]
then
    ./master
else
    ./slave
fi

File Patterns

sbatch allows for a filename pattern to contain one or more replacement symbols, which are a percent sign "%" followed by a letter 

Example:   #SBATCH -o ./%x.%j.out

PatternExpansion
%j
%J
%a

jobid of the running job,
jobid.stepid of the running job,
Job array ID (index)

%uUser name
%xJob Name
%t

task identifier (rank) relative to current job. This will create a separate IO file per task.

Useful commands

Show the estimated start time of a job: sqeueue --start [-u <userID>]

Guidelines for resource selection

Processing Mode

  • Jobs that  only use one or at most a few hardware cores perform serial processing and are not supported on SuperMUC-NG. Use the SuperMUC-Cloud for such purposes.
  • Bunches of multiple independent tasks can be bundled into one job, using one or more nodes.

Run time limits 

  • Please note that all job classes impose a maximum run time limit. It can be adjusted downward for any individual job. Since the scheduler uses a backfill algorithm, the better you specify a realistic runtime limit, the better throughput of your job may be achieved.

Islands/Switches

  • When a tree topology is used, this defines the maximum count of switches desired for the job allocation and optionally the maximum time to wait for that number of switches. If Slurm finds an allocation containing more switches than the count specified, the job remains pending until it either finds an allocation with desired (lower) switch count or the time limit expires. It there is no switch count limit, there is no delay in starting the job. This trades off better performance vs. shorter wait time in the queue

Memory Requirements

  • The total memory available in user space for the set of nodes requested by the job must not be exceeded.
  • The memory used on each individual node must not be exceeded by all tasks run on that node.
  • Applications exist for which the memory usage is unsymmetric. In this case it may become necessary to work with a variable number of tasks per node. One relevant scenario is a master-worker scheme where the master may need an order of magnitude more memory and therefore requires a node of its own, while worker nodes can share a node. LRZ provides the "mixed" partition for using thin and fat nodes concurrently

Disk and I/O Requirements

  • The disk and I/O requirements are not controlled by the batch scheduling system, but rely on parallel shared file systems, which provide system-global services with respect to bandwidth - this means that the total I/O bandwidth is shared between all users. The consequence is that all I/O may be significantly slowed down if heavily used by multiple users at the same time, or even - for large scale parallel jobs - by a single user. At present, LRZ can not make any Quality of Service assurance for I/O bandwidth.
  • The appropriate usage of the parallel file systems is essential.
  • Please consult File Systems of SuperMUC-NG for more detailed technical information.

Licences

  • Some jobs may make use of licensed software, either from the LRZ software application stack, or of software installed in the user's HOME directory. In many cases, the software needs to access a license server because there exist limits on how many instances of the software may run and who may access it at all.
  • There is no connection from SuperMUC-NG to the outside. Check with LRZ if you are in need of such licenses.
  • LRZ is currently not able to manage license contingents. The reason is that a significant additional effort is required, not only with suitable configuration of SLURM, but also with how the license servers are managed. The situation implies that a job will fail if the usage limit of a licensed software is exceeded when the job starts.

Conversion of scripts from LoadLeveler and other Workload Managers table

  • see: List of the most common command, environment variables, and job specification options used by the major workload management systems

Specific Topics (jobfarming, constraints)

SLURM Documentation





















  • No labels