The batch system on SuperMUC-NG is the open-source workload manager SLURM (Simple Linux Utility for Resource management). For details about the SLURM batch system, see Slurm Workload Manager.
Submit hosts are usually login nodes that permit to submit and manage batch jobs.
Intel processors on SuperMUC-NG support the hyperthreading mode which might increase the performance of your application. With hyperthreading, you have to increase the number of MPI tasks per node from 48 to 96 in your job script. Please be aware that with 96 MPI tasks per node each process gets only half of the memory by default. If you need more memory, you have to specify it in your job script and use the fat nodes (see example batch scripts).
List of relevant commands
|sbatch||submit a job script|
|scancel||delete or terminate a queued or running job|
|squeue||print table of submitted jobs and their state. |
Note: non-privileged users can only see their own jobs.e
|salloc||create an interactive SLURM shell|
|srun||execute argument command on the resources assigned to a job. |
Note: must be executed inside an active job (script or interactive environment).
mpiexec is an alternative and preferred on LRZ system
|sinfo||provide overview of cluster status|
|scontrol||query and modify SLURM state|
sacct is currently not working.e
Queues (SLURM partitions) and their limits
- Batch queues are called partitions in SLURM.
- The allocation granularity is multiples of one node (only complete nodes are allocated and accouonted for).
- Scheduling and priorization is based on a multifactor scheme including wait time, job size, partition, and required quality of service.
The following partitions are available. Check with
sinfo for more details and special partitions:
nodes per job
max run time (hours)
|max running jobs per user||max submitted|
jobs per user (qos)
(also used for interactive access with salloc)
(half of system)
|90 GB||48||24 |
to be increased
(not yet available)
|64-3072||90 GB and 760 GB||48|
srun and mpiexec
With SLURM srun command users can spawn any kind of application, process or task inside a job allocation or directly start executing a parallel job (and indirectly ask SLURM to create the appropriate allocation). It can be a shell command, any single-/multi-threaded executable in binary or script format, MPI application or hybrid application with MPI and OpenMP. When no allocation options are defined with srun command the options from sbatch or salloc are inherited.
Note: srun at LRZ is defined as the alias
Since aliases are not inherited the alias is only available in the login shell or in the initial batch script, everywhere else it falls back to /usr/bin/srun. Use the full syntac in these cases.
Note: mpiexec is the preferred and supported way to start applications. srun might fail for hyperthreaded applications.
salloc / srun for interactive processing
salloc is used the allocated nodes for interactive processing. The options for resource specification in salloc/srun/sbatch are the same.
Currently, at least --account, --time and --partition must be specified!
If there are difficulties starting up, it may be advantageous to also specify --ear=off. See the EAR document for more details.
"srun" can be used instead of "mpiexec"; both commands execute on the nodes previously allocated by the salloc.
There is no advantage by using "salloc" over "sbatch --partition=test" in terms of wait time.
sbatch Command / #SBATCH option
Batch job options and resources can be given as command line switches to sbatch (in which case they override script-provided values), or they can be embedded into a SLURM job script as a comment line of the form.
Batch Job Examples
General options applicable for all jobs
Hints and Explanations:
Replacement patterns in filenames:
get-user-env will set Environment variable as during Login.
specific settings which cannot be set up in the job prolog. Without this line your job will fail.
Options for resources and execution (select and click to expand)
MPI without hyperthreading using number of tasks
MPI without hyperthreading using ntasks per node
MPI with hyperthreading using tasks per node
Hybrid MPI/OpenMP without hyperthreading
Hybrid MPI/OpenMP with hyperthreading
Huge (Fat) Memory Jobs (>90 GB/node)
Large MPI Job (>792 nodes)
Fixed frequency (for profiling/benchmarking)
If SLURM can detect the number of tasks form its settings it is sufficient to use mpiexec without further parameters e.g.,
By default, the system may dynamically change the clock frequency of CPUs during the run time of a job to optimise for energy consumption (for more details, see Energy Aware Runtime). This makes profiling or benchmark measurements difficult and unstable. Users can enforce a fixed default frequency by switching EAR off:
Submitting several jobs with dependencies
Script for submitting several jobs with dependencies
Input Environment Variables
Upon startup, sbatch will read and handle the options set in the following environment variables. Note that environment variables will override any options set in a batch script, and command line options will override any environment variables. Some which may be used by you in $HOME/.profile:
Output Environment Variables
The Slurm controller will sets the variables in the environment of the batch script
|Both variants return the SLURM JobID|
|SLURM_JOB_ACCOUNT||Account name associated of the job allocation|
|SLURM_JOB_NUM_NODES||Number of nodes.|
To convert the Slurm compressed format into a full list:
|SLURM_NTASKS||Number of tasks. Example of usage:|
mpiexec -n $SLURM_TASKS
These variables are only set if the corresponding sbatch option was given. Example of usage:
|Count of processors available to the Job: Returned value looks like "96(x128)".|
Number of tasks to be initiated on each node. Returned value looks like "8(x128)".
The MPI rank (or relative process ID) of the current process. Can be used in wrapper scripts
if [ $SLURM_PROCID ]
sbatch allows for a filename pattern to contain one or more replacement symbols, which are a percent sign "%" followed by a letter
#SBATCH -o ./%x.%j.out
jobid of the running job,
task identifier (rank) relative to current job. This will create a separate IO file per task.
Show the estimated start time of a job:
sqeueue --start [-u <userID>]
Guidelines for resource selection
- Jobs that only use one or at most a few hardware cores perform serial processing and are not supported on SuperMUC-NG. Use the SuperMUC-Cloud for such purposes.
- Bunches of multiple independent tasks can be bundled into one job, using one or more nodes.
Run time limits
- Please note that all job classes impose a maximum run time limit. It can be adjusted downward for any individual job. Since the scheduler uses a backfill algorithm, the better you specify a realistic runtime limit, the better throughput of your job may be achieved.
- When a tree topology is used, this defines the maximum count of switches desired for the job allocation and optionally the maximum time to wait for that number of switches. If Slurm finds an allocation containing more switches than the count specified, the job remains pending until it either finds an allocation with desired (lower) switch count or the time limit expires. It there is no switch count limit, there is no delay in starting the job. This trades off better performance vs. shorter wait time in the queue
- The total memory available in user space for the set of nodes requested by the job must not be exceeded.
- The memory used on each individual node must not be exceeded by all tasks run on that node.
- Applications exist for which the memory usage is unsymmetric. In this case it may become necessary to work with a variable number of tasks per node. One relevant scenario is a master-worker scheme where the master may need an order of magnitude more memory and therefore requires a node of its own, while worker nodes can share a node. LRZ provides the "mixed" partition for using thin and fat nodes concurrently
Disk and I/O Requirements
- The disk and I/O requirements are not controlled by the batch scheduling system, but rely on parallel shared file systems, which provide system-global services with respect to bandwidth - this means that the total I/O bandwidth is shared between all users. The consequence is that all I/O may be significantly slowed down if heavily used by multiple users at the same time, or even - for large scale parallel jobs - by a single user. At present, LRZ can not make any Quality of Service assurance for I/O bandwidth.
- The appropriate usage of the parallel file systems is essential.
- Please consult File Systems of SuperMUC-NG for more detailed technical information.
- Some jobs may make use of licensed software, either from the LRZ software application stack, or of software installed in the user's HOME directory. In many cases, the software needs to access a license server because there exist limits on how many instances of the software may run and who may access it at all.
- There is no connection from SuperMUC-NG to the outside. Check with LRZ if you are in need of such licenses.
- LRZ is currently not able to manage license contingents. The reason is that a significant additional effort is required, not only with suitable configuration of SLURM, but also with how the license servers are managed. The situation implies that a job will fail if the usage limit of a licensed software is exceeded when the job starts.
Conversion of scripts from LoadLeveler and other Workload Managers table
- see: List of the most common command, environment variables, and job specification options used by the major workload management systems
Specific Topics (jobfarming, constraints)
- SLURM Workload Manager at LRZ
- Command/option Summary (two pages)
- Documentation for SLURM at SchedMD
- The manual pages slurm(1), sinfo(1), squeue(1), scontrol(1), scancel(1), sview(1)