The batch system on SuperMUC-NG is the open-source workload manager SLURM (Simple Linux Utility for Resource management). For details about the SLURM batch system, see Slurm Workload Manager.
Submit hosts are usually login nodes that permit to submit and manage batch jobs.
Intel processors on SuperMUC-NG support the hyperthreading mode which might increase the performance of your application. With hyperthreading, you have to increase the number of MPI tasks per node from 48 to 96 in your job script. Please be aware that with 96 MPI tasks per node each process gets only half of the memory by default. If you need more memory, you have to specify it in your job script and use the fat nodes (see example batch scripts).
List of relevant commands
|sbatch||submit a job script|
|scancel||delete or terminate a queued or running job|
|squeue||print table of submitted jobs and their state. |
Note: non-privileged users can only see their own jobs.
|salloc||create an interactive SLURM shell|
|srun||execute argument command on the resources assigned to a job. |
Note: must be executed inside an active job (script or interactive environment).
mpiexec is an alternative and preferred on LRZ system
|sstat||Display various status information of a running job/step.|
|sinfo||provide overview of cluster status|
|scontrol||query and modify SLURM state|
sacct is not available for users.
SLURM partitions (Queues) and their limits
- Batch queues are called partitions in SLURM.
- The allocation granularity is multiples of one node (only complete nodes are allocated and accounted for).
- Scheduling and priorization is based on a multifactor scheme including wait time, job size, partition, and required quality of service.
The following partitions are available. Check with
sinfo for more details and special partitions:
nodes per job
max run time (hours)
|max running jobs per user||max submitted|
jobs per user (qos)
(also used for interactive access with salloc)
(= 48*64 nodes, approx.
half of system)
|90 GB||48||24 |
to be increased
(not yet available)
|64-3072||90 GB and 760 GB||48|
srun and mpiexec
With SLURM srun command users can spawn any kind of application, process or task inside a job allocation or directly start executing a parallel job (and indirectly ask SLURM to create the appropriate allocation). It can be a shell command, any single-/multi-threaded executable in binary or script format, MPI application or hybrid application with MPI and OpenMP. When no allocation options are defined with srun command the options from sbatch or salloc are inherited.
Note: mpiexec is the preferred and only supported way to start applications. srun might fail (particularly for hyperthreaded applications).
salloc / srun for interactive processing
salloc is used the allocated nodes for interactive processing. The options for resource specification in salloc/srun/sbatch are the same.
Currently, at least --account, --time and --partition must be specified!
If there are difficulties starting up, it may be advantageous to also specify --ear=off. See the EAR document for more details.
"srun" can be used instead of "mpiexec"; both commands execute on the nodes previously allocated by the salloc.
There is no advantage by using "salloc" over "sbatch --partition=test" in terms of wait time.
sbatch Command / #SBATCH option
Batch job options and resources can be given as command line switches to sbatch (in which case they override script-provided values), or they can be embedded into a SLURM job script as a comment line of the form.
Batch Job Examples
General options applicable for all jobs
Hints and Explanations:
Replacement patterns in filenames:
Request a specific partition ("queue") for the resource allocation.
Options for resources and execution (select and click to expand)
MPI without hyperthreading using number of tasks
module load slurm_setup
specific settings which cannot be set up in the job prolog. Without this line your job will fail !
Request that a minimum of minnodes nodes be allocated to this job. A maximum node count may also be specified with maxnodes. If only one number is specified, this is used as both the minimum and maximum node count. The default behavior is to allocate enough nodes to satisfy the requirements of the ntasks and cpus-per-task options.
The default is one task per node, but note that the cpus-per-task option will change this default.
Request that ntasks be invoked on each node. If used with the ntasks option, the ntasks option will take precedence and the ntasks-per-node will be treated as a maximum count of tasks per node. Meant to be used with the nodes option.
Request that the maximum ntasks be invoked on each core.
Without this option, the controller will just try to allocate one core per task
Maximum count of switches desired for the job allocation and optionally the maximum time to wait for that number of switches. +Use this option only for very large Jobs.
Submit a job array, multiple jobs to be executed with identical parameters.
In most cases mpiexec can be used without specifying the number of tasks, because this is inherited from the sbatch command. Slurm output variables can also be used e.g.,
mpiexec -n $SLURM_NTASKS ./myprog
If SLURM can detect the number of tasks form its settings it is sufficient to use mpiexec without further parameters e.g.,
By default, the system may dynamically change the clock frequency of CPUs during the run time of a job to optimise for energy consumption (for more details, see Energy Aware Runtime). This makes profiling or benchmark measurements difficult and unstable. Users can enforce a fixed default frequency by switching EAR off:
- Environment Variables for Process Pinning
- Interoperability with OpenMP
- Calculation of masks, example:
module load lrztools
Submitting several jobs with dependencies
Defer the start of this job until the specified dependencies have been satisfied completed.< dependency_list> is of the form< type:job_id[:job_id][,type:job_id[:job_id]]>
- after:job_id[:jobid...] job can begin execution after the specified jobs have begun execution.
- afterany:job_id[:jobid...] job can begin execution after the specified jobs have terminated.
- afternotok:job_id[:jobid...] job can begin execution after the specified jobs have terminated in some failed state (non-zero exit code, node failure, timed out, etc).
- afterok:job_id[:jobid...]job can begin execution after the specified jobs have successfully executed (ran to completion with an exit code of zero
Input Environment Variables
Upon startup, sbatch will read and handle the options set in the following environment variables. Note that environment variables will override any options set in a batch script, and command line options will override any environment variables. Some which may be used by you in $HOME/.profile:
Output Environment Variables
The Slurm controller will sets the variables in the environment of the batch script
|Both variants return the SLURM JobID|
|SLURM_JOB_ACCOUNT||Account name associated of the job allocation|
|SLURM_JOB_NUM_NODES||Number of nodes.|
To convert the Slurm compressed format into a full list:
|SLURM_NTASKS||Number of tasks. Example of usage:|
mpiexec -n $SLURM_TASKS
These variables are only set if the corresponding sbatch option was given. Example of usage:
|Count of processors available to the Job: Returned value looks like "96(x128)".|
Number of tasks to be initiated on each node. Returned value looks like "8(x128)".
The MPI rank (or relative process ID) of the current process. Can be used in wrapper scripts
if [ $SLURM_PROCID ]
sbatch allows for a filename pattern to contain one or more replacement symbols, which are a percent sign "%" followed by a letter
#SBATCH -o ./%x.%j.out
jobid of the running job,
task identifier (rank) relative to current job. This will create a separate IO file per task.
Show the estimated start time of a job:
squeue --start [-u <userID>]
Guidelines for resource selection
- Jobs that only use one or at most a few hardware cores perform serial processing and are not supported on SuperMUC-NG. Use the SuperMUC-Cloud for such purposes.
- Bunches of multiple independent tasks can be bundled into one job, using one or more nodes.
Run time limits
- Please note that all job classes impose a maximum run time limit. It can be adjusted downward for any individual job. Since the scheduler uses a backfill algorithm, the better you specify a realistic runtime limit, the better throughput of your job may be achieved.
- When a tree topology is used, this defines the maximum count of switches desired for the job allocation and optionally the maximum time to wait for that number of switches. If Slurm finds an allocation containing more switches than the count specified, the job remains pending until it either finds an allocation with desired (lower) switch count or the time limit expires. It there is no switch count limit, there is no delay in starting the job. This trades off better performance vs. shorter wait time in the queue
- The total memory available in user space for the set of nodes requested by the job must not be exceeded.
- The memory used on each individual node must not be exceeded by all tasks run on that node.
- Applications exist for which the memory usage is unsymmetric. In this case it may become necessary to work with a variable number of tasks per node. One relevant scenario is a master-worker scheme where the master may need an order of magnitude more memory and therefore requires a node of its own, while worker nodes can share a node. LRZ provides the "mixed" partition for using thin and fat nodes concurrently
Disk and I/O Requirements
- The disk and I/O requirements are not controlled by the batch scheduling system, but rely on parallel shared file systems, which provide system-global services with respect to bandwidth - this means that the total I/O bandwidth is shared between all users. The consequence is that all I/O may be significantly slowed down if heavily used by multiple users at the same time, or even - for large scale parallel jobs - by a single user. At present, LRZ can not make any Quality of Service assurance for I/O bandwidth.
- The appropriate usage of the parallel file systems is essential.
- Please consult File Systems of SuperMUC-NG for more detailed technical information.
- Some jobs may make use of licensed software, either from the LRZ software application stack, or of software installed in the user's HOME directory. In many cases, the software needs to access a license server because there exist limits on how many instances of the software may run and who may access it at all.
- There is no connection from SuperMUC-NG to the outside. Check with LRZ if you are in need of such licenses.
- LRZ is currently not able to manage license contingents. The reason is that a significant additional effort is required, not only with suitable configuration of SLURM, but also with how the license servers are managed. The situation implies that a job will fail if the usage limit of a licensed software is exceeded when the job starts.
Conversion of scripts from LoadLeveler and other Workload Managers table
- see: List of the most common command, environment variables, and job specification options used by the major workload management systems
Resource usage of jobs
For currently running jobs, queries can be done via the sstat command, for example
would supply the Maximum resident set size of all tasks in job with ID 123456, as well as the node on which this value was reached. Note that this will only work if the executable is appropriately executed under SLURM control i.e. via the mpiexec or srun commands.
For jobs that are already completed, you need to contact the servicedesk to obtain such information. We currently cannot expose the sacct interface to regular users.
Specific Topics (jobfarming, constraints)
- SLURM Workload Manager at LRZ
- Command/option Summary (two pages)
- Documentation for SLURM at SchedMD
- The manual pages slurm(1), sinfo(1), squeue(1), scontrol(1), scancel(1), sstat(1), sview(1)