Page Content

What you should know at the beginning

All programs in the parallel or serial segments of the cluster must be started up using either

a SLURM batch script or
an interactive SLURM shell.

In order to access the SLURM infrastructure described here, please first log in to a login node of the cluster as described in Access and Login to the Linux-Cluster.

This document and its subdocuments provide information on:

Linux Cluster segments and their specifications,
resource limits imposed for various job classes, comprising runtime limits, limits on CPU resources, memory limits and job submission limits
how to setup and manage Slurm jobs:

Choose the compute resources of the Linux Cluster with great care! In particular, please be aware that misuse (see below) of the resources described here can result in the invalidation of the violating account!

Please respect our Linux Cluster policies and rules!

Do you need help or do you fear misusing the HPC resources?

Just get in touch with us for further consulting via Linux Cluster request at LRZ Servicedesk or via HPC Lounge.

Step 1: Get the resource which fits my needs aka "jobs that the Slurm scheduler likes"

In order to allocate the optimal resource for your job, you may consider different job types, e.g.:

standard distributed memory parallel job, e.g. using MPI
shared-memory parallel job on a single node, e.g. using OpenMP
single-core (serial) job
single-node parallel job requiring a fraction of both the available CPU cores and memory
single-node parallel job requiring a fraction of the available CPU cores but needing almost all memory of a regular compute node on CoolMUC-4
large-memory job

The following decision matrix provides assistance in making the right choice.

Decision matrix to select the appropriate partition for production jobs (interactive tests)
How many CPU resources does my job need?	How much memory does my job need?
How many CPU resources does my job need?	up to 488 GB per node	489 - 1000 GB per node	up to 6000 GB per node
more than 4 compute nodes	CoolMUC-4 cannot satisfy this requirement! Please check whether an application for a (test) project on SuperMUC-NG is a suitable alternative.
2 - 4 compute nodes	cm4_std (cm4_inter)
more than 8 CPU cores and max. 1 compute node	cm4_tiny (cm4_inter)	teramem_inter	teramem_inter
1 - 10 CPU cores	serial_std serial_long (cm4_inter)	serial_std serial_long (cm4_inter, teramem_inter)	teramem_inter (teramem_inter)

Step 2: Based on my choice, what job specifications do I have to set?

In the following, we list the appropriate cluster- and partition-specific Slurm settings which need to be specified in your job. For full job script examples, please consult Running parallel jobs on the Linux Cluster or Running serial jobs on the Linux Cluster.

Cluster "cm4"
Partition name	Slurm job settings
cm4_std	--clusters=cm4 --partition=cm4_std --qos=cm4_std
cm4_tiny	--clusters=cm4 --partition=cm4_tiny --qos=cm4_tiny

Cluster "serial"
Partition name	Slurm job settings
serial_std	--clusters=serial --partition=serial_std --mem=<memory_per_node_GB>G # e.g.: --mem=100G
serial_long	--clusters=serial --partition=serial_long --mem=<memory_per_node_GB>G # e.g.: --mem=100G

Cluster "inter"
Partition name	Slurm job settings
cm4_inter	--clusters=inter --partition=cm4_inter
teramem_inter	--clusters=inter --partition=teramem_inter --mem=<memory_per_node_GB>G # e.g.: --mem=100G

Step 3: Check further specifications and limits of clusters and partitions

Cluster specifications					Limits						Node usage
Slurm cluster segment	Slurm partition	Compute nodes in partition	CPU cores per node	GPUs per node	Node range per job min - max	Minimum CPU limit (physical cores)	Maximum CPU limit (physical cores)	Maximum job runtime (hours)	Maximum running (submitted) jobs per user	Memory limit	Node usage
Cluster system CoolMUC-4: Sapphire Rapids (Intel(R) Xeon(R) Platinum 8480+) nodes
cm4	cm4_std	100 (overlapping partitions)	112 (physical) 224 (logical)	--	2 - 4	112 per job	448 per job	24	2 (25)	488 GiB per node	exclusive
cm4	cm4_tiny	100 (overlapping partitions)		--	1 - 1	9 per job	112 per job	24	4 (25)	default: 2.1 GiB per logical CPU core overall limit: 488 GiB per node	shared
inter	cm4_inter	6		--	1 - 4	1 per job	112 per job	8	1 (2)		shared
Cluster system CoolMUC-4: Ice Lake (Intel(R) Xeon(R) Platinum 8380) nodes
serial	serial_std	10	80 (physical) 160 (logical)	--	1 - 1	1	20 per job 96 per user (in sum over all jobs, see remarks)	24	96 (200)	default: 6.2 GiB per logical CPU core overall limit: 1000 GiB per node	shared
serial	serial_long	2	80 (physical) 160 (logical)	--	1 - 1	1	20 per job 96 per user (in sum over all jobs, see remarks)	168	96 (200)		shared
Cluster system Teramem: single-node shared-memory system (Intel Xeon Platinum 8360HL), 6 TB memory
inter	teramem_inter	1	96 (physical) 192 (logical)	--	1 - 1	1 per job	96 per job	240	1 (1)	approx. 60 GiB per physical core available	shared

Remarks on CPU limit

CPU limit refers to the number of available hardware (physical) cores. When setting up your Slurm jobs, please consider the following characteristics and restrictions:

Jobs may incorporate all logical cores, i.e., the number of physical cores times number of threads that can simultaneously run on one core via hyperthreading. In case of CoolMUC-4, this is twice the number of physical cores. Shared-memory jobs (e.g. using OpenMP) or hybrid jobs (e.g. using MPI + OpenMP) may benefit from that. Please also refer to our Slurm job examples.
The total number of requested cores, i.e. the product of "tasks per node" (e.g. MPI processes) and "CPUs per tasks" (e.g. OpenMP threads) ...
- Must not exceed the maximum number of cores which is available or allowed to be used per compute node on a particular partition!
- Must not be smaller than the minimum number of cores allowed to be used on a compute node!
Important: Due to limited hardware resources in the serial cluster, the following restrictions apply:
- Maximum number of CPU cores per job is 10!
- The total number of requested CPU cores by a user on the entire serial cluster, which are in use at the same time in running jobs, must not exceed 96! For example, typically you may run a bunch of single-core jobs or 6 jobs with 16 cores each. Further jobs can be submitted but will have to wait.

The terms "tasks per node" and "CPUs per tasks" refer to the according Slurm specifications "--tasks-per-node" and "--cpus-per-task". Please refer to our Slurm documentation and CoolMUC-4 job script examples.

Remarks on Memory limit

Keep in mind that the default memory per job on shared partitions (serial cluster, cm4_tiny, cm4_inter) scales with the number of allocated CPU cores. If more memory is required, this has to be defined in the job via the "–mem" option.

General remarks

Please note:

Nodes on partitions cm4_tiny, cm4_inter, serial_std, serial_long and teramem_inter are used as shared resources, i.e., multiple jobs/users share those nodes. cm4_std only provides exclusive nodes for jobs.
Partitions cm4_std, cm4_tiny, serial_std and serial_long are only intended for batch jobs (via sbatch command, see below)!
The partition cm4_inter is intended for interactive jobs (via salloc command, see below). Due to the short job runtime, this partition is suitable for small test jobs.
Both batch jobs and interactive jobs can be run on the partition teramem_inter.

Common Slurm commands on the Linux Cluster for job submission and job management

Once submitted, a Slurm job will be queued for some time, depending on how many jobs are presently submitted and how many resources are available. Eventually, more or less after previously submitted jobs have completed, the job will be started on one or more of the systems determined by its resource requirements. Slurm provides several commands to check the status of waiting or running jobs, to inspect or even modify (modification only to limited extent) waiting/running jobs, to obtain information of finished jobs and to delete waiting/running jobs. In the following, we show some commonly used Slurm commands.

Submit jobs

Learn more on our documentation pages Running parallel jobs on the Linux Cluster and Running serial jobs on the Linux Cluster.

See also: https://slurm.schedmd.com/sbatch.html (the link refers to the documentation of the latest Slurm version, which might deviate from the installed version)

Learn more on our documentation page on Running interactive jobs on the Linux Cluster.

See also: https://slurm.schedmd.com/salloc.html (the link refers to the documentation of the latest Slurm version, which might deviate from the installed version)

Obtain job information

The status of a job can be queried with the squeue command. In order to get some basic job information, we show a simple example how to use it. The command squeue can provide a lot of information on jobs. Issue the command "squeue --help" to get an impression on that.

# "-M" can be used instead of "--clusters=", same for "-p" vs. "--partition="
# squeue -M <cluster_name> -p <partition_name> -u $USER

squeue -M cm4 -p cm4_tiny -u $USER

The latter command will produce the following output showing Running and PenDing jobs (also showing the reason why a job is waiting):

CLUSTER: cm4
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           7918678  cm4_tiny job-name ab01xyz2 PD       0:00      1 (QOSMaxJobsPerUserLimit)
           7918679  cm4_tiny job-name ab01xyz2 PD       0:00      1 (QOSMaxJobsPerUserLimit)
           7918676  cm4_tiny job-name ab01xyz2  R    6:07:52      1 name_of_allocated_node
           7918673  cm4_tiny job-name ab01xyz2  R    6:12:09      1 name_of_allocated_node
           7918674  cm4_tiny job-name ab01xyz2  R    6:12:09      1 name_of_allocated_node
           7918675  cm4_tiny job-name ab01xyz2  R    6:12:09      1 name_of_allocated_node

Get the start time and state of my jobs: However, this is just an estimation! The start time is regularly re-calculated by Slurm and may vary significantly!

squeue -M <cluster_name> -p <partition_name> -j <job_id> -O "jobid,state,priority,starttime,reason"

See also: https://slurm.schedmd.com/squeue.html (the link refers to the documentation of the latest Slurm version, which might deviate from the installed version)

# for sacct "-r" needs to be used instead of "--partition="
# compact output of jobs started after a certain date
sacct -M <cluster_name> -r <partition_name> -X -u $USER --starttime=2025-01-24T00:00:01

# some more details, e.g.: resources, start time, run time, max. memory consumption (KB), job state, reason, exit code, list of allocated nodes
sacct -M <cluster_name> -r <partition_name> –o jobid,nnodes,ntasks,start,elapsed,maxrss,state,reason,exitcode,nodelist

See also: https://slurm.schedmd.com/sacct.html (the link refers to the documentation of the latest Slurm version, which might deviate from the installed version)

First of all, you may check the start time of your job invoking Slurm's squeue command. See item "Check the status of my running/waiting jobs, for example on partition cm4_tiny [squeue]" from above.

There are multiple reasons why jobs have to wait, e.g.:

technical reasons on the system side → check cluster status via sinfo command (see below),
your job priority.

On the Linux Cluster, Slurm uses a priority system which is based on several factors:

fairshare policy: considering consumed compute time,
the age of waiting jobs,
the job size.

The job priority is dominated by (1) and (2). (3) plays a minor role. If your job is waiting very long, it seems that you have already consumed your shares on a particular cluster segment. Now the priority of your jobs depends only on the aging factor, which is 0 at job submission but continuously increases. Additionally, your fairshare value will also fully recover over a time-scale of a few weeks, resulting in a reduction of the penalty applied to your job.

Consequence: As long as there are users who have consumed less compute time than you, they will get a higher priority and their jobs will run before yours. But, also their next jobs will have reduced their priority. You may try another cluster segment (please refer to Job Processing on the Linux-Cluster). The shares are independent for each cluster so you shouldn't have any penalty on a cluster you have not run any job yet.

You may also check your fairshare value, e.g. on the cluster cm4:

sshare --clusters=cm4  -o User,FairShare

The value ranges from 0 to 1. In other words: "0" means, that on a particular cluster segment all jobs will have a very low priority at the binning (at submission time).

Learn more about Slurm's fairshare algorithm

https://slurm.schedmd.com/priority_multifactor.html#fairshare

https://slurm.schedmd.com/classic_fair_share.html

See also: https://slurm.schedmd.com/sshare.html (the link refers to the documentation of the latest Slurm version, which might deviate from the installed version)

Manipulate jobs

# delete a single job
scancel -M <cluster_name> <job_id>

# delete multiple jobs via a space-separated list of job IDs, e.g.:
scancel -M <cluster_name> <job_id1 job_is2 job_id3>

See also: https://slurm.schedmd.com/scancel.html (the link refers to the documentation of the latest Slurm version, which might deviate from the installed version)

Jobs can be inspected for their characteristics via the scontrol command. It provides various information.

scontrol -M <cluster_name> show jobid=<job_id>

As long as the job is waiting in the queue, scontrol can also be used to modify some of the characteristics, e.g. the runtime. Thus, it is not necessary to cancel and resubmit the job. Please note, that the job runtime can only be reduced. It is not allowed to increase it. For example, the run time of a job was set to 8 hours and needs to be reduced to 4 hours:

scontrol -M <cluster_name> update jobid=<job_id> TimeLimit=04:00:00

Decrease the number of tasks or threads, e.g.:

scontrol -M <cluster_name> update jobid=<job_id> NumTasks=8

scontrol -M <cluster_name> update jobid=<job_id> NumCPUs=8

See also: https://slurm.schedmd.com/scontrol.html (the link refers to the documentation of the latest Slurm version, which might deviate from the installed version)

Obtain cluster information

sinfo -M <cluster_name> -p <partition_name>

For example, on the cm4_tiny partition, the output may look like:

CLUSTER: cm4
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
cm4_tiny     up 1-00:00:00      1   comp node_name
cm4_tiny     up 1-00:00:00      1  drain node_name
cm4_tiny     up 1-00:00:00      2   fail node_names
cm4_tiny     up 1-00:00:00     90  alloc node_names
cm4_tiny     up 1-00:00:00      6   idle node_names

Common node states are:

alloc: nodes allocated by user jobs,
comp: a job on that node is completing,
drain: after a running job has completed on that node, the node will become unavailable, e.g. for a reboot or maintenance,
fail: not available due to technical reasons,
idle: either available for immediate job starts or the nodes are held back by Slurm for a bigger job which is next in the queue,
maint: node is in maintenance.

See also: https://slurm.schedmd.com/sinfo.html (the link refers to the documentation of the latest Slurm version, which might deviate from the installed version)