Documentation links you should know
Job Processing on the Linux-Cluster	Overview of Linux Cluster partitions, resource limits, job limits, job specifications, common Slurm commands on the Linux Cluster
SLURM Workload Manager	Slurm commands and options for job submission, explanations, recommendations

This document lists common Slurm commands for job submission, job manipulation or obtaining job and cluster information on CoolMUC-4.

Notes on usage of Slurm commands

For clarity, we use Slurm command options "–clusters=" and "–partition=" for specification of cluster and partition name. For most commands you may also use "-M" and "-p", e.g.:

sinfo -M cm4 -p cm4_tiny

Exception: sacct command → use "-r" instead of "-p", e.g.:

sacct -M cm4 -r cm4_tiny -X --starttime=2025-08-01T00:00:00

We strongly recommend to specify cluster name and partition name in all Slurm commands!
If you leave out both, the Slurm command will be applied to the inter cluster!

Submit jobs

Batch jobs (non-interactive)

Example:	Submit a batch job on the login node: myuserid@cm4login2:~> sbatch my_job_script.slurm Submitted batch job 17130 on cluster cm4
Learn more here:	Running parallel jobs on the Linux Cluster and Running serial jobs on the Linux Cluster
Slurm web documentation:	https://slurm.schedmd.com/archive/slurm-20.11.9/sbatch.html

Interactive jobs

Example:	Submit an interactive job on the login node via salloc command, e.g. on the partition cm4_inter for a maximum runtime of 30 minutes and using 20 CPU cores. Once the job has started, you will get an interactive session on the compute node. Then, you can start your application using the srun command. userid@cm4login2:~> salloc --clusters=inter --partition=cm4_inter -n 20 -t 00:30:00 salloc: Pending job allocation 303762 salloc: job 303762 queued and waiting for resources salloc: job 303762 has been allocated resources salloc: Granted job allocation 303762 userid@cm4r00c00s00:~> srun MyApplication [...]
Learn more here:	Running interactive jobs on the Linux Cluster.
Slurm web documentation:	https://slurm.schedmd.com/archive/slurm-20.11.9/salloc.html

Obtain job information

Check status of my waiting or running jobs

Example:

The status of a job can be queried with the squeue command. In order to get some basic job information, we show a simple example how to use it.
The command squeue can provide a lot of information on jobs. Issue the command "squeue --help" to get an impression on that.

Example 1: Here, check my jobs on cluster cm4 and partition cm4_tiny:

squeue --clusters=cm4 --partition=cm4_tiny -u $USER

This command will produce the following output showing Running and PenDing jobs (also showing the reason why a job is waiting):

CLUSTER: cm4
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           7918670  cm4_tiny job-name ab01xyz2 PD       0:00      1 (Priority) 
           7918678  cm4_tiny job-name ab01xyz2 PD       0:00      1 (QOSMaxJobsPerUserLimit)
           7918679  cm4_tiny job-name ab01xyz2 PD       0:00      1 (QOSMaxJobsPerUserLimit)
           7918676  cm4_tiny job-name ab01xyz2  R    6:07:52      1 name_of_allocated_node
           7918673  cm4_tiny job-name ab01xyz2  R    6:12:09      1 name_of_allocated_node
           7918674  cm4_tiny job-name ab01xyz2  R    6:12:09      1 name_of_allocated_node
           7918675  cm4_tiny job-name ab01xyz2  R    6:12:09      1 name_of_allocated_node

Example 2: Get the start time and state of my jobs: However, this is just an estimation! The start time is regularly re-calculated by Slurm and may vary significantly!
Showing "N/A" means the job is not yet eligible for execution Please check again later. "Priority" is the most common reason for waiting jobs. That means, jobs of other users have a higher priority. See job priority topic below for details.

squeue --clusters=cm4 --job=7918670 --start

Output:

CLUSTER: cm4
             JOBID PARTITION     NAME     USER ST          START_TIME  NODES SCHEDNODES           NODELIST(REASON)
           7918670  cm4_tiny job-name ab01xyz2 PD 2025-08-15T10:13:00      1 node_name            (Priority)

Slurm web documentation:

https://slurm.schedmd.com/archive/slurm-20.11.9/squeue.html

Get information of running, waiting and, particularly, finished jobs

Example:	List details of (finished) jobs. Example 1: The following sacct command provides a compact job overview showing all my jobs started after August 1, 2025 on cluster cm4 and partition cm4_tiny: sacct --clusters=cm4 --partition=cm4_tiny -X --starttime=2025-08-01T00:00:00 This output could look like: JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 7919001 job-name cm4_tiny lxcusers 224 TIMEOUT 0:0 7919002 job-name cm4_tiny lxcusers 112 RUNNING 0:0 7919000 job-name cm4_tiny lxcusers 224 COMPLETED 0:0 7919004 job-name cm4_tiny lxcusers 112 PENDING 0:0 7919005 job-name cm4_tiny lxcusers 224 FAILED 1:0 7919003 job-name cm4_tiny lxcusers 224 OUT_OF_ME+ 0:125 Example 2: Using sacct, you may show many more job details. In addition to example 1, the following command will display: job ID, requested nodes/tasks, start time, runtime, max. memory consumption (default: KB), job state, reason, exit code (in case of batch jobs, sbatch command will return the exit code of the submitted job), list of allocated nodes. sacct --clusters=cm4 --partition=cm4_tiny --starttime=2025-08-01T00:00:00 -o jobid,nnodes,ntasks,start,elapsed,maxrss,state,reason,exitcode,nodelist
Common job states	PENDING: job is waiting for execution RUNNING: job is being executed COMPLETED: job has finished with exit code 0 FAILED: job has finished with non-zero exit code; please retry; then, debugging of the job workflow or application might be necessary TIMEOUT: job was terminated as it hasn't finished but reached it's timelimit; increase the timelimit and run the job again OUT_OF_MEMORY: at least one application of the job has violated the memory limit (see here: Job Processing on the Linux-Cluster) and was killed by Slurm
Slurm web documentation:	https://slurm.schedmd.com/archive/slurm-20.11.9/sacct.html

Check job priority – if jobs wait very long in the queue

Example:

Jobs may wait long in the queue. On the Linux Cluster Slurm uses a priority system to ensure fairness in job processing. Compute time consumption continuously reduces priority of further jobs. As long as there are users who have consumed less compute time than you, they will get a higher priority and their jobs will run before yours. But, also their next jobs will have reduced their priority. And, on the other hand, priority increases with progressing waiting time. Check job priority (ranging from 0 to 1) via the squeue command. Job ID is not mandatory.

squeue --clusters=cm4 --Format=jobid,priority,state,reason --job=7919004

This output could look like:

CLUSTER: cm4
JOBID               PRIORITY            STATE               REASON              
7919004             0.01244102139316    PENDING             Priority

Slurm web documentation:

https://slurm.schedmd.com/archive/slurm-20.11.9/squeue.html
https://slurm.schedmd.com/priority_multifactor.html#fairshare

Manipulate jobs

Cancel jobs

Example:

Using the scancel command you may cancel a single job or multiple jobs on particular partitions. This applies to both waiting and running jobs. If job deletion is successful, there is no output from scancel command.

Delete single job withj job ID 7919004:

scancel --clusters=cm4 7919004

Delete multiple jobs by providing a space-separated or comma-separated list of job IDs:

scancel --clusters=cm4 7919004,7919002

Slurm web documentation:

https://slurm.schedmd.com/archive/slurm-20.11.9/scancel.html

Further inspection and modification of jobs – waiting or running jobs only

Example:

Job characteristics can be inspected and modified via the scontrol command.

Example 1: Show various characteristics for a particular job ID waiting/running on cluster cm4:

scontrol --clusters=cm4 show jobid=7919004

The output looks like:

JobId=7919004 JobName=some-job-name
   UserId=ab01xyz2(0123456) GroupId=a0000(0123) MCS_label=N/A
   Priority=11201723 Nice=0 Account=lxcusers QOS=cm4_tiny
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2025-08-14T12:00:28 EligibleTime=2025-08-14T12:00:28
   AccrueTime=2025-08-14T12:00:28
   StartTime=2025-08-15T02:48:46 EndTime=2025-08-16T02:48:46 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-08-14T13:43:55
   Partition=cm4_tiny AllocNode:Sid=cm4login1:118538
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null) SchedNodeList=cm4r00c00s00
   NumNodes=1-1 NumCPUs=112 NumTasks=112 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=112,mem=227920M,node=1,billing=112
   Socks/Node=* NtasksPerN:B:S:C=112:0:*:1 CoreSpec=*
   MinCPUsNode=112 MinMemoryCPU=2035M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/path/to/job-script/my_job_script.slurm
   WorkDir=/job/working/directory/
   StdErr=/path/to/standard_error_output/%x.%j.%N.err
   StdIn=/dev/null
   StdOut=/path/to/standard_error_output/%x.%j.%N.out
   Power=
   MailUser=my_mail_address MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT
   NtasksPerTRES:0

Example 2: As long as the job is waiting in the queue, scontrol can also be used to modify some of the characteristics, e.g. the runtime. Thus, it is not necessary to cancel and resubmit the job. Please note, that the job runtime can only be reduced. It is not allowed to increase it. For example, the run time of a job was set to 8 hours and needs to be reduced to 4 hours:

scontrol --clusters=cm4 update jobid=7919004 TimeLimit=04:00:00

Example 3: Decrease the number of tasks or threads, e.g. from some higher value:

scontrol --clusters=cm4 update jobid=7919004 NumTasks=8

scontrol --clusters=cm4 update jobid=7919004 NumCPUs=8

Slurm web documentation:

https://slurm.schedmd.com/archive/slurm-20.11.9/scontrol.html

Obtain cluster information

Get a brief overview of the cluster status

Example:	Use the sinfo command to get various cluster information, e.g. for partition cm4_tiny: sinfo --clusters=cm4 --partition=cm4_tiny The output looks like: CLUSTER: cm4 PARTITION AVAIL TIMELIMIT NODES STATE NODELIST cm4_tiny up 1-00:00:00 1 comp node_name cm4_tiny up 1-00:00:00 1 drain node_name cm4_tiny up 1-00:00:00 2 fail node_names cm4_tiny up 1-00:00:00 90 alloc node_names cm4_tiny up 1-00:00:00 6 idle node_names
Common node states	alloc: nodes allocated by user jobs, comp: a job on that node is completing, drain: after a running job has completed on that node, the node will become unavailable, e.g. for a reboot or maintenance, fail: not available due to technical reasons, idle: either available for immediate job starts or the nodes are held back by Slurm for a bigger job which is next in the queue, maint: node is in maintenance.
Slurm web documentation:	https://slurm.schedmd.com/archive/slurm-20.11.9/sinfo.html