Slurm Command Examples on the Linux Cluster
Documentation links you should know | |
---|---|
Job Processing on the Linux-Cluster | Overview of Linux Cluster partitions, resource limits, job limits, job specifications, common Slurm commands on the Linux Cluster |
SLURM Workload Manager | Slurm commands and options for job submission, explanations, recommendations |
This document lists common Slurm commands for job submission, job manipulation or obtaining job and cluster information on CoolMUC-4.
Notes on usage of Slurm commands
For clarity, we use Slurm command options "–clusters=" and "–partition=" for specification of cluster and partition name. For most commands you may also use "-M" and "-p", e.g.:
sinfo -M cm4 -p cm4_tiny
Exception: sacct command → use "-r" instead of "-p", e.g.:
sacct -M cm4 -r cm4_tiny -X --starttime=2025-08-01T00:00:00
We strongly recommend to specify cluster name and partition name in all Slurm commands!
If you leave out both, the Slurm command will be applied to the inter cluster!
Submit jobs
Batch jobs (non-interactive)
Example: | Submit a batch job on the login node: myuserid@cm4login2:~> sbatch my_job_script.slurm Submitted batch job 17130 on cluster cm4 |
Learn more here: | Running parallel jobs on the Linux Cluster and Running serial jobs on the Linux Cluster |
Slurm web documentation: | https://slurm.schedmd.com/archive/slurm-20.11.9/sbatch.html |
Interactive jobs
Example: | Submit an interactive job on the login node via salloc command, e.g. on the partition cm4_inter for a maximum runtime of 30 minutes and using 20 CPU cores. userid@cm4login2:~> salloc --clusters=inter --partition=cm4_inter -n 20 -t 00:30:00 salloc: Pending job allocation 303762 salloc: job 303762 queued and waiting for resources salloc: job 303762 has been allocated resources salloc: Granted job allocation 303762 userid@cm4r00c00s00:~> srun MyApplication [...] |
Learn more here: | Running interactive jobs on the Linux Cluster. |
Slurm web documentation: | https://slurm.schedmd.com/archive/slurm-20.11.9/salloc.html |
Obtain job information
Check status of my waiting or running jobs
Example: | The status of a job can be queried with the squeue command. In order to get some basic job information, we show a simple example how to use it. Example 1: Here, check my jobs on cluster cm4 and partition cm4_tiny: squeue --clusters=cm4 --partition=cm4_tiny -u $USER This command will produce the following output showing Running and PenDing jobs (also showing the reason why a job is waiting): CLUSTER: cm4 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 7918670 cm4_tiny job-name ab01xyz2 PD 0:00 1 (Priority) 7918678 cm4_tiny job-name ab01xyz2 PD 0:00 1 (QOSMaxJobsPerUserLimit) 7918679 cm4_tiny job-name ab01xyz2 PD 0:00 1 (QOSMaxJobsPerUserLimit) 7918676 cm4_tiny job-name ab01xyz2 R 6:07:52 1 name_of_allocated_node 7918673 cm4_tiny job-name ab01xyz2 R 6:12:09 1 name_of_allocated_node 7918674 cm4_tiny job-name ab01xyz2 R 6:12:09 1 name_of_allocated_node 7918675 cm4_tiny job-name ab01xyz2 R 6:12:09 1 name_of_allocated_node Example 2: Get the start time and state of my jobs: However, this is just an estimation! The start time is regularly re-calculated by Slurm and may vary significantly! squeue --clusters=cm4 --job=7918670 --start Output: CLUSTER: cm4 JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON) 7918670 cm4_tiny job-name ab01xyz2 PD 2025-08-15T10:13:00 1 node_name (Priority) |
Slurm web documentation: | https://slurm.schedmd.com/archive/slurm-20.11.9/squeue.html |
Get information of running, waiting and, particularly, finished jobs
Example: | List details of (finished) jobs. Example 1: The following sacct command provides a compact job overview showing all my jobs started after August 1, 2025 on cluster cm4 and partition cm4_tiny: sacct --clusters=cm4 --partition=cm4_tiny -X --starttime=2025-08-01T00:00:00 This output could look like: JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 7919001 job-name cm4_tiny lxcusers 224 TIMEOUT 0:0 7919002 job-name cm4_tiny lxcusers 112 RUNNING 0:0 7919000 job-name cm4_tiny lxcusers 224 COMPLETED 0:0 7919004 job-name cm4_tiny lxcusers 112 PENDING 0:0 7919005 job-name cm4_tiny lxcusers 224 FAILED 1:0 7919003 job-name cm4_tiny lxcusers 224 OUT_OF_ME+ 0:125 Example 2: Using sacct, you may show many more job details. In addition to example 1, the following command will display: sacct --clusters=cm4 --partition=cm4_tiny --starttime=2025-08-01T00:00:00 -o jobid,nnodes,ntasks,start,elapsed,maxrss,state,reason,exitcode,nodelist |
Common job states |
|
Slurm web documentation: | https://slurm.schedmd.com/archive/slurm-20.11.9/sacct.html |
Check job priority – if jobs wait very long in the queue
Example: | Jobs may wait long in the queue. On the Linux Cluster Slurm uses a priority system to ensure fairness in job processing. Compute time consumption continuously reduces priority of further jobs. As long as there are users who have consumed less compute time than you, they will get a higher priority and their jobs will run before yours. But, also their next jobs will have reduced their priority. And, on the other hand, priority increases with progressing waiting time. Check job priority (ranging from 0 to 1) via the squeue command. Job ID is not mandatory. squeue --clusters=cm4 --Format=jobid,priority,state,reason --job=7919004 This output could look like: CLUSTER: cm4 JOBID PRIORITY STATE REASON 7919004 0.01244102139316 PENDING Priority |
Slurm web documentation: | https://slurm.schedmd.com/archive/slurm-20.11.9/squeue.html https://slurm.schedmd.com/priority_multifactor.html#fairshare |
Manipulate jobs
Cancel jobs
Example: | Using the scancel command you may cancel a single job or multiple jobs on particular partitions. This applies to both waiting and running jobs. If job deletion is successful, there is no output from scancel command. Delete single job withj job ID 7919004: scancel --clusters=cm4 7919004 Delete multiple jobs by providing a space-separated or comma-separated list of job IDs: scancel --clusters=cm4 7919004,7919002 |
Slurm web documentation: | https://slurm.schedmd.com/archive/slurm-20.11.9/scancel.html |
Further inspection and modification of jobs – waiting or running jobs only
Example: | Job characteristics can be inspected and modified via the scontrol command. Example 1: Show various characteristics for a particular job ID waiting/running on cluster cm4: scontrol --clusters=cm4 show jobid=7919004 The output looks like: JobId=7919004 JobName=some-job-name UserId=ab01xyz2(0123456) GroupId=a0000(0123) MCS_label=N/A Priority=11201723 Nice=0 Account=lxcusers QOS=cm4_tiny JobState=PENDING Reason=Priority Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=1-00:00:00 TimeMin=N/A SubmitTime=2025-08-14T12:00:28 EligibleTime=2025-08-14T12:00:28 AccrueTime=2025-08-14T12:00:28 StartTime=2025-08-15T02:48:46 EndTime=2025-08-16T02:48:46 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-08-14T13:43:55 Partition=cm4_tiny AllocNode:Sid=cm4login1:118538 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) SchedNodeList=cm4r00c00s00 NumNodes=1-1 NumCPUs=112 NumTasks=112 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=112,mem=227920M,node=1,billing=112 Socks/Node=* NtasksPerN:B:S:C=112:0:*:1 CoreSpec=* MinCPUsNode=112 MinMemoryCPU=2035M MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/path/to/job-script/my_job_script.slurm WorkDir=/job/working/directory/ StdErr=/path/to/standard_error_output/%x.%j.%N.err StdIn=/dev/null StdOut=/path/to/standard_error_output/%x.%j.%N.out Power= MailUser=my_mail_address MailType=INVALID_DEPEND,BEGIN,END,FAIL,REQUEUE,STAGE_OUT NtasksPerTRES:0 Example 2: As long as the job is waiting in the queue, scontrol can also be used to modify some of the characteristics, e.g. the runtime. Thus, it is not necessary to cancel and resubmit the job. Please note, that the job runtime can only be reduced. It is not allowed to increase it. For example, the run time of a job was set to 8 hours and needs to be reduced to 4 hours: scontrol --clusters=cm4 update jobid=7919004 TimeLimit=04:00:00 Example 3: Decrease the number of tasks or threads, e.g. from some higher value: scontrol --clusters=cm4 update jobid=7919004 NumTasks=8 scontrol --clusters=cm4 update jobid=7919004 NumCPUs=8 |
Slurm web documentation: | https://slurm.schedmd.com/archive/slurm-20.11.9/scontrol.html |
Obtain cluster information
Get a brief overview of the cluster status
Example: | Use the sinfo command to get various cluster information, e.g. for partition cm4_tiny: sinfo --clusters=cm4 --partition=cm4_tiny The output looks like: CLUSTER: cm4 PARTITION AVAIL TIMELIMIT NODES STATE NODELIST cm4_tiny up 1-00:00:00 1 comp node_name cm4_tiny up 1-00:00:00 1 drain node_name cm4_tiny up 1-00:00:00 2 fail node_names cm4_tiny up 1-00:00:00 90 alloc node_names cm4_tiny up 1-00:00:00 6 idle node_names |
Common node states |
|
Slurm web documentation: | https://slurm.schedmd.com/archive/slurm-20.11.9/sinfo.html |