Overview of cluster specifications and limits

Cluster specifications			Limits
Slurm cluster	Slurm partition	Nodes in partition	Node range per job min - max	Maximum runtime (hours)	Maximum running (submitted) jobs per user	Memory limit (GByte)
Cluster system: CoolMUC-2 (28-way Haswell-EP nodes with Infiniband FDR14 interconnect and 2 hardware threads per physical core)
cm2	cm2_large	404 (overlapping partitions)	25 - 64	48	2 (30)	56 per node
cm2	cm2_std	404 (overlapping partitions)	3 - 24	72	4 (50)
cm2_tiny	cm2_tiny	288	1 - 4	72	10 (50)
serial	serial_std	96 (overlapping partitions)	1 - 1	96	dynamically adjusted depending on workload (250)
serial	serial_long	96 (overlapping partitions)	1 - 1	> 72 (currently 480)	dynamically adjusted depending on workload (250)
inter	cm2_inter	12	1 - 12	2	1 (2)
inter	cm2_inter_large_mem	6	1 - 6	96	1 (2)	120 per node
Cluster system: HPDA LRZ Cluster (80-way Ice Lake nodes, 2 hardware threads per physical core)
inter	cm4_inter_large_mem	9	1 - 1	96	1 (2)	1000 per node
Cluster system: Teramem (single-node shared-memory system, 4 x Intel Xeon Platinum 8360HL, in total 96 physical cores, 2 hyperthreads per physical core, 6 TB memory)
inter	teramem_inter	1	1 - 1 (up to 64 logical cores)	240	1 (2)	approx. 60 per physical core available
Cluster system: CoolMUC-3 (64-way Knight's Landing 7210F nodes with Intel Omnipath 100 interconnect and 4 hardware threads per physical core)
mpp3	mpp3_batch	145	1 - 32	48	50 (dynamically adjusted depending on workload)	approx. 90 DDR plus 16 HBM per node
inter	mpp3_inter	3	1 - 3	2	1 (2)	approx. 90 DDR plus 16 HBM per node

Overview of job processing

Slurm partition	Cluster- / Partition-specific Slurm job settings	Typical job type	Recommended submit host(s)	Common/Exemplary Slurm commands for job management via squeue (show waiting/running jobs), scancel (abort job), sacct (show details on waiting, running, finished jobs)
cm2_large	--clusters=cm2 --partition=cm2_large --qos=cm2_large	large distributed memory parallel (MPI) job	lxlogin1 lxlogin2 lxlogin3 lxlogin4	squeue -M cm2 -u $USER scancel -M cm2 <JOB-ID> sacct -M cm2 -X -u $USER --starttime=2021-01-01T00:00:01
cm2_std	--clusters=cm2 --partition=cm2_std --qos=cm2_std	standard distributed memory parallel (MPI) job
cm2_tiny	--clusters=cm2_tiny	small distributed memory parallel (MPI) job single-node shared memory parallel job		squeue -M cm2_tiny -u $USER scancel -M cm2_tiny <JOB-ID> sacct -M cm2_tiny -X -u $USER --starttime=2021-01-01T00:00:01
serial_std	--clusters=serial --partition=serial_std --mem=<memory_per_node_MB>M	single-core jobs Shared use of compute nodes among users! Default memory = mem_node/ N_{cores_node}		squeue -M serial -u $USER scancel -M serial <JOB-ID> sacct -M serial -X -u $USER --starttime=2021-01-01T00:00:01
serial_long	--clusters=serial --partition=serial_long --mem=<memory_per_node_MB>M
cm2_inter	--clusters=inter --partition=cm2_inter	interactive test jobs Do not run production jobs!		squeue -M inter -u $USER scancel -M inter <JOB-ID> sacct -M inter -X -u $USER --starttime=2021-01-01T00:00:01
cm2_inter_large_mem	--clusters=inter --partition=cm2_inter_large_mem --mem=<memory_per_node_MB>M	single-node shared memory parallel job requiring more memory than available on default compute nodes
cm4_inter_large_mem	--clusters=inter --partition=cm4_inter_large_mem	jobs which need much more memory than available on CoolMUC-2 compute nodes, but less memory available on Teramem	lxlogin5
teramem_inter	--clusters=inter --partition=teramem_inter --mem=<memory_per_node_MB>M	large-memory job on Teramem	lxlogin[1...4] lxlogin8
mpp3_inter	--clusters=inter --partition=mpp3_inter	interactive test jobs Do not run production jobs!	lxlogin8
mpp3_batch	--clusters=mpp3 --partition=mpp3_batch	shared memory thread-parallel job distributed memory parallel job	lxlogin8	squeue -M mpp3 -u $USER scancel -M mpp3 <JOB-ID> sacct -M mpp3 -X -u $USER --starttime=2021-01-01T00:00:01

Submit hosts (login nodes)

Submit hosts are usually login nodes that permit to submit and manage batch jobs.

Cluster segment	Submit hosts	Remarks
CooLMUC-2	lxlogin1, lxlogin2, lxlogin3, lxlogin4
CooLMUC-3	lxlogin8, lxlogin9	lxlogin9 is accessible from lxlogin8 via ssh mpp3-login9 lxlogin9 is KNL architecture. Thus, it can be used to build software for CoolMUC-3.
CoolMUC-4	lxlogin5
Teramem	lxlogin8

However, note that cross-submission of jobs to other cluster segments is also possible. The only thing you need to take care of is that different cluster segments support different instructions sets, so you need to make sure that your software build produces the appropriate binary that can execute on the targeted cluster segment.

Please do not run compute jobs on the login nodes! Instead, please choose the cluster and partition which fits your needs.

Documentation of SLURM

SLURM Workload Manger (commands and links to examples).