Page Content

Overview of cluster specifications and limits

Cluster specificationsLimits
Slurm
cluster
Slurm
partition

Nodes
in partition

Node range
per job
min - max

Maximum
runtime
(hours)

Maximum
running (submitted)
jobs
per user

Memory limit
(GByte)

Cluster system: CoolMUC-2 (28-way Haswell-EP nodes with Infiniband FDR14 interconnect and 2 hardware threads per physical core)
cm2cm2_large

404

(overlapping
partitions)

25 - 6448

2 (30)

56

per node

cm2_std3 - 2472

4 (50)

cm2_tinycm2_tiny3001 - 472

10 (50)

serialserial_std

96

(overlapping
partitions)

1 - 196

dynamically adjusted
depending on workload

(250)

serial_long1 - 1

> 72

(currently 480)

intercm2_inter121 - 122

1 (2)

cm2_inter_large_mem61 - 696

1 (2)

120

per node

Cluster systemTeramem (HP DL580 shared memory system, in total 96 physical cores, each physical core has 2 hyperthreads)

interteramem_inter1

1 - 1

(up to 64 logical cores)

240

1 (2)

approx. 60

per physical core

Cluster system: CoolMUC-3 (64-way Knight's Landing 7210F nodes with Intel Omnipath 100 interconnect and 4 hardware threads per physical core)

mpp3mpp3_batch1451 - 3248

50

(dynamically adjusted
depending on workload)

approx. 90 DDR

plus

16 HBM

per node

intermpp3_inter31 - 32

1 (2)

Overview of job processing

Slurm
partition

Cluster- / Partition-specific
Slurm job settings

Typical job type

Common/Exemplary Slurm commands for job management via

squeue (show waiting/running jobs),
scancel (abort job),
sacct (show details on waiting, running, finished jobs)

cm2_large
--clusters=cm2
--partition=cm2_large
--qos=cm2_large
squeue -M cm2 -u $USER
scancel -M cm2 <JOB-ID>
sacct -M cm2 -X -u $USER --starttime=2021-01-01T00:00:01
cm2_std
--clusters=cm2
--partition=cm2_std
--qos=cm2_std
cm2_tiny
--clusters=cm2_tiny
squeue -M cm2_tiny -u $USER
scancel -M cm2_tiny <JOB-ID>
sacct -M cm2_tiny -X -u $USER --starttime=2021-01-01T00:00:01
serial_std
--clusters=serial
--partition=serial_std
--mem=<memory_per_node>MB

Shared use of compute nodes among users!
Default memory = memnode / Ncores_node

squeue -M serial -u $USER
scancel -M serial <JOB-ID>
sacct -M serial -X -u $USER --starttime=2021-01-01T00:00:01
serial_long
--clusters=serial
--partition=serial_long
--mem=<memory_per_node>MB
cm2_inter
--clusters=inter
--partition=cm2_inter

Do not run production jobs!

squeue -M inter -u $USER
scancel -M inter <JOB-ID>
sacct -M inter -X -u $USER --starttime=2021-01-01T00:00:01
cm2_inter_large_mem
--clusters=inter
--partition=cm2_inter_large_mem
--mem=<memory_per_node>MB
teramem_inter
--clusters=inter
--partition=teramem_inter
mpp3_inter
--clusters=inter
--partition=mpp3_inter

Do not run production jobs!

mpp3_batch
--clusters=mpp3
--partition=mpp3_batch
squeue -M mpp3 -u $USER
scancel -M mpp3 <JOB-ID>
sacct -M mpp3 -X -u $USER --starttime=2021-01-01T00:00:01

Submit hosts

Submit hosts are usually login nodes that permit to submit and manage batch jobs.

Cluster segmentSubmit hostsRemarks
CooLMUC-2lxlogin1, lxlogin2, lxlogin3, lxlogin4
CooLMUC-3lxlogin8, lxlogin9

lxlogin9 is accessible from lxlogin8 via

ssh mpp3-login9

lxlogin9 is KNL architecture. Thus, it can be used to build software for CoolMUC-3.

Teramemlxlogin8

However, note that cross-submission of jobs to other cluster segments is also possible. The only thing you need to take care of is that different cluster segments support different instructions sets, so you need to make sure that your software build produces the appropriate binary that can execute on the targeted cluster segment.


  • No labels