Decommisioned CoolMUC-2

Image of the CoolMUC-2 computer racks (black racks) and six adsorption chillers powered by waste heat from the computer racks (light gray devices on the right side of the image).CooLMUC2_Front_0.35

A new cluster segment, “CooLMUC-2,” with the same processor and cooling technology as SuperMUC Phase 2, was installed and commissioned in two stages. The first part of the application (state-funded large-scale equipment pursuant to Article 143c of the German Constitution) was already approved in 2014. Installation of the first subsystem began in December 2014, and user operation commenced in May 2015. In addition to additional computing nodes, the installation of Phase 2 (large research equipment pursuant to Art. 91b GG) also included six adsorption chillers installed in collaboration with Sortech AG to use the waste heat from the computers to generate cooling capacity. The use of improved technologies and control engineering enables reliable year-round cooling of the remaining air-cooled components of phase 2 of SuperMUC. The final expansion of CooLMUC-2 ranks 261st on the TOP500 list, November 2015 edition. In addition, the provision of a GPFS-based high-performance file system as scratch storage has eliminated a growing bottleneck in the processing of I/O-heavy computing jobs. CooLMUC-2 replaces the previous “CooLMUC-1” cluster, which is scheduled to be decommissioned in 2016. The Nehalem-based sgi ICE cluster was already taken out of user operation in the summer of 2015. This was followed in the fall by the decommissioning of the existing serial cluster; accordingly, a subset of the nodes was moved to CooLMUC-2.

The first steps toward integrating the future big data infrastructure with the HPC systems were taken with the connection of the CooLMUC-2 login nodes to the Data Science Storage (DSS). In addition to productive computing operations, the CoolMUC-2 system also served as a research object for innovative and energy-efficient cooling concepts. In addition to the hot water cooling system that had been established at the LRZ for years, it also had six adsorption chillers from Sortech. These made it possible to generate cooling from the waste heat of the computer nodes with low electrical energy consumption, which was used to cool the storage system of SuperMUC Phase2. The technology proved to be very reliable and efficient: in 2016, an average of 120 kW of waste heat at 45°C was used to generate approximately 50 kW of cooling at 21°C. The coefficient of performance of the overall system was 12. This means that 1 kW of electrical energy had to be expended for every 12 kW of cooling capacity. This made the adsorption chillers about three times more efficient than traditional compressor-based chillers.

After 9 years of operation, the CoolMUC-2 system was shut-off Friday, 13.12.2024.

CoolMUC-2: System Overview

Hardware

Number of nodes

812

Cores per node

28

Hyperthreads per core

2

Core nominal frequency

2.6 GHz

Memory (DDR4) per node

64 GB (Bandwidth 120 GB/s - STREAM)

Bandwidth to interconnect per node

13,64 GB/s (1 Link)

Bisection bandwidth of interconnect (per island)

3.5 TB/s

Latency of interconnect

2.3 µs

Peak performance of system

1400 TFlop/s

Infrastructure

Electric power of fully loaded system

290 kVA

Percentage of waste heat to warm water

97%

Inlet temperature range for water cooling

30 … 50 °C

Temperature difference between outlet and inlet

4 … 6 °C

Software (OS and development environment)

Operating system

SLES15 SP1 Linux

MPI

Intel MPI 2019, alternatively OpenMPI

Compilers

Intel icc, icpc, ifort 2019

Performance libraries

MKL, TBB, IPP

Tools for performance and correctness analysis

Intel Cluster Tools



Overview of cluster specifications and limits

Cluster specificationsLimits
Slurm
cluster
Slurm
partition

Nodes
in partition

Node range
per job
min - max

Maximum
runtime
(hours)

Maximum
running (submitted)
jobs
per user

Memory limit
(GByte)

Cluster system: CoolMUC-2 (28-way Haswell-EP nodes with Infiniband FDR14 interconnect and 2 hardware threads per physical core)
cm2cm2_large

404

(overlapping
partitions)

25 - 6448

2 (30)

56

per node

cm2_std3 - 2472

4 (50)

cm2_tinycm2_tiny2881 - 472

10 (50)

serialserial_std

96

(overlapping
partitions)

1 - 196

dynamically adjusted
depending on workload

(250)

serial_long1 - 1

> 72

(currently 480)

inter



cm2_inter121 - 122

1 (2)

cm2_inter_large_mem61 - 696

1 (2)

120

per node

Cluster system: HPDA LRZ Cluster (80-way Ice Lake nodes, 2 hardware threads per physical core)
inter
cm4_inter_large_mem91 - 196

1 (2)

1000 per node

Cluster system: Teramem (single-node shared-memory system, 4 x Intel Xeon Platinum 8360HL, in total 96 physical cores, 2 hyperthreads per physical core, 6 TB memory)

interteramem_inter1

1 - 1

(up to 64 logical cores)

240

1 (2)

approx. 60
per physical core
available

Cluster system: CoolMUC-3 (64-way Knight's Landing 7210F nodes with Intel Omnipath 100 interconnect and 4 hardware threads per physical core)

mpp3mpp3_batch1451 - 3248

50

(dynamically adjusted
depending on workload)

approx. 90 DDR

plus

16 HBM

per node

intermpp3_inter31 - 32

1 (2)

Overview of job processing

Slurm
partition

Cluster- / Partition-specific
Slurm job settings

Typical job type

Recommended
submit
host(s)

Common/Exemplary Slurm commands for job management via

squeue (show waiting/running jobs),
scancel (abort job),
sacct (show details on waiting, running, finished jobs)

cm2_large
--clusters=cm2
--partition=cm2_large
--qos=cm2_large
  • large distributed memory parallel (MPI) job






lxlogin1

lxlogin2

lxlogin3

lxlogin4

squeue -M cm2 -u $USER
scancel -M cm2 <JOB-ID>
sacct -M cm2 -X -u $USER --starttime=2021-01-01T00:00:01
cm2_std
--clusters=cm2
--partition=cm2_std
--qos=cm2_std
  • standard distributed memory parallel (MPI) job
cm2_tiny
--clusters=cm2_tiny
  • small distributed memory parallel (MPI) job
  • single-node shared memory parallel job
squeue -M cm2_tiny -u $USER
scancel -M cm2_tiny <JOB-ID>
sacct -M cm2_tiny -X -u $USER --starttime=2021-01-01T00:00:01
serial_std
--clusters=serial
--partition=serial_std
--mem=<memory_per_node_MB>M
  • single-core jobs

Shared use of compute nodes among users!
Default memory = memnode / Ncores_node

squeue -M serial -u $USER
scancel -M serial <JOB-ID>
sacct -M serial -X -u $USER --starttime=2021-01-01T00:00:01
serial_long
--clusters=serial
--partition=serial_long
--mem=<memory_per_node_MB>M
cm2_inter
--clusters=inter
--partition=cm2_inter
  • interactive test jobs

Do not run production jobs!

squeue -M inter -u $USER
scancel -M inter <JOB-ID>
sacct -M inter -X -u $USER --starttime=2021-01-01T00:00:01
cm2_inter_large_mem
--clusters=inter
--partition=cm2_inter_large_mem
--mem=<memory_per_node_MB>M
  • single-node shared memory parallel job requiring more memory than available on default compute nodes
cm4_inter_large_mem
--clusters=inter
--partition=cm4_inter_large_mem
  • jobs which need much more memory than
    available on CoolMUC-2 compute nodes, but
    less memory available on Teramem

lxlogin5
teramem_inter
--clusters=inter
--partition=teramem_inter
--mem=<memory_per_node_MB>M
  • large-memory job on Teramem

lxlogin[1...4]

lxlogin8

mpp3_inter
--clusters=inter
--partition=mpp3_inter
  • interactive test jobs

Do not run production jobs!



lxlogin8
mpp3_batch
--clusters=mpp3
--partition=mpp3_batch
  • shared memory thread-parallel job
  • distributed memory parallel job
squeue -M mpp3 -u $USER
scancel -M mpp3 <JOB-ID>
sacct -M mpp3 -X -u $USER --starttime=2021-01-01T00:00:01