Image of the CoolMUC-2 computer racks (black racks) and six adsorption chillers powered by waste heat from the computer racks (light gray devices on the right side of the image). CooLMUC2_Front_0.35

A new cluster segment, “CooLMUC-2,” with the same processor and cooling technology as SuperMUC Phase 2, was installed and commissioned in two stages. The first part of the application (state-funded large-scale equipment pursuant to Article 143c of the German Constitution) was already approved in 2014. Installation of the first subsystem began in December 2014, and user operation commenced in May 2015. In addition to additional computing nodes, the installation of Phase 2 (large research equipment pursuant to Art. 91b GG) also included six adsorption chillers installed in collaboration with Sortech AG to use the waste heat from the computers to generate cooling capacity. The use of improved technologies and control engineering enables reliable year-round cooling of the remaining air-cooled components of phase 2 of SuperMUC. The final expansion of CooLMUC-2 ranks 261st on the TOP500 list, November 2015 edition. In addition, the provision of a GPFS-based high-performance file system as scratch storage has eliminated a growing bottleneck in the processing of I/O-heavy computing jobs. CooLMUC-2 replaces the previous “CooLMUC-1” cluster, which is scheduled to be decommissioned in 2016. The Nehalem-based sgi ICE cluster was already taken out of user operation in the summer of 2015. This was followed in the fall by the decommissioning of the existing serial cluster; accordingly, a subset of the nodes was moved to CooLMUC-2.

The first steps toward integrating the future big data infrastructure with the HPC systems were taken with the connection of the CooLMUC-2 login nodes to the Data Science Storage (DSS). In addition to productive computing operations, the CoolMUC-2 system also served as a research object for innovative and energy-efficient cooling concepts. In addition to the hot water cooling system that had been established at the LRZ for years, it also had six adsorption chillers from Sortech. These made it possible to generate cooling from the waste heat of the computer nodes with low electrical energy consumption, which was used to cool the storage system of SuperMUC Phase2. The technology proved to be very reliable and efficient: in 2016, an average of 120 kW of waste heat at 45°C was used to generate approximately 50 kW of cooling at 21°C. The coefficient of performance of the overall system was 12. This means that 1 kW of electrical energy had to be expended for every 12 kW of cooling capacity. This made the adsorption chillers about three times more efficient than traditional compressor-based chillers.

After 9 years of operation, the CoolMUC-2 system was shut-off Friday, 13.12.2024.

CoolMUC-2: System Overview

Hardware
Number of nodes	812
Cores per node	28
Hyperthreads per core	2
Core nominal frequency	2.6 GHz
Memory (DDR4) per node	64 GB (Bandwidth 120 GB/s - STREAM)
Bandwidth to interconnect per node	13,64 GB/s (1 Link)
Bisection bandwidth of interconnect (per island)	3.5 TB/s
Latency of interconnect	2.3 µs
Peak performance of system	1400 TFlop/s
Infrastructure
Electric power of fully loaded system	290 kVA
Percentage of waste heat to warm water	97%
Inlet temperature range for water cooling	30 … 50 °C
Temperature difference between outlet and inlet	4 … 6 °C
Software (OS and development environment)
Operating system	SLES15 SP1 Linux
MPI	Intel MPI 2019, alternatively OpenMPI
Compilers	Intel icc, icpc, ifort 2019
Performance libraries	MKL, TBB, IPP
Tools for performance and correctness analysis	Intel Cluster Tools

Overview of cluster specifications and limits

Cluster specifications			Limits
Slurm cluster	Slurm partition	Nodes in partition	Node range per job min - max	Maximum runtime (hours)	Maximum running (submitted) jobs per user	Memory limit (GByte)
Cluster system: CoolMUC-2 (28-way Haswell-EP nodes with Infiniband FDR14 interconnect and 2 hardware threads per physical core)
cm2	cm2_large	404 (overlapping partitions)	25 - 64	48	2 (30)	56 per node
cm2	cm2_std	404 (overlapping partitions)	3 - 24	72	4 (50)
cm2_tiny	cm2_tiny	288	1 - 4	72	10 (50)
serial	serial_std	96 (overlapping partitions)	1 - 1	96	dynamically adjusted depending on workload (250)
serial	serial_long	96 (overlapping partitions)	1 - 1	> 72 (currently 480)	dynamically adjusted depending on workload (250)
inter	cm2_inter	12	1 - 12	2	1 (2)
inter	cm2_inter_large_mem	6	1 - 6	96	1 (2)	120 per node
Cluster system: HPDA LRZ Cluster (80-way Ice Lake nodes, 2 hardware threads per physical core)
inter	cm4_inter_large_mem	9	1 - 1	96	1 (2)	1000 per node
Cluster system: Teramem (single-node shared-memory system, 4 x Intel Xeon Platinum 8360HL, in total 96 physical cores, 2 hyperthreads per physical core, 6 TB memory)
inter	teramem_inter	1	1 - 1 (up to 64 logical cores)	240	1 (2)	approx. 60 per physical core available
Cluster system: CoolMUC-3 (64-way Knight's Landing 7210F nodes with Intel Omnipath 100 interconnect and 4 hardware threads per physical core)
mpp3	mpp3_batch	145	1 - 32	48	50 (dynamically adjusted depending on workload)	approx. 90 DDR plus 16 HBM per node
inter	mpp3_inter	3	1 - 3	2	1 (2)	approx. 90 DDR plus 16 HBM per node

Overview of job processing

Slurm partition	Cluster- / Partition-specific Slurm job settings	Typical job type	Recommended submit host(s)	Common/Exemplary Slurm commands for job management via squeue (show waiting/running jobs), scancel (abort job), sacct (show details on waiting, running, finished jobs)
cm2_large	--clusters=cm2 --partition=cm2_large --qos=cm2_large	large distributed memory parallel (MPI) job	lxlogin1 lxlogin2 lxlogin3 lxlogin4	squeue -M cm2 -u $USER scancel -M cm2 <JOB-ID> sacct -M cm2 -X -u $USER --starttime=2021-01-01T00:00:01
cm2_std	--clusters=cm2 --partition=cm2_std --qos=cm2_std	standard distributed memory parallel (MPI) job
cm2_tiny	--clusters=cm2_tiny	small distributed memory parallel (MPI) job single-node shared memory parallel job		squeue -M cm2_tiny -u $USER scancel -M cm2_tiny <JOB-ID> sacct -M cm2_tiny -X -u $USER --starttime=2021-01-01T00:00:01
serial_std	--clusters=serial --partition=serial_std --mem=<memory_per_node_MB>M	single-core jobs Shared use of compute nodes among users! Default memory = mem_node/ N_{cores_node}		squeue -M serial -u $USER scancel -M serial <JOB-ID> sacct -M serial -X -u $USER --starttime=2021-01-01T00:00:01
serial_long	--clusters=serial --partition=serial_long --mem=<memory_per_node_MB>M
cm2_inter	--clusters=inter --partition=cm2_inter	interactive test jobs Do not run production jobs!		squeue -M inter -u $USER scancel -M inter <JOB-ID> sacct -M inter -X -u $USER --starttime=2021-01-01T00:00:01
cm2_inter_large_mem	--clusters=inter --partition=cm2_inter_large_mem --mem=<memory_per_node_MB>M	single-node shared memory parallel job requiring more memory than available on default compute nodes
cm4_inter_large_mem	--clusters=inter --partition=cm4_inter_large_mem	jobs which need much more memory than available on CoolMUC-2 compute nodes, but less memory available on Teramem	lxlogin5
teramem_inter	--clusters=inter --partition=teramem_inter --mem=<memory_per_node_MB>M	large-memory job on Teramem	lxlogin[1...4] lxlogin8
mpp3_inter	--clusters=inter --partition=mpp3_inter	interactive test jobs Do not run production jobs!	lxlogin8
mpp3_batch	--clusters=mpp3 --partition=mpp3_batch	shared memory thread-parallel job distributed memory parallel job	lxlogin8	squeue -M mpp3 -u $USER scancel -M mpp3 <JOB-ID> sacct -M mpp3 -X -u $USER --starttime=2021-01-01T00:00:01