Operation

The documentation of the Linux Cluster systems has been extended to also cover operation of the CooLMUC-3 system. This document mostly describes where there are restrictions and deviations to the documented behaviour during the introductory phase of operation, as well as some special properties of the system that differ from the other cluster systems at LRZ.

Login node and operating environment

The front end node lxlogin8.lrz.de must be used to do development work for the CooLMUC-3 system. That login node is not a many-core (KNL) node itself, so it may not be possible to execute binaries there that have been custom built for the KNL architecture. However, the Intel development software stack permits you to do cross-compilation there, and an interactive SLURM job can be used to execute test runs on KNL compute nodes.

Please also note that CooLMUC-3 deploys a new release of the operating environment (SLES12). Programs built on this platform are likely to not execute properly on the other HPC systems at LRZ.

Batch processing

A separate SLURM instance serves CooLMUC-3, so all batch runs for CooLMUC-3 must be submitted from the CooLMUC-3 login node lxlogin8 (and this will remain that way for quite some time to come).
Initially, at most 60 nodes can be used for a single job for at most 12 hours.
Initially, not all hardware facilities will be available:

there may be limitations on the efficient use of hyperthreading; handling optimal pinning of processes and threads under SLURM is still under investigation.
switching cluster modes and HBM modes is now supported.
setting core frequencies is supported (within limits), please use the --cpu-freq switch of the SLURM sbatch command (man sbatch supplies details).

Restrictions on file system access

The SCRATCH file system of CooLMUC-2 is now available also for CooLMUC-3, however still in a somewhat experimental phase. If the value of the $SCRATCH environment variable indicates that the directory does not exist, or if I/O operations fail, please use the "old" scratch area pointed to by $SCRATCH_LEGACY instead. Also, current I/O performance of the system is still below expectation; work is under way to improve the situation.

CoolMUC3 Features

This section describes additional features that can be requested via the -C (or --constraint=) option for a job. Note that only values can be specified here; using contradictory values may result in undesired behaviour.

Purpose of feature	Value to be specified	Effect
Select cluster mode	quad	"quadrant"; affinity between cache management and memory. Recommended for everyday use. In certain shared memory model workloads where application can use all the cores in a single process using a threading library like OpenMP and TBB, this mode can also provide better performance than Sub-NUMA clustering mode.
Select cluster mode	snc4	"Sub-NUMA clustering"; affinity between tiles, cache management and memory. NUMA-optimized software can profit from this mode. This mode is suitable for distributed memory programming models using MPI or hybrid MPI-OpenMP. Proper pinning of tasks and threads is essential
Select cluster mode	a2a	"alltoall"; no affinity between tiles, cache management and memory. Not recommended because performance is degraded
Select memory mode	flat	High bandwidth memory is operated as regular memory mapped into the address space. Note: due to SLURM limitations, the maximum available memory per node for a job will still only be 96 GBytes. The usage of "numactl -p" or Memkind library is recommended.
Select memory mode	cache	High bandwidth memory is operated as cache to DDR memory.
Select memory mode	hybrid	High bandwidth memory is evenly split between regular memory (8GB) and cache (8 GB)

System Overview

Infrastructure
Hardware
Number of nodes	148
Cores per node	64
Hyperthreads per core	4
Core nominal frequency	1.3 GHz
Memory (DDR4) per node	96 GB (Bandbreite 80.8 GB/s)
High Bandwidth Memory per node	16 GB (Bandbreite 460 GB/s)
Bandwidth to interconnect per node	25 GB/s (2 Links)
Number of Omnipath switches (100SWE48)	10 + 4 (je 48 Ports)
Bisection bandwidth of interconnect	1.6 TB/s
Latency of interconnect	2.3 µs
Peak performance of system	459 TFlop/s
Electric power of fully loaded system	62 kVA
Percentage of waste heat to warm water	97%
Inlet temperature range for water cooling	30 … 50 °C
Temperature difference between outlet and inlet	4 … 6 °C
Software (OS and development environment)
Operating system	SLES12 SP2 Linux
MPI	Intel MPI 2017, alternatively OpenMPI
Compilers	Intel icc, icpc, ifort 2017
Performance libraries	MKL, TBB, IPP
Tools for performance and correctness analysis	Intel Cluster Tools

The performance numbers in the above table are theoretical and cannot be reached by any real-world application. For the actually observable memory bandwidth of the high bandwidth memory, the STREAM benchmark will yield approximately 450 GB/s per node, and the commitment for the LINPACK performance of the complete system is 255 TFlop/s.

Many-core Architecture, the “Knights Landing” Processor

The processor generation installed in this system is the first from Intel to be stand-alone; previous generations were only available as accelerators, which greatly adds to the complexity and effort required to provision, program, and use such systems. The LRZ Cluster, in contrast, can be installed with a standard Operating System, and can be used with the same programming models familiar to users of Xeon-based clusters. The high speed interconnect between the compute nodes is realized by an on-chip interface and a mainboard-integrated Dual-port adapter, which is of advantage to latency-bound parallel applications, in comparison to PCI-card interconnects.

A further feature of the architecture is the closely-integrated MCDRAM, also known as “High-Bandwidth Memory” (HBM). The bandwidth of this memory area is an order of magnitude higher than that of conventional memory technology. It can either be configured as cache memory, directly-addressable memory, or a 50/50 hybrid of the two types. The vector units have also been expanded – each core contains two AVX-512 VPUs and can therefore, when multiplication and addition operations are combined, perform 32 double-precision floating-point operations per cycle. Each pair of cores are tightly coupled and share a 1MB L2 cache to form a “tile”, and 32 tiles share a 2-dimensional interconnect with a bisection bandwidth of over 700GB/s over which cache coherence and traffic flow in various user-configurable modes. To what extent such configuration options are capable of optimizing specific user applications is currently under evaluation, as in general a reboot of the affect nodes is required.

Because of the low core frequency as well as the small per-core memory of its nodes, the system is not suited for serial throughput load, even though the instruction set permits execution of legacy binaries. For best performance, it is likely that a significant optimization effort for existing parallel applications must be undertaken. To make efficient use of the memory and exploit all levels of parallelism in the architecture, typically a hybrid approach (e.g. using both MPI and OpenMP) is considered a best practice. Restructuring of data layouts will often be required in order to achieve cache locality, a prerequisite for effectively using the broader vector units. For codes that require use of the distributed memory paradigm with small message sizes, the integration of the Omnipath network interface on the chip set of the computational node can bring a significant performance advantage over a PCI-attached network card.

LRZ has acquired know-how throughout the past three years in optimizing for many-core systems by collaborating with Intel. This collaboration included tuning codes for optimal execution on the previous-generation “Knight’s Corner” accelerator cards used in the SuperMIC prototype system; guidance on how to do such optimization will be documented on the LRZ web server, and can be supplied on a case-by-case basis by the LRZ application support staff members. The Intel development environment (“Intel Parallel Studio XE”) that includes compilers, performance libraries, an MPI implementation and additional tuning, tracing and diagnostic tools, assists programmers in establishing good performance for applications. Courses on programming many-core systems as well as using the Intel toolset are regularly scheduled within the LRZ course program[1].

Omnipath Interconnect

With their acquisition of network technologies in the last decade, Intel has chosen a new strategy for networks, namely the integration of the network into the processor architecture. This cluster will be the first time the Omnipath interconnect, in its already mature first generation, is put into service at the LRZ. It is identified by its markedly lower application latencies, higher achievable message rates, and high aggregate bandwidths at a better price than competing hardware technologies. The LRZ will use this system to gather experience with the management, stability, and performance of the new technology.

Cooling Infrastructure

LRZ was already a pioneer in the introduction of warm-water cooling in Europe. CooLMUC-1, installed in mid-2011 by MEGWare, was the first such system at the LRZ. A little more than year later, the IBM/Lenovo 3-PFlop “SuperMUC” went into service, the first system that allowed an inlet temperature of around 40°C, in turn allowing year-round chillerless cooling. Furthermore, the high water temperatures allowed this waste energy to be captured in the form of additional cooling power for the remaining air-cooled and cool-water-cooled components (e.g. storage) – in 2016, after a detailed pilot project in collaboration with the company Sortech, an Adsorption-cooling system was put into service for the first time, converting roughly half of the Lenovo Cluster’s waste heat into cooling capacity.

CooLMUC-3 will now take the next step in improving energy efficiency. Via the introduction of water cooling for additional components like power supplies and network components, it will be possible to thermally insulate the racks and virtually eliminate the emission of heat (only 3% of the electrical energy) to the server room.