High Performance Computing

<< Zurück zur Dokumentationsstartseite

High Performance Computing

 

Forgot your Password? click here
Add new user (only for SuperMUC-NG)?
click here

Add new IP(only for SuperMUC-NG)?
click here
How to write good LRZ Service Requests? click here

How to setup two-factor authentication (2FA) on HPC systems? click here

End of Life: CoolMUC-2 and CoolMUC-3 will be switched off on Friday December 13th

New: Virtual "HPC Lounge" to ask question and get advice. Every Wednesday, 2:00pm - 3:00pm
For details and Zoom Link see: HPC Lounge

System Status (see also: Access and Overview of HPC Systems)

GREEN = fully operational YELLOW = operational with restrictions (see messages below) RED = not available = see messages below



Höchstleistungsrechner (SuperMUC-NG)

login nodes: skx.supermuc.lrz.de LOGIN

archive nodes: skx-arch.supermuc.lrz.de ARCHIVE

File Systems  
HOME WORK SCRATCH DSS DSA

Partitions/Queues: 
 MIRCRO GENERAL LARGE

  FAT TEST

Detailed node status

Details:

Submit an Incident Ticket for the SuperMUC-NG

Add new user? click here

Add new IP? click here

Questions about 2FA on SuperMUC-NG? click here


Linux Cluster 

CoolMUC-2see messages below
lxlogin(1,2,3,4).lrz.de

ISSUES

 

serial partition serial_std

MOSTLY UP

 

serial partition serial_long

UP


parallel partitions cm2_(std,large)

DEAD FOR GOOD

cluster cm2_tiny

UP


interactive partition: cm2_inter

DEAD FOR GOOD

c2pap

UP

 

C2PAP Work filesystem: /gpfs/work

READ-ONLY

CoolMUC-3

lxlogin(8,9).lrz.de

parallel partition: mpp3_batch

interactive partition: mpp3_inter


DOWN

MOSTLY UP

UP


CoolMUC-4

lxlogin5.lrz.de

interactive partition: cm4_inter_large_mem


DOWN

MOSTLY UP


others


teramem_inter

UP

 

kcs

PARTIALLY UP

 

biohpc

MOSTLY UP

 

hpda

UP

 

File Systems

HOME
SCRATCH (legacy)
SCRATCH_DSS
DSS
DSA


ISSUES
UNAVAILABLE
DOWN
UP
UP


 

Detailed node status
Detailed queue status



Details:

Submit an Incident Ticket for the Linux Cluster

 


DSS Storage systems

For the status overview of the Data Science Storage please go to

https://doku.lrz.de/display/PUBLIC/Data+Science+Storage+Statuspage


Messages

see also: Aktuelle LRZ-Informationen / News from LRZ

Messages for all HPC System

A new software stack (spack/23.1.0) is available on the CoolMUC- 2 and SuperMUC-NG. Release Notes of Spack/23.1.0 Software Stack

This software stack provides new versions of compilers, MPI libraries, and most other applications. Also there are significant changes w.r.t module suffixes (specifically MPI and MKL modules) and module interactions (we have added prerequisites of MPI and compilers for high-level packages to adhere to the compatibility in loaded modules in your terminal environment). Please refer to the release notes for detailed changes. 

This software stack is rolled out as a non-default on both machines. You will have to explicitly swap/switch spack modules to access the new software stack. The best way is to purge all loaded modules and load the Spack/23.1.0 like,

$> module purge ; module load spack/23.1.0

Please be aware, 

  • Using the "module purge" command will unload all previously loaded modules from your terminal shell, including automatically loaded ones such as "intel," "intel-mpi," and "intel-mkl." This step is crucial to prevent potential errors that may arise due to lingering modules.

  • In the future, when version 23.1.0 or later versions of the Spack software stack become the default, we will no longer automatically load any modules (e.g., compilers, MPI, and MKL). This change will provide users with a clean environment to begin their work.

We request you to reach out to us for any suggestions and questions. Use the "Spack Software Stack" keyword when you open a ticket at https://servicedesk.lrz.de/en/ql/create/26 .


Messages for SuperMUC-NG

Maintenance finished. System is back in operation.

Messages for Linux Clusters

Cluster maintenance from Nov 11th 2024 until Nov 15th 2024

Update  

The maintenance will be finished on Nov 20. Affected cluster segments will be back in operation.

Please note: The latest LRZ software stack spack/23.1.0 is set as default on the CoolMUC-4 partitions! The old software stack spack/22.2.1 (commonly used on cm2 and cm4_inter_large_mem nodes in the past) is still available via the according module.
Important: The naming conventions for INTEL related modules (Compiler, MPI, MKL) in both Spack software stacks differ from each other, so that users eventually nee to update their SLURM scripts accordingly.

--

Update

The maintenance needs to be prolonged until Nov 19th!

--

Original announcement:

Due to works on the power grid infrastructure and security relevant system updates all denoted cluster segments are in maintenance from Monday, Nov 11th 2024 at 06:30am until Friday, Nov 15th 2024 at approx. 6:00pm:

CoolMUC-3 Cluster:

  • LRZ: lxlogin8, inter, mpp3_batch

  • Housing: httf, htus, kcs, kcs_nim,  htce, htfd, htrp, tum_aer, lcg

CoolMUC-4 Cluster:

  • LRZ: lxlogin5, inter, teramem2

  • Housing: httc, hlai, htso, htrp, htls, lcg, biohpc_gen, tum_aer

  • DLR:  dlr-login[1,2], all hpda2_* queues

This means that neither scripted batch jobs nor “salloc” style interactive jobs will execute. 

cm2/cm2_inter are gone for ever

Due to some further hardware failure, the complete island 22 went out of operation. This concerns also housing clusters attached to the same network. Customers are informed by mail.

9:30 a.m.: Outage SCRATCH_DSS

The infrastructure maintenance affected the SCRATCH_DSS filesystem and lead to an outage. We are working to resolve the problem. 

CoolMUC-2/-3:

  For the CM-2 queues due to degeneration of the cluster communication network they are open for single-node jobs only. SLURM restrictions apply. For CM-3 multi-node jobs can be submitted again. Please abstain from submitting tickets about software modernization requests on both systems. The systems are provided "as is" for the remaining lifetime. (see below)

Legacy SCRATCH File System of CoolMUC-2/3 Broken - Data recovery

On severe hardware failures occured on the CoolMUC clusters (SCRATCH filesystem, switches). As a mitigation, until end-of-life of CoolMUC-2/3, we have mapped the SCRATCH variable to SCRATCH_DSS (/dss/lxclscratch/.../$USER) also accessible now on CoolMUC-2.

Update: Our administrators managed to bring the filesystem back up in read-only mode:

  • /gpfs/scratch/ and /gpfs/work/ are mounted on lxlogin[1-4] and lxloginc2pap

Please do not use the $SCRATCH environment variable, rather absolute paths, e.g., /gpfs/scratch/<project-id>/<user-id>.

We cannot guarantee data integrity or completeness. Please save all relevant files as soon as possible.

Filesystem was unmounted  November 9.

End-of-Life Announcement for CoolMUC-2 

After 9 years of operation the hardware of CoolMUC-2 can no longer offer reliable service. The system is targeted to be turned off latest Friday . Due to network degradation we can only support single node jobs on a best-effort basis until then. In case of further hardware problems, the shutdown date might be much earlier. 

End-of-Life Announcement of CoolMUC-3

Hardware and software support for the Knights Landing nodes and the Omni Path network on CoolMUC-3 (mpp3_batch) has ended several years ago and needs to be decommissioned. The system is targeted to be turned off Friday  along with CoolMUC-2. In case of further hardware problems, the shutdown date might be earlier.
Housing segments attached to CoolMUC-3 will stay in operation. 

New Cluster Segment CoolMUC-4

Hardware for a new cluster system, CoolMUC-4, has been delivered and is currently  being installed and tested. The cluster comprises some ~12.000 cores based on Intel® Xeon®Platinum 8480+ (Sapphire Rapids). We expect start of user operation beginning of December 2024.

Messages for Compute Cloud and other HPC Systems

The AI Systems will be affected by an infrastructure power cut scheduled in November 2024. The following system partitions will become unavailable for 3 days during the specified time frame. We apologise for the inconvenience associated with that.

Calendar Week 46, 2024-11-11 - 2024-11-13

  • lrz-v100x2
  • lrz-hpe-p100x4
  • lrz-dgx-1-p100x8
  • lrz-dgx-1-v100x8
  • lrz-cpu (partly)
  • test-v100x2
  • lrz-hgx-a100-80x4
  • mcml-hgx-a100-80x4
  • mcml-hgx-a100-80x4-mig

The AI Systems (including the MCML system segment) are under maintenance between September 30th and October 2nd, 2024. On these days, the system will not be available to users. Normal user operation is expected to resume during the course of Wednesday, October 2nd.

The previously announced scheduled downtime between 2024-09-16 and 2024-09-27 (Calendar Week 38 & 39) has been postponed until further notice. (Warnung) The system will remain in user operation up to the scheduled maintenance at the end of September.