High Performance Computing

<< Zurück zur Dokumentationsstartseite

High Performance Computing

 

Forgot your Password? click here
Add new user (only for SuperMUC-NG)?
click here

Add new IP(only for SuperMUC-NG)?
click here
How to write good LRZ Service Requests? click here

How to setup two-factor authentication (2FA) on HPC systems? click here

End of Life: CoolMUC-2 and CoolMUC-3 will be switched off on Friday December 13th

New: Virtual "HPC Lounge" to ask question and get advice. Every Wednesday, 2:00pm - 3:00pm
For details and Zoom Link see: HPC Lounge

System Status (see also: Access and Overview of HPC Systems)

GREEN = fully operational YELLOW = operational with restrictions (see messages below) RED = not available = see messages below



Höchstleistungsrechner (SuperMUC-NG)

login nodes: skx.supermuc.lrz.de LOGIN

archive nodes: skx-arch.supermuc.lrz.de ARCHIVE

File Systems  
HOME WORK SCRATCH DSS DSA

Partitions/Queues: 
 MIRCRO GENERAL LARGE

  FAT TEST

Detailed node status

Details:

Submit an Incident Ticket for the SuperMUC-NG

Add new user? click here

Add new IP? click here

Questions about 2FA on SuperMUC-NG? click here


Linux Cluster 

CoolMUC-2see messages below
lxlogin(1,2,3,4).lrz.de

ISSUES

 

serial partition serial_std

DOWN

 

serial partition serial_long

DOWN


parallel partitions cm2_(std,large)

DOWN


cluster cm2_tiny

DOWN


interactive partition: cm2_inter

DOWN


c2pap

MOSTLY UP

 

C2PAP Work filesystem: /gpfs/work

DOWN

CoolMUC-3

lxlogin(8,9).lrz.de

parallel partition: mpp3_batch

interactive partition: mpp3_inter


2FA ISSUES

DOWN

UP


CoolMUC-4

lxlogin5.lrz.de

interactive partition: cm4_inter_large_mem


UP

UP


others


teramem_inter

UP

 

kcs

MOSTLY UP

 

biohpc

UP

 

hpda

UP

 

File Systems

HOME
SCRATCH (legacy)
SCRATCH_DSS
DSS
DSA


ISSUES
DOWN
UP
UP
UP


 

Detailed node status
Detailed queue status



Details:

Submit an Incident Ticket for the Linux Cluster

 


DSS Storage systems

For the status overview of the Data Science Storage please go to

https://doku.lrz.de/display/PUBLIC/Data+Science+Storage+Statuspage


Messages

see also: Aktuelle LRZ-Informationen / News from LRZ

Messages for all HPC System

A new software stack (spack/23.1.0) is available on the CoolMUC- 2 and SuperMUC-NG. Release Notes of Spack/23.1.0 Software Stack

This software stack provides new versions of compilers, MPI libraries, and most other applications. Also there are significant changes w.r.t module suffixes (specifically MPI and MKL modules) and module interactions (we have added prerequisites of MPI and compilers for high-level packages to adhere to the compatibility in loaded modules in your terminal environment). Please refer to the release notes for detailed changes. 

This software stack is rolled out as a non-default on both machines. You will have to explicitly swap/switch spack modules to access the new software stack. The best way is to purge all loaded modules and load the Spack/23.1.0 like,

$> module purge ; module load spack/23.1.0

Please be aware, 

  • Using the "module purge" command will unload all previously loaded modules from your terminal shell, including automatically loaded ones such as "intel," "intel-mpi," and "intel-mkl." This step is crucial to prevent potential errors that may arise due to lingering modules.

  • In the future, when version 23.1.0 or later versions of the Spack software stack become the default, we will no longer automatically load any modules (e.g., compilers, MPI, and MKL). This change will provide users with a clean environment to begin their work.

We request you to reach out to us for any suggestions and questions. Use the "Spack Software Stack" keyword when you open a ticket at https://servicedesk.lrz.de/en/ql/create/26 .


Messages for SuperMUC-NG

9:00 Maintenance of SuperMUC-NG

A hardware failure in the enclosure of the WORK file system requires a maintenance on Tuesday, October 29, 9:00 a.m. Login nodes will be closed before the start of the maintenance. We have set up a reservation in the SLURM scheduler to suspend job processing. All running jobs will terminate regularly beforehand. The system should be back online late afternoon.

Maintenance finished. System is back in operation.

Messages for Linux Clusters

Legacy SCRATCH File System of CoolMUC-2/3 Broken

On severe hardware failures occured on the CoolMUC clusters (SCRATCH filesystem, switches). 

The old SCRATCH file system (/gpfs/scratch/$PROJ/$USERcannot be recovered. We are sorry to annouce that data are inevitably lost. Kindly refrain from inquiries concerning data access. This applies unfortunately to the C2PAP work filesystem as well, since /gpfs/work was a part of the $SCRATCH filesystem.
Until end-of-life of CoolMUC-2/3 (see below) we have mapped the SCRATCH variable to SCRATCH_DSS (/dss/lxclscratch/.../$USER) also accessible now on CoolMUC-2.


End-of-Life Announcement for CoolMUC-2 

After 9 years of operation the hardware of CoolMUC-2 can no longer offer reliable service. The system is targeted to be turned off latest Friday . Due to network degradation we can only support single node jobs on a best-effort basis until then. In case of further hardware problems, the shutdown date might be much earlier. 

End-of-Life Announcement of CoolMUC-3

Hardware and software support for the Knights Landing nodes and the Omni Path network on CoolMUC-3 (mpp3_batch) has ended several years ago and needs to be decommissioned. The system is targeted to be turned off Friday  along with CoolMUC-2. Housing segments attached to CoolMUC-3 will stay in operation. 

New Cluster Segment CoolMUC-4

Hardware for a new cluster system, CoolMUC-4, has been delivered and is currently  being installed and tested. The cluster comprises some ~12.000 cores based on Intel® Xeon®Platinum 8480+ (Sapphire Rapids). We expect start of user operation beginning of December 2024.

Messages for Compute Cloud and other HPC Systems

The AI Systems will be affected by an infrastructure power cut scheduled in November 2024. The following system partitions will become unavailable for 3 days during the specified time frame. We apologise for the inconvenience associated with that.

Calendar Week 46, 2024-11-11 - 2024-11-13

  • lrz-v100x2
  • lrz-hpe-p100x4
  • lrz-dgx-1-p100x8
  • lrz-dgx-1-v100x8
  • lrz-cpu (partly)
  • test-v100x2
  • lrz-hgx-a100-80x4
  • mcml-hgx-a100-80x4
  • mcml-hgx-a100-80x4-mig

The AI Systems (including the MCML system segment) are under maintenance between September 30th and October 2nd, 2024. On these days, the system will not be available to users. Normal user operation is expected to resume during the course of Wednesday, October 2nd.

The previously announced scheduled downtime between 2024-09-16 and 2024-09-27 (Calendar Week 38 & 39) has been postponed until further notice. (Warnung) The system will remain in user operation up to the scheduled maintenance at the end of September.