Access Management for MCML PIs

MCML PIs who want (members of their research group) to access the MCML partition (as part of the LRZ AI Systems), please open a service request with LRZ Servicedesk here
and choose/add

Type: Service Request
Description: "Access to MCML Segment @ LRZ AI Systems";
Details:
- Please specify the name of the MCML PI
- If applicable (i.e. if the working group already has one), please provide a LRZ Master User/Linux Cluster project ID

Further instructions will follow via the communication in the service request ticket.

Once a suitable LRZ Master User/Linux Cluster project has been created/set up as a "MCML project", any account within this project which gets Linux Cluster permissions assigned (by the Master Users) will automatically also be granted access to LRZ AI Systems, including the MCML partition of this system. Similarly, removing Linux Cluster permissions from individual accounts will, also revoke LRZ AI Systems permissions, including the MCML partition.

Dedicated Compute Hardware

In addition to the generally available resources listed on General Description and Resources, the user IDs associated with the LRZ projects of MCML research groups are entitled to use the following hardware. Access will be granted automatically (for dedicated MCML LRZ projects) or upon request (for eligible accounts of pre-existing LRZ Master User projects). The allocation time limit for individual jobs is 2 days (2-00:00:00).

	Slurm Partition	Number of nodes	CPUs per node	Memory per node	GPUs per node	Memory per GPU
HGX H100 Architecture	mcml-hgx-h100-94x4	21	96	768 GB	4 NVIDIA H100	94 GB
HGX A100 Architecture	mcml-hgx-a100-80x4	21	96	1 TB	4 NVIDIA A100	80 GB
DGX A100 Architecture	mcml-dgx-a100-40x8	8	256	1 TB	8 NVIDIA A100	40 GB

In order to use resources of the mcml-dgx-a100-40x partition, eligible users need to specify the "mcml" quality of service (QoS) for their job allocation and/or submission, e.g.

$ salloc -p mcml-dgx-a100-40x8 -q mcml -n 8 --gres=gpu:8  # short form options, where available
$ salloc --partition=mcml-dgx-a100-40x8 --qos=mcml --ntasks=8 --gres=gpu:8  # long form options

In the same way, to use resources of the mcml-hgx-a100-80x4 or mcml-hgx-h100-92x4 partitions, eligible users need to specify the "mcml" quality of service (QoS) for their job allocation and/or submission, e.g.

$ salloc -p mcml-hgx-a100-80x4 -q mcml -n 4 --gres=gpu:4  # short form options, where available
$ salloc --partition=mcml-hgx-a100-80x4 --qos=mcml --ntasks=4 --gres=gpu:4  # long form options

Smaller Scale Resources / Multi Instance GPU Mode

Additionally, via the NVIDIA's Multi Instance GPU (MIG) mode (NVIDIA Multi-Instance GPU User Guide) there are a couple of smaller scale resources available ("virtual GPU instances"). Some A100 GPUs have been 'partitioned' into slices which can be offered as virtual GPU instances. MIG divides each card into 7 slices, which can be combined in different ways. The following table indicates how these slices are combined for different mcml nodes in the mcml-hgx-a100-80x4-mig partition.

Slurm Partition	Number of nodes	GPUs per node	MIGs mode per GPU
mcml-hgx-a100-80x4-mig	3	4 NVIDIA A100	3 / 2 / 1 / 1

The way to interpret the table above is as follows. The first row indicates that there are five nodes whose GPUs are partitioned in three virtual GPU instances each: one with three slices out seven, and two with two slices each. The second row, indicates there are two nodes whose GPUs are partitioned in four virtual GPU instances each: one with three slices, one with two slices, and two with one slice.

In case you want to allocate one instance with three slices (i.e., three seventh the capacity of a full A100) the following code block shows an example.

$ salloc -p mcml-hgx-a100-80x4-mig -q mcml --gres=gpu:3g  # short form options, where available
$ salloc --partition=mcml-hgx-a100-80x48-mig --qos=mcml --gres=gpu:3g  # long form options

Please be aware that only single slices can be used for an individual job. The MIG mode does not support 'multi-GPU' computing.

Dedicated Storage Options

As indicated on Storage on the LRZ AI Systems, there are two dedicated MCML DSS systems (/dss/dssmcmlfs01 and /dss/mcmlscratch) available to eligible users. These are high-performance, SSD-based network storage systems. They are intended for high-bandwidth, low-latency I/O operations, serving the demands of modern day AI applications.

MCML DSS - /dss/dssmcmlfs01
The Master Users of eligible LRZ projects (i.e. MCML research groups) are welcome to submit a service request to LRZ Servicedesk asking for storage on the MCML DSS system. By using this link, they will open a ticket for the appropriate service. Select "AI topics", next "Master user only: Application for project storage space (DSS AI)" and confirm. Fill in the form and make sure to note that the request is for MCML DSS. Finally, submit the form.
A quota of up to 5 TB, 10.000.000 files and a maximum of 5 DSS containers (commonly: a single container) will typically be assigned (a quota of up to 10 TB, 20.000.000 files can be granted upon explicit request & proof of demand). Once granted, the Master User will subsequently act as DSS Data Curator and manage the assigned storage quotas for their project.
MCML SCRATCH DSS - /dss/mcmlscratch
Every MCML user with access to the AI Systems does have access to this temporary/scratch disk space. Environment variable $MCMLSCRATCH is pointing to directories accessible for individual users. Note that this is temporary disk space only: a sliding window deletion mechanism will purge files older than 40 days (once a fill level of 95% is reached). Make sure to move important files to other storage systems before they are gone!
When transferring data between $MCMLSCRATCH and other DSS containers, you should primarily resort to the respective Globus endpoints using e.g. the Globus Web App. This will typically be faster and more efficient than using native command line tools.

3.0 Specifics for MCML Members

Table of Contents

Access Management for MCML PIs

Dedicated Compute Hardware

Smaller Scale Resources / Multi Instance GPU Mode

Dedicated Storage Options