3.0 Specifics for MCML Members

Table of Contents

Access Management for MCML PIs

MCML PIs who want (members of their research group) to access the MCML partition (as part of the LRZ AI Systems), please open a service request with LRZ Servicedesk here
and choose/add

  • Type: Service Request
  • Description: "Access to MCML Segment @ LRZ AI Systems";
  • Details:
    • Please specify the name of the MCML PI
    • If applicable (i.e. if the working group already has one), please provide a LRZ Master User/Linux Cluster project ID

Further instructions will follow via the communication in the service request ticket.

Once a suitable LRZ Master User/Linux Cluster project has been created/set up as a "MCML project", any account within this project which gets Linux Cluster permissions assigned (by the Master Users) will automatically also be granted access to LRZ AI Systems, including the MCML partition of this system. Similarly, removing Linux Cluster permissions from individual accounts will, also revoke LRZ AI Systems permissions, including the MCML partition.

Dedicated Compute Hardware

In addition to the generally available resources listed on General Description and Resources, the user IDs associated with the LRZ projects of MCML research groups are entitled to use the following hardware. Access will be granted automatically (for dedicated MCML LRZ projects) or upon request (for eligible accounts of pre-existing LRZ Master User projects). The allocation time limit for individual jobs is 4 days (4-00:00:00).


Slurm Partition

Number of nodesCPU cores per nodeMemory per node

GPUs per node

Memory per GPU

HGX A100 Architecture

mcml-hgx-a100-80x4

2196

1 TB

4 NVIDIA A100

80 GB

DGX A100 Architecture

mcml-dgx-a100-40x8

8256

1 TB

8 NVIDIA A100

40 GB

In order to use resources of the mcml-dgx-a100-40x partition, eligible users need to specify the "mcml" quality of service (QoS) for their job allocation and/or submission, e.g.

$ salloc -p mcml-dgx-a100-40x8 -q mcml -n 8 --gres=gpu:8  # short form options, where available
$ salloc --partition=mcml-dgx-a100-40x8 --qos=mcml --ntasks=8 --gres=gpu:8  # long form options

In the same way, to use resources of the mcml-hgx-a100-80x4 partition, eligible users need to specify the "mcml" quality of service (QoS) for their job allocation and/or submission, e.g.

$ salloc -p mcml-hgx-a100-80x4 -q mcml -n 4 --gres=gpu:4  # short form options, where available
$ salloc --partition=mcml-hgx-a100-80x4 --qos=mcml --ntasks=4 --gres=gpu:4  # long form options

Smaller Scale Resources / Multi Instance GPU Mode

Additionally, via the NVIDIA's Multi Instance GPU (MIG) mode (NVIDIA Multi-Instance GPU User Guide) there are a couple of smaller scale resources available ("virtual GPU instances"). Some A100 GPUs have been 'partitioned' into slices which can be offered as virtual GPU instances. MIG divides each card into 7 slices, which can be combined  in different ways. The following table indicates how these slices are combined for different mcml nodes in the mcml-hgx-a100-80x4 partition. 

Slurm Partition

Number of nodes

GPUs per node

MIGs mode per GPU

mcml-hgx-a100-80x4

5

4 NVIDIA A100

3 + 2 + 2
2

4 NVIDIA A100

3 + 2 + 1 + 1

The way to interpret the table above is as follows. The first row indicates that there are five nodes whose GPUs are partitioned in three virtual GPU instances each: one with three slices out seven, and two with two slices each. The second row, indicates there are two nodes whose GPUs are partitioned in four virtual GPU instances each: one with three slices, one with two slices, and two with one slice. 

In case you want to allocate one instance with three slices (i.e., three seventh the capacity of a full A100) the following code block shows an example. 

$ salloc -p mcml-hgx-a100-80x4 -q mcml -n 8 --gres=gpu:3g  # short form options, where available
$ salloc --partition=mcml-hgx-a100-80x48 --qos=mcml --ntasks=8 --gres=gpu:3g  # long form options

Please be aware of, that only single slices can be used for a single job. The MIG mode does not support multi 'GPU' computing.

Dedicated Storage Options

As indicated on Storage on the LRZ AI Systems, there is a dedicated MCML DSS system (/dssmcmlfs01) available to eligible users. This is high-performance, SSD-based network storage. The system is intended for high-bandwidth, low-latency I/O operations, serving the demands of modern day AI applications.

The Master Users of eligible LRZ projects (i.e. MCML research groups) are welcome to submit a service request to LRZ Servicedesk asking for storage on the MCML DSS system. By using this link, they will open a ticket for the appropriate service. Select "AI topics", next "Master user only: Application for project storage space (DSS AI)" and confirm. Fill in the form and make sure to note that the request is for MCML DSS. Finally, submit the form.

Once granted, the Master User will subsequently act as DSS Data Curator and manage the assigned storage quotas for their project. A quota of up to 10 TB, 20.000.000 files and maximum 5 DSS containers (typically: a single container) will be assigned.