3. Storage


The LRZ AI Systems are integrating the full offer of the LRZ Data Science Storage (DSS) system.

Storage Options

Storage PoolUse CaseTop-level DirectorySize LimitAutomated BackupExpiration
Home Directory

Critical files like code, scripts, and configs that need regular backups and are small.

/dss/dsshome1/100 GB

Yes, backup to tape and file system snapshots

Lifetime of LRZ project
AI Systems DSS

High-bandwidth, low-latency storage for I/O.

use this for reading and writing
model data and results

/dss/dssfs044 TB (up to)
4+ TB (costs apply)
No for free tier /
Yes for paid option
Until further notice

Linux Cluster DSS

General purpose, long-term storage.

/dss/dssfs02
/dss/dssfs03
/dss/dssfs05

10 TB (up to)
20+ TB (costs apply)

No for free tier /
Yes for paid option
Lifetime of the data project

Private DSS

System Owner Defined

/dss/dsslegfs01
/dss/dsslegfs02
/dss/dssmcmlfs01
/dss/mcmlscratch

System Owner DefinedSystem Owner DefinedSystem Owner Defined

Home Directory

Home directories can be accessed via SSH on the login nodes (login.ai.lrz.de) or through the web frontend at https://login.ai.lrz.de. When logging in via terminal, you are placed directly in your home directory. By typing `pwd` immediately after login, you can verify this and see the full path to your current (home) directory.

The LRZ AI Systems share a unified home directory with the LRZ Linux Cluster (see File Systems and IO on Linux-Cluster). Home directories are hosted in a dedicated DSS container managed by LRZ. They are limited in both capacity and I/O performance (bandwidth and latency) and are therefore not suitable for high-intensity AI workloads or large-scale data operations.

The home directory should primarily be used to store code, configuration files, and other lightweight data. All home directories are regularly backed up to ensure data integrity and security.

AI Systems DSS

A dedicated AI Systems DSS provides high-performance, SSD-based network storage designed for demanding AI workloads. It is optimized for high-bandwidth, low-latency I/O operations to support the data-intensive requirements of modern AI applications. In contrast to the home directory, the AI Systems DSS is the appropriate location for high-intensity AI workloads and large-scale data operations.

Access to this storage is granted upon request by the Master User of an LRZ project via the LRZ Servicedesk using the following form.
A quota of up to 4 TB, 8 million files, and a maximum of 3 DSS containers (typically a single container) can be allocated.
Once provisioned, the Master User assumes the role of DSS Data Curator and is responsible for managing the assigned storage quotas within their project.

Additional AI Systems DSS storage (4 TB or more) can be requested as part of the DSS on demand offer offering, which is subject to additional costs.

Linux Cluster DSS

The Master User can request additional storage of up to 10 TB. To do so, they must first request the activation of the project for the Linux Cluster via the LRZ Servicedesk using the following form, and then submit a separate request for explicit Linux Cluster DSS storage via this form.

The Linux Cluster DSS is primarily intended for long-term data storage and general-purpose workloads. Both the home directory and the Linux Cluster DSS are not designed for high-intensity AI workloads or large-scale data operations.

Private DSS

As part of a joint project offering, dedicated DSS systems can be purchased, deployed, and operated exclusively for a private group of users. For more information see here and here.

The GPFS Distributed File System

The AI Systems at LRZ use the IBM General Parallel File System (GPFS), as their main storage backend. GPFS is a high-performance distributed file system designed for large-scale HPC environments. Unlike local file systems, which operate on a single disk or node, GPFS spreads data and metadata across many servers and disks, enabling parallel access by thousands of compute nodes.

Latency and I/O Patterns:

Tasks that are instant on local disks, like creating, moving, or deleting many small files, can be slow on GPFS because metadata and file data are managed across multiple servers. For best performance, use fewer but large files and sequential access instead of many small random reads or writes.

Metadata and Inodes

Each file or directory is represented by an inode that stores ownership, permissions, and disk location. In distributed systems, inode management adds overhead. Large numbers of files can quickly cause performance drops or reach inode limits.

Machine Learning Datasets

Machine learning datasets often consist of millions of small files (e.g., individual images). This pattern leads to heavy metadata (inodes) traffic and slow data access. To improve efficiency:

  • Pack data into archive formats (e.g., .tar, .zip, HDF5, or TFRecord).

  • Avoid frequent directory scans (ls, find, stat) on large folders.

Additional Information

Additional information can be found at File Systems and IO on Linux-Cluster and Data Science Storage.

Use the following command on the login nodes to get an storage utilization overview of all individually accessible DSS containers:

dssusrinfo all