3. Storage
The LRZ AI Systems are integrating the full offer of the LRZ Data Science Storage (DSS) system.
Storage Options
Storage Pool | Use Case | Top-level Directory | Size Limit | Automated Backup | Expiration |
---|---|---|---|---|---|
Home Directory | Critical files like code, scripts, and configs that need regular backups and are small. | /dss/dsshome1/ | 100 GB | Yes, backup to tape and file system snapshots | Lifetime of LRZ project |
AI Systems DSS | High-bandwidth, low-latency storage for I/O. use this for reading and writing | /dss/dssfs04 | 4 TB (up to) 4+ TB (costs apply) | No for free tier / Yes for paid option | Until further notice |
Linux Cluster DSS | General purpose, long-term storage. | /dss/dssfs02 | 10 TB (up to) | No for free tier / Yes for paid option | Lifetime of the data project |
Private DSS | System Owner Defined | /dss/dsslegfs01 | System Owner Defined | System Owner Defined | System Owner Defined |
Home Directory
Home directories can be accessed via SSH on the login nodes (login.ai.lrz.de) or through the web frontend at https://login.ai.lrz.de. When logging in via terminal, you are placed directly in your home directory. By typing `pwd` immediately after login, you can verify this and see the full path to your current (home) directory.
The LRZ AI Systems share a unified home directory with the LRZ Linux Cluster (see File Systems and IO on Linux-Cluster). Home directories are hosted in a dedicated DSS container managed by LRZ. They are limited in both capacity and I/O performance (bandwidth and latency) and are therefore not suitable for high-intensity AI workloads or large-scale data operations.
The home directory should primarily be used to store code, configuration files, and other lightweight data. All home directories are regularly backed up to ensure data integrity and security.
AI Systems DSS
A dedicated AI Systems DSS provides high-performance, SSD-based network storage designed for demanding AI workloads. It is optimized for high-bandwidth, low-latency I/O operations to support the data-intensive requirements of modern AI applications. In contrast to the home directory, the AI Systems DSS is the appropriate location for high-intensity AI workloads and large-scale data operations.
Access to this storage is granted upon request by the Master User of an LRZ project via the LRZ Servicedesk using the following form.
A quota of up to 4 TB, 8 million files, and a maximum of 3 DSS containers (typically a single container) can be allocated.
Once provisioned, the Master User assumes the role of DSS Data Curator and is responsible for managing the assigned storage quotas within their project.
Additional AI Systems DSS storage (4 TB or more) can be requested as part of the DSS on demand offer offering, which is subject to additional costs.
Linux Cluster DSS
The Master User can request additional storage of up to 10 TB. To do so, they must first request the activation of the project for the Linux Cluster via the LRZ Servicedesk using the following form, and then submit a separate request for explicit Linux Cluster DSS storage via this form.
The Linux Cluster DSS is primarily intended for long-term data storage and general-purpose workloads. Both the home directory and the Linux Cluster DSS are not designed for high-intensity AI workloads or large-scale data operations.
Private DSS
As part of a joint project offering, dedicated DSS systems can be purchased, deployed, and operated exclusively for a private group of users. For more information see here and here.
The GPFS Distributed File System
The AI Systems at LRZ use the IBM General Parallel File System (GPFS), as their main storage backend. GPFS is a high-performance distributed file system designed for large-scale HPC environments. Unlike local file systems, which operate on a single disk or node, GPFS spreads data and metadata across many servers and disks, enabling parallel access by thousands of compute nodes.
Latency and I/O Patterns:
Tasks that are instant on local disks, like creating, moving, or deleting many small files, can be slow on GPFS because metadata and file data are managed across multiple servers. For best performance, use fewer but large files and sequential access instead of many small random reads or writes.
Metadata and Inodes
Each file or directory is represented by an inode that stores ownership, permissions, and disk location. In distributed systems, inode management adds overhead. Large numbers of files can quickly cause performance drops or reach inode limits.
Machine Learning Datasets
Machine learning datasets often consist of millions of small files (e.g., individual images). This pattern leads to heavy metadata (inodes) traffic and slow data access. To improve efficiency:
Pack data into archive formats (e.g., .tar, .zip, HDF5, or TFRecord).
Avoid frequent directory scans (ls, find, stat) on large folders.
Additional Information
Additional information can be found at File Systems and IO on Linux-Cluster and Data Science Storage.
Use the following command on the login nodes to get an storage utilization overview of all individually accessible DSS containers:
dssusrinfo all