File Systems and IO on Linux-Cluster

This document gives an overview of background storage systems available on the LRZ Linux Cluster. Usage, special tools and policies are discussed.

Disk resources and file system layout

The following table gives an overview of the available file system resources on the Linux Clusters.

Recommendation: LRZ has defined an environment variable $SCRATCH which should be used as a base path for reading/writing large scratch files. Since the target of $SCRATCH may change over time, it is recommended to use this variable instead of hard-coded paths.

Purpose	Segment of the Linux Cluster	File system type and full name	How the user should access the files	Space Available	Approx. aggregated bandwidth	Backup by LRZ	Lifetime and deletion strategy. Remarks
Globally accessible Home and Project Directories
User's Home Directories	all	GPFS /dss/dsshome1/lxc##/<user>	$HOME	100 GByte by default per user	up to a few GB/s	YES, backup to tape and file system snapshots	Expiration of LRZ project. DSS quotas apply.
DSS Data Science Storage long-term storage	all	GPFS see text section below for further details	interactively type: `dssusrinfo all` view and select an appropriate directory and defined your own variable (e.g. WORK) in your (e.g. ~/.profile or ~/.bashrc)	up to 10 TByte without additional cost	up to a few GB/s	NO	Is not automatically available after Linux-Cluster activation Storage application Storage modification Expiration of data project.
Temporary/scratch File Systems
(Legacy) Scratch file system	all, except CoolMUC-4 and Teramem	GPFS /gpfs/scratch/<group>/<user>	$SCRATCH	1,400 TByte	up to ~30 GB/s on CooLMUC2 (aggregate) up to ~8 GB/s on CooLMUC3 (aggregate)	NO	Sliding window file deletion. No guarantee for data integrity.
(New) Scratch file system	all, except CoolMUC-2 compute nodes	GPFS /dss/lxclscratch/##/<user>	$SCRATCH_DSS	3,100 TByte	up to ~60 GB/s (aggregate) up to ~8 GB/s on CoolMUC3 (aggregate)	NO	Sliding window file deletion. No guarantee for data integrity.
Node-local File Systems (please do not use!)
Node-local temporary user data	all	local disks, if available /tmp		8-200 GByte	approx. 30 MB/s for diskfull nodes	NO	Compute nodes: Job duration only. Files should be deleted by user job script at the end of a job. Login Nodes: files are removed if necessary.

Backup and Archiving

User's responsibility for saving important data

Having (parallel) filesystems of several hundreds of Terabytes (DSS, $SCRATCH), it is technically impossible (or too expensive) to backup these data automatically. Although the disks are protected by RAID mechanisms, other severe incidents might destroy the data. In most cases however, it is the user himself who incidently deletes or overwrites files. Therefore it is within the responsibility of the user to transfer data to more safe places (e.g. $HOME) and to archive them to tapes. Due to the long off-line times for dump and restoring of data, LRZ might not be able to recover data from any type of file outage/inconsistency of the scratch or DSS filesystems. A specified lifetime for a file system until the end of your project should not be misguided as an indication for the safeness of data stored there!

LRZ had to discontinue the old Linux Cluster tape archive because of security concerns.

Snapshots

The data in your home directories is protected by nightly file system snapshots, which are kept for at most 7 days. In order to access theses snapshots, look into the directory /dss/dsshome1/.snapshots/. In this directory you'll find the individual snapshots as subdirectories, which have the date and time at which the snapshot was taken encoded as YYYY-MM-DD_HHMM in their directory name. In order to restore files you can simply copy them back to your HOME directory.

Details on the usage and on the configuration of the file systems

DSS long-term storage

LRZ uses Data Science Storage (DSS) based systems for the purpose of long-term data storage. In conjunction with this, LRZ has transferred management rights and obligations for these storage areas to the data curator, an additional role typically taken on by the master user of your project. For projects that use basic DSS storage services on the cluster, LRZ retains certain management rights to be able to provide these services.

In order to use DSS storage on the Linux-Cluster, the following steps need to be performed:

On any cluster login node, issue the command
dssusrinfo all
This will list paths to accessible containers, as well as quota information etc. If no such container exists, please continue with step 2; otherwise, go to step 5.
Please verify in the LRZ IDM-Portal section "Self Services | Person | view" that your user data contain a valid e-mail address, either for an LRZ mail service on a personal account, or as contact e-mail address. Otherwise, please ask your Master User to register a contact e-mail address for you in IDM-Portal.
Open a ticket with the LRZ Service Desk against the service "High Performance Computing → Linux Cluster" with a request to set up a DSS storage area for the project your cluster account belongs to, and the required capacity (at most 10 TBytes).
If your request is granted, and a new DSS area is created, you will receive an e-mail to the address specified above. Please reply appropriately to it to activate your DSS share.
Edit your shell profile and set the PROJECT and/or WORK variable to a suitable path value based on the above output, typically one of the DSS paths with your account name appended to it. These settings can subsequently be used in any login shell or batch script.

Notes:

The DSS long-term storage is not automatically available after Linux-Cluster activation. The master user can apply for the storage via predefined Service Request Template.
If the current capacity is smaller than 10 TByte, the data curator can ask for a quota increase up to the maximum value via predefined Service Request Template.
For larger capacities and/or containers that are automatically backed up at a regular basis (which cannot be provided free of cost), you need to contact your master user to ask LRZ for a quote.
Due to your involvement in multiple projects, the dssusrinfo output may refer to more than one DSS container. It is your responsibility to appropriately store data where they belong, and perform the necessary bookkeeping.
Depending on the system used and the usage pattern, it may be appropriate to stage in/out data to/from the SCRATCH file system before/after performing large scale processing. It is permissible to perform the necessary copy or rsync operations on the cluster login nodes.

Metadata on SCRATCH and DSS directories

While for both scratch and project directories the metadata performance (i.e., performance for generating, accessing and deleting directories and files) is improved compared to previously used technologies, the capacity for metadata (e.g., number of file entries in a directory) is limited. Therefore, please do not generate extremely large numbers of very small files in these areas; instead, try to aggregate into larger files and write data into these e.g. via direct access. Violation of this rule can lead to LRZ blocking your access to the $SCRATCH or DSS area since otherwise user operation on the cluster may be obstructed. Please also note that there exists a per-directory limit for storing i-node metadata (directory entries and file names); this limits the number of files which can be put into a single directory.

File deletion strategies and data integrity issues

To prevent overflow of the large scale storage areas, LRZ has implemented various deletion strategies. Please note that

for a given file or directory, the exact time of deletion is unpredictable!
the normal tar -x command preserves the modification time of the original file instead of the time when the archive is unpacked. So unpacked files may become one of the first candidates for deletion. Use tar -mx if required, or perform touchon a file or
```
find mydir -exec touch {} \;
```
on a directory tree mydir.

Due to the deletion strategies described in the subsections below, but also due to the fact that LRZ cannot guarantee the same level of data integrity for the high performance file system as compared to e.g., $HOME, LRZ urges you to copy, transfer or archive your files from temporary disks as well as from the DSS areas to safe storage/tape areas!

High Watermark Deletion: When the filling of the file system exceeds some limit (typically between 80% and 90%), files will be deleted starting with the oldest and largest files until a filling of between 60% and 75% is reached. The precise values may vary.
Sliding window file deletion: Any files and directories older than typically 30 days (the interval may be shortened if the fill-up rate becomes very high) are removed from the disk area. This deletion mechanism is invoked once a day.

World-Wide data access and transfer

For easy access and transfer of data to/from the LRZ Linux Cluster DSS based file systems, HOME and the (new) scratch filesystem, you can use the Globus Research Data Management Portal. This allows you to easily transfer data world wide, using a protocol which is optimised for high speed transfer via wide area networks (WAN).

For details on how to use Globus Online, check out this documentation.

Please make sure to log in to Globus, using your LRZ Linux Cluster user ID (Search for Leibniz Rechenzentrum in the list of available Institutions) and use the Globus Collection: Leibniz Supercomputing Centre's DSS - CILogon to access the data.