File Systems of SuperMUC-NG

See also:

Technology

SuperMUC-NG integrates Lenovo DSS-G for IBM Spectrum Scale (aka GPFS) as building blocks for the storage. They are used for both the long-term storage and the high performance parallel file system.

File System Characteristics

AreaPurposeTotal CapacityAggregate Bandwidth
Home

Storage for user's source, input data, and small and important result files.
Globally accessible from login and compute nodes.

256 TiB~25 GiB/s (SSD Tier)
~6 GiB/s (HDD Tier)
WorkLarge datasets that need to be kept on-disk medium or long term.
Globally accessible from login and compute nodes.
34 PiB~300 GiB/s
ScratchTemporary storage for large datasets (usually restart files, files to be pre-/postprocessed). Globally accessible from login and compute nodes.16 PiB~200 GiB/s
DSS

Data Science Storage. Long term near-line storage for project's purposes and/or the science community. World wide access/transfer of this data via high performance WAN optimized transfer protocols, using a simple Graphical User Interface in the Web. Share data like LRZ Sync+Share, Dropbox or Google Drive.

20 PiB~70 GiB/s
DSA

Data Science Archive. Long term offline storage for project's purposes and/or the science community. World wide access/transfer of this data via high performance of this data via high performance WAN optimized transfer protocols, using a simple Graphical User Interface in the Web. 

260 PiB~10 GiB/s
Node-local/tmp on login and compute nodes. Resides in memory on compute nodes. Locally accessible only. Please do not use paths to this area explicitly (e.g. in scripts). TMPDIR (see below) can be used and will automatically be set to an appropriate value.Small. A completely filled /tmp causes the node to become unusablevaries

File system access and policies

Upon login to the system or inside batch jobs, the environment module tempdir is loaded and supplies the necessary variable settings for file systems with exception of HOME.

AreaEnvironment VariablePath patternQuotaLifetime of DataData Safety/Integrity Measures
Home$HOME/dss/home/<hash>/<user>100 GB/userExpiration of all projects an
account is associated with

Nightly snapshots, kept for the last 7 days.

Replication to secondary storage plus daily backup to tape

Work$WORK_<project>$WORK_<project>In accordance with project grant1. Project-level quota only.End of specified projectNone. See section below on archiving important data.
Scratch$SCRATCH/hppfs/scratch/<hash>/<user>1 PB/user (safety measure)Usually 3-4 weeks. Execution of deletion procedure depends on file system filling.None. See section below on archiving important data.
DSS-/dss/dssfs0[23]/<data-project>/<container>Per data-project and container1End of data projectper-container policy.
Regarding backup to tape archive: 
NONE, BACKUP_WEEKLY, BACKUP_DAILY
(costs may arise for the user!)
DSA-/dss/dsafs01/<bucket>/<container>Limitation of number of files per data-project and container1End of data projectData replicated on two tapes in two different sites. Metadata backed up daily.
temporary$TMPDIRdepends on availability of file systems, usually a subfolder of SCRATCH. /tmp is only used as a last measure cop-out.depends on target file systemdepends on target file system.depends on target file system.
1 Supplied value can be increased upon request. Please contact the Service Desk.

File system usage

WORK and SCRATCH usage

With great power comes great responsibility! WORK and SCRATCH have a rather large block size of 16 MB, what is necessary for efficient IO on file systems of that size of Peta Bytes. This in turn means that many small files represent an inefficient use of such file systems.
Please adapt your work flows (bundle your small files and directory hierarchy; e.g. use mpifileutils)!

While on WORK there is a quota, on SCRATCH there is none, yet. We explicitly don't want to limit the users ambitions. But this in turn requires some understanding and discipline from the users.
If you (possibly deviating from your intention in your project application, where we try to filter out inappropriate work flows on an early stage) need temporarily unprecedented resources like more than a Peta Byte of SCRATCH, or more than 10 mio. inodes (please count both, files and directories!), please inform us in the Service Desk! Specifically, the number of available inodes is necessarily limited on SCRATCH for performance reasons. Exceeding them will interrupt the system operation.

Although there is a gliding deletion policy, specifically if you temporarily occupied a lot resources, please clean as soon as possible to a reasonable level. SCRATCH is there for all users on the system! Please respect this, and try to reasonably minimize your resource consumption.

User's responsibility for saving important data

Having (parallel) file systems of several tens of petabyte, it is technically impossible (or too expensive) to backup these data automatically. Although the disks are protected by RAID mechanisms, other severe incidents might destroy the data. In most cases however, it is the user himself who incidently deletes or overwrites files. Therefore it is within the responsibility of the user to transfer data to more safe/secondary places and/or to archive them to tapes. Due to the long off-line times for dump and restoring of data, LRZ might not be able to recover data from any type of file outage/inconsistency of the SCRATCH or WORK filesystems. The alias name WORK and the intended storage period until the end of your project should not be misguided as an indication for the data safeness!

There is no automatic backup for SCRATCH and WORK. Beside automatic deletion, severe technical problems might destroy your data. It is your obligation to copy, transfer, or archive the files you want to keep!

Data after the end of project

Data will be deleted one year after the end of the project. However, for the data in DSS, DSA and the legacy archive, the project manager can request that the project is converted into a data-only project to gain further access to the archived data. Additionally, the project manager is warned by email after the project end that the data will be deleted.

Dos and don'ts, best practices, and notes on optimizations

The WORK and SCRATCH systems are tuned for high bandwidth, but it is not optimal for handling large quantities of small files located in a single directory with parallel accesses. In particular, generating more than ca. 1000 files per directory at approximately the same time from either a parallel program or from simultaneously running jobs will probably cause your application(s) to experience I/O errors (due to timeouts) and crashes. If you require this usage pattern, please generate a directory hierarchy with at most a few hundred files per subdirectory.  See:

Temporary filesystem 

Please use the environment variable $SCRATCH to access the temporary file system. This variable points to the location where the underlying file system will deliver optimal IO-Performance. Do not use /tmp or $TMPDIR for storing temporary files! The file system where /tmp resides in memory is very small. Files will be regularly deleted by automatic procedures or sysadmins.

Coping with high watermark deletion in $SCRATCH

The high watermark deletion mechanism may remove files which are only a few days old if the file system is used heavily. In order to cope with this situation, please note:

  • The normal tar -x command preserves the modification time of the original file and not the time when the archive has been unpacked. Therefore, files which have been unpacked from an older archive are one of the first candidates to be deleted. To prevent this, use tar -xm to unpack your files, which will give them the actual date.
  • Please use the Backup and Archive system on SuperMUC-NG to archive/retrieve files from/to SCRATCH to/from the tape archive.
  • Please always use $WORK or $SCRATCH for files which are considerably larger than 1 GB.
  • Please remove any files which are not needed any more as soon as possible. The high watermark deletion procedure is then less likely to be triggered.
  • More information about the filling of the file systems and about the oldest files will be made available on a web site in the near future.

Selecting the $WORK directory

Each project on SuperMUC-NG has a separate WORK directory with a shared quota for all users in this project. Users can select a specific WORK directory by applying the appropriate projectID e.g.,

export WORK=$WORK_<project>     in scripts or setting it in their .profile.

A colon seperated list of all WORK directories a user has access to is stored in the environment variable

echo $WORK_LIST

Sharing files with other users

Backup and Archive

Transferring files from/to other systems

We provide several options to move data from/to SuperMUC-NG. All of them have in common that the IP-Address of the remote machine must be first enabled in the SuperMUC-NG firewall.

Quotas and Access

Beside the quota for the volume of data, there are also quotas for the number of files. The parallel files systems are layed out for large output files. If you have a large number of small files and reach your limits, pack them into tar-archives,

To Display see your quota, use the following commands since the usual "quota" command will not work on the High Performance Parallel Files Systems.

  • budget_and_quota  or   fullquota

For Information about the accessible DSS file systems and containers, use the following command on a login node. However, do not use it in a batch Job since it may block.

  • dssusrinfo all

Parallel copy and rsync

Sometime it is necessary to copy or sync large amount (TBytes) of data for example from SCRATCH to WORK. Hint: use msrync, prsync or pexec to distribute the work onto more than one process or onto many cores.

Examples:

module load lrztools

#use 96 tasks on one node
msrsync -p 96 $SCRATCH/mydata $WORK/RESULTS/Experiment1

#use all processes within a parallel job
# generate the commands, make the directory structure, copy the data
prsync -f $SCRATCH/mydata -t WORK/RESULTS/Experiment1
source $HOME/.lrz_parallel_rsync/MKDIR
mpiexec -n 256 pexec $HOME/.lrz_parallel_rsync/RSYNCS

# exectue many copies in parallel
cat copylist
cp -r $SCRATCH/mydata/Exp1 $WORK/RESULTS
cp -r $SCRATCH/mydata/Exp2 $WORK/RESULTS
...
cp -r r $SCRATCH/mydata/Exp2000 $WORK/RESULTS
mpiexec -n 256 pexec copylist

Conversion of a SuperMUC project into a Data-Only Project (after project end)

Data in the tape archive will be deleted one year after the project end if the project is not converted into a data only project. However, the project manager can request that the project is converted into a data-only project to have further access to the archived data. The project manager is warned by email after the project end that the data will be deleted.

On request, it is possible  to convert a SuperMUC project into a Data-Only project. Within such a Data-Only project the project manager is able to further retain and access the data once archived on tape, thus using the tape archive as a safe and reliable long term storage for the data generated by an SuperMUC project.

Data can than be accessed via the gateway node "tsmgw.abs.lrz.de" using the SuperMUC username and password of the project manager. Access to the server is possible via SSH with no restricitons on the IP address. However, access to SuperMUC itself is not possible after the end of a project. Currently, the server is equipped with a 37 TB local disk storage (/tsmtrans) to buffer the data retrieved from tape. There is a directory /tsmtrans/<username> where you can store the data and transfer them via scp.

The project manager can access all data of the project that are stored in the tape archive. Also, the password for accessing the tape archive (TSM Node) is not stored on the gateway node and must be set and remembered by the project manager.  

  • When a SuperMUC project ends, the project manager will receive a reminder E-Mail, explaining the steps necessary to convert the project.

Further information