SuperMUC-NG Archive Usage (TSM-based)

Content:

Working with the IBM Spectrum Protect based archive of SuperMUC

Archiving means saving the current version of a file to tape. Several version of the same file can be kept in the tape archive. For restoring them you must differentiate them by date and time or by an additional description.

In order to archive and retrieve data at the SuperMUC/SuperMIG or the Linux Cluster, the IBM Spectrum Protect based LRZ Backup- and Archive Infrastructure is used. A system wide ISP client configuration is available, so that you do not need to perform the installation or configuration of a ISP client yourself.

General Information

The SuperMUC Archive is based on offline tape storage. If archiving or retrieving is not starting promptly this is most probably NOT a problem. There usually are just no free tape drives available at the moment. If you encounter such a situation, please be patient and wait. We only have 15 tape drives for SuperMUC available and if there is much workload from multiple users on the system, it can take several hours until your archive job gets a free tape drive. Avoid cancelling and resubmitting your archive command. If you feel that there is a problem, please open a support ticket at the Servicedesk for SuperMUC-NG.

Archiving data with IBM Spectrum Protect

Let's assume you work for project prXXXX and have a file myFile stored on the temporary filesystem in a location (directory) denoted by some self-defined environment variable $MY_SCRDIR. Since myFile may be automatically removed from $MY_SCRDIR by high-watermark deletion after some days, you might want to have an archive copy at hand. So here's how to create one. Go to $MY_SCRDIR and invoke on of the commands:

$> dsmc archive -se=prXXXX myFile

$> dsmc archive -se=prXXXX -description="V.1.2" myFile

We recommend to keep logs of all archive commands in a specific directory, e.g.

$> dsmc archive -se=prXXXX myFile >$HOME/mytapelogs/archived_on_YYYY_MM_DD_hh_mm

This might later help to avoid confusion with the file namespace (see blow).

In case the file name contains spaces you have to enclose it in double quotes, e.g., "my file with spaces". If you want to archive several files myFile1, myFile2, ... you can use wildcards or specify the filenames:

$> dsmc archive -se=prXXXX myFile*

$> dsmc archive -se=prXXXX myFile1 myFile2 myFile3

You can also archive complete directory trees. This can be achieved by using an additional command-line option:

$> dsmc archive -se=prXXXX -subdir=yes MyDirectory/

dsmc interprets MyDirectory/ as a directory.

If you have lots of data to archive, you may want to have a look at the Optimal Usage of the Archive section below.

Retrieving data with IBM Spectrum Protect

You can search for archived files in a subdirectory $MY_SCRDIR of any file system by issuing the command

$> dsmc query archive -se=prXXXX -subdir=yes $MY_SCRDIR/

Again the slash after $MY_SCRDIR is important to remind dsmc that it is a directory. A file can be retrieved with one of the command

$> dsmc retrieve -se=prXXXX $MY_SCRDIR/myFile $MY_SCRDIR/myNewFileName

$> dsmc retrieve -se=prXXXX -description="V.1.2" myFile myNewFileName

If you omit the second file argument, the file will be restored under its original name. Of course, you can also retrieve complete directory trees.

$> dsmc retrieve -se=prXXXX MyDirectory/ RetrievedDirectory/ -subdir=yes

This will restore the data in MyDirectory/ to RetrievedDirectory/. Again, directory or file names containing spaces have to be enclosed in double quotes, and directory names must end with a slash (/).

Retrieving files with several versions

If you have several version of the same file you can use the options -fromdate, -fromtime, -todate, -totime, -description to differentiate. You might need to specify the format of the date and time string. Interactively, you can use the -pick option.

$> dsmc retrieve -se=prXXXX -timeformat=4 -datefomat=3 -fromdate=2011-11-30 -fromtime=23:33:00 MyFile

$> dsmc retrieve -se=prXXXX -pick MyFile

$> dsmc retrieve -se=prXXXX -description="V.1.2" myFile

Looking for files

If you do not know the exact filespace/filename use

$> dsmc query filespace -se=prXXXX

Then try to find your files by using the displayed information and wildcards

$> dsmc query archive -se pr28fa '/gpfs/work/*'

Deletion of data from IBM Spectrum Protect

The default policy is to prohibit users the deletion of data from the archives to prevent that data gets accidentally deleted. However, since many many request this feature, the permission can be granted on request via the Servicedesk.

Please bear in mind, that deletion rights can only be granted on the granularity of a project, meaning that once granted, all users of the project are allowed to delete all data of the project. Please also bear in mind that deleted data cannot be restored so be very carefully when deleting data. If you feel unassertive, feel free to contact us via the Servicedesk for guidance.

To delete archived data, you can use the command:

$> dsmc delete archive -se=prXXXX <OPTIONS>

For details on the usage of this command see: https://www.ibm.com/support/knowledgecenter/SSEQVQ_8.1.4/client/r_cmd_delarchive.html

Dealing with resource limits (very large archives)

On some of LRZ's HPC systems, resource limits are in place to prevent misuse. Please use the ulimit command to check which values these limits have. In particular, a CPU time limit (-t switch of ulimit) may cause archivation of very large files to abort. If you are impacted by this, you need to split your data and archive disjoint subsets with multiple dsmc commands (possibly in parallel).

Optimal Usage of the Archive

In order to achieve a better performance with TSM archive or retrieve jobs you should consider the following guidelines.

Use large files

If you have many small files put them into an archive and put the the tar files into the tape archive

$> tar cfv archive.tar small_files/

$> dsmc archive -se=prXXXX archive.tar

If you cannot avoid having many small files, use:

$> dsmc archive -se=prXXXX -subdir=yes small_files/

And avoid using:

$> dsmc archive -se=prXXXX small_files/*

The difference is, that the first command will group up to 4096 files or 20GB of data into a single transaction and therefore you are more likely to get the wire speed of the tape drive. The second command will create a single transaction for each file and therefore will be very slow (up to factor 10 or more depending on the size of the files)

Working with parallel streams

By default a single TSM client call will use only a single tape drive. When you need to archive multiple big files, the throughput of a single tape drive may be not enough. In this case you can specify the number of parallel streams - and therefore the number of tape drives used in parallel - by the resourceutilization parameter. However, you have to keep in mind that the resourceutilization parameter does not directly specify the number of sessions created by the client but does influence the clients decision on how much resources he may use. 

However, please bear in mind that SuperMUC has available only 15 tape drives and that other users may want also archive data at the same time you do. So please be kind to other users and do not start too many parallel archiving jobs at once. Practical relevant are when archiving:

Resource Utilisation ValueMax number of parallel streams
42
63
74
95
106

In the example below we have 3 large files that we want to archive in parallel. Therefore we use the following command:

dsmc ar -se=prXXXX -subdir=yes -resourceutilization=6 test/
   IBM Tivoli Storage Manager
   Command Line Backup-Archive Client Interface
   Client Version 6, Release 2, Level 2.7  
   ..... 
   Archive function invoked.
   Normal File-->    10,737,418,240 /home/hpc/prxxfa/userxyz/test/testfile2 [Sent]      
   Normal File-->    10,737,418,240 /home/hpc/prxxfa/userxyz/test/testfile1 [Sent]      
   Normal File-->    10,737,418,240 /home/hpc/prxxfa/userxyz/test/testfile3 [Sent]      
   Archive processing of '/home/hpc/prxxfa
   /userxyz/test/*' finished without failure.
   Total number of objects inspected:        3
   ...
   Total number of objects failed:           0
   Total number of bytes transferred:   30.01 GB
   LanFree data bytes:                   30.00 GB
   Data transfer time:                   69.38 sec
   Network data transfer rate:   453,533.57 KB/sec
   Aggregate data transfer rate: 413,967.37 KB/sec

Unfortunately by now TSM lacks support for parallel retrieve sessions. Therefore you should make sure to start to retrieve your files soon enough so that they are ready when you need them. We created an Request for Enhancement at IBM to add this feature. You can help us prioritize the request by logging into IBM Developer Network and Voting for the particular RFE. In special cases where you have to store and retrieve intermediate result files from/to scratch it may be possible to work around with special procedures. However, if you currently need parallel retrieve/restore sessions please contact us via ServiceDesk so that we can help to find an individual solution for you.

Retrieving Files into $SCRATCH

There is is one minor problem when you retrieve files into $SCRATCH. The files are restored with the "last access date (atime)" of the original files. However, the automatic cleanup procedure in $SCRATCH deletes files older than a certain amount of days. You have to touch all restored files the within the time span between retrieveal and the cleanup deletion (which is typically done over night). Alternatively, you can  retrieve the files into $WORK where no automatic cleanup is done.

Special Cases of Archive usage

Tivoli Storage Manager follows a symbolic link and archives the associated file or directory. This is the default. To avoid this, specify that only the symbolic link and not the associated file or director are beeing archived, using the -archsymlinkasfile=no option.

Retaining archive data after the end of a project

It is possible to retain the archive data beyond the end of a SuperMUC project. For details look read the section conversion of a SuperMUC project into a Data-Only project here.

Further Sources of Information