SuperMUC-NG Archive Usage (TSM-based)
Content:
Working with the IBM Spectrum Protect based archive of SuperMUC
Archiving means saving the current version of a file to tape. Several version of the same file can be kept in the tape archive. For restoring them you must differentiate them by date and time or by an additional description.
In order to archive and retrieve data at the SuperMUC/SuperMIG or the Linux Cluster, the IBM Spectrum Protect based LRZ Backup- and Archive Infrastructure is used. A system wide ISP client configuration is available, so that you do not need to perform the installation or configuration of a ISP client yourself.
General Information
The SuperMUC Archive is based on offline tape storage. If archiving or retrieving is not starting promptly this is most probably NOT a problem. There usually are just no free tape drives available at the moment. If you encounter such a situation, please be patient and wait. We only have 15 tape drives for SuperMUC available and if there is much workload from multiple users on the system, it can take several hours until your archive job gets a free tape drive. Avoid cancelling and resubmitting your archive command. If you feel that there is a problem, please open a support ticket at the Servicedesk for SuperMUC-NG.
Archiving data with IBM Spectrum Protect
Let's assume you work for project prXXXX
and have a file myFile
stored on the temporary filesystem in a location (directory) denoted by some self-defined environment variable $MY_SCRDIR
. Since myFile
may be automatically removed from $MY_SCRDIR
by high-watermark deletion after some days, you might want to have an archive copy at hand. So here's how to create one. Go to $MY_SCRDIR
and invoke on of the commands:
$> dsmc archive -se=prXXXX myFile
$> dsmc archive -se=prXXXX -description="V.1.2" myFile
We recommend to keep logs of all archive commands in a specific directory, e.g.
$> dsmc archive -se=prXXXX myFile >$HOME/mytapelogs/archived_on_YYYY_MM_DD_hh_mm
This might later help to avoid confusion with the file namespace (see blow).
In case the file name contains spaces you have to enclose it in double quotes, e.g., "my file with spaces"
. If you want to archive several files myFile1, myFile2, ...
you can use wildcards or specify the filenames:
$> dsmc archive -se=prXXXX myFile*
$> dsmc archive -se=prXXXX myFile1 myFile2 myFile3
You can also archive complete directory trees. This can be achieved by using an additional command-line option:
$> dsmc archive -se=prXXXX -subdir=yes MyDirectory/
dsmc interprets MyDirectory/
as a directory.
If you have lots of data to archive, you may want to have a look at the Optimal Usage of the Archive section below.
Retrieving data with IBM Spectrum Protect
You can search for archived files in a subdirectory $MY_SCRDIR
of any file system by issuing the command
$> dsmc query archive -se=prXXXX -subdir=yes $MY_SCRDIR/
Again the slash after $MY_SCRDIR
is important to remind dsmc that it is a directory. A file can be retrieved with one of the command
$> dsmc retrieve -se=prXXXX $MY_SCRDIR/myFile $MY_SCRDIR/myNewFileName
$> dsmc retrieve -se=prXXXX -description="V.1.2" myFile myNewFileName
If you omit the second file argument, the file will be restored under its original name. Of course, you can also retrieve complete directory trees.
$> dsmc retrieve -se=prXXXX MyDirectory/ RetrievedDirectory/ -subdir=yes
This will restore the data in MyDirectory/
to RetrievedDirectory/
. Again, directory or file names containing spaces have to be enclosed in double quotes, and directory names must end with a slash (/).
Retrieving files with several versions
If you have several version of the same file you can use the options -fromdate
, -fromtime
, -todate
, -totime
, -description
to differentiate. You might need to specify the format of the date and time string. Interactively, you can use the -pick
option.
$> dsmc retrieve -se=prXXXX -timeformat=4 -datefomat=3 -fromdate=2011-11-30 -fromtime=23:33:00 MyFile
$> dsmc retrieve -se=prXXXX -pick MyFile
$> dsmc retrieve -se=prXXXX -description="V.1.2" myFile
Looking for files
If you do not know the exact filespace/filename use
$> dsmc query filespace -se=prXXXX
Then try to find your files by using the displayed information and wildcards
$> dsmc query archive -se pr28fa '/gpfs/work/*'
Deletion of data from IBM Spectrum Protect
The default policy is to prohibit users the deletion of data from the archives to prevent that data gets accidentally deleted. However, since many many request this feature, the permission can be granted on request via the Servicedesk.
Please bear in mind, that deletion rights can only be granted on the granularity of a project, meaning that once granted, all users of the project are allowed to delete all data of the project. Please also bear in mind that deleted data cannot be restored so be very carefully when deleting data. If you feel unassertive, feel free to contact us via the Servicedesk for guidance.
To delete archived data, you can use the command:
$> dsmc delete archive -se=prXXXX <OPTIONS>
For details on the usage of this command see: https://www.ibm.com/support/knowledgecenter/SSEQVQ_8.1.4/client/r_cmd_delarchive.html
Dealing with resource limits (very large archives)
On some of LRZ's HPC systems, resource limits are in place to prevent misuse. Please use the ulimit
command to check which values these limits have. In particular, a CPU time limit (-t
switch of ulimit
) may cause archivation of very large files to abort. If you are impacted by this, you need to split your data and archive disjoint subsets with multiple dsmc
commands (possibly in parallel).
Optimal Usage of the Archive
In order to achieve a better performance with TSM archive or retrieve jobs you should consider the following guidelines.
Use large files
If you have many small files put them into an archive and put the the tar files into the tape archive
$> tar cfv archive.tar small_files/
$> dsmc archive -se=prXXXX archive.tar
If you cannot avoid having many small files, use:
$> dsmc archive -se=prXXXX -subdir=yes small_files/
And avoid using:
$> dsmc archive -se=prXXXX small_files/*
The difference is, that the first command will group up to 4096 files or 20GB of data into a single transaction and therefore you are more likely to get the wire speed of the tape drive. The second command will create a single transaction for each file and therefore will be very slow (up to factor 10 or more depending on the size of the files)
Working with parallel streams
By default a single TSM client call will use only a single tape drive. When you need to archive multiple big files, the throughput of a single tape drive may be not enough. In this case you can specify the number of parallel streams - and therefore the number of tape drives used in parallel - by the resourceutilization
parameter. However, you have to keep in mind that the resourceutilization parameter does not directly specify the number of sessions created by the client but does influence the clients decision on how much resources he may use.
However, please bear in mind that SuperMUC has available only 15 tape drives and that other users may want also archive data at the same time you do. So please be kind to other users and do not start too many parallel archiving jobs at once. Practical relevant are when archiving:
Resource Utilisation Value | Max number of parallel streams |
---|---|
4 | 2 |
6 | 3 |
7 | 4 |
9 | 5 |
10 | 6 |
In the example below we have 3 large files that we want to archive in parallel. Therefore we use the following command:
dsmc ar -se=prXXXX -subdir=yes -resourceutilization=6 test/ IBM Tivoli Storage Manager Command Line Backup-Archive Client Interface Client Version 6, Release 2, Level 2.7 ..... Archive function invoked. Normal File--> 10,737,418,240 /home/hpc/prxxfa/userxyz/test/testfile2 [Sent] Normal File--> 10,737,418,240 /home/hpc/prxxfa/userxyz/test/testfile1 [Sent] Normal File--> 10,737,418,240 /home/hpc/prxxfa/userxyz/test/testfile3 [Sent] Archive processing of '/home/hpc/prxxfa /userxyz/test/*' finished without failure. Total number of objects inspected: 3 ... Total number of objects failed: 0 Total number of bytes transferred: 30.01 GB LanFree data bytes: 30.00 GB Data transfer time: 69.38 sec Network data transfer rate: 453,533.57 KB/sec Aggregate data transfer rate: 413,967.37 KB/sec
Unfortunately by now TSM lacks support for parallel retrieve sessions. Therefore you should make sure to start to retrieve your files soon enough so that they are ready when you need them. We created an Request for Enhancement at IBM to add this feature. You can help us prioritize the request by logging into IBM Developer Network and Voting for the particular RFE. In special cases where you have to store and retrieve intermediate result files from/to scratch it may be possible to work around with special procedures. However, if you currently need parallel retrieve/restore sessions please contact us via ServiceDesk so that we can help to find an individual solution for you.
Retrieving Files into $SCRATCH
There is is one minor problem when you retrieve files into $SCRATCH. The files are restored with the "last access date (atime)" of the original files. However, the automatic cleanup procedure in $SCRATCH deletes files older than a certain amount of days. You have to touch all restored files the within the time span between retrieveal and the cleanup deletion (which is typically done over night). Alternatively, you can retrieve the files into $WORK where no automatic cleanup is done.
Special Cases of Archive usage
Symbolic Links
Tivoli Storage Manager follows a symbolic link and archives the associated file or directory. This is the default. To avoid this, specify that only the symbolic link and not the associated file or director are beeing archived, using the -archsymlinkasfile=no
option.
Retaining archive data after the end of a project
It is possible to retain the archive data beyond the end of a SuperMUC project. For details look read the section conversion of a SuperMUC project into a Data-Only project here.