DSA documentation additions for users

Overview

This document is an addition to the DSS documentation for users that outlines specific differences between the usage of a Data Science Storage container and a Data Science Archive container.

Please note that our Data Science Archive is still in an early production phase and capabilities and other things may still change over the next few months. So it's probably a good idea to get back to this document from time to time or at least review our DSS Release Notes from time to time.

DSA - The A is for Archive

At first sight for you as a user, a Data Science Archive container does not look different from a Data Science Storage container. However, the purpose of the Data Science Archive is to safely store large amounts of cold research data to allow you to comply with the rules for good scientific practice. Therefore you will soon notice that files stored in a DSA container behave a little bit different because first their content is eventually moved from the disk partition of DSA to tape (we like the analogy that the data freezes like water in a glacier) and second they are protected against accidental data loss in a very special way.

The life-cycle of a DSA file

All files you put in a DSA container will follow the following life-cycle:

  • A few hours after you've copied the file over to DSA, DSA will create a copy of this file on tape in two different data centres. (The data will still also be on the disk partition at this point in time)
  • Approximately 24h* after the file has been created in DSA, it will be made immutable and a deletion hold of 10 years is put on the file. This means you will never be able to modify, append or rename the file again and you will also not be able to delete the file for the next 10 years.
  • At some point in time when the disk partition is filled up to a certain watermark, DSA will begin to purge data of files which have a copy on tape from the disk partition. The file meta data is still kept so you will still see your files in the DSA container, however as the file is frozen now, when you try to access the file, you will get an error (Permission denied)
  • When you want to access your file again, after it has been frozen, you have to first thaw it again. This can be done either implicitly by initiating a Globus Online transfer or explicitly by sending a stage request to the DSA Recall Director service. We'll cover both methods later. After the files have been thawed they are usually accessible for at least 7 days unless there is very high pressure on the disk partition. A file can be frozen and thawed for an unlimited number of times.
  • After 10 years the deletion hold will be released and you now can delete the file again if you want to. However you'll still not be able to modify, append or rename the file.

*For files smaller than 1GB, this time period is extended to 7 days as we consider storing a large number of files smaller than 1GB as an anit-pattern for DSA and we want to give you a little bit more time to cleanup files, stored in DSA mistakenly.

Getting data in and out

There are basically two ways in which you can get data in and out of DSA. The first one is via Globus Online and the second one is via the Login Nodes of SuperMUC-NG and the LRZ Linux Cluster. In the following we will describe both ways and outline the specific advantages and disadvantages. 

Getting data in

As you'll see in this section getting data into DSA is relatively straightforward. However there is one important factor to keep in mind: Your data will eventually end up on tape which is a storage media that only allows sequential access. So in order to be able to get back your data later in a timely manner the files you put in to the archive must be large enough to allow the tape drives to operate efficiently. As a general rule of thumb, files should not be smaller than 1GB and files with 100GB or more are even better. As an upper boundary, if possible files should not be larger than 6TB. If you have many small files use tar or zip in order to combine them into a larger archive file. 

Please also note that we only grant a very strict quota on the number of files that can be stored by each project which can not be increased (Usually a low 5-digit number). So if you are storing too many small files you'll be running out of quota very soon and then may need to rework your whole archive. 

When working with tar or zip archives consider putting a "META" file next to each archive that describes the content. Files smaller than 20MB usually will never get purged from the disk partition and therefore can easily been searched through without having to stage them before.

Using Globus Online

The Data Science Archive is available as Globus Online Endpoint here. In order to access it, make sure you log in using the LRZ username you were invited to access your DSA container. You can find the user ID you have been invited in the DSA invitation mail we sent you. 

For information how to use Globus Online please read their fine Getting Started Guide.

In order to transfer data from SuperMUC-NG or LRZ Linux Cluster, you can use the following endpoints to move data into DSA:

If you want to transfer data from a remote system that does not yet provide a Globus Endpoint, you can use Globus Connect Personal to turn virtually any System into a Globus Online endpoint within minutes.

Using Globus Online to put data into DSA has the following advantages:

  • It will automatically calculate checksums of your data after transfer to make sure no silent data corruption occurred
  • It will automatically copy data in parallel so performance will be much better than with cp or rsync for example
  • It will take care of your data transfer and inform you by mail once it succeeded

Using HPC Login Nodes

As the frontend of the Data Science Archive looks like a normal file system, it is also mounted on the Login Nodes of your HPC systems. So you can also put data into DSA just by using whatever file copy or archiving utility is available on the login nodes. The path to your DSA container directory can be found in the invitation mail.

Beware of tools or options that try to preserve owning group and/or access rights like cp -a, rsync -a, rsync -p -g. The concept of a DSA container is that everyone that has access to the container has access to all data and we work hard that this semantic is enforced. However tools that try to mess with ownership or access rights may break our effort and this could cause you to loose access to your data eventually.

You can directly create tar archives from your data on WORK or SCRATCH into a DSA container.

Getting data out

We've recently identified a lock contention issue when multiple staging jobs are startet in parallel that compete for the same tape volumes, which can lead to incomplete stages. In order to avoid that, please only start one stage job after another for a single container. We plan to have a fix for this available in Q1/2023. But anyway  it is generally better/more performant to issue less but larger stage jobs than many small ones.

Using Globus Online

Getting data out with Globus Online is basically as easy as getting it in. Just start a transfer and Globus Online will take care of the rest.

The way Globus Online currently handles staging is not optimal. While it works reasonable well with files larger than 10GB or when you want to transfer only a few files, when trying to access many files smaller than 10GB, consider to first stage the files manually (see next section) and then transfer the files after staging.

Using HPC Login Nodes

When you try to access a file in a DSA container that is currently offline, meaning on tape only, you'll get an Permission denied error. In order to be able to access it again, it needs to be staged explicitly using the DSA CLI utility which is installed on all HPC login nodes. 

The dsacli tool has the option to specify various output formats like csv, json, yaml and so on that may come handy when using the tool in your own scripts and programs. Just use the base command together with the -h switch to get an overview of the output options (e.g. dsacli stage job list -h).


In order to start a new stage job, use the dsacli stage job create command. In its simples form it expects the name of the DSA container and a file or directory name either relativ to the container base directory or as absolute path as argument. Alternatively it also accepts a file that contains all files to stage, again either with their absolute or relativ path. Additionally it provides the -w switch that will allow you to use GLOB wildcards in file names. Last but not least you can add the -n switch to get an email notification when staging is done.

In the following example we'll show you the various ways the dsacli stage job create command can be used. For theses example sections, we use the DSA container pr74qo-dss-0007 which is available under the path /dss/dsafs01/0001/pr74qo-dss-0007

#> pwd
/dss/dsafs01/0001/pr74qo-dss-0007
#> ls -lh
total 391G
-r--r----- 1 root pr74qo-dss-0007 98G Feb 12 11:02 100g2
-r--r----- 1 root pr74qo-dss-0007 98G Feb 12 11:14 100g3
-r--r----- 1 root pr74qo-dss-0007 98G Feb 12 11:19 100g4
drwxrws--- 8 root pr74qo-dss-0007 4.0K Feb 12 10:01 100G-in-large-files
drwxrws--- 8 root pr74qo-dss-0007 4.0K Feb 12 10:24 100G-in-large-files2
drwxrws--- 8 root pr74qo-dss-0007 4.0K Feb 12 10:35 100G-in-large-files3
drwxrws--- 8 root pr74qo-dss-0007 4.0K Feb 12 10:45 100G-in-large-files4
-r--r----- 1 root pr74qo-dss-0007 98G Feb 12 11:53 100gr1
-r--r----- 1 root pr74qo-dss-0007 98G Feb 12 12:55 100gr2

#> # First of all, we have to log in to the DSA Staging Service
#> dsacli login
Username: a9999bp
Password: 
Logged in.

#> # First of all lets stage a single file. In this case 100gr1
#> dsacli stage job create --dsacontainer pr74qo-dss-0007 -s ./100gr1
+--------------------------+----------------------------------+
| Field                    | Value                            |
+--------------------------+----------------------------------+
| ID                       | 5                                |
| Container                | pr74qo-dss-0007                  |
| Status                   | Preparing stage list             |
| Number of tapes required | None                             |
| Number of tapes done     | 0                                |
| created                  | 2021-02-16T09:21:52.250986+01:00 |
+--------------------------+----------------------------------+

#> # Now lets stage all files that match the file glob 100g*
#> dsacli stage job create --dsacontainer pr74qo-dss-0007 -s "./100g*" -w
+--------------------------+----------------------------------+
| Field                    | Value                            |
+--------------------------+----------------------------------+
| ID                       | 6                                |
| Container                | pr74qo-dss-0007                  |
| Status                   | Preparing stage list             |
| Number of tapes required | None                             |
| Number of tapes done     | 0                                |
| created                  | 2021-02-16T09:23:42.022459+01:00 |
+--------------------------+----------------------------------+
#> # Note that we put the 100g* glob into quotes. This is because otherwise the glob would be interpreted by the shell and not by dsacli. Also the -w switch is needed to tell dsacli to interpret the argument as glob.

#> # You can either use paths relative to the container base directory (in our case /dss/dsafs01/0001/pr74qo-dss-0007) or also full qualified paths
#> dsacli stage job create --dsacontainer pr74qo-dss-0007 -s /dss/dsafs01/0001/pr74qo-dss-0007/100gr1
+--------------------------+----------------------------------+
| Field                    | Value                            |
+--------------------------+----------------------------------+
| ID                       | 7                                |
| Container                | pr74qo-dss-0007                  |
| Status                   | Preparing stage list             |
| Number of tapes required | None                             |
| Number of tapes done     | 0                                |
| created                  | 2021-02-16T09:27:44.929252+01:00 |
+--------------------------+----------------------------------+

#> # When you pass a directory instead of a file as staging argument, all files in the directory tree will be staged recursive
#> dsacli stage job create --dsacontainer pr74qo-dss-0007 -s /dss/dsafs01/0001/pr74qo-dss-0007/100G-in-large-files/
+--------------------------+----------------------------------+
| Field                    | Value                            |
+--------------------------+----------------------------------+
| ID                       | 8                                |
| Container                | pr74qo-dss-0007                  |
| Status                   | Preparing stage list             |
| Number of tapes required | None                             |
| Number of tapes done     | 0                                |
| created                  | 2021-02-16T10:21:36.191010+01:00 |
+--------------------------+----------------------------------+

#> # Alternatively, you can create a file list of the files you want to stage. Use one line for each file. Again, you can specify files either with the full qualified path or their relative path to the container base directory. You can specify individual files or complete directories.
#> cat stagelist.txt 
./100gr2
/dss/dsafs01/0001/pr74qo-dss-0007/100g4
./100G-in-large-files/
#> dsacli stage job create --dsacontainer pr74qo-dss-0007 -l stagelist.txt 
+--------------------------+----------------------------------+
| Field                    | Value                            |
+--------------------------+----------------------------------+
| ID                       | 10                               |
| Container                | pr74qo-dss-0007                  |
| Status                   | Preparing stage list             |
| Number of tapes required | None                             |
| Number of tapes done     | 0                                |
| created                  | 2021-02-16T10:26:20.173931+01:00 |
+--------------------------+----------------------------------+

#> # Also with file lists, you can use wildcards when using the -w switch
#> cat stagelist.txt 
./100g*
dsacli stage job create --dsacontainer pr74qo-dss-0007 -l stagelist.txt -w
+--------------------------+----------------------------------+
| Field                    | Value                            |
+--------------------------+----------------------------------+
| ID                       | 12                               |
| Container                | pr74qo-dss-0007                  |
| Status                   | Preparing stage list             |
| Number of tapes required | None                             |
| Number of tapes done     | 0                                |
| created                  | 2021-02-16T10:27:59.507579+01:00 |
+--------------------------+----------------------------------+

In order to get an overview over your recent stage jobs, you can use the dsacli stage job list command. We keep information about your stage jobs of the last 30 days.

#> dsacli stage job list
+----+-----------------+-------------------+----------------------------------+
| ID | Container       | Status            | Created                          |
+----+-----------------+-------------------+----------------------------------+
|  5 | pr74qo-dss-0007 | Staging completed | 2021-02-16T09:21:52.250986+01:00 |
|  6 | pr74qo-dss-0007 | Staging completed | 2021-02-16T09:23:42.022459+01:00 |
|  7 | pr74qo-dss-0007 | Staging completed | 2021-02-16T09:27:44.929252+01:00 |
|  8 | pr74qo-dss-0007 | Staging completed | 2021-02-16T10:21:36.191010+01:00 |
|  9 | pr74qo-dss-0007 | Staging completed | 2021-02-16T10:23:41.742482+01:00 |
| 10 | pr74qo-dss-0007 | Staging completed | 2021-02-16T10:26:20.173931+01:00 |
| 11 | pr74qo-dss-0007 | Staging completed | 2021-02-16T10:27:28.330764+01:00 |
| 12 | pr74qo-dss-0007 | Staging completed | 2021-02-16T10:27:59.507579+01:00 |
+----+-----------------+-------------------+----------------------------------+

In order to get some details and the current status of a particular stage job, you can use the dsacli stage job show command. It takes the ID of the stage job you want to view as argument.

The command will show you the container it is operating on, the number of tapes that need to be touched in order to fulfil the request and the number of tapes that have already been finished as well as the creation and end time and the current job status.

A stage job goes through the following states:

  • New pending: The stage job has entered the system but has not jet been handed over to the worker nodes
  • Preparing stage list: The stage job is sorting out already staged files and computing the optimal stage order
  • Waiting for staging slots: The stage job is waiting for free tape drives
  • Staging in progress: The stage job has begun to move data from tape to disk
  • Staging completed: The stage job has completed
  • Staging aborted by user: The stage job has been aborted by a user request

#> dsacli stage job show 6 
+--------------------------+----------------------------------+
| Field                    | Value                            |
+--------------------------+----------------------------------+
| ID                       | 6                                |
| Container                | pr74qo-dss-0007                  |
| Status                   | Staging completed                |
| Number of tapes required | 1                                |
| Number of tapes done     | 1                                |
| Created                  | 2021-02-16T09:23:42.022459+01:00 |
| Finished                 | 2021-02-16T09:32:58.230031+01:00 |
+--------------------------+----------------------------------+

In order to get a list of files and directories, the job is going to stage, you can use the dsacli stage job show list command. It again takes the ID of the stage job you want to view as argument. Note that the file and directory names will always been shown relative to the container base directory.

#> dsacli stage job show list 10
+-------------------------+
| File/Directory to stage |
+-------------------------+
| ./100gr2                |
| /100g4                  |
| ./100G-in-large-files/  |
+-------------------------+

You can also retrieve a list of already staged files of a job in realtime using the dsacli stage job show staged command. It takes the ID of the stage job you want to view as argument. Additionally you can limit the number of files that are displayed at once using the --number argument whereby with each call the next N elements are displayed util it reaches the and and starts from the beginning again. Also you may want to use the --consume switch where files that were once displayed are removed from the list.

As the staged file list is created in realtime, this command may be very handy when you want to already start processing files while the overall staging job is still running.


#> dsacli stage job show staged 6
+------------------------------------------+
| Staged files                             |
+------------------------------------------+
| /dss/dsafs01/0001/pr74qo-dss-0007/100gr2 |
| /dss/dsafs01/0001/pr74qo-dss-0007/100g4  |
| /dss/dsafs01/0001/pr74qo-dss-0007/100g3  |
| /dss/dsafs01/0001/pr74qo-dss-0007/100gr1 |
| /dss/dsafs01/0001/pr74qo-dss-0007/100g2  |
+------------------------------------------+

#> # When using the number argument, you can skip through the file-list as if it were a ring buffer
#> dsacli stage job show staged 6 --number 2
+------------------------------------------+
| Staged files                             |
+------------------------------------------+
| /dss/dsafs01/0001/pr74qo-dss-0007/100gr2 |
| /dss/dsafs01/0001/pr74qo-dss-0007/100g4  |
+------------------------------------------+
#> dsacli stage job show staged 6 --number 2
+------------------------------------------+
| Staged files                             |
+------------------------------------------+
| /dss/dsafs01/0001/pr74qo-dss-0007/100g3  |
| /dss/dsafs01/0001/pr74qo-dss-0007/100gr1 |
+------------------------------------------+
#> dsacli stage job show staged 6 --number 2
+------------------------------------------+
| Staged files                             |
+------------------------------------------+
| /dss/dsafs01/0001/pr74qo-dss-0007/100g2  |
| /dss/dsafs01/0001/pr74qo-dss-0007/100gr2 |
+------------------------------------------+

#> # When you use the --consume switch every file that was once delivered by a show staged command will be purged from the list
dsacli stage job show staged 6 --consume
+------------------------------------------+
| Staged files                             |
+------------------------------------------+
| /dss/dsafs01/0001/pr74qo-dss-0007/100g4  |
| /dss/dsafs01/0001/pr74qo-dss-0007/100g3  |
| /dss/dsafs01/0001/pr74qo-dss-0007/100gr1 |
| /dss/dsafs01/0001/pr74qo-dss-0007/100g2  |
| /dss/dsafs01/0001/pr74qo-dss-0007/100gr2 |
+------------------------------------------+
#> dsacli stage job show staged 6 

#>

Last but not least, you can use the dsacli stage job abort command to abort running stage jobs. The command again takes the ID of the stage job you want to abort as argument. Note that it may take some time until the job is aborted as it will do that in a coordinated fashion and may need to wait for some tasks to get to a sane stage.

#> dsacli stage job create --dsacontainer pr74qo-dss-0007 -s /100g2 
+--------------------------+----------------------------------+
| Field                    | Value                            |
+--------------------------+----------------------------------+
| ID                       | 13                               |
| Container                | pr74qo-dss-0007                  |
| Status                   | Preparing stage list             |
| Number of tapes required | None                             |
| Number of tapes done     | 0                                |
| created                  | 2021-02-16T11:19:47.242866+01:00 |
+--------------------------+----------------------------------+
#> dsacli stage job show 13
+--------------------------+----------------------------------+
| Field                    | Value                            |
+--------------------------+----------------------------------+
| ID                       | 13                               |
| Container                | pr74qo-dss-0007                  |
| Status                   | Staging in progress              |
| Number of tapes required | 1                                |
| Number of tapes done     | 0                                |
| Created                  | 2021-02-16T11:19:47.242866+01:00 |
| Finished                 | None                             |
+--------------------------+----------------------------------+
#> dsacli stage job abort 13
Initiated abortion of stage job 13
#> dsacli stage job show 13
+--------------------------+----------------------------------+
| Field                    | Value                            |
+--------------------------+----------------------------------+
| ID                       | 13                               |
| Container                | pr74qo-dss-0007                  |
| Status                   | Staging aborted by user          |
| Number of tapes required | 1                                |
| Number of tapes done     | 0                                |
| Created                  | 2021-02-16T11:19:47.242866+01:00 |
| Finished                 | None                             |
+--------------------------+----------------------------------+

Sometimes it also can be helpful to directly find out for a given file if it is on the online partition of DSA or not. For this you can use the command /usr/lpp/mmfs/bin/mmlsattr which will also tell you some more information about the file.

#> /usr/lpp/mmfs/bin/mmlsattr -L 100g2
file name:            100g2
metadata replication: 1 max 2
data replication:     1 max 2
immutable:            yes							<= Indicates if the file has already been made immutable
appendOnly:           no
indefiniteRetention:  no
expiration Time:      Tue Feb 11 00:00:00 2031		<= This is the time at which the deletion hold will expire
flags:                
storage pool name:    data
fileset name:         pr74qo-dsa-0007				<= The DSA container the file is contained in
snapshot name:        
creation time:        Fri Feb 12 11:02:20 2021
Misc attributes:      ARCHIVE OFFLINE READONLY		<= OFFLINE indicates the file is not on disk currently
Encrypted:            no

#> /usr/lpp/mmfs/bin/mmlsattr -L 100g3
file name:            100g3
metadata replication: 1 max 2
data replication:     1 max 2
immutable:            yes
appendOnly:           no
indefiniteRetention:  no
expiration Time:      Tue Feb 11 00:00:00 2031
flags:                
storage pool name:    data
fileset name:         pr74qo-dsa-0007
snapshot name:        
creation time:        Fri Feb 12 11:14:13 2021
Misc attributes:      ARCHIVE READONLY				<= OFFLINE flag is missing so file is on disk currently
Encrypted:            no