Datenarchiv
Information and tips for building a data archive with TSM
The long-term, secure storage of data due to compliance guidelines is not only an important topic in the business environment. In the research sector, too, the DFG, for example, stipulates in its rules for safeguarding good scientific practice that primary scientific data must be retained for at least 10 years. In more and more research areas, there is also an increasing requirement to store data "forever". With its backup and archive system, the LRZ offers the technical infrastructure for long-term data storage. However, due to the - from an IT point of view - long periods of time, you as a user have to consider some things from the beginning so that your archiving project can be successful.
On the terms "archiving" and "long-term archiving" at the LRZ
The LRZ uses the terms archiving and long-term archiving to indicate the different retention period of the archived data. If you as a customer did not specify anything else when registering your TSM node, it follows the "normal" archiving guidelines. These are based on the DFG guidelines for good scientific practice, and therefore we keep your data for 10 years. After 10 years, they are automatically deleted from the archive. If you want to keep data longer than 10 years, there is an option to include as a comment when registering the node that it should be a long-term archive. In this case, your node is associated with a policy where archive data is never automatically deleted, so it is kept "forever".
The terms "backup" and "archiving" in Tivoli Storage Manager (TSM)
The software suite "TSM" provided by the LRZ distinguishes between the two storage types backup and archiving.
The backup function works incrementally and version-based. This means that the backup function only saves a file if the same version of this file does not yet exist in the backup system. In addition, only a certain number of versions of a file (usually 3 versions at the LRZ) are saved. Also, symbolic links, for example, are saved and restored as symbolic links.
The archiving function, on the other hand, is purely time-oriented. This means that the archiving function saves a file even if the same version of the file already exists in the archive system, and it also saves an infinite number of versions of this file for the duration set accordingly in the policies. So, if you archive the same file 10 times in a row, it will be transferred and stored 1 time during backup and 10 times during archive. In addition, for symbolic links, the archiving function does not save the link itself, but the file associated with it. Furthermore, the archiving function allows you to store a description for each file.
On request, we also provide a special backup policy - suitable for long-term retention - where a single version of a file is kept "forever". This can be useful, for example, if you want to automatically archive all "new" files in a file system without providing a function to find the "new" files yourself. If you have a need for this, please contact us before requesting the node via the LRZ-Servicedesk.
Best Practice 1: Separate backup and archive
At first sight it seems to be practical to use one and the same TSM node to protect your system against failures (backup) as well as to archive data on a long-term basis. However, it is recommended to use separate TSM nodes for these application scenarios. The reason is that the backup refers to the computer you are currently using and the archive can exist over many computer generations. This means that you should usually use a new node for the backup for each new computer to avoid mixing the system data of the old one with the new one in the backup. However, this does not work as soon as you want to store archive data over several computer generations in one node. Therefore we recommend to use different TSM nodes for archive and backup from the beginning.
Best Practice 2: Keep order
All too often it happens that archives are operated as "digital attics". This means that data that one still wants to keep is carelessly put into the archive without structure and order. This often means that it can be very time-consuming if you want to find something specific again years later.
Therefore, you should consider a structure for your archive from the beginning that allows you - and also the successor of the successor of your successor - to still find your data in several decades. Unfortunately, TSM is not a fully comprehensive document management system that stores extensive indexing and other metadata. So the only way to structure the data is the directory structure and the additional description you can give to each archive file and also search for. For large archive projects, it is worth considering whether a document management system with appropriate options for storing metadata and referencing the data according to TSM would not be advantageous. Several archiving projects - hosted at the LRZ - in the library and museum environment are already taking this path.
To avoid possible confusion you should take care that you keep the directory structure on your system the same over the computer generations if possible. Please note that TSM "thinks" in so called filespaces. A filespace usually corresponds to a file system. This can become a problem, if you have operated e.g. on computer generation 1 a separate file system /archive and on computer generation 2 /archive exists only as "normal" directory in the file system /. Then you suddenly have archive data in the filespace /archive and in the filespace / under the directory /archive. As a result, you have to explicitly specify the filespace you want to search in during the search and thus have to search for your data in two places. Optimally you define your archive area from the beginning as a separate filespace by using the TSM option VIRTUALMOUNTPOINT.
Best Practice 3: Keeping an eye on technology change
The rapid pace of change in the IT world presents one of the biggest challenges when it comes to storing data for a long time. Not so long ago, storing data on floppy disks was common. Today, reading data from a floppy disk poses a not inconsiderable challenge. On the one hand, because the necessary drive hardware to read the bits is missing, on the other hand, because it is not at all clear whether software still exists that can interpret the data bits. And even if the software still exists, it is not clear whether the software can even run on modern systems.
Therefore, it is extremely important to keep an eye on technology change. If you use the LRZ-TSM archive system, this task is split into two parts. The LRZ takes care of keeping your bits readable over time by continuously migrating your data to current hardware. Your task is to ensure that the bits remain interpretable. On the one hand, this can be done by keeping the necessary hardware and software systems up and running or, as soon as a corresponding change in technology becomes apparent, by retrieving the data from the archive, converting it to a new data format and archiving the new data again.
We are here for you
Are you planning to set up a digital archive or do you have questions about archiving with TSM? Please do not hesitate to contact us via the LRZ-Servicedesk. We will be happy to advise you.