Wiederherstellung der Daten

The use of a data backup system such as the LRZ Archive and Backup System can be understood like a term life insurance policy. The regular execution of backup runs as well as the control and maintenance of the backup software represent, so to speak, the insurance premiums you have to pay and the day you have to restore lost data from the backup system corresponds to the payment of the insurance sum in case of an insured event. In this article we would like to give you information and advice on how to get your data as quickly as possible and without much fuss in case of an "insurance case".

Important note

In the event of an important recovery of data, please do not hesitate to notify the LRZ-Servicedesk immediately. We will then contact you as soon as possible and can assist you with tips and tricks and, for example, also ensure that resources are made available to you on a priority basis.

How does a data recovery, a restore, work with TSM and what has to be considered?

Starting with a running and correctly configured TSM client, a restore usually consists of the following steps:

You select the data to be recovered and then start the recovery.
TSM searches its database for the desired files and puts them in an order so that as few tape changes as possible are necessary during readout.
The TSM server inserts one backup tape after the other into a free tape drive, reads the desired files and sends them to the client, which writes them to the desired location on your computer.

Already with the selection of the data which should be restored, there are some things which must be considered. First of all you should keep in mind that TSM - without any further specification - will only show you the files or file versions that were still on your system during the last backup run. In TSM jargon these are the "active data". If you also want to see the files or older versions of files that have already been deleted at the last backup time, you have to tell TSM explicitly that it should also show the "inactive" data. This can be done by selecting the following icon in the restore dialog of the GUI or by specifying the parameter -inactive in the query backup or restore command.

Using the so-called "Point In Time" restore, you can select a point in time in the past and display all files or file versions that were "active" at that point in time. Please be aware, however, that according to our usage policy we only keep "inactive" data for 180 days or a maximum of 3 versions. So you will only see the data for the point in time that is still present in the backup system based on these guidelines.

There is another point to consider when selecting the data to be restored. TSM distinguishes between a so-called "Standard Restore" and a "No Query Restore". The difference between the two methods will be explained below. To get TSM to execute a "No Query Restore", the following must be given:

Only one complete directory branch is selected
No further restrictions are placed on the files to be restored.
No inactive files are selected from this branch.

At the command line level, specifying the following command will result in

restore /<Pfad>/<zu>/<Verzeichnis>/* <Ziel> -subdir=yes

to a "No Query Restore" the additional specification of the options inactive, latest, pick, fromdate, todate, volinformation, pit leads to the fact that a "Standard Restore" is executed.

In the GUI, selecting a single directory on the left without first selecting the "Display Inactive Files" field and without specifying any search filters via the "Magnifying Glass" results in a "No Query Restore". Everything else leads to a "Standard Restore".

As soon as you have started the restore at the client, TSM starts with the restore procedure. Depending on whether a "Standard Restore" or a "No Query Restore" is performed, this runs slightly differently.

In the case of a "Standard Restore", the client asks the server for a list of all existing files of the file system to be restored. The client searches this list for the files that match the specified search criteria and puts them in an order that requires as few tape changes as possible. The client passes the list to the server, which then reads the files and directories from the tapes in the specified order and sends them to the client.

Unfortunately there is a bug in many TSM client versions which causes that during the "Standard Restore" the files are retrieved in an optimal order, but unfortunately not the directories. This means that a restore may take a long time because the server has to jump between tapes all the time. On the client side it looks like the backup is "hanging" for hours because TSM restores the directory structure first and then the files like a "standard restore". If you are affected by the problem, you can read here. The fix is expected to be available starting with TSM client versions 6.3.3, 6.4.3 and 7.1.1 respectively. As a workaround you should trigger a "No Query Restore" whenever possible if you want to restore larger directory structures.

In the case of a "No Query Restore", the client informs the server for which file branch a "No Query Restore" is to be performed. The server then searches its database for the files and directories to be restored and puts them in an order so that as few tape changes as possible are required. The server reads the data from tape and sends it to the client.

If possible, a "No Query Restore" should be preferred for large restores. Firstly, the "Standard Restore" requires significantly more RAM because in this case the list of files is processed in the client's RAM and secondly, with a "No Query Restore" you can read from several tape drives in parallel (see tuning tips below).

Please also note that regardless of the restore method, depending on the number of objects saved and to be restored, it may take some time to read and sort the list of data from the database. For example, if there are 25 million files to be restored, this process may take several hours. During this time you may get the impression that the TSM client has hung up. The same phenomenon can be caused by the fact that currently all tape drives are occupied and the server has to wait for the next free drive. Canceling the restore job in this phase is counterproductive, because the next attempt has to start from the beginning again. If you have any doubts whether your client has hung up during the restore process, please do not hesitate to consult the LRZ service desk or the LRZ hotline before aborting. We can then check the status of your restore process on the server side and ensure that resources/tape drives are made available to you on a priority basis.

To minimize the probability that a large restore aborts because of a known client problem, we strongly recommend to use a TSM client version that is as up-to-date as possible.

If a backup fails, it is recommended that you first check whether a so-called "Restartable Restore" is still available on the server. To do this, select "Restartable Restores" in the GUI under "Actions" or use query/restart restore on the command line. In this case, the client can continue exactly where it left off before.

What are some ways to speed up a large restore?

Within certain limits, the recovery of large amounts of data can be accelerated. Within certain limits means that we have no direct influence on the duration of the search and sorting phase of the data to be restored. The only way to improve this is to split the backup into several nodes or filespaces. However, this is only worthwhile for file systems with several 10 million files. The main starting point for performance optimization is the optimization or parallelization of data transfer.

The following values in the dsm.sys and dsm.opt configuration files have often proven beneficial in the past:

TCPBUFFSIZE    512
DISKBUFFSIZE   1023
TCPNODELAY     YES

Up to and including TSM 6.1:

TXNBYTELIMIT   2097152

From TSM 6.2:

TXNBYTELIMIT   20G

If your operating system supports "TCP Window Autotuning":

TCPWINDOWSIZE 0

Another option is to use several tape drives in parallel for a "no query restore". However, this only makes sense if the performance of your network connection and your storage system can accommodate correspondingly high data rates. As a rule of thumb you can say that this can be advantageous from about 60 MB/s on. To give TSM the possibility to parallelize a "No Query Restore", you have to specify the maximum number of parallel sessions in the dsm.opt or dsm.sys as follows:

RESOURCEUTILIZATION <Anzahl paralleler Sessions 1-10>

In parallel, you must contact us via the service desk or the hotline so that we enable your node on the server side for parallel restores. It is best to point out that you want to have the MAXNUMMP parameter adjusted for your node. The value for MAXNUMMP must be equal to the RESOURCEUTILIZATION value. For operational safety, this is set to 1 by default so that a misconfigured client does not occupy excessive resources, possibly bringing the entire system down. If only the RESOURCEUTILIZATION parameter is set higher than 1, errors will occur during restores and the system will only be partially restored. If you have a legitimate interest, we can also set the MAXNUMMP parameters for individual nodes permanently to a value greater than 1, regardless of the reason. If you consider this necessary, please contact the service desk.

I have problems with the restore - what can I do?

A major system failure is often an exceptionally stressful situation for you as an administrator. In addition to repairing the hardware and operating system and reassuring the users, you are also faced with restoring the data. Often, this is a task that some people have never thought about. The last thing you need in this situation is problems restoring your data.

The best and only way to rule out problems during restore in advance, if possible, is to test the restore case regularly. This way, problems can be detected and fixed in a non-critical situation. As an added bonus, you are already familiar with the procedure just in case and can approach the restore in a relaxed manner.

We realize that due to other priorities, you will - unfortunately - often lack the time or resources to take such action. However, every year we experience that restores are considerably delayed due to problems that were not recognized in advance, because the error has to be searched for before the restore can be completed successfully.

Please contact us immediately in such a case. Often, and as long as the appropriate resources are available, we will first perform the restore locally at our LRZ and provide you with the data via network share or ISCSI volume so that your users can access their data again.