DSS Understanding Data Science Storage Container Backup Retention

This article describes the semantics of the backup retention policies for data science storage container backups.

Overview

In order to keep the workload for backups at a minimum, LRZ's Backup- and Archive System only copies the data onto the backup system, which has been changed on the source system since the last backup run. Every change of a file - in comparison to the last backup run - leads to a new version of that file, stored in the backup system. Unfortunately it is not possible for us to keep all versions of a file forever. Therefore there exist certain rules on the backup system that define how old versions of files will expire from the backup. These rules operate on time as well as on version level, whereby you have to note that the most recent version of a file which - at the time of the last backup run - still existed on the source system (the so called active version) is never expired. As soon as an active version of a file is replaced with a new version or the file is deleted on the source system, the backup system marks it as inactive and thereby is then subject to the retention policy rules, which are defined using the following parameters:

  • VERExists – Maximum number of versions of a file to keep, which is still present on the source system.
  • VERDeleted – Maximum number of versions of a file to keep, which has already been deleted from the source system.
  • RETExtra – Number of days, a backed up version of a file will be kept, after it has been marked inactive. 
  • RETOnly – Number of days, the last backed up version of a file will be kept, after it has been marked inactive. 

As soon as a backed up version of a file is eligible to any of the above parameters, this version will be purged from the backup.

Data Science Storage BACKUP policies use the following settings:

  • VERExists = 3
  • VERDeleted = 3
  • RETExtra = 180 days
  • RETOnly = 180 days

This means that we store at max 3 versions of a file for at most 180 days.

Data Science Storage ARCHIVE policies use the following settings:

  • VERExists = UNLIMITED
  • VERDeleted = UNLIMITED
  • RETExtra = 10 years
  • RETOnly = 10 years

This means that we store unlimited versions of a file for at max 10 years.

Please note that ARCHIVE policies must only be used for static files which do not change, like simulation results, output data from instruments like genomic sequences, microscopes, etc. We expect that usually at most a single version of a file is stored in the archive. (And will monitor that) However, as we value your data, and want to protect it against unexpected events like for example a crypto-locker or some other events that may silently corrupt your data on disk, we make sure that the initial "good" archive version will not be expired because of versioning.

Using an ARCHIVE policy for living data that still changes will lead to massive amounts of data, generated in the backup system, which we cannot handle and will cause high costs. Therefore we have to treat this as misuse of our system. Please note that in such cases, we reserve the right to claim financial compensation from the container owners or even will withdraw your right for using the DSS archive function and delete the archived data.


Related