11. Public Datasets and Containers on the LRZ AI Systems

When developing new AI methods or evaluating existing ones, ML/AI researchers and scientists routinely use public datasets. Often the very same datasets are used by different research groups, which end up downloading these to their own storage. For example, more than one research group might download the Alphafold database needed for predicting 3D protein structures (see https://alphafold.ebi.ac.uk/, >2TB). This situation has previously lead to data replication and storage capacity wasting for both, users and LRZ.

To avoid the situation described above, the LRZ AI Systems offer a dedicated Data Science Storage (DSS) container aimed at storing public datasets as well as, potentially, Enroot container images of interest to more than one researcher.

Available datasets and Enroot images

11.0 Available Public Datasets

11.1 Available Enroot Container Images (currently none provided, see below for requests)

How to request the addition of public datasets

Users interested in a particular dataset need to:

  • make sure the dataset is licensed for public usage and requires no individual license nor registration
  • open a ticket with the LRZ Servicedesk, providing the location of the dataset and a justification for public interest (including the expected target audience)
  • provide clear instructions for downloading it (ideally in the form of a shell script)

An example of request is as follows:

Please specify your incident/request: AI topics
Please choose an AI category: Request new Dataset offer

Description: The Alphafold dataset (https://alphafold.ebi.ac.uk/), which requires >2TB of storage is becoming popular for protein prediction within the ML community. This dataset is used in the methods x, and y. 

The dataset is publicly available (https://github.com/deepmind/alphafold#genetic-databases). It can be downloaded with the scripts provided here https://github.com/deepmind/alphafold/tree/main/scripts/. The instruction for doing this are:

  • install the aria2c dependence
  • execute 
    $ bash scripts/download_all_data.sh <DOWNLOAD_DIR>

Acceptance & implementation is subject to feasibility and available resources. 

How to request Enroot images on the AI systems

Users interested in a particular image need to:

  • make sure the image is licensed for public usage and requires no individual license or registration
  • make sure the image is not provided by the Nvidia NGC, Dockerhub or another public repository directly
  • write a ticket with the location of the Dockerfile for building the image and a justification for public interest (including the expected target audience)
  • provide clear instructions for building the image (in case it deviates from the standard procedure)

Acceptance & implementation is subject to feasibility and available resources.