6. Running Applications as Interactive Jobs on the LRZ AI Systems

Interactive jobs can be executed in an existing allocation of resources. Use the sinfo command for an overview of available resources (partitions and current node states). Resources can be allocated with the salloc command.
For example, if you want an allocation within the lrz-v100x2 partition, you would need to type the following command:

$ salloc -p lrz-v100x2 --gres=gpu:1

The --gres=gpu:1 argument above indicates that a single GPU is needed in that allocation. It is a required argument for all the partitions described in 1. General Description and Resources except the lrz-cpu one.

Interactive jobs are submitted to an existing allocation of resources using the srun command. The following example executes the command bash in the allocated node.

$ srun --pty bash

Additionally, the command can be executed within an Enroot container. The SLURM installation in the LRZ AI systems allows doing this via a plugin called pyxis (check https://github.com/NVIDIA/pyxis for documentation and extra options added for `srun`). The recommended way of doing this is to find a container image (from Docker Hub, NGC or locally stored) that provides all the required libraries. If this image does not exist, you can create your own image extending an existing one as indicated in our guide 9. Creating and Reusing a Custom Enroot Container Image. Once you have the the image location URI on a container repository or the path to a locally store image, you can provide it to srun via the --container-image argument. SLURM takes care of transparently creating the container out of that image and executing the command of your choice within it. The following is an example of how to run bash, but this time within a container created out of an image that provides pytorch and that comes from the nvcr.io docker repository.

$ srun --pty --container-mounts=./data-test:/mnt/data-test \
  --container-image='nvcr.io#nvidia/pytorch:22.12-py3' \
  bash

The --container-mounts option in the previous example indicates how to mount a folder from outside the container into the container. In the example, we are mounting a folder called data-test in the current directory to the folder /mnt/data-test within the container.

Additionally, there is the argument --container-name that allows tagging and reusing a container during the same job allocation (i.e., in the scope of a single salloc). The --container-name option is not intended to take effect across job allocations (see here https://github.com/NVIDIA/pyxis/issues/30#issuecomment-717654607 for details.)