5.1 Slurm Interactive Jobs

Interactive jobs allow you to work directly within an allocated set of resources.
To view available resources such as partitions and the current status of compute nodes use the sinfo command.

Allocate Resources

Allocate resources interactively with the salloc command. For example, to request one GPU in the lrz-v100x2 partition, run:

salloc -p lrz-v100x2 --gres=gpu:1

The --gres=gpu:1 option explicitly requests one GPU and is essential when working with GPU-enabled partitions.
Without this option, GPU resources will not be allocated.

You can adjust the number to request multiple GPUs, e.g., --gres=gpu:2 for two GPUs.
This option is mandatory for all GPU partitions, with the sole exception of the lrz-cpu partition, which does not provide GPUs.

Launch Interactive Jobs

Interactive jobs can be launched within an existing allocation using the srun command.
For example, the following command opens an interactive bash shell directly on the allocated compute node:

srun --pty bash

This gives you interactive access to the compute node, including all allocated resources.
You can now run commands directly and also use tools like Enroot, which are available within the compute node environment.

Note: When the allocation ends, you will be returned to your login shell.
If the interactive shell is still running, you may need to manually exit it using exit to return fully to your original environment.

At this point, running an Enroot container using enroot import, enroot create, and enroot start, as described in Section 4.1 Enroot - Introduction,
is a natural next step to define and control your software environment within the interactive session.

Launch Interactive Jobs with Enroot

You can also run interactive jobs inside an Enroot container.
The SLURM setup on the LRZ AI systems supports this via the Pyxis plugin.

To do this, first choose a suitable container image. It can come from NGC, Docker Hub, or be a local image.
Make sure the image includes all the required software and libraries.

If no such image exists, you can create your own by extending a base image.
For guidance, see our documentation: 4.3 Enroot - Custom Images.

Once you have the image URI or local path, pass it to srun using the --container-image option.
SLURM will automatically create and launch a container from that image and run the specified command inside it.

For example, the following command starts an interactive bash shell in a container from the NGC repository that includes PyTorch:

srun --pty --container-mounts=./data-test:/mnt/data-test --container-image='nvcr.io#nvidia/pytorch:22.12-py3' bash

The --container-mounts option binds a host folder into the container at a specified path,
optionally with custom mount flags like ro(read-only).

Additionally, the --container-name option assigns a name to the container so it can be reused within the same job allocation.
Unnamed containers are deleted after the job, while named ones persist. It is not designed to persist across separate job allocations.
For details on this limitation and the reasoning behind it, see this discussion.