5. Slurm

Slurm is the open-source resource manager and job scheduler for Linux clusters of any size.
It allocates computing resources, runs and monitors jobs, and manages job queues, with optional plugins for advanced scheduling and accounting.

S-Commands

The following are the most frequently used Slurm commands that users rely on for day-to-day work. They cover the essential steps of checking resources, submitting jobs, monitoring progress, and managing running or completed jobs.

  • sinfo - Show available partitions, nodes, and their status.
  • squeue - Display currently queued and running jobs.
  • srun - Submit or launch a job (interactive or batch).
  • sbatch - Submit a batch job script to the scheduler.
  • scancel - Cancel a running or pending job.
  • salloc - Allocate resources for an interactive job session.
  • sacct - View job accounting and usage information.

Key Points

The following key points describe important details specific to the Slurm setup on the AI Systems.
Make sure to follow these conventions when submitting or running jobs to ensure your workloads start and run correctly.

  • Always specify the number of GPUs when requesting resources using: --gres=gpu:1 (more general --gres=gpu:<number_of_GPUs>).

  • For individual jobs and allocations, the default time limit is 1 hour, and the maximum time limit is 2 days.

Overview