5. Slurm
Slurm is an open-source resource manager and job scheduler for Linux clusters of any size.
It allocates computing resources, runs and monitors jobs, and manages job queues, with optional plugins for advanced scheduling and accounting.
Key Points
Always specify the number of GPUs when requesting resources by using the --gres=gpu:<number_of_GPUs>
(--gres=gpu:1
) option.
For individual jobs (allocations) the default time limit is 1 hour, and the maximum time limit is 2 days.
Overview