Job farming with SLURM

Decision Matrix

                                                 Method

Workload 


Job Arrays
Multiple serial workers with srun
multi-prog
Multiple serial 
Workers with mpiexec
Multiple (parallel) workers
with srun
Multiple (parallel) workers
with  mpiexec
(on one node)
                pexec  
redis/doRedis
Long running parallel jobs with large number of nodes for each workeryesnononononono
Parallel workersyesno noyesyesnoyes, with redisexec
Serial workersnoyesyesyesyesyesyes (R or python)
Number of workers = number of allocated cores noyesyesyesyesyesyes
Number of workers> number of allocated coresnononoyesnoyesyes
Number of Workers >> number of allocated coresnononononoyesyes
(very) unbalanced Workersyes

no

noyesnoyesyes

Task identifier

Depending on the method the following environment variables can be used to distinguish between tasks

  • SLURM_ARRAY_TASK_ID: for Array jobs
  • SLURM_PROCID: The MPI rank (or relative process ID) of the current process (with srun)
  • SLURM_LOCALID: Node local task ID for the process within a job (with srun)

  • SLURM_STEPID: The step ID of the current job (with srun)
  • PMI_RANK: The MPI rank (or relative process ID) of the current process with Intel MPI (with mpiexec)
  • SUBJOB within pexec

Methods

Job Arrays

Job arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily; job arrays with many tasks can be submitted in milliseconds. All jobs must have the same initial options (e.g. size, time limit, etc.). Job arrays will have additional environment variable set.

  • SLURM_ARRAY_JOB_ID will be set to the first job ID of the array.
  • SLURM_ARRAY_TASK_ID will be set to the job array index value.
  • SLURM_ARRAY_TASK_COUNT will be set to the number of tasks in the job
#SBATCH --nodes=100
#SBATCH --ntasks=4800
#SBATCH --array=1-10
#same program but different input data
mpiexec -n  $SLURM_NTASKS ./myprog <in.$SLURM_ARRAY_TASK_ID


Combining Job Arrays with other methods below is possible (e.g. with "Mutliple parallel workers with srun", or "srun --multi-prog")

Multiple serial Workers with  srun --multi-prog

Run a job with different programs and different arguments for each task. In this case, the executable program specified is actually a configuration file specifying the executable and arguments for each task. The number work tasks is limited by the number of SLURM tasks. 

  • Task rank: One or more task ranks to use this configuration. Multiple values may be comma separated. Ranges may be indicated with two numbers separated with a '-' with the smaller number first (e.g. "0-4" and not "4-0"). To indicate all tasks not otherwise specified, specify a rank of '*' as the last line of the file.
  • Executable: The name of the program to execute
  • Arguments: The expression "%t" will be replaced with the task's number (SLURM_TASKID). The expression "%o" will be replaced with the task's offset within this range (e.g. a configured task rank value of "1-5" would have offset values of "0-4"). Single quotes may be used to avoid having the enclosed values interpreted.  The expression "%t" will be replaced with the task's number. The expression "%o" will be replaced with the task's offset within this range (e.g. a configured task rank value of "1-5" would have offset values of "0-4", SLURM_LOCALID).
cat example.conf
################################################################### 
4-6 hostname 
1,7 echo task:%t 
2-3 echo offset:%o
0 bash myscript <Input.%t

srun -n 8 --multi-prog example.conf

Multiple serial Workers with mpiexec

a) only a few commands

mpiexec -n 1 ./first-script : -n 2 ./second-script : -n 3 ./third-script 

b) many commands using PMI_RANK

cat workers
#!/bin/bash
if [ $PMI_RANK -lt 10 ]
then
    ./firstten <input.$PMI_RANK >output.$PMI_RANK
else
    ./manyothers  <input.$PMI_RANK >output.$PMI_RANK
fi
mpiexec workers

Multiple parallel workers with srun

srun can be used as a resource manager, which also works with OpenMP threading. The following script runs multiple job steps in parallel within an allocated set of nodes. Currently, we recommend issuing a small sleep between the submission of the tasks. The sum of the nodes involved in the job steps should not be larger than the number of allocated nodes of the job. Here we provide an example for a script where 128 work units have to be performed and up to 10 workers are running in parallel, each submitting one sub-job  of 2 nodes per unit at the time. Therefore we need to allocate 20 nodes for the whole job. In this example, each subjob uses 8 tasks-per-node (thus 16 tasks per worker in total) and 6 cpus-per-task for OpenMP threading. For that we export, as usual, the value of OMP_NUM_THREADS and invoke srun with the -c $SLURM_CPUS_PER_TASK option. Remove both (or set cpus-per-task to 1) and adjust ntasks-per-node for running without OpenMP.

...
#SBATCH --nodes=20
#SBATCH --ntasks-per-node=8 
#SBATCH --cpus-per-task=6
...
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
...

# Fill just the following two
MY_NODES_PER_UNIT=2
MY_WORK_UNITS=128

# Just algebra, these need no changes
MY_WORKERS=$(( $SLURM_JOB_NUM_NODES / $MY_NODES_PER_UNIT ))
MY_TASKS_PER_WORKER=$(( $SLURM_NTASKS_PER_NODE * $MY_NODES_PER_UNIT ))

for (( I=1; I<=$MY_WORK_UNITS; I++ )) ; do # Count the work units that started
  while true ; do # Scan for a free worker
    SUBJOBS=`jobs -r | wc -l` # detect how many subjobs are already running
    if [ $SUBJOBS -lt $MY_WORKERS ] ; then  # submit only if at least one worker is free
      sleep 4 # wait before any submission 
      # wrapper could also be an MPI program,`-c $SLURM_CPUS_PER_TASK` is only for OpenMP
      srun -N $MY_NODES_PER_UNIT -n $MY_TASKS_PER_WORKER -c $SLURM_CPUS_PER_TASK -J subjob.$I ./wrapper $I >OUT.$I &
      break # So "I" will get +1
    fi
  done
done

wait # for the last pool of work units

exit

Here the important points to note are that we background each srun with the "&", and then tell the shell for wait for all child processes to finish before exiting. 

If the sub-jobs are well balanced you can, of course, do the following:

srun -N 2 -n 16 -c 6 -J subjob.1 ./wrapper 1 >OUT.1 &
srun -N 2 -n 16 -c 6 -J subjob.2 ./wrapper 2 >OUT.2 &
srun -N 2 -n 16 -c 6 -J subjob.3 ./wrapper 3 >OUT.3 &
...
srun -N 2 -n 16 -c 6 -J subjob.10 ./wrapper 10 >OUT.10 &
wait
# now do the next batch of jobs
srun -N 2 -n 16 -c 6 -J subjob.11 ./wrapper 11 >OUT.11 &
...

Multiple parallel workers with mpiexec (within a node)

It is not possible to run more than one MPI program concurrently using the normal startup. However, within a node you can start an arbitrary number of mpiexec using the communication within the shared memory. Typically you have to specify the processor list for pinning to avoid overlap of the particular programs (However, you can do this if you need).

#First create a hostfile with the entry localhost
for I in `seq 0 95`
do
  echo localhost
done >localhost

#prepare the MPI environment
unset I_MPI_HYDRA_BOOTSTRAP
unset I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS
export I_MPI_FABRICS=shm
export I_MPI_HYDRA_HOST_FILE=localhost
#now you can start multiple mpiexecs with the node
#load a test program
module load lrztools
I_MPI_PIN_PROCESSOR_LIST=0-9    mpiexec -n 10 placementtest-mpi.intel >OUT.1 &
I_MPI_PIN_PROCESSOR_LIST=10-19  mpiexec -n 10 placementtest-mpi.intel >OUT.2 &
I_MPI_PIN_PROCESSOR_LIST=20-29  mpiexec -n 10 placementtest-mpi.intel >OUT.3 &
I_MPI_PIN_PROCESSOR_LIST=30-39  mpiexec -n 10 placementtest-mpi.intel >OUT.4 &
I_MPI_PIN_PROCESSOR_LIST=40-47  mpiexec -n 8  placementtest-mpi.intel >OUT.5 &

wait

Here the important points to note are that we background each mpiexec with the "&", and then tell the shell for wait for all child processes to finish before exiting. 

If the tasks are not run in the background then they will run one after the other and if the memory is not divided then the first srun will take the entire allocation thus preventing the others from starting which also causes the sequential execution of the calls to mpiexec. 

if you want several of these mpiexecs, you can pack the second part of the script above into a shell script, make it executable and execute it on each of your allocated nodes with srun (as described in the previous section).

serial commands with pexec from lrztools

pexec takes a configuration file with serial commands. The number of worker tasks may be much larger than the number of allocated cores. Within the script or wrapper the environment variable $SUBJOB may by used to distinguish between tasks. The next free core will take the next task in the list.

module load lrztools

cat wrapper
#!/bin/bash
./serial <input $SUBJOB >output.$SUBJOB

cat tasklist
./serial-script <input.1 >output.1
./serial-exe <input.2 >output.2
...
./serial-command <input.1000 >output.1000
./wrapper
./wrapper

mpiexec -n 150  pexec tasklist

Using R and redis

A simple worker queue for R functions using redis as database

Redisexec can be used in single node mode (default) or in MPI mode (used if 'nodespertask'>1 or 'forcempi'=1). In MPI-mode, redisexec will automatically split up the SLURM host file to create MPI groups of size 'nodespertask' (one of redisexec's arguments). Currently, only Intel-MPI is supported.

For more documentation please see the github page.