Jobfarm - a job-farming framework using julia pmap

What is Jobfarm

Jobfarm is based on the julia's pmap function, wrapped into a bash convenience framework. A list of tasks in a ASCII file is scheduled/distributed onto independent worker-pipelines, the resources of which (CPUs, memory) are controlled via Slurm job configurations.

It is assumed that a user can split the major problem into smaller tasks (at most shared-memory parallel) that can be executed independently. Complex task-dependencies are not foreseen, so other tools need to be used for such scenarios.

In contrast to simple srun job-farming, jobfarm provides a rudimentary bookkeeping system that allows for resuming conveniently at a restart (e.g. after a node failure related Slurm job abort, or timeout).

Getting started

The jobfarm module can be loaded via

> module load jobfarm
> jobfarm -h

 Usage: jobfarm [Options]
        jobfarm [CMD] [Options] [input-file]

        CMD    : start      : to start a jobfarming job
                 status     : check status of jobfarming job
                 analyse    : [TODO]
        Otions : -h, --help : print help

 For command help: jobfarm [CMD] -h

The jobfarm allows for tab completion. The tool has three modes, of which start and status are probably the most useful ones.

jobfarm start <task-db>  must be executed in a Slurm job (interactive or batch; SLURM_NTASKS must be set), and starts the task scheduling (internally via srun).  jobfarm status <task-db>  is used on a login node in order to investigate the farming job's progress.
An example (to be found also under $JOBFARM_BASE/test/cm2_std/1CPU.tasks.slurm) may illustrate the usage of this tool.

myFarmJob_CM2.slurm
#!/bin/sh
#SBATCH -D .
#SBATCH -o log.%N.%x.%j.out
#SBATCH -J JobFarm_Test
#SBATCH --get-user-env
#SBATCH --clusters=cm2
#SBATCH --partition=cm2_std
#SBATCH --qos=cm2_std
#SBATCH --nodes=3                                                                       # set resource demands as required
#SBATCH --tasks-per-node=28
#SBATCH --cpus-per-task=1
#SBATCH --mail-type=none
#SBATCH --export=NONE
#SBATCH --time=0:05:00

module load slurm_setup

module load jobfarm                                                                     # load jobfarm module

taskdb="task-list.txt"

if [ ! -f $taskdb ]; then                                                               # create the task database unless it already exists
  cmd="/lrz/sys/tools/placement_test_2021/bin/placement-test.omp_only -d "
  for i in $(seq 1 100); do
    echo "$cmd $(( RANDOM % 20 + 30 ))" >> $taskdb     # 40 +/- 10 seconds
  done
fi

jobfarm start $taskdb                                                                   # start jobfarm execution

This job script can be submitted via sbatch. And jobfarm status task-list.txt is used to monitor the jobs progress.

> sbatch myFarmJob_CM2.slurm
> jobfarm status task-list.txt

 STATUS Farming Job: task-list.txt:
       RUNNING (SLURM Job ID : 80687; anticipated endtime: 2021-07-21T15:36:20)

    Processed Tasks: 6 of 100 (6%)
       [ success  failed  total ] = [ 6  0  100 ]

    Task Statistics (minutes):
       [ Min  Mean  Median  Max ] = [ 0.62 0.73 0.74 0.80 ]

    List of failed Tasks: 

If a Slurm job is already running, this is indicated by also the Slurm job ID. Currently, only cm2, cm2_tiny, mpp3 and inter are included into the status monitor. If you intend to use this tool on other LRZ systems, please contact our Service Desk.
The Processed Tasks gives information about the total progress, as well as about the success/failure status of tasks. If tasks have failed, their task ID is listed below such that you know which failed.
The Task Statistics is meant to gain a feeling for the task's runtime distribution. This can be used for assessing which parallel resources are required in order to run the total farming job in which period of time (perfect scaling assumed).

When the job stops (e.g. due to timeout), you can see in the STATUS information that no job is currently running. You can simply resubmit this Slurm job via sbatch from the same location. The farming job goes on where it stopped.

Remark: jobfarm analyse <task-db> is supposed to create some graphics (by means of gnuplot) plotting a task runtime histogram, a task runtime Gantt chart. However, this seems to be a little fragile. So, a normal CSV file with the tasks start and end time points is generated, too, if you which to visualize that by other means.

Some Background on Mode of Operation behind the Scenes

Task Database

The task database is a simple ASCII file with one task per line. If you need more complex workflows, please wrap this into a task-script (a bash, python, ... script), which is then just executed in one line in the task database.

The task database file is copied into the farming job's work directory (hidden; see below), and is checked for consistency. jobfarm will not accept changes to the task database as it otherwise loses the possibility for correct bookkeeping. If you make changes, remove the working directory before you submit the farming job.

A more complex example may illustrate the usage further.

A more complex Example
#!/bin/sh

#SBATCH -D .
#SBATCH -o log.%N.%x.%j.out
#SBATCH -J JobFarm_Test
#SBATCH --get-user-env
#SBATCH --clusters=cm2_tiny
#SBATCH --partition=cm2_tiny
#SBATCH --nodes=1                                             # CooolMUC-2 node(s)
#SBATCH --ntasks-per-node=2                                   # number of "pipelines" per node
#SBATCH --cpus-per-task=14                                    # each pipeline 14 CPUs wide (2 x 14 = 28 == number of physical cores on CoolMUC-2)
#SBATCH --hint=nomultithread                                  # sigh ... Slurm's way to say: Use only physical cores (no hyper-threading)
#SBATCH --mail-type=none
#SBATCH --export=NONE
#SBATCH --time=0:01:00

module load slurm_setup

module use -p /lrz/sys/share/modules/extfiles/jobfarming/     # unnecessary soon
module load jobfarm

taskdb="task-list.txt"
if [ ! -f $taskdb ]; then
   for i in $(seq 1 100); do
      WORKDIR=../mytask_$i
      LOG=log.mytask_$i
      CMD="OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK /lrz/sys/tools/placement_test_2021/bin/placement-test.omp_only -d 10"
      echo "mkdir $WORKDIR && cd $WORKDIR; pwd > $LOG; $CMD &>> $LOG; echo \"number of lines \$(wc -l log.task.$i)\" &>> $LOG" >> $taskdb
   done
fi

jobfarm start $taskdb

The example illustrates when you have an own working directory for each of your farming tasks. You can specify a relative path to jobfarm's working directory. Or, absolute paths.

Also writing to log-files in task-local working directories must be done manually. Otherwise, the output to stdout/stderr goes into the task's output file termed according to the tasks number (line number of the task database) inside jobfarm's working directory.

As is visible, also more complex bash commands are possible – including piping, redirecting, ... . But we really recommend to use extra bash scripts as wrappers in order for a better overview and comprehensibility.

If you need an explicit index for your tasks (sort of array ID or so), just prepend an export MY_TASK_ID=<number>; ... to each line. Of course, it is in your responsibility to do that right (e.g. avoid double use of IDs). Insofar, we recommend the automated generation of the task database as illustrated using the for-loop in bash above.

Bookkeeping and Debugging

Jobfarm creates at the beginning (start) a folder named according to the task database file with  _res attached. This is the farming job's working directory. Herein is written Julia's worker output (files called julia-*.out), and the single tasks output (named according to their task number, which correspond to their line number in the task database), annotated with information used for the bookkeeping. All is human readable such that it can be used for debugging in the case something goes wrong.

Also all tasks redirected output can be found here, unless written to some other location.

Successful and failed tasks count as processed tasks, and are ignored at restart. If you want to have task re-run, remove the corresponding task output file from the _res directory before you restart the job.

In the case you need to debug failed tasks or whole job, there are these three places: Slurm output file, julia worker output files, tasks output files.

General Workflow Considerations

Usually, it is a good idea to run one or several task interactively on a login or a compute node (acquired via salloc, for instance), maybe even with \time -v  prepended, to test correct functionality, and to gain a feeling for the task's runtime duration and memory consumption. According to them you should setup the farming job's Slurm resource requirements.

If single tasks fail, don't worry, and even possibly do not immediately start debugging! The goal of job farming is not perfection or precision, but to get the largest fraction of work done fast! A small fraction of the tasks can be processed also manually later with hardly any extra effort.

Job Farm Julia Implementation

If you are curious how that is setup, please have a look here (Simple Task Farming Scheduler (Job Farm)).

Trouble-Shooting