Jobfarm - a Map-Reduce Job-Farming Approach

What is Jobfarm

Jobfarm is based on the julia's pmap function, wrapped into a bash convenience framework. A list of tasks in a ASCII file is scheduled/distributed onto independent worker-pipelines, the resources of which (CPUs, memory) are controlled via Slurm job configurations.

It is assumed hat the user can split the major problem into smaller tasks that can be executed independently. Complex task-dependencies are not foreseen, and other tools need to be used for such scenarios.

In contrast to simple srun job-farming, jobfarm provides a rather rudimentary book keeping system, such that job-farming runs can be conveniently restarted (e.g. after a node failure related Slurm job abort, or timeout).

Getting started

The jobfarm module can be loaded via

> module use -p /lrz/sys/share/modules/extfiles/jobfarming           # might be unnecessary later
> module load jobfarm
> jobfarm -h

 Usage: jobfarm [Options]
        jobfarm [CMD] [Options] [input-file]

        CMD    : start      : to start a jobfarming job
                 status     : check status of jobfarming job
                 analyse    : [TODO]
        Otions : -h, --help : print help

 For command help: jobfarm [CMD] -h

The jobfarm allows for tab completion. The tool has three modes, of which start and status are probably the most useful ones.

jobfarm start <task-db>  must be executed in a Slurm job (interactive or batch), and starts the task scheduling.  jobfarm status <task-db>  is to be executed on login nodes, and is used to investigate the jobfarming job's progress.
An example (to be found also under $JOBFARM_BASE/test/cm2_std/1CPU.tasks.slurm) may illustrate the usage of this tool.

myFarmJob_CM2.slurm
#!/bin/sh
#SBATCH -D .
#SBATCH -o log.%N.%x.%j.out
#SBATCH -J JobFarm_Test
#SBATCH --get-user-env
#SBATCH --clusters=cm2
#SBATCH --partition=cm2_std
#SBATCH --qos=cm2_std
#SBATCH --nodes=3                                                                       # set resource demands as required
#SBATCH --tasks-per-node=28
#SBATCH --cpus-per-task=1
#SBATCH --mail-type=none
#SBATCH --export=NONE
#SBATCH --time=0:05:00

module load slurm_setup

module use -p /lrz/sys/share/modules/extfiles/jobfarming
module load jobfarm                                                                     # load jobfarm module

taskdb="task-list.txt"

if [ ! -f $taskdb ]; then                                                               # create the task database once (if it does not exists, yet)
  cmd="/lrz/sys/tools/placement_test_2021/bin/placement-test.intel_impi -d "
  for i in $(seq 1 100); do
    echo "$cmd $(( RANDOM % 20 + 30 ))" >> $taskdb     # 40 +/- 10 seconds
  done
fi

jobfarm start $taskdb                                                                   # start jobfarm execution

This job script can be submitted normally. And jobfarm status task-list.txt is used to monitor the jobs progress.

> sbatch myFarmJob_CM2.slurm
> jobfarm status task-list.txt

 STATUS Farming Job: task-list.txt:
       RUNNING (SLURM Job ID : 80687; anticipated endtime: 2021-07-21T15:36:20)

    Processed Tasks: 6 of 100 (6%)
       [ success  failed  total ] = [ 6  0  100 ]

    Task Statistics (minutes):
       [ Min  Mean  Median  Max ] = [ 0.62 0.73 0.74 0.80 ]

    List of failed Tasks: 

If a Slurm job is already running, this is indicated by also the Slurm job ID. Currently, only cm2, cm2_tiny, mpp3 are included into the status monitor. If you intend to use this tool on other LRZ systems, please contact our Service Desk.
The Processed Tasks gives information about the total progress, as well as about the success/failure status of tasks. If tasks have failed, their task ID is listed below such that you know which failed.
The Task Statistics is meant to gain a feeling for the task's runtime distribution. This can be used for assessing which parallel resources are required in order to run the total farming job in which period of time (perfect scaling assumed).

When the job stops (e.g. due to timeout), you can see in the STATUS information that no job is currently running. You can simply resubmit this Slurm job via sbatch from the same location. The farming job goes on where it stopped.

Remark: jobfarm analyse <task-db> is supposed to create some graphics (by means of gnuplot) plotting a task runtime histogram, a task runtime Gantt chart. But as this seems to be rather fragile, also a normal CSV file with the tasks start and end time points is generated.

Some Background on Mode of Operation behind the Scenes

Task Database

The task database is a simple ASCII file with one task per line. If you need more complex workflows (like piping in bash, or python, or any other language), please wrap this into a task-script (a bash, python, ... script), which is then just executed in one line in the task database.

The task database file is copied into the farming job's work directory (hidden; see below), and is checked for consistency. jobfarm will no accept changes to the task database as it otherwise loses the possibility for correct book keeping. If you make changes, remove the working directory before you submit the farming job.

Book Keeping and Debugging

Jobfarm creates at the beginning (start) a folder named according to the task database file with  _res attached. This is the farming job's working directory. Herein is written Julia's worker output (files called julia-*.out), and the single tasks output (named according to their task number, which correspond to their line number in the task database), annotated with information used for the book keeping. All is human readable such that it can be used for debugging in the case something goes wrong.

Also all tasks redirected output can be found here, unless written to some other location.

Succeeded and failed tasks count as processed tasks, and are ignored at restart. If you want to have task re-run, remove the corresponding task output file from the working directory before you restart the job.

In the case you need to debug failed tasks or whole job, there are these three places: Slurm output file, julia worker output files, tasks output files.

General Workflow Considerations

Usually, it is a good idea to run one or several task interactively on a login or a compute node (acquired via salloc, for instance), maybe even with \time -v  prepended, to test correct functionality, and to gain a feeling for the task's runtime duration and memory consumption. According to them you should setup the farming job's Slurm resource requirements.

If single tasks fail, don't worry, and even possibly do not immediately start debugging! The goal of job farming is not perfection or precision, but to get the largest fraction of work done fast! A small fraction of the tasks can be processed also manually later with hardly any extra effort.

Trouble-Shooting