Flux Framework - Flux in Slurm

Getting started ...

What is it?

Flux Framework is a task scheduling and resource management framework - much like Slurm is. However, it can be run completely in user space. And we describe it here as an alternative to Slurm's srun task farming capabilities.

Flux is rather versatile, but also quite complex - and still under very active development. We must therefore refer to the flux documentation for all the details left out here.

Installation

The simplest installation is probably via conda.

> conda create -n my_flux -c conda-forge flux-core flux-sched
> conda activate my_flux
(my_flux) > flux version
commands:    		0.59.0
libflux-core:		0.59.0
build-options:		+hwloc==2.8.0+zmq==4.3.5

If you need a more up-to-date version of flux, you probably can't get around to build it from source (https://github.com/flux-framework/). But spack may help you to simplify that process.

Interactive Workflows

Real interactive work with Flux is probably not so reasonable. But for testing purposes, and as sort of a starting point, let's have a short look at it. We start from a login node.

login > conda activate my_flux                                                         # activate flux environment
(my_flux) login > srun -N 2 -M inter -p cm2_inter --pty flux start                     # allocate resources (on cluster/partition you want)
i22r07c05s05 > flux uptime                                                             # basic info about the running flux instance
 14:11:57 run 7.9s,  owner ⼌⼌⼌⼌⼌⼌⼌,  depth 0,  size 2
i22r07c05s05 > flux resource info                                                      # basic info about the resources managed by the flux instance
2 Nodes, 56 Cores, 0 GPUs
i22r07c05s05 > flux run --label-io -N2 hostname                                        # run a task (here, on each node one)
0: i22r07c05s05
1: i22r07c05s08
i22r07c05s05 > flux bulksubmit --output=log.{{id}} -n 1 -c 7 /lrz/sys/tools/placement_test_2021/bin/placement-test.omp_only -t 7 -d 20 ::: $(seq 0 100)
ƒCF6D7Bu                                                                               # flux job IDs
[...]
i22r07c05s05 > flux jobs -a
       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
[...]
    ƒCL2LiaU ⼌⼌⼌⼌⼌⼌⼌  placement+  S      1      -        - 
    ƒCGVkRgt ⼌⼌⼌⼌⼌⼌⼌  placement+  R      1      1   8.580s i22r07c05s05
    ƒCGVkRgs ⼌⼌⼌⼌⼌⼌⼌  placement+  R      1      1   10.15s i22r07c05s11
    ƒCGUGSQa ⼌⼌⼌⼌⼌⼌⼌  placement+  R      1      1   12.45s i22r07c05s11
    ƒCGUGSQZ ⼌⼌⼌⼌⼌⼌⼌  placement+  R      1      1   12.45s i22r07c05s11
    ƒCGUGSQY ⼌⼌⼌⼌⼌⼌⼌  placement+  R      1      1   12.79s i22r07c05s05
    ƒCGUGSQX ⼌⼌⼌⼌⼌⼌⼌  placement+  R      1      1   13.35s i22r07c05s11
    ƒCGSnT8C ⼌⼌⼌⼌⼌⼌⼌  placement+  R      1      1   14.15s i22r07c05s05
    ƒCGSnT8B ⼌⼌⼌⼌⼌⼌⼌  placement+  R      1      1   17.15s i22r07c05s05
    ƒCG62dBP ⼌⼌⼌⼌⼌⼌⼌  placement+ CD      1      1   23.41s i22r07c05s05
    ƒCG62dBQ ⼌⼌⼌⼌⼌⼌⼌  placement+ CD      1      1   19.54s i22r07c05s11
    ƒCG62dBM ⼌⼌⼌⼌⼌⼌⼌  placement+ CD      1      1   20.68s i22r07c05s11
[...]
i22r07c05s05 > exit

flux has an elaborate direct help system. Please use flux help and flux help <command> to acquire some information or reminder.

flux submit/bulksubmit, flux cancel <job ID> and flux job -a can be used similarly to sbatch, scancel and squeue under Slurm. Maybe flux cancelall -f is a highlight in the first tests.

Non-Interactive Workflows

The far more normal approach to use flux is probably to have a bunch of tasks that should be bundled with a Slurm job. This comprises already the maximum scope of possible workflows, which we cannot cover here at all. But an example should illustrate the basic principle.

test.sh
#!/bin/bash
#SBATCH -o log.%x.%j.%N.out
#SBATCH -D . 
#SBATCH -J flux_test
#SBATCH --get-user-env 
#SBATCH -M inter
#SBATCH -p cm2_inter
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --mail-type=none 
#SBATCH --export=NONE 
#SBATCH --time=00:02:00 

module load slurm_setup
conda activate my_flux

cat > workflow.sh << EOT
flux uptime
flux resource info
flux run --label-io -N2 hostname
# 100 tasks with each 7 CPUs
flux bulksubmit --wait --output=log.{{id}} -n 1 -c 7 /lrz/sys/tools/placement_test_2021/bin/placement-test.omp_only -t 7 -d 20 ::: $(seq 0 100)
# 1 MPI task with 2 nodes, 8 ranks, and 7 threads (CPUs) per rank
flux run --output=log.mpi  -N 2 -n 8 -c 7 -o mpi=intel -o cpu-affinity=per-task /lrz/sys/tools/placement_test_2021/bin/placement-test.intel_impi -t 7
EOT
chmod u+x workflow.sh

srun --export=all --mpi=none flux start ./workflow.sh

With srun, the flux instances are started (one process per node), and also just with a script – workflow.sh. This script contains the actual flux workflow description. We use here some dummy programs which provide us with information about the rank/thread-to-cpu placement. It is probably a good idea to check the correctness of that.

This Slurm script is to be submitted as usual via sbatch.

NB: -o mpi=intel was only a guess. Documentation seems not verbose on Intel MPI. But it seems to work. (But it also works without that flag ... Hoping for an error was probably too much expected.)

Waitable Jobs

In general, flux submit would submit a job, and return to the shell. Specfically, in mass jobs within a Slurm job, that would lead to that the workflow-script above would then just return after the submission of the last flux job. To handle this, flux submit knows the option --flags=waitable. Together with a subsequent flux job wait --all, we have a similar idiom like the srun &; wait for Slurm job farming. However, the flux documentation claims that flux job wait is much more lightweight than bash wait.

Dependency Trees

flux submit also knows job dependencies via --dependency=... option. Here, ... can for instance be afterok:JOBID. That is sematically equal to Slurm's sbatch job dependencies.

After Slurm Job stops

flux seems not to have a job-bookkeeping device. But flux queue seems to offer some capabilities to document/archive the flux's queue status. Please check the cheat sheet below.

# Stop the queue, wait for running jobs to finish, and dump an archive.
flux queue stop
flux queue idle
flux dump ./archive.tar.gz

In order to execute that in a Slurm job, maybe some bash trap ... EXIT is necessary (where ... is some cleanup bash function).


The last three topics are maybe easier to answer with some workflow managers like nextflow (assuming that they support flux anyhow).

Further Reading

Flux Framework comes with a vast scope of documentation, user guides and tutorials. We propose to beginners to start with the Learning Guide.

To bind Flux into a Slurm frame, please consult the docu on that.

As a good overview, the Cheat Sheet is of tremendous help.