FAQ: batch queuing and batch jobs

Long queueing times

There are multiple reasons, that jobs have to wait a long time in the queue. The most obvious reason might be that the cluster is fully occupied and many jobs are pending in the queue.

Depending on the HPC system, there are other possible reasons. Please expand the according section below.

Supercomputer

If your jobs take too long to start on SuperMUC-NG and you cannot generate enough troughput, please check if you can use the following remedies:
  • Large jobs are prioritized compared to small jobs
    Try to fit your jobs into the next larger queue (SuperMUC-NG SLRUM partitions)
    • try to run at the upper node limit, where your application still has a reasonable parallel performance 
    • If you have multiple long running simulations, try to start them within a single job script (Job farming with SLURM)
    • If you have many smaller identical jobs with different parameters, run them in SLURM job arrays (FAQ: Embarassingly parallel jobs using Python and R)

  • Chained jobs often start right after your previous job terminates (Submitting several jobs with dependencies)
    A job chain consists of many jobs, each depending on its predecessor (typical case). Typical examples are long MD trajectories or extended CFD simulations. Here, all jobs in the chain have the same size. The chained jobs gain priority during the waiting times for the predecessor job to terminate. When the predecessor job finishes, the next job in the chain has the right size to fit the empty slot and often enough priority to start straight away.

  • Check your workflow. Jobs may benefit from Slurm's backfill scheduler.
    • What does that mean? On the condition, that the start time of high-priority jobs is not delayed, low-priority jobs might be executed earlier.
    • What to do? Estimate the maximum run time of your job and set a meaningful value of the sbatch option "--time" in your job script or sbatch command (cp. SLURM Workload Manager and Job Processing with SLURM on SuperMUC-NG). In order to optimize job scheduling for all users, we highly recommend to set "--time" instead of using default run time.

LRZ will generally not prioritize single jobs/users/projects, or provide reservations. If your job processing is delayed and none of the above measures seem to help, please seek advice early via the service desk.

Further reading:

Linux Cluster

First of all, you may check the start time of your job via the following Slurm command. However, this is just an estimation! The start time is regularly re-calculated by Slurm and may significantly vary!

Example for a job on Cluster cm2
squeue --clusters=cm2 --job=MYJOBID -O "jobid,partition,timelimit,state,priority,starttime,reason"


On the Linux Cluster, Slurm uses a priority system which is based on several factors:

  1. fairshare policy: considering consumed compute time,
  2. the age of waiting jobs,
  3. the job size.

The job priority is dominated by (1) and (2). (3) plays a minor role.

It seems that you have already consumed your shares on a particular cluster segment. Now the priority of your jobs depends only on the aging factor, which is 0 at job submission but continuously increases. Additionally, your fairshare value will also fully recover, resulting in a reduction of the penalty applied to your job. However, that happens at a time scale of some weeks.

Consequence: As long as there are users who have consumed less compute time than you, they will get a higher priority and their jobs will run before yours. But, also their next jobs will have reduced their priority. You may try another cluster segment (please refer to Job Processing on the Linux-Cluster). The shares are independent for each cluster so you shouldn't have any penalty on a cluster you have not run any job yet.

You may check the job priority via the command sprio, e.g. on CoolMUC-2:

sprio --clusters=cm2 --job=MYJOBID

You may also check your fairshare value, e.g.:

sshare --clusters=cm2 --users=$USER

Its output consists of 3 rows. Fairshare is the last column. The value of the second row represents the maximum fairshare. Your fairshare value can be found in the third row. After consuming a lot of compute time this value might become 0. Example:


On the Linux Cluster there is a limited number of interactive compute nodes. Those nodes are not shared resources. It might happen that all nodes are occupied. As a result, you might have to wait some time.

For details on the job processing on the cluster, please refer to:
https://doku.lrz.de/x/AgaVAg

You may check the status of the interactive cluster segment via following Slurm command:

sinfo -M inter

Jobs using shared memory parallelism

Note that OpenMP parallel programs may need a suitable setting for the environment variable OMP_NUM_THREADS. Default is usually 1 Thread.

For some programs (like, e.g., Gaussian), which are multithreaded, however not via OpenMP but via shared memory used by the processes or explicit pthread programming, you need to study the program documentation for hints on how to configure a parallel run. Setting OMP_NUM_THREADS will usually not have any effect for these programs.

Names of Job Scripts

Batch scripts must not have a number as first character of their name. E.g., a script of the form 01ismyjob will not be correctly started. Please use one of the characters a-z, A-Z as first character of your job script name.

Batch Scripts in DOS/Unicode format / Unprintable Characters in Scripts

Scripts which have been edited under DOS/Windows may contain line-feeds and carriage-returns; these will not work under SLURM. The same applies for scripts which have been written in Unicode (e.g., UTF format) by modern editors. Furthermore, apparent whitespaces, for example in the

#! /bin/sh

specification could lead to problems. Scripts like these will fail to execute and may even block a queue altogether! Please remove such special whitespaces.

Determination of file format and fixing of format problems

  1. Run the file command on the script:     file myscript

    The result should be something like my_script: Bourne-Again shell script text. If this is not the case, but instead a format like UTF-8 is reported, then please run the iconv command:

    iconv -c -f UTF-8 -t ASCII my_script > fixed_script

    (the result is written to standard output, which is redirected in this example - so fixed_script should now be ASCII, while my_script is unchanged).

  2. Edit the script with vi(=vim). In the status line you will see the string [dos]i f the file happens to contain carriage returns/linefeeds.

    For conversion from DOS to UNIX format the tool recode may be used:

    recode ibmpc..lat1 my_script

    Alternatively, the dos2unix command can also be used:

    dos2unix my_script

    These commands perform the necessary changes in-place (i.e., the file is modified).

  3. If none of the above two items help, you can also perform an octal dump of your script. You should see the following
    $ od -c myscript | less
    0000000 # ! / b i n / b a s h \n # $ - o
     etc. etc. ...
    

    If any strange numbers or "\r \n" sequences occur, the format is incorrect and must be fixed (e.g. via multi-lingual emacs editing).

My SLURM job fails without any output

You might observe following symptom: My application did not start. Apparently, the job did nothing and failed immediatly after startup.

The reason might be a misconfigured job script. Please check the correct specification of job output and error files, for example:

#SBATCH -o mypath/myfilename.out
#SBATCH -e mypath/myfilename.err

The path must exist! Otherwise, Slurm fails to create the output/error files. Consequently, the job fails.

Please also refer to SLURM Workload Manager.

My SLURM jobs fail with strange error messages involving I/O

The error messages typically are "Can't open output file" and/or "file too large".

The reasons for this may be

  • you have exceeded a file system quota
  • you have exceeded the per-directory limit for the number of files

Please consult the file systems document appropriate for the system you are using: Cluster or SuperMUC. The therapy usually requires removing files and/or restructuring your directory hierarchy.

I've deleted my job, but it is still listed in squeue

With SLURM, you might typically see entries like

278820 serial             multiflo   abcd12d   CANCELLED     17:42:32 4-08:00:00      1 lx64a295 278824 

when issuing squeue -M serial -u $USER some time after having deleted your job using scancel. The trouble is that the master cannot always distinguish a node crash from a temporary network outage, and so cannot remove the job from its internal tables. It might or might not need to do some janitorial work on the client node!

There is of course a catch here: If you resubmit the job (operating on the same data), there is a chance that processes from the deleted job still running on the client node will overwrite newly generated data. The chance for this is not large, since in most cases we do observe node crashes rather than long-term network outages, but it is not zero. There is no sure way to avoid this but for running the new job on a separate data set.