FAQ: batch queuing and batch jobs

Long queueing times

There are multiple reasons, that jobs have to wait a long time in the queue. The most obvious reason might be that the cluster is fully occupied and many jobs are pending in the queue.

Depending on the HPC system, there are other possible reasons. Please expand the according section below.

Supercomputer

If your jobs take too long to start on SuperMUC-NG and you cannot generate enough troughput, please check if you can use the following remedies:

Large jobs are prioritized compared to small jobs
Try to fit your jobs into the next larger queue (SuperMUC-NG SLRUM partitions)
- try to run at the upper node limit, where your application still has a reasonable parallel performance
- If you have multiple long running simulations, try to start them within a single job script (Job farming with SLURM)
- If you have many smaller identical jobs with different parameters, run them in SLURM job arrays (FAQ: Embarassingly parallel jobs using Python and R)
Chained jobs often start right after your previous job terminates (Submitting several jobs with dependencies)
A job chain consists of many jobs, each depending on its predecessor (typical case). Typical examples are long MD trajectories or extended CFD simulations. Here, all jobs in the chain have the same size. The chained jobs gain priority during the waiting times for the predecessor job to terminate. When the predecessor job finishes, the next job in the chain has the right size to fit the empty slot and often enough priority to start straight away.
Check your workflow. Jobs may benefit from Slurm's backfill scheduler.
- What does that mean? On the condition, that the start time of high-priority jobs is not delayed, low-priority jobs might be executed earlier.
- What to do? Estimate the maximum run time of your job and set a meaningful value of the sbatch option "--time" in your job script or sbatch command (cp. SLURM Workload Manager and Job Processing with SLURM on SuperMUC-NG). In order to optimize job scheduling for all users, we highly recommend to set "--time" instead of using default run time.

LRZ will generally not prioritize single jobs/users/projects, or provide reservations. If your job processing is delayed and none of the above measures seem to help, please seek advice early via the service desk.

Linux Cluster

Please consult the section "Common Slurm commands on the Linux Cluster for job submission and job management" on our Linux Cluster documentation page.

On the Linux Cluster there is a limited number of interactive compute nodes. Those nodes are not shared resources. It might happen that all nodes are occupied. As a result, you might have to wait some time.

Jobs using shared memory parallelism

Note that OpenMP parallel programs may need a suitable setting for the environment variable OMP_NUM_THREADS. Default is usually 1 Thread.

For some programs (like, e.g., Gaussian), which are multithreaded, however not via OpenMP but via shared memory used by the processes or explicit pthread programming, you need to study the program documentation for hints on how to configure a parallel run. Setting OMP_NUM_THREADS will usually not have any effect for these programs.

Names of Job Scripts

Batch scripts must not have a number as first character of their name. E.g., a script of the form 01ismyjob will not be correctly started. Please use one of the characters a-z, A-Z as first character of your job script name.

Batch Scripts in DOS/Unicode format / Unprintable Characters in Scripts

Scripts which have been edited under DOS/Windows may contain line-feeds and carriage-returns; these will not work under SLURM. The same applies for scripts which have been written in Unicode (e.g., UTF format) by modern editors. Furthermore, apparent whitespaces, for example in the

#! /bin/sh

specification could lead to problems. Scripts like these will fail to execute and may even block a queue altogether! Please remove such special whitespaces.

Determination of file format and fixing of format problems

Run the file command on the script: file myscript
The result should be something like my_script: Bourne-Again shell script text. If this is not the case, but instead a format like UTF-8 is reported, then please run the iconv command:
```
iconv -c -f UTF-8 -t ASCII my_script > fixed_script
```
(the result is written to standard output, which is redirected in this example - so fixed_script should now be ASCII, while my_script is unchanged).
Edit the script with vi(=vim). In the status line you will see the string [dos]i f the file happens to contain carriage returns/linefeeds.
For conversion from DOS to UNIX format the tool recode may be used:
```
recode ibmpc..lat1 my_script
```
Alternatively, the dos2unix command can also be used:
```
dos2unix my_script
```
These commands perform the necessary changes in-place (i.e., the file is modified).
If none of the above two items help, you can also perform an octal dump of your script. You should see the following
```
$ od -c myscript | less
0000000 # ! / b i n / b a s h \n # $ - o
 etc. etc. ...
```
If any strange numbers or "\r \n" sequences occur, the format is incorrect and must be fixed (e.g. via multi-lingual emacs editing).

My SLURM job fails without any output

You might observe following symptom: My application did not start. Apparently, the job did nothing and failed immediatly after startup.

The reason might be a misconfigured job script. Please check the correct specification of job output and error files, for example:

#SBATCH -o mypath/myfilename.out
#SBATCH -e mypath/myfilename.err

The path must exist! Otherwise, Slurm fails to create the output/error files. Consequently, the job fails.

Please also refer to SLURM Workload Manager.

My SLURM jobs fail with strange error messages involving I/O

The error messages typically are "Can't open output file" and/or "file too large".

The reasons for this may be

you have exceeded a file system quota
you have exceeded the per-directory limit for the number of files

Please consult the file systems document appropriate for the system you are using: Cluster or SuperMUC. The therapy usually requires removing files and/or restructuring your directory hierarchy.

I've deleted my job, but it is still listed in squeue

With SLURM, you might typically see entries like

278820 serial multiflo abcd12d CANCELLED 17:42:32 4-08:00:00 1 lx64a295 278824

when issuing squeue -M serial -u $USER some time after having deleted your job using scancel. The trouble is that the master cannot always distinguish a node crash from a temporary network outage, and so cannot remove the job from its internal tables. It might or might not need to do some janitorial work on the client node!

There is of course a catch here: If you resubmit the job (operating on the same data), there is a chance that processes from the deleted job still running on the client node will overwrite newly generated data. The chance for this is not large, since in most cases we do observe node crashes rather than long-term network outages, but it is not zero. There is no sure way to avoid this but for running the new job on a separate data set.