FAQ: batch queuing and batch jobs
Long queueing times
There are multiple reasons, that jobs have to wait a long time in the queue. The most obvious reason might be that the cluster is fully occupied and many jobs are pending in the queue.
Depending on the HPC system, there are other possible reasons. Please expand the according section below.
Supercomputer
Linux Cluster
Jobs using shared memory parallelism
Note that OpenMP parallel programs may need a suitable setting for the environment variable OMP_NUM_THREADS. Default is usually 1 Thread.
For some programs (like, e.g., Gaussian), which are multithreaded, however not via OpenMP but via shared memory used by the processes or explicit pthread programming, you need to study the program documentation for hints on how to configure a parallel run. Setting OMP_NUM_THREADS will usually not have any effect for these programs.
Names of Job Scripts
Batch scripts must not have a number as first character of their name. E.g., a script of the form 01ismyjob will not be correctly started. Please use one of the characters a-z, A-Z as first character of your job script name.
Batch Scripts in DOS/Unicode format / Unprintable Characters in Scripts
Scripts which have been edited under DOS/Windows may contain line-feeds and carriage-returns; these will not work under SLURM. The same applies for scripts which have been written in Unicode (e.g., UTF format) by modern editors. Furthermore, apparent whitespaces, for example in the
#! /bin/sh
specification could lead to problems. Scripts like these will fail to execute and may even block a queue altogether! Please remove such special whitespaces.
Determination of file format and fixing of format problems
- Run the file command on the script:
file myscript
The result should be something like
my_script: Bourne-Again shell script text.
If this is not the case, but instead a format like UTF-8 is reported, then please run the iconv command:iconv -c -f UTF-8 -t ASCII my_script > fixed_script
(the result is written to standard output, which is redirected in this example - so fixed_script should now be ASCII, while my_script is unchanged).
- Edit the script with vi(=vim). In the status line you will see the string [dos]i f the file happens to contain carriage returns/linefeeds.
For conversion from DOS to UNIX format the tool recode may be used:
recode ibmpc..lat1 my_script
Alternatively, the dos2unix command can also be used:
dos2unix my_script
These commands perform the necessary changes in-place (i.e., the file is modified).
- If none of the above two items help, you can also perform an octal dump of your script. You should see the following
$ od -c myscript | less 0000000 # ! / b i n / b a s h \n # $ - o etc. etc. ...
If any strange numbers or "\r \n" sequences occur, the format is incorrect and must be fixed (e.g. via multi-lingual emacs editing).
My SLURM job fails without any output
You might observe following symptom: My application did not start. Apparently, the job did nothing and failed immediatly after startup.
The reason might be a misconfigured job script. Please check the correct specification of job output and error files, for example:
#SBATCH -o mypath/myfilename.out #SBATCH -e mypath/myfilename.err
The path must exist! Otherwise, Slurm fails to create the output/error files. Consequently, the job fails.
Please also refer to SLURM Workload Manager.
My SLURM jobs fail with strange error messages involving I/O
The error messages typically are "Can't open output file" and/or "file too large".
The reasons for this may be
- you have exceeded a file system quota
- you have exceeded the per-directory limit for the number of files
Please consult the file systems document appropriate for the system you are using: Cluster or SuperMUC. The therapy usually requires removing files and/or restructuring your directory hierarchy.
I've deleted my job, but it is still listed in squeue
With SLURM, you might typically see entries like
278820 serial multiflo abcd12d CANCELLED 17:42:32 4-08:00:00 1 lx64a295 278824
when issuing squeue -M serial -u $USER some time after having deleted your job using scancel. The trouble is that the master cannot always distinguish a node crash from a temporary network outage, and so cannot remove the job from its internal tables. It might or might not need to do some janitorial work on the client node!
There is of course a catch here: If you resubmit the job (operating on the same data), there is a chance that processes from the deleted job still running on the client node will overwrite newly generated data. The chance for this is not large, since in most cases we do observe node crashes rather than long-term network outages, but it is not zero. There is no sure way to avoid this but for running the new job on a separate data set.