see also:
General
The batch system on SuperMUC-NG is the open-source workload manager SLURM (Simple Linux Utility for Resource management). For details about the SLURM batch system, see Slurm Workload Manager.
Submit hosts are usually login nodes that permit to submit and manage batch jobs.
Intel processors on SuperMUC-NG support the hyperthreading mode which might increase the performance of your application. With hyperthreading, you have to increase the number of MPI tasks per node from 48 to 96 in your job script. Please be aware that with 96 MPI tasks per node each process gets only half of the memory by default. If you need more memory, you have to specify it in your job script and use the fat nodes (see example batch scripts).
List of relevant commands
Command's name | Functionality |
---|---|
sbatch | submit a job script |
scancel | delete or terminate a queued or running job |
squeue | print table of submitted jobs and their state. Note: non-privileged users can only see their own jobs.e |
salloc | create an interactive SLURM shell |
srun | execute argument command on the resources assigned to a job. Note: must be executed inside an active job (script or interactive environment). mpiexec is an alternative and preferred on LRZ system |
sinfo | provide overview of cluster status |
scontrol | query and modify SLURM state |
sacct
is currently not working.e
Queues (SLURM partitions) and their limits
- Batch queues are called partitions in SLURM.
- The allocation granularity is multiples of one node (only complete nodes are allocated and accouonted for).
- Scheduling and priorization is based on a multifactor scheme including wait time, job size, partition, and required quality of service.
The following partitions are available. Check with sinfo
for more details and special partitions:
partition | min-max nodes per job | max usable memory | cores per node | max run time (hours) | max running jobs per user | max submitted jobs per user (qos) |
---|---|---|---|---|---|---|
test (also used for interactive access with salloc) | 1-16 | 90 GB | 48 | 0.5 | 1 | 3 |
micro | 1-16 | 90 GB | 48 | 48 | 20 | 30 |
general | 17-768 | 90 GB | 48 | 48 | 5 | 20 |
large | 769-3072 (half of system) | 90 GB | 48 | 24 to be increased | 2 | 5 |
fat | 1-128 | 740 GB | 48 | 48 | 2 | 10 |
mixed (not yet available) | 64-3072 | 90 GB and 760 GB | 48 | 12 | 1 |
srun and mpiexec
With SLURM srun command users can spawn any kind of application, process or task inside a job allocation or directly start executing a parallel job (and indirectly ask SLURM to create the appropriate allocation). It can be a shell command, any single-/multi-threaded executable in binary or script format, MPI application or hybrid application with MPI and OpenMP. When no allocation options are defined with srun command the options from sbatch or salloc are inherited.
Note: srun at LRZ is defined as the aliassrun='I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so /usr/bin/srun'
.
Since aliases are not inherited the alias is only available in the login shell or in the initial batch script, everywhere else it falls back to /usr/bin/srun. Use the full syntac in these cases.
Note: mpiexec is the preferred and supported way to start applications. srun might fail for hyperthreaded applications.
salloc / srun for interactive processing
salloc is used the allocated nodes for interactive processing. The options for resource specification in salloc/srun/sbatch are the same.
Currently, at least --account, --time and --partition must be specified!
If there are difficulties starting up, it may be advantageous to also specify --ear=off. See the EAR document for more details.
"srun" can be used instead of "mpiexec"; both commands execute on the nodes previously allocated by the salloc.
There is no advantage by using "salloc" over "sbatch --partition=test" in terms of wait time.
sbatch Command / #SBATCH option
Batch job options and resources can be given as command line switches to sbatch (in which case they override script-provided values), or they can be embedded into a SLURM job script as a comment line of the form.
Batch Job Examples
General options applicable for all jobs#!/bin/bash # Job Name and Files (also --job-name) #SBATCH -J jobname #Output and error (also --output, --error): #SBATCH -o ./%x.%j.out #SBATCH -e ./%x.%j.err #Initial working directory (also --chdir): #SBATCH -D ./ #Notification and type #SBATCH --mail-type=END #SBATCH --mail-user=insert_your_email_here # Wall clock limit: #SBATCH --time=24:00:00 #SBATCH --no-requeue #Setup of execution environment #SBATCH --export=NONE #SBATCH --get-user-env #SBATCH --account=insert your_projectID_here #constraints are optional #--constraint="scratch&work" <insert the specific options for resources and execution from below here> | Hints and Explanations: Replacement patterns in filenames: Notification types: requeue/no-requeue: environment: get-user-env account: constraint (optional): |
Options for resources and execution (select and click to expand)
| Resource Specifications: module load slurm_setup specific settings which cannot be set up in the job prolog. Without this line your job will fail ! nodes=<minnodes[-maxnodes]> ntasks: ntasks-per-node: ntasks-per-core: cpus-per-task: switches=<number>[@waittime hh:mm:ss] array: mpiexec: If SLURM can detect the number of tasks form its settings it is sufficient to use mpiexec without further parameters e.g., Execution Specification: By default, the system may dynamically change the clock frequency of CPUs during the run time of a job to optimise for energy consumption (for more details, see Energy Aware Runtime). This makes profiling or benchmark measurements difficult and unstable. Users can enforce a fixed default frequency by switching EAR off:
Pinning: |
Submitting several jobs with dependencies | dependency=<dependency_list>
|
Input Environment Variables
Upon startup, sbatch will read and handle the options set in the following environment variables. Note that environment variables will override any options set in a batch script, and command line options will override any environment variables. Some which may be used by you in $HOME/.profile:
Variable | Option |
---|---|
SBATCH_ACCOUNT | --account |
SBATCH_JOB_NAME | --jobid |
SBATCH_REQUEUE SBATCH_NOREQUEUE | --requeue --no-requeue |
Output Environment Variables
The Slurm controller will sets the variables in the environment of the batch script
Variable | Option |
---|---|
SLURM_JOB_ID SLURM_JOBID | Both variants return the SLURM JobID |
SLURM_JOB_ACCOUNT | Account name associated of the job allocation |
SLURM_JOB_NUM_NODES | Number of nodes. |
SLURM_JOB_NODELIST | To convert the Slurm compressed format into a full list: |
SLURM_NTASKS | Number of tasks. Example of usage: mpiexec -n $SLURM_TASKS |
SLURM_NTASKS | These variables are only set if the corresponding sbatch option was given. Example of usage: |
SLURM_JOB_CPUS_PER_NODE SLURM_TASKS_PER_NODE | Count of processors available to the Job: Returned value looks like "96(x128)". Number of tasks to be initiated on each node. Returned value looks like "8(x128)". |
SLURM_PROCID | The MPI rank (or relative process ID) of the current process. Can be used in wrapper scripts if [ $SLURM_PROCID ] |
File Patterns
sbatch allows for a filename pattern to contain one or more replacement symbols, which are a percent sign "%" followed by a letter
Example: #SBATCH -o ./%x.%j.out
Pattern | Expansion |
---|---|
%j %J %a | jobid of the running job, |
%u | User name |
%x | Job Name |
%t | task identifier (rank) relative to current job. This will create a separate IO file per task. |
Useful commands
Show the estimated start time of a job: sqeueue --start [-u <userID>]
Guidelines for resource selection
Processing Mode
- Jobs that only use one or at most a few hardware cores perform serial processing and are not supported on SuperMUC-NG. Use the SuperMUC-Cloud for such purposes.
- Bunches of multiple independent tasks can be bundled into one job, using one or more nodes.
Run time limits
- Please note that all job classes impose a maximum run time limit. It can be adjusted downward for any individual job. Since the scheduler uses a backfill algorithm, the better you specify a realistic runtime limit, the better throughput of your job may be achieved.
Islands/Switches
- When a tree topology is used, this defines the maximum count of switches desired for the job allocation and optionally the maximum time to wait for that number of switches. If Slurm finds an allocation containing more switches than the count specified, the job remains pending until it either finds an allocation with desired (lower) switch count or the time limit expires. It there is no switch count limit, there is no delay in starting the job. This trades off better performance vs. shorter wait time in the queue
Memory Requirements
- The total memory available in user space for the set of nodes requested by the job must not be exceeded.
- The memory used on each individual node must not be exceeded by all tasks run on that node.
- Applications exist for which the memory usage is unsymmetric. In this case it may become necessary to work with a variable number of tasks per node. One relevant scenario is a master-worker scheme where the master may need an order of magnitude more memory and therefore requires a node of its own, while worker nodes can share a node. LRZ provides the "mixed" partition for using thin and fat nodes concurrently
Disk and I/O Requirements
- The disk and I/O requirements are not controlled by the batch scheduling system, but rely on parallel shared file systems, which provide system-global services with respect to bandwidth - this means that the total I/O bandwidth is shared between all users. The consequence is that all I/O may be significantly slowed down if heavily used by multiple users at the same time, or even - for large scale parallel jobs - by a single user. At present, LRZ can not make any Quality of Service assurance for I/O bandwidth.
- The appropriate usage of the parallel file systems is essential.
- Please consult File Systems of SuperMUC-NG for more detailed technical information.
Licences
- Some jobs may make use of licensed software, either from the LRZ software application stack, or of software installed in the user's HOME directory. In many cases, the software needs to access a license server because there exist limits on how many instances of the software may run and who may access it at all.
- There is no connection from SuperMUC-NG to the outside. Check with LRZ if you are in need of such licenses.
- LRZ is currently not able to manage license contingents. The reason is that a significant additional effort is required, not only with suitable configuration of SLURM, but also with how the license servers are managed. The situation implies that a job will fail if the usage limit of a licensed software is exceeded when the job starts.
Conversion of scripts from LoadLeveler and other Workload Managers table
- see: List of the most common command, environment variables, and job specification options used by the major workload management systems
Specific Topics (jobfarming, constraints)
SLURM Documentation
- SLURM Workload Manager at LRZ
- Command/option Summary (two pages)
- Documentation for SLURM at SchedMD
- The manual pages slurm(1), sinfo(1), squeue(1), scontrol(1), scancel(1), sview(1)