Running large-memory jobs on the Linux Cluster

Documentation links you should know
Job Processing on the Linux-ClusterOverview of Linux Cluster partitions, resource limits, job limits, job specifications, common Slurm commands on the Linux Cluster
Job policies and general rules

General and cluster-/job-specific policies. Please also take note of and respect our Linux Cluster rules!

SLURM Workload ManagerSlurm commands and options for job submission, explanations, recommendations
Environment Modules
Spack Generated Modules
Overview of module system and LRZ software stack

Overview

In order to process large-memory jobs, we recommend to use the Teramem system of the Linux Cluster. Jobs may be run both as interactive and batch jobs. Please consider the limits on the partition teramem_inter and the fact that it is a shared system, i.e., both interactive jobs and batch jobs of different users share the Teramem! As this system is a single compute node only, we recommend to make careful use of it.

The Teramem – a single node with approx. 6 TB of memory – is the only system at LRZ which can satisfy memory requirements beyond 1 TB on a single node. Please consider:

  • Typical Teramem jobs may use a relatively small amount of CPU resources in shared-memory applications.
  • Typical Teramem jobs may run applications without any distributed-memory parallelization.
  • It is discouraged to run MPI applications on the Teramem! If possible and if required, run the job on the partition cm4_std. If this is insufficient, please consult the decision matrix.

This page briefly describes how to use the Teramem.

Interactive jobs on the Teramem

Batch jobs on the Teramem step by step

Step 1: Prepare a batch job script

In order to start a batch job, you need to log in to the Linux Cluster. Then, you can submit a pre-prepared job script to the Slurm scheduler.

Batch job script lineExplanationComment
#!/bin/bash


#SBATCH -J <job_name>

(Placeholder) Set name of job (not more than 10 characters please).


#SBATCH -D  ./
Set working directory. This directory is used by the script as starting point. The directory must exist before the job starts. Here, the path is relative to the directory where the job script is submitted.
#SBATCH -o ./%x.%j.%N.out
#SBATCH -e ./%x.%j.%N.err
(Placeholder) Standard output/error goes there. The directory where the output file is placed must exist before the job starts, and the (full) path name must be specified (no environment variable!). "%x" encodes the job name into the output file name. "%j" encodes the job ID. "%N" encodes the name of the master node of the job. Here, the specified path is relative to the directory specified in the -D specification.
#SBATCH --get-user-env
Set user environment properly.
#SBATCH --export=NONE
Do not export the environment of the submitting shell into the job; while SLURM allows to also use ALL here, this is strongly discouraged, because the submission environment is very likely to be inconsistent with the environment required for execution of the job.
#SBATCH --clusters=inter
#SBATCH --partition=teramem_inter

Specify the names of the cluster segment and partition.


#SBATCH --ntasks=<number>
(Placeholder) The number of MPI tasks to start on each node. Typically, the value used here should not be larger than the number of physical cores in a node. It may be chosen smaller for various reasons (memory needed for a task, hybrid programs, etc). Set it to 1 in purely serial jobs.If CPU resources are not specified, Slurm will assign only a single core to the job per default!
#SBATCH --cpus-per-task=<number>
(Placeholder) Set the number of (OpenMP) threads per (MPI) task. This parameter is required in shared-memory and hybrid job setups. Set it to 1 in purely serial jobs.
#SBATCH --mem=<size>[unit]

(Placeholder) Specify maximum memory the job can use, e.g. 2T = 2 TB (typical units: M, G, T). Very large values can cause assignment of additional cores to the job that remain unused, so this feature should be used with care. This parameter is optional. The default memory per job scales with the amount of CPU cores used by the job.


#SBATCH --time=<HH:MM:SS>
(Placeholder) Set the maximum run time of the job using the format "hours:minutes:seconds".
module load slurm_setup
First executed line: SLURM settings necessary for proper setup of batch environment.
module load <module_name>
(Placeholder) Load any required environment modules (usually needed if program is linked against shared libraries, or if paths to applications are needed), e.g. MPI.Modules are not auto-loaded by default (e.g. for libraries or software packages)! Please load all required modules explicitly. Otherwise, the job starts with a clean environment!
./my_serial_program.exe

(Purely serial application) Start application.


./my_shared-memory_program.exe

(Shared-memory application) Start application.


Step 2: Job submission procedure

The job script "my_job.slurm" is submitted to the queue via Slurm's sbatch command. At submission time the control sequences are evaluated and stored in the queuing database, and the script is copied into an internal directory for later execution. If the command was executed successfully, the job ID will be returned. The sequence should look like: 

userid@loginnode:~> sbatch my_job.slurm 
Submitted batch job 305429 on cluster inter

It is a good idea to note down your Job IDs, for example to provide to LRZ Linux Cluster Support as information if anything goes wrong. Nevertheless, you may also invoke Slurm commands to inspect your jobs or to check their status. See section "Job management".

Furthermore, the submission command sbatch can also contain control sequences, which would override the settings in the script. We strongly advice against doing so! Otherwise, it might be difficult for you or the LRZ HPC support to understand and reproduce the job configuration based on the job script file.

Step 3: Job management

Once submitted, the job will be queued for some time, depending on how many jobs are presently submitted and how many resources are available. Eventually, more or less after previously submitted jobs have completed, the job will be started on one or more of the systems determined by its resource requirements. Slurm provides several commands to check the status of waiting or running jobs, to inspect or even modify (modification only to limited extent) waiting/running jobs, to obtain information of finished jobs and to delete waiting/running jobs.

Please consult Job Processing on the Linux-Cluster for a list of common Slurm commands on the Linux Cluster.

Batch job example

The exemplary job script is provided as template which you can adapt for your own settings. In particular, you should consider that some entries are placeholders, which you must replace with correct, user-specific settings. In particular, path specifications must be adapted, for example you need to substitute "path_to_my_prog". Furthermore, the job in our examples will be executed in the path where the job script is located. That is, the working directory (check option "-D") is set to "./". All other path names used in the script are relative paths with respect to this working directory. Please also keep in mind, that the job output files are written to this directory. After running some jobs, it might be cluttered with lots of files. We recommend to use a separate directory for that.

Your job might produce a lot of (temporary) data. Please also configure the data paths accordingly. For recommendations on how to do large-scale I/O please refer to the description of the file systems available on the cluster.

Shared-memory job

#!/bin/bash
#SBATCH -J job_name
#SBATCH -D ./
#SBATCH -o ./%x.%j.%N.out
#SBATCH -e ./%x.%j.%N.err
#SBATCH --get-user-env
#SBATCH --export=NONE
#SBATCH --clusters=inter
#SBATCH --partition=teramem_inter
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --mem=2T
#SBATCH --time=08:00:00
 
module load slurm_setup
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
MYPROG=path_to_my_prog/my_shared-memory_program.exe
 
$MYPROG

The example will start a shared-memory application using 20 threads and requesting 2 TB of memory per job on partition teramem_inter.

The maximum job runtime is set to 8 hours.