Policies on the Linux Cluster
Usage Policies on the Login Nodes
General Job Processing Rules
Job Scheduling
- For parallel jobs, it is recommended to explicitly specify the runtime limit using the option "--time" in Slurm jobs. This may shorten the waiting time, since the job might be run in Slurm's backfill mode. In other words: The short job may get resources that are free while Slurm tries to fit another large job into the system. Your specification gives Slurm the information required to organize this.
Job Submission
- Submission of large numbers of jobs (> 100, including array jobs) with very short run time (< 2 minutes) is considered a misuse of resources. It causes both waste of computational resources and - if mail notifications are used - disruption of the notification system. Users that submit such jobs will be banned from further use of the batch system. Bundle the individual jobs into a much bigger one!
- In order to prevent monopolization of the clusters by a single user, there are maximum numbers of jobs that can be submitted by this user. Please check the job limits.
Use of Mail Notifications in Batch Jobs
All cluster systems permit to include an e-mail address into Slurm batch scripts for notification about certain job states (typically, job start and/or job end). Please note that any mailing request is obliged to include a valid email address!
If you request an e-mail to an invalid or non-existing e-mail address, Linux Cluster administration will revoke your job submission rights, as a defensive measure against a denial of service attack on the LRZ mail hub.
Job Farming
- Jobs using the "srun & wait" idiom (cp. General Considerations to Job-/Task-Farming), may start many job steps. Please note, that the total number of job steps per Slurm job is limited. In other words, the total number of "srun" calls is limited! A job exceeding that limit will throw errors like that:
srun: error: Unable to create step for job ...: Step limit reached for this job
The limits are different on the cluster segments:- cm4: 40000
- serial: 1000
Memory Use
- Jobs exceeding the physical memory available on the selected node(s) will be removed, either by Slurm itself, or the OOM ("out of memory") killer in the operating system kernel, or at LRZ's discretion since such a usage typically has a negative impact on system stability.
Software Licenses
- Many commercial software packages have been licensed for usage on the cluster. Most of these require the use of so-called floating licenses, only a limited amount of which are typically available. Since it is not possible to check whether a license is available before a batch job starts, LRZ cannot provide any guarantees that a batch job requesting use of such a license will run successfully.