FAQ: Job crashed
1. Are you producing too many files in one directory?
Parallel access to thousands of small files in one directory may cause a crash. Check our best practices for filesystem usage:
Best Practices, Hints and Optimizations for IO
2. Node Failure?
A node failure may interrupt your job. After seeing several error messages you may see something similar at the end:
slurmstepd: error: *** STEP 123456.0 ON i01r01c01s09 CANCELLED AT 2022-10-27T08:07:38 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***
If this is the case, please try to resubmit your job again, if this failure recurs please notify us via the incident system.
Billing in SuperMUC-NG: your lost Core-Hours with a Node-Failure will not be accounted.
3. Your Job crashed with a Segmentation Fault?
Dereferencing NULL, uninitialized or invalid pointers (maybe because your application is out of memory), buffer overflow, incorrect memory deallocation, stack overflow (no, not the website, but running out of memory in the stack due to deep recursion), out-of-bounds array access, corrupted memory, multithreading issues, floating point exceptions, incorrect usage of an API, running out of file descriptors (may manifest later in a segfault. See point 1 in this Section as well), etc...
This list is very long and still not quite complete. Here are some tools and tips for debugging your parallel application:
4. The Problem might not be in your Application
The software stack with which your application works, may be very large, and there might be a bug that only appears rarely... Please try to rule out any problems with your application (by going through Point 3.). If this is the case, then: use your favourite search engine and search the error in forums and communities, maybe switching to another version of a library in your stack solves the problem. If you find no solution and we support the software, do open an incident, we might need a reproducible test case as small as possible, with script, input files and binary. We are aware that providing a software may not be possible due to proprietary information, licensing and similar issues. In those cases we may not be able to provide much help, but rather some guidance on what to do.
Some problems may arise at a systemwide level, which make applications crash. A typical failure is for example a filesystem problem, with stale file handles. Billing in SuperMUC-NG: your lost Core-Hours can be refunded, please open an incident in this case with the list of job ids that need refund.