Lightweight and Streaming Debugging

General Considerations

Debugging is hard work, and mostly frustrating as one actually would like to do rather different things ... often under time pressure. Some debugging guide and easy-to-use tools are therefore desirable, we believe.

Errors can occur on different levels in the work flow. Compile-time errors, link-time errors, run-time errors. And each of them can have their plurality of causes, impacts and signatures. We cover here only run-time errors, and specifically for the HPC cases, as those are the ones most time consuming for us. Of course, we assume that you compiled your software cautiously in a clean and consistent environment. Debugging is difficult enough for well written and build software. Introducing additional side effects might render any attempt to debug errors possibly hopeless.

Debugging is about correctness! Optimization might be counter-productive for correctness! Sort your priorities!

1. Law of Debugging: Complexity is the enemy of Debugging!

Keep your cases (specifically for testing) simple!
Keep your environment as clean and simple as possible (try to live with system defaults)!
Keep errors reproducible! Specifically, it should be always the same error which you try to debug. Hitting moving targets is much more difficult!
Keep your test cases as small and short as possible! The higher the test frequency, the higher your chance of success (and also to get help).

2. Law of Debugging: Approach the problem systematically!

Because MPI tells you that an error occurred does not at all mean that MPI is the reason!
Increase the level of verbosity! Switch on debugging output if available! Compile with -g and -traceback options. More information is helpful!
Use tools to confine temporally and spatially, where and when does the error occur, if possible!
Document your test environment! This is consistent with the requirement for reproducibility.

3. Law of Debugging: Use the right diagnosing tools for the right error!

Hunting memory problems with MPI communication analysis tools is supposed to be fruitless.
Know your tools and their abilities! Look for simple tools, easy to use, because complexity is ....

4. Law of Debugging: Segmentation Faults are your friend!

Some errors like race conditions appear as hanging and idling processes. Jobs quietly sleeping away with no further output might go unnoticed wasting resources. Use timeouts for communications and operations!
Try to program, compile and run in a way such that clear points of failure appear. The earlier the better!
Don't fear to use export MALLOC_CHECK_=3! Or, use efence!
The closer a program aborts around the root cause for the abort, the easier is it to find that cause.

Of course, some errors appear only at larger scale. But debugging on the scale of hundreds or even thousands of nodes, or for long run-times, is very expensive. Try to avoid those scenarios!

User Build Job/Process Monitor

Often, jobs may not fail with a hard crash, but starve somehow, or assume sort of "ill state". These are probably the most difficult to analyze scenarios on black box systems without interactive access to the compute nodes.

Still, users can instrument their codes to include a health checker. These might include the self-surveillance of memory consumption or other information form the /proc or /sys file system. There are already some tools accomplishing this task. See next section!

If you need an adaptable monitor, a simple MPI wrapper script can do this (here with Intel MPI, where PMI_RANK is defined):

Example Script with rank-wise Monitor

#!/bin/bash
[...]                                                   # Slurm header
module load slurm_setup 

cat > mon_wrap.sh << EOT
#!/bin/bash
[ "\$PMI_RANK" == "0" ] && echo "[\$(date '+%Y-%m-%d %H:%M:%S')] Start"
[ "\$PMI_RANK" == "0" ] && echo "running \$*"
if [ "\$(echo \$PMI_RANK%\$SLURM_NTASKS_PER_NODE | bc)" == "0" ]; then
   env > env.\$(hostname).\$PMI_RANK
   top -b -d 5 -n 40 -u \$USER > mon.\$(hostname) &
fi
eval \$* 2>&1 | while IFS= read -r line; do printf '[%s] %s\n' "\$(date '+%Y-%m-%d %H:%M:%S')" "\$line"; done
[ "\$PMI_RANK" == "0" ] && echo "\$(date '+%Y-%m-%d %H:%M:%S') Finish"
EOT
chmod u+x mon_wrap.sh

mpiexec -l ./mon_wrap.sh <user program> <prog parameters>

But take care for not to create too much information such that the monitor influences or even dominates the job's workflow! Useful shell commands might be top, ps, free, ...

If not a rank-wise monitor is needed, but just a node-wise one, you can simplify the monitor somewhat.

Example Script with node-wise Monitor

#!/bin/bash
[...]
module load slurm_setup

cat > monintor.sh << EOT
#!/bin/bash
echo "[\$(date '+%Y-%m-%d %H:%M:%S')] Start"
env > env.\$(hostname)
top -b -d 5 -n 40 -u \$USER > mon.\$(hostname) &
EOT
chmod u+x monitor.sh

srun --ntasks-per-node=1 --export=ALL --mpi=none ./monitor.sh &
MON_PID=$(echo $!)

mpiexec <user program> <prog parameters>

kill -9 $MON_PID

For instance, a node-wise momory monitor could be like

free -s 5 | while IFS= read -r line; do printf '%s  %s\n' "$(date '+%Y-%m-%d %H:%M:%S')" "\$line"; done &>  mon.\$(hostname)

which produces an output file for each node, with node memory information every five seconds. The output may look like

free output

2024-03-24 09:04:20                total        used        free      shared  buff/cache   available
2024-03-24 09:04:20  Mem:      131166260    36430936    88130324      233072     6605000    93044136
2024-03-24 09:04:20  Swap:       8388604     8071664      316940

and can be analysed and plotted with any tool you deem suitable for this task. The above can be brought e.g. into a form digestible by Gnuplot or python/matplotlib via

awk 'BEGIN {print "time,mem[GB],mem[%]"}  $3 == "Mem:" {printf("%s %s,%0.2f,%0.2f\n",$1,$2,$5/1024/1024,$5*100/$4)}' mon.file > monitor.csv

In Gnuplot, it may be as simple as

monitor.pl

set xdata time
set timefmt '%Y-%m-%d %H:%M:%S'
set datafile separator ','
set terminal pdf
set output 'monitor.pdf'
set xlabel "time"
set ylabel "memory consumption"
plot "monitor.csv" u 1:3 w l t 'mem [%]'

LRZ Debugging Tool Modules

These modules are currently hidden as they are not often needed, thank goodness.

> module use /lrz/sys/share/modules/extfiles/
> module av debugging
----------------------- /lrz/sys/share/modules/extfiles ------------------------
debugging/efence/2.2        debugging/heaptrack/1.2.0  debugging/strace/5.9  
debugging/gperftools/2.9.1  debugging/ltrace/0.7.3
> module help debugging/strace
-------------------------------------------------------------------
Module Specific Help for /lrz/sys/share/modules/extfiles/debugging/strace/5.9:

modulefile "strace/5.9"
  Provides debugging tool ltrace
  Provides also MPI wrapper for MPI parallel tracing using strace,
  e.g. inside Slurm (backslashes are necessary):
    mpiexec -l strace_mpi_wrapper.sh -ff -t -o strace.\$\(hostname\).\$PMI_RANK <your-prog> <your-options>
-------------------------------------------------------------------

In future, there might remain just a single module like yatb containing all the tools.

strace and ltrace

Trace system/library calls. Easy to use because it creates just ASCII output with trace that can be scrutinized with any editor.

For MPI programs, we created a wrapper, which can be used as follows:

mpiexec -l strace_mpi_wrapper.sh -ff -t -o strace.\$\(hostname\).\$PMI_RANK <your-prog> <your-options>

The reason for the wrapper is that hostname and PMI_RANK should be evaluated at task start on the respective node, which simplifies afterwards the assignment of the output to the rank ID. The strace options -ff and -t are for tracing threads and inserting time stamps into the output, respectively. Other ways are certainly also possible.

Similarly for ltrace. Consult module help for strace and ltrace!

heaptrack

Tracing memory allocations and consumption. Lightweight surrogate for Valgrind massif. Can be used to find memory leaks, out-of-memory events, and more, also for cases when the program aborts or is killed by an OOM killer.
Simply use it like this

mpiexec heaptrack <your-prog> <your-options>

It produces ASCII files with name scheme heaptrack.<your-prog>.<HOSTNAME>.<PID>.zst (or some other compression ending), which can be analyzed using heaptrack_print or heaptrack_gui (if available).

time

Can be used to obtain a simple resource consumption overview like for run-time, memory consumption.

As bash wraps time with slightly less functionality, you must use it via

\time -v <prog-name>

or

env time -v <prog-name>