Debugging

Parallel applications are difficult to debug. Depending on the kind of problem, the type of parallelism, some tools may provide a great help in the debugging process.

Summary

Supported programming model and functionality
Name MPI OpenMP Cuda GUI Step by step Memory Debugging
Arm-forge DDT
GDB          
PDB          
Intel Inspector      
Totalview
Valgrind        

To display a list of all available debuggers use the search option of the module command:

$ module search debugger

Compiler flags

Common flags

To debug codes, you need to enable debug symbols. You get these symbols by compiling with the appropriate options:

  • -g to generate debug symbols usable by most debugging and profiling tools.
  • or -g3 to generate even more debugging information (available for GNU and Intel, C, C++ or Fortran compilers).
  • and optionally -O0 to avoid code optimization (this is strongly recommended for first debug sessions).

Flags for Fortran

  • -traceback with ifort or -fbacktrace with gfortran: specifies that a backtrace should be produced if the program crashes, showing which functions or subroutines were being called when the error occurs.

For example, when getting a segmentation fault in Fortran, you may get the following error message which is not very useful:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image    PC               Routine Line    Source
run_exe  000000010005EAC7 Unknown Unknown Unknown
run_exe  000000010005DDA9 Unknown Unknown Unknown
run_exe  00000001000009BC Unknown Unknown Unknown
run_exe  0000000100000954 Unknown Unknown Unknown

A code compiled with -fbacktrace or -traceback will give a more relevant output:

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image    PC               Routine   Line    Source
run_exe  000000010005EAC7 test_m_   265     mod_test.f90
run_exe  000000010005DDA9 io_       52      io.f90
run_exe  00000001000009BC setup_    65      test_Setup.f90
run_exe  0000000100000954 main_     110     launch.f90
  • -check bounds with ifort or -fbounds-check with gfortran: checks that an index is within the bounds of the array each time an array element is accessed. This option is expected to substantially slow down program execution but is a convenient way to track down bugs related to arrays. Without this flag, an illegal array access would produce either a subtle error that might not become apparent until much later in the program or will cause an immediate segmentation fault with poor information on the origin of the error.

Note

Be careful. Most of these compiler options will slow down your code performances.

GDB

GDB is the Gnu DeBugger. It is a lightweight simple serial debugger available on most systems.

To start a program under GDB, first make sure it is compiled with -g. Start a GDB session for your code:

$ gdb ./gdb_test
GNU gdb (GDB) Red Hat Enterprise Linux
Copyright (C) 2010 Free Software Foundation, Inc.
(gdb)

Once the GDB session is started, launch the code with:

(gdb) run

If an error occurs, you will be able to get information with backtrace:

Program received signal SIGSEGV, Segmentation fault.
(gdb) backtrace
#0  0x00000000004005e0 in func1 (rank=1) at test.c:14
#1  0x0000000000400667 in main (argc=1, argv=0x7fffffffacc8) at test.c:30

GDB allows to set breakpoints, run the code step by step and more. See man gdb for more information and options.

GDB can be used on one process at a time with a parallel program. To attach GDB to a running process you may use the following method :

  • Compile the program with debug options.
  • Start the program
  • Find on wich nodes the program is running using the ccc_mpp -u $USER command
  • Connect to a compute node used by the program, using the ssh <compute node> command
  • Find the process ID of your application using the ps -fu command
  • Connect to a running process using the gdb -p <process id> command

You can use gdb on several processes at the same time.

DDT

DDT is a highly scalable debugger specifically adapted to supercomputers.

Basics

You can use DDT after loading the appropriate module:

$ module load arm-forge

Then use the command ddt. For parallel codes, edit your submission script and replace the line

$ ccc_mprun -n 16 ./a.out

with:

$ ddt -n 16 ./a.out

You may want to add the -noqueue option to make sure DDT will not submit a new job to the scheduler. You have to specify the good version of the mpi distribution by selecting run and select SLURM (generic) implementation as shown on the figures below.

DDT opening window: choose 'Run'

DDT opening window: choose ‘Run’

Choose 'change'

Choose ‘change’

Choose 'SLURM (generic)'

Choose ‘SLURM (generic)’

Example of submission script:

$ cat ddt.job
#!/bin/bash
#MSUB -r MyJob_Para       # Job name
#MSUB -q <partition>      # Partition name
#MSUB -A <project>        # Project ID
#MSUB -n 32               # Number of tasks to use
#MSUB -T 1800             # Elapsed time limit in seconds
#MSUB -o example_%I.o     # Standard output. %I is the job id
#MSUB -e example_%I.e     # Error output. %I is the job id
set -x
cd ${BRIDGE_MSUB_PWD}
ddt -n 32 ./ddt_test

$ ccc_msub -X ddt.job

Note

The -X option for ccc_msub enables X11 forwarding.

Example of DDT window

DDT with NiceDCV

If debugging with DDT requires more performance than what can provide the X11 forwarding, you may use NiceDCV. First, start ddt on NiceDCV.

$ module load arm-forge
$ ddt

Then select Manual Launch and indicate the number of processes:

Launching DDT
Selecting the number of processes

Then submit your code using the ddt-client command :

cat submit_visu.sh

#!/bin/bash
#MSUB -r TP4_debugging
#MSUB -n 16
#MSUB -T 1800
#MSUB -q rome
#MSUB -A <project>
#MSUB -m work,scratch
#MSUB -e TP4_debugging_%J.err
#MSUB -o TP4_debugging_%J.out

ml purge
ml mpi

ml arm-forge
ccc_mprun ddt-client  ./cstartmpi

DDT should be able to catch the launch and you may use DDT as usual.

Note

Arm-forge DDT is a licensed product.

Check the output of module show ddt or module help ddt to get more information on the amount of licenses available.

A full documentation is available in the installation path on the cluster. To open it:

$ evince ${ARMFORGE_ROOT}/doc/userguide-forge.pdf

Advanced: debug MPMD scripts

Prior to start ddt you need to create an appropriate script in MPMD mode:

$ cat ddt.job
#!/bin/bash
#MSUB -r MyJob_Para       # Job name
#MSUB -q <partition>      # Partition name
#MSUB -A <project>        # Project ID
#MSUB -n 4                # Number of tasks to use
#MSUB -T 1800             # Elapsed time limit in seconds
#MSUB -X
#MSUB -o example_%I.o     # Standard output. %I is the job id
#MSUB -e example_%I.e     # Error output. %I is the job id
set -x
cd ${BRIDGE_MSUB_PWD}

module load arm-forge
cat << END > exe.conf
1   env ddt-client ./algo1
3   env ddt-client ./algo2
END

ccc_mprun -f exe.conf

Now, as well as before, load the appropriate module:

$ module load arm-forge

Then start ddt:

$ ddt&

Once ddt interface is visible, select MANUAL LAUNCH:

../../_images/ddt_started.png

Select the same number of processes that you choose on your script at #MSUB -n (4 here) and press Listen:

../../_images/ddt_manual2.png

Then, launch your script:

$ ccc_msub -X ddt.job

Wait and your job will be automatically attach to ddt. Now you have an an interface with algo1 and algo2 running at the same time:

../../_images/ddt_run.png

TotalView

TotalView may be used by loading a module and by submitting an appropriate job:

$ module load totalview

Then launch your job with a submission script like:

#!/bin/bash
#MSUB -r MyJob             # Job name
#MSUB -q <partition>       # Partition name
#MSUB -A <project>         # Project ID
#MSUB -n 8                 # Number of tasks to use
#MSUB -T 600               # Time limit
#MSUB -o totalview_%I.o    # Standard output. %I is the job id
#MSUB -e totalview_%I.e    # Error output. %I is the job id
set -x
cd ${BRIDGE_MSUB_PWD}
ccc_mprun -d tv ./totalview_test

It needs to be submitted with:

$ ccc_msub -X totalview.job

Totalview should open on the Startup Parameters window. There is nothing to change here, just hit OK. Once in the main window, you can either come back to the parameter window with “<ctrl-a>” or launch the code with “g”.

Example of Totalview window

Example of Totalview window

Note

Totalview is a licensed product.

Check the output of module show totalview or module help totalview to get more information on the amount of licenses available.

A full documentation is available in the installation path on the cluster. To open it:

$ evince ${TOTALVIEW_ROOT}/doc/pdf/TotalView_User_Guide.pdf

Pdb Python debugger

pdb is a built-in Python debugger that aids in inspecting your code, setting breakpoints, and understanding program flow.

First of all, load python3 module:

$ module load python3

To start pdb when running your script, use the python command -m pdb option:

$ python3 -m pdb monte_carlo_pi.py
> monte_carlo_pi.py(1)<module>()
-> import random
(Pdb) ...

A prompt is opened, use help command:

(Pdb) help

Documented commands (type help <topic>):
========================================
EOF    c          d        h         list      q        rv       undisplay
a      cl         debug    help      ll        quit     s        unt
alias  clear      disable  ignore    longlist  r        source   until
args   commands   display  interact  n         restart  step     up
b      condition  down     j         next      return   tbreak   w
break  cont       enable   jump      p         retval   u        whatis
bt     continue   exit     l         pp        run      unalias  where

Use help <topic> for more information about any command:

(Pdb) help a
a(rgs)
     Print the argument list of the current function.

Use a breakpoint (break <file>:<line>) to stop the code when a file line is reached:

(Pdb) break monte_carlo_pi.py:10
Breakpoint 1 at monte_carlo_pi.py:10
(Pdb) continue
> monte_carlo_pi.py(10)estimate_pi()
-> distance = x**2 + y**2

Use list . command to list the current code:

(Pdb) list .
  3         def estimate_pi(num_points):
  4             points_in_circle = 0
  5
  6             for _ in range(num_points):
  7                 x = random.uniform(0, 1)
  8                 y = random.uniform(0, 1)
  9
 10 B->             distance = x**2 + y**2
 11                 if distance <= 1:
 12                     points_in_circle += 1
 13
 14             return 4 * points_in_circle / num_points

Note the B for Breakpoint and -> indicates the current line.

Pdb is able to print values with p or pp commands:

(Pdb) p distance
*** NameError: name 'distance' is not defined

If you are looking for any local variable, use locals() function:

(Pdb) locals()
{'num_points': 10000000, 'points_in_circle': 0, '_': 0, 'x': 0.2584912119080409, 'y': 0.6628071583040221}

Note there is a equivalent for globals() but could be very long.

With the next command, execute the code line per line:

(Pdb) next
> /ccc/work/cont000/asplus/cotte/cProfile/monte_carlo_pi.py(11)estimate_pi()
-> if distance <= 1:
(Pdb) p distance
0.5061310357327408
(Pdb) ll
  3         def estimate_pi(num_points):
  4             points_in_circle = 0
  5
  6             for _ in range(num_points):
  7                 x = random.uniform(0, 1)
  8                 y = random.uniform(0, 1)
  9
 10 B               distance = x**2 + y**2
 11  ->             if distance <= 1:
 12                     points_in_circle += 1
 13
 14             return 4 * points_in_circle / num_points

Note the -> has moved and distance is defined now.

Here are the main usefull commands:

Command Description
b(reak) Set a breakpoint at specified line number or function, with an optional condition.
c(ont) Continue execution, only stop when a breakpoint is encountered.
l(ist) Displays 11 lines around the current line (l .) or continue the previous listing.
ll List the whole source code for the current function or frame.
p / pp Evaluate and print the expression in Python syntax. Use pp for tables/structures.
locals() Return a dictionary of the current namespace.
globals() Return a dictionary of the current global namespace.
s(tep) Execute the current line, stop at the first possible occasion.
n(ext) Continue execution until the next line in the current function is reached or it returns.
r(eturn) Continue execution until the current function returns.
q(uit) Quit from the debugger.

For more information, please refer to the official Python documentation.

Other tools

Valgrind Memcheck

Valgrind is an instrumentation framework for dynamic analysis tools. It comes with a set of tools for profiling and debugging.

Memcheck is a memory error detector. It is the default use of Valgrind so any call to valgrind is equivalent to calling

$ valgrind --tools=memcheck

To check your code with Valgrind, just call valgrind before the program :

$ module load valgrind
$ valgrind ./test

To run MPI programs under Valgrind, use the available library “libmpiwrap” to filter false positives on MPI functions. It is available through the VALGRIND_PRELOAD environment variable. It is also possible to specify the output file and to force Valgrind to output one file per process (with --log-file).

#!/bin/bash
#MSUB -n 32
#MSUB -T 1800
#MSUB -q <partition>
#MSUB -A <project>

module load valgrind

export LD_PRELOAD=${VALGRIND_PRELOAD}

ccc_mprun valgrind --log-file=valgrind_%q{SLURM_JOBID}_%q{SLURM_PROCID} ./test

Here is the kind of output Valgrind returns :

==22860== Invalid write of size 4
==22860== at 0x4005DD: func1 (test1.c:12)
==22860== by 0x40061E: main (test1.c:20)
==22860== Address 0x4c11068 is 0 bytes after a block of size 40 alloc'd
==22860== at 0x4A05FDE: malloc (vg_replace_malloc.c:236)
==22860== by 0x4005B0: func1(test1.c:9)
==22860== by 0x40061E: main (test1.c:20)