next previous contents
Next: Common Problems: Descriptions Previous: Debugging MPI Programs

Debugging Methods

There are essentially three ways to debug programs (sequential or parallel):

  1. Inspection of the program;
  2. Inserting output statements to demonstrate flow and values;
  3. Use of a debugger.

But here is a suggested method for debugging MPI programs. First, if possible, write the program as a serial program. This will allow you to debug most syntax, logic, and indexing errors. It can also give you a result to compare for correctness.

Then, modify the program and run it with 2--4 processes on the same machine. This step will allow you to catch syntax and logic errors concerning intertask communication. A common error found at this point is the use of non-unique message tags. The final step in debugging your application is to run the same processes on different machines. This will check for synchronization errors caused by message delays on the network.

You should first try to find the bug by using a few printf statements (either to the screen or to a file, you could even use separate files for each process).

Alternatively you can use the -mpitrace option to mpicc to produce an executable which will produce some output to the screen logging the MPI function calls (giving the corresponding MPI process number). You might be able to get some idea of the flow of control from this.

If this does not work, then you may want to try running the program under a debugger, such as gdb, which will start the first process under the debugger where possible. See man gdb for information on the GNU debugger and man mpirun for information on adding them to your command line arguments.

Note that the debuggers are not parallel versions, you can only debug the single process you are attached to. If you hit deadlock and want to find out where your programs are up to (and possibly have a look at their data) you'll need to separately log in to the remote machine and start separate gdb processes to attach to the parallel processes.

How to debug the remote processes? Here is a brief HOWTO:

  1. Rebuild the executable with the -g compiler/linker option. Make sure the process gets blocked at some point so that you have time to get the debugger running, attached, and at the area of code you want to debug. (You may need to add some code to achieve this.)
  2. Log onto the remote node containing the process you want to debug.
  3. Determine which (of the several) processes associated with the parallel execution to debug.
  4. Start the gdb debugger for the parallel executable (e.g., a.out or cpi but _NOT_ mpirun) and attach to the already executing process.
  5. Set your breakpoints.
  6. Continue execution under the debugger
Caveats:
* I've noticed that sometimes (seems to be application dependent) if the non-debugged processes get to continue (because they aren't being debugged) that your debugged process may be unable to perform some MPI operations (getting a broken pipe type error message). What can you do? Debug up to that point. If seems okay up to there, then debug AFTER that point in another debugging session.