Debugging parallel programs is a particularly arduous task. If you know how to debug a serial program, you already know many of the common pitfalls and mistakes that programmers can make. But parallel programming introduces additional challenges. Sometimes bugs are nearly irreproducible, because they show up only when some sort of asynchronicity occurs. For example, if you have a memory leak, you might never know it until one processor gets bogged down for some reason and a crucial piece of memory is overwritten at an unexpected point in the execution of another process. I've made errors such as these and it's a royal pain trying to track them down! But, there are ways to find these errors, and the debugging of parallel programs is the topic of this blog entry.
When debugging serial programs, there are a couple of different things you can do. First of all, you can insert print statements into your program, and then examine the output to see where your program went wrong. This is a very simple way to debug, but there are several drawbacks. First, it is time-consuming to put all those print statements into your code. Second, you will get a lot of extraneous data from all the places where the program is working just fine. Finally, the print statements will show you where your program failed, but unless you are very clever, they will often fail to show you why your program failed.
Another possibility is to use a debugger. The debugger gdb is free and not too difficult to use. You have to compile your program with the -g flag, which tells the compiler to insert the debugging information into the object files it creates. There are other debuggers too, some open source (such as gdb and ddd, a graphical version of gdb), and some commercial. The main drawback to debuggers is that you actually have to learn how to use them. But, the advantages are numerous. First of all, debuggers allow you to step through your program one line at a time, if you so desire, or to run it to completion (or failure). A debugger will tell you at which exact line the program broke down, and why (e.g. seg fault, bus error). You can also examine the contents of every variable at that point, which will help you figure out what went wrong. However, there are limitations, in the sense that a debugger will make it clear where the program broke down, but you will still need to trace back to the source if it is not a simple error.
Another possibility is to use special debugging libraries. In particular, when you suspect that you have a memory leak, you can link with a library such as valgrind or electric fence, which will let you know when you have overstepped the bounds of an array or failed to initialize a block of memory that you then attempt to write. These libraries are very useful for finding memory leaks, but if your error is not a memory leak, then they may not help at all.
Likewise, in a parallel program, you can always debug using print statements. You can have each processor print to a log file with a suffix of the processor number, e.g. log.0012. But the amount of extraneous data you have to sift through grows with the number of processors you use, so this may not be the best way of debugging.
I don't know whether electric fence and valgrind work with MPI, because I've never tried them. Anyone out there have any experience with that?
There are plenty of ways to use a debugger with an MPI program: you can use gdb or any of your favorite open source debuggers that are compatible with MPI. (Check the man page for mpiexec or mpirun to find out which ones are compatible with your MPI implementation.)
Also, there are some good commercial debuggers, the best known one being TotalView. TotalView is usually available on large, production systems, but it's probably not on your homemade cluster.
Next topic: Benchmarking and Performance