Sometimes* when you write a computer program, there are bugs in it. Some of those bugs can be easy to catch (e.g., I forgot to put a semicolon at the end of line 67 and my code won't compile), while others are not. There are different types of bugs, too -- memory errors that cause a segmentation fault, typos that might cause one variable to be updated instead of another, and errors in your algorithm, to name a few. Each of them is uniquely challenging to find and fix.
But when we add an additional layer of complexity to a code by making it run in parallel, the difficulty of finding and fixing bugs goes up by several orders of magnitude. The most insidious of bugs will appear only at high process counts, or irregularly. How then can we find out where our code is going wrong?
A classic method of finding bugs is by inserting print statements in the code. Using the print statements and running the code, we can follow a sort of bisection algorithm to determine where things go bad. Typically we insert a few print statements at the first pass, and then further hone down to the point where the error occurs with several subsequent runs. But this is highly time consuming, and produces a lot of excess data. I for one would hate to insert all those print statements in a complicated code, not to mention sift through the output of print statement debugging across 200,000 processes. It can take weeks to find a bug in this way, especially if you have to wait for a batch system to run your jobs.
The best solution is to use a debugger. Using a debugger, you can pinpoint the exact line at which the bug occurs in a single trial (for bugs such as segmentation faults). And you can insert break points around areas of the code you suspect are faulty, and examine the contents of the variables. You can also step through the code slowly and figure out how x came to equal 27 instead of 32 (for example).
Parallel debuggers exist, and do scale up to hundreds of thousands of processes. Of course, these are commercial products but I'm guessing you don't have a 200K-core supercomputer in your basement. Most supercomputing centers have a license for a commercial debugger such as Allinea DDT or TotalView. Both of these are great products that will help you to find your bugs quickly and relatively painlessly. And if you stubbornly insist on not using a commercial product, most mpirun or mpiexec commands allow you to attach your favorite free debugger to your parallel execution.
Do you think it is too hard to learn to use a debugger, that by the time you do learn it you could have already found that bug and moved on to something else? Invest in your future and learn to use a debugger anyhow! Let me tell you the sad, sad story of a graduate student I knew quite well.
This graduate student felt that by the time he/she learned to use a debugger, that final bug sitting between the student and graduation would have been found and fixed. Because, of course, that was the final bug. He/she said this, bug after bug after bug. Looking back, this person realized that by investing a day learning to use a debugger, he/she could have graduated (and started earning for-real money instead of measly graduate student stipends) about six months earlier.
Don't be like that graduate student. Learn to use a debugger and stop wasting your time!
* Actually, pretty much every time you write anything more complicated than "Hello, world!"