## Monday, December 15, 2008

### Supercomputing Course: OpenMP Syntax and Directives, Part II

In order to be parallelizable by a parallel for construct, a loop must satisfy certain conditions. First, the number of iterations of the loop must be fixed, and knowable upon entry into the loop. Second, each iteration must be independent of the others. And finally, there must be no data dependence between iterations.

So, here are some things we can't parallelize with a parallel for construct:
• Conditional loops (e.g., many while loops; it's best to recast your parallelizable while loop as a for loop if you want to use the parallel for construct)
• Iterators over a loop (e.g., C++ STL iterators over a std::list)
• Loops in which any of the iterations are dependent upon any of the previous iterations
• Loops in which there is data dependence between iterations.
So, how do we determine whether a loop is parallelizable? Let's take a look at a real-life example!

In many scientific problems, we want to solve a system of linear equations. One of the ways to do this is known as Gaussian elimination. Suppose we want to parallelize the following Gaussian elimination code:

/* Gaussian Elimination (no pivoting): x = A\b */
for (int i = 0; i < N-1; i++) {
for (int j = i; j < N; j++) {
double ratio = A[j][i]/A[i][i];
for (int k = i; k < N; k++) {
A[j][k] -= (ratio*A[i][k]);
b[j] -= (ratio*b[i]);
}
}
}

We have three possible candidate loops to parallelize. Which one should we select?

First, we need to really understand how Gaussian Elimination works. For this, I refer to the below picture. (Click to embiggen.)
You can see the progression of Gaussian elimination in the four frames of this picture. The i loop is represented by the yellow row and column. Basically, we take the entries in the yellow row and column and use them to update the green submatrix before going on to row/column i+1. This tells us all we need to know about the parallelizeability of the i loop. The values of the entries in the (i+1)st yellow area depend on what operations were performed on them at previous values of i. Therefore we can't use OpenMP to parallelize this loop because of data dependence.

What about the j loop? The number of iterations in that loop varies with i, but we do know the number of iterations every time we are about to enter the loop. Do any of the later iterations depend on earlier ones? No, they do not. Can the j iterations be computed in any order? Yes, they can. So it looks like the j loop is parallelizable.

What about the k loop? Like the j loop, its number of iterations varies but is calculable for every i. None of the later iterations depend on earlier ones, and they can all be computed in any order. Therefore the k loop is also parallelizable.

So, we can't parallelize the outermost loop, but we can parallelize the inner two. Which one should we select? It's best to select the outer loop, because then we'll have more uninterrupted parallelism.

So, our Gaussian elimination code now looks like this (changes in red):

/* Gaussian Elimination (no pivoting): x = A\b */
for (int i = 0; i < N-1; i++) {
#pragma omp parallel for
for (int j = i; j < N; j++) {
double ratio = A[j][i]/A[i][i];
for (int k = i; k < N; k++) {
A[j][k] -= (ratio*A[i][k]);
b[j] -= (ratio*b[i]);
}
}
}

(Observe that we can combine the parallel and for constructs.)

Isn't that just amazing? We can parallelize this piece of code for the modest cost of one additional line! And our algorithm is still easy to read: It's not obscured by calls to OpenMP functions. Compare this to our old friend MPI. I never cease to be amazed by the compactness of OpenMP.

Next time, we'll talk about some other OpenMP directives, including sections, reduction, and some synchronization directives.