Thursday, August 30, 2007

Supercomputing Course: MPI

MPI (Message Passing Interface) is an industry standard for inter-processor communication in supercomputers. In late 1992, a bunch of very smart people from industry, academia, and the national labs got together and began developing MPI based on the programming needs of users. The first MPI standard was completed in 1994. In 1998, the MPI standard was further refined, and C++ bindings were added. (For more history, see this page.)

MPI provides only a definition of the functionality that needs to be implemented. There are many implementations of the standard, both open source and proprietary. The most famous and widely-used open source implementation is MPICH, a joint effort between Argonne National Laboratory and Mississippi State University. Others include LAM-MPI and OpenMPI. Vendors have also created their own versions. IBM and Cray have implemented it for their own machines, and other third-party vendors have made their own implementations, such as ChaMPIon, which was installed on one of the machines I used for my dissertation.

MPI is a large document that defines over 128 functions. But there are only six functions that you need to do (almost) anything you want to do. I will introduce them in pairs.

Initiation and Termination

MPI_Init(int *argc, char **argv)
MPI_Finalize(void)

These two functions have to be placed in the body of your code. Place MPI_Init(…) in the main body of your code, after you've declared your variables and before you try to do any other MPI commands. MPI_Finalize() shuts down MPI, and therefore should be placed near the end of your code, after your last MPI command.

Environmental Inquiry

MPI_Comm_size(MPI_Comm comm, int *size)
MPI_Comm_rank(MPI_Comm comm, int *rank)

These two functions help a program running on a processor to figure out the number of processes running this program (size) and which rank within that number of processes this one is (rank). Note that rank begins counting at zero, so 0 ≤ rank ≤ size-1.

Communication

MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)

MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

These two functions are the bread and butter of parallel computing. They are the way that we can send messages between processors. The declarations may look a little ugly, but they are actually easy to use. Basically, I use MPI_Send(…) to send the data in the buffer called buf, which is of data type datatype, and contains count number of entries, to dest as defined in the communicator comm and with a special message identifying tag tag. Similarly, I use MPI_Recv(…) to receive into the buffer buf count entries of type datatype from processor number source as defined in the communicator comm and with a special message identifying tag tag, and the status of the operation (success, failure) stored in status.

Let's say that I wanted to write a program in which each processor to send its rank to its neighbor rank+1 (with the processor of rank size-1 sending to rank 0).

#include <mpi.h>
#include <stdio.h>
int main(int argc, char **argv) {
     int me, np, q, sendto;
     MPI_Status status;
     MPI_Init(&argc, &argv);
     MPI_Comm_size(MPI_COMM_WORLD, &np);
     MPI_Comm_rank(MPI_COMM_WORLD, &me);
     if (np%2 == 1) {
         printf("Use an even number of processors!");
     return 0;
     }
     sendto = (me+1)%np; // % = modulo op
     MPI_Recv(&q, 1, MPI_INT, sendto, sendto, MPI_COMM_WORLD, &status);
     MPI_Send(&me, 1, MPI_INT, sendto, me, MPI_COMM_WORLD);
     printf("Sent %d to process %d, received %d from process %d\n", me, sendto, q, sendto);
     MPI_Finalize();
         return 0;
}

Try running this program on your machine. You'll see that it actually doesn't work! The program hangs, and if you insert print statements you'll see that it never gets beyond the line with MPI_Recv(…). The reason is because the basic receive operation blocks, meaning it waits around and does nothing until it receives a message. If every processor begins with a receive, then nobody ever gets around to sending, and no processor unblocks and continues with the program.

What if we switch the order of the send and receive lines? This will (probably) work for this program, but in general, we could run into trouble with deadlock caused by resource contention if we are sending large messages and there is not enough room to buffer them.

How, then, can we fix this program? There are several choices: we could use more advanced, non-blocking sends and receives (beyond the scope of this tutorial), or we could have one process send before receiving and set off a sort of "chain reaction" to all the others, or we could have all the even-numbered processors send first and then receive, and the odd-numbered processors receive and then send. So we would replace the send and receive lines above with the following:

     if (me%2 == 0) {
         MPI_Send(&me, 1, MPI_INT, sendto, me, MPI_COMM_WORLD);
         MPI_Recv(&q, 1, MPI_INT, sendto, sendto, MPI_COMM_WORLD, &status);
     } else {
         MPI_Recv(&q, 1, MPI_INT, sendto, sendto, MPI_COMM_WORLD, &status);
         MPI_Send(&me, 1, MPI_INT, sendto, me, MPI_COMM_WORLD);
     }

Exercise for the reader: What if we allowed np to be odd? Would there be any risk of resource contention in the solution coded above? Explain why or why not.

Next topic: Debugging parallel programs

No comments: