When you perform a task, there are two types of subtasks: those that must be performed in a certain order, and those that can be performed in any order. We call the subtasks that must be performed in order serial subtasks, and those that can be performed in any order parallel subtasks.
This is true of more than just computer programs. For example, if you're preparing dinner, there are serial and parallel subtasks within the task of preparing dinner. Let's suppose that we're making a nice dinner consisting of lasagna, salad, and garlic bread. There are sequential tasks within this dinner. For example, we have to prepare the lasagna in a certain order: first we make the sauce, prepare the cheeses, and cook the lasagna noodles, then assemble the lasagna, and finally, we bake it. We can't assemble the lasagna until we have made the sauce. We can't bake the lasagna before assembling it. On the other hand, we could prepare the cheeses before we make the sauce, because these two tasks do not depend on each other. However, it would probably make sense to hold off on preparing the cheeses until the sauce is simmering.
Similarly, we have to wash the vegetables and cut them up before assembling the salad, and in that order. But the preparation of the salad is independent of the preparation of the lasagna, so we could make the salad before, after, or while we prepare the lasagna. And the garlic bread is independent of the lasagna and salad, and likewise, we can prepare it at any time (although ideally, we want it to come out of the oven right when we're sitting down at the table).
If there's one person preparing the dinner, it's fairly simple to figure out how best to plan the cooking. But if there are two people preparing the dinner, there are a couple of ways to plan the cooking. Maybe one person could be in charge of the lasagna, and the other could be in charge of the salad and garlic bread. Or maybe one person is too young to use the oven, so we give that person all the tasks that don't involve the stove or oven, and give the adult all the stove/oven tasks. Or maybe one person is really fast with a knife, so we give that person the job of cutting the vegetables and cutting the bread, and any other tasks to fill out the time.
If both cooks are experienced, there will be little communication necessary to perform all the tasks. If you've never made garlic bread and are assigned that task, however, I might have to tell you what to do, which means that our total wall time to make dinner could go up. The first time doing something is usually slower to begin with; and supervising your work will slow me down as I do whatever tasks I've been assigned. It could be more economical (timewise) for me to just make the garlic bread myself.
The parallel and serial task considerations for a program are identical to the concerns we've encountered in scheduling the cooking of dinner. There are sequential and parallel tasks. We have to watch out for which resources can be used (the oven, a knife), and communication overhead (telling you how to make garlic bread).
There are two primary parallel programming paradigms: SPMD and MPMD. SPMD stands for "single program, multiple data." There's a single set of instructions which are performed upon multiple sets of data by different processors. This would be like having several chefs, each preparing their own lasagnas. MPMD stands for "multiple programs, multiple data," meaning that there are several different sets of instructions, performed on multiple sets of data by different processors. You could think of this as a cadre of chefs preparing a 3-course dinner, each preparing a different dish. Most parallel programs are a hybrid of SPMD and MPMD: some identical tasks but also some different tasks.
Here's another problem to think about: suppose that you have a 5000 piece puzzle that you want to put together. How could you decrease the walltime to completion?
Suppose you had a friend helping you. How would that impact the walltime? Would it take half the time to complete the puzzle (assuming that you both work at equal speeds)? The time that it takes to finish the puzzle would probably go down, but it would not be halved, because of communication overhead and resource contention. If your friend has a piece that you need, you have to ask her for the piece, whereas if you were doing the puzzle alone, you could just grab it. Also, you might have a piece that you've put into your part of the puzzle that your friend needs to complete her part of the puzzle, so she can't complete that part until you're done using that piece.
Suppose you had N friends helping you. How would that impact the walltime? What would be the impact of communication overhead and resource contention? At some point, you might have too many people to fit around the table comfortably.
So what if you had N friends at N tables, and each is given 5000/N pieces of the puzzle at random. How would that impact the walltime? What kind of communication overhead and resource contention would you have? In this case, communication would be much more expensive, and getting a puzzle piece to add to your part of the puzzle would be a big deal, because you'd have to ask around until you found that piece, and then either get up and go over to get the piece, or have it passed to you.
What if instead of randomly distributing the pieces of the puzzle, you gave one person all the mountain pieces, and another person all the sky pieces, and another person all the stream pieces, etc.? How would this impact walltime? In this case, we might have a load imbalance, because maybe the mountain is really big and the stream is really small. Furthermore, there will be some pieces that are hard to classify. Some pieces will have both sky and mountain on them. Who do you give them to? And when the time comes to assemble the whole puzzle together, how do you do that?
The puzzle questions may sound kind of silly, but in fact the issues we've discussed here are exactly what the designers of parallel programs face every day. We always have to consider issues of communication overhead, resource contention, and load imbalance. The different table setups represent different types of computers. You can think of the people as processors, the table as memory, and the puzzle pieces as data.
There are some machines that have a single piece of memory for multiple processors. You can see that there are some issues that come up when we have a lot of processors grabbing for data. On the other hand, it's fairly easy to share data between processors with this model. But then there are other machines that have distributed memory, with each processor having its own memory bank. Each processor has more memory to spread out, but a big problem is that when you need some data from another processor, it's a Big Deal to fetch it and it takes up time that you could be using to compute.
Today, most machines are a combination of both memory paradigms. It's like having N people, and N/x tables, each with an N/xth of the puzzle pieces (where x can range from 2 to 64). So sometimes you can micro-manage and save on communication costs if you put data that is in close proximity in the model (e.g. adjacent puzzle pieces) on processors that share memory.
I know that you're all wondering how this works on an actual machine! Well, I'm afraid you're going to have to wait for the next entry. This one is already too long.
Next Topic: MPI