Friday, August 31, 2007

Supercomputing Course: Benchmarking and Performance

The purpose of running a program on a parallel computer is to cut down on the walltime to completion of the program. Anything that can be done in parallel can of course be done in serial, but it might take a long time. So if we can divide up our work into pieces that can be performed in parallel, then we can reduce the amount of time it takes to do the computation.

Ideally, if we had a perfectly parallelizeable algorithm, then if it took a single processor time T to solve the problem, then it would take two processors T/2 time, four processors T/4 time, … and N processors T/N time. But, like we discussed earlier regarding putting a puzzle together, there are factors such as communication and resource contention that tack additional time onto our wall time to completion.

There are two primary metrics by which we measure the efficacy of a parallel program: efficiency and scalability. (Some programs have their own, interesting metrics, such as a fusion code I read about recently that measures its scalability in atoms computed upon per second! But if we wanted to compare that code's use of the machine with another, unrelated code's, we'd have to convert the metric to efficiency or scalability.)

Efficiency is a measure of how well we're using the processors of the machine, and is computed EN = T1/(N TN) (where EN is the efficiency for N processors, and T with a subscript represents the runtime for that many processors). Ideally, efficiency would be one, but usually, a program's efficiency is less than one. Some ways to increase efficiency include making sure that the computational work is evenly distributed, minimizing idle time, and minimizing the overhead due to parallel execution (e.g. communication that is not necessary in the serial case).

Scalability is the measure of how well a program takes advantage of additional computing resources. We compute the speedup SN = T1/TN. Ideally, the speedup would be N, but usually it is less than N. Rarely, the speedup is greater than N, but this is usually attributable to caching effects (i.e., as the work per processor shrinks, it fits better in the fast memory cache). This can be confirmed by performing other types of performance evaluations.

We can determine the speedup of a program by doing benchmarking runs on different numbers of processors, all solving the exact same problem. Usually we use increasing powers of two, (i.e., 16, 32, 64, 128, 256, …). Notice that I didn't start with 1, because usually a problem of any interesting size is too big for a single processor. But ultimately, the number of processors you start with is a function of the resources allocated to you. When we're benchmarking a massively parallel code on one of the really big machines with tens of thousands of processors where we have millions of hours of allocation, we usually start at a minimum of 256 processors. On your homebrew Linux cluster, you may have to start with two and go up to 32, because that's all you have available.

Anyhow, we run the program at least three times for each number of processors, preferably five times, discarding any runs that are obvious outliers (sometimes you get assigned a processor that is just limping along for some reason, which adversely impacts your walltime), and then take the average for each number of processors. Then we plot number of processors vs. walltime on a log-log graph (one that uses a logarithmic scale on each axis) and the slope of the resulting line shows the scalability.

There's a related metric called weak scalability, in which we grow the problem size proportionally to the number of processors, and see how long it takes to run the program. Ideally, if you double the problem size and you throw twice the number of processors at it, then the walltime to completion should remain the same. If you discover after doing the earlier benchmarking that you have superlinear speedup, this is a good way to confirm that fact, because the amount of work per processor remains constant and the caching effects should completely disappear.

The weak scalability metric is a favorite for many because it is less arduous than speedup. What I mean by that is that as long as the amount of computational work per processor remains large, the effects of communication will remain insignificant. But when processors don't have as much to do, the time spent in communication becomes more important and speedup drops off. Even though the numbers can be depressing, I think it's important to examine speedup. Ultimately, performance evaluation is not a contest; it's an assessment of how much computation can be done. At the end of the day, it doesn't matter how many peak flops you can sustain; it's about getting the science done. A careful examination of the performance of your code can show you areas of your algorithm that have room for improvement, and making those improvements can lead to more science being done.

Next topic: Doing a more detailed evaluation of your program's performance


brocktice said...

Hi there! I was referred to your blog recently from elsewhere (can't remember where), and have been following your supercomputing course with interest.

Our lab uses supercomputers to simulate the electrical and mechanical activity that takes place during heart attacks.

Anyway, I just wanted to add another consideration to your excellent post.

One of the main reasons that we need to use parallel code is that our newest, highest-detail geometrical models don't fit in memory on a single node. Of course, even if they could fit in memory on a single-processor machine, the time to complete a job would be prohibitive.

Thanks for writing this stuff!

Doctor Pion said...

Another excellent article. I think I owe you a picture of a program deck. Found some of those boxes the other day.

One reason the fusion code uses a different metric is because there is an objective goal. It is entirely possible that you are better off giving up some parallelism to use an algorithm that gives you a correct answer more efficiently (in less wall clock time on the same number of processors).

Good point in the comments, also, where you use a distributed memory architecture to increase the storage available for a large problem that only has to share a small amount of data at the boundary conditions between nodes.