Monday, March 08, 2010

HPC and the Future of Algorithms

As parallel computers get bigger and and more powerful, the way we use these machines has to change. The reason is that the machines are not just growing in clock speed, they're changing in architecture as well.

Over the past decade, leadership high-performance computing resources have evolved from systems with thousands of CPUs to systems with hundreds of thousands of multicore CPUs. And in this time, the algorithms in the best science application programs have adapted to this change. Today, the best programs exploit threading (such as OpenMP) within a node, and MPI between nodes. They use clever algorithms that minimize communication between nodes, with broadcasts and other all-to-all communication minimized.

But a radical change is about to happen. At this point, adding more cabinets of multicore CPUs to create an even bigger supercomputer is unsustainable in both space and power consumption. The most powerful supercomputer in the world takes up more than 4000 square feet of floor space -- the same footprint as a large house. It idles at 3 megawatts -- when it's just turned on but nobody's using the thing, it consumes enough electricity to power a large neighborhood. And during its full-machine runs of the Linpack benchmarks, it was demanding 7 MW. Quadrupling (or more) the machine size to reach 20 Petaflops* is not feasible, even here in the land of cheap land and plentiful power. So a new architecture is needed.

The harbinger of this new architecture was the first machine to cross the petaflop barrier, Roadrunner. Los Alamos National Laboratory, together with IBM, put together this novel machine which came online about two years ago. The novel part of the architecture was that it was no longer homogeneous -- not all the processors were the same. Roadrunner has three different types of processors in it.

Sadly, Roadrunner is going the way of the dodo, because it was too complicated to program. But it was a good experiment that led the way to the architecture that will be in the next 20 PF machines. In this next class of machines, additional floating-point operations will be provided by accelerators, kind of like the graphics cards in the machine you're probably using to read this post. This new architecture requires new algorithms to be able to exploit the accelerators. Accelerators are massively threaded -- on the order of a million different threads can be run at once on a single accelerator. So we're having to rethink our algorithms, and redefine them in a way that can exploit that kind of parallelism.

Looking to the future, the exascale** will be here before we know it. Early reports suggest that exascale machines will have millions of heterogeneous processors. At this point, we will have to completely rethink our algorithms.

Too many codes still rely on the manager-worker paradigm. There's one process, P0, who is in charge, who tells everybody else what to do, and collects and compiles results from them. This is a great paradigm if you don't have too many processes, but rapidly becomes inefficient when you reach more than a thousand. There are things you can do to improve the efficiency at higher processor counts, but in the end, this is not a scalable paradigm. Something will need to change radically before these codes will be able to run on millions of processors.

I like to think of the algorithm problem in analogy with music. Let's say that computations are like music, and computer processors are like musicians. Today's algorithms are like compositions for a marching band. When the algorithms are running, the band is in lock-step, each member with a very predefined role in the composition/part in the score. There's a leader, who keeps everybody in time. There's a predefined structure and order of operations. You can use different arrangements of the score to accommodate different sizes of marching bands.

But a marching band is not scalable. One conductor can be seen by only so many musicians. How would you lead a million musicians?

Maybe you could broadcast a picture of the conductor, or something like that, but it's not the same because the conductor can't keep track of all the musicians. Ultimately, you really can't lead a million musicians in lock-step. So you have to rethink your algorithm for creating music.

The musical equivalent of the algorithms that we must develop for the exascale are unfamiliar to the Western ear. If the ultimate goal is to create music, who says we have to do it in a scripted way? What if we provided certain parameters, such as the key and the starting note of the scale, and then let everybody improvise from there, perhaps with some synchronization between neighbors, kind of like a distributed, million-musician Raga?

In fact, the algorithms we have developed function very much in this way. The work is spread amongst all processes in a statistically even way. Due to variations in the machine, the algorithms may not run in precisely the same way every time, but this is controlled for and the answers we compute are still the same. The cost of communicating with millions of other processes is hidden by overlapping communication and computation. If you haven't heard from a process you need to complete the current task, move on to another task and come back to it later. The computations travel to the location where the data is, instead of the data being transported to the computation.

I've heard it said like this: let's say your goal is to reach the moon. There is a series of progressively taller trees that you can climb and get progressively closer to the moon. But there are not any trees tall enough to reach the moon by climbing them. So you have to think of another solution to reach your goal.

It will be interesting to see what application developers do to exploit exascale computing resources. How many of them will keep climbing trees, and how many others will abandon their tree-climbing programs in favor of something else?

* A flop is a floating point operation -- any basic arithmetic operation involving a decimal point, e.g., 1.1+1.1. A petaflop machine is capable of performing one quadrillion floating point operations per second -- a feat that would take everyone in the whole world, working together, doing one flop per second, roughly three days to complete.

** The exascale is the level above the petascale -- on the order of one quintillion flops. Exascale machines should come online in 2017 or 2018.


MathGirl said...

Very nice, thought-through article. I am working with databases right now and it feels very strange to still work with SQl that was developed in the 70s which should be really outdated by now and finding modern concepts that treat newer demands and possibilities reasonably become really hard to understand because they don't follow those easy paradigms from the 70s. (Even though some of them try to use FO logic or Description Logic, which would then be some sort of a back-to-basic step)

Rebecca said...

Thanks, MathGirl! Yeah, it's bad when your work is constrained by outdated ways of thinking. It's hard to give up the legacy code and start all over, though.

Aaron Dubrow said...

A great description of parallel programming and the challenges of developing algorithms for millions of processors. I write feature articles for the Texas Advanced Computing Center and am always looking to capture the issues of HPC in plain language. You do a great job of it here.

Rebecca said...

Thanks, Aaron! I'm glad my ramblings make sense :)