Monday, March 29, 2010

Architectures and Algorithms

I used to not care one way or the other about supercomputer architecture, back when I was more on the math side of things. After all, math is math, right? It should always work no matter what the machine looks like.

But as it turns out, a supercomputer's architecture influences the feasibility of algorithms. An algorithm that requires a lot of communication will not perform well on a supercomputer with a slow interconnect, for example. There are many different architectures out there, and different algorithms work better on different machines.

There are seven common patterns in the scientific computations we encounter in the high-performance computing world: sparse linear algebra, dense linear algebra, structured grids, unstructured grids, N-body problems, fast Fourier transforms, and Monte Carlo algorithms. There are other patterns, but these are the top seven in HPC. Each of these types of algorithms thrive under different conditions.

When you are designing and building a supercomputer, you have constraints on budget, power consumption, and size. What elements of the machine do you beef up, and what elements do you cut corners with?

At this point, you have to think of your intended users. If all your users are computational biologists studying protein folding, then there's a machine already out there for them: the IBM BlueGene. This machine is great for the N-body algorithms employed in the study of protein folding and molecular dynamics in general.

But, if your users are nuclear physicists, then the BlueGene is about the worst way to go. BlueGenes are low processor power, low memory machines -- the antithesis of an ideal nuclear-physics machine. The two things nuclear physicists need the most for their sparse and dense linear algebra calculations are memory and floating-point operations. Nuclear physicists would do best on machines that had huge shared memory banks, but unfortunately, there aren't really any machines like that these days.*

The hard part is when you have a mixed group of target users, who use more than one or two of those seven fundamental patterns. Since each pattern stresses different elements of the machine, you would like your all-purpose machine to have top-of-the-line processor speed, memory, and interconnects. But, since you are constrained by cost, power consumption, and existing technologies, something's gotta give.

Today, floating-point operations (Flops) cost practically nothing, because fast processors are cheap. So today's supercomputers are capable of performing more Flops than application scientists are capable of consuming. The real bottlenecks lie in memory and communication.

Memory is constrained by two factors: cost and power consumption. The cost of memory on a single node does not scale linearly -- 4 GB per core costs more than twice as much as 2 GB per core. Memory also requires a lot of power, so the more memory you have, the less power you have remaining to be used by the processors.

As for the interconnect, it definitely costs more for faster communication. Not only does the fiber cost more, but it also depends on how many miles of cables you must use on your machine. (And yes, for the big machines, it is on the order of ten miles of interconnect.) How do you want your processors to be able to communicate -- do you want a direct connection from every processor to another (completely infeasible on large machines due to space constraints, to say nothing of expense), or by hopping along a network? What kind of network will work? Most machines today use a 3-D torus, as a balance between cost and efficiency.

At my workplace, we are a general-purpose center so we try to have a balanced machine. This pleases everyone and no one. The thing we're the most lacking is probably memory, but that is true everywhere. If you look at the ratio of flop capability to memory capability, it has skyrocketed in the past decade, and that trend will continue. So what ends up happening is that codes that require a lot of memory (such as the nuclear physics applications) end up using a lot of nodes inefficiently in terms of floating point operations, but filling up every last bit of the memory. Part of my job is to help these folks reduce their memory footprint, thereby being able to run bigger problems on the same number of processors.

Now that I am more on the practical implementation side of things, I see how much of a difference the architecture of a machine can make in terms of application performance. I wish I had realized this earlier, because I would have made an effort to learn more then about all the very interesting ideas in computer architecture!


* There were machines that were considered to have huge amounts of shared memory, at the time when they were new, but compared to the agregate memory of today's top supercomputer, they had only tiny amounts of memory.

Sunday, March 28, 2010

The Ebb and Flow of Blogging

Sorry, vast blogging audience, not to have posted anything in more than a week, or anything substantial in nearly two weeks. I've been busy, but also, I find that my blogging inspiration comes and goes.

It seems like the number of topics I can think of to write about is inversely proportional to my posting frequency. In other words, the more I write, the more I can think of to write about. This month's big deadline left me with no time to keep up the fairly good pace I had last month. My momentum has been interrupted, and now it is an uphill battle to regain my inspiration.

I have a few ideas to expand upon the future of HPC, so I will keep working on those, and with luck that will expand my inspiration to other topics as well. Stay tuned.

Saturday, March 20, 2010

Freedom!!!

I have spent the past two weeks tucked away in my office working on a big document. Finally, on Friday, it was turned in to the gubmint, and I am off the hook (at least for now). This weekend, I am celebrating by doing as little as possible!

Tuesday, March 16, 2010

The Diversity Tax

There is a tax that I pay for being a woman in computational science. Compared to my (majority) male colleagues, I am disproportionately called upon to do outreach activities and participate in photo shoots. So I can relate to what Female Science Professor is talking about when she seeks "diversity help." Although I've never been explicitly asked to do something for the sake of diversity, it has always been implicit in the invitation.

Many of my fellow "diverse" colleagues (and this group includes not just women, but African-American, Hispanic, and Native American men as well) resent this tax upon their time. I can see where they're coming from. After all, while I'm out talking to middle schoolers, all the dudez are actually getting some work done. It's true, we have to do these types of outreach activities on a volunteer basis -- we get no credit at work for helping to mold the minds of the next generation. Management loves that we do it, as long as it doesn't interfere with the day job.

I think this is a shame. And when I am in management, I will work to change that attitude and actually provide some concrete means of crediting those who go out and make a difference in the community.

In the meantime, though, I see the need for those of us who are not stereotypical scientists to do outreach. I don't let it take over my life, but I do enjoy speaking at middle schools and giving machine room tours to young people, especially girls. It is every citizen's duty to give something back to society, and I'm glad that I can contribute by being a role model for the next generation of scientists, and help make science that much more inviting to those who don't fit the stereotypes.

Thursday, March 11, 2010

Oh So Busy

Have I mentioned how busy I am lately?

I mean, dang!

You would think one little-old document wouldn't take that long to create. You would be wrong.

Monday, March 08, 2010

HPC and the Future of Algorithms

As parallel computers get bigger and and more powerful, the way we use these machines has to change. The reason is that the machines are not just growing in clock speed, they're changing in architecture as well.

Over the past decade, leadership high-performance computing resources have evolved from systems with thousands of CPUs to systems with hundreds of thousands of multicore CPUs. And in this time, the algorithms in the best science application programs have adapted to this change. Today, the best programs exploit threading (such as OpenMP) within a node, and MPI between nodes. They use clever algorithms that minimize communication between nodes, with broadcasts and other all-to-all communication minimized.

But a radical change is about to happen. At this point, adding more cabinets of multicore CPUs to create an even bigger supercomputer is unsustainable in both space and power consumption. The most powerful supercomputer in the world takes up more than 4000 square feet of floor space -- the same footprint as a large house. It idles at 3 megawatts -- when it's just turned on but nobody's using the thing, it consumes enough electricity to power a large neighborhood. And during its full-machine runs of the Linpack benchmarks, it was demanding 7 MW. Quadrupling (or more) the machine size to reach 20 Petaflops* is not feasible, even here in the land of cheap land and plentiful power. So a new architecture is needed.

The harbinger of this new architecture was the first machine to cross the petaflop barrier, Roadrunner. Los Alamos National Laboratory, together with IBM, put together this novel machine which came online about two years ago. The novel part of the architecture was that it was no longer homogeneous -- not all the processors were the same. Roadrunner has three different types of processors in it.

Sadly, Roadrunner is going the way of the dodo, because it was too complicated to program. But it was a good experiment that led the way to the architecture that will be in the next 20 PF machines. In this next class of machines, additional floating-point operations will be provided by accelerators, kind of like the graphics cards in the machine you're probably using to read this post. This new architecture requires new algorithms to be able to exploit the accelerators. Accelerators are massively threaded -- on the order of a million different threads can be run at once on a single accelerator. So we're having to rethink our algorithms, and redefine them in a way that can exploit that kind of parallelism.

Looking to the future, the exascale** will be here before we know it. Early reports suggest that exascale machines will have millions of heterogeneous processors. At this point, we will have to completely rethink our algorithms.

Too many codes still rely on the manager-worker paradigm. There's one process, P0, who is in charge, who tells everybody else what to do, and collects and compiles results from them. This is a great paradigm if you don't have too many processes, but rapidly becomes inefficient when you reach more than a thousand. There are things you can do to improve the efficiency at higher processor counts, but in the end, this is not a scalable paradigm. Something will need to change radically before these codes will be able to run on millions of processors.

I like to think of the algorithm problem in analogy with music. Let's say that computations are like music, and computer processors are like musicians. Today's algorithms are like compositions for a marching band. When the algorithms are running, the band is in lock-step, each member with a very predefined role in the composition/part in the score. There's a leader, who keeps everybody in time. There's a predefined structure and order of operations. You can use different arrangements of the score to accommodate different sizes of marching bands.

But a marching band is not scalable. One conductor can be seen by only so many musicians. How would you lead a million musicians?

Maybe you could broadcast a picture of the conductor, or something like that, but it's not the same because the conductor can't keep track of all the musicians. Ultimately, you really can't lead a million musicians in lock-step. So you have to rethink your algorithm for creating music.

The musical equivalent of the algorithms that we must develop for the exascale are unfamiliar to the Western ear. If the ultimate goal is to create music, who says we have to do it in a scripted way? What if we provided certain parameters, such as the key and the starting note of the scale, and then let everybody improvise from there, perhaps with some synchronization between neighbors, kind of like a distributed, million-musician Raga?

In fact, the algorithms we have developed function very much in this way. The work is spread amongst all processes in a statistically even way. Due to variations in the machine, the algorithms may not run in precisely the same way every time, but this is controlled for and the answers we compute are still the same. The cost of communicating with millions of other processes is hidden by overlapping communication and computation. If you haven't heard from a process you need to complete the current task, move on to another task and come back to it later. The computations travel to the location where the data is, instead of the data being transported to the computation.

I've heard it said like this: let's say your goal is to reach the moon. There is a series of progressively taller trees that you can climb and get progressively closer to the moon. But there are not any trees tall enough to reach the moon by climbing them. So you have to think of another solution to reach your goal.

It will be interesting to see what application developers do to exploit exascale computing resources. How many of them will keep climbing trees, and how many others will abandon their tree-climbing programs in favor of something else?


* A flop is a floating point operation -- any basic arithmetic operation involving a decimal point, e.g., 1.1+1.1. A petaflop machine is capable of performing one quadrillion floating point operations per second -- a feat that would take everyone in the whole world, working together, doing one flop per second, roughly three days to complete.

** The exascale is the level above the petascale -- on the order of one quintillion flops. Exascale machines should come online in 2017 or 2018.

Friday, March 05, 2010

Weekend Fun

In a rare moment of convergence, both of my sisters and their sons will be at my dad and bonus mom's house on Sunday. This means that we'll be heading up there tomorrow and spending all day Sunday with them. There will be a cousin party. My sisters and I will have special sister time. I am bringing a cake baked by me and my favorite baking assistant, to celebrate my bonus mom's birthday (which was last month, but who's counting?!?!).

I am looking forward to this little break from work. For my latest responsibility, there is a hugely important report due to the gubmint after the Ides of March. I've been working on it nonstop for the past two weeks, so it will be nice to have some fun times with my family.

Wednesday, March 03, 2010

Today in "Things You Learn As a Parent"

Did you know that there are people who put videos of box fans and ceiling fans, complete with commentary about the quality of said fans, on YouTube? Neither did I.

Monday, March 01, 2010

Running Like Mad

So, I learned from blogfriend PhizzleDizzle that until 1960, there were no women's summer Olympics events in which women ran 800 meters or more. And the women's marathon was not an event until 1984! (Here's her post -- warning for those sensitive to words of the four-letter variety: it is full of very descriptive and highly justified cursing!)

Anyhow, this outraged me too after I learned about it. It's pretty sad that women couldn't participate in these sporting events because of erroneous beliefs about women.

Sometimes I think my struggles against gender inequity in science are bad, and then when I see the struggles in other fields... well, on the one hand I feel better, because it's not just my field; on the other hand, I feel dejection, because it's not just my field.

But, it made me realize that athletic talents in women is just as subversive as mathematical and scientific talents are. I'm no athlete, but I can at least participate in athletic endeavors and appreciate all the strides that have been made.

So, today I ran for an extra long stretch, out of solidarity with my Olympian systers. Because if some overweight, out-of-shape computational scientist can run non-stop for a mile or more, that just tells you how ridiculous these proscriptive rules of femininity are. I'm never going to win a race, but I can help run over these ridiculous stereotypes.