Monday, March 29, 2010

Architectures and Algorithms

I used to not care one way or the other about supercomputer architecture, back when I was more on the math side of things. After all, math is math, right? It should always work no matter what the machine looks like.

But as it turns out, a supercomputer's architecture influences the feasibility of algorithms. An algorithm that requires a lot of communication will not perform well on a supercomputer with a slow interconnect, for example. There are many different architectures out there, and different algorithms work better on different machines.

There are seven common patterns in the scientific computations we encounter in the high-performance computing world: sparse linear algebra, dense linear algebra, structured grids, unstructured grids, N-body problems, fast Fourier transforms, and Monte Carlo algorithms. There are other patterns, but these are the top seven in HPC. Each of these types of algorithms thrive under different conditions.

When you are designing and building a supercomputer, you have constraints on budget, power consumption, and size. What elements of the machine do you beef up, and what elements do you cut corners with?

At this point, you have to think of your intended users. If all your users are computational biologists studying protein folding, then there's a machine already out there for them: the IBM BlueGene. This machine is great for the N-body algorithms employed in the study of protein folding and molecular dynamics in general.

But, if your users are nuclear physicists, then the BlueGene is about the worst way to go. BlueGenes are low processor power, low memory machines -- the antithesis of an ideal nuclear-physics machine. The two things nuclear physicists need the most for their sparse and dense linear algebra calculations are memory and floating-point operations. Nuclear physicists would do best on machines that had huge shared memory banks, but unfortunately, there aren't really any machines like that these days.*

The hard part is when you have a mixed group of target users, who use more than one or two of those seven fundamental patterns. Since each pattern stresses different elements of the machine, you would like your all-purpose machine to have top-of-the-line processor speed, memory, and interconnects. But, since you are constrained by cost, power consumption, and existing technologies, something's gotta give.

Today, floating-point operations (Flops) cost practically nothing, because fast processors are cheap. So today's supercomputers are capable of performing more Flops than application scientists are capable of consuming. The real bottlenecks lie in memory and communication.

Memory is constrained by two factors: cost and power consumption. The cost of memory on a single node does not scale linearly -- 4 GB per core costs more than twice as much as 2 GB per core. Memory also requires a lot of power, so the more memory you have, the less power you have remaining to be used by the processors.

As for the interconnect, it definitely costs more for faster communication. Not only does the fiber cost more, but it also depends on how many miles of cables you must use on your machine. (And yes, for the big machines, it is on the order of ten miles of interconnect.) How do you want your processors to be able to communicate -- do you want a direct connection from every processor to another (completely infeasible on large machines due to space constraints, to say nothing of expense), or by hopping along a network? What kind of network will work? Most machines today use a 3-D torus, as a balance between cost and efficiency.

At my workplace, we are a general-purpose center so we try to have a balanced machine. This pleases everyone and no one. The thing we're the most lacking is probably memory, but that is true everywhere. If you look at the ratio of flop capability to memory capability, it has skyrocketed in the past decade, and that trend will continue. So what ends up happening is that codes that require a lot of memory (such as the nuclear physics applications) end up using a lot of nodes inefficiently in terms of floating point operations, but filling up every last bit of the memory. Part of my job is to help these folks reduce their memory footprint, thereby being able to run bigger problems on the same number of processors.

Now that I am more on the practical implementation side of things, I see how much of a difference the architecture of a machine can make in terms of application performance. I wish I had realized this earlier, because I would have made an effort to learn more then about all the very interesting ideas in computer architecture!


* There were machines that were considered to have huge amounts of shared memory, at the time when they were new, but compared to the agregate memory of today's top supercomputer, they had only tiny amounts of memory.

5 comments:

acdalal said...

Wow, could you please come and talk to my computer architecture class? :) I have such a hard time convincing my computer science students that computer architecture is not just a box you check off on the graduation requirements list, that it has a real impact on how well your programs run. I might just use some of these examples the next time I teach the course. Thanks for a great post!

sweet_alien88 said...

Hi,

I must congratulate you for explaining so well the compute-communicate divide in HPC. I have been following your blog for a while now. Keep up the good work!

Reshmi

Rebecca said...

Amy, I'd be happy to talk to your class, any time. ;)

Although, if they're anything like my students were back when I was a TA, they won't care because they don't care about numerical methods either!

Reshmi, thanks!

ScienceGirl said...

I hope you are saving these posts up for a book or something, or at least are planning to bring them to SC for getting the BE students started! You really have a knack for putting the complicated issues in simple terms, focusing on what is most important.

Doctor Pion said...

The real bottlenecks lie in memory and communication.

This has always been true. I have heard two different supercomputer designers make that specific point, but by "always" I mean going back to Cray's first supercomputer (CDC6000 series) which ran 1 Mflop on 1 Mbyte and pushed external communication off onto peripheral machines.

The cynic in me would answer acdalal by saying it is probably impossible to convince them until they write a real computer program. I think it was Hans Bethe who said a "real" program is one that runs 100 hours on the fastest computer available. When it takes a week to get the next result, your time spent thinking about algorithms and architecture starts to pay off.

All computers look alike to "toy" problems.