I used to not care one way or the other about supercomputer architecture, back when I was more on the math side of things. After all, math is math, right? It should always work no matter what the machine looks like.
But as it turns out, a supercomputer's architecture influences the feasibility of algorithms. An algorithm that requires a lot of communication will not perform well on a supercomputer with a slow interconnect, for example. There are many different architectures out there, and different algorithms work better on different machines.
There are seven common patterns in the scientific computations we encounter in the high-performance computing world: sparse linear algebra, dense linear algebra, structured grids, unstructured grids, N-body problems, fast Fourier transforms, and Monte Carlo algorithms. There are other patterns, but these are the top seven in HPC. Each of these types of algorithms thrive under different conditions.
When you are designing and building a supercomputer, you have constraints on budget, power consumption, and size. What elements of the machine do you beef up, and what elements do you cut corners with?
At this point, you have to think of your intended users. If all your users are computational biologists studying protein folding, then there's a machine already out there for them: the IBM BlueGene. This machine is great for the N-body algorithms employed in the study of protein folding and molecular dynamics in general.
But, if your users are nuclear physicists, then the BlueGene is about the worst way to go. BlueGenes are low processor power, low memory machines -- the antithesis of an ideal nuclear-physics machine. The two things nuclear physicists need the most for their sparse and dense linear algebra calculations are memory and floating-point operations. Nuclear physicists would do best on machines that had huge shared memory banks, but unfortunately, there aren't really any machines like that these days.*
The hard part is when you have a mixed group of target users, who use more than one or two of those seven fundamental patterns. Since each pattern stresses different elements of the machine, you would like your all-purpose machine to have top-of-the-line processor speed, memory, and interconnects. But, since you are constrained by cost, power consumption, and existing technologies, something's gotta give.
Today, floating-point operations (Flops) cost practically nothing, because fast processors are cheap. So today's supercomputers are capable of performing more Flops than application scientists are capable of consuming. The real bottlenecks lie in memory and communication.
Memory is constrained by two factors: cost and power consumption. The cost of memory on a single node does not scale linearly -- 4 GB per core costs more than twice as much as 2 GB per core. Memory also requires a lot of power, so the more memory you have, the less power you have remaining to be used by the processors.
As for the interconnect, it definitely costs more for faster communication. Not only does the fiber cost more, but it also depends on how many miles of cables you must use on your machine. (And yes, for the big machines, it is on the order of ten miles of interconnect.) How do you want your processors to be able to communicate -- do you want a direct connection from every processor to another (completely infeasible on large machines due to space constraints, to say nothing of expense), or by hopping along a network? What kind of network will work? Most machines today use a 3-D torus, as a balance between cost and efficiency.
At my workplace, we are a general-purpose center so we try to have a balanced machine. This pleases everyone and no one. The thing we're the most lacking is probably memory, but that is true everywhere. If you look at the ratio of flop capability to memory capability, it has skyrocketed in the past decade, and that trend will continue. So what ends up happening is that codes that require a lot of memory (such as the nuclear physics applications) end up using a lot of nodes inefficiently in terms of floating point operations, but filling up every last bit of the memory. Part of my job is to help these folks reduce their memory footprint, thereby being able to run bigger problems on the same number of processors.
Now that I am more on the practical implementation side of things, I see how much of a difference the architecture of a machine can make in terms of application performance. I wish I had realized this earlier, because I would have made an effort to learn more then about all the very interesting ideas in computer architecture!
* There were machines that were considered to have huge amounts of shared memory, at the time when they were new, but compared to the agregate memory of today's top supercomputer, they had only tiny amounts of memory.