Monday, December 08, 2008

Supercomputing Course: Scalability, Memory, and Multicore Architecture

Remember my supercomputing course? Good! Here's another installment.

We discussed scalability before. To review, there are two primary measures of scalability of your program: strong scaling, in which we throw different numbers of processors at a constant-sized problem, and hope that the wall time to completion is inversely proportional to the number of processors; and weak scaling, in which we throw a constant amount of work per process at varying numbers of processors, and hope that the wall time to completion remains constant.

Scalability is a very handy concept, because it alerts us to bottlenecks in our algorithms. We may not know exactly what's wrong, but if we see that the numbers are not what we would like them to be, we can go in and check it out.

Sometimes, however, we get some false readings from our scaling studies. For example, I once had superlinear speedup, meaning that in my strong scaling study, I found that it took less than half the wall time on 2N processors that it did on N processors. This was because of caching effects -- a (2N)th of the problem fit fully into cache memory, while an Nth did not. Since the machine never had to access the slower memory, the computation went even faster than it would have otherwise.

Speaking of memory, to solve some problems we have to duplicate a data set across every process. For example, a nuclear physicist might need to store her basis functions on every process, because the basis functions are used in the computation of potentials for every particle in the nucleus. This can be very redundant, and especially in the case of today's mixed distributed and multicore supercomputer architecture, it is also a waste of memory space.

On a multicore supercomputer, instead of each processor consisting of a single CPU plus memory, we have multiple CPUs that physically share one large chunk of memory. If we use MPI, then we treat each CPU as if it is a single CPU with memory that is inaccessible to all the other processes, even the ones that are on the same board sharing the same memory pool with it. So in this scenario, the nuclear physicist would have to reproduce the same data twice for a dual-core node, four times for a quad-core node, or eight times for an eight-core node. For example, on a quad-core machine with 2 GB memory per core, if our basis functions take up 1 GB, then the redundant basis function storage occupies 4 out of the 8 GB available. Wouldn't it be cool if we could just reproduce it once? We could use that extra 3 GB to solve a bigger problem.

We could use only one of the cores on the node, letting the rest remain unused, and use all the memory on the node for that one processor. Unfortunately, any computations that we perform will be done on only one of the cores, thereby slowing them by a factor of n for an n-core node.

What we need instead is a way to take advantage of all the memory while simultaneously using all the processors when we do computations. In other words, we want to use a distributed memory programming model between multicore nodes, and a shared memory programming model within the nodes. Is there a shared memory programming model that we could use for this purpose?

Of course there is; it's called OpenMP, and it's the topic of the next installment of this course!

No comments: