Monday, March 16, 2009

Adventures in Pseudorandom Number Generation

When I made my Pi Day cake, a circle inscribed within a square, I decorated it with sprinkles to represent the Monte Carlo quadrature algorithm that could be used to compute π.

You can see that I tried to distribute the sprinkles evenly but somewhat irregularly over the cake. I did not want them to be in a grid, but at the same time I didn't want them to be clumped together.

In the case of computing a two-dimensional area, a grid and a pseudorandom distribution will yield the same order of accuracy, but for three dimensions and higher, the pseudorandom distribution always comes out ahead. This is because in finding the area or (hyper)volume of a D-dimensional object, the accuracy of the number you get from N points is proportional to one over the Dth root of N for a grid, but always one over the square root of N for a pseudorandom distribution. In other words, for two digits of accuracy in computing the volume of a three-dimensional blob, I need a million points on a 3-D grid, but only 10,000 points in a pseudorandom distribution.

So, these Monte Carlo algorithms can really save us a lot of effort. So how can we create pseudorandom distributions? Furthermore, what does pseudorandom even mean?

I'll start with the last question first. Pseudorandom means that something seems random but actually is not. A sequence of numbers that we generate using a mathematical formula cannot be random by definition. But if it shares certain desirable properties with random number sequences, then it is pseudorandom.

Basically, if a sequence of numbers that we generate looks random, meaning that there are no discernible patterns in it, then it is a pseudorandom sequence. A good pseudorandom sequence has the following qualities:
  1. It has a very long period (meaning there are many millions or billions of numbers generated before the sequence starts repeating itself) -- generally they have a period of 2n, where n is the number of bits in the computer's representation of numbers, although longer periods are better.
  2. It is uniformly distributed (meaning that the sequence lands on every possible value with equal frequency).
  3. It is reproducible (meaning that you can regenerate the same sequence over and over again just by initializing it with the same seed value). This is the primary advantage of pseudorandom number generation: it is useful to use the same "random" numbers every time when you are debugging a program, for example.
  4. There should be no correlations in higher dimensions, meaning that if I generate n-tuples from the sequence (for example, pairs (x, y) derived from elements a2k and a2k+1 in the sequence), there should be no discernible pattern.
  5. If the sequence generation is quick and cheap, that would be helpful too.

How, then, can we generate a pseudorandom sequence? Pseudorandom number generation is hard! I had a tough time generating the points on my cake. At first, I had a lot of clumps of sprinkles, and I actually had to go back in and carefully fill in some of the bare spots. So "Rebecca sprinkling sprinkles over a cake" is not a very good generator.

Likewise, most pseudorandom number generators (PRNGs) aren't actually very good. For example, the RANDU generator, commonly used in the 1960's and 70's, was a very poor PRNG. Check out the graph in the Wikipedia article, and you will see how bad it was. If you used RANDU to generate triplets and then graphed them as (x, y, z) coordinates, all the points would fall into one of fifteen two-dimensional planes. So if you were using this for your 3-D Monte Carlo integration, it would almost be as if the points were pseudorandom in two dimensions, and gridded in the third, leading to inaccurate results.

The best PRNG for scientific applications is the Mersenne Twister, developed in 1997 by by Makoto Matsumoto and Takuji Nishimura. Its period is more than 106000, it is equidistributed up to 627 dimensions (so you won't have the problem described above), and it is cheap and easy.

Most computer programming languages have a built-in PRNG. In C, for example, you can call the rand() function, which will output a number between 0 and RAND_MAX (a constant that is machine dependent). But these generators are usually pretty lousy, because they are linear congruential generators, sharing the correlation characteristics of RANDU. They are acceptable for program development and debugging, but when the time comes for production runs, you should replace the built-in PRNG with something better.

For serial pseudorandom number generation, the GNU Scientific Library (GSL) provides a large suite of PRNGs, including the Mersenne Twister. The GSL is open source and freely available online.

Generating pseudorandom sequences in a parallel application is a more difficult task. You want all the processes to generate different sequences; otherwise, you've gained no information because you duplicated the same Monte Carlo integration across all processes. The standard parallel PRNG is called SPRNG. The advantage of SPRNG over just generating sequences with different seed values on different processors is that SPRNG can generate multiple independent sequences, more than one per processor, with only a minimum of communication as each new stream of numbers is initialized.

Pseudorandom number generation is an important topic of study not only for those who use Monte Carlo simulations: it's also an important component of cryptographic applications. But that is a topic for its own post...


Madeleine said...

This is off topic, but I can't find a blog-email address to send it to you separately, and I thought you'd be interested to hear that applied math made the news in my town.

How can a slot machine make a $43M mistake?
Computer language blamed for glitch that displayed a $43 million win

It turns out the key is that it wasn't a 43 million dollar mistake. It was a 42.9 million dollar mistake. Does that ring a bell for you?

Rebecca said...

That's awesome, Madeleine! And not really off topic either. Slot machines use random number generators, which are in fact regulated for their randomness. I wonder how much of the inner workings of a slot machine they'll have to reveal in court!

Madeleine said...

Oh, true! Even cooler then.

I doubt they will end up in court. I suspect they will settle for the maximum payout that machine is designed to give, which is about $9,000, and maybe a regulation that all machines must be labeled with their maximum payout amount for consumer education.

For 3 minutes the guy thought he'd won millions, and he spent the money in his head. Initially he says all they offered him was four tickets to the buffet! So he is angry, but I'll be really surprised if it goes to court. It's an interesting little piece of human psychology.