The Warp Generator : A Uniform Random Number Generator for GPUs

As of 2011 this RNGs is unpublished, so haven't been through peer-review and so-on. I've satisfied myself that it has good quality, and is sufficiently interesting to people that it makes sense to make them publically available rather than wait till it is published. There is always a chance that some features of the generator will interact poorly with the highly parallel nature of GPUs, so if you find wierd effects, the very first thing you should do is to try the same simulation with a completely different type of generator like MWC64X, or another known-good generator like KISS or the Mersenne Twister. See also Note 4. Contents:

For those who were at the Thalesians GPU workshop, my presentation is here.

Motivation

Lots of people have got excited about using GPUs for Monte-Carlo simulations, and are now busily converting existing C applications to CUDA/OpenCL. Many people have got used to the idea of using the Mersenne Twister (MT) on CPUs: it's pretty fast, has excellent quality, and a huge period (the more recent SFMT made it even faster, but it isn't clear how many people are actually using that yet). Obviously they now want to use the MT on the GPU...

However, A GPU is not a CPU, and the underlying assumptions used to design the MT (and similar methods such as SFMT and WELL) are no longer true. In particular, high-speed array storage (in the form of shared memory) is now a very precious commodity, but the designers of the MT assumed that array storage was not a concern. In order to achieve the huge period of 2^19937-1, each MT RNG instance needs to store 624 words of state. However, in a current generation NVidia GPU there are only 4096 words of shared memory per block of threads, so we could only fit 6 MT instances in shared memory.

We can try to put the RNG state in global memory: there is plenty of global memory available, and we can easily allocate 624 bytes of memory per thread, so each thread has a seperate RNG instance (this is the approach taken in the SDK sample). However, we now hit a bandwidth wall - each RNG step needs to do two 32-bit reads and one write to it's global state, so 12 bytes per RNG step, and the accesses are trivial to coallesce. The total memory bandwidth on my Tesla C1060 is a respectable 18GB/s, with approximately 2:1 read to write capacity (which is what the MT needs), but if you do the figures you realise that it simply isn't enough. Unless the RNG is driving something very computationally intensive (which makes no accesses to global memory), the MT's accesses to global memory will be the limiting factor.

If we can't use a big state, another approach is to use an RNG with a small state that fits in registers. One example of this is An example of this kind of thing is MWC64X which is designed to be small, fast and higher quality, but has a far smaller period (however, it also has stream-splitting built-in to get round that).


Approach

My approach is to look at how a GPU works, and try to design for that architecture and set of constraints, rather than converting existing CPU generators. For RNG purposes, the two things that current GPUs give us are:

  1. Fast shared memory accessible by threads within a warp.
  2. Fast "gather" access to shared memory if there are no address collisions (modulo the half warp size).
  3. Atomic parallel shared memory reads and writes for instructions within the same warp.
This leads one to a generator that assigns an RNG to each warp, rather than to each thread. So we can assign a relatively large state to each warp's RNG, putting it in memory, then use all threads in the warp to update the RNG in parallel.

I'm going to skip alot of theory in this web-page, but the set of operations I chose are binary linear operations (exclusive-or and bit shifts), which are the same mechanism underlying the MT. However, the parallel nature of the RNG step means that the recurrence looks very different to that to the MT. In particular, the recurrence matrix is very dense, allowing multiple words to be used after each step.

The result of all this is an RNG shared amongst all 32 threads in a warp, with a period of either 2^1024-1, or 2^2048-1, where each call to the generator generates one new 32-bit random number for each thread. Other parametrisations are possible (different warp sizes, larger periods), but for now I'm sticking to the GPUs I can actually play with (at the time of writing Larrabee is still a paper architecture, and AMD GPUs don't quite have a robust OpenCL implementation). I've chosen three warp RNGs to give a range of characteristics:

  1. WarpStandard: This is a generator with a period of 2^1024-1 that gives a nice balance of speed and quality. It passes all tests in TestU01 (the crush and bigcrush batteries), so the statistical quality is very good.
  2. WarpBuffered: This gives a larger period of 2^2048-1, with slightly better underlying quality than WarpStandard but slightly reduced speed.
  3. WarpCorrelated: This is the fastest generator, and has a period of 2^1024-1 but has known problems with the statistical quality. If each thread's stream of numbers is considered individually then the quality is ok (but not excellent); however when the outputs of all the warp's threads are considered as a whole then there are signficant defects. I would strongly advise against using this generator - go for Warp Standard, it is almost as fast, but all the quality and correlation problems disappear. (DANGER!).

If you want to know more about how the generators work, I have a partially finished working paper, but it is a complete mess at the moment.


Performance and Quality

To give an idea about how fast the generator is, we compared against a number of other generators. Note that they tend towards simple low-quality ones, as I haven't had a chance to adapt the more complicated generators from SPRNG and such-like. The generators are:
  1. Mersenne Twister: This is a moderately optimised version of MT19937, using two reads and one write to global memory per generated sample. Note that it is completely bandwidth limited by main memory bandwidth - it completely maxes out our C1060's global memory throughput, and most of the time the threads are idle (although see Note 6: my version may not be as optimal as I might like).
  2. Park Miller: This is a prime modulus LCG, with period 2^31-2, which has previously been suggested for RNGs on GPUs (although in our opinion it is pretty useless as the quality is so bad). We use Schrage's algorithm, as it was the fastest of the various methods we tried.
  3. Hybrid XorShift: This combines a XorShift generator with period 2^96, and the Quick and Dirty LCG, to give a generator with period 2^128 with pretty good quality.
  4. Quick and Dirty: This is the modulus 2^32 LCG suggested in Numerical Recipes, which has fairly poor quality.
  5. Adder: This is included as a baseline - it simply increments a counter by a prime step, and is completely useless as an RNG.

Performance was measured by having each RNG generate numbers and summing them together so that nothing got optimised out. To mitigate the fairly high overhead of such a tight loop, the generator was manually unrolled 16 times. The test platform was a Tesla C1060. The number of generated samples was repeatedly doubled until each run took more than 5 seconds of wall-clock time.

This is the performance measured in GSamples/Sec, i.e. how many 32-bit random numbers can the entire GPU generate per second:

This is an estimate of the number of instructions executed per generate random number (including the instruction used to sum the outputs):

It is a little noisy, both because the GPU clock does not exactly match the frequency it reports, and because there is still some loop overhead included despite the unrolling. However, we see that the speed for the adder is very close to the two instructions we would expect, and the instruction counts for the warp RNGs match very closely to the number of instructions reported by decuda. Note that the instruction count for the Mersenne Twister reflects the number of stall cycles due to limited memory bandwidth - most of the time the processors are idle.

We applied the empirical tests from TestU01 to the generators to check their theoretical quality. It contains three sets of test batteries: SmallCrush, Crush, and BigCrush, which apply increasingly stringent statistical tests to increasingly large sequences of random numbers. There are other batteries, such as Diehard and the NIST tests, but these are fairly old and don't provide the same leve of testing as TestU01.

For the warp RNGs we did the tests in two modes: first by taking the output of each thread as a single stream, then by interleaving the outputs of all threads in the warp. This should check both that the individual streams are good quality, and that there are no nasty correlations between the threads. For the other generators we didn't both with the interleaved version, though that would probably be quite a good idea anyway.

Batteries were interpreted as Pass or Fail by inspection of the p-values, and reporting the battery failed if any individual test failed. When you look at so many p-values the approach of thresholding at [0.01,0.99] doesn't work, but in my experience it is really obvious when a test is being failed (they tend to do so catastrophically) - you'll just have to rely on my judgement (or even better, repeat the tests yourself). I've distuinguished between two types of failure: if the only failures were due to binary rank or linear complexity then the result is coloured orange; but if any other kind of test failed it is coloured red. The reason for this is that in practise most applications are not very sensitive linear complexity (for example, the Mersenne Twister has been successfully used for many years despite failing these tests).

External NamePeriodGSamples/SecSmallCrushBigSmallCrushBig
Adder2^32141.28FailFailFail
WarpCorrelated2^102448.45PassFailFailPassFailFail
QuickAndDirty2^3243.84FailFailFail
WarpStandard2^102437.58PassPassPassPassPassPass
WarpBuffered2^204831.80PassPassPassPassPassPass
HybridXorshift2^12820.33PassFailFail
ParkMiller2^3210.67FailFailFail
MersenneTwister2^199370.22PassFailFail


Source Code

The RNG implementations and a basic test suite are available under a BSD license, so you can do whatever you want with them, including putting them in libraries and whatever you want. I would only request that you attribute the algorithms to me in the documentation, but that's up to you.

RNG implementation files:

RNGs plus test suite:

The test suite assumes you have some kind of GNU "make" and "sed" available. Do "make all.csv", and it will build, test, and benchmark the three generators. There is also a dumb Pi calculation example: do "make calc_pi_WarpBuffered.gpu", and it will build an executable called "calc_pi_warpBuffered.gpu" that you can run.

Presentation from Thalesians GPU Workshop


Usage

The current interface is a horrible C like thing, because my version of CUDA doesn't support classes yet as far as I can tell (it should look a lot better once it does). Where I talk about "RNGNAME" below, it is actually one of "WarpStandard", "WarpBuffered", or "WarpCorrelated" in the actual code. Note that the functions only look at the x dimensions of the blocks and grids. I am lazy, and for the types of simulations I write it doesn't make sense to have more dimensions. However, it is easy to extend if you want.

Constants

const unsigned RNGNAME_K;
The number of threads cooperating on a single RNG. In the current version this is fixed at 32 (i.e. the warp size), but see Note 3.
const unsigned RNGNAME_REG_COUNT;
The number of registers needed per thread for the RNG. These registers are private to the RNG, and are used for storing various constants and sometimes state. Between calls to Generate there should be exactly REG_COUNT registers associated with the RNG, so it gives you an idea of how much extra register pressure the RNG is introducing on your threads. In the current versions it is 4 for WarpStandard and WarpCorrelated, and 6 for WarpBuffered.
const unsigned RNGNAME_STATE_WORDS;
This is the number of 32-bit words needed to describe the state of each RNG (not the number of state words per thread). So each group of K threads work together on one RNG, and the threads work together to use the STATE_WORDS to initialise their shared RNG.
__device__ const unsigned RNGNAME_TEST_DATA[RNGNAME_STATE_WORDS];
This is an array of test data, so that you can check the function is working. If you start from a state of all ones (i.e. initialise with a state of containing RNGNAME_STATE_WORDS of 0xFFFFFFFF), then step the generator by K steps, the resulting state should match the state contained in RNGNAME_TEST_DATA.

Functions

All the interface functions take two parameters: "regs", an array of registers private to the RNG of size RNGNAME_REG_COUNT; and "shmem", a pointer to an array of shared memory containing blockDim.x words.
__device__ void RNGNAME_LoadState(
    const unsigned *state,
	unsigned *regs,	unsigned *shmem
);
The load state function extracts an RNG state from global memory, and loads it into the combination of registers and shared memory. The parameter "state" should point to an array of (blockDim.x/RNGNAME_K)*gridDim.x*RNGNAME_STATE_WORDS words. Each RNG's state is stored contiguously.
__device__ void RNGNAME_SaveState(
	const unsigned *regs, const unsigned *shmem,
	unsigned *state
);
The save state function does the opposite of load state: it takes whatever the current state of the RNG is, and dumps it into the array "state". This is not a destructive function, so the state of the RNG remains the same and can continue to be used to generate random numbers.
__device__ unsigned RNGNAME_Generate(
	unsigned *regs, unsigned *shmem
);
This generates a single 32-bit uniform random number each time it is called. Note that if one thread in the warp calls generate, then all threads must call it. If there is thread divergence around the generate function then the RNG state will be corrupted and things will go very wrong, so make sure you generate your numbers before any branches that might cause divergence within the warp. Divergence between warps is fine. See also note 1.

Example usage

This is just a sketch of how you use the generators. Look at the example code for actual executing code. The first thing the kernel needs to do is initialise the RNGs from a memory buffer, using Load Seed. The simulation or whatever can then call Generate as many times as it wants. As a final step you may wish to retrieve the state of the RNGs using SaveSeed, so that the same RNG states can be used for any future calls to the kernel.
extern __shared__ unsigned sharedMemory;

// state should point to gridDim.x*(blockDim.x/K)*RNGNAME_STATE_WORD words
__global__ void MyMonteCarloSim(unsigned *state, unsigned N, ...)
{
    // This should point to blockDim.x words of shared memory used
    // to hold the RNG state.
    unsigned *rngShmem=sharedMemory;
    
    // These are private words for the RNG
    unsigned rngRegs[RNGNAME_REG_COUNT];
    
    // Move the state into the actual RNG
    RNGNAME_LoadState(state, rngRegs, rngShem);
    
    // Sort out whatever this application is doing
    MonteCarloInitialise();    
    
    // Main simulation loop
    for(unsigned i=0;i<N;i++){
        unsigned u=RNGNAME_Generate(rngRegs, rngShmem);
        
        // Use the random number for the application. Make
        // sure that any intra-warp divergence happens after
        // the call to generate.
        if(u & 1){
            // Do something with the number
            x += MonteCarloStepA(u, ...);        
        }else{
            // Do something else with the number
            x += MonteCarloStepB(u,...);
        }
    }
    
    // Record whatever results you discovered
    MonteCarloFinalise();
    
    // Save the state back to memory if you so wish. It can be
    // used for the next kernel launch, and will carry on from
    // where it left off.
    RNGNAME_SaveState(rngRegs, rngShmem, state);
}
The host side code needs to make sure that it allocates and initialises the RNG states for use by the kernel, then transfer it over to the device. Once the state is moved over, it can be left there for as long as need be, assuming the kernel uses the LoadState/SaveState approach shown above. You can also share the state between different kernels, as long as they are using the same RNG and you make sure that they do not execute in an interleaved fashion, i.e. one kernel must definitely finish before the next one using the same state memory starts. If each launch has different block and grid sizes, just create an array of states big enough for the largest one, and smaller ones will only use a subset of the array.

The easiest way to do initialisation is by using a good quality (preferably cryptographic quality) generator - please don't use rand() to fill up the seeds, and even lrand48() may cause problems. It's your job to make sure the RNG seeds are sufficiently random and to make sure they are all distinct. If want to use an auxiliary generator to seed the warp RNGs, use something like KISS at the very least, and be very careful about how you seed it. In general I would suggest using /dev/urandom to initialise the RNG array, unless that becomes a computational bottleneck for your application. If you are worried about the safety of collisions see Note 6.

Another option for seeding is using skip-ahead, or leap-ahead. This is entirely possible for the warp RNGs, and can be done on a per-warp basis in the GPU quite efficiently. However, being an academic I didn't realise how important this was to many people in the real world, so didn't bother to actually write the code. Once I've gotten round to writing and testing the code, I'll post it up here as well, hopefully fairly soon (he says, writing in Sep 2009).

int main(...)
{
    dim3 gridDim=...;    // Something sensible
    dim3 blockDim=...;    // Must be a multiple of K

    unsigned totalStateBytes=4*blockDim.x/RNGNAME_K*RNGNAME_STATE_WORDS;
    unsigned *localState=malloc(totalStateBytes);
    unsigned *devState=0;
    cudaAlloc(&devState, totalStateBytes);
    
    // Create state initialisers. I would recommend /dev/urandom, then
    // reuse the state. You can also seed from another RNG, but
    // you have to be somewhat careful then. Please don't use rand().
    int fr=open("/dev/random", ...);
    read(fr, localState, totalStateBytes);
    close(fr);
    
    cudaMemcpy(devState, localState,
        totalStateBytes, cudaMemcpyLocalToDevice);
    
    // We now have the correct number of state words places on the
    // device, and can simulate to our hearts content.
    
    while(!MonteCarloFinished()){
        // We can call multiple times without having to provide
        // a new state, as the RNG will simply carry on
        // from where it stopped last time.
        MyMonteCarloSim<<<gridDim,blockDim,blockDim.x*4>>>(state, N, ...);
    }
    
    // Can just abandon the state on the device. Could
    // also copy it back if we want.
    cudaFree(devState);
}

Note 1 : intra-warp divergence

I think in principle you can use PTX to make sure that if any thread in a warp performs an instruction, then every thread does so. This would mean that a PTX version of generate could be used to force correct behaviour of the RNG even with intra-warp divergence. However, I haven't read the ptx spec. in detail, and it's something that would be better done by NVidia - from what I understand of the architecture it would be bizarre if you couldn't do it, but who knows?

Note 2 : WarpCorrelated

I'm serious - if you don't know what you are doing, stay well away from WarpCorrelated. I only included it because in very specific situations a very fast high-period poor-quality generator might be useful to a small number of people (if you can't imagine what those situations are, you are not one of those people). If you have problems when using WarpCorrelated, I don't really care, and probably will not respond to any questions about it.

Note 3 : Different parameterisations

It's possible to generate this type of generator for lots of different operation costs, warp sizes, etc. For example, we can use a K that is a factor of the actual warp size, which would make skip-ahead quicker. Or we can use a K that is a multiple of the actual warp size, which increases the period and generator quality, but requires the introduction of synchronisation primitives, as more than one warp works on each generator.

I have a program that can spit out generators for different architectures, so if your platform doesn't look like an NVidia one (the instruction costs are different, or the warp size is smaller, etc.), then give me a shout and I'll see what I can do. Also, if the cost of 64-bit operations goes down, or if new instructions are introduced, then I could choose new generators. For example, a built-in bit rotate or bit reverse instruction would be interesting, or byte shuffles would be even better (although they would mainly make the RNGs higher-quality rather than faster).

Note 4 : Should you trust these RNGs?

Up to you. I have a reasonable amount of experience designing FPGA random number generators, and know a bit about the software RNG approaches. I've also done some parallel Monte-Carlo simulations, mainly in the finance world, so I have an idea of the problems that can occur. Both WarpStandard and WarpBuffered pass every test in Crush and BigCrush from TestU01, which is an independent test suite - you can run the tests yourself if you want. I ran these tests both on the streams that each thread generates, as well as by interleaving the output of each thread into one master stream from each warp.

However, passing BigCrush doesn't mean that they will pass all tests that can be conceived - you can always find some flaw in a generator if you look hard enough. However, the tests in TestU01 are partially derived from statistical flaws that previously caused problems in real-world problems, so you should be fine for most applications. I would point out the the Mersenne Twister (and any raw binary linear generator) fails tests for linear complexity and matrix rank, but that doesn't stop it being a very useful and high-quality generator for the vast majority of RNG users.

Note 5 : Safety of random initialisation

People get worred about random initialisation, and think there is a chance that RNGs might overlap. We can work out the probability of this using a version of the Birthday Paradox. We have a generator with period 2^n, and will assume that each time an RNG is used it will consume around 2^128 samples (this is clearly a ridiculous overestimate, as a THz processor would take something like 2^60 years to execute that many instructions). So we will assume there are 2^(n-128) possible subsequences that we could choose.

Using an approximation to the Birthday Paradox, if we randomly generate g RNG states, then the probability of a collision is roughly 1-exp(-g*g/(2*(2^(n-128)))). If we put that in practical terms, to raise the probability of a collision up to 10^-14 you have to choose g=2^425 for WarpStandard and WarpCorrelated, and g=2^935 for WarpBuffered. This is only an approximation, but the upshot is that for all practical purposes a collision is impossible. The only reason you might see a collision is if you use a poor source of randomness for the initialisers, which is why I would recommend using /dev/urandom or something else of high quality and long period.

Note 6 : Mersenne Twister performance

According to cudaprof my implementation of the Mersenne Twister was maxing out the memory bus with coallesced reads and writes, so I couldn't see how to make it faster. However, the MT in OPLib is reported at ~1.5GSamples/sec, so I may have missed something.


Up: RNGs for FPGAs,