The MWC64X RNG is a small, fast, high-quality, skippable random uniform number generator designed for use with GPUs via OpenCL. The main features of MWC64X are:

**Small State**: Each RNG requires just two 32-bit words of register storage, reducing register pressure on the rest of the program.**Speed**: Each random number takes five or six instructions (assuming a post-2010 GPU with 32-bit integer multiply like Fermi or Cayman).**High Quality**: MWC64X passes every statistical test I've applied to it, unlike most other small and/or fast generators. Oh, and I'm talking about*serious*tests here, not just dicking about with Diehard and the NIST tests.**Stream-Splitting**: The RNG is skippable (which is not that un-common), using very little working storage (which is less common), and I've also actually provided the code to do it (which is somewhat rare).**Portability**: The instructions used by MWC64X are directly supported on all high-performance GPUs (i.e. AMD + NVidia), and are also supported in both the scalar and SSE2 instruction set for x86.

uint32_t MWC64X(uint64_t *state) { uint32_t c=(*state)>>32, x=(*state)&0xFFFFFFFF; *state = x*((uint64_t)4294883355U) + c; return x^c; }Hopefully the compiler will take advantage of the 32x32->64 bit instructions in modern processors. In OpenCL we can take advantage of the builtin mul_hi instruction and do everything in 32-bit code:

uint MWC64X(uint2 *state) { enum { A=4294883355U}; uint x=(*state).x, c=(*state).y; // Unpack the state uint res=x^c; // Calculate the result uint hi=mul_hi(x,A); // Step the RNG x=x*A+c; c=hi+(x<c); *state=(uint2)(x,c) // Pack the state back up return res; // Return the next result }In both cases "state" points to the 64-bit state of the RNG, and the returned value is a 32-bit random integer.

Note that to achieve a small fast generator there are certain tradeoffs - these won't matter to most people, but if you are doing hard-core multi-GPU Monte-Carlo then bear them in mind:

**Period**: The period is "only" 2^63. This is reasonable, but in some applications you could exhaust the period by having huge numbers of GPUs in parallel, or by running simulations for days. It also means you can't use random seeding, although because I've provided skipping code that shouldn't matter.**Lack of Theory**: The generator is based on a type of generator with known theoretical quality and period (the Multiply-With-Carry) but then modified to drastically improve empirical quality while retaining the period. I can't say anything about theoretical structure, but*can*demonstrate that it is empirically much more random than generators with supposedly good theoretical properties.

I've put together a package containing the OpenCL implementations, and some OpenCL/C++ test benches which can make sure that your particular OpenCL vendor hasn't got a broken compiler or something. The RNG provides four variants of the generator, which produce either one, two, four, or eight streams from a single generator. The main one-stream version is most useful on NVidia platforms, while the four-vector version is useful on AMD platforms as it lets the compiler fill the VLIW slots.

The code should be platform agnostic, and works for me in Linux and MinGW and on AMD and CUDA OpenCL distributions. (Yes, I know there is no Visual Studio project, and I don't intend to provide one. If you can't figure out how to deal with the makefile then you probably shouldn't be using OpenCL).Each random number generator is represented by an opaque struct which contains the state of the RNG:

typedef ... mwc64x_state_t; typedef ... mwc64xvec2_state_t; typedef ... mwc64xvec4_state_t; typedef ... mwc64xvec8x_state_t;Samples from an RNG are then retrieved using an RNG-specific NextUint function:

uint MWC64X_NextUint(mwc64x_state_t *state); uint2 MWC64X_NextUint2(mwc64x_state_t *state); uint4 MWC64X_NextUint4(mwc64x_state_t *state); uint8 MWC64X_NextUint8(mwc64x_state_t *state);Note that each variant returns a different sized vector, according to the number of streams supported.

The package provides a stream splitting mechanism for assigning sub-sequences to a particular stream. To use it you need to choose the maximum possible samples you will take from a single stream - for most cases 2^40 is a reasonable choice. The function MWC32X_SeedStreams will then initialise each stream to be offset by the amount from the previous stream:

void MWC64X_SeedStreams(mwc64x_state_t *s, ulong baseOffset, ulong perStreamOffset); void MWC64XVEC2_SeedStreams(mwc64xvec2_state_t *s, ulong baseOffset, ulong perStreamOffset); void MWC64XVEC4_SeedStreams(mwc64xvec4_state_t *s, ulong baseOffset, ulong perStreamOffset); void MWC64XVEC8_SeedStreams(mwc64xvec8_state_t *s, ulong baseOffset, ulong perStreamOffset);The seed functions automatically take into account how many streams a particular RNG is producing, and calculates the offset of each stream within the entire MWC64X sequence as (baseOffset+perStreamOffset*(get_global_id(0)*vectorSize+vectorOffset)).

The nice thing about this is that you can create simulations which give exactly the same result, no matter how many work items you launch with, or whether you use the single stream or multiple stream RNG. For example, this function estimates PI using n 2d points, and gives the same result no matter how many work items you use (assuming n is divisible by get_global_size(0)):

__kernel void EstimatePi(uint n, ulong baseOffset, __global uint *acc) { mwc64x_state_t rng; ulong samplesPerStream=n/get_global_size(0); MWC64X_SeedStreams(&rng, baseOffset, 2*samplesPerStream); uint count=0; for(uint i=0;i<samplesPerStream;i++){ ulong x=MWC64X_NextUint(&rng); ulong y=MWC64X_NextUint(&rng); ulong x2=x*x; ulong y2=y*y; if(x2+y2 >= x2) count++; } atomic_add(acc, count); }Each time it is called it will consume 2*n contiguous samples from the stream, starting from baseOffset.

We could do exactly the same thing with the 4 stream version, and will get exactly the same answer:

__kernel void EstimatePi(uint n, ulong baseOffset, __global uint *acc) { mwc64x_state_t rng; ulong samplesPerStream=n/(4*get_global_size(0)); MWC64XVEC4_SeedStreams(&rng, baseOffset, 2*samplesPerStream); uint count=0; for(uint i=0;i<samplesPerStream;i++){ ulong4 x=MWC64XVEC4_NextUint4(&rng); ulong4 y=MWC64XVEC4_NextUint4(&rng); ulong4 x2=x*x; ulong4 y2=y*y; int4 hit=-(x2+y2 >= x2); count+=hit.x+hit.y+hit.z+hit.w; } atomic_add(acc, count); }The source code includes the pi calculation example for the various stream sizes, and for a given number of random samples, you should get exactly the same result from all combinations of RNGs and work-items. This is quite nice for debugging a simulation using the single-stream version with one work-item, then checking you still get the same result when using hundreds of work-items (assuming you take floating-point evaluation into account).

MWC64X is based on a Multiply-With-Carry (MWC) generator, which is a type of generator invented by George Marsaglia. For our purposes we can describe the MWC generator using some simple pseudo-code and 64-bit integers:

enum{ A = 4294883355UL }; // A "special" constant ulong x; // Holds current RNG State x=(x&0xFFFFFFFF)*A+(x>>32); // Advance RNG to the next stateIf the constant A is chosen carefully, then the state will advance through a sequence with period P= (A*2^32-2)/2, so the generator above has a period of P=2^62.999972. As it turns out, the MWC is closely related to another type of generator, called a Linear Congruential Generator (LCG), and in particular it is related to a prime-modulus LCG. Prime modulus LCGs generally have much better quality than non-prime, but are usually quite slow to implement, so the MWC provide a special case where prime-modulus evaluation is very quick. In particular, if we treat the MWC as two 32-bit components c and x, we can describe the generator as:

uint x, c; // 64-bit state split into 32-bits uint hi=mul_hi(x,A); x=x*A+c; c=hi+(x<c); // The final term deals with overflow of xIf we split this up into one statement per GPU-instruction then we end up with:

uint hi=mul_hi(x,A); uint lo=x*A; x=lo+c; c=x<c; c=c+hi;So each MWC step requires five 32-bit instructions: two multiplications that can (potentially) be performed in parallel; followed by a chain of 3 dependent instructions. Note that the register updates can occur in-place, so there are no extraneous moves needed to get everthing back into the correct register.

So an MWC allows us to implement a 64-bit RNG that is really fast on a GPU, but there is one problem: the MWC is equivalent to a prime-modulus LCG, but unfortunately you don't have much control over the LCG multiplier. Usually one chooses the LCG multiplier to optimise the spectral distribution in the first ten or so dimensions, but here we just have to deal with what the MWC maths allows, and sadly it only allows multipliers that provide quite a poor distribution of consecutive triples (on other dimensions it is pretty decent). In practise this means that the plain MWC fails on tests looking at triples, and is not appropriate for use in the raw form.

However, one of the nice things about the MWC in the GPU optimised form is that it gives you two components (x and c) to work with. In addition, both components completely change for each update step. So what we can do is to combine these two components to effectively destroy the spectral structure of the the MWC - so we lose the theoretical guarantees that the spectral test gives us, but in exchange we gain extremely good empirical quality.

There are lots of ways we could combine x and c, but the most obvious choice is also extremely effective - we can simply xor them together. This has the nice intuitive argument that we are combining the components using a different "type" of maths (i.e. GF(2)) than was used to generate them, and also behaves very well statistically. So if we hold the state of the RNG in a uint2 (i.e. a two element OpenCL vector), the full MWC64X update step is:

uint MWC64X(uint2 *state) { enum{ A=4294883355U }; uint x=(*state).x, c=(*state).y; // Unpack the state uint res=x^c; // Calculate the result uint hi=mul_hi(x,A); // Step the RNG x=x*A+c; c=hi+(x<c); *state=(uint2)(x,c) // Pack the state back up return res; // Return the next result }This will generate a new 32-bit uniform integer every time it is called. Note that we perform the exclusive-or on the previous state, as on some GPUs (e.g. AMD) this lets the VLIW scheduler get to work on the output of the RNG in parallel with the actual update.

To be clear: I have no theoretical quality measures for this RNG. All I can say is that it has a period of 2^63, and it passes every statistical test that I am aware of, so there are no known empirical flaws. It easily passes the Diehard and NIST tests, but they are pretty out of date and irrelevant in the fact of modern simulations consuming trillions of samples.

For a realistic test we turn to TestU01, which provides a large set of statistical tests, and also defines two extremely stringent test batteries, Crush and BigCrush, which consume huge numbers of samples. MWC64X passes both of these tests with no suspicious p-values.

One of the problems with many RNGs is that the low order bits are less random than the high-order bits, so to test that I also ran BigCrush on the bit-reversed output of MWC64X. It still passed all the tests, so that means one can get away with doing things like:

uint index=MWC64X(pState) & 0xF;to get random integers in [0..16) without worrying about correlations.

Finally, one of the things that worried me a little was the straight uniformity of the output (as the range of the c component is slightly deficient) - if we do a histogram of the output, is it flat? To test this I bucketed samples according to their top 20 bits, i.e. created a histogram with one million buckets. For each binary power number of samples up to 2^40 the histogram was filled with consecutive samples, and for each sample size chi2 tests were done at 2^20 buckets and also by aggregating to get fewer buckets down to 2^4. In total this required 2^41 consecutive samples, and none of the p-values was suspicious, so the uniformity seems to be pretty damn good and I was probably worrying about nothing.

Stream skipping and/or stream splitting are a way of carving
the output of a random number generator into multiple distinct
streams. An RNG can be described in terms of two parts: a
state transition function s_i=f(s_{i-1}) and an output function x_{i}=f(s_{i}).
Given an initial state s_{0}, we can define the sequence of states
as s_{1}=f(s_{0}), s_{2}=f(s_{1}),
and the sequence of outputs x_{0}=f(s_{0}),x_{1}=f(s_{1}).
So we end up with a sequence of outputs x_{0},x_{1},x_{2},...
Eventually this sequence must repeat, so for example with MWC64X
we find that at p=~2^{63} we have x_{p}=x_{0}, x_{p+1}=x_{1}, etc.

The idea of stream splitting is to take n samples from the large
sequence, and to give each of k processors a contiguous sub-sequence
from the n samples. For example, assume that we want to run a simulation
using n=2^{40} random samples, and wish to process it
using k=2^{10} processors (e.g. work items), then we need to give each processor
m=n/k=2^{30} consecutive samples .

For processor 0, that is easy: we can assign it the sub-stream x_{0}...x_{m-1},
so the processor just starts in the first state and sequentially generates n random numbers. However,
for processor 1, we want it to process samples x_{m}..x_{2m-1}, so the
samples immediately following those consumed by processor 0. However, to get to sample
x_{m} knowing only the starting state s_{0} would require processor
1 to perform 2_{30} RNG steps before it could start work. The worst situation
would be for processor k-1 which must generate and throw away almost 2^{40}
samples before it could start any real work. Such an approach would be pointless, and extremely slow.

Fortunately, we can use the relationship between the MWC and LCG generators
to implement a skip function s_{i+d}=skip(s_{i},d), which is able
to jump d steps forward in much fewer than d steps (approximately log2(d) operations, although
each operation is more complicated than the RNG step function. This allows
each processor j to jump directly to s_{j*m}=skip(s_{0},j*m),
without having to generate all the intermediate steps.

A particularly nice feature of the MWC64X skipping method for GPUs is that it requires very little storage - around 10 or so 64-bit registers, and no local or global RAM. This makes it much easier to manage skipping/seeding, as all work-items can skip at the same time, and there is no need to allocate or manage additional memory. The tradeoff is that the MWC64X skip function is a bit more computationally intensive than some other methods (for example those for xor-based generators), but is still fast (particularly for small distances), and skipping is rarely a bottleneck.

In the source code the skip function is provided by a skip function specialised for each generator:

void MWC64X_Skip(mwc64x_state_t *s, ulong distance); void MWC64XVEC2_Skip(mwc64xvec2_state_t *s, ulong distance); void MWC64XVEC4_Skip(mwc64xvec4_state_t *s, ulong distance); void MWC64XVEC8_Skip(mwc64xvec8_state_t *s, ulong distance);For any given state this will skip the generator forwards by the requested distance. For generators providing multiple streams, each stream is skipped by the same distance.

More useful for most people is MWC64X_SeedStreams and its variants, which will automatically take into account the number of work items and streams per generator. See the source code overview, or look at the estimate_pi example in the source code for more information.

Up: RNGs for GPUs