Uniform generators are usually seen as generating a random word per function call, treated either as an integer or a real, as this reflects both the way that software generation algorithms are designed, and the way the random numbers are usually consumed. However, in an FPGA it makes more sense to think in terms of generating a vector of n random bits per cycle, which can then be grouped into one or more independent words as needed. The number of bits needed depends on the application, and may be very large, particularly when multiple parallel simulation instances are packed into one design: in one credit-risk modelling application 6048 independent bits were consumed in every cycle at 233MHz, so a total generation rate of 1.4 Tb/s is required.
The naive approach to this problem is to instantiate 189 individual 32-bit generators, such as Taus-113, a reasonable quality software generator with period 2113 that is easy to describe using current HDLs. However, each Taus-113 instance requires a total of 208 LEs, so the total resource usage would be 39312. So we would spend 6.5 LEs per generated bit, and even though the whole set of generators contains 26624 active state bits (a third of the LE FFs are wasted and do not store state), the period of the aggregate generator is still only 2113, not even close to the potential period of 226624.
Because different applications need different RNG characteristics, we have developed three types of RNG, each of which provides a different tradeoff between statistical quality and resource usage:
Our approach is to target the logic elements directly, and optimise for the LUTs and FFs they contain. The first example of this is the LUT-Optimised generator (catchy name, I know). The idea here is to use logic elements and to get as close to the minimum resource bounds as possible: if we need n bits, then we have to use at least n LUTs (otherwise at least one bit would be a copy of another bit). This means that we have to use n logic elements, which means that there are also n associated FFs. The maximum possible period with n bits of state is 2n, so lets try to get as close to that as possible.
To achieve these goals we took the underlying binary linear recurrence theory, used to design word-based generators such as Taus-113, the Mersenne Twister, and WELL, and adapted it to find recurrences which work with bit-wise logic elements. This leads to generators with the following properties:
LUT-Optimised generators work very well for n up to about 512, providing clock rates in excess of 500MHz for Virtex-4. We found the clock rate decreases roughly logarithmically with n, so if more bits per cycle are needed then it is usually better to instantiate multiple generators. The maximum possible size is somewhere around 1300, as it becomes computationally infeasible to search for maximum period generators for larger sizes.
The empirical quality of generators is pretty good: for n above 128 the only tests in Crush that are failed are for matrix rank and linear complexity, which are impossible for binary linear recurrences to pass (even the Mersenne Twister and WELL fail them). We have also looked at the equidistribution of the generators, but it is not really clear what word-based equidistribution means in the context of a bit generator that will be regrouped into multiple smaller bits. In general it is easy to find a generator that has a worst case dimension gap (Δ1) of 1 (treating the output as a single n bit word), but it is more difficult to find maximally equidistributed generators: we have only found them for 64-bit generators using 5- and 6-input LUTs, although no particularly targeted search has yet been performed.
The LUT-Optimised generators are ideally suited to the job of providing input bits for non-uniform random number generators, as the minor statistical defects are likely to be masked by the transformation process. However, some applications may use the bits directly, where the linear nature and relatively low period (<2^512) of the generator may combine to produce biases in the application results. Another problem is that some applications may only need a small number of bits per cycle, for example, one 36 bit word. A generator with a period of 236 is highly likely to wrap around in an FPGA context, and will also have relatively poor statistical quality. It is possible to use a larger generator, but only to take a subset of the bits each cycle, but this seems like a waste of logic elements.
To extend the period of a generator we need to add state, and as well as the FFs in logic elements, FPGAs also typically contain different types of storage elements, such as block RAMs, distributed RAMs, and shift-registers. All of these can be configured to act as fixed-length FIFOs, which delay data for some fixed number of cycles. By combining these FIFOs with logic elements we can create very generators with very large states (and hence periods), without using huge amounts of logic.
The resulting LUT-FIFO generators use the same binary linear recurrence theory as the LUT-Optimised generators, but the matrix is constructed in a different way to incorporate both active bits (LUTs) and delay bits (FIFOs). By using block RAM based FIFOs we are able to get much higher periods, such as 211213 using a Virtex-(2,4) block RAM, or 2^23209 with a Virtex-5 block RAM. The key advantages of this method are:
The LUT-FIFO generator has the advantage of extremely good statistical quality, but requires a bock RAM, which is a fairly expensive resource in many applications. The LUT-SR generator provides a middle ground between the LUT-Opt and LUT-FIFO generator, by using cheap bit-wise shift-registers (e.g. SRL32s in Xilinx) to provide long periods and good quality without requiring expensive resources.
LUT-SR generators are appropriate for most uses of random number generators, and also have the advantage that they can be compactly specified and easily implemented. Full C++ and VHDL source code is available here.
Prev: Motivation, Up: RNGs for FPGAs, Next: Exponential Generators