Uniform generators are usually seen as generating a random word per function call, treated either as
an integer or a real, as this reflects both the way that software generation algorithms
are designed, and the way the random numbers are usually consumed. However, in
an FPGA it makes more sense to think in terms of generating a vector of *n* random
bits per cycle, which can then be grouped into one or more independent words as needed.
The number of bits needed depends on the application, and may be very large, particularly
when multiple parallel simulation instances are packed into one design: in one credit-risk
modelling application 6048 independent bits were consumed in every cycle at 233MHz, so
a total generation rate of 1.4 Tb/s is required.

The naive approach to this problem is to instantiate 189 individual 32-bit generators, such
as Taus-113, a reasonable quality software generator with period 2^{113} that is easy to describe
using current HDLs. However, each Taus-113 instance requires a total of 208 LEs, so the total
resource usage would be 39312. So we would spend 6.5 LEs per generated bit, and even
though the whole set of generators contains 26624 active state bits (a third of the LE FFs are
wasted and do not store state), the period of the aggregate generator is still only 2^{113}, not
even close to the potential period of 2^{26624}.

Because different applications need different RNG characteristics, we have developed three types of RNG, each of which provides a different tradeoff between statistical quality and resource usage:

**LUT-Optimised Generators**: Very low resource usage (1 LUT per generated bit), but with limited period and statistical quality. Useful as a building block for non-uniform RNGs.**LUT-FIFO Generators**: Each LUT-FIFO generator requires a block RAM plus 1 LUT per bit, but in exchange for the increased resources provides huge periods and extremely high quality.**LUT-SR Generators**: These provide a good balance between resource usage and statistical quality, as the period is at least 2^^{1024}-1, while the resource usage is 2 LUTs per generated bit.

Our approach is to target the logic elements directly, and optimise for the LUTs and FFs
they contain. The first example of this is the
LUT-Optimised generator (catchy name, I know).
The idea here is to use logic elements and to get as close to the minimum resource bounds as
possible: if we need *n* bits, then we have to use at least *n* LUTs (otherwise
at least one bit would be a copy of another bit). This means that we have to use *n*
logic elements, which means that there are also *n* associated FFs. The maximum
possible period with *n* bits of state is 2^{n}, so lets try to get as close to that as possible.

To achieve these goals we took the underlying binary linear recurrence theory, used to design word-based generators such as Taus-113, the Mersenne Twister, and WELL, and adapted it to find recurrences which work with bit-wise logic elements. This leads to generators with the following properties:

- Generators can be customised for the application, to produce as many bits as are needed per cycle.
- A generator producing
*n*bits per cycle takes exactly*n*logic elements. - The period of the generator is 2
^{n}-1 - Both the LUT and FF of each logic element are fully utilised.
- The longest path within the generator is FF-LUT-FF, allowing extremely high clock rates.

LUT-Optimised generators work very well for *n* up to about 512, providing clock
rates in excess of 500MHz for Virtex-4. We found the clock rate decreases roughly
logarithmically with *n*, so if more bits per cycle are needed then it is usually
better to instantiate multiple generators. The maximum possible size is somewhere around
1300, as it becomes computationally infeasible to search for maximum period generators
for larger sizes.

The empirical quality of generators is pretty good: for
*n* above 128 the only tests in Crush that are failed are for matrix rank and
linear complexity, which are impossible for binary linear recurrences to pass (even
the Mersenne Twister and WELL fail them). We have also looked at the equidistribution
of the generators, but it is not really clear what word-based equidistribution means
in the context of a bit generator that will be regrouped into multiple smaller bits.
In general it is easy to find a generator that has a worst case dimension gap
(Δ_{1}) of 1 (treating the output as a single *n* bit word), but it
is more difficult to find maximally equidistributed generators: we have only found them
for 64-bit generators using 5- and 6-input LUTs, although no particularly targeted
search has yet been performed.

The LUT-Optimised generators are ideally suited to the job of providing input
bits for non-uniform random number generators, as the minor statistical defects
are likely to be masked by the transformation process. However, some applications
may use the bits directly, where the linear nature and relatively low period (<2^^{512})
of the generator may combine to produce biases in the application results. Another problem is that
some applications may only need a small number of bits per cycle, for example, one
36 bit word. A generator with a period of 2^{36} is highly likely
to wrap around in an FPGA context, and will also have relatively poor statistical quality.
It is possible to use a larger generator, but only to take a subset of the bits each cycle,
but this seems like a waste of logic elements.

To extend the period of a generator we need to add state, and as well as the FFs in logic elements, FPGAs also typically contain different types of storage elements, such as block RAMs, distributed RAMs, and shift-registers. All of these can be configured to act as fixed-length FIFOs, which delay data for some fixed number of cycles. By combining these FIFOs with logic elements we can create very generators with very large states (and hence periods), without using huge amounts of logic.

The resulting LUT-FIFO generators
use the same binary linear recurrence theory as the LUT-Optimised generators, but
the matrix is constructed in a different way to incorporate both active bits (LUTs) and
delay bits (FIFOs). By using block RAM based FIFOs we are able to get much higher
periods, such as 2^{11213} using a Virtex-(2,4) block RAM, or
2^^{23209} with a Virtex-5 block RAM. The key advantages of this method are:

- Very large periods are possible (limited only by block RAM size).
- The number of generated bits per cycle can still be customised for the application, allowing huge numbers of bits to be generated per cycle.
- The critical path is the longer of FF-LUT-FF and FF-RAM-FF, so performance is usually limited by the RAM's maximum clock rate (e.g. 500MHz in Virtex-4).
- The resource efficiency is very good, and increases with the number of bits generated, from
1.31 LEs/bit at
*n*=89, to 1.05 LEs/bit at*n*=521 (in Virtex-4).

The LUT-FIFO generator has the advantage of extremely good statistical quality, but requires a bock RAM, which is a fairly expensive resource in many applications. The LUT-SR generator provides a middle ground between the LUT-Opt and LUT-FIFO generator, by using cheap bit-wise shift-registers (e.g. SRL32s in Xilinx) to provide long periods and good quality without requiring expensive resources.

LUT-SR generators are appropriate for most uses of random number generators, and also have the advantage that they can be compactly specified and easily implemented. Full C++ and VHDL source code is available here.

Prev: Motivation, Up: RNGs for FPGAs, Next: Exponential Generators