The exponential distribution is extremely important for simulations which model discrete events, as it describes the expected waiting time until the next event for a Poisson process. Many real-world processes can be modelled as Poisson processes, such as chemical interactions, radioactive particle decay, and loan defaults. Many of these models cannot be solved analytically, so it is necessary to use Monte-Carlo simulation to explore the average behaviour of the models, a computationally intensive process that is an ideal candidate for FPGA acceleration.

To map these simulations into FPGAs we need some kind of exponential random number generator. There are a number of ways of sampling from the exponential distribution in software, most notably the superb Zigurrat method. However, if we map these algorithms into software we find they become less efficient. For example, the Zigurrat method is designed to be incredibly efficient for 99% of the random numbers generated, and to be reasonably efficient in the remaining 1% of cases. This works very well in software, as the 1% case has no impact on the execution speed of the 99% case.

However, hardware performance is measured as a combination of both execution speed (cycle time) and resource usage. To support the Ziggurat method both the 99% and 1% pipelines must exist at all times, so in terms of resource usage per generated number there is little gain. A second problem is that the 1% case actually uses a rejection method, so a hardware version could not guarantee to generate one new random number every cycle.

The best existing methods for exponential random numbers in FPGAs use standard function approximation methods: custom range-reduction and tables of polynomials. However, it appears to be relatively expensive to get high-resolution variates using these methods (the existing work only looks at 16 fractional bits), which is an important feature both to ensure that tie-breaking between event times is fair, and to ensure that there is sufficient resolution left after taking the minimum of large numbers of event times.

Personally, my biggest complaint about existing methods is that they are just plain boring, and don't fit in with my overall goals.

One of the interesting things about the exponential distribution is the memoryless
property. What this means is that if the waiting time till some event follows
the exponential distribution, the fact that you have waited some time and the
event does not occur has no affect on the time you still have to wait. For
example, assume that you are waiting for a radioactive particle to decay,
and you know that the time till decay follows the exponential distribution with
a mean of *t* seconds. However, if you start waiting, and the particle
hasn't decayed after *s* seconds, the time till decay still follows the
same exponential distribution. No matter how long you have already spent
waiting for the decay, the amount of time you still have to wait will follow the
same exponential distribution.

The memoryless property started me thinking about how to break the exponential distribution into independent components, as the distribution is clearly seperable in some way. The key is to consider the distribution as a fixed-point number:

where

Clearly the most significant bit *b _{s}* is an independent
Bernoulli variate (random bit), as this is true of any fixed-point random variable.
However, if we consider the next bit

This feature continues recursively through the bits, with the surprising result that
*all the bits of a fixed-point exponential variate are completely independent*.
The probability of each fixed-point bit being 1 is:

p_{i} = Pr[ *b*_{i} = 1 ] = 1/(1+exp(2^{i}))

There are two ways of generating variates using this method. The first is
the direct method: generate each Bernoulli variate *b _{i}*
by generating a uniform random number, and comparing it with

It may seem that this method is extremely wasteful: we have linear resource growth in both
*w* (the width of the fixed-point output) and *m* (the width of the
Bernoulli comparisons), so the overall growth in resources can be seen as growing
roughly quadratically with quality. However, the scaling constant before this quadratic
is very low, as random uniform bits can be generated at a cost of 1 LE per bit, and
an *m*-bit constant comparison can be executed in approximately
ceil(*m*/*k*) LEs (see the paper for more details).
This makes the resource cost much lower than might be expected, particularly
as no DSPs or RAMs are used.

The key features of the direct (LUT) method are then:

- Total resource (LE) cost of ~
*w*(*m*+ ceil(*m*/*k*)). - Maximum path length of FF-LUT-FF.
- Intrinsic period of at least
*w m*(or even more if a LUT-FIFO generator is used).

- Total resource usage of 1422 LEs (881 slices)[1].
- Maximum clock rate of 607MHz
- Empirical statistical quality that has passed all tests applied so far
^{[2]}.

The second generation method is to generate groups of bits, rather than individual
bits. If we consider a span of *r* bits from the fixed-point exponential
variate, then the bits together form a 2^{r} valued discrete
distribution. Luckily, generating discrete distributions in hardware is fast and efficient, thanks
to Walker's alias method (of which I am much enamoured: it crops up in a number of
my papers). This lets us generate *r* bits at precision *m* using
an *m*-bit RAM with 2^*r* entries and a little bit of logic.

The overall effect of the second method is to trade off the use of a number of block RAMs for use of LEs. The performance is also reduced, as block-RAMs are fundamentally slower than LEs (and in my current implementation I'm too lazy to pipeline the comparators much). Which method is preferable depends mainly on which resources are most precious, which is dictated both by the application and the resource balance in the target device.

To give an idea of the resource tradeoff, the resources for the previously mentioned
generator with *m*=*w*=36 changes as follows:

FFs | LUTs | Slices | RAMs | MHz | |

Bit-wise | 1422 | 1350 | 881 | 0 | 607 |

Table | 413 | 370 | 397 | 4 | 375 |

[2] Empirical quality: We haven't published these results
yet, but we have done a number Chi2 tests, using equal probability buckets (including adjustments
for effects of fixed-point boundaries on bucket expected probabilities). We used bucket counts of
2^{8}, 2^{10} and 2^{12}, and for each one tested for sample
counts from 2^{10} up to 2^{40}, at 60 (logarithmically spaced) points.
The high-quality version passed everything, with nothing look remotely suspicious on the
p-value front. And yes, testing up to 2^{40} (10^{12}), takes one
hell of a long time.

[3] Synthesis methods : It's worth noting that the table-based method has not been optimised nearly as much as the bit-wise one. The rubbish slice-packing for the table generator is just because that used Handel-C for synthesis (which does no packing), while the bit-wise one used XST. This also affects the MHz etc. The reason we used Handel-C for synthesis is that it used less FFs and LUTs than using xst, and we are trying to demonstrate the area tradeoff more than anything (375MHz is plenty fast enough for most applications).

Prev: Uniform generators, Up: RNGs for FPGAs, Prev: Gaussian generators,