## Exponential generators for FPGAs

The exponential distribution is extremely important for simulations which model discrete events, as it describes the expected waiting time until the next event for a Poisson process. Many real-world processes can be modelled as Poisson processes, such as chemical interactions, radioactive particle decay, and loan defaults. Many of these models cannot be solved analytically, so it is necessary to use Monte-Carlo simulation to explore the average behaviour of the models, a computationally intensive process that is an ideal candidate for FPGA acceleration.

### Existing methods

To map these simulations into FPGAs we need some kind of exponential random number generator. There are a number of ways of sampling from the exponential distribution in software, most notably the superb Zigurrat method. However, if we map these algorithms into software we find they become less efficient. For example, the Zigurrat method is designed to be incredibly efficient for 99% of the random numbers generated, and to be reasonably efficient in the remaining 1% of cases. This works very well in software, as the 1% case has no impact on the execution speed of the 99% case.

However, hardware performance is measured as a combination of both execution speed (cycle time) and resource usage. To support the Ziggurat method both the 99% and 1% pipelines must exist at all times, so in terms of resource usage per generated number there is little gain. A second problem is that the 1% case actually uses a rejection method, so a hardware version could not guarantee to generate one new random number every cycle.

The best existing methods for exponential random numbers in FPGAs use standard function approximation methods: custom range-reduction and tables of polynomials. However, it appears to be relatively expensive to get high-resolution variates using these methods (the existing work only looks at 16 fractional bits), which is an important feature both to ensure that tie-breaking between event times is fair, and to ensure that there is sufficient resolution left after taking the minimum of large numbers of event times.

Personally, my biggest complaint about existing methods is that they are just plain boring, and don't fit in with my overall goals.

### Decomposing the exponential distribution into independent Bernoulli variates

One of the interesting things about the exponential distribution is the memoryless property. What this means is that if the waiting time till some event follows the exponential distribution, the fact that you have waited some time and the event does not occur has no affect on the time you still have to wait. For example, assume that you are waiting for a radioactive particle to decay, and you know that the time till decay follows the exponential distribution with a mean of t seconds. However, if you start waiting, and the particle hasn't decayed after s seconds, the time till decay still follows the same exponential distribution. No matter how long you have already spent waiting for the decay, the amount of time you still have to wait will follow the same exponential distribution.

The memoryless property started me thinking about how to break the exponential distribution into independent components, as the distribution is clearly seperable in some way. The key is to consider the distribution as a fixed-point number:

x=bsbs-1...b1b0.b-1b-2...

where x is a fixed-point exponentially distributed random number, and bi represents the binary digit 2i.

Clearly the most significant bit bs is an independent Bernoulli variate (random bit), as this is true of any fixed-point random variable. However, if we consider the next bit bs-1, we find that the probability of it being one is the same whether bs is 0 or 1. In other words, the second bit is also an independent Bernoulli variate.

This feature continues recursively through the bits, with the surprising result that all the bits of a fixed-point exponential variate are completely independent. The probability of each fixed-point bit being 1 is:

pi = Pr[ bi = 1 ] = 1/(1+exp(2i))

### Generating exponential numbers in FPGAs

There are two ways of generating variates using this method. The first is the direct method: generate each Bernoulli variate bi by generating a uniform random number, and comparing it with pi. Note that we have to choose some fixed accuracy (bit-width) m with which to perform this comparison, as in general pi cannot be represented exactly in binary. The value of m should be chosen based on the needs of the application, with the simplest interpretation being that m bits should be sufficient for an application using at least 2m exponential numbers (and probably a lot more).

It may seem that this method is extremely wasteful: we have linear resource growth in both w (the width of the fixed-point output) and m (the width of the Bernoulli comparisons), so the overall growth in resources can be seen as growing roughly quadratically with quality. However, the scaling constant before this quadratic is very low, as random uniform bits can be generated at a cost of 1 LE per bit, and an m-bit constant comparison can be executed in approximately ceil(m/k) LEs (see the paper for more details). This makes the resource cost much lower than might be expected, particularly as no DSPs or RAMs are used.

The key features of the direct (LUT) method are then:

• Total resource (LE) cost of ~w (m + ceil(m/k)).
• Maximum path length of FF-LUT-FF.
• Intrinsic period of at least w m (or even more if a LUT-FIFO generator is used).
Using an extremely high-quality parameterisation, of m=w=36, these features lead to the following characteristics (in Virtex-4):
• Total resource usage of 1422 LEs (881 slices).
• Maximum clock rate of 607MHz
• Empirical statistical quality that has passed all tests applied so far.

The second generation method is to generate groups of bits, rather than individual bits. If we consider a span of r bits from the fixed-point exponential variate, then the bits together form a 2r valued discrete distribution. Luckily, generating discrete distributions in hardware is fast and efficient, thanks to Walker's alias method (of which I am much enamoured: it crops up in a number of my papers). This lets us generate r bits at precision m using an m-bit RAM with 2^r entries and a little bit of logic.

The overall effect of the second method is to trade off the use of a number of block RAMs for use of LEs. The performance is also reduced, as block-RAMs are fundamentally slower than LEs (and in my current implementation I'm too lazy to pipeline the comparators much). Which method is preferable depends mainly on which resources are most precious, which is dictated both by the application and the resource balance in the target device.

To give an idea of the resource tradeoff, the resources for the previously mentioned generator with m=w=36 changes as follows:

 FFs LUTs Slices RAMs MHz Bit-wise 1422 1350 881 0 607 Table 413 370 397 4 375

### Notes

 The LE count is lower than given than the formula because we can optimise out certain bits when the LSBs of the truncated probability are zero.

 Empirical quality: We haven't published these results yet, but we have done a number Chi2 tests, using equal probability buckets (including adjustments for effects of fixed-point boundaries on bucket expected probabilities). We used bucket counts of 28, 210 and 212, and for each one tested for sample counts from 210 up to 240, at 60 (logarithmically spaced) points. The high-quality version passed everything, with nothing look remotely suspicious on the p-value front. And yes, testing up to 240 (1012), takes one hell of a long time.

 Synthesis methods : It's worth noting that the table-based method has not been optimised nearly as much as the bit-wise one. The rubbish slice-packing for the table generator is just because that used Handel-C for synthesis (which does no packing), while the bit-wise one used XST. This also affects the MHz etc. The reason we used Handel-C for synthesis is that it used less FFs and LUTs than using xst, and we are trying to demonstrate the area tradeoff more than anything (375MHz is plenty fast enough for most applications).

Prev: Uniform generators, Up: RNGs for FPGAs, Prev: Gaussian generators,