Credit Risk Modelling using Hardware Accelerated Monte-Carlo Simulation

2008, In Proceedings of FCCM
Abstract, PDF, Bibtex

I was getting kind of bored of discrete-time simulations, so this work looks at using discrete-event simulations to model events within a loan portfolio. Happily this coincided with the world's banks making collective idiots out of themselves, so it was all rather topical. Anyway, it models a portfolio of loans, where each loan is assigned one of a number of risk classes, where the risk class is supposed to identify the probability of loan default. During the simulation loans may randomly default, or move between risk classes, with each event happening at some rate. However, this rate also depends on the current economic conditions, which is modelled as another random process.

That was a terrible explanation, the paper does a bit better. It's just a big Markov-Chain Monte-Carlo simulation, using exponential random numbers to determine the time of the next event. We mapped the simulation into hardware, developing three different simulation architectures, with each architecture designed to work well when one type of event is most common. This takes advantage of the reconfiguration, as we can swap between the designs at run-time, based on the characteristics of each set of input data. The overall speedup of a 233MHz xc4vsx55 over four parallel threads in a quad Pentium-4 Core2 2.4GHz varies between 60 and 100 times.

A Domain Specific Language for Reconfigurable Path-based Monte Carlo Simulations

2007, In Proceedings of FPT, Pages 97-104
Abstract, PDF, Bibtex

This paper is about Contessa, which is a domain-specific language for describing certain types of Monte-Carlo simulations, particularly those that consist of one or more time-series (e.g. equities, interest rates, exchange rates, und so weiter). Some of the more interesting features of the language are that it is pure-functional, and allows pretty much unbounded branching, iteration, and recursion through the use of continuations. However, the main point is that there is a direct automated route from the high-level description down to a pipelined hardware implementation.

Just to labour the point, you can literally take the high-level source code, with no information about parallelism, timing, resource mappings, or platform bindings, push a button, and have a 300MHz pipelined design pop out the end. The huge amount of parallelism available in Monte-Carlo stuff means that pipeline latencies etc. are hidden, so the utilisation of the floating-point units is near 100%. At the moment there is no public release of the Contessa toolchain (as it is a huuuge mess), but I'm currently working on a new more general purpose version (with stuff like dynamic memory, maybe) that I may make public in some form.

Sampling from the Multivariate Gaussian Distribution using Reconfigurable Hardware

2007, In Proceedings of FCCM, Pages 3-12
Abstract, PDF, Bibtex

Here we were trying to look beyond the normal univariate Monte-Carlo simulations, and try to work out what FPGAs could actually be good at. It turns out that you can use every DSP48 (hard multiply-accumulate block) in the device at 50% utilisation... It also turns out, that if you try really hard, you can make the entire thing run at the maximum frequency of the DSPs in Virtex-4.

The central idea is to turn the entire FPGA into a giant (dense) matrix-vector multiplier, which will take a vector of independent Gaussian samples and impose a correlation structure on them. By using all N DSPs in an FPGA, we can produce length-N correlated vectors, generating one vector every N cycles. For the largest device we tested (the xc4vsx55), that means that we could generate up to N=512 at 500MHz, so ~1M vectors per second.

The comparison with software is already a bit out of date (though there are bigger/faster FPGAs as well), but for a (fairly dumb) Value-at-Risk application we were getting a 33x times speed-up over all four cores of a quad Opteron machine. That's using AMD's fancy optimised BLAS, using single-precision SIMD, cache optimisations etc., and we were not IO bound on the FPGA card: basically, this is as close to the real-world deployed performance as we can get (sorry, this is a pre-emptive strike at the inevitable moaners, who complain we didn't spent three months optimising the software. No, BECAUSE AMD ALREADY DID THAT!).