web builder

fpgaConvNet

A framework for mapping Convolutional Neural Networks on FPGAs

Convolutional Neural Networks (ConvNets/CNNs) are a powerful Deep Learning model which has demonstrated state-of-the-art accuracy in numerous AI tasks, from ConvNet-based object detectors to neural image captioning. In this context, FPGAs constitute a promising platform for the deployment of ConvNets which can satisfy the demanding performance needs and power constraints posed by emerging ConvNet applications. Nevertheless, the effective mapping of ConvNets on FPGAs requires from Deep Learning practitioners to have expertise in hardware design and familiarity with the FPGA development toolchains, and therefore poses a significant barrier. fpgaConvNet is a framework that automates the mapping of ConvNets onto reconfigurable FPGA-based platforms. Starting from a high-level description of a ConvNet model, fpgaConvNet considers both the input ConvNet workload and the application-level performance needs, including throughput, latency and multiobjective criteria, in order to generate optimised streaming accelerators tailored for the target FPGA. fpgaConvNet is being developed by the Intelligent Digital Systems Lab (iDSL) at Imperial College London.

Mobirise

CNN-to-FPGA Toolflow

End-to-End Framework

The Deep Learning developer provides as inputs a trained ConvNet model together with the resources of the target FPGA. The ConvNet workload is captured internally in a Synchronous Dataflow-based intermediate representation. Design space exploration is formulated as mathematical optimisation and performed by means of a proprietary optimiser. 

Automated Synthesizable Code Generation

The hardware description of the selected design is automatically generated by means of code generation. The currently supported devices include FPGAs from Xilinx.

Efficient multiobjective design space exploration: By casting the design space exploration task as a mathematical optimisation problem, fpgaConvNet's optimiser searches the design space by optimising the application-level performance objective of interest. The video below shows the latency-driven design space exploration for a batch size of 1 for VGG16 targeting the ZC706 board. Each grey point represents an explored design point that corresponds to a hardware mapping with different parameters. As the grey points become blue with time, our optimiser visits these design points. The red points form the Pareto front of the explored design space and reflect the trade-off between performance and resource consumption. After the design space has been traversed, the optimiser selects the hardware design that optimises latency and fpgaConvNet generates the corresponding hardware architecture.

  1. A Synchronous Dataflow Model for ConvNets  - fpgaConvNet employs a Synchronous Dataflow (SDF) model in order to represent both ConvNet workloads and hardware architectures. Under this scheme, the design space is explored by means of a set of algebraic transformations that modify the performance-resource characteristics of the hardware design.
  2. Throughput-Driven Design - Dedicated design flow for generating high-throughput hardware designs that meet the performance requirements of high-throughput  applications, such as large-scale visual search and image captioning. fpgaConvNet exploits batch processing and the reconfigurability of FPGAs to generate an optimised high-throughput architecture for the ConvNet on the target platform. 
  3. Latency-Driven Design - Dedicated design flow tailored for latency-sensitive applications. A flexible architecure is generated that meets the low latency requirements of latency-critical applications, such autonomous cars and UAVs.
  4. Support for Irregular Networks - fpgaConvNet offers support for a wide range of networks, including both conventional ConvNets with regular layer connectivity as well as compound modules, such as Inception modules, residual blocks and dense blocks.
  5. Support for large networks - fpgaConvNet makes no assumptions on the size of ConvNets and supports the mapping of deep and wide networks independently of the target FPGA resources. This is achieved by supporting (i) bitstream-level reconfiguration which allows the mapping of ConvNets of large depth and (ii) the weights reloading of a layer which allows ConvNets to have wide convolutional layers without being constrained by the available on-chip memory. Both the reconfiguration and weights reloading employed by the generated hardware architecture are parametrised and optimised by fpgaConvNet for the target ConvNet-FPGA pair.

Benchmarks

Model NameTaskWorkload (GOps)Target PlatformThroughput
(favourable batch)
Latency
(batch size of 1)
CFFFace Recognition0.00037Zynq ZC706 SoC
@ 125 MHz
159.22 GOp/s,
430.32×10^3 fps
159.22 GOp/s,
430.32×10^3 fps
LeNet-5Hand-Written
Digit Recognition
0.0038Zynq ZC706 SoC
@ 125 MHz
185.81 GOp/s,
48.89×10^3 fps
185.81 GOp/s,
48.89×10^3 fps
MPCNNHand Gesture
Recognition
0.0029Zynq ZC706 SoC
@ 125 MHz
100.23 GOp/s,
34.56×10^3 fps
100.23 GOp/s,
34.56×10^3 fps
CIFAR-10
Object Recognition
0.0248
Zynq ZC706 SoC
@ 125 MHz
166.16 GOp/s,
6.70×10^3 fps
150.91 GOp/s,
6.08×10^3 fps
Sign Recognition CNN
Traffic Sign
Recognition
4.0284
Zynq ZC706 SoC
@ 125 MHz
116.24 GOp/s,
28.85 fps
116.24 GOp/s,
28.85 fps
Scene Labelling CNN
Scene Labelling
7.6559
Zynq ZC706 SoC
@ 125 MHz
203.53 GOp/s,
19.09 fps
203.53 GOp/s,
26.58 fps
AlexNetObject Recognition
1.3315
Zynq ZC706 SoC
@ 125 MHz
197.40 GOp/s,
148.25 fps
161.82 GOp/s,
121.53 fps
VGG16
Object Recognition30.72
Zynq ZC706 SoC
@ 125 MHz
155.82 GOp/s,
5.07 fps
123.12 GOp/s,
4.01 fps
GoogLeNet
Object Recognition3.14
Zynq ZC706 SoC
@ 125 MHz
180.97 GOp/s,
57.63 fps
141.70 GOp/s,
45.12 fps
ResNet-152
Object Recognition23.02
Zynq ZC706 SoC
@ 125 MHz
188.18 GOp/s,
8.17 fps
149.79 GOp/s,
6.50 fps 
DenseNet-161
Object Recognition13.76
Zynq ZC706 SoC
@ 125 MHz
155.57 GOp/s,
11.30 fps
149.80 GOp/s,
10.88 fps

Experimental Methodology

The measured execution time includes the ConvNet execution by the hardware together with the I/O communication with the off-chip memory. The I/O includes the streaming of the batch of input images into the hardware design generated by fpgaConvNet as well as the write-back of the output feature maps. The I/O is performed using the Video DMA IP by Xilinx. The reported performance uses a batch size of 254 inputs, which is the burst limit of the Video DMA IP. A larger batch size would allow us to reach a higher throughput by utilising more fully the fpgaConvNet hardware and hence the batch size of 254 is a limitation of the current platform. We measure fpgaConvNet's performance by means of both software timers as well as hardware counters. Throughput is calculated as the ratio between the GOp/input of the benchmark ConvNet multiplied by the batch size, over the execution time.

Publications

Conference Proceedings

Stylianos I. Venieris and Christos Savvas Bouganis

fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs, FCCM 2016
(link, paperslides, bibtex)

This paper presents the Synchronous Dataflow (SDF) modelling core of fpgaConvNet and the automated design methodology for the generation of high-throughput hardware mappings . High-throughput applications allow the use of batch processing and offer opportunities for optimisations. To this end, fpgaConvNet uses a set of transformations over the SDF model, including FPGA reconfiguration, coarse-grained folding and fine-grained folding in order to traverse the large architectural design space and yield an optimised, high-throughput design for the target FPGA platform.

Stylianos I. Venieris and Christos Savvas Bouganis

fpgaConvNet: Automated Mapping of Convolutional Neural Networks on FPGAs, FPGA 2017
(link, poster, bibtex)

This work presents a graphical illustration of the fpgaConvNet flow. fpgaConvNet has been extended to target both high-throughput and low-latency designs, with two different modes of operation. Moreover, performance results on larger CNNs are presented including AlexNet and VGG16.

Stylianos I. Venieris and Christos Savvas Bouganis

Latency-Driven Design for FPGA-based Convolutional Neural Networks, FPL 2017
(link, paperbibtex)

Emerging new AI systems such self-driving cars and UAVs require the very low-latency execution of several CNN-based tasks without the processing of inputs in batches. To meet these requirements, we place latency at the centre of optimisation and generate latency-optimised hardware designs for the target CNN-FPGA pairs. This is achieved by introducing a novel SDF transformation, named weights reloading, which enables the high-performance execution of CNNs without the need for batch processing. With this approach, fpgaConvNet expands the architectural design space and is able to meet the performance needs of both latency-sensitive and high-throughput applications.

Stylianos I. Venieris and Christos Savvas Bouganis

fpgaConvNet: A Toolflow for Mapping Diverse Convolutional Neural Networks on Embedded FPGAs, NIPS 2017 Workshop on Machine Learning on the Phone and other Consumer Devices

With the emergence of novel AI applications and IoT devices, on-device machine learning is becoming a necessity. Embedded and mobile systems place low-latency processing together with low power and privacy requirements the forefront. In this context, high-performance, low-power FPGAs constitute a promising alternative to the conventionally used platforms. This paper presents fpgaConvNet's performance in settings with stringent absolute power consumption and performance-per-Watt constraints. Comparisons with highly optimised embedded GPU implementations on mainstream CNNs demonstrates the performance efficiency gains of the proposed framework.

Stylianos I. Venieris and Christos Savvas Bouganis

fpgaConvNet: Mapping Regular and Irregular Convolutional Neural Networks on FPGAs, IEEE Transactions on Neural Networks and Learning Systems 2018

Since neural networks renaissance, Convolutional Neural Networks (CNNs) have demonstrated a state-of-the-art performance in several emerging AI tasks. The deployment of CNNs in real-life applications requires power-efficient hardware designs that meet the application-level performance needs. Two major challenges in the mapping of CNNs on FPGAs are: 1) the variability with respect to performance requirements across CNN applications, spanning from high-throughput to low-latency settings; and 2) the mapping of state-of-the-art CNNs with irregular dataflow due to the introduction of novel building blocks, such as Inception modules (Inception networks), residual blocks (ResNets) and dense blocks (DenseNets). This work expands fpgaConvNet to address these challenges by introducing 1) a multiobjective optimisation framework for tailoring the generated accelerator to both the throughput and latency requirements of the target application and 2) novel customisable hardware building blocks for mapping state-of-the-art Inception, residual and dense networks. To the best of our knowledge, this is the first work to map DenseNet to custom hardware, while achieving 6.65x higher performance than optimised GPU designs and up to 2.94x higher performance density than state-of-the-art FPGA-based CNN architectures when evaluated across real-life CNNs.

Address

Circuits ands Systems Group - Level 9
EEE Department
Imperial College London
Exhibition Road
London SW7 2BT, UK

Contact

Email: stylianos.venieris10@imperial.ac.uk
Links:
http://cas.ee.ic.ac.uk/people/sv1310
http://cas.ee.ic.ac.uk/people/ccb98