MAERI: An Open Source Framework for Generating Modular DNN Accelerators supporting Flexible Dataflow 

Tutorial at ISCA 2018.

Date: June 3, 2018: 8:30 AM to 12 Noon


Tushar Krishna
Assistant Professor
Georgia Tech

Michael Pellauer
Sr. Research Scientist
NVIDIA Research

Hyoukjun Kwon
PhD student
Georgia Tech

More details about the MAERI Project can be found here.

The right microarchitecture of a DNN ASIC accelerator is an area of active research. There are a few key challenges that computer architects face:

  • DNN topologies are evolving at a rapid rate, and it is common to have convolution, recurrent, pooling, and fully-connected layers with varying input and filter sizes in the most recent topologies.
  • DNNs may be dense or sparse, with a variety of encoding schemes.
  • Owing to fixed number of PEs on-chip, DNNs can be partitioned in myriad ways (within and across layers) to exploit data reuse (weights and intermediate outputs) and mapped over the PEs. Different partitioning and mapping strategies can lead to energy trade-offs due to the amount of data reuse at various levels of the memory hierarchy.

All the above lead to myriad dataflows within the accelerator substrate, making the dataflow optimization a first-class component of the micro-architecture. Unfortunately, most DNN accelerators today support fixed dataflow patterns internally as they perform a careful co-design of the PEs and the network-on-chip (NoC). In fact, the majority of them are only optimized for traffic within a convolutional layer. This makes it challenging to map arbitrary dataflows on the fabric efficiently, and can lead to underutilization of the available compute resources. In fact, each new optimization has resulted in a new accelerator proposal optimized for the optimization.

The research community today lacks a simulation infrastructure to evaluate DNN dataflows and architectures systematically and reason about performance, power, and area implications of various design choices.

We recently proposed MAERI (Multiply-Accumulate Engine with Reconfigurable Interconnect) which is a modular design-methodology for building DNN accelerators. MAERI makes a case for assembling accelerators using a suite of plug-and-play building blocks rather than as a monolithic tightly-coupled entity. These building blocks can be tuned at runtime using MAERI’s novel tree-based configurable interconnection fabrics to enable efficient mapping of myriad dataflows.

In this tutorial, we will present the MAERI platform. MAERI serves three key roles:

  • Rapid-Design Space Exploration: MAERI can simulate different dataflows by allowing users to vary loop-ordering, loop unrolling, spatial tiling, and temporal tiling, and study the effects on overall runtime and energy on a spatial DNN accelerator with user-specified number of PEs and buffer sizes.
  • DNN Accelerator RTL Generation: MAERI can generate DNN accelerators with configurable interconnects, fine-grained computing building blocks, and tiny distributed buffers, enabling efficient mapping of myriad dataflows. It outputs the RTL for the accelerator, which can then be sent through an ASIC or FPGA flow for latency/power/area estimates.
  • End-to-End DNN Simulation. MAERI allows mapping of convolutional, LSTM, pooling, and fully-connected layers, allowing an end-to-end run of modern DNNs.

Tutorial Schedule

8:30 – 9:10 AM Introduction and background on DNN accelerators
9:15 – 9:40 AM MAESTRO: A performance and cost model for DNN dataflows [paper]
– Dataflow Taxonomy based on tile, temporal/spatial map, and merge/unroll pragmas
– DSL for describing dataflows
9:40 – 10:30 AM MAESTRO hands on exercises
– Impact of Dataflow
– Describe and evaluate state-of-the-art accelerator dataflows (Eyeriss, NVDLA, ShiDianNao, and more)
– Impact of DNN Topology
– Evaluate different layers of VGGNet
– Impact of Microarchitecture
– Vary number of PEs buffer sizes, interconnect bandwidth
10:30- 11:15 AM MAERI – An Open Source RTL for Flexible DNN Accelerators [Paper]
– MAERI Building Blocks
– Multiplier Switches, Adder Switches, Simple Switches, Distribution Network, Collection Network, Activation Units, Prefetch Buffer
– Full Microarchitecture
– RTL Modules and Code Organization
11:15-11:45 AM MAERI hands on exercises
– Configure MAERI and generate RTL
– Mapping a DNN over MAERI
– Running performance evaluations
– ASIC and FPGA Synthesis flow for Area and Power
Wrap up and Future extensions

Target Audience:

The tutorial targets students, faculty, and researchers who want to

  • architect novel DNN accelerators, or
  • study performance implications of dataflow mapping strategies, or
  • plug a DNN accelerator RTL into their system

Pre-requisite Knowledge: A brief understanding of DNNs and a brief understanding of RTL.

The whole is greater than the sum of its parts