MAERI Tutorial @ HPCA 2019

Enabling Rapid Design Space Exploration and Prototyping of DNN Accelerators

Tutorial at HPCA 2019

Previous Tutorial: ISCA 2018 

Date: February 16, 2019



Tushar Krishna is an Assistant Professor in the School of Electrical and Computer Engineering at Georgia Tech. He received a PhD in EECS from MIT in 2014, and worked at Intel from 2014-15. Tushar is a leading expert in Networks-on-Chip (NoC), having taped out multiple NoC chips and is also the co-author of “On-Chip Networks, 2nd Edition” which is part of the Synthesis Lectures in Computer Architecture, Morgan & Claypool publishers. He co-developed the Eyeriss Deep Learning ASIC that was presented at ISSCC 2016. His research group maintains the Garnet NoC model (part of the gem5 simulator) and the OpenSMART NoC RTL generator.
Michael Pellauer is a Sr. Research Scientist at
NVIDIA. He received a PhD in EECS from MIT in 2010. Prior to that, he was a Research Engineer at Intel (VSSAD group) from 2010-2015. Dr. Pellauer’s research focuses on spatial accelerators and dataflows for Deep Learning.


Hyoukjun Kwon is a PhD student in the College of Computing at Georgia Institute of Technology, advised by Prof. Tushar Krishna. His research interests are in communication-aware Deep Learning hardware-software co-design. He has interned in the computer architecture group at NVIDIA research in 2017 and 2018 architecting the NoC for Deep Learning accelerators.


Prasanth Chatarasi is a senior PhD student advised by Prof. Vivek Sarkar & Dr. Jun Shirako in the School of Computer Science at the Georgia Institute of Technology, Atlanta, US. His research focuses on 1) Domain-specific compiler optimizations for data analytics (graph processing, image processing, and deep neural nets) on modern and emerging architectures, 2) Debugging and optimizations of explicitly-parallel programs using polyhedral compilation techniques, and 3) Systematic integration of loop optimizations with storage optimizations in compiler frameworks. He has interned at Xilinx Versal Compiler and Programming models team at San Jose, and also interned at INRIA Paris with Albert Cohen.


Zhongyuan Zhao is a PhD student at Shanghai Jiaotong University and was a visiting student at Georgia Tech from Aug 2017 to Aug 2018 hosted by professor Tushar Krishna.
His research interests are in compiler and architecture design for Coarse-Grained Reconfigurable computing platforms targeting computational and data intensive applications such as machine learning.

Deep learning techniques have pervaded vision and speech applications due to the high degree of accuracy they provide. To address growing performance and energy-efficiency demands of deep neural networks (DNNs) both industry and academia are actively developing specialized hardware accelerator ASICs. Examples include Google’s TPU, ARM’s project Trillium, Apple’s Neural Engine, MIT’s Eyeriss, and so on. Most of these designs are built using a spatial array of processing elements (PEs).

The right microarchitecture of a spatial DNN accelerator is an area of active research. There are two key challenges that computer architects face:

  • Owing to fixed number of PEs on-chip, DNNs can be partitioned in myriad ways (within and across layers) to exploit data reuse (weights and intermediate outputs) and mapped over the PEs. This is known as dataflow. Moreover, different dataflows also exist due to different layer types (convolution, recurrent, pooling, fully-connected) and input/filter dimensions. Each dataflow leads to different performance and energy trade-offs due to the amount of data reuse at various levels of the memory hierarchy. Thus, dataflow optimization is a first-order requirement for DNN accelerator design. Unfortunately, the design-space of possible dataflows in modern DNNs is too large to be explored by hand or via cycle-accurate simulations.
  • Most DNN accelerators today support fixed dataflow patterns internally as they perform a careful co-design of the PEs and the network-on-chip (NoC). In fact, the majority of them are only optimized for traffic within a convolutional layer. This makes it challenging to map arbitrary dataflows on the fabric efficiently, and can lead to underutilization of the available compute resources. Moreover, the hardware cost of supporting a dataflow requires the ability to quickly generate RTL and perform area and power estimates.

 The research community today lacks a simulation infrastructure to evaluate DNN dataflows and architectures systematically and reason about performance, power, and area implications of various design choices.

In this tutorial, we will present two tools for enabling rapid design-space exploration of DNN accelerators for addressing the challenges listed above.

  • MAESTRO [arXiv paper][website] is an analytical cost-model for modeling and analyzing different convolutional dataflows.  Using a simple DSL, it enables users to simulate different dataflows by varying loop-ordering, loop unrolling, spatial tiling, and temporal tiling, and study the effects on overall runtime and energy on a spatial DNN accelerator with user-specified number of PEs and buffer sizes. MAESTRO can be used at design-time, for providing quick first-order metrics when hardware resources (buffers and interconnects) are being allocated on-chip, and compile-time when different layers need to be optimally mapped for high utilization and energy-efficiency.
  • MAERI [ASPLOS 2018 paper, IEEE Micro paper][website]is a parameterizable DNN accelerator generator that builds accelerators using a suite of plug-and-play building blocks rather than as a monolithic tightly-coupled entity. It outputs the RTL for the accelerator, which can then be sent through an ASIC or FPGA flow for latency/power/area estimates. MAERI allows mapping of convolutional, LSTM, pooling, and fully-connected layers, allowing an end-to-end run of modern DNNs. MAERI uses configurable interconnects internally, enabling it to efficiently map any dataflow generated by MAESTRO, and get actual area, power and performance numbers

Tutorial Schedule


  • We will be distributing the MAESTRO and MAERI code bases in a VM.
  • Please install Virtual Box on your laptop before coming for the tutorial.
  • We will bring some pen-drives with the VM image at the tutorial.
Time Agenda Presenter Resources
8:30 – 9:00 Introduction to DNN accelerators Tushar [Slides] [Video]
9:00 – 10:00 A primer on DNN Dataflows
– Taxonomy
– Performance/Power Trade-offs
Michael [Slides][Video]
10:00 – 10:30 MAESTRO Data Directives
– How to Formally Describe Dataflows in MAESTRO
Prasanth [Slides] [Video]
10:30 – 10:50 Coffee Break
10:50 – 11:10 MAESTRO Data Directives [contd]
– Examples
Prasanth [Slides] [Video]
11:10 – 11:45 MAESTRO Analytical Model
– Parsing Dataflows
– Estimating Data Reuse
– Estimating Performance and Energy
Hyoukjun [Slides] [Video]
11:45 – 12:30 MAESTRO Hands-on Exercises
– Dataflow Design-space Exploration
– Hardware Design-space Exploration
– HW-SW Co-Design
Prasanth + Hyoukjun [Slides] [Video]
12:30 – 2:00 Lunch
2:00 – 2:20 MAERI Overview
– How to support flexible dataflows
– MAERI Building Blocks
Tushar [Slides] [Video]
2:20 – 3:00 MAERI Mapper
– Tensor Flow -> MAERI configuration
– Efficient Mapping Search
– Compiler overview
Zhongyuan [Slides] [Video]
3:00 – 3:20 MAERI RTL
– Detailed Microarchitecture
– RTL Modules and Code Organization
Hyoukjun [Slides] [Video]
3:20 – 3:40 MAERI Demo
– RTL Simulation
– ASIC Synthesis and Place-and-Route Flow
– FPGA Synthesis Flow
– Demo
Hyoukjun [Slides] [Video]
3:40 – 4:00 Coffee Break
4:00 – 4:30 Hands-on Exercises
– Assembling a DNN accelerator using MAERI building blocks
– Mapping a DNN over MAERI
– Running performance evaluations
– Running power and area synthesis
Hyoukjun VM will be distributed via pen-drives
4:30 – 5:00 Extensions
– Configurable Memory Hierarchy
Michael [Slides] [Video]
5:00 – 5:10 Wrap-Up Tushar [Slides] [Video]

Target Audience:

The tutorial targets students, faculty, and researchers who want to

  • understand how to design DNN accelerators, or
  • study performance implications of dataflow mapping strategies, or
  • prototype a DNN accelerator (on ASIC of FPGA)

Pre-requisite Knowledge: A brief understanding of DNNs and a brief understanding of RTL.


web counter free

The whole is greater than the sum of its parts