**FastTrack**: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs

Nachiket Kapre + Tushar Krishna nachiket@uwaterloo.ca, tushar@ece.gatech.edu



#### Claim

FPGA overlay NoCs designed to exploit interconnect properties of the FPGA fabric can surpass existing state-of-the-art NoCs by:

- 2.5–2.8× throughput  $\uparrow$
- ▶ 2.2× energy ↓
- ▶ at 2.5× LUT cost ↑

Xilinx Virtex-7 485T FPGA,  $8 \times 8$  system size, synthetic+real-world traffic.

#### Context



- FPGAs finding comfortable home in datacenters
  - Offloading compute intensive workloads to the FPGA
  - Energy-efficiency, fast coupling to networking
- Common Infrastructure: NoCs for apps + system IO

#### Context



- FPGAs finding comfortable home in datacenters
  - Offloading compute intensive workloads to the FPGA
  - Energy-efficiency, fast coupling to networking
- Common Infrastructure: NoCs for apps + system IO





- ► ASIC clones transplanted onto FPGAs fare poorly! → expensive buffers, virtual channels, multi-ported switches
- Even contemporary FPGA routers are expensive and slow
- ► **FastTrack**: Deflection-routing + Bufferless + Torus



- ► ASIC clones transplanted onto FPGAs fare poorly! → expensive buffers, virtual channels, multi-ported switches
- Even contemporary FPGA routers are expensive and slow
- ► **FastTrack**: Deflection-routing + Bufferless + Torus



- ► ASIC clones transplanted onto FPGAs fare poorly! → expensive buffers, virtual channels, multi-ported switches
- Even contemporary FPGA routers are expensive and slow
- ► **FastTrack**: Deflection-routing + Bufferless + Torus



- ► ASIC clones transplanted onto FPGAs fare poorly! → expensive buffers, virtual channels, multi-ported switches
- Even contemporary FPGA routers are expensive and slow
- ► **FastTrack**: Deflection-routing + Bufferless + Torus

# Qualitative Comparison of FPGA NoC Routers

| Router      | Cost         |              |              |  |  |  |  |  |  |  |
|-------------|--------------|--------------|--------------|--|--|--|--|--|--|--|
|             | Xbar+Arb     | Buffers      | VCs          |  |  |  |  |  |  |  |
| OpenSMART   | X            | X            | X            |  |  |  |  |  |  |  |
| BLESS       | ×            | $\checkmark$ | $\checkmark$ |  |  |  |  |  |  |  |
| CONNECT     | ×            | X            | X            |  |  |  |  |  |  |  |
| Split-Merge | X            | X            | 1            |  |  |  |  |  |  |  |
| Hoplite     | $\checkmark$ | 1            | ✓            |  |  |  |  |  |  |  |

#### Quick Tutorial on Hoplite



Hoplite: A Deflection-Routed Directional Torus NoC for FPGAs, TRETS 2017 Hoplite: Building Austere Overlay NoCs for FPGAs, FPL 2015

## Quick Tutorial on HopliteRT



HopliteRT: An Efficient FPGA NoC for Real-Time Applications, FPT 2017

# Qualitative Comparison of FPGA NoC Routers

| Router      | Cost         |              |              |  |  |  |  |  |  |  |
|-------------|--------------|--------------|--------------|--|--|--|--|--|--|--|
|             | Xbar+Arb     | Buffers      | VCs          |  |  |  |  |  |  |  |
| OpenSMART   | X            | X            | X            |  |  |  |  |  |  |  |
| BLESS       | ×            | $\checkmark$ | $\checkmark$ |  |  |  |  |  |  |  |
| CONNECT     | ×            | X            | X            |  |  |  |  |  |  |  |
| Split-Merge | X            | X            | 1            |  |  |  |  |  |  |  |
| Hoplite     | $\checkmark$ | 1            | ✓            |  |  |  |  |  |  |  |

# Qualitative Comparison of FPGA NoC Routers

| Router      | (            | Cost         | Perf         |              |              |  |  |  |
|-------------|--------------|--------------|--------------|--------------|--------------|--|--|--|
|             | Xbar+Arb     | Buffers      | VCs          | Tput         | Latency      |  |  |  |
| OpenSMART   | X            | X            | X            | 1            | 1            |  |  |  |
| BLESS       | ×            | $\checkmark$ | $\checkmark$ | $\checkmark$ | X            |  |  |  |
| CONNECT     | X            | X            | X            | $\checkmark$ | $\checkmark$ |  |  |  |
| Split-Merge | X            | X            | $\checkmark$ | $\checkmark$ | $\checkmark$ |  |  |  |
| Hoplite     | $\checkmark$ | 1            | 1            | X            | X            |  |  |  |

## Challenge

- $\blacktriangleright$  Deflection routing  $\rightarrow$  inefficient use of wiring resources
  - Deflected packets stay in network for longer ightarrow latency $\uparrow$
  - Steal bandwidth from other traffic ightarrow throughput  $\downarrow$
- Can we allow improve NoC performance under deflection routing?
- Are there unique opportunities provided by the FPGA fabric?
  - Hoplite cheap in LUT cost...
  - FastTrack  $\rightarrow$  inspect FPGA interconnect



Introduction and Motivation

FastTrack NoC Organization

FastTrack Router Operation

Evaluation



Introduction and Motivation

FastTrack NoC Organization

FastTrack Router Operation

Evaluation

## **FPGA Wire Speeds**

| XO  | r0 | xc | נייו | <br>X0. | (2 | xo | YЗ | хo | Y4   | X0. | r'5        | X0. | 76 | хo  | 77        | X0. | r8 | X0. | r9                | X0. | r10 | X0. | Y11 | X0. | r12        | хo | Y13        | XO | Y14              |
|-----|----|----|------|---------|----|----|----|----|------|-----|------------|-----|----|-----|-----------|-----|----|-----|-------------------|-----|-----|-----|-----|-----|------------|----|------------|----|------------------|
| XI  | r0 | XJ | .YI  | XI      | (2 | Хl | YЗ | XI | Y4   | XI  | r5         | ХГ  | Y6 | XI  | Y7        | XI  | r8 | XI  | r9                | Хľ  | r10 | XI  | Y11 | XI  | r12        | Xl | Y13        | хı | Y14              |
| X2` | ٢O | X2 | m    | X2`     | (2 | X2 | YЗ | X2 | Y4   | X2  | r5         | X2  | 76 | X2  | <b>17</b> | X2` | r8 | X2  | Y9                | X2  | r10 | X2  | Y11 | X2  | Y12        | X2 | Y13        | X2 | Y14              |
| ХЗ, | r0 | XS | ry I | X3,     | (2 | хз | YЗ | ХЗ | Y4   | ХЗ. | ۲ <b>5</b> | Х3. | Y6 | X3. | 17        | X3. | r8 | X3. | 79                | Х3, | r10 | ХЗ. | Y11 | X3. | <b>712</b> | хз | Y13        | ХЗ | Y14              |
| X4` | r0 | X4 | r'i  | X4`     | (2 | X4 | YЗ | X4 | Y4   | X4` | Y5         | X4` | Y6 | X4` | Y7        | X4` | r8 | X4` | Y9                | X4` | r10 | X4' | Y11 | X4` | Y12        | X4 | Y13        | X4 | Y14              |
| X5` | r0 | XS | 'n   | X5`     | (2 | X5 | YЗ | X5 | SUPP | X5` | ۲ <b>5</b> | X5` | Y6 | X5` | 07        | X5` | r8 | X5` | 19 <sup>0</sup> R | X5` | r10 | X5  | 711 | X5` | <b>712</b> | V5 | <b>V13</b> | V5 | に<br>で<br>に<br>、 |

distances not to scale

#### FastTrack NoC Organization



## Depopulated Topology Generation



### Parametric Topology generation

FPGA NoC parameterized by three terms:

- N System size
- D Distance of express link
- ► R Depopulation parameter → controls how many routers are FastTrack vs. vanilla Hoplite
- Fully populated 4×4 NoC  $\rightarrow$  FT(16,2,1)
- Half population  $4 \times 4 \text{ NoC} \rightarrow FT(16,2,2)$

#### Outline

Introduction and Motivation

FastTrack NoC Organization

FastTrack Router Operation

Evaluation

#### FastTrack Switch Organization





- Packets can start in either short or express links
- DOR routing function: travel in X first, then Y
- Packets can upgrade to fast links if they can
- Packets can downgrade to slow links only on turn!
- Livelock avoidance:  $W \rightarrow S > N \rightarrow S$
- ► Express links=higher priority, deflected packets acquire higher priority → progress



- Packets can start in either short or express links
- DOR routing function: travel in X first, then Y
- Packets can upgrade to fast links if they can
- Packets can downgrade to slow links only on turn!
- Livelock avoidance:  $W \rightarrow S > N \rightarrow S$
- ► Express links=higher priority, deflected packets acquire higher priority → progress



- Packets can start in either short or express links
- DOR routing function: travel in X first, then Y
- Packets can upgrade to fast links if they can
- Packets can downgrade to slow links only on turn!
- Livelock avoidance:  $W \rightarrow S > N \rightarrow S$
- ► Express links=higher priority, deflected packets acquire higher priority → progress



- Packets can start in either short or express links
- DOR routing function: travel in X first, then Y
- Packets can upgrade to fast links if they can
- Packets can downgrade to slow links only on turn!
- Livelock avoidance:  $W \rightarrow S > N \rightarrow S$
- ► Express links=higher priority, deflected packets acquire higher priority → progress



- Packets can start in either short or express links
- DOR routing function: travel in X first, then Y
- Packets can upgrade to fast links if they can
- Packets can downgrade to slow links only on turn!
- Livelock avoidance:  $W \rightarrow S > N \rightarrow S$
- ► Express links=higher priority, deflected packets acquire higher priority → progress



- Packets can start in either short or express links
- DOR routing function: travel in X first, then Y
- Packets can upgrade to fast links if they can
- Packets can downgrade to slow links only on turn!
- Livelock avoidance:  $W \rightarrow S > N \rightarrow S$
- ► Express links=higher priority, deflected packets acquire higher priority → progress



- Packets can start in either short or express links
- DOR routing function: travel in X first, then Y
- Packets can upgrade to fast links if they can
- Packets can downgrade to slow links only on turn!
- Livelock avoidance:  $W \rightarrow S > N \rightarrow S$
- ► Express links=higher priority, deflected packets acquire higher priority → progress



- Packets can start in either short or express links
- DOR routing function: travel in X first, then Y
- Packets can upgrade to fast links if they can
- Packets can downgrade to slow links only on turn!
- Livelock avoidance:  $W \rightarrow S > N \rightarrow S$
- ► Express links=higher priority, deflected packets acquire higher priority → progress

#### Outline

Introduction and Motivation

FastTrack NoC Organization

FastTrack Router Operation

Evaluation

#### Experimental Setup

- $\blacktriangleright$  RTL implementation of Routers  $\rightarrow$  parameterized
  - D, R parameters control cost
- Cycle-accurate simulations  $\rightarrow$  Verilator
- ► FPGA synthesis + out-of-context place-and-route + XDC floorplanning constraints → Vivado
- Benchmarking:
  - Synthetic traffic patterns at various injection rates
  - Traces from real workloads SpMV, Graph Analytics, Multi-processing
- Measure sustained throughput, average latency, power model

Avg. Latency RANDOM traffic 8×8 NoC



# Avg. Latency RANDOM traffic 8×8 NoC



# Avg. Latency RANDOM traffic



- FastTrack saturates at 4–5× higher injection rate than Hoplite
- vs Replicated Hoplite, still better but by smaller margin
- Replicated Hoplite has a new kind of livelock possibility (delivery)

## Results – LUT vs Throughput $8 \times 8$ NoC



## Results – LUT vs Throughput $8 \times 8$ NoC



## Results – LUT vs Throughput $8 \times 8$ NoC



## Results – Wiring vs. Throughput $8 \times 8$ NoC



## Results – Wiring vs. Throughput $8 \times 8$ NoC



### Results – Wiring vs. Throughput $8 \times 8$ NoC



## Results – Cost vs. Throughput $8 \times 8$ NoC



- FastTrack makes better use of FPGA resources (LUTs, and wires)
- Packets are allowed to leave the NoC faster, freeing up resources
- Must pick proper combination of FT design parameters

# Qualitative Comparison of FPGA NoC Routers

| Router      | Cost         |              |              |  |  |  |
|-------------|--------------|--------------|--------------|--|--|--|
|             | Xbar+Arb     | Buffers      | VCs          |  |  |  |
| OpenSMART   | X            | X            | X            |  |  |  |
| BLESS       | ×            | $\checkmark$ | $\checkmark$ |  |  |  |
| CONNECT     | ×            | X            | X            |  |  |  |
| Split-Merge | X            | X            | 1            |  |  |  |
| Hoplite     | $\checkmark$ | 1            | ✓            |  |  |  |

# Qualitative Comparison of FPGA NoC Routers

| Router      | Cost         |              |              | Perf         |              |
|-------------|--------------|--------------|--------------|--------------|--------------|
|             | Xbar+Arb     | Buffers      | VCs          | Tput         | Latency      |
| OpenSMART   | X            | X            | X            | 1            | ✓            |
| BLESS       | X            | $\checkmark$ | 1            | $\checkmark$ | X            |
| CONNECT     | X            | X            | X            | $\checkmark$ | $\checkmark$ |
| Split-Merge | X            | X            | $\checkmark$ | $\checkmark$ | $\checkmark$ |
| Hoplite     | $\checkmark$ | 1            | 1            | X            | X            |

# Qualitative Comparison of FPGA NoC Routers

| Router      | Cost         |              |              | Perf         |              |
|-------------|--------------|--------------|--------------|--------------|--------------|
|             | Xbar+Arb     | Buffers      | VCs          | Tput         | Latency      |
| OpenSMART   | X            | X            | X            | 1            | 1            |
| BLESS       | ×            | $\checkmark$ | $\checkmark$ | $\checkmark$ | X            |
| CONNECT     | X            | X            | X            | $\checkmark$ | $\checkmark$ |
| Split-Merge | X            | ×            | $\checkmark$ | $\checkmark$ | $\checkmark$ |
| Hoplite     | $\checkmark$ | 1            | ✓            | X            | X            |
| FastTrack   | $\checkmark$ | 1            | ✓            | 1            | 1            |

## FPGA Mapping Frequency 8×8 NoC



- Calibration studies showed express links can travel quickly on chip
- Fmax for 2-hop FastTrack keeps up with original Hoplite
- 4-hop express link distance too large, some noticeable slowdown

## Conclusions

- FastTrack outperforms state-of-the-art Hoplite FPGA NoC by
  - 2.5× for synthetic traffic, 2.8× for real-world traces
  - $2.2 \times$  on energy efficiency
  - ▶ 2.5× more LUTs required
- FastTrack better at larger system sizes
- Ideal hop distance is 2–4 (4–256 PEs)
- Fmax gap between FastTrack and Hoplite is small