#### ISCA 2018 Session 8B: Interconnection Networks Synchronized Progress in Interconnection Networks (SPIN): A new theory for deadlock freedom Aniruddh Ramrakhyani Georgia Tech (aniruddh@gatech.edu) Tushar Krishna Georgia Tech (tushar@ece.gatech.edu) Paul V. Gratz Texas A&M University (pgratz@tamu.edu) A Routing Deadlock is a cyclic buffer dependency chain that renders forward progress impossible. A Routing Deadlock is a cyclic buffer dependency chain that renders forward progress impossible. - A Routing Deadlock is a cyclic buffer dependency chain that renders forward progress impossible. - Deadlocks are a fundamental problem in both off-chip and on-chip interconnection networks. - A Routing Deadlock is a cyclic buffer dependency chain that renders forward progress impossible. - Deadlocks are a fundamental problem in both off-chip and on-chip interconnection networks. - Cause system breakdown and kill chips. - A Routing Deadlock is a cyclic buffer dependency chain that renders forward progress impossible. - Deadlocks are a fundamental problem in both off-chip and on-chip interconnection networks. - Cause system breakdown and kill chips. - Deadlocks are hard to detect. - A Routing Deadlock is a cyclic buffer dependency chain that renders forward progress impossible. - Deadlocks are a fundamental problem in both off-chip and on-chip interconnection networks. - Cause system breakdown and kill chips. - Deadlocks are hard to detect. - Manifest after a long use time. - A Routing Deadlock is a cyclic buffer dependency chain that renders forward progress impossible. - Deadlocks are a fundamental problem in both off-chip and on-chip interconnection networks. - Cause system breakdown and kill chips. - Deadlocks are hard to detect. - Manifest after a long use time. - Show up due to system wear out faults and powergating of network elements which are hard to simulate. - A Routing Deadlock is a cyclic buffer dependency chain that renders forward progress impossible. - Deadlocks are a fundamental problem in both off-chip and on-chip interconnection networks. - Cause system breakdown and kill chips. - Deadlocks are hard to detect. - Manifest after a long use time. - Show up due to system wear out faults and powergating of network elements which are hard to simulate. - Need a solution for functional correctness!! - Defines a strict order in acquisition of links and/or buffers which ensures a cyclic dependency is never created. - Implementations: Turn model [5], XY routing, Up-Down routing [20]. - Defines a strict order in acquisition of links and/or buffers which ensures a cyclic dependency is never created. - Implementations: Turn model [5], XY routing, Up-Down routing [20]. - Limitations: - Routing Restrictions: Increased Latency, Throughput loss, Energy overhead - Require large no. of VCs for fully adaptive routing. Adds buffers to create a deadlock free escape path that can be used to avoid/recover from deadlocks. - Adds buffers to create a deadlock free escape path that can be used to avoid/recover from deadlocks. - Implementation: turn restrictions in escape-VC. - Adds buffers to create a deadlock free escape path that can be used to avoid/recover from deadlocks. - Implementation: turn restrictions in escape-VC. - Adds buffers to create a deadlock free escape path that can be used to avoid/recover from deadlocks. - Implementation: turn restrictions in escape-VC. - Adds buffers to create a deadlock free escape path that can be used to avoid/recover from deadlocks. - Implementation: turn restrictions in escape-VC. - Adds buffers to create a deadlock free escape path that can be used to avoid/recover from deadlocks. - Implementation: turn restrictions in escape-VC. - Adds buffers to create a deadlock free escape path that can be used to avoid/recover from deadlocks. - Implementation: turn restrictions in escape-VC. - Adds buffers to create a deadlock free escape path that can be used to avoid/recover from deadlocks. - Implementation: turn restrictions in escape-VC. - Limitations: - Energy and Area overhead of escape VCs. - Additional routing tables/logic for routing within escape-VC. #### **Other Solutions** Solution III: Flow Control - Solution III: Flow Control - Restrict injection when no. of empty buffers fall below a threshold - Solution III: Flow Control - Restrict injection when no. of empty buffers fall below a threshold - Implementation: Bubble Flow Control [9] - Solution III: Flow Control - Restrict injection when no. of empty buffers fall below a threshold - Implementation: Bubble Flow Control [9] - Limitation: Implementation Complexity. - Solution III: Flow Control - Restrict injection when no. of empty buffers fall below a threshold - Implementation: Bubble Flow Control [9] - Limitation: Implementation Complexity. - Solution IV: Deflection Routing - Solution III: Flow Control - Restrict injection when no. of empty buffers fall below a threshold - Implementation: Bubble Flow Control [9] - Limitation: Implementation Complexity. - Solution IV: Deflection Routing - Assign every flit to some output port even if they get misrouted. - Solution III: Flow Control - Restrict injection when no. of empty buffers fall below a threshold - Implementation: Bubble Flow Control [9] - Limitation: Implementation Complexity. - Solution IV: Deflection Routing - Assign every flit to some output port even if they get misrouted. - Implementation: BLESS [10], CHIPPER [35] - Solution III: Flow Control - Restrict injection when no. of empty buffers fall below a threshold - Implementation: Bubble Flow Control [9] - Limitation: Implementation Complexity. - Solution IV: Deflection Routing - Assign every flit to some output port even if they get misrouted. - Implementation: BLESS [10], CHIPPER [35] - Limitation: Livelocks Theonic Acyclic CDG not Required No Packet Injection Restrictions Livelock Free VC cost for Mesh Routing Minimal Adaptive Topology Independent | Memic | Acyclic<br>CDG not<br>Required | No Packet<br>Injection<br>Restrictions | Livelock<br>Free | Ro | for Mesh<br>uting<br>Adaptive | Topology<br>Indepen-<br>dent | |-------|--------------------------------|----------------------------------------|------------------|----|-------------------------------|------------------------------| | Dally | ** | <b>/</b> | <b>/</b> | 1 | 6 | ** | | Memic | Acyclic<br>CDG not<br>Required | No Packet<br>Injection<br>Restrictions | Livelock<br>Free | Ro | for Mesh<br>uting<br>Adaptive | Topology<br>Indepen-<br>dent | |-------|--------------------------------|----------------------------------------|------------------|----|-------------------------------|------------------------------| | Dally | ** | <b>/</b> | <b>/</b> | 1 | 6 | ** | | Duato | | | | 1 | 2 | * | | Theony | Acyclic<br>CDG not<br>Required | No Packet<br>Injection<br>Restrictions | Livelock<br>Free | Ro | for Mesh<br>uting<br>Adaptive | Topology<br>Indepen-<br>dent | |-----------------|--------------------------------|----------------------------------------|------------------|----|-------------------------------|------------------------------| | Dally | ** | <b>/</b> | <b>✓</b> | 1 | 6 | ** | | Duato | | | | 1 | 2 | ** | | Flow<br>Control | | ** | | 2 | 2 | <b>/</b> | | Theory | Acyclic<br>CDG not<br>Required | No Packet<br>Injection<br>Restrictions | Livelock<br>Free | Ro | for Mesh<br>uting<br>Adaptive | Topology<br>Indepen-<br>dent | |-----------------------|--------------------------------|----------------------------------------|------------------|----|-------------------------------|------------------------------| | Dally | ** | | <b>/</b> | 1 | 6 | ** | | Duato | > | | | 1 | 2 | ** | | Flow<br>Control | | ** | | 2 | 2 | | | Deflection<br>Routing | | ** | * | * | 1 | | | Theonic | Acyclic<br>CDG not<br>Required | No Packet<br>Injection<br>Restrictions | Livelock<br>Free | Ro | for Mesh<br>uting<br>Adaptive | Topology<br>Indepen-<br>dent | |-----------------------|--------------------------------|----------------------------------------|------------------|----|-------------------------------|------------------------------| | Dally | ** | <b>/</b> | <b>/</b> | 1 | 6 | ** | | Duato | | <b>/</b> | <b>/</b> | 1 | 2 | ** | | Flow<br>Control | Can we do better ?? | | | | | | | Deflection<br>Routing | <b>/</b> | ** | ** | ** | 1 | <b>✓</b> | | Theory | Acyclic<br>CDG not<br>Required | No Packet<br>Injection<br>Restrictions | Livelock<br>Free | Ro | for Mesh<br>uting<br>Adaptive | Topology<br>Indepen-<br>dent | |-----------------------|--------------------------------|----------------------------------------|------------------|----|-------------------------------|------------------------------| | Dally | ** | | <b>/</b> | 1 | 6 | ** | | Duato | > | | | 1 | 2 | ** | | Flow<br>Control | | ** | | 2 | 2 | | | Deflection<br>Routing | | ** | * | * | 1 | | | Metric | Acyclic<br>CDG not<br>Required | No Packet<br>Injection<br>Restrictions | Livelock<br>Free | Rou | for Mesh<br>uting<br>Adaptive | Topology<br>Indepen-<br>dent | |-----------------------|--------------------------------|----------------------------------------|------------------|-----|-------------------------------|------------------------------| | Dally | * | | | 1 | 6 | ** | | Duato | <b>/</b> | | <b>/</b> | 1 | 2 | ** | | Flow<br>Control | | * | | 2 | 2 | | | Deflection<br>Routing | | ** | ** | ** | 1 | | | SPIN | | <b>/</b> | <b>/</b> | 1 | 1 | | #### **Outline** - Routing Deadlocks - State of the Art - Dally's Theory - Duato's Theory - Flow Control Routing - Deflection Routing - SPIN: Synchronized Progress in Interconnection Networks - Evaluations - Conclusion #### What if: We coordinate the movement of every packet to the next hop at a given time ?? - Simultaneous Synchronized Movement of all deadlocked packets in the loop is called a spin. - Each spin leads to one hop forward movement of all deadlock packets. - Simultaneous Synchronized Movement of all deadlocked packets in the loop is called a spin. - Each spin leads to one hop forward movement of all deadlock packets. - One spin may not resolve the deadlock. If so, spin can be repeated - Simultaneous Synchronized Movement of all deadlocked packets in the loop is called a spin. - Each spin leads to one hop forward movement of all deadlock packets. - One spin may not resolve the deadlock. If so, spin can be repeated - Deadlock is guaranteed to be resolved in a finite number of spins [proof in paper, Sec. III] #### **Outline** - Routing Deadlocks - State of the Art - SPIN: Synchronized Progress in Interconnection Networks - Key Idea - Implementation Example - Micro-architecture - FAVORS - Evaluations - Conclusion SPIN is a generic deadlock freedom theory that can have multiple implementations. - SPIN is a generic deadlock freedom theory that can have multiple implementations. - SPIN Implementation: - Detect the Deadlock. - Coordinate a time for spin. - Execute the spin. - SPIN is a generic deadlock freedom theory that can have multiple implementations. - SPIN Implementation: - Detect the Deadlock. - Coordinate a time for spin. - Execute the spin. - We choose a recovery approach as deadlocks are rare scenarios (See Sec. II-F). Use counters. - Use counters. - Placed at every node at design time. - Use counters. - Placed at every node at design time. - Optimize by exploiting topology symmetry (See Static Bubble [6]). - Use counters. - Placed at every node at design time. - Optimize by exploiting topology symmetry (See Static Bubble [6]). - If packet does not leave in threshold time (configurable), it indicates a potential deadlock. - Use counters. - Placed at every node at design time. - Optimize by exploiting topology symmetry (See Static Bubble [6]). - If packet does not leave in threshold time (configurable), it indicates a potential deadlock. Probe is a special message that tracks the buffer dependency. - Probe is a special message that tracks the buffer dependency. - Probe returns to sender: - Probe is a special message that tracks the buffer dependency. - Probe returns to sender: - Cyclic buffer dependence, hence deadlock. - Probe is a special message that tracks the buffer dependency. - Probe returns to sender: - Cyclic buffer dependence, hence deadlock. - Next, send a move msg. to convey the spin time - Probe is a special message that tracks the buffer dependency. - Probe returns to sender: - Cyclic buffer dependence, hence deadlock. - Next, send a move msg. to convey the spin time - Upon receiving move msg., router sets its counter to count to spin cyle. ### Implementation Example: spin ### Implementation Example: spin ### Implementation Example: spin Resolving a deadlock may require multiple spins - Resolving a deadlock may require multiple spins - After spin, router can resume normal operation. - Resolving a deadlock may require multiple spins - After spin, router can resume normal operation. - Counter expires again, process repeated. - Resolving a deadlock may require multiple spins - After spin, router can resume normal operation. - Counter expires again, process repeated. - Optimization: send probe\_move after spin is complete. - Resolving a deadlock may require multiple spins - After spin, router can resume normal operation. - Counter expires again, process repeated. - Optimization: send probe\_move after spin is complete. - probe\_move checks if deadlock still exists and if so, sets the time for the next spin. - Resolving a deadlock may require multiple spins - After spin, router can resume normal operation. - Counter expires again, process repeated. - Optimization: send probe\_move after spin is complete. - probe\_move checks if deadlock still exists and if so, sets the time for the next spin. - Details in paper (Sec. IV-B). #### **Outline** - Routing Deadlocks - State of the Art - SPIN: Synchronized Progress in Interconnection Networks - Key Idea - Implementation Example - Micro-architecture - FAVORS - Evaluations - Conclusion No additional links: Spl. Msgs. use the same links as regular flits. - No additional links: Spl. Msgs. use the same links as regular flits. - Spl. Msgs. have higher priority in link usage over regular flits. - No additional links: Spl. Msgs. use the same links as regular flits. - Spl. Msgs. have higher priority in link usage over regular flits. - Links are anyways idle during deadlocks. - No additional links: Spl. Msgs. use the same links as regular flits. - Spl. Msgs. have higher priority in link usage over regular flits. - Links are anyways idle during deadlocks. - Bufferless Forwarding: Spl. Msgs. are not buffered anywhere (either forwarded or dropped). - No additional links: Spl. Msgs. use the same links as regular flits. - Spl. Msgs. have higher priority in link usage over regular flits. - Links are anyways idle during deadlocks. - Bufferless Forwarding: Spl. Msgs. are not buffered anywhere (either forwarded or dropped). - Distributed Design: any router can initiate the recovery. - No additional links: Spl. Msgs. use the same links as regular flits. - Spl. Msgs. have higher priority in link usage over regular flits. - Links are anyways idle during deadlocks. - Bufferless Forwarding: Spl. Msgs. are not buffered anywhere (either forwarded or dropped). - Distributed Design: any router can initiate the recovery. - 4% area overhead compared to traditional mesh router in 15nm Nangate [42]. #### **Outline** - Routing Deadlocks - State of the Art - SPIN: Synchronized Progress in Interconnection Networks - Key Idea - Walkthrough Example - Micro-architecture - FAVORS - Evaluations - Conclusion ■ SPIN is the first scheme that enables true one-VC fully adaptive deadlock-free routing for any topology. - SPIN is the first scheme that enables true one-VC fully adaptive deadlock-free routing for any topology. - **FAVORS**: Fully Adaptive One-vc Routing with SPIN. - **SPIN** is the *first scheme* that enables *true one-VC* fully adaptive deadlock-free routing for *any topology*. - FAVORS: Fully Adaptive One-vc Routing with SPIN. - Algorithm has two flavors: - **SPIN** is the *first scheme* that enables *true one-VC* fully adaptive deadlock-free routing for *any topology*. - **FAVORS**: Fully Adaptive One-vc Routing with SPIN. - Algorithm has two flavors: - Minimal Adaptive - SPIN is the first scheme that enables true one-VC fully adaptive deadlock-free routing for any topology. - FAVORS: Fully Adaptive One-vc Routing with SPIN. - Algorithm has two flavors: - Minimal Adaptive - Non-minimal Adaptive. - **SPIN** is the *first scheme* that enables *true one-VC* fully adaptive deadlock-free routing for *any topology*. - **FAVORS**: Fully Adaptive One-vc Routing with SPIN. - Algorithm has two flavors: - Minimal Adaptive - Non-minimal Adaptive. - Route Selection Metrics: - **SPIN** is the *first scheme* that enables *true one-VC* fully adaptive deadlock-free routing for *any topology*. - FAVORS: Fully Adaptive One-vc Routing with SPIN. - Algorithm has two flavors: - Minimal Adaptive - Non-minimal Adaptive. - Route Selection Metrics: - Credit turn-around time - **SPIN** is the *first scheme* that enables *true one-VC* fully adaptive deadlock-free routing for *any topology*. - FAVORS: Fully Adaptive One-vc Routing with SPIN. - Algorithm has two flavors: - Minimal Adaptive - Non-minimal Adaptive. - Route Selection Metrics: - Credit turn-around time - Hop Count ## **FAvORS** Routing Algorithm - **SPIN** is the *first scheme* that enables *true one-VC* fully adaptive deadlock-free routing for *any topology*. - FAVORS: Fully Adaptive One-vc Routing with SPIN. - Algorithm has two flavors: - Minimal Adaptive - Non-minimal Adaptive. - Route Selection Metrics: - Credit turn-around time - Hop Count - More details in paper (Sec. V). #### **Outline** - Routing Deadlocks - State of the Art - SPIN : Synchronized Progress in Interconnection Networks - Evaluations - Conclusion #### **Evaluations** Network Configuration: **Simulator** **Topologies** Link Latency **Traffic** #### **Evaluations** Network Configuration: | Simulator | gem5 + Garnet 2.0 Network simulator | | |-----------------|----------------------------------------|--| | Topologies | 8x8 Mesh | | | Link<br>Latency | 1-cycle | | | Traffic | Synthetic +<br>Multi-threaded (PARSEC) | | #### **Evaluations** Network Configuration: | Simulator | gem5 + Garnet 2.0 Network simulator | | | |-----------------|----------------------------------------|----------------------------------------------|--| | Topologies | 8x8 Mesh | 1024 node Off-chip<br>Dragon-fly | | | Link<br>Latency | 1-cycle | Inter-group: 1-cycle<br>Intra-group: 3-cycle | | | Traffic | Synthetic +<br>Multi-threaded (PARSEC) | Synthetic | | #### **Evaluations: Baselines** #### **Evaluations: Baselines** #### ■ <u>8x8 Mesh:</u> | Design | Routing<br>Adaptivity | Minimal | Theory | Deadlock<br>Freedom Type | |--------------------|-----------------------|---------|--------------|--------------------------| | West-first Routing | Partial | Yes | Dally | Avoidance | | Escape-VC | Full | Yes | Duato | Avoidance | | Static-Bubble [6] | Full | Yes | Flow-Control | Recovery | #### **Evaluations: Baselines** #### ■ <u>8x8 Mesh:</u> | Design | Routing<br>Adaptivity | Minimal | Theory | Deadlock<br>Freedom Type | |--------------------|-----------------------|---------|--------------|--------------------------| | West-first Routing | Partial | Yes | Dally | Avoidance | | Escape-VC | Full | Yes | Duato | Avoidance | | Static-Bubble [6] | Full | Yes | Flow-Control | Recovery | | Design | Routing<br>Adaptivity | Minimal | Theory | Deadlock<br>Freedom Type | |-----------|-----------------------|---------|--------|--------------------------| | UGAL [37] | Full | No | Dally | Avoidance | ■ 1024-node Off-chip Dragon-fly: throughput compared to **Minimal Routing 1-VC** compared to UGAL\_Dally compared to UGAL\_Dally Deadlocks are a fundamental problem in Interconnection Networks. - Deadlocks are a fundamental problem in Interconnection Networks. - SPIN is a generic deadlock freedom theory - Scalable: Distributed Deadlock Resolution - Plug-n-Play: topology agnostic - Enables true one-VC fully adaptive routing for any topology - Deadlocks are a fundamental problem in Interconnection Networks. - SPIN is a generic deadlock freedom theory - Scalable: Distributed Deadlock Resolution - Plug-n-Play: topology agnostic - Enables true one-VC fully adaptive routing for any topology - Performance (Saturation Throughput): - On-chip mesh: upto 68% higher - Off-chip Dragon-fly: upto 62% higher Practical Applications: Practical Applications: | On-Chip | Mesh<br>(Intel SCC, Tilera Tile64) | |------------------------------|--------------------------------------------------------------| | Super-computers | Dragon-fly<br>(Cray XC Networks) | | Datacenters | JellyFish (HP),<br>Fat Tree (Google) | | Irregular Topologies | Faults (Static Bubble [6]) Power-gating (Router Parking[29]) | | NoC Generators | FlexNoc (ARTERIS),<br>Sonics GN | | Domain specific Accelerators | Corelink Interconnect (ARM) | Practical Applications: | On-Chip | Mesh<br>(Intel SCC, Tilera Tile64) | |------------------------------|--------------------------------------------------------------| | Super-computers | Dragon-fly<br>(Cray XC Networks) | | Datacenters | JellyFish (HP),<br>Fat Tree (Google) | | Irregular Topologies | Faults (Static Bubble [6]) Power-gating (Router Parking[29]) | | NoC Generators | FlexNoc (ARTERIS),<br>Sonics GN | | Domain specific Accelerators | Corelink Interconnect (ARM) | ■ Thank you!! # Back-up # **SPIN: Applications** ### **SPIN: Applications** - SPIN is a generic deadlock freedom theory - Scalable: distributed deadlock resolution - Plug-n-Play: doesn't require knowledge of topology ### **SPIN: Applications** - SPIN is a generic deadlock freedom theory - Scalable: distributed deadlock resolution - Plug-n-Play: doesn't require knowledge of topology - SPIN can thus be used in : - On-chip networks: Mesh (Intel SCC, Tilera Tile64) - Supercomputers: Dragon-fly (Cray XC Networks) - Datacenters: Jellyfish (HP), Fat Tree (Google) - Static & Dynamically Changing Irregular topologies due to faults (Static Bubble [6]) & power-gating (Router Parking [29]) - NoC Generators (Opensmart [13]) & Domain specific accelerator (Eyeriss[15])