
Exploring the Potential of Heterogeneous Von Neumann Dataflow Execution Models
"Discover the advantages of combining Von Neumann and Dataflow architectures for efficient program execution in this research study by Tony Nowatzki, Vinay Gangadhar, and Karthikeyan Sankaralingam from the University of Wisconsin-Madison."
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Exploring the Potential of Heterogeneous Von Neumann / Dataflow Execution Models Tony Nowatzki Vinay Gangadhar Karthikeyan Sankaralingam University of Wisconsin - Madison 1
Time Program Execution Low ILP High ILP Low ILP High ILP Due to Speculation Very high/non-local parallelism Reason for high ILP: Helpful Arch. Features: Branch Prediction, speculative scheduling, fast recovery Very-high issue width, Very-large instruction window Best Suited Architecture: Von Neumann Explicit-Dataflow Explicit- Dataflow Von Our Proposal: Neumann High ILP, Highly Speculative, Modest ILP Little Speculation 2
Related Work Our Proposal Von Von Von Explicit Dataflow Neumann Neumann Neumann Dual Issue OOO + SEED - Dual Issue OOO + 4 Others: (Beret, CCores, bigLITTLE, In-place Loop) 30% Speedup, 70% Energy Efficiency 5% Speedup, 20% Energy Efficiency 3
Outline (General Purpose) Describing why VonNeumann can complement Dataflow architectures Memory Von- Control Neumann Leveraging program properties for efficient heterogeneous design Nested Loops Explicit Dataflow Designing SEED: Specialization Engine for Explicit-Dataflow (Offload Engine) Performing a Design-Space Exploration Across (Small, Medium, Big) VN Cores Energy Speedup 4
Von Neumann (Out-of-Order) Explicit-Dataflow Instruction-by-instruction execution of Control Flow Graph Dependence-graph execution, Control deps becomes data deps Instruction Stream Instruction Window Dependence Graph +Speculation +Local/Non Local ILP +Lower Overheads No Fetch, Decode, Rename No dynamic dependence graph construction 5
Loop 1: Loop 2: High ILP, Data Dependent Control Non-Critical Control ld ld ld ld + + + + + + + + > > - - - br br Von Neumann Wins (Speculation) Explicit-Dataflow Wins (Higher ILP) 6
Higher ILP Data-Parallel (SIMD/GPU) Unpredictable (Dataflow?) Memory Regularity High ILP (Data- flow?) General Code (Out-of-Order) Memory Latency Bound (Dataflow?) Control Regularity 7
Outline Describing why VonNeumann can complement Dataflow architectures. Memory Von- Control Neumann Leveraging program properties for efficient heterogeneous design Nested Loops 24 Irregular Workloads from SpecINT/MediaBench Explicit Dataflow Designing SEED: Specialization Engine for Explicit-Dataflow Performing a Design-Space Exploration Across (Small, Medium, Big) VN Cores Energy Speedup 8
Property 1: Affinity Phase Behavior Architecture affinity over Time App. 1 Hundreds to Millions of Instructions App. 2 App. 3 Cache Hierarchy VonNeumann Data-Parallel Dataflow Out-of- Order Explicit- Dataflow Workload Architecture Affinity SIMD Von Neumann OOO Only fast-switching 13% Ideal Dataflow Only 25% 63% Heterogeneous Execution 9
Property 2: Benefits of Restricted Scope Arbitrary Code Inner Loops Scope: Traces (call) (must support arbitrary procedure calls, recursion, instruction misses) Area/Power: Low Low High Coverage: (any duration) 61% 41% 100% Coverage: (long duration) 46% 27% 100% Fine Print: Static Region Size 1024 Instructions; Long Duration (lasting Longer than 1000 Cycles) 10
Property 2: Benefits of Restricted Scope Nested Loops Arbitrary Code Inner Loops Scope: Traces (call) Low Area/Power: High Low Low Coverage: (any duration) 74% 61% 41% 100% Coverage: (long duration) 46% 27% 67% 100% Fine Print: Static Region Size 1024 Instructions; Long Duration (lasting Longer than 1000 Instructions) 11
Outline Describing why VonNeumann can complement Dataflow architectures. Memory Von- Control Neumann Leveraging program properties for efficient heterogeneous design Fine-Grain Switching => Phased Dataflow Affinity Simplify Dataflow Arch. => Nested-Loop Scope Nested Loops Explicit Dataflow Designing SEED: Specialization Engine for Explicit-Dataflow 1. System Overview 2. Architecture Inspiration 3. Architecture Overview Performing a Design-Space Exploration Across (Small, Medium, Big) VN Cores Energy Speedup 12
System Overview Compiler finds and inlines profitable nested loop regions, and generates the dataflow representation for SEED. Invoking SEED: SEED_CONFIG: Begins streaming configuration (20-250 cycles) SEED_BEGIN: Transfers execution, powers down non-stateful host core components Resuming OOO: Core is powered on Live values transferred to OOO core registers. L1 Cache Nested Loop OOO Core Compiler Runs on SEED Time Program Binary System Architecture Resume OOO SEED_CONFIG SEED_BEGIN Time Program Execution 13
Leveraging Decades of Dataflow Research Design Decisions Decisions Design TRIPS WaveScalar DySER BERET Scope Scope Whole Program Whole Program Inner Loop Compute Hot Loop Trace Integration to Host to Host Integration Behind L1 Cache Behind L1 Cache Control Flow Control Flow VN/Predicat. Switch Insts. Predication Trace-only Control Speculation Speculation Control Block-based None Block-based Trace-only Dataflow Dataflow Firing Firing Position- based Tag-based Static Not applicable Execution Execution Units Units Homog. FU Heterog. FU Heterog. FU Compound FU Criteria: 1. Low Area/Power 2. High Performance 3. Complement Capabilities of VonNeumann 14
SEED Architecture L1 Cache ICache Seed Unit 1 Seed Unit 8 DCache Config & Init. IMU IMU (Instruction Mgmt. Unit) Bus Arbiter SEED OOO CPU CFU8 CFU1 ODU Store Buffer CPU XFER ODU IMU: Keeps instruction definitions and unrolled operand storage (32 comp. insts). Issues one instruction per cycle to CFU. functional units (CFUs). 2-5 Instructions each. communicating between SEED Units. and SEED units, using an integer linear program [see Paper] CFU: SEED Units are organized around a set of compound ODU+Bus: Multi-bus distribution network for Store Buffer: Interface to memory system. Entering & Exiting SEED Regions Compile-time scheduler assigns instructions to compound FUs 15
Capability Comparison L1 Cache ICache Seed Unit 1 Seed Unit 8 DCache Config & Init. IMU IMU (Instruction Mgmt. Unit) Bus Arbiter OOO CPU CFU8 CFU1 ODU Quad-Issue OOO Store Buffer CPU XFER ODU SEED Max. Effective IPC 3-3.5 (4) 6-7 (16) (parens: theoretical max) Max. Instruction Window 48 (48) 200-300 (1024) Speculative Control Yes No Speculative Scheduling Yes No 16
Outline Describing why VonNeumann can complement Dataflow architectures. Memory Von- Control Neumann Leveraging program properties for efficient heterogeneous design Nested Loops Explicit Dataflow Designing SEED: Specialization Engine for Explicit-Dataflow Performing a Design-Space Exploration Across (Small, Medium, Big) VN Cores Energy Speedup 17
Experimental Setup Methodology Simulator: Modified Gem5 + McPAT/Cacti Power: Core + L1 + L2 + Static (@2Ghz and 22nm) Comparison to State-of-the-art VonNeuman-Based Heterogeneous Execution Models: Accelerators : BERET, Conservation-Cores Micro-arch: bigLITTLE, In-place Loop Execution Oracle scheduler: choose best arch. per-region (others have demonstrated solutions [Padmanabha et al., MICRO 2013]) Design Space Exploration: IO2: Dual-Issue Inorder OOO2: Dual-Issue, 32-entry IW OOO4: Quad-Issue, 48-entry IW 18
Architecture Comparison State-of- the-art 1.14 Perf. 1.54 En. This Way Better 1.3 Perf. 1.7 En. This Work 19
Dataflow Heterogeneity Analysis Slowdown Regions Vectorizable on OOO Core Control on Critical Path Short Regions Similar Regions No speculative scheduling + Fewer Branch-Mispredicts + Less Stack Spilling Speedup Regions + High Instruction Parallelism + High Memory Parallelism 3 SEED Speedup over Quad-Issue OOO 2.5 2 1.5 1 0.5 0 20
Dataflow Heterogeneity Analysis Slowdown Regions Vectorizable on OOO Core Control on Critical Path Short Regions Similar Regions No speculative scheduling + Fewer Branch-Mispredicts + Less Stack Spilling Speedup Regions + High Instruction Parallelism + High Memory Parallelism 3 SEED Speedup over Quad-Issue OOO 2.5 2 1.5 1 0.5 0 21
Dataflow Heterogeneity Analysis Slowdown Regions Vectorizable on OOO Core Control on Critical Path Short Regions Similar Regions No speculative scheduling + Fewer Branch-Mispredicts + Less Stack Spilling Speedup Regions + High Instruction Parallelism + High Memory Parallelism 3 SEED Speedup over Quad-Issue OOO 2.5 2 1.5 1 0.5 0 22
Dataflow Heterogeneity Analysis Slowdown Regions Vectorizable on OOO Core Control on Critical Path Short Regions Similar Regions No speculative scheduling + Fewer Branch-Mispredicts + Less Stack Spilling Speedup Regions + High Instruction Parallelism + High Memory Parallelism 3 SEED Speedup over Quad-Issue OOO 2.5 2 1.5 1 0.5 0 23
Dataflow Heterogeneity Analysis Slowdown Regions Vectorizable on OOO Core Control on Critical Path Short Regions Similar Regions No speculative scheduling + Fewer Branch-Mispredicts + Less Stack Spilling Speedup Regions + High Instruction Parallelism + High Memory Parallelism 3 SEED Speedup over Quad-Issue OOO 2.5 2 1.5 1 0.5 0 24
Dataflow Heterogeneity Analysis Slowdown Regions Vectorizable on OOO Core Control on Critical Path Short Regions Similar Regions No speculative scheduling + Fewer Branch-Mispredicts + Less Stack Spilling Speedup Regions + High Instruction Parallelism + High Memory Parallelism 3 SEED Speedup over Quad-Issue OOO 2.5 2 1.5 1 0.5 0 25
Dataflow Heterogeneity Analysis Slowdown Regions Vectorizable on OOO Core Control on Critical Path Short Regions Similar Regions No speculative scheduling + Fewer Branch-Mispredicts + Less Stack Spilling Speedup Regions + High Instruction Parallelism + High Memory Parallelism 3 SEED Speedup over Quad-Issue OOO 2.5 2 1.5 1 0.5 0 26
Heterogeneity Analysis Summary Slowdown Regions Vectorizable on OOO Core Control on Critical Path Short Regions Similar Regions No speculative scheduling + Fewer Branch-Mispredicts + Less Stack Spilling Speedup Regions + High Instruction Parallelism + High Memory Parallelism 3 SEED Speedup over Quad-Issue OOO 2.5 2 1.5 1 0.5 0 27
Conclusions Heterogeneous Von Neumann + Dataflow has a high potential Especially for modest sized OOO Cores Delay need for application-specific accelerators? Looking Forward Augment dataflow-architecture with data-parallel capabilities? Alternative heterogeneous models (what other program properties to leverage?) Can micro-architecture modifications achieve the same benefits? VN DF DF ? Thank you! 28