Rethinking Hardware for Disciplined Parallelism
This project, DeNovo, aims to revolutionize hardware design for disciplined parallelism, addressing issues such as power efficiency, complexity, and performance scalability. By challenging the traditional shared-memory paradigm and focusing on disciplined shared memory, the team strives to eliminate data races, non-determinism, and inefficiencies prevalent in current systems. Through innovative approaches, DeNovo seeks to bridge the gap between hardware and software, paving the way for a new era of computing.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
DeNovo: Rethinking Hardware for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Rob Bocchino, Sarita Adve, Vikram Adve Other collaborators: Languages: Adam Welc, Tatiana Shpeisman, Yang Ni (Intel) Applications: John Hart, Victor Lu De Novo = From the beginning, anew
Motivation Goal: Power-, complexity-, performance-scalable hardware Today: shared-memory Directory-based coherence Complex, unscalable Address, communication, coherence granularity is cache line Software-oblivious, inefficient power, bandwidth, latency, area Difficult programming model Data races, non-determinism, no safety/composability/modularity, Can t specify what value can read return a.k.a. memory model Data races defy acceptable semantics; mismatched hardware/software Fundamentally broken for hardware & software
Motivation Goal: Power-, complexity-, performance-scalable hardware Today: shared-memory Directory-based coherence Complex, unscalable Address, communication, coherence granularity is cache line Inefficient in power, bandwidth, latency, area, especially for O-O codes Difficult programming model Data races, non-determinism, no safety/composability/modularity, Can t specify what value a read can return a.k.a. memory model Data races defy acceptable semantics; mismatched hardware/software Fundamentally broken for hardware & software Banish shared memory?
Motivation Goal: Power-, complexity-, performance-scalable hardware Today: shared-memory Directory-based coherence Complex, unscalable Address, communication, coherence granularity is cache line Inefficient in power, bandwidth, latency, area, especially for O-O codes Difficult programming model Data races, non-determinism, no safety/composability/modularity, Can t specify what value a read can return a.k.a. memory model Data races defy acceptable semantics; mismatched hardware/software Fundamentally broken for hardware & software Banish wild shared memory!
Motivation Goal: Power-, complexity-, performance-scalable hardware Today: shared-memory Directory-based coherence Complex, unscalable Address, communication, coherence granularity is cache line Inefficient in power, bandwidth, latency, area, especially for O-O codes Difficult programming model Data races, non-determinism, no safety/composability/modularity, Can t specify what value a read can return a.k.a. memory model Data races defy acceptable semantics; mismatched hardware/software Fundamentally broken for hardware & software Banish wild shared memory! Need disciplined shared memory!
What is Shared-Memory? Shared-Memory = Global address space + Implicit, anywhere communication, synchronization
What is Shared-Memory? Shared-Memory = Global address space + Implicit, anywhere communication, synchronization
What is Shared-Memory? Wild Shared-Memory = Global address space + Implicit, anywhere communication, synchronization
What is Shared-Memory? Disciplined Shared-Memory = Global address space + Implicit, anywhere communication, synchronization Explicit, structured side-effects
Disciplined Shared-Memory Top-down view: prog model, language, (earlier today) Use explicit effects for semantic guarantees Data-race-freedom, determinism-by-default, controlled non-determinism Reward: simple semantics, safety, composability, Bottom-up view: hardware, runtime, Use explicit effects + above guarantees for Simple coherence and consistency Software-aware address/comm/coherence granularity, data layout Reward: power-, complexity-, performance-scalable hardware Top-down and bottom-up views are synergistic!
DeNovo Research Strategy 1. Start with deterministic ( Common & best case, basis for extension to other codes 2. Disciplined non-deterministic codes 3. Wild non-deterministic, legacy codes data-race-free) codes Work with languages for best h/w-s/w interface Current driver is DPJ End-goal is language-oblivious interface Work with realistic applications Current work with kd-trees
Progress Summary 1. Deterministic codes Language level model with DPJ, translation to h/w interface Design for simple coherence, s/w-aware comm. and layout Baseline coherence implemented, rest underway 2. Disciplined non-deterministic codes Language level model with DPJ (and Intel) 3. Wild non-deterministic, legacy codes Just begun, will be influenced by above Collaborations with languages & applications groups DPJ; first scalable SAH quality kd-tree construction
Progress Summary 1. Deterministic codes Language level model with DPJ, translation to h/w interface Design for simple coherence, s/w-aware comm. and layout Baseline coherence implemented, rest underway 2. Disciplined non-deterministic codes Language level model with DPJ (and Intel) 3. Wild non-deterministic, legacy codes Just begun, will be influenced by above Collaborations with languages & applications groups DPJ; first scalable SAH quality kd-tree construction
Coherence and Consistency Key Insights Guaranteed determinism Explicit effects
Coherence and Consistency Key Insights Guaranteed determinism Read should return value of last write in sequential order From same task in this parallel phase, if it exists Or from previous parallel phase No concurrent conflicting writes in this phase Explicit effects
Coherence and Consistency Key Insights Guaranteed determinism Read should return value of last write in sequential order From same task in this parallel phase, if it exists Or from previous parallel phase No concurrent conflicting writes in this phase Explicit effects Compiler knows all regions written in this parallel phase Cache can self-invalidate before next parallel phase Invalidates data in writeable regions not written by itself
Today's Coherence Protocols Snooping: broadcast, ordered networks Directory: avoid broadcast through indirection Complexity: Races in protocol Overhead: Sharer list Performance: All cache misses go through directory
Today's Coherence Protocols Snooping: broadcast, ordered networks Directory: avoid broadcast through indirection Complexity: Races in protocol Race-free software (almost) race-free coherence protocol No transient states, much simpler protocol Overhead: Sharer list Performance: All cache misses go through directory
Today's Coherence Protocols Snooping: broadcast, ordered networks Directory: avoid broadcast through indirection Complexity: Races in protocol Race-free software (almost) race-free coherence protocol No transient states, much simpler protocol Overhead: Sharer list Explicit effects enable self-invalidations No need for sharer lists Performance: All cache misses go through directory
Today's Coherence Protocols Snooping: broadcast, ordered networks Directory: avoid broadcast through indirection Complexity: Races in protocol Race-free software (almost) race-free coherence protocol No transient states, much simpler protocol Overhead: Sharer list Explicit effects enable self-invalidations No need for sharer lists Performance: All cache misses go through directory Directory only tracks one up-to-date copy, not sharers or serialization Data copies can move from cache to cache without telling directory
Baseline DeNovo Coherence Assume (for now): Private L1, shared L2; single word line Directory tracks one current copy of line, not sharers L2 data arrays double as directory Keep valid data or registered core id, no space overhead S/W inserts self-invalidates for regions w/ write effects L1 states = invalid, valid, registered No transient states: Protocol 3-state textbook pictures Formal specification and verification with Intel
DeNovo Region Granularity DPJ regions too fine-grained Ensure accesses to individual objects/fields don t interfere Too many regions for hardware DeNovo only needs aggregate data written in phase E.g., can summarize a field of entire data structure as one region Can we aggregate to few enough regions, without excessive invalidations?
Evaluation Methodology Modified Wisconsin GEMS + (Intel!) Simics simulator 4 apps: LU, FFT, Barnes, kd-tree Converted DPJ regions into DeNovo regions by hand App # Regions Barnes FFT LU Kd-tree 8 3 2 4 Compared DeNovo vs. MESI for single word lines Goal: Does simplicity impact performance? E.g., are self-invalidations too conservative? (Efficiency enhancements are next step)
Results for Baseline Coherence 100 Normalized to MESI (%) 80 Execution time 60 MESI DeNovo 40 20 0 LU FFT Barnes-Hut kd-tree (nested) DeNovo comparable to MESI Simple protocol is foundation for efficiency enhancements
Improving Performance & Power Insight: Can always copy valid data to another cache w/o demand access, w/o going through directory If later demand read sees it, it must be correct No false sharing effects (no loss of ownership ) Simple line based protocol (with word based valid bits) Can get line from anywhere, not just from directory L2 Point-to-point transfer, sender-initiated transfer Can transfer more than line at a time Point-to-point bulk transfer Transfer granularity can be region-driven AoS vs. SoA optimization is natural
Towards Ideal Efficiency Current systems: Address, transfer, coherence granularity = fixed cache line Denovo so far: Transfer is flexible, but address is still line, coherence is still word Next step: Region-centered caches Use regions for memory layout Region based pool allocation, fields of same region at fixed strides Cache banks devoted to regions Regions accessed together give address, transfer granularity Regions w/ same sharing behavior give coherence granularity Applicable to main memory and pin bandwidth Interactions with runtime scheduler
Summary Current shared-memory models fundamentally broken Semantics, programmability, hardware Disciplined programming models solve these problems DeNovo = hardware for disciplined programming Software-driven memory hierarchy Coherence, consistency, communication, data layout, Simpler, faster, cooler, cheaper, Sponsor interactions: Nick Carter, Mani Azimi/Akhilesh Kumar/Ching-Tsun Chou Non-deterministic model: Adam Welc/Shpeisman/Ni SCC prototype exploration: Jim Held
Next steps Phase 1: Full implementation, verification, results for deterministic codes Design and results for disciplined non-determinism Design for wild non-deterministic, legacy codes Continue work with language and application groups Explore prototyping on SCC Phase 2: Design and simulation results for complete DeNovo system running large applications Language-oblivious hardware-software interface Prototype and tech transfer