Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

denovo rethinking the multicore memory hierarchy n.w

1 / 40

Embed Share

Explore the concept of shared memory in the context of safe and efficient parallel computing. The research aims to tackle challenges such as complex directory coherence, data races, and mismatches between hardware and software interfaces. It advocates for a disciplined shared memory model to address fundamental issues in hardware and software interactions.

kat_hen Follow

Uploaded on May 02, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand, Sarita V. Adve, Vikram S. Adve, Nicholas P. Carter, Ching-Tsun Chou University of Illinois at Urbana-Champaign and Intel denovo@cs.illinois.edu

Motivation Goal: Safe and efficient parallel computing Easy, safe programming model Complexity-, power-, performance-scalable hardware Today: shared-memory Complex, power- and performance-inefficient hardware * Complex directory coherence, unnecessary traffic, ... Difficult programming model * Data races, non-determinism, composability/modularity?, testing? Mismatched interface between HW and SW, a.k.a memory model * Can t specify what value can read return * Data races defy acceptable semantics Fundamentally broken for hardware & software

Motivation Goal: Safe and efficient parallel computing Easy, safe programming model Complexity-, power-, performance-scalable hardware Banish shared memory? Today: shared-memory Complex, power- and performance-inefficient hardware * Complex directory coherence, unnecessary traffic, ... Difficult programming model * Data races, non-determinism, composability/modularity?, testing? Mismatched interface between HW and SW, a.k.a memory model * Can t specify what value can read return * Data races defy acceptable semantics Fundamentally broken for hardware & software

Motivation Goal: Safe and efficient parallel computing Easy, safe programming model Complexity-, power-, performance-scalable hardware Banish wild shared memory! Today: shared-memory Complex, power- and performance-inefficient hardware * Complex directory coherence, unnecessary traffic, ... Difficult programming model * Data races, non-determinism, composability/modularity?, testing? Mismatched interface between HW and SW, a.k.a memory model * Can t specify what value can read return * Data races defy acceptable semantics Need disciplined shared memory! Fundamentally broken for hardware & software

What is Shared-Memory? Shared-Memory = Global address space + Implicit, anywhere communication, synchronization

What is Shared-Memory? Shared-Memory = Global address space + Implicit, anywhere communication, synchronization

What is Shared-Memory? Wild Shared-Memory = Global address space + Implicit, anywhere communication, synchronization

What is Shared-Memory? Wild Shared-Memory = Global address space + Implicit, anywhere communication, synchronization

What is Shared-Memory? Disciplined Shared-Memory = Global address space + Implicit, anywhere communication, synchronization Explicit, structured side-effects

Benefits of Explicit Effects Strong safety properties Determinism-by-default, explicit & safe non-determinism Simple semantics, composability, testability - Deterministic Parallel Java (DPJ) explicit effects + structured parallel control Disciplined Shared Memory Efficiency: complexity, performance, power Simplify coherence and consistency Optimize communication and storage layout - DeNovo Simple programming model AND Complexity, power, performance-scalable hardware

DeNovo Hardware Project Exploit discipline for efficient hardware Current driver is DPJ (Deterministic Parallel Java) End goal is language-oblivious interface Software research strategy Deterministic codes Safe non-determinism OS, legacy, Hardware research strategy On-chip: coherence, consistency, communication, data layout Off-chip: similar ideas apply, next step

Current Hardware Limitations Complexity Subtle races and numerous transient states in the protocol Hard to extend for optimizations Storage overhead Directory overhead for sharer lists Performance and power inefficiencies Invalidation and ack messages False sharing Indirection through the directory Fixed cache-line communication granularity

Contributions Current Hardware Limitations Complexity Subtle races and numerous transient sates in the protocol Hard to extend for optimizations Storage overhead Directory overhead for sharer lists Performance and power inefficiencies Invalidation and ack messages False sharing Indirection through the directory Traffic (fixed cache-line communication) No transient states Simple to extend for optimizations No storage overhead for directory information No invalidation and ack messages No false sharing No indirection through the directory Flexible, not cache-line, communication Up to 81% reduction in memory stall time

Outline Motivation Background: DPJ Base DeNovo Protocol DeNovo Optimizations Evaluation Complexity Performance Conclusion and Future Work

Background: DPJ [OOPSLA 09] Extension to Java; fully Java-compatible Structured parallel control: nested fork-join style foreach, cobegin A novel region-based type and effect system Speedups close to hand-written Java programs Expressive enough for irregular, dynamic parallelism

Regions and Effects Region: a name for a set of memory locations Programmer assigns a region to each field and array cell Regions partition the heap Effect: a read or write on a region Programmer summarizes effects of method bodies Compiler checks that Region types are consistent, effect summaries are correct Parallel tasks are non-interfering (no conflicts) Simple, modular type checking (no inter-procedural .) Programs that type-check are guaranteed determinism

Example: A Pair Class class Pair { regionBlue, Red; int X in Blue; int Y in Red; void setX(int x) writes Blue { this.X = x; } void setY(int y) writes Red { this.Y = y; } void setXY(int x, int y) writes Blue; writes Red { cobegin { setX(x); // writes Blue setY(y); // writes Red } } } Declaring and using region names Region names have static scope (one per class) Pair Pair.Blue X 3 Pair.Red Y 42

Example: A Pair Class class Pair { regionBlue, Red; int X in Blue; int Y in Red; void setX(int x) writes Blue { this.X = x; } void setY(int y) writes Red { this.Y = y; } void setXY(int x, int y) writes Blue; writes Red { cobegin { setX(x); // writes Blue setY(y); // writes Red } } } Pair Pair.Blue X 3 Pair.Red Y 42 Writing method effect summaries

Example: A Pair Class class Pair { regionBlue, Red; int X in Blue; int Y in Red; void setX(int x) writes Blue { this.X = x; } void setY(int y) writes Red { this.Y = y; } void setXY(int x, int y) writes Blue; writes Red { cobegin { setX(x); // writes Blue setY(y); // writes Red } } } Pair Pair.Blue X 3 Pair.Red Y 42 Inferred effects Expressing parallelism

Outline Motivation Background: DPJ Base DeNovo Protocol DeNovo Optimizations Evaluation Complexity Performance Conclusion and Future Work

Memory Consistency Model Guaranteed determinism Read returns value of last write in sequential order 1. Same task in this parallel phase 2. Or before this parallel phase ST 0xa ST 0xa Parallel Phase LD 0xa

Memory Consistency Model Guaranteed determinism Read returns value of last write in sequential order 1. Same task in this parallel phase 2. Or before this parallel phase ST 0xa Coherence Mechanism Parallel Phase LD 0xa

Cache Coherence Coherence Enforcement 1. Invalidate stale copies in caches 2. Track up-to-date copy Explicit effects Compiler knows all regions written in this parallel phase Cache can self-invalidate before next parallel phase * Invalidates data in writeable regions not accessed by itself Registration Directory keeps track of one up-to-date copy Writer updates before next parallel phase

Basic DeNovo Coherence Assume (for now): Private L1, shared L2; single word line Data-race freedom at word granularity No space overhead Keep valid data or registered core id registry L2 data arrays double as directory No transient states Read Invalid Valid Write Write Registered

Example Run L1 of Core 1 L1 of Core 2 class Pair { X in DeNovo-region Blue; Y in DeNovo-region Red; void setX(int x) writes Blue { this.X = x; } } Pair PArray[size]; ... Phase1writes Blue {// DeNovo effect foreach i in 0, size { PArray[i].setX( ); } self_invalidate(Blue); } IX0 V Y0 I X1 V Y1 I X2 V Y2 R X3 V Y3 R X4 V Y4 R X5 V Y5 V X5 V Y5 R X5 V Y5 R X0 V Y0 R X1 V Y1 R X2 V Y2 VX3 V Y3 VX4 V Y4 VX5 V Y5 I X5 V Y5 VX5 V Y5 R X0 V Y0 R X1 V Y1 R X2 V Y2 I X3 V Y3 I X4 V Y4 VX4 V Y4 VX0 V Y0 VX1 V Y1 VX2 V Y2 VX3 V Y3 VX0 V Y0 VX1 V Y1 VX2 V Y2 VX3 V Y3 V X4 V Y4 R X4 V Y4 VX0 V Y0 VX1 V Y1 VX2 V Y2 R X3 V Y3 Registration Registration Shared L2 VX0 V Y0 VX1 V Y1 VX2 V Y2 VX3 V Y3 V X4 V Y4 V X5 V Y5 R C2 V Y5 R C1 V Y0 R C1 V Y1 R C1 V Y2 R C2 V Y3 R C2 V Y4 Ack Ack Registered Valid Invalid

Practical DeNovo Coherence Basic protocol impractical High tag storage overhead (a tag per word) Address/Transfer granularity > Coherence granularity DeNovo Line-based protocol Traditional software-oblivious spatial locality Coherence granularity still at word * no word-level false-sharing Cache Line Merging V V R Tag

Current Hardware Limitations Complexity Subtle races and numerous transient states in the protocol Hard to extend for optimizations Storage overhead Directory overhead for sharer lists Performance and power inefficiencies Invalidation and ack messages False sharing Indirection through the directory Fixed cache-line communication granularity

Protocol Optimizations Insights 1. Traditional directory must be updated at every transfer DeNovo can copy valid data around freely 1. Traditional systems send cache line at a time DeNovo uses regions to transfer only relevant data

Protocol Optimizations Direct cache-to-cache transfer Flexible communication L1 of Core 2 L1 of Core 1 R X0 V Y0 R X1 V Y1 R X2 V Y2 I X3 V Y3 I X4 V Y4 I X5 V Y5 V Z0 V Z1 V Z2 V Z3 V Z4 V Z5 IX0 I X1 I X2 R X3 R X4 R X5 V Y0 V Y1 V Y2 V Y3 V Y4 V Y5 V Z0 V Z1 V Z2 V Z3 V Z4 V Z5 X3 Y3 Z3 LD X3 Shared L2 R C1 V Y0 R C1 V Y1 R C1 V Y2 R C2 V Y3 R C2 V Y4 R C2 V Y5 V Z0 V Z1 V Z2 V Z3 V Z4 V Z5 Registered Valid Invalid

Protocol Optimizations Direct cache-to-cache transfer Flexible communication Flexible communication Direct cache-to-cache transfer L1 of Core 2 L1 of Core 1 R X0 V Y0 R X1 V Y1 R X2 V Y2 I X3 V Y3 I X4 V Y4 I X5 V Y5 V X5 V Y5 R X0 R X1 R X2 V X3 V X4 V Y0 V Y1 V Y2 V Y3 V Y4 V Z0 V Z1 V Z2 V Z3 V Z4 V Z0 V Z1 V Z2 V Z3 V Z4 V Z5 V Z5 IX0 I X1 I X2 R X3 R X4 R X5 V Y0 V Y1 V Y2 V Y3 V Y4 V Y5 V Z0 V Z1 V Z2 V Z3 V Z4 V Z5 X3 X4 X5 LD X3 Shared L2 R C1 V Y0 R C1 V Y1 R C1 V Y2 R C2 V Y3 R C2 V Y4 R C2 V Y5 V Z0 V Z1 V Z2 V Z3 V Z4 V Z5 Registered Valid Invalid

Current Hardware Limitations Complexity Subtle races and numerous transient states in the protocol Hard to extend for optimizations Storage overhead Directory overhead for sharer lists Performance and power inefficiencies Invalidation and ack messages False sharing Indirection through the directory Fixed cache-line communication granularity

Outline Motivation Background: DPJ Base DeNovo Protocol DeNovo Optimizations Evaluation Complexity Performance Conclusion and Future Work

Protocol Verification DeNovo vs. MESI word with Murphi model checking Correctness Six bugs in MESI protocol * Difficult to find and fix Three bugs in DeNovo protocol * Simple to fix Complexity 15x fewer reachable states for DeNovo 20x difference in the runtime

Performance Evaluation Methodology Simulator: Simics + GEMS + Garnet System Parameters 64 cores Simple in-order core model Workloads FFT, LU, Barnes-Hut, and radix from SPLASH-2 bodytrack and fluidanimate from PARSEC 2.1 kd-Tree (two versions) [HPG 09]

MESI Word (MW) vs. DeNovo Word (DW) Memory Stall Time 1600% 500% Mem Hit R L1 Hit L2 Hit L1 STALL 1500% 1400% 1300% 400% 1200% 1100% 1000% 300% 900% 800% 700% 200% 600% 500% 400% 100% 300% 200% 100% 0% 0% DL DL DW DDF DW DDF MW ML MW ML FFT LU kdFalse bodytrack fluidanimate radix kdPadded Barnes DW s performance competitive with MW

MESI Line (ML) vs. DeNovo Line (DL) Memory Stall Time 1600% 500% Mem Hit R L1 Hit L2 Hit L1 STALL 1500% 1400% 1300% 400% 1200% 1100% 1000% 300% 900% 800% 700% 200% 600% 500% 400% 100% 300% 200% 100% 0% 0% DL DL DW DDF DW DDF MW ML MW ML FFT LU kdFalse bodytrack fluidanimate radix kdPadded Barnes DL about the same or better memory stall time than ML DL outperforms ML significantly with apps with false sharing

Optimizations on DeNovo Line 200% Mem Hit R L1 Hit L2 Hit L1 STALL 150% 104 105 105102 100102 100 101 100 100 100 100 100 100 99 95 93 100% 91 81 42 50% 42 38 24 21 0% ML DL DDF ML DL DDF ML DL DDF ML DL DDF ML DL DDF ML DL DDF ML DL DDF ML DL DDF FFT LU kdFalse bodytrack fluidanimate radix kdPadded Barnes Combined optimizations perform best Except for LU and bodytrack Apps with low spatial locality suffer from line-granularity allocation

Network Traffic 200% Invalidation WB Write Read 150% 100% 50% 0% ML DL DDF ML DL DDF ML DL DDF ML DL DDF ML DL DDF ML DL DDF ML DL DDF ML DL DDF FFT LU kdFalse bodytrack fluidanimate radix kdPadded Barnes DeNovo has less traffic than MESI in most cases DeNovo incurs more write traffic due to word-granularity registration Can be mitigated with write-combining optimization

Conclusion and Future Work DeNovo rethinks hardware for disciplined models Complexity No transient states: 20X faster to verify than MESI Extensible: optimizations without new states Storage overhead No directory overhead Performance and power inefficiencies No invalidations, acks, false sharing, indirection Flexible, not cache-line, communication Up to 81% reduction in memory stall time Future work: region-driven layout, off-chip memory, safe non-determinism, sync, OS, legacy

Thank You!

Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Download Presentation

Presentation Transcript

Related

More Related Content