Vashon High Expectations & Policies Overview

Slide Note

"Explore the classroom expectations and policies at Vashon High School, including guidelines on behavior, consequences, technology usage, door expectations, and passes. Understand the importance of following the rules to maintain a positive learning environment."

cgra Follow

Uploaded on Mar 07, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

www.bsc.es Coherence Protocol for Transparent Management of Scratchpad Memories in Shared Memory Manycore Architectures Lluc lvarez, Llu s Vilanova, Miquel Moret , Marc Casas, Marc Gonz lez, Xavier Martorell, Nacho Navarro, Eduard Ayguad , Mateo Valero Discussion Lead By: Vijay Thiruvengadam

Introduction Traditional Hardware Caching a problem in future manycores Caches Power efficiency Scalability Programmability Scratchpad memories Power efficiency Scalability Programmability Hybrid memory hierarchies in HPC (like GPU) Power efficiency Scalability Programmability 2/23

Introduction Our goal Introduce SPMs alongside the L1 cache Advantages in power and scalability Combination of compiler, runtime and hardware techniques manage the SPMs and coherence between SPM & regular memory Minimal programmer involvement L1 L1 C C Cluster Interconnect SPM SPM L1 L1 C C SPM SPM L1 L1 C C SPM SPM Overview of Solution Compiler transforms code based on programmer annotation and compiler analysis Memory accesses split into regular (GM) access, scratchpad (SPM) access, potentially aliased (Guarded GM) access Hardware intercepts guarded accesses and redirects them to SPM in case it is aliased with SPM data L1 L1 C C SPM SPM L2 DRAM DRAM 3/23

Hybrid Memory Hierarchy Code Transformation Programmer annotates code suitable for transformation Compiler applies tiling transformations Strided accesses mapped to the SPM Random accesses mapped to the cache hierarchy i=0; while (i<N) { MAP (&a[i], _a, iters, tags); MAP (&b[i], _b, iters, tags); n = (i+iters>N) ? N : i+iters; for (i=0; i<N; i++) { a[i] = b[i]; c[b[i]] = 0; ptr[a[i]]++; } } } for (i=0; i<N; i++) { a[i] = b[i]; c[b[i]] = 0; ptr[a[i]]++; ptr[a[i]]++; for (i=0; i<N; i++) { a[i] = b[i]; c[b[i]] = 0; Control SYNCH (tags) Synch Strided: a, b Random: c, ptr for (_i=0; _i<n; _i++, i++) { _a[_i] = _b[_i]; c[_b[_i]] = 0; ptr[_a[_i]]++; } } Work c, ptr L1 C SPM a, b 4/23

Coherence Problem SPMs are not coherent with the cache hierarchy Invalid results if strided and random accesses alias Very challenging problem for the compiler Alias analysis Restrictive or inefficient solutions Programmer identifies private data and applies code transformations Prior work by authors proposes similar solution but restricts a core to access only it s own SPM Our solution solves this issue Compiler can generate code and mark accesses that might be aliased Hardware intercepts these potentially aliased accesses and resolves it 5/23

Outline Introduction Hybrid memory hierarchy Coherence problem Hardware-software coherence protocol Compiler support Hardware design Evaluation Experimental framework Comparison with cache hierarchies Conclusions 6/23

Hardware-Software Coherence Protocol Basic idea Avoid maintaining two coherent copies of the data Ensure the valid copy is always accessed Compiler detects potentially incoherent accesses Emits guarded memory instructions Hardware diverts them to the valid copy of the data Distributed hardware directory to track the contents of the SPMs Hierarchy of filters to track data not mapped to any SPM on any core 7/23

Compiler Support Step 1: Classification of memory references Strided accesses Random accesses Potentially incoherent accesses for (i=0; i<N; i++) { a[i] = b[i]; c[b[i]] = 0; ptr[a[i]]++; } } } } for (i=0; i<N; i++) { a[i] = b[i]; c[b[i]] = 0; ptr[a[i]]++; ptr[a[i]]++; ptr[a[i]]++; for (i=0; i<N; i++) { a[i] = b[i]; c[b[i]] = 0; c[b[i]] = 0; for (i=0; i<N; i++) { a[i] = b[i]; c L1 Dir ptr C SPM a, b Strided: a, b Random: c Potentially incoherent: ptr 8/23

Compiler Support Step 2: Code transformation Only for strided accesses Apply tiling Change memory references i=0; while (i<N) { MAP (&a[i], _a, iters, tags); MAP (&b[i], _b, iters, tags); n = (i+iters>N) ? N : i+iters; Control for (i=0; i<N; i++) { a[i] = b[i]; c[b[i]] = 0; ptr[a[i]]++; } SYNCH (tags) Synch for (_i=0; _i<n; _i++, i++) { _a[_i] = _b[_i]; c[_b[_i]] = 0; ptr[_a[_i]]++; } } Work 9/23

Compiler Support Step 3: Code generation Guarded memory instructions for potentially incoherent accesses i=0; while (i<N) { MAP (&a[i], _a, iters, tags); MAP (&b[i], _b, iters, tags); n = (i+iters>N) ? N : i+iters; Control i=0; while (i<N) { MAP (&a[i], _a, iters, tags); MAP (&b[i], _b, iters, tags); n = (i+iters>N) ? N : i+iters; SYNCH (tags) Synch for (_i=0; _i<n; _i++, i++) { // _a[_i] = _b[_i]; ld _b(,_i,4),r1 st r1,_a(,_i,4) Control Synch SYNCH (tags) // c[_b[_i]] = 0; mv #0,r3 st r3,c(,r1,4) for (_i=0; _i<n; _i++, i++) { _a[_i] = _b[_i]; c[_b[_i]] = 0; ptr[_a[_i]]++; } } Work Work // ptr[_a[_i]]++; gld ptr(,r1,4),r2 inc r2,r2 gst r2,ptr(,r1,4) } } 10/23

Hardware Design Distributed hardware directory (SPMDir) One directory CAM per core Each core tracks contents of its SPM Maps regular address to SPM address Core TLB L1D CPU SPM Filter One filter CAM per core Track data not mapped to any SPM SPMDir Filter Cache directory Directory of filters (FilterDir) Tracks contents of all filters Located at L2 cache shared by all cores Given a regular address, gives out the list of cores that have it in their local filters @ Sharers Status FilterDir @ Sharers 11/23

Hardware Design Strided accesses Remote core Local core FilterDir TLB TLB L1D L1D SPM SPM SPMDir SPMDir Filter Filter 12/23

Hardware Design Random accesses Remote core Local core FilterDir TLB TLB L1D L1D SPM SPM SPMDir SPMDir Filter Filter 13/23

Hardware Design Potentially incoherent accesses No mapping in SPMs Mapping in local SPM Mapping in remote SPM No mapping with filter miss When data is mapped to some SPM Update SPMDir Invalidate filters UP Remote core Local core FilterDir TLB TLB L1D L1D MISS HIT SPM SPM HIT MISS MISS HIT SPMDir SPMDir HIT MISS UP Filter Filter 14/23

Outline Introduction Hybrid memory hierarchy Coherence problem Hardware-software coherence protocol Compiler support Hardware design Evaluation Experimental framework Comparison with cache hierarchies Conclusions 15/23

Experimental Framework Gem5 x86 64 OoO cores L1 32KB , SPM 32KB, L2 256KB Parameter Description Cores 64 cores, OoO, 6 instruction wide, 2GHz Pipeline front end 13 cycles. 4-way BTB 4K entries, RAS 32 entries. Branch predictor 4K selector, 4K G-share, 4K Bimodal Execution ROB 160 entries. IQ 96 entries. LQ/SQ 48/32 entries. 3 INT ALU, 3 FP ALU, 3 LD/ST units. 256/256 INT/FP register file. Full bypass McPAT 22nm Clock gating L1 I-cache 2 cycles, 32KB, 4-way, pseudoLRU L1 D-cache 2 cycles, 32KB, 4-way, pseudoLRU, stride prefetcher L2 cache Shared unified NUCA 16MB, sliced 256 KB/core 15 cycles, 16-way, pseudoLRU NAS benchmarks CG, EP, FT, IS, MG, SP Tiling transformations by hand Cache coherence MOESI. Distributed 4-way cache directory 64K entries NoC Mesh. Link 1 cycle, router 1 cycle SPM 2 cycles, 32KB DMAC Command queue 32 entries in-order Bus request queue 512 entries in-order Potentially incoherent accesses GCC alias analysis report Unused x86 instruction prefix SPMDir 32 entries Filter 48 entries, fully associative, pseudoLRU FilterDir Distributed 4K entries, fully associative, pseudoLRU 16/23

Comparison with Cache Hierarchies Performance Hybrid memory hierarchy (32KB L1 + 32KB SPM) Cache hierarchy (64KB L1) 1.4 1.3 22% 1.2 14% 12% 1.1 Speedup 3% 1 0.9 0.8 0.7 0.6 CG EP FT IS MG SP AVG 17/23

Comparison with Cache Hierarchies NoC traffic Hybrid memory hierarchy (32KB L1 + 32KB SPM) Cache hierarchy (64KB L1) 1.2 1.1 Normalized packets 2% 1 0.9 20% 23% 0.8 34% 0.7 0.6 0.5 0.4 CG EP FT IS MG SP AVG 18/23

Comparison with Cache Hierarchies Energy consumption Hybrid memory hierarchy (32KB L1 + 32KB SPM) Cache hierarchy (64KB L1) 1.2 1.1 -3% Normalized energy 1 13% 15% 0.9 24% 0.8 0.7 0.6 0.5 0.4 CG EP FT IS MG SP AVG 19/23

Outline Introduction Hybrid memory hierarchy Coherence problem Hardware-software coherence protocol Compiler support Hardware design Evaluation Experimental framework Comparison with cache hierarchies Conclusions 20/23

Conclusions Hybrid memory hierarchy Attractive solution for future manycores Coherence problem Hardware-software coherence protocol Straightforward compiler support Simple hardware design with low overheads The hybrid memory hierarchy can be programmed with shared memory programming models The hybrid memory hierarchy outperforms cache hierarchies Average speedup of 14% Average NoC traffic reduction of 23% Average energy consumption reduction of 15% 21/23

www.bsc.es Coherence Protocol for Transparent Management of Scratchpad Memories in Shared Memory Manycore Architectures Thanks for your attention! Questions?

Vashon High Expectations & Policies Overview

Download Presentation

Presentation Transcript

Related

More Related Content