Modern Hierarchical Multi-GPU Cache Coherence Protocols

hmg extending cache coherence protocols across n.w

1 / 15

Embed Share

Explore the challenges of cache coherence in modern hierarchical multi-GPU systems and the proposed solutions to mitigate NUMA impact, achieve high performance, and overcome scalability limitations of existing protocols.

jveron Follow

Uploaded on Apr 03, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

HMG: Extending Cache Coherence Protocols Across Modern Hierarchical Multi-GPU Systems Xiaowei Ren1,2, Daniel Lustig2, EvgenyBolotin2, Aamer Jaleel2, Oreste Villa2, David Nellans2 1The University of British Columbia 2NVIDIA

Coming Up NUMA behavior bottlenecks performance in multi-GPU systems Mitigating NUMA impact requires caching and cache coherence, but existing cache coherence protocols do not scale Our work: Insight: weaker memory model, less data sharing, latency-tolerant architecture Achieve 97% of the performance of an idealized caching system 2

HMG: Hierarchical Multi-GPUs MCM-GPU GPU GPU GPU GPU Monolithic GPU MCM-GPU GPM GPM GPM GPM ~200 GB/s Monolithic GPU NV-Switch NV-Switch ~2 TB/s Single-GPU Multi-GPU GPM GPM GPM GPU GPM GPU GPU GPU GPM (GPU Chip Module) GPM (GPU Chip Module) NUMA behavior bottlenecks performance scaling require caching and cache coherence protocols 3

Existing Coherence Protocols Dont Scale SW cache coherence: 2.0 Normalized Speedup 29% 1.5 4-GPU system, 4 GPMs (GPU Chip Modules) in each GPU 1.0 Entire cache invalidation at synchronization points 0.5 0.0 much worse than idealized caching No-Caching SW Ideal 4

Existing Coherence Protocols Dont Scale HW-VI cache coherence: 2.0 Normalized Speedup 21% 1.5 Fine-grained cacheline invalidations 1.0 Slightly better than software coherence 0.5 Without consideration of NUMA effect inter-GPU link BW is the critical bottleneck 0.0 HW No-Caching SW Ideal Assumed a stronger memory model 5

LeveragingScoped Memory Model Synchronization is scoped Only enforce coherence in a subset of caches that are under the scope in question Store results can be visible to some threads earlier than others Store requests do not need to be stalled until all other sharers are invalidated 6

HMG Overview MCM-GPU GPU GPU GPM (GPU Module) GPM GPM SM + L1 $ NV-Switch L2 $ Directory GPM GPM GPU GPU State Tag Sharers Directory-based cache coherence, keep track of all sharers Map synchronization scopes to caches: .cta L1 $ .gpu/.sys L2$ L1 $ coherence is software-maintained, we mainly focus on the L2 $ 7

Extending to Scoped GPU Memory Model load hit on A store A Assign a home cache to each address GPM1 home of A GPM0 write through Non-atomic loads can hit in all caches L2 $ A Directory V:A:[1, 2] V:A:[1] L2 $ A Directory As some GPMs can see the latest value earlier: no inv acks for store requests no transient states to reduce stalls inv A Directory L2 $ A L2 $ Directory GPM2 GPM3 8

Extending to Scoped GPU Memory Model st.release.gpu A GPM1 home of A GPM0 ld.acquire greater than .cta scope invalidates L1 cache, but not L2 write through fwd ack A V:A:[1, 2] V:A:[1] A Directory ack inv A & fwd forward store.release to all GPMs to clear all infight invalidations ack Directory A L2 $ Directory retire store.release after it s acked GPM2 GPM3 9

Problem of Extending to Multi-GPUs GPU0 GPU1 GPM0 GPM0 home of A Critical bottleneck is NUMA effect due to BW difference Directory L2 $ A Directory L2 $ load A load A reply 67% inter-GPU loads are redundant inter-GPM net. ~2 TB/s inter-GPM net. ~200 GB/s Record data sharing hierarchically can avoid redundant inter-GPU loads Directory L2 $ L2 $ Directory GPM1 GPM1 10

Hierarchical Multi-GPU Cache Coherence GPU0 GPU1 Unlike CPUs, no extra structures or coherence states to reduce latency GPM0 GPM0 sys home of A gpu home of A Directory V:A:[GPU1] Directory V:A:[GPM1] L2 $ A L2 $ Assign both system and GPU home caches to each address load A reply load A Loads and store invalidations are propagated hierarchically inter-GPM net. inter-GPM net. reply Directory L2 $ L2 $ A Directory GPM1 GPM1 11

Hierarchical Multi-GPU Cache Coherence GPU0 GPU1 Unlike CPUs, no extra structures or coherence states to reduce latency GPM0 GPM0 sys home of A gpu home of A st A L2 $ A Directory V:A:[GPU1] Directory V:A:[GPM1] Assign both system and GPU home caches to each address inv A Loads and store invalidations are propagated hierarchically inter-GPM net. inter-GPM net. inv A ld.acquire and st.release are like the single-GPU scenario st.release retires after it s acked Directory L2 $ L2 $ A Directory GPM1 GPM1 12

Overall Performance 2.0 Only 3% slower than the idealized caching 3% Normalized Speedup 1.5 Why? 1.0 Each store request invalidates 1.5 valid cache lines 0.5 Each coherence directory eviction invalidates 1 valid cache line 0.0 HMG No-Caching SW HW Ideal BW cost of invalidation message is 3.58GB/s 13

Hardware Cost and Scalability 2.0 Normalized Speedup 2.7% of each GPM s L2 cache data capacity 1.5 1.0 HMG-50%: cutting coherence directory size by 50% only makes performance slightly worse 0.5 0.0 HMG is scalable to future bigger multi-GPU systems No-Caching Ideal HMG HMG-50% 14

Summary Hierarchical cache coherence is necessary to mitigate NUMA effect in multi-GPUs Unlike CPUs, extending coherence to multi-GPUs does not need extra hardware structures or transient coherence states Leveraging the latest scoped memory model can significantly simplify the design of GPU coherence protocol Thankyou for listening! 15

Modern Hierarchical Multi-GPU Cache Coherence Protocols

Download Presentation

Presentation Transcript

Related

More Related Content