Managing GPU Concurrency in Heterogeneous Architectures

Managing GPU Concurrency in Heterogeneous Architectures
Slide Note
Embed
Share

When sharing the memory hierarchy, CPU and GPU applications interfere with each other, impacting performance. This study proposes warp scheduling strategies to adjust GPU thread-level parallelism for improved overall system performance across heterogeneous architectures.

  • GPU Concurrency
  • Heterogeneous Architectures
  • Warp Scheduling
  • System Performance
  • CPU-GPU Interaction

Uploaded on Feb 14, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Managing GPU Concurrency in Heterogeneous Architectures Onur Kay ran, Nachiappan CN, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, Chita R. Das

  2. Era of Heterogeneous Architectures NVIDIA Denver NVIDIA Echelon AMD Fusion Intel Haswell

  3. Executive Summary When sharing the memory hierarchy, CPU and GPU applications interfere with each other GPU applications significantly affect CPU applications due to multi-threading Existing GPU Thread-level Parallelism (TLP) management techniques (MICRO12, PACT13) Unaware of CPUs Not effective in heterogeneous systems Our Proposal: Warp scheduling strategies to Adjust GPU TLP to improve CPU and/or GPU performance

  4. Executive Summary CPU-centric Strategy Memory Congestion CPU Performance

  5. Executive Summary CPU-centric Strategy Memory Congestion CPU Performance IF Memory Congestion GPU TLP

  6. Executive Summary CPU-centric Strategy Memory Congestion CPU Performance IF Memory Congestion GPU TLP Results Summary: +24% CPU & -11% GPU

  7. Executive Summary CPU-GPU Balanced Strategy CPU-centric Strategy Memory Congestion CPU Performance GPU TLP GPU Latency Tolerance IF Memory Congestion GPU TLP Results Summary: +24% CPU & -11% GPU

  8. Executive Summary CPU-GPU Balanced Strategy CPU-centric Strategy Memory Congestion CPU Performance GPU TLP GPU Latency Tolerance IF Memory Congestion GPU TLP IF Latency Tolerance GPU TLP Results Summary: +24% CPU & -11% GPU

  9. Executive Summary CPU-GPU Balanced Strategy CPU-centric Strategy Memory Congestion CPU Performance GPU TLP GPU Latency Tolerance IF Memory Congestion GPU TLP IF Latency Tolerance GPU TLP Results Summary: +24% CPU & -11% GPU Results Summary: +7% both CPU & GPU

  10. Outline Summary Background Motivation Analysis of TLP Our Proposal Evaluation Conclusions

  11. Many-core Architecture CPU Cores SIMT Cores Scheduler L1 ALUs Caches CTA L2 Throughput optimized cores Warp Scheduler Caches ROB L1 ALUs Caches Interconnect Latency optimized cores LLC cache DRAM

  12. Outline Summary Background Motivation Analysis of TLP Our Proposal Evaluation Conclusions

  13. Application Interference Up to 20% Up to 85% noCPU mcf omnetpp perlbench noGPU KM MM PVR Normalized 1.2 Normalized 1.2 GPU IPC CPU IPC 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 KM MM PVR mcf omnetpp perlbench GPU applications are affected moderately due to CPU interference CPU applications are affected significantly due to GPU interference

  14. Latency Tolerance in CPUs vs. GPUs GPU IPC CPU IPC High GPU TLP -> memory system congestion High GPU TLP -> low CPU performance GPU cores can tolerate latencies due to multi- threading Normalized IPC 1.6 1.4 1.2 1 0.8 Problem: 0.6 0.4 0.2 TLP management strategies for GPUs are not aware of the latency tolerance disparity between CPU and GPU applications 0 DYNCTA (PACT 2013) Higher performance potential at low TLP

  15. Outline Summary Background Motivation Analysis of TLP Our Proposal Evaluation Conclusions

  16. Effect of GPU Concurrency on GPU Performance Reduction in GPU TLP GPU performance

  17. Effect of GPU Concurrency on CPU Performance Reduction in GPU TLP CPU performance

  18. Effect of GPU Concurrency on CPU Performance ? Change in CPU performance 2 metrics: - Memory congestion - Network congestion CPU performance : congestion

  19. Outline Summary Background Motivation Analysis of TLP Our Proposal Evaluation Conclusions

  20. Our Approach Improved GPU performance Improved CPU performance Existing works CPU-centric Strategy CPU-GPU Balanced Strategy + control the trade-off

  21. CM-CPU: CPU-centric Strategy Categorize congestion: low, medium, or high network L H M memory L GPU-unaware TLP management: Insufficient GPU latency tolerance M H No Increase # of warps Decrease # of warps change in # of warps

  22. CM-BAL: CPU-GPU Balanced Strategy Latency tolerance of GPU cores: stallGPU: scheduler stalls @ GPU cores Overrides CM-CPU can only increase TLP same strategy as CM-CPU Low latency tolerance High memory congestion stallGPU GPU TLP

  23. CM-BAL: CPU-GPU Balanced Strategy Latency tolerance of GPU cores: stallGPU: scheduler stalls @ GPU cores Overrides CM-CPU can only increase TLP same strategy as CM-CPU Low latency tolerance High memory congestion stallGPU GPU TLP

  24. CM-BAL: CPU-GPU Balanced Strategy Latency tolerance of GPU cores: stallGPU: scheduler stalls @ GPU cores Overrides CM-CPU can only increase TLP same strategy as CM-CPU Low latency tolerance High memory congestion stallGPU GPU TLP

  25. CM-BAL: CPU-GPU Balanced Strategy Latency tolerance of GPU cores: stallGPU: scheduler stalls @ GPU cores Overrides CM-CPU can only increase TLP same strategy as CM-CPU Low latency tolerance High memory congestion stallGPU Control the triggering of the condition = Control the trade-off between CPU or GPU benefits GPU TLP

  26. Outline Summary Background Motivation Analysis of TLP Our Proposal Evaluation Conclusions

  27. Evaluated Architecture GPU GPU LLC/ MC LLC/ MC CPU CPU GPU GPU GPU GPU LLC/ MC LLC/ MC CPU CPU GPU GPU GPU GPU GPU CPU CPU CPU GPU GPU GPU GPU GPU GPU CPU CPU CPU GPU GPU GPU GPU GPU LLC/ MC LLC/ MC CPU CPU GPU GPU GPU GPU LLC/ MC LLC/ MC CPU CPU GPU GPU Tile-based design

  28. Evaluation Methodology Evaluated on an integrated platform with an in-house x86 CPU simulator and GPGPU-Sim Baseline Architecture 28 GPU cores, 14 CPU cores, 8 memory controllers, 2D mesh GPU: 1400MHz, SIMT Width = 16*2, Max. 1536 threads/core, GTO Sch. CPU: 2000 MHz, OoO, 128-entry instr. win., max. 3 inst./cycle 8MB, 128B Line, 16-way, 700MHz GDDR5 800MHz Workloads: 13 GPU applications 34 CPU applications, 6 CPU application mixes 36 diverse workloads 1 GPU application + 1 CPU mix

  29. GPU Performance Results 1.2 Normalized GPU IPC 7% 2% 1.1 1 -11% -11% 0.9 0.8 All 36 workloads

  30. CPU Performance Results 1.4 24% 19% Normalized CPU WS 1.2 7% 2% 1 0.8 1 2 3 4 5 6 All 36 workloads

  31. System Performance OSS = (1 ) WSCPU + SUGPU (ISCA 2012) is between 0 and 1 Higher -> higher GPU importance Obj. 1 CM-CPU Obj. 2 (Balanced) CM-BAL 48 warps DYNCTA 1.25 Normalized OSS 1.125 1 0.875 0.75 alpha 0.5 1.0 alpha (0 - 1)

  32. More in the Paper Motivation Analysis of the metrics used by our algorithm Scheme Detailed hardware walkthrough of our scheme Results Analysis over time Change in GPU TLP Change in the metrics used by our algorithm Comparison against static approaches Lower number of LLC accesses

  33. Outline Summary Background Motivation Analysis of TLP Our Proposal Evaluation Conclusions

  34. Conclusions Sharing the memory hierarchy leads to CPU and GPU applications to interfere with each other Existing GPU TLP management techniques are not well-suited for heterogeneous architectures We propose two GPU TLP management techniques for heterogeneous architectures CM-CPU reduces GPU TLP to improve CPU performance CM-BAL is similar to CM-CPU, but increases GPU TLP when it detects low latency tolerance in GPU cores TLP can be tuned based on user s preference for higher CPU or GPU performance

  35. THANKS!

  36. Managing GPU Concurrency in Heterogeneous Architectures Onur Kay ran, Nachiappan CN, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, Chita R. Das

Related


More Related Content