
Speeding up GPU Performance Through Memory Optimization Techniques
"Explore how Mascar is enhancing GPU performance by reducing memory pitstops and tackling memory saturation issues. Learn about the impact of memory saturation on performance and strategies to optimize memory resources for better GPU efficiency."
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Mascar: Speeding up GPU Warps by Reducing Memory Pitstops Ankit Sethia* D. Anoushe Scott Mahlke Jamshidi University of Michigan
GPU usage is expanding Data Analytics Graphics Machine Learning Linear Algebra All kinds of applications, compute and memory intensive are targeting GPUs Simulation Computer Vision 2
Performance variation of kernels 100% Compute Intensive Memory Intensive 80% % of peak IPC 60% 40% 20% 0% Memory intensive kernels saturate bandwidth and get lower performance 3
Impact of memory saturation - I FPUs FPUs FPUs . . . . LSU LSU LSU L1 L1 L1 Memory System Memory intensive kernels serialize memory requests Critical to prioritize order of requests from SMs 4
Impact of memory saturation fraction of peak IPC fraction of cycles LSU stalled 100% Compute Intensive Memory Intensive 80% 60% 40% 20% 0% Significant stalls in LSU correspond to low performance in memory intensive kernels 5
Impact of memory saturation - II FPUs FPUs FPUs FPUs . . . . LSU LSU LSU L1 L1 L1 LSU L1 W1 W0 Cache-blocks Memory System W1 Data Data present in the cache, but LSU can t access Unable to feed enough data for processing 6
Increasing memory resources 2 Large MSHRs + Queues Full Associativity Freq +20% All 1.75 1.5 Speedup 1.25 1 0.75 0.5 0.25 0 Memory Intensive Kernels Large # of MSHRs + Full Associativity + 20% bandwidth boost UNBUILDABLE 7
During memory saturation: Serialization of memory requests cause less overlap between memory accesses and compute: Memory aware scheduling (Mas) Data present in the cache cannot be reused as data cache cannot accept any request: Cache access re-execution (Car) Mascar Mas + car 8
Memory Aware Scheduling Memory Saturation Warp 0 Warp 1 Warp 2 Memory requests Memory requests Memory requests Serving one request and switching to another warp (RR) No warp is ready to make forward progress 9
Memory Aware Scheduling Memory Saturation Warp 0 Warp 1 Warp 2 GTO issues instructions from another warp whenever: No instruction in the i-buffer for that warp Dependency between instructions Similar to RR as multiple warps may issue memory requests Memory requests Memory requests Memory requests Serving one request and switching to another warp (RR) No warp is ready to make forward progress Serving all requests from one warp and then switching to another One warp is ready to begin computation early (MAS) 10
MAS operation Schedule in Equal Priority mode N Check if memory intensive (MSHRs or miss queue almost full) Y Assign new owner warp (only owner s request can go beyond L1) Execute memory inst. only from owner, other Schedule in Memory Priority mode warps can execute compute inst. Is the next instruction of owner dependent on already issued load MP mode N Y 11
Implementation of MAS Decode Scheduler WST Ordered Warps Warp Id I-Buffer OPtype ... .... ... Issued Warp WRC Stall bit Comp_Q Head Mem_Q Head Scoreboard From RF Memory saturation flag Divide warps as memory and compute warps in ordered warps Warp Readiness Checker (WRC): Tests if a warp should be allowed to issue memory instructions Warp Status Table: Decide if scheduler should schedule from a warp 12
During memory saturation: Serialization of memory requests cause less overlap between memory accesses and compute: Memory aware scheduling (Mas) Data present in the cache cannot be reused as data cache cannot accept any request: Cache access re-execution (Car) Mascar Mas + car 13
Cache access re-execution Load Store Unit L1 Cache W1 W0 W2 Cache-blocks HIT W1 Data Re-execution Queue Better than adding more MSHRs: More MSHR cause faster saturation of memory Cause faster thrashing of data cache 14
Experimental methodology GPGPUSim 3.2.2 GTX 480 architecture SMs 15, 32 PEs/SM Schedulers LRR, GTO, OWL and CCWS L1 cache 32kB, 64 sets, 4 way, 64 MSHRs L2 cache 768kB, 8 way, 6 partitions, 200 core cycles DRAM 32 requests/partition, 440 core cycles 15
Performance of compute intensive kernels GTO OWL CCWS Mascar 1.25 Speedup w.r.t RR 1 0.75 0.5 0.25 0 Performance of compute intensive kernels is insensitive to scheduling policies 16
Performance of memory intensive kernels GTO OWL CCWS MAS CAR 3.0 4.8 4.24 2 Overall Cache Sensitive Bandwidth intensive Speedup w.r.t RR 1.5 1 0.5 0 Bandwidth Intensive 4% 4% 4% 17% Cache Sensitive 24% 4% 55% 56% Scheduler Overall GTO OWL CCWS Mascar 13% 4% 24% 34% 17
Conclusion During memory saturation: Serialization of memory requests cause less overlap between memory accesses and compute: Memory aware scheduling (Mas) allows one warp to issue all its requests and begin early computation Data present in the cache cannot be reused as data cache cannot accept any request: Cache access re-execution (Car) exploits more hit- under-miss opportunities through re-execution queue 34% speedup 12% energy savings 18
Mascar: Speeding up GPU Warps by Reducing Memory Pitstops Questions??