
Instruction Temporal Locality in Deep Multithreaded GPUs
Explore how the concept of inter-warp instruction temporal locality is leveraged in deep multithreaded GPUs for improved energy efficiency and accelerated performance. The study delves into exploiting instruction locality, organization, and experimental results to enhance throughput and reduce energy consumption effectively.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria
This Work Accelerators o Designed to maximize throughput o ILT: fetch the same instruction repeatedly o Wasted Our solution: o Keep fetched instructions in small buffer, save energy Key result: 19% front-end energy reduction 2 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Outline Background Instruction Locality Exploiting Instruction Locality o Decoded-Instruction Buffer o Row Buffer o Filter Cache Case Study: Filter Cache o Organization o Experimental Setup o Experimental Results 3 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Heterogonous Systems Heterogonous system to achieve optimal performance/watt o Superscalar speculative out-of-order processor for latency-intensive serial workloads o Accelerator (Multi-threaded in-order SIMD processor) for High-throughput parallel workloads 6 of 10 Top500.org supercomputers today employ accelerators o IBM Power BQC 16C 1.60 GHz (1st, 3th, 8th, and 9th) o NVIDIA Tesla (6th and 7th) 4 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
GPUs as Accelerators GPUs are most available accelerators o Class of general-purpose processors named SIMT o Integrated on same die with CPU (Sandy Bridge, etc) High energy efficiency o GPU achieves 200 pJ/instruction o CPU achieves 2 nJ/instruction [Dally 2010] 5 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
SIMT Accelerator SIMT (Single-Instruction Multiple-Thread) Goal is throughput Deep-multithreaded Designed for latency hiding 8- to 32-lane SIMD 6 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Streaming Multiprocessor (SM), CTA & Warps Threads of same thread-block (CTA) o Communicate through fast shared memory o Synchronized through fast synchronizer A CTA is assigned to one SM SMs execute in warp (group of 8-32 threads) granularity 7 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Warping Benefits Thousands of threads are scheduled zero-overhead o Context of threads are all on core Concurrent threads are grouped into warps o Share control-flow tracking overhead o Reduce scheduling overhead o Improve utilization of execution units (SIMD efficiency) 8 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Energy Reduction Potential in GPUs Huge amount of context o Caches o Shared Memory o Register file o Execution units To many inactive threads o Synchronization o Branch/Memory divergence High Locality o Similar behavior by different threads 9 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Baseline Pipeline Front-end Modeled according to NVIDIA Patents 3-Stage Front-end o Instruction Fetch (IF) o Instruction Buffer (IB) o Instruction Dispatch (ID) Energy breakdown o I-Cache second most energy consuming I-Cache tag I-Cache data Instruction buffer Scoreboard Operand Collector and buffering 10 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
SM Pipeline Front-end Example 1 2 3 1 W1 Instruction Buffer Warp Scheduler Instruction Scheduler W2 insn src1 src2 dest 1 add r0 r1 r2 PC W1 Scoreboard ld r2 -- r3 I-Cache Field1 Field2 W2 SIMD Back-end W1 r2 r3 Code sequence: 1: add r2 <- r0, r1 2: ld r3 <- [r2] 3: W2 r0t0 r2t0 r2t0 r3t0 Operand Buffering r1t0 0 Register File r0t1 r2t1 lane1 lane2 lane3 lane4 r2t1 r3t1 r1t1 0 W1 r0 for all lanes r0 r0t0 r0t1 r0t2 r0t3 r2 r2t0 r2t1 r2t2 r2t3 r0t2 r2t2 r1 for all lanes r2t2 r3t2 r1 r1t0 r1t1 r1t2 r1t3 r1t2 0 r2 for all lanes r2 for all lanes r2 for all lanes r0t3 r2t3 r2t3 r3t3 r3 for all lanes r3 for all lanes r3 for all lanes r1t3 0 W2 r0 for all lanes 11 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Inter-Thread Instruction Locality (ITL) Warps are likely to fetch/and decode same instruction The percentage of instructions already fetched by other currently active warps recently: 100% 80% Redundancy rate 60% 40% 20% 0% CP HSPT LPS MP MTM NN RAY SCN <= 16-fetch <= 64-fetch and > 32-fetch <= 32-fetch and > 16-fetch > 64-fetch 12 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Exploiting ITL Toward performance improvement o Minor improvement by reducing the latency of arithmetic pipeline Toward energy saving o Fetch/decode bypassing similar to loop buffering o Reducing accesses to I-Cache Row buffer Filter Cache (our case study) 13 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Fetch/Decode Bypassing W1 1 Instruction Buffer Warp Scheduler 1 W2 insn src1 src 2 dest add r0 r1 r2 PC ld r2 -- r3 W1 I-Cache Decoded- Instruction Buffer PC I-Cache tag W2 PC Decoded-insn No need to access I-Cache and decode logic if buffer hits I-Cache data Buffer can bypass 42% of instruction fetches 14 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Row Buffer W1 1 Instruction Buffer Warp Scheduler 1 W2 insn src1 src2 dest add r0 r1 r2 PC W1 ld r2 -- r3 I-Cache PC W2 I-Cache tag Row Buffer MUX I-Cache data Buffer last accessed I-Cache line 15 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Filter Cache (Our Case Study) W1 1 Instruction Buffer Warp Scheduler 1 W2 insn src1 src2 dest add r0 r1 r2 PC W1 ld r2 -- r3 I-Cache PC W2 I-Cache tag Filter Cache MUX I-Cache data Buffering last fetched instruction in a set-associative table 16 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Filter Cache Enhanced Front-end Bypass I-Cache accesses to save dynamic power 32-entry (256-byte) FC o FC hit rate Up to ~100% o Front-end Energy Saving Up to 19% o Front-end area overhead 4.7% o Front-end leakage overhead 0.7% 17 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Methodology Cycle-accurate simulation of CUDA workloads by GPGPU-sim o Configured to model NVIDIA Tesla architecture o 16 8-wide SMs o 1024 threads/SM o 48 KB D-L1$/SM o 4 KB I-L1$/SM (256-byte lines) 21 Workloads from: o RODINIA (Backprop, ) o CUDA SDK (Matrix Multiply, ) o GPGPU-sim (RAY, ) o Parboil (CP) o Third-party sequence alignment (MUMmerGPU++) 18 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Methodology (2) Energy evaluations under 32-nm technology using CACTI Area ( m2) Leakage (mW) Energy per R/W (pJ) Delay (ps) 115.94 221.20 137.59 162.17 174.05 117.28 161.76 105.47 143.38 I-Cache tag I-Cache data Instruction Buf. Scoreboard Operand Buf. FC tag (32-entry) FC data (32-entry) FC tag (16-entry) FC data (16-entry) 229 0.03 1.78 0.16 0.24 0.53 0.03 0.11 0.02 0.05 0.13 4.30 1.00 1.57 4.16 0.14 0.81 0.10 0.57 18204 2600 6921 24173 266 2229 155 1337 Instruction Buf. Scoreboard Operand Buf. 6921 Modeled by a wide tag array Modeled by a data array 19 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Experimental Results FC hit rate and energy saving o 32-entry FC o 1024-thread per SM o Round-robin warp scheduler Sensitivity analysis under o FC size o Thread per SM o Warp scheduler 20 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
FC Hit Rate and Energy Saving FC Baseline I-Cache energy (nJ) 16532.20 12076.70 14955.05 161.40 347.37 132820.86 10520.69 562.47 Front-end energy-saving using FC I-Cache + FC energy (nJ) 6616.15 5676.64 7625.97 hit rate CP HSPT LPS MP MTM NN RAY SCN CP 100% 89% 83% 30% 95% 99% 76% 97% 7% 8% 8% 2% 8% 19% 6% 9% 100% 16532.20 6616.15 7% MP 139.78 153.66 53780.92 5945.77 240.70 30% 161.40 139.78 2% Low-Concurrent Warps, Divergent Branch High-Concurrent Warps, Coherent Branch 21 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Sensitivity Analysis Filter Cache size o Larger FC provides higher hit-rate but has higher static/dynamic energy Thread per SM o Higher thread per SM, higher the chance of instruction re-fetch Warp Scheduling o Advanced warp schedulers (latency hiding or data cache locality boosters) may keeping the warps at the different paces Memory Pending Compute Round-robin Two-level W0 W1 W0 W1 W0 W1 W0 W0 W1 W1 Time Time 22 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Sensitivity to Multithreading-Depth Threads Per SM: 1024 512 100% 80% FC hit rate 60% 40% 20% ~ 1% hit reduction 0% CP HSPT LPS MP MTM NN RAY SCN avg 30% Front-end energy saving ~ 1% 25% reduction in savings 20% 15% 10% 5% 0% CP HSPT LPS MP MTM NN RAY SCN avg 23 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Sensitivity to Warp Scheduling Warp Scheduler: Round-Robin Two-Level 100% 80% FC hit rate 60% 40% ~1% 20% hit reduction 0% CP HSPT LPS MP MTM NN RAY SCN avg 30% Front-end energy saving ~1% 25% reduction in savings 20% 15% 10% 5% 0% CP HSPT LPS MP MTM NN RAY SCN avg 24 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Sensitivity to Filter Cache Size Number of entries in FC: 32 16 100% 80% FC hit rate 60% 40% ~5% 20% hit reduction 0% CP HSPT up to ~1% reduction in savings (due to lower hit rate) LPS MP MTM NN RAY SCN avg 30% Front-end energy saving Overall ~2% increase in savings (due to smaller FC) 25% 20% 15% 10% 5% 0% CP HSPT LPS MP MTM NN RAY SCN avg 25 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Conclusion & Future Works We have evaluated instruction locality among concurrent warps under deep-multithreaded GPU The locality can be exploited for performance or energy- saving Case Study: Filter cache provides 1%-19% energy-saving for the pipeline Future Works: o Evaluating the fetch/decode bypassing o Evaluating concurrent kernel GPUs 26 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Thank you! Question? 27 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Backup-Slides 28 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
References [Dally 2010] W. J. Dally, GPU Computing: To ExaScale and Beyond, SC 2010. 29 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Workloads Abbr. BFS BKP CP DYN Name and Suite BFS Graph [2] Back Propagation [2] Coulumb Poten. [19] Dyn_Proc [2] Grid Size 16x(8) 2x(1,64) (8,32) 13x(35) 6x(32) 3x(16) (128) 48x(3,3) (43,43) (4,25) (196) (1) (5,8) (196) Block Size 16x(512) 2x(16,16) (16,8) 13x(256) #Insn 1.4M 2.9M 113M 64M CTA/SM 1 4 8 4 7x(256) 3x(512) FWAL Fast Wal. Trans. [18] 11M 2, 4 GAS HSPT LPS MP2 MP MTM MU2 Gaussian Elimin. [2] Hotspot [2] Laplace 3D [1] MUMmer-GPU++ [8] big MUMmer-GPU++ [8] small Matrix Multiply [18] MUMmer-GPU [2] big 48x(16,16) (16,16) (32,4) (256) (256) (16,16) (256) 9M 76M 81M 139M 0.3M 2.4M 75M 1 2 6 2 1 4 4 30 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs
Workloads (2) Abbr. MU NNC Name and Suite MUMmer-GPU [2] small Nearest Neighbor [2] Grid Size (1) 4x(938) (6,28) (25,28) (100,28) (10,28) (256) 2x(1) 2x(31) (32) (16,32) (64) 4x(8,8) 4x(4,4) Block Size (100) 4x(16) #Insn 0.2M 5.9M CTA/SM 1 8 (13,13) (5,5) 2x(1) NN Neural Network [1] 68M 5, 8 NQU N-Queen [1] (96) 1.2M 1 NW Needleman-Wun. [2] 63x(16) 12M 2 RAY SCN SR1 SR2 Ray Tracing [1] Scan [18] Speckle Reducing [2] big Speckle Reducing [2] small (16,8) (256) 4x(16,16) 4x(16,16) 64M 3.6M 9.5M 2.4M 3 4 2, 3 1 31 Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs