Collaborative Preemption for Multitasking on Shared GPU

chimera collaborative preemption for multitasking n.w

1 / 28

Embed Share

"Learn about how Chimera enables collaborative preemption for multitasking on shared GPUs, addressing challenges in latency and throughput overhead. Explore the objective and innovative approach of this work from the University of Michigan."

noes561 Follow

Uploaded on Aug 22, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Chimera: Collaborative Preemption for Multitasking on a Shared GPU Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1 1University of Michigan, Ann Arbor 2Hongik University University of Michigan 1 Electrical Engineering and Computer Science

GPUs in Modern Computer Systems GPU is now a default component in modern computer systems Servers, desktops, laptops, etc. Mobile devices Offloads data-parallel kernels CUDA, OpenCL, and etc. University of Michigan 2 Electrical Engineering and Computer Science

GPU Execution Model threads thread blocks University of Michigan 3 Electrical Engineering and Computer Science

Multitasking Needs in GPUs Augmented Reality Bodytrack 3D rendering Graphics Data parallel algorithm . University of Michigan 4 Electrical Engineering and Computer Science

Traditional Context Switching GTX 780 (Kepler) 256kB registers + 48kB shared memory 288.4 GB/s for 12 SMs ~88us per SM (~1us for CPUs) Context Save Context Load K1 K2 K2 launches Time University of Michigan 5 Electrical Engineering and Computer Science

Challenge 1: Preemption latency GTX 780 (Kepler) 256kB registers + 48kB shared memory 288.4 GB/s for 12 SMs ~88us per SM Too long for latency-critical kernels Context Save Context Load K1 K2 K2 launches Time University of Michigan 6 Electrical Engineering and Computer Science

Challenge 2: Throughput overhead GTX 780 (Kepler) 256kB registers + 48kB shared memory 288.4 GB/s for 12 SMs ~88us per SM Context Save Context Load K1 K2 K2 launches No useful work is done Time University of Michigan 7 Electrical Engineering and Computer Science

Objective of This Work Preemption Cost Switch Prior work 0% 100% Thread Block Progress (%) University of Michigan 8 Electrical Engineering and Computer Science

SM draining [Tanasic 14] No issue Thread block K1 K2 K2 launches Time University of Michigan 9 Electrical Engineering and Computer Science

Chimera Switch!! Drain!! Preemption Cost Switch Opportunity Drain 0% 100% Thread Block Progress (%) University of Michigan 10 Electrical Engineering and Computer Science

SM Flushing Instant preemption Throw away what was running on the SM Re-execute from the beginning later (Idempotent kernel) CPU Idempotent if read state is not modified GPU Only observable state Global Memory University of Michigan 11 Electrical Engineering and Computer Science

Finding Relaxed Idempotence Detected by compiler CUDA __global__ void kernel_cuda(const float *in, float* out, float* inout) { = inout[idx]; atomicAdd( ); out[idx] = ; inout[idx] = ; } Atomic Operation Idempotent Region Global Overwrite University of Michigan 12 Electrical Engineering and Computer Science

Chimera Flush Preemption Cost Switch Optimal Drain 0% 100% Thread Block Progress (%) Flush near the beginning Context switch in the middle Drain near the end University of Michigan 13 Electrical Engineering and Computer Science

Independent Thread Block Execution No shared state between SMs and thread blocks Each SM/thread block can be preempted with different preemption technique GPU SM No shared state Thread block SM University of Michigan 14 Electrical Engineering and Computer Science

Chimera Collaborative Preemption Progress Flush GPU Thread SM Drain Thread block SM Switch University of Michigan 15 Electrical Engineering and Computer Science

Architecture Two level scheduler Kernel scheduler + thread block scheduler How many SMs will each kernel have? SM Scheduling Policy Kernel Scheduler SM Scheduling Policy TB-to-Kernel Mapping Thread Block Scheduler Chimera University of Michigan 16 Electrical Engineering and Computer Science

Architecture Chimera Which SM will be preempted? Which preemption technique to use? Kernel Scheduler SM Scheduling Policy TB-to-Kernel Mapping Thread Block Scheduler Chimera University of Michigan 17 Electrical Engineering and Computer Science

Architecture Thread block scheduler Which thread block will be scheduled? Carry out preemption decision Thread Block Queue per Kernel Kernel Scheduler Preempted TB TB-to-Kernel Mapping Next TB Preempt Thread Block Scheduler University of Michigan 18 Electrical Engineering and Computer Science

Cost Estimation: Preemption Latency Switch Context size / (Memory bandwidth / # of SMs) Drain Instructions measured in a warp granularity Progress in insts Average execution insts: Estimated remaining insts x CPI = Estimated preemption latency Flush Zero preemption latency University of Michigan 19 Electrical Engineering and Computer Science

Cost Estimation: Throughput Switch IPC * Preemption latency * 2 Doubled due to context saving and loading Drain Instructions measured in a warp granularity Progress in insts Most progressed in the same SM: Overhead Flush Executed instructions in a warp granularity University of Michigan 20 Electrical Engineering and Computer Science

Chimera Algorithm Preemption victim Least throughput overhead Meets preemption latency constraint Flush GPU SM Flush : Switch: Thread Drain : Thread block SM Switch University of Michigan 21 Electrical Engineering and Computer Science

Chimera Algorithm Preemption victim Least throughput overhead Meets preemption latency constraint Constraint Flush GPU Latency : SM Overhead : Thread Latency : Thread block SM Switch Overhead : University of Michigan 22 Electrical Engineering and Computer Science

Experimental Setup GPGPU-Sim v3.2.2 GPU Model: Fermi architecture Up to 32,768 (128 kB) registers Up to 48 kB shared memory Workloads 14 benchmarks from Nvidia SDK, Parboil, and Rodinia GPGPU benchmark + Synthetic benchmark Mimics periodic, real-time task (e.g. Graphics kernel) Period: 1ms Execution Time: 200us GPGPU benchmark + GPGPU benchmark Baseline: Non-preemptive First-come First-served University of Michigan 23 Electrical Engineering and Computer Science

Preemption Latency Violations GPGPU benchmark + Synthetic benchmark 15 us preemption latency constraint (real-time task) Switch Drain Flush Chimera 100% 80% Violations 60% 40% 0.2% 20% 0% Non-idempotent kernel with short thread block execution time Estimated shorter preemption latency University of Michigan 24 Electrical Engineering and Computer Science

System Throughput Case study: LUD + Other GPGPU benchmark LUD has many kernel launches with varying number of thread blocks Switch Drain Flush Chimera 70% STP Improvement (%) 50% 30% 10% -10% Drain has lower average normalized turnaround time (5.17x for Drain, 5.50x for Chimera) University of Michigan 25 Electrical Engineering and Computer Science

Preemption Technique Distribution GPGPU benchmark + Synthetic benchmark Switch Drain Flush 100% 80% 60% 40% 20% 0% 5us 10us 15us 20us Preemption Latency Constraint University of Michigan 26 Electrical Engineering and Computer Science

Summary Context switch can have high overhead on GPUs Preemption latency, and throughput overhead Chimera Flushing Instant preemption Collaborative preemption Flush + Switch + Drain Almost always meets preemption latency constraint 0.2% violations (estimated shorter preemption latency) 5.5x ANTT improvement, 12.2% STP improvement For GPGPU benchmark + GPGPU benchmark combinations University of Michigan 27 Electrical Engineering and Computer Science

Questions? Drain Context Switch Flush University of Michigan 28 Electrical Engineering and Computer Science

Collaborative Preemption for Multitasking on Shared GPU

Download Presentation

Presentation Transcript

Related

More Related Content