Optimizing GPU Usage Through OS Abstractions

chris rossbach jon currey microsoft research mark n.w

1 / 30

Embed Share

Explore the challenges and benefits of GPU programming, the importance of OS abstractions for GPUs, and the need for improved support in operating systems based on findings presented at SOSP 2011.

alen847 Follow

Uploaded on Mar 18, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Chris Rossbach, Jon Currey, Microsoft Research Mark Silberstein, Technion Baishakhi Ray, Emmett Witchel, UT Austin SOSP October 25, 2011

There are lots of GPUs 3 of top 5 supercomputers use GPUs In all new PCs, smart phones, tablets Great for gaming and HPC/batch Unusable in other application domains GPU programming challenges GPU+main memory disjoint Treated as I/O device by OS PTask SOSP 2011 2

There are lots of GPUs 3 of top 5 supercomputers use GPUs In all new PCs, smart phones tablets Great for gaming and HPC/batch Unusable in other application domains GPU programing challenges GPU+main memory disjoint Treated as I/O device by OS These two things are related: We need OS abstractions Unusable in other application domains Treated as I/O device by OS PTask SOSP 2011 3

The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 4

programmer- visible interface OS-level abstractions Hardware interface 1:1 correspondence between OS-level and user-level abstractions PTask SOSP 2011 5

programmer- visible interface Language Integration GPGPU APIs Shaders/ Kernels DirectX/CUDA/OpenCL Runtime 1 OS-level abstraction! 1. 2. 3. No kernel-facing API No OS resource-management Poor composability PTask SOSP 2011 6

GPU GPU benchmark throughput benchmark throughput 1200 1000 800 600 400 200 0 Higher is better no CPU load high CPU load Image-convolution in CUDA Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz nVidia GeForce GT230 PTask SOSP 2011 CPU scheduler and GPU scheduler not integrated! 7

OS cannot prioritize cursor updates WDDM + DWM + CUDA == dysfunction Flatter lines Are better Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz nVidia GeForce GT230 PTask SOSP 2011 8

Raw images Hand events detect capture noisy point cloud capture detect gestures camera images xform filter geometric transformation noise filtering High data rates Data-parallel algorithms good fit for GPU NOT Kinect: this is a harder problem! PTask SOSP 2011 9

#> capture | xform | filter | detect & CPU GPU CPU GPU Modular design flexibility, reuse Utilize heterogeneous hardware Data-parallel components GPU Sequential components CPU Using OS provided tools processes, pipes PTask SOSP 2011 10

GPUs cannot run OS: different ISA Disjoint memory space, no coherence Host CPU must manage GPU execution Program inputs explicitly transferred/bound at runtime Device buffers pre-allocated Main memory CPU User-mode apps must implement Copy inputs Send commands Copy outputs GPU GPU memory PTask SOSP 2011 11

#> capture | xform | filter | detect & xform filter capture capture xform filter detect detect write() read() write() read() write() read() read() OS executive copy from GPU copy from GPU copy copy IRP to to GPU GPU camdrv GPU driver HIDdrv PCI-xfer PCI-xfer PCI-xfer PCI-xfer GPU Run! PTask SOSP 2011 12

GPU Analogues for: Process API IPC API Scheduler hints Abstractions that enable: Fairness/isolation OS use of GPU Composition/data movement optimization PTask SOSP 2011 13

The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 14

ptask Has priority Analogous to a process for GPU execution List of input/output resources (e.g. stdin, stdout ) ports Can be mapped to ptask input/outputs A data source or sink channels Similar to pipes, connect arbitrary ports Specialize to eliminate double-buffering graph DAG: connected ptasks, ports, channels datablocks Memory-space transparent buffers ptask (parallel task) priority for fairness ports channels OS objects OS RM possible data: specify where, not how graph datablocks PTask SOSP 2011 15

#> capture | xform | filter | detect & ptask graph rawimg rawimg rawimg cloud f-out xform filter detect f-in capture GPU mem GPU mem mapped mem process (CPU) ptask (GPU) Optimized data movement Data arrival triggers computation port channel ptask graph datablock PTask SOSP 2011 16

Graphs scheduled dynamically ptasks queue for dispatch when inputs ready Queue: dynamic priority order ptask priority user-settable ptask prio normalized to OS prio Transparently support multiple GPUs Schedule ptasks for input locality PTask SOSP 2011 17

Datablock V V M M 1 0 1 GPU 1 Memory GPU 0 Memory Main Memory data data RW 1 1 1 0 1 0 RW space main gpu0 gpu1 space 1 1 1 Logical buffer backed by multiple physical buffers buffers created/updated lazily mem-mapping used to share across process boundaries Track buffer validity per memory space writes invalidate other views Flags for access control/data placement PTask SOSP 2011 18

#> capture | xform | filter rawimg rawimg rawimg cloud cloud cloud xform filter f-in capture GPU Memory Main Memory Datablock V V M M 0 0 process ptask port channel datablock data data RW 1 1 1 1 RW 0 0 0 0 space main gpu space 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 PTask SOSP 2011 19

port datablock port 1-1 correspondence between programmer and OS abstractions GPU APIs can be built on top of new OS abstractions PTask SOSP 2011 20

The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 21

Windows 7 Full PTask API implementation Stacked UMDF/KMDF driver Kernel component: mem-mapping, signaling User component: wraps DirectX, CUDA, OpenCL syscalls DeviceIoControl() calls Linux 2.6.33.2 Changed OS scheduling to manage GPU GPU accounting added to task_struct PTask SOSP 2011 22

Windows 7, Core2-Quad, GTX580 (EVGA) Implementations pipes modular handcode ptask Configurations real unconstrained pipes: capture | xform | filter | detect modular: capture+xform+filter+detect, 1process handcode: data movement optimized, 1process ptask: ptask graph real- -time unconstrained: driven by in-memory playback time: driven by cameras PTask SOSP 2011 23

3.5 3 relative to handcode relative to handcode 2.5 handcode modular pipes ptask 2 1.5 1 0.5 0 compared to hand-code 11.6% higher throughput lower is better compared to pipes ~2.7x less CPU usage 16x higher throughput ~45% less memory usage runtime user sys lower CPU util: no driver program Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz GTX580 (EVGA) PTask SOSP 2011 24

1600 1400 PTask invocations/second PTask invocations/second 1200 1000 800 fifo priority ptask 600 400 200 0 PTask provides throughput proportional to priority 2 4 6 8 Higher is better PTask PTask priority priority FIFO queue invocations in arrival order ptask aged priority queue w OS priority graphs: 6x6 matrix multiply priority same for every PTask node Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz GTX580 (EVGA) PTask SOSP 2011 25

2 Synthetic graphs: Varying depths Speedup over 1 GPU Speedup over 1 GPU 1.5 1 priority data-aware 0.5 0 Higher is better Data-aware provides best throughput, preserves priority Data-aware == priority + locality Graph depth > 1 req. for any benefit Windows 7 x64 8GB RAM Intel Core 2 Quad 2.66GHz 2 x GTX580 (EVGA) PTask SOSP 2011 26

R/W bnc cuda-1 cuda-2 user-prgs EncFS FUSE libc user-libs PTask PTask Linux 2.6.33 OS SSD1 SSD2 GPU HW Simple GPU usage accounting Restores performance GPU/ CPU 1.17x -10.3x 1.28x -4.6x GPU/ CPU 1.17x 1.28x cuda- -1 1 Linux cuda Linux cuda- -2 2 Linux cuda Linux -30.8x -10.3x cuda- -1 1 PTask 1.16x 1.21x cuda PTask 1.16x 1.21x cuda- -2 Ptask 1.16x 1.20x cuda Ptask 1.16x 1.20x 2 Read Write EncFS: nice -20 cuda-*: nice +19 AES: XTS chaining SATA SSD, RAID seq. R/W 200 MB PTask SOSP 2011 27

The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 28

OS support for heterogeneous platforms: Helios [Nightingale 09], GPU Scheduling TimeGraph [Kato 11], Graph-based programming models Synthesis [ [Masselin Monsoon/Id [ [Arvind Dryad [ [Isard StreamIt [ [Thies DirectShow TCP Offload [ [Currid Tasking Tessellation, Apple GCD, [Nightingale 09], BarrelFish [Baumann 09] , [Baumann 09] ,Offcodes [ [Weinsberg Weinsberg 08] 08] [Kato 11], Pegasus [Gupta 11] [Gupta 11] Masselin 89] Arvind] ] Isard 07] Thies 02] 89] 07] 02] Currid 04] 04] PTask SOSP 2011 29

OS abstractions for GPUs are critical Enable fairness & priority OS can use the GPU Dataflow: a good fit abstraction system manages data movement performance benefits significant Thank you. Questions? PTask SOSP 2011 30

Optimizing GPU Usage Through OS Abstractions

Download Presentation

Presentation Transcript

Related

More Related Content