Holistic Approach to GPU Resource Virtualization

zorua a holistic approach to resource n.w

1 / 18

Embed Share

Explore the holistic approach to resource virtualization in GPUs, discussing high performance, CUDA kernels, and the need to statically allocate major resources like registers and memory for optimal performance. Learn from experts in the field in this insightful session.

audryana Follow

Uploaded on Jul 15, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Zorua: A Holistic Approach to Resource Virtualization in GPUs Session 2A Monday, 5:20 PM Nandita Vijaykumar Kevin Hsieh, Gennady Pekhimenko, Samira Khan, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B. Gibbons, Onur Mutlu

High Performance

High Performance GPUs

__global__ void CUDAkernel2DCT(float *dst, float *src, int I){ int OffsThreadInRow = threadIdx.y * B + threadIdx.x; for(unsigned int i = 0; i < B; i++) bl_ptr[i * X] = src[i * I]; __syncthreads(); CUDAsubroutineInplaceDCTvector( ); __syncthreads(); CUDAsubroutineInplaceDCTvector( ); for(unsigned int i = 0; i < B; i++) dst[i *I] = bl_ptr[i * X]; }

__global__ void CUDAkernel2DCT(float *dst, float *src, int I){ int OffsThreadInRow = threadIdx.y * B + threadIdx.x; for(unsigned int i = 0; i < B; i++) bl_ptr[i * X] = src[i * I]; __syncthreads(); CUDAsubroutineInplaceDCTvector( ); __syncthreads(); CUDAsubroutineInplaceDCTvector( ); for(unsigned int i = 0; i < B; i++) dst[i *I] = bl_ptr[i * X]; } Low Performance!

The programmer has to statically allocate 3 major resources:

The programmer has to statically allocate 3 major resources: R Registers

The programmer has to statically allocate 3 major resources: R Registers Scratchpad Memory S

The programmer has to statically allocate 3 major resources: R Registers Scratchpad Memory S Thread Slots T

The programmer has to statically allocate 3 major resources: R Registers Scratchpad Memory S Thread Slots T Imperfect Allocation Low Performance

Tune Code R T S FIX: Usage of Registers, Scratchpad and Thread Slots

High Performance R T S Problem: Programming Effort

GPU 1 GPU 2 R S T S T R

Low Performance! Problem: Performance Portability

Programmer-specified resource allocation leads to 3 key issues with: Programming ease Performance portability Performance for optimized code

Our Approach Decouple Programmer-specified resource usage Allocation in the hardware

Zorua: A Framework to Virtualize On-chip Resources in GPUs

Zorua: A Holistic Approach to Resource Virtualization in GPUs Session 2A Monday, 5:20 PM Nandita Vijaykumar Kevin Hsieh, Gennady Pekhimenko, Samira Khan, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B. Gibbons, Onur Mutlu

Holistic Approach to GPU Resource Virtualization

Download Presentation

Presentation Transcript

Related

More Related Content