
Holistic Approach to GPU Resource Virtualization
Explore the holistic approach to resource virtualization in GPUs, discussing high performance, CUDA kernels, and the need to statically allocate major resources like registers and memory for optimal performance. Learn from experts in the field in this insightful session.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Zorua: A Holistic Approach to Resource Virtualization in GPUs Session 2A Monday, 5:20 PM Nandita Vijaykumar Kevin Hsieh, Gennady Pekhimenko, Samira Khan, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B. Gibbons, Onur Mutlu
High Performance
High Performance GPUs
__global__ void CUDAkernel2DCT(float *dst, float *src, int I){ int OffsThreadInRow = threadIdx.y * B + threadIdx.x; for(unsigned int i = 0; i < B; i++) bl_ptr[i * X] = src[i * I]; __syncthreads(); CUDAsubroutineInplaceDCTvector( ); __syncthreads(); CUDAsubroutineInplaceDCTvector( ); for(unsigned int i = 0; i < B; i++) dst[i *I] = bl_ptr[i * X]; }
__global__ void CUDAkernel2DCT(float *dst, float *src, int I){ int OffsThreadInRow = threadIdx.y * B + threadIdx.x; for(unsigned int i = 0; i < B; i++) bl_ptr[i * X] = src[i * I]; __syncthreads(); CUDAsubroutineInplaceDCTvector( ); __syncthreads(); CUDAsubroutineInplaceDCTvector( ); for(unsigned int i = 0; i < B; i++) dst[i *I] = bl_ptr[i * X]; } Low Performance!
The programmer has to statically allocate 3 major resources:
The programmer has to statically allocate 3 major resources: R Registers
The programmer has to statically allocate 3 major resources: R Registers Scratchpad Memory S
The programmer has to statically allocate 3 major resources: R Registers Scratchpad Memory S Thread Slots T
The programmer has to statically allocate 3 major resources: R Registers Scratchpad Memory S Thread Slots T Imperfect Allocation Low Performance
Tune Code R T S FIX: Usage of Registers, Scratchpad and Thread Slots
High Performance R T S Problem: Programming Effort
GPU 1 GPU 2 R S T S T R
Low Performance! Problem: Performance Portability
Programmer-specified resource allocation leads to 3 key issues with: Programming ease Performance portability Performance for optimized code
Our Approach Decouple Programmer-specified resource usage Allocation in the hardware
Zorua: A Framework to Virtualize On-chip Resources in GPUs
Zorua: A Holistic Approach to Resource Virtualization in GPUs Session 2A Monday, 5:20 PM Nandita Vijaykumar Kevin Hsieh, Gennady Pekhimenko, Samira Khan, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B. Gibbons, Onur Mutlu