Holistic Approach to GPU Resource Virtualization

zorua a holistic approach to resource n.w
1 / 18
Embed
Share

Explore the holistic approach to resource virtualization in GPUs, discussing high performance, CUDA kernels, and the need to statically allocate major resources like registers and memory for optimal performance. Learn from experts in the field in this insightful session.

  • GPU Virtualization
  • High Performance Computing
  • CUDA Kernels
  • Resource Allocation
  • Holistic Approach

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Zorua: A Holistic Approach to Resource Virtualization in GPUs Session 2A Monday, 5:20 PM Nandita Vijaykumar Kevin Hsieh, Gennady Pekhimenko, Samira Khan, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B. Gibbons, Onur Mutlu

  2. High Performance

  3. High Performance GPUs

  4. __global__ void CUDAkernel2DCT(float *dst, float *src, int I){ int OffsThreadInRow = threadIdx.y * B + threadIdx.x; for(unsigned int i = 0; i < B; i++) bl_ptr[i * X] = src[i * I]; __syncthreads(); CUDAsubroutineInplaceDCTvector( ); __syncthreads(); CUDAsubroutineInplaceDCTvector( ); for(unsigned int i = 0; i < B; i++) dst[i *I] = bl_ptr[i * X]; }

  5. __global__ void CUDAkernel2DCT(float *dst, float *src, int I){ int OffsThreadInRow = threadIdx.y * B + threadIdx.x; for(unsigned int i = 0; i < B; i++) bl_ptr[i * X] = src[i * I]; __syncthreads(); CUDAsubroutineInplaceDCTvector( ); __syncthreads(); CUDAsubroutineInplaceDCTvector( ); for(unsigned int i = 0; i < B; i++) dst[i *I] = bl_ptr[i * X]; } Low Performance!

  6. The programmer has to statically allocate 3 major resources:

  7. The programmer has to statically allocate 3 major resources: R Registers

  8. The programmer has to statically allocate 3 major resources: R Registers Scratchpad Memory S

  9. The programmer has to statically allocate 3 major resources: R Registers Scratchpad Memory S Thread Slots T

  10. The programmer has to statically allocate 3 major resources: R Registers Scratchpad Memory S Thread Slots T Imperfect Allocation Low Performance

  11. Tune Code R T S FIX: Usage of Registers, Scratchpad and Thread Slots

  12. High Performance R T S Problem: Programming Effort

  13. GPU 1 GPU 2 R S T S T R

  14. Low Performance! Problem: Performance Portability

  15. Programmer-specified resource allocation leads to 3 key issues with: Programming ease Performance portability Performance for optimized code

  16. Our Approach Decouple Programmer-specified resource usage Allocation in the hardware

  17. Zorua: A Framework to Virtualize On-chip Resources in GPUs

  18. Zorua: A Holistic Approach to Resource Virtualization in GPUs Session 2A Monday, 5:20 PM Nandita Vijaykumar Kevin Hsieh, Gennady Pekhimenko, Samira Khan, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B. Gibbons, Onur Mutlu

Related


More Related Content