Lecture 6: Shared-memory Computing with GPU

Lecture 6: Shared-memory Computing with  GPU
Slide Note
Embed
Share

Delve into shared-memory computing with GPUs in this lecture to unlock the potential of parallel processing, accelerating your tasks and applications. Explore advanced techniques and optimizations for harnessing the power of GPUs efficiently.

  • GPU Computing
  • Parallel Processing
  • Shared Memory
  • Optimizations

Uploaded on Feb 16, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Lecture 6: Shared-memory Computing with GPU

  2. START: download NVIDIA CUDA Free download NVIDIA CUDA https://developer.nvidia.com/cud a-downloads CUDA programming on visual studio 2010

  3. START: Matrix Addition #include "cuda_runtime.h" #include "device_launch_parameters.h" int main() { float *a = new float[N*N]; float *b = new float[N*N]; float *c = new float[N*N]; int i, j; for (int i = 0; i < N*N; ++i) { a[i] = 1.0f; b[i] = 3.5f; } float *ad, *bd, *cd; const int size = N*N*sizeof(float); cudaMalloc( (void**)&ad, size); cudaMalloc( (void**)&bd, size); cudaMalloc( (void**)&cd, size); #include <stdio.h> const int N = 1024; const int blocksize = 16; __global__ void add_matrix(float* a, float *b, float *c, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y ; int index = i+j*N; if (i< N && j <N) c[index] = a[index] + b[index]; } Global memory cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice); cudaMemcpy(bd, b, size, cudaMemcpyHostToDevice); dim3 dimBlock(blocksize, blocksize); dim3 dimGrid(N/dimBlock.x, N/dimBlock.y); add_matrix<<<dimGrid, dimBlock>>>(ad, bd, cd, N); cudaMemcpy(c, cd, size, cudaMemcpyDeviceToHost); for (i = 0; i<N; i++) { for (j=0; j<N; j++) printf("%f", c[i,j]); printf("\n"); }; dimBlock.y threadIdx.x delete[] a; delete b; delete [] c; return EXIT_SUCCESS; } height threadIdy.y (i,j ) dimBlock.x width

  4. Memory Allocation Example

  5. Memory Allocation Example dimBlock.y threadIdx.x height threadIdy.y dimBlock.x (xIdx,yIdy ) width

  6. Memory Allocation Example

  7. Memory Allocation Example

  8. Memory Allocation Example shared memory Global memory (threadIDx.x, threadIDx.y) (2) yBlock (1) (threadIDx.y, threadIDx.x) (X,Y) height xBlock width (1) Read from global memory & write to block shared memory (2) Transposed address (3) Read from the shared memory & write to global memory

  9. Memory Allocation Example shared memory Global memory (threadIDx.x, threadIDx.y) (2) yBlock (1) (threadIDx.y, threadIDx.x) (X,Y) height yBlock height xBlock xBlock (3) width width (1) (2) (y,x) (3) Global memory

  10. Exercise (1) Compile and execute program Matrix Addition. (2) Write a complete version of the program for Memory Allocation. 16 2 (3) Write a program for calculate , where the number of intervals = .

More Related Content