Lecture 6: Shared-memory Computing with GPU
Delve into shared-memory computing with GPUs in this lecture to unlock the potential of parallel processing, accelerating your tasks and applications. Explore advanced techniques and optimizations for harnessing the power of GPUs efficiently.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
START: download NVIDIA CUDA Free download NVIDIA CUDA https://developer.nvidia.com/cud a-downloads CUDA programming on visual studio 2010
START: Matrix Addition #include "cuda_runtime.h" #include "device_launch_parameters.h" int main() { float *a = new float[N*N]; float *b = new float[N*N]; float *c = new float[N*N]; int i, j; for (int i = 0; i < N*N; ++i) { a[i] = 1.0f; b[i] = 3.5f; } float *ad, *bd, *cd; const int size = N*N*sizeof(float); cudaMalloc( (void**)&ad, size); cudaMalloc( (void**)&bd, size); cudaMalloc( (void**)&cd, size); #include <stdio.h> const int N = 1024; const int blocksize = 16; __global__ void add_matrix(float* a, float *b, float *c, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y ; int index = i+j*N; if (i< N && j <N) c[index] = a[index] + b[index]; } Global memory cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice); cudaMemcpy(bd, b, size, cudaMemcpyHostToDevice); dim3 dimBlock(blocksize, blocksize); dim3 dimGrid(N/dimBlock.x, N/dimBlock.y); add_matrix<<<dimGrid, dimBlock>>>(ad, bd, cd, N); cudaMemcpy(c, cd, size, cudaMemcpyDeviceToHost); for (i = 0; i<N; i++) { for (j=0; j<N; j++) printf("%f", c[i,j]); printf("\n"); }; dimBlock.y threadIdx.x delete[] a; delete b; delete [] c; return EXIT_SUCCESS; } height threadIdy.y (i,j ) dimBlock.x width
Memory Allocation Example dimBlock.y threadIdx.x height threadIdy.y dimBlock.x (xIdx,yIdy ) width
Memory Allocation Example shared memory Global memory (threadIDx.x, threadIDx.y) (2) yBlock (1) (threadIDx.y, threadIDx.x) (X,Y) height xBlock width (1) Read from global memory & write to block shared memory (2) Transposed address (3) Read from the shared memory & write to global memory
Memory Allocation Example shared memory Global memory (threadIDx.x, threadIDx.y) (2) yBlock (1) (threadIDx.y, threadIDx.x) (X,Y) height yBlock height xBlock xBlock (3) width width (1) (2) (y,x) (3) Global memory
Exercise (1) Compile and execute program Matrix Addition. (2) Write a complete version of the program for Memory Allocation. 16 2 (3) Write a program for calculate , where the number of intervals = .