Lecture 6: Shared-memory Computing with GPU

Slide Note

Delve into shared-memory computing with GPUs in this lecture to unlock the potential of parallel processing, accelerating your tasks and applications. Explore advanced techniques and optimizations for harnessing the power of GPUs efficiently.

gila_4 Follow

Uploaded on Feb 16, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Lecture 6: Shared-memory Computing with GPU

START: download NVIDIA CUDA Free download NVIDIA CUDA https://developer.nvidia.com/cud a-downloads CUDA programming on visual studio 2010

START: Matrix Addition #include "cuda_runtime.h" #include "device_launch_parameters.h" int main() { float *a = new float[N*N]; float *b = new float[N*N]; float *c = new float[N*N]; int i, j; for (int i = 0; i < N*N; ++i) { a[i] = 1.0f; b[i] = 3.5f; } float *ad, *bd, *cd; const int size = N*N*sizeof(float); cudaMalloc( (void**)&ad, size); cudaMalloc( (void**)&bd, size); cudaMalloc( (void**)&cd, size); #include <stdio.h> const int N = 1024; const int blocksize = 16; __global__ void add_matrix(float* a, float *b, float *c, int N) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y ; int index = i+j*N; if (i< N && j <N) c[index] = a[index] + b[index]; } Global memory cudaMemcpy(ad, a, size, cudaMemcpyHostToDevice); cudaMemcpy(bd, b, size, cudaMemcpyHostToDevice); dim3 dimBlock(blocksize, blocksize); dim3 dimGrid(N/dimBlock.x, N/dimBlock.y); add_matrix<<<dimGrid, dimBlock>>>(ad, bd, cd, N); cudaMemcpy(c, cd, size, cudaMemcpyDeviceToHost); for (i = 0; i<N; i++) { for (j=0; j<N; j++) printf("%f", c[i,j]); printf("\n"); }; dimBlock.y threadIdx.x delete[] a; delete b; delete [] c; return EXIT_SUCCESS; } height threadIdy.y (i,j ) dimBlock.x width

Memory Allocation Example

Memory Allocation Example dimBlock.y threadIdx.x height threadIdy.y dimBlock.x (xIdx,yIdy ) width

Memory Allocation Example

Memory Allocation Example

Memory Allocation Example shared memory Global memory (threadIDx.x, threadIDx.y) (2) yBlock (1) (threadIDx.y, threadIDx.x) (X,Y) height xBlock width (1) Read from global memory & write to block shared memory (2) Transposed address (3) Read from the shared memory & write to global memory

Memory Allocation Example shared memory Global memory (threadIDx.x, threadIDx.y) (2) yBlock (1) (threadIDx.y, threadIDx.x) (X,Y) height yBlock height xBlock xBlock (3) width width (1) (2) (y,x) (3) Global memory

Exercise (1) Compile and execute program Matrix Addition. (2) Write a complete version of the program for Memory Allocation. 16 2 (3) Write a program for calculate , where the number of intervals = .

Lecture 6: Shared-memory Computing with GPU

Download Presentation

Presentation Transcript

Related

More Related Content