Unified Memory in CUDA: Key Concepts and Implementation Details

1 / 17

Embed Share

"Explore the concept of Unified Memory in CUDA, enabling seamless data access between CPU and GPU without explicit copying. Learn about page migration, system-wide operations, and GPU architecture advancements. Dive into examples and best practices for efficient memory management."

zbelg Follow

Uploaded on Jun 17, 2025 | 3 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Unified Cuda Memory RUI (RAY) WU RAYWU1990@NEVADA.UNR.EDU

Outline Profile Unified Memory Ideas about Unified Vector Dot Product How to add vectors more than the maximum thread number? PA2

Profile What is nvprof? Profile nvprof ./PA0 <argv> nvprof does not need cudaEvent_t and has more detailed information

Unified Memory

Unified Memory Key idea: allocate and access data that can be used by code running on any processor in the system, CPU or GPU No need to cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost Multiple GPUs and multiple CPUs Read more details: https://devblogs.nvidia.com/unified-memory-cuda- beginners/

Unified Memory

Unified Memory: Vector Addition Example: https://devblogs.nvidia.com/unified-memory-cuda-beginners/ cudaDeviceSynchronize: synchronize before access the data!

Unified Memory How does it work? Store data into Page : Unified Memory is able to automatically migrate data at the level of individual pages between host and device memory Move Page between CPU memory and GPU memory cudaMemcpy => cudaMallocManaged Page-> similar to cache, performs better if you use the loading data multiple times. Read: three methods to avoid page faults

Unified Memory When it accesses any absent pages, the GPU stalls execution of the accessing threads, and the Page Migration Engine migrates the pages to the device before resuming the threads. Pre-Pascal GPUs lack hardware page faulting, so coherence can t be guaranteed. An access from the CPU while a kernel is running will cause a segmentation fault! Pascal and Volta GPUs support system-wide atomic memory operations. That means you can atomically operate on values anywhere in the system from multiple GPUs. What is Pascal and Volta : https://en.wikipedia.org/wiki/CUDA

Unified Memory 49-bit virtual addressing and on-demand page migration. 49-bit virtual addresses are sufficient to enable GPUs to access the entire system memory plus the memory of all GPUs in the system. 49 bits means how many GB? Discuss in next class More reading materials: https://devblogs.nvidia.com/unified-memory-in- cuda-6/

Ideas about Unified Vector Dot Product Step 1: calculate product of each pair in one block (serve PA2) Step 2: __syncthreads() threads in this block Step 3: sum reduction

Ideas about Unified Vector Dot Product: Sum Reduction

Ideas about Unified Vector Dot Product: Sum Reduction __syncthreads() threads in this block Book page P80 introduces how to do this by using shared memory. Shared memory: old version