
Optimizing Gaussian Blur Memory Access
Explore memory optimization techniques for efficient Gaussian blur computation, including separable filter implementation, constant memory usage, and shared memory optimization.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
ECE 4822: Final Project Sadia Afrin Purba
System Architecture Overview 1. Python Frontend Image loading and windowing Batch management 2. C++/CUDA Backend Pinned memory management CUDA kernels execution gaussian_filter.cu histogram_kernel.cu Calculating mean and standard deviation
System Architecture Overview 3. Pybinds11 Bridge between C++ and Python Allows C++ functions and classes to be called directly from Python Handle automatic conversion between Python and C++ data types
Memory Optimization Techniques Pinned Memory Allocation Faster host-device transfer Page-locked memory Direct GPU access CUDA streams The stream is used with cudaMemcpyAsync to perform asynchronous memory transfers This allows the CPU to continue working while the GPU transfer is in progress
Gaussian Blur Optimization Separable Filter Implementation Split 2D filter into two 1D passes Horizontal Vertical Reduce complexity from O(n ) to O(n) Constant Memory Usage Kernel coefficients stored in constant memory Fast access through constant cache Shared by all threads
Gaussian Blur Memory Access Pattern Horizontal Pass Each thread block: Uses sequential memory access pattern Shares data among threads using shared memory Handle edge cases with boundary check Vertical Pass Similar optimization strategy Efficient shared memory usage for column operations
Gaussian Blur Memory Access Pattern Vertical Pass Memory Layout in Row-Major Order: [R0C0][R0C1][R0C2][R0C3] Row 0 [R1C0][R1C1][R1C2][R1C3] Row 1 [R2C0][R2C1][R2C2][R2C3] Row 2 [R3C0][R3C1][R3C2][R3C3] Row 3 Vertical Pass Needs: Thread 0 needs: [R0C0]-[R1C0]-[R2C0] // Column 0 Thread 1 needs: [R0C1]-[R1C1]-[R2C1] // Column 1 Thread 2 needs: [R0C2]-[R1C2]-[R2C2] // Column 2
Gaussian Blur Memory Access Pattern If we use global memory, then: For Thread 0 to get its column: Load 1: R0C0 at address X Load 2: R1C0 at address X + width // Far jump in memory Load 3: R2C0 at address X + 2*width // Another far jump Issues: Arbitary Memory Access Each thread jumps 'width' elements apart Causes multiple memory transactions Poor cache utilization Solution: Shared Memory Transformation Adjacent threads read adjacent memory
Gaussian Blur Memory Access Pattern Step 1: Sequential Load Thread 0 reads: R0C0 Thread 1 reads: R0C1 Thread 2 reads: R0C2 ... (All threads read consecutive memory -efficient!) Step 2: Vertical Filter Access -Thread 0 now reads from shared memory: R0C0: Direct access R1C0: +blockDim.x R2C0: +2*blockDim.x Fast shared memory access and no global memory latency
Gaussian Blur Performance Features Memory Bandwidth Optimization Restricted memory qualifier (__restrict__) Prevent pointer overlapping Better compiler optimization Synchronization Minimal use of __syncthreads() Only when necessary for shared memory coherence
Histogram Computation Optimization Max Value Finding Parallel reduction in shared memory Efficient tree-based reduction pattern Sequential Reduction: Steps needed: N-1 Total comparisons: N-1 Sequential Reduction: Steps needed: log2(N) Total comparisons: N-1 But comparisons happen in parallel!
Histogram Computation Optimization Tree Reduction to Get Max Value 1. Each thread compares elements that are 's' positions apart 2. The stride 's' starts at blockDim.x/2 and gets divided by 2 each iteration 3. Each thread only operates if its ID (tid) is less than the current stride 4. The pattern ensures all elements are compared without missing any Example: Initial Data (16 elements) Iteration 1: [0-7] compare with [8-15] Iteration 2: [0-3] compare with [4-7] Iteration 3: [0-1] compare with [2-3] Final: [0] contains max
Histogram Computation Optimization Thread Block Memory Pattern: Thread 0: [Bin0] += count Thread 1: [Bin1] += count ... Thread 15: [Bin15] += count
Results (Nave Approach) Number of Images Number of Images Time Time Node Node 1 real 37m29.506s user 42m11.486s sys 0m53.665s nedc_012
Results Number of Images Number of Images Time Time Node Node 1 real 1m58.964s user 6m3.611s sys 1m18.409s nedc_012 4 real 23m6.097s user 22m28.160s sys 5m21.400s nedc_012 10 real 69m0.512s user 58m57.503s sys 14m52.981s nedc_012