Optimizing Gaussian Blur Memory Access

ece 4822 final project n.w
1 / 16
Embed
Share

Explore memory optimization techniques for efficient Gaussian blur computation, including separable filter implementation, constant memory usage, and shared memory optimization.

  • Optimization
  • Gaussian Blur
  • Memory Access
  • CUDA
  • Shared Memory

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. ECE 4822: Final Project Sadia Afrin Purba

  2. System Architecture Overview 1. Python Frontend Image loading and windowing Batch management 2. C++/CUDA Backend Pinned memory management CUDA kernels execution gaussian_filter.cu histogram_kernel.cu Calculating mean and standard deviation

  3. System Architecture Overview 3. Pybinds11 Bridge between C++ and Python Allows C++ functions and classes to be called directly from Python Handle automatic conversion between Python and C++ data types

  4. Memory Optimization Techniques Pinned Memory Allocation Faster host-device transfer Page-locked memory Direct GPU access CUDA streams The stream is used with cudaMemcpyAsync to perform asynchronous memory transfers This allows the CPU to continue working while the GPU transfer is in progress

  5. Gaussian Blur Optimization Separable Filter Implementation Split 2D filter into two 1D passes Horizontal Vertical Reduce complexity from O(n ) to O(n) Constant Memory Usage Kernel coefficients stored in constant memory Fast access through constant cache Shared by all threads

  6. Gaussian Blur Memory Access Pattern Horizontal Pass Each thread block: Uses sequential memory access pattern Shares data among threads using shared memory Handle edge cases with boundary check Vertical Pass Similar optimization strategy Efficient shared memory usage for column operations

  7. Gaussian Blur Memory Access Pattern Vertical Pass Memory Layout in Row-Major Order: [R0C0][R0C1][R0C2][R0C3] Row 0 [R1C0][R1C1][R1C2][R1C3] Row 1 [R2C0][R2C1][R2C2][R2C3] Row 2 [R3C0][R3C1][R3C2][R3C3] Row 3 Vertical Pass Needs: Thread 0 needs: [R0C0]-[R1C0]-[R2C0] // Column 0 Thread 1 needs: [R0C1]-[R1C1]-[R2C1] // Column 1 Thread 2 needs: [R0C2]-[R1C2]-[R2C2] // Column 2

  8. Gaussian Blur Memory Access Pattern If we use global memory, then: For Thread 0 to get its column: Load 1: R0C0 at address X Load 2: R1C0 at address X + width // Far jump in memory Load 3: R2C0 at address X + 2*width // Another far jump Issues: Arbitary Memory Access Each thread jumps 'width' elements apart Causes multiple memory transactions Poor cache utilization Solution: Shared Memory Transformation Adjacent threads read adjacent memory

  9. Gaussian Blur Memory Access Pattern Step 1: Sequential Load Thread 0 reads: R0C0 Thread 1 reads: R0C1 Thread 2 reads: R0C2 ... (All threads read consecutive memory -efficient!) Step 2: Vertical Filter Access -Thread 0 now reads from shared memory: R0C0: Direct access R1C0: +blockDim.x R2C0: +2*blockDim.x Fast shared memory access and no global memory latency

  10. Gaussian Blur Performance Features Memory Bandwidth Optimization Restricted memory qualifier (__restrict__) Prevent pointer overlapping Better compiler optimization Synchronization Minimal use of __syncthreads() Only when necessary for shared memory coherence

  11. Histogram Computation Optimization Max Value Finding Parallel reduction in shared memory Efficient tree-based reduction pattern Sequential Reduction: Steps needed: N-1 Total comparisons: N-1 Sequential Reduction: Steps needed: log2(N) Total comparisons: N-1 But comparisons happen in parallel!

  12. Histogram Computation Optimization Tree Reduction to Get Max Value 1. Each thread compares elements that are 's' positions apart 2. The stride 's' starts at blockDim.x/2 and gets divided by 2 each iteration 3. Each thread only operates if its ID (tid) is less than the current stride 4. The pattern ensures all elements are compared without missing any Example: Initial Data (16 elements) Iteration 1: [0-7] compare with [8-15] Iteration 2: [0-3] compare with [4-7] Iteration 3: [0-1] compare with [2-3] Final: [0] contains max

  13. Histogram Computation Optimization Thread Block Memory Pattern: Thread 0: [Bin0] += count Thread 1: [Bin1] += count ... Thread 15: [Bin15] += count

  14. Results (Nave Approach) Number of Images Number of Images Time Time Node Node 1 real 37m29.506s user 42m11.486s sys 0m53.665s nedc_012

  15. Results Number of Images Number of Images Time Time Node Node 1 real 1m58.964s user 6m3.611s sys 1m18.409s nedc_012 4 real 23m6.097s user 22m28.160s sys 5m21.400s nedc_012 10 real 69m0.512s user 58m57.503s sys 14m52.981s nedc_012

  16. Thank You!

More Related Content