Power-Efficient Medical Image Processing Using PUMA

Power-Efficient Medical Image Processing Using PUMA
Slide Note
Embed
Share

Medical image processing is rapidly advancing with the introduction of power-efficient techniques like PUMA. This technology leverages GPGPUs to enhance computation in medical imaging, ensuring high quality and efficient processing with low power consumption. The integration of PUMA in medical image reconstruction and CT image reconstruction showcases the potential for improved results in healthcare applications.

  • Medical imaging
  • Image processing
  • Power-efficient
  • PUMA technology
  • GPGPUs

Uploaded on Feb 22, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Power-Efficient Medical Image Processing using PUMA Ganesh Dasika, Kevin Fan1, Scott Mahlke University of Michigan Advanced Computer Architecture Laboratory 1Parakinetics, Inc. University of Michigan Electrical Engineering and Computer Science

  2. The Advent of the GPGPU Increasingly popular substrate for HPC Astrophysics Weather Prediction EDA Financial instrument pricing Medical Imaging 2 University of Michigan Electrical Engineering and Computer Science

  3. Advantages of GPGPUs High degree of parallelism Data-level Thread-level High bandwidth Commodity products Increasingly programmable 3 University of Michigan Electrical Engineering and Computer Science

  4. Disadvantages of GPGPUs Gap between computation and bandwidth 933 GFLOPS : 142 GB/s bandwidth (0.15B of data per FLOP, ~26:1 Compute:Mem Ratio) Very high power consumption Graphics-specific hardware Multiple thread contexts Large register files and memories Fully general datapath Inefficiencies in all general-purpose architectures 4 University of Michigan Electrical Engineering and Computer Science

  5. Programmability vs Efficiency? FPGAs Highly efficient, some programmability General Purpose Processors DSPs Flexibility Domain-specific Accelerators, GPGPUs ??? Loop Accelerators, ASICs Efficiency 5 5 University of Michigan Electrical Engineering and Computer Science

  6. Medical Image Reconstruction Compute intensive loops 32-bit floating point code High data/bandwidth requirements Increased demand for portability, low power Much current research focuses on using GPGPUs for this domain 6 University of Michigan Electrical Engineering and Computer Science

  7. CT Image reconstruction X-Ray emitters and receptors on opposite sides of patients Received x-ray intensity corresponds to tissue density Multiple scans ( slices ) taken around patient put together to reconstruct 1 2D-image 7 University of Michigan Electrical Engineering and Computer Science

  8. Projection & Sinogram Sinogram: All projections Projection: All ray-sums in a direction y P( t) t x f(x,y) t X-rays Sinogram 8 University of Michigan Electrical Engineering and Computer Science

  9. Example: Backprojection Sinogram Backprojected Image 9 University of Michigan Electrical Engineering and Computer Science

  10. Example: Filtered Backprojection Filtered Sinogram Reconstructed Image 10 University of Michigan Electrical Engineering and Computer Science

  11. Reconstruction: Solve for s X-Ray Emitter 12 11 13 14 22 21 22 23 24 12 Human Body 31 32 33 34 10 41 42 43 44 15 Detector Values 16 22 11 10 Densities 11 University of Michigan Electrical Engineering and Computer Science

  12. Real Reconstruction Problem 100 s of diagonals @ 100 s of angles 712 Intensity measured Rays transmitted through multiple pixels Find individual pixel values from transmission data 199 255 534 ? ? ? ? ? ? 417 ? ? ? ? ? ? ? ? ? ? ? ? 512 values 364 555 ? ? ? ? ? ? 501 ? ? ? ? ? ? 355 ? ? ? ? ? ? 512 values 12 University of Michigan Electrical Engineering and Computer Science

  13. Medical Imaging Applications Inner-loop %Scalar/Vector Compute:Mem ratio Benchmark Outer-loop TLP Segmentation Fully vectorizable Do-all 4:1 Laplacian Filtering Gaussian Convolution Fully vectorizable Fully vectorizable with predicates Do-all 3:1 Do-all 6:1 MRI FH Vector Fully vectorizable Do-all 6:1 MRI Q Vector Fully vectorizable Do-all 5.5:1 Image reconstruction for MRI/CT/PET scans Large amounts of Vector/Thread-level parallelism FP-intensive kernels Often requiring math library functions Data-intensive (~5:1 compute:mem ratio) 13 University of Michigan Electrical Engineering and Computer Science

  14. Current Concerns: Portability/Power Currently, most scans require moving patient to imaging room Consumes time Stress on patient Studies show benefits of portable, bed-side scanners: 86% increase in patients suitable for post-stroke thrombolytic therapy [Weinreb et al, RSNA] 80-100% drop in scan-related complications [Gunnarsson et al, J. of Neurosurgery] New X-Ray emitters push for mAs of current use 14 University of Michigan Electrical Engineering and Computer Science

  15. Current Concerns: Performance High-accuracy CT algorithms take too long Iterative forward/backward projection ~Hours on modern CT scanners instead of minutes Interventional radiology Scans currently takes minutes, but should take seconds CT-Flouroscopy Several scans done in succession 15 University of Michigan Electrical Engineering and Computer Science

  16. Flexibility Software algorithms change over time NRE Time-to-market 16 16 University of Michigan Electrical Engineering and Computer Science

  17. PUMA Tiled architecture Bandwidth-matched for improved efficiency Each tile is a Programmable Loop Accelerator Extern. Interface Disk Mem CPU 17 University of Michigan Electrical Engineering and Computer Science

  18. Programmable Loop Accelerator Generalize accelerator without losing efficiency FPGAs General Purpose Processors DSPs Flexibility Domain-specific Accelerators, GPGPUs ??? Programmable Loop Accelerators Loop Accelerators, ASICs Efficiency, Performance 18 18 University of Michigan Electrical Engineering and Computer Science

  19. Designing Loop Accelerators Local Mem + * MEM << CRF Point-to-point Connections BR + & MEM Local Mem Hardware Loop C Code 19 19 University of Michigan Electrical Engineering and Computer Science

  20. Loop Accelerator Architecture CRF Point-to-point Connections FSM Local Mem BR + & MEM Control signals Hardware realization of modulo scheduled loop Parameterized hardware: FUs Shift Register Files Static Control Point-to-point Interconnect 20 20 University of Michigan Electrical Engineering and Computer Science

  21. Programmable Loop-Accelerator Architecture CRF Literals Point-to-point Connections Ring Control Memory FSM Local Mem +/- + &/| & BR MEM Control signals RR SRF RR SRF RR SRF RR SRF PLA LA Functionality Storage Connectivity Control Custom FU set Limited size, no addr. Generalized FUs + MOVs Rotating Reg. Files Point-to-point Ring + Port-swapping Hardwired Control Lit. Reg. File + Control Mem 21 21 University of Michigan Electrical Engineering and Computer Science

  22. MRI.FH PLA FU Type # ~0.6 mm2 per tile 38 FUs 128 32-bit registers Inter-FU BW 1 TB/sec FP-ADDSUB 6 FP-MPY 9 I-ADDSUB 8 MEM 9 I-MPY 1 Other 5 22 University of Michigan Electrical Engineering and Computer Science

  23. Performance on MRI.FH PLA Unschedulable 1.0 Normalized Performance 0.8 0.6 0.4 0.2 0.0 MRI.FH MRI.Q CT.segment CT.laplace CT.gauss Non-Generalized Generalized II preserved II doubled 23 University of Michigan Electrical Engineering and Computer Science

  24. Efficiency on MRI.FH PLA 1.0 Normalized Perf/Power 0.8 Efficiency 0.6 0.4 0.2 0.0 MRI.FH MRI.Q CT.segment CT.laplace CT.gauss mean Non-Generalized Generalized 24 University of Michigan Electrical Engineering and Computer Science

  25. PUMA System Design 5 systems designed around 5 benchmarks Each composed of identical tiles Assume same B/W as GTX280 (142 GB/s) # Tiles based on B/W requirements of benchmark Extern. Interface Disk Mem CPU 25 University of Michigan Electrical Engineering and Computer Science

  26. System Performance 4W 3W 2.8W 2.3W 2.7W 160 140 120 GOPs/sec 100 80 60 40 20 0 MRI.FH MRI.Q CT.segment CT.laplace CT.gauss Theoretical Realized 26 University of Michigan Electrical Engineering and Computer Science

  27. Performance vs. GPGPU 2.0 1.5 2X performance of GTS 250 TOPs/sec 1.0 0.5 0.0 PUMA GTS 250 GTX 260 GTX 280 GTX 285 GTX 295 Theoretical Realized 63% performance of GTX 295 27 University of Michigan Electrical Engineering and Computer Science

  28. Efficiency vs. GPGPU 54X 60 PUMA Perf/Power efficiency over GPU 50 22X 40 30 20 10 0 MRI.FH MRI.Q CT.segment CT.laplace CT.gauss GTS 250 GTX 260 GTX 280 GTX 285 GTX 295 28 University of Michigan Electrical Engineering and Computer Science

  29. Conclusions Power-efficient accelerator for medical imaging ASIC-like efficiency with programmability 63-201% of GPU performance 22-54X GPU Performance/Power efficiency 29 University of Michigan Electrical Engineering and Computer Science

  30. Thank you!! Questions? 30 University of Michigan Electrical Engineering and Computer Science

Related


More Related Content