Understanding a Modern Processing in-Memory Architecture
Data movement between memory/storage units and compute units significantly impacts execution time and energy consumption. Processing-in-Memory (PIM) offers a promising solution to address this bottleneck. This paper discusses the challenges and advancements in PIM architecture, focusing on UPMEM's innovative design featuring DDR4 chips with in-order multithreaded DRAM Processing Units (DPUs). The work includes an introduction to the UPMEM programming model, microbenchmark-based characterization of the DPU, and a comprehensive evaluation of the PIM architecture through benchmarking and workload suitability studies. Key contributions include detailed analysis of the commercially available PIM architecture, the development of PrIM benchmarks for memory-bound workloads, and comparisons with traditional CPU and GPU systems. The study provides insights into workload characteristics, programming recommendations, and suggestions for future PIM system designers, along with programming samples and evaluations for current and prospective PIM systems.
Uploaded on Feb 20, 2025 | 2 Views
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Understanding a Modern Understanding a Modern Processing Processing- -in in- -Memory Architecture: Memory Architecture: Benchmarking and Experimental Characterization Juan G mez Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, Onur Mutlu https://arxiv.org/pdf/2105.03814.pdf https://arxiv.org/pdf/2105.03814.pdf https://github.com/CMU https://github.com/CMU- -SAFARI/prim SAFARI/prim- -benchmarks benchmarks
Executive Summary Data movement between memory/storage units and compute units is a major contributor to execution time and energy consumption Processing-in-Memory (PIM) is a paradigm that can tackle the data movement bottleneck - Though explored for +50 years, technology challenges prevented the successful materialization UPMEM has designed and fabricated the first publicly-available real-world PIM architecture - DDR4 chips embedding in-order multithreaded DRAM Processing Units (DPUs) Our work: - Introduction to UPMEM programming model and PIM architecture - Microbenchmark-based characterization of the DPU - Benchmarking and workload suitability study Main contributions: - Comprehensive characterization and analysis of the first commercially-available PIM architecture - PrIM (Processing-In-Memory) benchmarks: 16 workloads that are memory-bound in conventional processor-centric systems Strong and weak scaling characteristics - Comparison to state-of-the-art CPU and GPU Takeaways: - Workload characteristics for PIM suitability - Programming recommendations - Suggestions and hints for hardware and architecture designers of future PIM systems - PrIM: (a) programming samples, (b) evaluation and comparison of current and future PIM systems 2
System Organization A UPMEM DIMM contains 8 or 16 chips - Thus, 1 or 2 ranks of 8 chips each Inside each PIM chip there are: - 8 64MB banks per chip: Main RAM (MRAM) banks - 8 DRAM Processing Units (DPUs) in each chip, 64 DPUs per rank Main Memory PIM Chip Control/Status Interface DDR4 Interface DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip xM Host CPU DISPATCH FETCH1 FETCH2 FETCH3 READOP1 READOP2 READOP3 FORMAT ALU1 ALU2 ALU3 ALU4 MERGE1 MERGE2 24-KB IRAM DMA Engine Register File Pipeline 64-MB DRAM Bank (MRAM) PIM Chip PIM Chip PIM Chip PIM Chip PIM Chip PIM Chip PIM Chip PIM Chip 64 bits PIM Chip PIM Chip PIM Chip PIM Chip PIM Chip PIM Chip PIM Chip PIM Chip xN 64-KB WRAM PIM-enabled Memory x8 3
Key Takeaway 1 64.00 11 12 13 14 15 16 11 12 13 14 15 16 11 12 13 14 15 16 11 12 13 14 15 16 11 12 13 14 15 16 11 12 13 14 15 16 3456789 10 3456789 10 3456789 10 3456789 10 3456789 10 3456789 10 Arithmetic Throughput (MOPS, log scale) (a) INT32, ADD (1 DPU) 11 12 13 14 15 16 345678 9 10 32.00 The throughput saturation point is as low as OP/B, i.e., 1 integer addition per every 32-bit element fetched 345 6 7 8 9 10 11 12 13 14 15 16 16.00 2 2 2 2 2 2 23 4 5 6 7 8 9 10 11 12 13 14 15 16 2 8.00 2 1 1 1 1 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 4.00 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2.00 1 12 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1.00 12 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0.50 Memory-bound region Compute-bound region 12 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0.25 12 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0.13 0.06 0.03 Operational Intensity (OP/B) KEY TAKEAWAY 1 The UPMEM PIM architecture is fundamentally compute bound. As a result, the most suitable workloads are memory-bound. 4
Key Takeaway 2 CPU GPU 640 DPUs 2556 DPUs 1024.000 Speedup over CPU (log scale) 256.000 64.000 16.000 4.000 1.000 0.250 0.063 0.016 0.004 0.001 SpMV SCAN-SSA SEL MLP GMEAN (1) GMEAN (2) GMEAN TS VA HST-S HST-L UNI BFS NW RED GEMV TRNS BS SCAN-RSS GMEAN (2) GMEAN (1) GMEAN More PIM-suitable workloads (1) More PIM-suitable workloads (1) Less PIM-suitable workloads (2) Less PIM-suitable workloads (2) KEY TAKEAWAY 2 The most well-suited workloads for the UPMEM PIM architecture use no arithmetic operations or use only simple operations (e.g., bitwise operations and integer addition/subtraction). 5
Key Takeaway 3 CPU GPU 640 DPUs 2556 DPUs 1024.000 Speedup over CPU (log scale) 256.000 64.000 16.000 4.000 1.000 0.250 0.063 0.016 0.004 0.001 SpMV SCAN-SSA SEL MLP GMEAN (1) GMEAN (2) GMEAN TS VA HST-S HST-L UNI BFS NW RED GEMV TRNS BS SCAN-RSS GMEAN (2) GMEAN (1) GMEAN More PIM-suitable workloads (1) More PIM-suitable workloads (1) Less PIM-suitable workloads (2) Less PIM-suitable workloads (2) KEY TAKEAWAY 3 The most well-suited workloads for the UPMEM PIM architecture require little or no communication across DPUs (inter-DPU communication). 6
Key Takeaway 4 KEY TAKEAWAY 4 UPMEM-based PIM systems outperform state-of-the-art CPUs in terms of performance and energy efficiency on most of PrIM benchmarks. UPMEM-based PIM systems outperform state-of-the-art GPUs on a majority of PrIM benchmarks, and the outlook is even more positive for future PIM systems. UPMEM-based PIM systems are more energy-efficient than state- of-the-art CPUs and GPUs on workloads that they provide performance improvements over the CPUs and the GPUs. 7
Understanding a Modern PIM Architecture https://arxiv.org/pdf/2105.03814.pdf https://arxiv.org/pdf/2105.03814.pdf https://github.com/CMU https://github.com/CMU- -SAFARI/prim SAFARI/prim- -benchmarks benchmarks 8
PrIM Repository All microbenchmarks, benchmarks, and scripts https://github.com/CMU-SAFARI/prim-benchmarks 9
Understanding a Modern Understanding a Modern Processing Processing- -in in- -Memory Architecture: Memory Architecture: Benchmarking and Experimental Characterization Juan G mez Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, Onur Mutlu el1goluj@gmail.com el1goluj@gmail.com https://arxiv.org/pdf/2105.03814.pdf https://arxiv.org/pdf/2105.03814.pdf https://github.com/CMU https://github.com/CMU- -SAFARI/prim SAFARI/prim- -benchmarks benchmarks