A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing

A Scalable Processing-in-Memory  Accelerator for Parallel Graph Processing
Slide Note
Embed
Share

This research paper presents a novel scalable processing-in-memory accelerator designed for efficient parallel graph processing. Authored by Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi, the work discusses the architecture and performance benefits of the proposed accelerator. By leveraging insights from Oracle Labs and Carnegie Mellon University, this work aims to enhance processing efficiency while tackling the computational challenges associated with graph analytics.

  • Graph Processing
  • Scalable Accelerator
  • Processing-in-Memory
  • Parallel Computing
  • Computational Efficiency

Uploaded on Mar 09, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn, Sungpack Hong*, Sungjoo Yoo, Onur Mutlu+, Kiyoung Choi *Oracle Labs +Carnegie Mellon University Seoul National University

  2. Graphs Abstract representation of object relationships Vertex: object (e.g., person, article, ) Edge: relationship (e.g., friendships, hyperlinks, ) Recent trend: explosive increase in graph size 36 Million Wikipedia Pages 1.4 Billion Facebook Users 300 Million Twitter Users 30 Billion Instagram Photos

  3. Large-Scale Graph Processing Example: Google s PageRank ? ? = ? + ????[?] ? Succ(?) for (v: graph.vertices) { for (w: v.successors) { w.next_rank += weight * v.rank; } } for (v: graph.vertices) { v.rank = v.next_rank; v.next_rank = alpha; }

  4. Large-Scale Graph Processing Example: Google s PageRank ? ? = ? + Independent to Each Vertex Vertex-Parallel Abstraction ????[?] ? Succ(?) for (v: graph.vertices) { for (w: v.successors) { w.next_rank += weight * v.rank; } } for (v: graph.vertices) { v.rank = v.next_rank; v.next_rank = alpha; }

  5. PageRank Performance 6 5 4 Speedup 3 2 +42% 1 0 32 Cores DDR3 128 Cores DDR3

  6. Bottleneck of Graph Processing for (v: graph.vertices) { for (w: v.successors) { w.next_rank += weight * v.rank; } } 1. Frequent random memory accesses v &w w.rank w.next_rank weight * v.rank w.edges w 2. Little amount of computation

  7. Bottleneck of Graph Processing for (v: graph.vertices) { for (w: v.successors) { w.next_rank += weight * v.rank; } } High Memory Bandwidth Demand 1. Frequent random memory accesses v &w w.rank w.next_rank weight * v.rank w.edges w 2. Little amount of computation

  8. PageRank Performance 6 5 4 Speedup 3 2 +42% 1 0 32 Cores DDR3 (102.4GB/s) 128 Cores DDR3 (102.4GB/s)

  9. PageRank Performance 6 5 4 Speedup 3 +89% 2 +42% 1 0 32 Cores DDR3 (102.4GB/s) 128 Cores DDR3 (102.4GB/s) 128 Cores HMC (640GB/s)

  10. PageRank Performance 6 5 4 Speedup 3 +89% 2 +42% 1 0 32 Cores DDR3 (102.4GB/s) 128 Cores DDR3 (102.4GB/s) 128 Cores HMC (640GB/s)

  11. PageRank Performance 6 Lessons Learned: 1. High memory bandwidth is the key to the scalability of graph processing 2. Conventional systems do not fully utilize high memory bandwidth 5.3x 5 4 Speedup 3 +89% 2 +42% 1 0 32 Cores DDR3 (102.4GB/s) 128 Cores DDR3 (102.4GB/s) 128 Cores HMC (640GB/s) 128 Cores HMC Internal BW (8TB/s)

  12. Challenges in Scalable Graph Processing Challenge 1: How to provide high memory bandwidth to computation units in a practical way? Processing-in-memory based on 3D-stacked DRAM Challenge 2: How to design computation units that efficiently exploit large memory bandwidth? Specialized in-order cores called Tesseract cores Latency-tolerant programming model Graph-processing-specific prefetching schemes

  13. Tesseract System Host Processor Memory-Mapped Accelerator Interface (Noncacheable, Physically Addressed) DRAM Controller In-Order Core LP PF Buffer Crossbar Network MTP NI Message Queue

  14. Tesseract System Host Processor Memory-Mapped Accelerator Interface (Noncacheable, Physically Addressed) DRAM Controller In-Order Core LP PF Buffer Crossbar Network MTP NI Message Queue

  15. Tesseract System Host Processor Memory-Mapped Accelerator Interface (Noncacheable, Physically Addressed) DRAM Controller In-Order Core Communications via Remote Function Calls LP PF Buffer Crossbar Network MTP NI Message Queue

  16. Communications in Tesseract for (v: graph.vertices) { for (w: v.successors) { w.next_rank += weight * v.rank; } } v &w w

  17. Communications in Tesseract for (v: graph.vertices) { for (w: v.successors) { w.next_rank += weight * v.rank; } } Vault #1 Vault #2 v &w w

  18. Communications in Tesseract for (v: graph.vertices) { for (w: v.successors) { put(w.id, function() { w.next_rank += weight * v.rank; }); } } barrier(); Vault #1 Non-blocking Remote Function Call Can be delayed until the nearest barrier Vault #2 put v &w put put w put

  19. Non-blocking Remote Function Call 1. Send function address & args to the remote core 2. Store the incoming message to the message queue 3. Flush the message queue when it is full or a synchronization barrier is reached Local Core Remote Core &func, &w, value MQ NI NI put(w.id, function() { w.next_rank += value; })

  20. Benefits of Non-blocking Remote Function Call Latency hiding through fire-and-forget Local cores are not blocked by remote function calls Localized memory traffic No off-chip traffic during remote function call execution No need for mutexes Non-blocking remote function calls are atomic Prefetching Will be covered shortly

  21. Tesseract System Host Processor Memory-Mapped Accelerator Interface (Noncacheable, Physically Addressed) DRAM Controller In-Order Core LP PF Buffer Crossbar Network MTP NI Message Queue

  22. Tesseract System Host Processor Memory-Mapped Accelerator Interface (Noncacheable, Physically Addressed) In-Order Core Prefetching DRAM Controller LP PF Buffer Crossbar Network MTP NI Message Queue

  23. Memory Access Patterns in Graph Processing for (v: graph.vertices) { for (w: v.successors) { put(w.id, function() { w.next_rank += weight * v.rank; }); } } Vault #1 Vault #2 Sequential v &w (Local) w Random (Remote)

  24. Message-Triggered Prefetching Prefetching random memory accesses is difficult Opportunities in Tesseract Domain-specific knowledge of target data address Time slack of non-blocking remote function calls for (v: graph.vertices) { for (w: v.successors) { put(w.id, function() { w.next_rank += weight * v.rank; }); } } barrier(); Can be delayed until the nearest barrier

  25. Message-Triggered Prefetching DRAM Controller In-Order Core PF Buffer MTP NI Message Queue put(w.id, function() { w.next_rank += value; })

  26. Message-Triggered Prefetching DRAM Controller In-Order Core PF Buffer MTP NI Message Queue put(w.id, function() { w.next_rank += value; }, &w.next_rank)

  27. Message-Triggered Prefetching 1. Message M1received 2. Request prefetch 3. Mark M1as ready when the prefetch is serviced DRAM Controller In-Order Core w.next.. PF Buffer MTP M1 NI Message Queue put(w.id, function() { w.next_rank += value; }, &w.next_rank)

  28. Message-Triggered Prefetching 1. Message M1received 2. Request prefetch 3. Mark M1as ready when the prefetch is serviced 4. Process multiple ready messages at once DRAM Controller In-Order Core d1 d2 d3 PF Buffer MTP M1 M2 M3 NI Message Queue M4 put(w.id, function() { w.next_rank += value; }, &w.next_rank)

  29. Other Features of Tesseract Blocking remote function calls List prefetching Prefetch buffer Programming APIs Application mapping Please see the paper for details

  30. Evaluated Systems DDR3-OoO (with FDP) HMC-OoO (with FDP) HMC-MC Tesseract (32-entry MQ, 4KB PF Buffer) 32 Tesseract Cores 128 In-Order 2GHz 128 In-Order 2GHz 8 OoO 4GHz 8 OoO 4GHz 8 OoO 4GHz 8 OoO 4GHz 128 In-Order 2GHz 128 In-Order 2GHz 8 OoO 4GHz 8 OoO 4GHz 8 OoO 4GHz 8 OoO 4GHz 102.4GB/s 640GB/s 640GB/s 8TB/s

  31. Workloads Five graph processing algorithms Average teenage follower Conductance PageRank Single-source shortest path Vertex cover Three real-world large graphs ljournal-2008 (social network) enwiki-2003 (Wikipedia) indochina-0024 (web graph) 4~7M vertices, 79~194M edges

  32. Performance 16 13.8x 14 11.6x 12 9.0x 10 Speedup 8 6 4 +56% +25% 2 0 DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract- Tesseract- LP-MTP LP

  33. Performance Memory Bandwidth Consumption 16 13.8x 3.5 14 Memory Bandwidth (TB/s) 2.9TB/s 3 11.6x 2.2TB/s 12 2.5 9.0x 10 Speedup 2 8 1.3TB/s 1.5 6 1 0.5 4 243GB/s 190GB/s 80GB/s +56% 0 +25% 2 DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract- Tesseract- LP-MTP 0 LP DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract- Tesseract- LP-MTP LP

  34. Iso-Bandwidth Comparison HMC-MC Bandwidth (640GB/s) Tesseract Bandwidth (8TB/s) 7 6.5x 6 5 Programming Model Speedup 4 3.0x 3 2.3x 2 Bandwidth 1 0 HMC-MC HMC-MC + PIM BW Tesseract + Conventional BW Tesseract (No Prefetching)

  35. Execution Time Breakdown 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% AT.LJ CT.LJ PR.LJ SP.LJ VC.LJ Normal Mode Interrupt Mode Interrupt Switching Network Barrier

  36. Execution Time Breakdown 100% 90% 80% 70% 60% 50% 40% Network Backpressure due to Message Passing (Future work: message combiner, ) 30% 20% 10% 0% AT.LJ CT.LJ PR.LJ SP.LJ VC.LJ Normal Mode Interrupt Mode Interrupt Switching Network Barrier

  37. Execution Time Breakdown 100% 90% 80% 70% 60% 50% 40% Workload Imbalance (Future work: load balancing) 30% 20% 10% 0% AT.LJ CT.LJ PR.LJ SP.LJ VC.LJ Normal Mode Interrupt Mode Interrupt Switching Network Barrier

  38. Prefetch Efficiency Prefetch Timeliness Coverage 1.2 1.04 1 Demand L1 Miss, 12% 0.8 Speedup 0.6 Prefetch Buffer Hit, 88% 0.4 0.2 0 Tesseract with Prefetching Ideal (Prefetch takes zero cycles)

  39. Scalability DDR3-OoO Tesseract with Prefetching 16 16 13x 14 14 12x 12 12 Memory- Capacity- Proportional Performance Speedup 10 10 8 8 6 6 4x 4 4 +42% 2 2 0 0 32 128 Cores 32 128 Cores (32GB) 512 Cores (128GB) 512 Cores + Partitioning Cores Cores (8GB)

  40. Memory Energy Consumption Memory Layers Logic Layers Cores 1.2 1 0.8 0.6 0.4 -87% 0.2 0 HMC-OoO Tesseract with Prefetching

  41. Memory Energy Consumption Memory Layers Logic Layers Cores Cores Memory Layers 1.2 1 0.8 Logic Layers 0.6 0.4 -87% 0.2 0 HMC-OoO Tesseract with Prefetching

  42. Conclusion Revisiting the PIM concept in a new context Cost-effective 3D integration of logic and memory Graph processing workloads demanding high memory bandwidth Tesseract: scalable PIM for graph processing Many in-order cores in a memory chip New message passing mechanism for latency hiding New hardware prefetchers for graph processing Programming interface that exploits our hardware design Evaluations demonstrate the benefits of Tesseract 14x performance improvement & 87% energy reduction Scalable: memory-capacity-proportional performance

  43. A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing Junwhan Ahn, Sungpack Hong*, Sungjoo Yoo, Onur Mutlu+, Kiyoung Choi *Oracle Labs +Carnegie Mellon University Seoul National University

More Related Content