DistTC: High Performance Distributed Triangle Counting Solution

DistTC: High Performance Distributed Triangle Counting Solution
Slide Note
Embed
Share

This study presents DistTC, a high-performance distributed triangle counting solution addressing memory limitations. It introduces a novel graph partitioning policy for efficient distributed processing, showcasing up to 1.6x faster results over previous champions. The implementation on multiple GPUs eliminates most communication, enhancing speed and scalability significantly.

  • Distributed Processing
  • Graph Partitioning
  • High Performance
  • Multi-GPU
  • Efficiency

Uploaded on Feb 21, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. DistTC: High Performance Distributed Triangle Counting Loc Hoang, Vishwesh Jatala, Xuhao Chen, Udit Agarwal, Roshan Dathathri, Gurbinder Gill, and Keshav Pingali The University of Texas at Austin IEEE HPEC Graph Challenge 2019 1

  2. Goal: Implement an Efficient Triangle Counting Challenges Solution: DistTC Not enough memory to fit it on single machine (clueweb12 has 37 B edges) A novel graph partitioning policy Most of the implementations are on shared memory/single GPU Eliminates almost all communication for distributed TC Distributed processing requires efficient partitioning and synchronization Platform-agnostic Up to 1.6 faster than previous graph challenge champions TC is not a vertex program 2

  3. Partitioning Scheme For Distributed Triangle Counting Host-1 Host-2 A D A D D B C E F B C E E F Host-1 Host-2 A D A D D B C E F B C E E F Master Mirror 3 Partitioned Graph Based on OEC [CuSP IPDPS 19 ]

  4. Experimental Setup Software CuSP [Hoang et al. IPDPS 2019] Bridges Super Computer 32 (nodes) * 2 (NVIDIA P100 GPUs) Hardware Input |V| |E| Input |V| |E| scale19 0.33 M 7.7 M twitter40 41.6 M 1.2 B Inputs scale20 0.64 M 15.6 M friendster 66 M 1.8 B scale21 1.24 M 31.7 M gsh-2015 988 M 25 B scale22 2.39 M 64.1 M clueweb12 978 M 37 B 4

  5. Results Graph 500 Input scale18 scale19 scale20 scale21 scale22 IrGL (sec) 0.06 0.18 0.56 1.58 4.41 DistTC (sec) 0.06 0.15 0.35 0.80 1.91 Input GPUs TriCore (sec) DistTC (sec) twitter40 8 6.5 6.7 friendster 8 2.1 4.0 gsh-2015 32 253.4 159.9 clueweb12 64 - 172.4 Comparison on Single GPU Comparison on Distributed GPUs DistTC is up to 2.31x faster than IrGL. DistTC is up to 1.58x faster than TriCore 5

  6. Summary Proposed a new graph partitioning policy Eliminates most of the synchronization Implemented distributed multi-GPU triangle counting Achieves up to 1.6x speed up over 2018 champion 6

  7. Questions? 7

  8. Results: Distributed Multi-GPUs Input GPUs Pre-Processing (sec) Exec. Time (sec) Total Time (sec) rmat26 16 30.16 (29.15) 5.95 36.11 32 26.92 (25.99) 3.63 30.55 64 23.74 (22.89) 2.62 26.36 twitter40 16 24.90 (24.20) 3.92 28.82 32 20.81 (20.20) 2.83 23.64 64 18.73 (18.19) 2.35 21.08 friendster 16 52.13 (51.32) 2.49 54.62 32 41.80 (41.19) 1.64 43.44 64 36.13 (35.64) 1.50 37.63 uk2007 16 12.16 (11.63) 8.64 20.80 32 12.06 (11.66) 6.52 18.58 64 11.41 (11.05) 5.47 16.88 gsh-2015 32 143.43 (142.30) 16.44 159.87 64 143.72 (142.97) 15.25 158.97 8 clueweb12 64 162.92 (162.61) 9.49 172.41

More Related Content