Distributed Graph Pattern Matching

Distributed Graph Pattern Matching
Slide Note
Embed
Share

Real-life graphs are vast, necessitating distributed processing techniques. Explore applications, challenges, and solutions in the realm of distributed graph pattern matching.

  • Graph Matching
  • Distributed Processing
  • Pattern Recognition

Uploaded on Mar 05, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Distributed Graph Pattern Matching Shuai Ma, Yang Cao, Jinpeng Huai, Tianyu Wo

  2. Graphs are everywhere, and quite a few are huge graphs! Social Networks Social Networks File systems File systems Databases Databases World Wide Web World Wide Web Graph searching is a key to social searching engines! 2

  3. Graph Pattern Matching Given two graphs G1 (pattern graph) and G2 (data graph), decide whether G1 matches G2 (Boolean queries); identify subgraphs of G2 that match G1 Applications Web mirror detection/ Web site classification Complex object identification Software plagiarism detection Social network/biology analyses Matching Semantics Traditional: Subgraph Isomorphism Emerging applications: Graph Simulation and its extensions, etc.. A variety of emerging real-life applications! 3

  4. Distributed Graph Pattern Matching Real-life graphs are typically way too large: Yahoo! web graph: 14 billion nodes Facebook: over 0.8 billion users It is NOT practical to handle large graphs on single machines Real-life graphs are naturally distributed: Google, Yahoo and Facebook have large-scale data centers Distributed graph processing is inevitable It is nature to study distributed graph pattern matching ! 4

  5. Distributed Graph Pattern Matching Given pattern graph Q(Vq, Eq) and fragmented data graph F = (F1, , Fk) of G(V, E) distributed over k sites, the distributed graph pattern matching problem is to find the maximum match maximum match in G for Q, via graph simulation. There exists a unique maximum match for graph simulation! 5

  6. Graph Simulation Given pattern graph Q(Vq, Eq) and data graph G(V, E), a binary relation R Vq V is said to be a match if (1) for each (u, v) R, u and v have the same label; and (2) for each edge (u, u ) Eq, there exists an edge (v, v ) in E such that (u , v ) R. Graph G matches pattern Q via graph simulation, if there exists a total match relation M for each u Vq, there exists v V such that (u, v) M. Intuitively, simulation preserves the labels and the child relationship of a graph pattern in its match. Simulation was initially proposed for the analyses of programs; and simulation and its extensions were recently introduced for social networks. Subgraph isomorphism (NP-complete) vs. graph simulation (O(n2))! 6

  7. Graph Simulation Set up a team to develop a new software product Graph simulation returns F3, F4 and F5; Subgraph isomorphism returns empty! Subgraph Isomorphism is too strict for emerging applications! 7

  8. Properties of Graph Simulation Impacts of connected components (CCs) Let pattern Q = {Q if M Mi i is the maximum match in G for Qi, then M M1 1 M Mh h is the maximum match in G for Q. Let data graph G = {G G = {G1 1, . . . , if M Mi i is the maximum match in Gi for Q, then M M1 1 M Mh h is the maximum match in G for Q. Q = {Q1 1, . . . , , . . . , Q Qh h} } (h CCs). For any data graph G, , . . . , G Gh h} } (h CCs). For any pattern graph G, Even if data graph G is connected, R(G) might be highly disconnected, by removing useless nodes and edges from G. Any binary relation R graph G(V,E) that contains the maximum match M M in G for Q If M Mi i is the maximum match in R(G) then M M1 1 M Mh h is exactly the maximum match in G for Q, where R(G) R(G) consists of h h CCs R(G) R(G)1 1, . . . , R(G) R Vq Vq V V on pattern graph Q(Vq,Eq) and data G for Q. R(G)i i for Q for Q, , . . . , R(G)h h). 8

  9. Properties of Graph Simulation What can be computed locally? The matched subgraph of Q1 and G1 is Gs = F3 F Removing any node or edge from Gs makes Q1 NOT match Gs. F4 4 F F5 5; Graph simulation has poor data locality 9

  10. Properties of Graph Simulation We turn to the data locality of single nodes Checking whether data node v in G be determined locally locally iff iff subgraph desc v in G matches pattern node u in Q desc(Q, u) (Q, u) is a DAG. is a DAG. u in Q can d desc(Q esc(Q1 1, SA) Q1 with nodes SA, SD and ST , SA) is the subgraph in What we have learned from the static analysis? Treat each connected component in Q and G separately; Use the data locality to check whether a node in G can be determined locally. 10

  11. Complexity Analysis of Distributed Algorithms Model of Computation: A cluster of identical machines (with one acted as coordinator); Each machine can directly send arbitrary number of messages to another one; All machines co-work with each other by local computations and message-passing. Complexity measures: 1. Visit times: the maximum visiting times of a machine (interactions) 2. Makespan: the evaluation completion time (efficiency) 3. Data shipment: the size of the total messages shipped among distinct machines (network band consumption) 11

  12. Complexity Analysis of Distributed Algorithms Specifications for the distributed algorithms: For each machine Si (1 i k) , Local information: Local information: ( 1) pattern graph Q; (2) subgraph Gs,i of G; and (3) a marked binary relation Ri Vq V, where each match (u; v) 2 Ri is marked as true, false or unknown; and Ri can be updated by either messages or local computations. . Message: Message: only local information is allowed to be exchanged Local computations: Local computations: update Ri by utilizing the semantics of graph simulation. local algorithms execute only local computations without involving message-passing during the computation, run in time of a polynomial of |Q| and |Gs,i|. Complexity bounds: 1. The optimal data shipment is |G| - 1, and it is tight. 2. The optimal visit times are 1, and it is tight. 3. The minimum makespan problem is NP-complete. Remarks: 1. Data shipment, visit times and makespan are controversial with each other. 2. A well-balanced strategy between makespan and the other two measures. 12

  13. Distributed Evaluation of Graph Simulation Stage 1 Stage 1: Coordinator S SQ Q broadcasts Q Q to all k sites; Stage 2 Stage 2: All sites, in parallel, partially evaluate Q Q on local fragments partial match partial match; Stage 3 Stage 3: Ship those CCs across different machines to single machines, while minimizing data shipment and makespan; Stage 4 Stage 4: Compute the maximum matches in those CCs originally across multiple machines in parallel; Stage 5 Stage 5: Collect and assemble partial matches in the coordinator. Performance guarantees: 1. The total computational complexity is the same to the best-known centralized algorithm, while it invokes 4 rounds of message-passing and local evaluation only; 2. Total data shipment is bounded by |G| + 4|B| + |Q||G| + (k - 1) |Q|; 3. Each machine except coordinator SQ is visited with g + 2 times (g is the maximum number machines at which a CC resides in Stage2, and SQ is visited 2(k -1) times. Sacrifice data shipment and visit times for makspan! 13

  14. Scheduling Data Shipment - Stage 3 The Scheduling Problem: Given h connected components, C1, ,Ch, and an integer k, find an assignment of the connected component to k identical machines, so that both the makespan and the total data shipment are minimized. Approximation Hardness (data shipment, makespan): The scheduling problem is not approximable within ( , max(k 1, 2)) for any > 1. Performance guarantees of algorithm dSchedule: Algorithm dSchedule produces an assignment of the scheduling problem such that the makespan is within a factor (2 1/k) of the optimal one. Remarks: 1. A heuristic is used to minimize the data shipment, 2. A greedy approach is adopted to guarantee the performance of the makesapn. 3. The algorithm runs in O(kh), and is very efficient. Hence, its evaluation could not cause a bottleneck. 14

  15. Optimization Techniques Using data locality Determine whether (u, v) belongs to the maximum match M in G for Q: Case 1: Case 1: when there are no boundary nodes in desc(G, v) of fragmented graph Gj; Case 2: Case 2: when there are boundary nodes in fragmented graph Gj, but subgraph desc (Q, u) of Q is a DAG (SA, SA2): Case 1 (BA, BA2): Case 2 Minimization Minimizing pattern graphs (Q Qm) Given pattern graph Q, we compute a minimized equivalent pattern graph Qm such that for any data graph G, G matches Q iff G matches Qm, via graph simulation. 15

  16. Experimental Study Real life datasets: Google Web graph: : 875,713 Amazon product co-buy network: : 548,552 875,713 nodes and 5,105,039 548,552 nodes and 1,788,725 5,105,039 edges 1,788,725 edges Synthetic graph generator: (108nodes and 3,981,071,706 edges) Three parameters: 1. The number n of nodes; 2. The number n of edges; and 3. The number l of node labels Algorithms: Algorithm disHHK and its optimized version disHHK+ Optimal algorithms naiveMatchds(data shipment) and naiveMatchvt(visit times) Machines: The experiments were run on a cluster of 16 machines, all with 2 Intel Xeon E5620 CPUs and 64GB memory 16

  17. Experimental Study 1. All algorithms scale well except naiveMatchdsand naiveMatchvt 2. disHHK+ consistently reduces about [1/5, 1/4] running time of disHHK 17

  18. Experimental Study 1. All algorithms ship about 1/10000 of the data graphs 2. disHHK+ and disHHK even ship less data than naiveMatchdswhen data graphs are large and sparse 18

  19. Experimental Study disHHK+ and disHHK have [30%, 53%] more visit times than naiveMatchds, as expected 19

  20. Conclusion We have formulated and investigated the distributed graph pattern matching problem, via graph simulation. We have given a static analysis of graph simulation Utility of connected components Study of data locality We have studied the complexity of a large class of distributed algorithms for graph simulation. A message-passing computation model Makespan, data shipment, and visit times (controversial with each other) We have proposed a distributed algorithm for graph simulation The scheduling problem Optimization techniques Experimental verification A first step towards the big picture of distributed graph pattern matching 20

More Related Content