
Graph Processing Systems and Failure Recovery
"Explore challenges in distributed graph processing systems, including fast failure recovery, analytics for large graphs, and overcoming failures of compute nodes to ensure reliable operations. Learn about existing solutions and experimental results in this domain."
Uploaded on | 0 Views
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor 1
Graph analytics Emergence of large graphs The web, social networks, spatial networks, Increasing demand of querying large graphs PageRank, reverse web link analysis over the web graph Influence analysis in social networks Traffic analysis, route recommendation over spatial graphs 2
Distributed graph processing MapReduce- like systems Pregel-like systems GraphLab-related systems Others 3
Failures of compute nodes Increase in the number of failed nodes Increasing graph size More compute nodes 0.8 Failure probability when avg. failure time of a compute node is ~200 hours Failure rate # of failures per unit of time 1/200(hours) Exponential failure probability 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 4 16 64 256 # of compute nodes 4
Outline Motivation & background Failure recovery problem Challenging issues Existing solutions Solution Reassignment generation In-parallel recomputation Workload rebalance Experimental results Conclusions 5
Pregel-like distributed graph processing systems Graph model G=(V,E) P: partitions Vertex Subgraph A B A B P1 A B G I B G C D C D P2 C D H J E F C E F P3 E F G H G H P4 ?? ?? I B D E F Computation model A set of supersteps Invoke compute function for each active vertex Each vertex can Receive and process messages Send messages to other vertices Modify its value, state(active/inactive), its outgoing edges I J H I J P5 6
Failure recovery problem Running example All the vertices compute and send messages to all neighbors in all the supersteps N1 fails when the job executes in superstep 12 Two states: record each vertex completes which superstep when failure occurs (Sf)and failure is recovered (Sf*) Problem statement For a failure F(Nf, sf), recover vertex states from Sf to Sf* A B G I Sf Sf* A-F: 10; G-J: 12 A-J: 12 C D H J E F ?? ?? 7
Challenging issues Cascading failures New failures may occur during the recovery phase How to handle all the cascading failures if any? Existing solution: treat each cascading failure as an individual failure and restart from the latest checkpoint Recovery latency Re-execute lost computations to achieve state S* Forward messages during recomputation Recover cascading failures How to perform recovery with minimized latency? 8
Existing recovery mechanisms Checkpoint-based recovery During normal execution all the compute nodes flush its own graph-related information to a reliable storage at the beginning of every checkpointing superstep (e.g., C+1, 2C+1, , nC+1). Can handle cascading failures! Simple to implement! During recovery let c+1 be the latest checkpointing superstep use healthy nodes to replace failed ones; all the compute nodes rollback to the latest checkpoint and re-execute lost computations since then (i.e., from superstep c+1 to sf) Replay lost computations over whole graph! Ignore partially recovered workload! 9
Existing recovery mechanisms Checkpoint + log During normal execution: besides checkpoint, every compute node logs its outgoing messages at the end of each superstep During recovery Use healthy nodes (replacements) to replace failed one Replacements: redo lost computation and forward messages among each other; forward messages to all the nodes in superstep sf Healthy nodes: holds their original partitions and redo the lost computation by forwarding locally logged messages to failed vertices 10
Existing recovery mechanisms Checkpoint + log Suppose latest checkpoint is made at the beginning of superstep 11; N1 (A-F) fails at superstep 12 During recovery superstep 11: A-F perform computation and send messages to each other; G-J send messages to A-F superstep 12:A-F perform computation and send messages along their outgoing edges; G-J send messages to A-F the lost computation! Less computation and communication cost! Overhead of locally logging! (negligible) Limited parallelism: replacements handle all A B G I C D H J E F ?? ?? 11
Outline Motivation & background Problem statement Challenging issues Existing solutions Solution Reassignment generation In-parallel recomputation Workload rebalance Experimental results Conclusions 12
Our solution Partition-based failure recovery Step 1: generate a reassignment for the failed partitions Step 2: recompute failed partitions Every node is informed of the reassignment Every node loads its newly assigned failed partitions from the latest checkpoint; redoes lost computations Step 3: exchange partitions Re-balance workload after recovery 13
Recompute failed partitions In superstep ? [? + 1,??], every compute node iterates through its active vertices. For each vertex ?, we: perform computation for vertex ? only if: its state after the failure satisfies: ?(?) < ? forward a message from ? to ? only if: ? ? < ?; or, ?(?) < ? ?(?) = ? Intuition: ? will need this message to perform computation in superstep i+1 14
A B G I Example N1 fails in superstep 12 Redo superstep 11, 12 C D H J E F ?? ?? B D G I Less computation and communication cost! A C H J E F ?? ?? ??1 = ?1 ??2 = ?2 ??3 = ?2 (1) reassginment (2) recomputation 15
B D G I Handling cascading failures N1 fails in superstep 12 A C H J E F ?? ?? N2 fails superstep 11 during recovery No need to recover A and B since they have been recovered! Same recovery algorithm can be used to recovery any failure! A B G I C D H J E F ?? ?? (1) reassginment (2) recomputation 16
Reassignment generation When a failure occurs, how to compute a good reassignment for failed partitions? Minimize the recovery time Calculating recovery time is complicated because it depends on: Reassignment for the failure Cascading failures Reassignment for each cascading failure No knowledge about cascading failures! 17
Our insight When a failure occurs (can be cascading failure), we prefer a reassignment that can benefit the remaining recovery process by considering all the cascading failures that have occurred We collect the state S after the failure and measure the minimum time Tlow to achieve Sf* Tlow provides a lower bound of remaining recovery time 18
Estimation of Tlow ????= ?=?+? Ignore downtime (similar over different recovery methods) ?? (????? ? + ?????[?]) To estimate computation and communication time, we need to know: Which vertex will perform computation Which message will be forwarded (across different nodes) Maintain relevant statistics in the checkpoint 19
Reassignment generation problem Given a failure, find a reassignment that minimizes ???? Problem complexity: NP-hard Different from graph partitioning problem Assignment partitioning Not a static graph, but depends on runtime vertex states and messages No balance requirement Greedy algorithm Start with a random reassignment for failed partitions and achieve a better one (with less ????) by moving the failed partitions 20
Outline Motivation & background Problem statement Challenging issues Existing solutions Solution Reassignment generation In-parallel recomputation Workload rebalance Experimental results Conclusions 21
Experimental evaluation Experiment settings In-house cluster with 72 nodes, each of which has one Intel X3430 2.4GHz processor, 8GB of memory, two 500GB SATA hard disks and Hadoop 0.20.203.0, and Giraph-1.0.0. Comparisons PBR(our proposed solution), CBR(checkpoint-based) Benchmark Tasks K-means Semi-clustering PageRank Datasets Forest LiveJournal Friendster 22
PageRank results Logging Overhead Single Node Failure 23
PageRank results Multiple Node Failure Cascading Failure 24
PageRank results (communication cost) Multiple Node Failure Cascading Failure 25
Conclusions Develop a novel partition-based recovery method to parallelize failure recovery workload for distributed graph processing Address challenges in failure recovery Handle cascading failures Reduce recovery latency Reassignment generation problem Greedy strategy 26
Thank You! Q & A 27