IS.651: Distributed Systems Consensus Challenges

IS.651: Distributed Systems Consensus Challenges
Slide Note
Embed
Share

Distributed systems present challenges such as consistency, concurrency, machine failures, network failures, and replication. Various replication models like Primary-Backup and Viewstamp Replication play crucial roles in ensuring system correctness and fault tolerance. The concept of consensus is discussed in depth, highlighting the need for coordinated actions among nodes despite potential failures. Announcements regarding upcoming assignments and the significance of data replication and consistency in distributed storage are also covered.

  • Distributed Systems
  • Consensus
  • Replication Models
  • Fault Tolerance

Uploaded on Apr 23, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. IS 651: Distributed Systems Consensus Sisi Duan Assistant Professor Information Systems sduan@umbc.edu

  2. Recall Consistency challenges Concurrency Machine failures Network failures Replication State machine replication Primary back up replication

  3. Recall Primary Backup Replication Deal with backup failures Deal with primary failures

  4. Recall Viewstamp replication Key to correctness View number a deterministic node as the primary No committed requests will be lost Quorum Majority nodes if we consider benign failures

  5. 0 1 2 Recall Seq A B C p0's log Committed if majority nodes said yes Backups monitor the primary A B p1's log time -> Client D E p2's log A Primary replies to the client After it gets majority OKs Request Reply p0 (primary) m=<PREPARE,v0,seq=1, request> p1 ok time -> Client p2 View moves to v1 if majority replied OK New primary decides on the new log On receiving a PREPARE message m: Reject if its current view id > m.v Reject if its status != NORMAL Append request with the seq to its log p0 (primary) m=<VIEW-CHANGE,v1> p1 ok p2 On receiving a VIEW-CHANGE message m: Reject if its current view id > m.v Set vid = m.v, status=VIEW-CHANGE Send its log (operation history) and the latest normal view

  6. Announcement HW4 due next week (Nov 6)

  7. Today The crown problem of distributed systems A.k.a. consensus We have covered a lot about why we need consensus Data replication consistency Distributed storage

  8. Consensus Despite having different views of the world, all nodes in a distributed system must act in concert, e.g.: All replicas that store the same object O must apply all updates to O in the same order (consistency) All nodes involved in a transaction must either commit or abort their portion of the transaction (atomicity) All that, despite FAILURES Nodes can restart, die, be slow Networks can be slow but reliable For simplicity, we assume the network is reliable where messages are eventually received

  9. The Consensus/Agreement Problem Some nodes propose something to other nodes All nodes must decide whether to accept or reject those values v v v v v Examples of something A value Whether or not to commit a transaction to a database Whether to execute a client request Who has the lock in a distributed lock service among multiple clients that request it almost simultaneously

  10. Timing Assumptions: Reminder Synchronous system Known bound on processing delays and transmission delays Asynchronous system No assumption on the delays Partially synchronous system There is such as bound but the bound value is unknown to the nodes

  11. Consensus Correctness in General Safety (consistency) All nodes agree on the same value The agreed value X has been proposed by some node Liveness (timing, performance) All nodes eventually agree on the value If less than some fraction of nodes crash, the rest should still reach an agreement Replicated systems just behave like a centralized one!

  12. Many forms of consensus/agreement Today Transactions (e.g., distributed databases) Application perform multi-operation, multi-object data access Transfer money from one account to another Insert Alice to Bob s friendlist and insert Bob to Alice s friendlist What if Failures occur in the middle of writing objects Concurrent operations race with each other?

  13. ACID transactions A (Atomicity) All-or-nothing w.r.t. failures C (Consistency) Transactions maintain any internal storage state invariants I (Isolation) Concurrently executing transactions do not interfere D (Durability) Effect of transactions survive failures

  14. X = read(A) Y = read (B) Write (A, X-100) Write (B, y+100) ACID Example T1: Transfer $100 from A to B A D I C T1 fully completes or leaves nothing once T1 commits, T1 s writes are not lost no races, as if T1 happens either before or after T2 preserves invariants, e.g., account balance > 0

  15. Solution: WAL logging Write-Ahead-Logging (WAL) All state modification must be written to log before they are applied Simplest WAL: REDO logging Only stores REDO information in log entries Transactions buffer writes during execution This requirement is easy to satisfy now, but not in 80s or 90s when memory capacity is very low

  16. The Concurrency Control Challenge T1: Transfer $100 from A to B T2: Transfer $100 from A to C A D I C T1 fully completes or leaves nothing once T1 commits, T1 s writes are not lost no races, as if T1 happens either before or after T2 preserves invariants, e.g., account balance > 0 Remember the differences between Linearizability and sequential consistency?

  17. Problem of interleaving

  18. Ideal isolation semantics: serializability Execution of a set of transactions is equivalent to some serial order Two executions are equivalent if they have the same effect on database and produce same output

  19. Conflict serializability An execution schedule is the ordering of read/write/commit/abort operations X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit X = read(A) Y = Read(B) Print (x+y) commit A (serial) schedule: R(A),R(B),W(A),W(B),C,R(A),R(B),C

  20. Conflict serializability Two schedules are equivalent if they Contain the same operations Order conflicting operations the same way Two ops are conflicting if they access the same data and one is a write A schedule is serializable if it s equivalent to some serial schedule Strict serializability/Order-preserving serializability If T finishes before T starts, T must be ordered before T in equivalent serial schedule

  21. Serializability Example X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit X = read(A) Y = Read(B) Print (x+y) commit Serializable? R(A),R(B), R(A),R(B),C, W(A),W(B),C Equivalent R(A),R(B),C, R(A),R(B), W(A),W(B),C

  22. Serializability Example X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit X = read(A) Y = Read(B) Print (x+y) commit Serializable? R(A),R(B), W(A), R(A),R(B),C, W(B),C

  23. Realize a serializable schedule Locking-based approach (remember mutual exclusion?) Solution 1: Grab global lock before transaction starts Release global lock after transaction commits Solution 2: Grab short-term fine-grained locks on an item before access Lock(A) Read(A) Unlock(A) Lock(B) Write(B) Unlock(B) Discussion: which one is better? What are the potential issues?

  24. Solution 2 Problem X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit X = read(A) Y = Read(B) Print (x+y) commit Serializable? R(A),R(B), W(A), R(A),R(B),C, W(B),C Locks on writes should be held till the end of transaction Otherwise we may read an uncommitted value

  25. A better solution Solution 3 Fine-grained locks Long-term locks for writes Grab lock before write, release lock after tx commits/aborts Short-term locks for reads

  26. Solution 3 problem X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit X = read(A) Y = Read(B) Print (x+y) commit Long-term locks for writes, short-term locks for reads? R(A),R(B), W(A), R(A),R(B),C, W(B),C Read locks must be held till commit time Non-repeatable reads R(A), R(A),R(B), W(A), W(B),C, R(B),C

  27. Realize a serializable schedule 2 phase locking (2PL) A growing phase in which the transaction is acquiring locks A shrinking phase in which locks are released In practice The growing phase is the entire transaction The shrinking phase is at the commit time Optimization Use read/write locks instead of exclusive locks

  28. 2PL: An Example Rlock(A) X = read(A) Rlock(B) Y = Read(B) Wlock(A) Buffer A = x-100 Wlock(B) Buffer B = y+100 log(A=0, B=200) Write (A, 0) Unlock(A) Write (B, 200) Unlock(B) Rlock(A) X = read(A) Rlock(B) Y = Read(B) Print (x+y) Unlock(A) Unlock(B)

  29. How to relate this to distributed databases/storage? Storage is sharded across multiple machines Different machines store different subset of data

  30. X = read(A) Y = Read(B) Write (A, x-100) Write (B, y+100) commit 2PC 1. Client sends a request to the coordinator client request Coordinator A B

  31. X = read(A) Y = Read(B) Write (A, x-100) Write (B, y+100) commit 2PC 1. Client sends a request to the coordinator client 2. Coordinator locks the items and the machines apply the operations request Discussion: What could go wrong? Coordinator A=x-100 B=y+100 A B

  32. X = read(A) Y = Read(B) Write (A, x-100) Write (B, y+100) commit 2PC 1. Client sends a request to the coordinator client 2. Coordinator locks the items and the machines apply the operations request Discussion: What could go wrong? Coordinator A does not have enough money B crashes A=x-100 B=y+100 Coordinator crashes A B

  33. What we need TC (coordinator), A, and B need to agree on whether they should execute the operations Safety If one commits, no one aborts If one aborts, no one commits Liveness If no failures and A and B can commit, action commits If failures, reach a conclusion ASAP

  34. X = read(A) Y = Read(B) Write (A, x-100) Write (B, y+100) commit 2PC 1. Client sends a request to the coordinator client request Coordinator A B

  35. X = read(A) Y = Read(B) Write (A, x-100) Write (B, y+100) commit 2PC 1. Client sends a request to the coordinator client 2. Coordinator sends a PREPARE message request Coordinator PREPARE PREPARE A B

  36. X = read(A) Y = Read(B) Write (A, x-100) Write (B, y+100) commit 2PC 1. Client sends a request to the coordinator client 2. Coordinator sends a PREPARE message 3. A, B replies YES or NO 1. If A does not have enough balance, reply no request Coordinator YES YES A B

  37. X = read(A) Y = Read(B) Write (A, x-100) Write (B, y+100) commit 2PC 1. Client sends a request to the coordinator client 2. Coordinator sends a PREPARE message 3. A, B replies YES or NO request 4. Coordinator sends a COMMIT or ABORT message 1. COMMIT if both say yes 2. ABORT if either says no Coordinator COMMIT COMMIT A B

  38. X = read(A) Y = Read(B) Write (A, x-100) Write (B, y+100) commit 2PC 1. Client sends a request to the coordinator client 2. Coordinator sends a PREPARE message 3. A, B replies YES or NO ok 4. Coordinator sends a COMMIT or ABORT message 1. COMMIT if both say yes 2. ABORT if either says no Coordinator 5. Coordinator replies to the client A,B commit on the receipt of commit message A B

  39. 2PC

  40. Why 2PC is correct Neither can commit unless both agree to commit Performance? Timeout: If the expected message is not received Reboot: Node crashed, is rebooting, must clean up

  41. X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit 2PC Coordinator waits for YES or NO client - If it hasn t heard back from A or B in time, can abort request - It might be a network problem But no safety issues Coordinator A and B wait for COMMIT or ABORT from coordinator COMMIT COMMIT - If it sends a NO, can abort directly A B - Discussion: If it sends a YES, can we safely commit?

  42. X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit 2PC A and B wait for COMMIT or ABORT from coordinator client - If it sends a NO, can abort directly request - Discussion: If it sends a YES, can we safely commit? - What if coordinator fails after it sends only one COMMIT message? Coordinator COMMIT COMMIT - A and B may wait forever - Discussion: How to solve the problem? A B

  43. X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit 2PC Suppose A and B can talk client B->A: status? A can reply to B. Four cases request 1) No reply from A: no decision.. 2) A received commit or abort from coordinator: agree with the coordinator s decision Coordinator COMMIT COMMIT 3) A hasn t voted yet or voted no: both abort 4) A voted yes: both must wait for coordinator A B

  44. 3PC Three Phase Commit (3PC) Never block on node failures as 2PC did Split COMMIT/ABORT phase into two phases First communicate the outcome to everyone Let them commit only after everyone knows the outcome

  45. 3PC

  46. 3PC If one of them has received PRE-COMMIT, they can all commit We know for sure that both voted for YES If none of them has received PRE-COMMIT, they can all abort Safe for node crashes (both coordinator and participants)

  47. 3PC Safety and livenss? Liveness? (availability) yes Doesn t block, it always makes progress by timing out Safety? (correctness) nope Can you think of scenarios in which 3PC would result in inconsistent states between the nodes? Two examples A hasn t crashed, it s just offline Coordinator hasn t crashed, it s just offline

  48. 3PC with Network Partitions PRE-COMMIT Coordinator crashes after it sends PRE- COMMIT to A A is partitioned later (or crashes and recover later) None of B,C,D have got PRE-COMMIT, they will abort A comes back and decides to commit A B C D

  49. Solution? Next week!

  50. Reading List Optional Michael J. Franklin. Concurrency Control and Recovery. Sigmod 1992.

More Related Content