IS.651: Distributed Systems Consensus Challenges
Distributed systems present challenges such as consistency, concurrency, machine failures, network failures, and replication. Various replication models like Primary-Backup and Viewstamp Replication play crucial roles in ensuring system correctness and fault tolerance. The concept of consensus is discussed in depth, highlighting the need for coordinated actions among nodes despite potential failures. Announcements regarding upcoming assignments and the significance of data replication and consistency in distributed storage are also covered.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
IS 651: Distributed Systems Consensus Sisi Duan Assistant Professor Information Systems sduan@umbc.edu
Recall Consistency challenges Concurrency Machine failures Network failures Replication State machine replication Primary back up replication
Recall Primary Backup Replication Deal with backup failures Deal with primary failures
Recall Viewstamp replication Key to correctness View number a deterministic node as the primary No committed requests will be lost Quorum Majority nodes if we consider benign failures
0 1 2 Recall Seq A B C p0's log Committed if majority nodes said yes Backups monitor the primary A B p1's log time -> Client D E p2's log A Primary replies to the client After it gets majority OKs Request Reply p0 (primary) m=<PREPARE,v0,seq=1, request> p1 ok time -> Client p2 View moves to v1 if majority replied OK New primary decides on the new log On receiving a PREPARE message m: Reject if its current view id > m.v Reject if its status != NORMAL Append request with the seq to its log p0 (primary) m=<VIEW-CHANGE,v1> p1 ok p2 On receiving a VIEW-CHANGE message m: Reject if its current view id > m.v Set vid = m.v, status=VIEW-CHANGE Send its log (operation history) and the latest normal view
Announcement HW4 due next week (Nov 6)
Today The crown problem of distributed systems A.k.a. consensus We have covered a lot about why we need consensus Data replication consistency Distributed storage
Consensus Despite having different views of the world, all nodes in a distributed system must act in concert, e.g.: All replicas that store the same object O must apply all updates to O in the same order (consistency) All nodes involved in a transaction must either commit or abort their portion of the transaction (atomicity) All that, despite FAILURES Nodes can restart, die, be slow Networks can be slow but reliable For simplicity, we assume the network is reliable where messages are eventually received
The Consensus/Agreement Problem Some nodes propose something to other nodes All nodes must decide whether to accept or reject those values v v v v v Examples of something A value Whether or not to commit a transaction to a database Whether to execute a client request Who has the lock in a distributed lock service among multiple clients that request it almost simultaneously
Timing Assumptions: Reminder Synchronous system Known bound on processing delays and transmission delays Asynchronous system No assumption on the delays Partially synchronous system There is such as bound but the bound value is unknown to the nodes
Consensus Correctness in General Safety (consistency) All nodes agree on the same value The agreed value X has been proposed by some node Liveness (timing, performance) All nodes eventually agree on the value If less than some fraction of nodes crash, the rest should still reach an agreement Replicated systems just behave like a centralized one!
Many forms of consensus/agreement Today Transactions (e.g., distributed databases) Application perform multi-operation, multi-object data access Transfer money from one account to another Insert Alice to Bob s friendlist and insert Bob to Alice s friendlist What if Failures occur in the middle of writing objects Concurrent operations race with each other?
ACID transactions A (Atomicity) All-or-nothing w.r.t. failures C (Consistency) Transactions maintain any internal storage state invariants I (Isolation) Concurrently executing transactions do not interfere D (Durability) Effect of transactions survive failures
X = read(A) Y = read (B) Write (A, X-100) Write (B, y+100) ACID Example T1: Transfer $100 from A to B A D I C T1 fully completes or leaves nothing once T1 commits, T1 s writes are not lost no races, as if T1 happens either before or after T2 preserves invariants, e.g., account balance > 0
Solution: WAL logging Write-Ahead-Logging (WAL) All state modification must be written to log before they are applied Simplest WAL: REDO logging Only stores REDO information in log entries Transactions buffer writes during execution This requirement is easy to satisfy now, but not in 80s or 90s when memory capacity is very low
The Concurrency Control Challenge T1: Transfer $100 from A to B T2: Transfer $100 from A to C A D I C T1 fully completes or leaves nothing once T1 commits, T1 s writes are not lost no races, as if T1 happens either before or after T2 preserves invariants, e.g., account balance > 0 Remember the differences between Linearizability and sequential consistency?
Ideal isolation semantics: serializability Execution of a set of transactions is equivalent to some serial order Two executions are equivalent if they have the same effect on database and produce same output
Conflict serializability An execution schedule is the ordering of read/write/commit/abort operations X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit X = read(A) Y = Read(B) Print (x+y) commit A (serial) schedule: R(A),R(B),W(A),W(B),C,R(A),R(B),C
Conflict serializability Two schedules are equivalent if they Contain the same operations Order conflicting operations the same way Two ops are conflicting if they access the same data and one is a write A schedule is serializable if it s equivalent to some serial schedule Strict serializability/Order-preserving serializability If T finishes before T starts, T must be ordered before T in equivalent serial schedule
Serializability Example X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit X = read(A) Y = Read(B) Print (x+y) commit Serializable? R(A),R(B), R(A),R(B),C, W(A),W(B),C Equivalent R(A),R(B),C, R(A),R(B), W(A),W(B),C
Serializability Example X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit X = read(A) Y = Read(B) Print (x+y) commit Serializable? R(A),R(B), W(A), R(A),R(B),C, W(B),C
Realize a serializable schedule Locking-based approach (remember mutual exclusion?) Solution 1: Grab global lock before transaction starts Release global lock after transaction commits Solution 2: Grab short-term fine-grained locks on an item before access Lock(A) Read(A) Unlock(A) Lock(B) Write(B) Unlock(B) Discussion: which one is better? What are the potential issues?
Solution 2 Problem X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit X = read(A) Y = Read(B) Print (x+y) commit Serializable? R(A),R(B), W(A), R(A),R(B),C, W(B),C Locks on writes should be held till the end of transaction Otherwise we may read an uncommitted value
A better solution Solution 3 Fine-grained locks Long-term locks for writes Grab lock before write, release lock after tx commits/aborts Short-term locks for reads
Solution 3 problem X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit X = read(A) Y = Read(B) Print (x+y) commit Long-term locks for writes, short-term locks for reads? R(A),R(B), W(A), R(A),R(B),C, W(B),C Read locks must be held till commit time Non-repeatable reads R(A), R(A),R(B), W(A), W(B),C, R(B),C
Realize a serializable schedule 2 phase locking (2PL) A growing phase in which the transaction is acquiring locks A shrinking phase in which locks are released In practice The growing phase is the entire transaction The shrinking phase is at the commit time Optimization Use read/write locks instead of exclusive locks
2PL: An Example Rlock(A) X = read(A) Rlock(B) Y = Read(B) Wlock(A) Buffer A = x-100 Wlock(B) Buffer B = y+100 log(A=0, B=200) Write (A, 0) Unlock(A) Write (B, 200) Unlock(B) Rlock(A) X = read(A) Rlock(B) Y = Read(B) Print (x+y) Unlock(A) Unlock(B)
How to relate this to distributed databases/storage? Storage is sharded across multiple machines Different machines store different subset of data
X = read(A) Y = Read(B) Write (A, x-100) Write (B, y+100) commit 2PC 1. Client sends a request to the coordinator client request Coordinator A B
X = read(A) Y = Read(B) Write (A, x-100) Write (B, y+100) commit 2PC 1. Client sends a request to the coordinator client 2. Coordinator locks the items and the machines apply the operations request Discussion: What could go wrong? Coordinator A=x-100 B=y+100 A B
X = read(A) Y = Read(B) Write (A, x-100) Write (B, y+100) commit 2PC 1. Client sends a request to the coordinator client 2. Coordinator locks the items and the machines apply the operations request Discussion: What could go wrong? Coordinator A does not have enough money B crashes A=x-100 B=y+100 Coordinator crashes A B
What we need TC (coordinator), A, and B need to agree on whether they should execute the operations Safety If one commits, no one aborts If one aborts, no one commits Liveness If no failures and A and B can commit, action commits If failures, reach a conclusion ASAP
X = read(A) Y = Read(B) Write (A, x-100) Write (B, y+100) commit 2PC 1. Client sends a request to the coordinator client request Coordinator A B
X = read(A) Y = Read(B) Write (A, x-100) Write (B, y+100) commit 2PC 1. Client sends a request to the coordinator client 2. Coordinator sends a PREPARE message request Coordinator PREPARE PREPARE A B
X = read(A) Y = Read(B) Write (A, x-100) Write (B, y+100) commit 2PC 1. Client sends a request to the coordinator client 2. Coordinator sends a PREPARE message 3. A, B replies YES or NO 1. If A does not have enough balance, reply no request Coordinator YES YES A B
X = read(A) Y = Read(B) Write (A, x-100) Write (B, y+100) commit 2PC 1. Client sends a request to the coordinator client 2. Coordinator sends a PREPARE message 3. A, B replies YES or NO request 4. Coordinator sends a COMMIT or ABORT message 1. COMMIT if both say yes 2. ABORT if either says no Coordinator COMMIT COMMIT A B
X = read(A) Y = Read(B) Write (A, x-100) Write (B, y+100) commit 2PC 1. Client sends a request to the coordinator client 2. Coordinator sends a PREPARE message 3. A, B replies YES or NO ok 4. Coordinator sends a COMMIT or ABORT message 1. COMMIT if both say yes 2. ABORT if either says no Coordinator 5. Coordinator replies to the client A,B commit on the receipt of commit message A B
Why 2PC is correct Neither can commit unless both agree to commit Performance? Timeout: If the expected message is not received Reboot: Node crashed, is rebooting, must clean up
X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit 2PC Coordinator waits for YES or NO client - If it hasn t heard back from A or B in time, can abort request - It might be a network problem But no safety issues Coordinator A and B wait for COMMIT or ABORT from coordinator COMMIT COMMIT - If it sends a NO, can abort directly A B - Discussion: If it sends a YES, can we safely commit?
X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit 2PC A and B wait for COMMIT or ABORT from coordinator client - If it sends a NO, can abort directly request - Discussion: If it sends a YES, can we safely commit? - What if coordinator fails after it sends only one COMMIT message? Coordinator COMMIT COMMIT - A and B may wait forever - Discussion: How to solve the problem? A B
X = read(A) Y = Read(B) Write (A, x+100) Write (B, y-100) commit 2PC Suppose A and B can talk client B->A: status? A can reply to B. Four cases request 1) No reply from A: no decision.. 2) A received commit or abort from coordinator: agree with the coordinator s decision Coordinator COMMIT COMMIT 3) A hasn t voted yet or voted no: both abort 4) A voted yes: both must wait for coordinator A B
3PC Three Phase Commit (3PC) Never block on node failures as 2PC did Split COMMIT/ABORT phase into two phases First communicate the outcome to everyone Let them commit only after everyone knows the outcome
3PC If one of them has received PRE-COMMIT, they can all commit We know for sure that both voted for YES If none of them has received PRE-COMMIT, they can all abort Safe for node crashes (both coordinator and participants)
3PC Safety and livenss? Liveness? (availability) yes Doesn t block, it always makes progress by timing out Safety? (correctness) nope Can you think of scenarios in which 3PC would result in inconsistent states between the nodes? Two examples A hasn t crashed, it s just offline Coordinator hasn t crashed, it s just offline
3PC with Network Partitions PRE-COMMIT Coordinator crashes after it sends PRE- COMMIT to A A is partitioned later (or crashes and recover later) None of B,C,D have got PRE-COMMIT, they will abort A comes back and decides to commit A B C D
Solution? Next week!
Reading List Optional Michael J. Franklin. Concurrency Control and Recovery. Sigmod 1992.