
CAP Theorem and Its Implications in Data Systems
Explore the CAP Theorem, Eric Brewer's Conjecture, and the desirable properties of data systems such as ACID transactions. Learn about the tradeoffs between Consistency, Availability, and Partition Tolerance in distributed environments.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
CAP + Clocks Time keeps on slipping, slipping
Logistics Last week s slides online Sign up on Piazza now No really, do it now Papers are loaded in HotCRP Sign up for account at http://cs7780.ccs.neu.edu I will make you a PC member so you can enter reviews Do review preferences
Logistics (2) Some changes to the schedule NENS field trip Moved other stuff down Don t forget about project proposals Details on content on the website
Logistics (3) Audit policy You must present at least one paper You must read all of them and contribute to the discussion (no dead weight) No project or project presentation requirement
Desirable properties of data systems ACID transactions Atomic: Either all of the transaction happens or none Consistent: After a transaction, all uniqueness properties are maintained Isolation: One transaction does not affect another Durable: Once complete, the transaction remains complete
Move to distributed environment CAP Consistency: updates to data are applied to all or none Availability: must be able to access all data Partitions: failures can partition network into subtrees Why should we want all three? Is it possible?
Eric Brewers CAP theorem The Brewer Conjecture No system can simultaneously achieve C and A and P Implication: must perform tradeoffs to obtain 2 at the expense of the 3rd Never published, but widely recognized
CAP Examples (key, 1) A+P Availability Client can always read Impact of partitions Not consistent (key, 1) Replicate (key, 1) (key, 2) What about C+A? Doesn t really exist Partitions are always possible Tradeoffs must be made to cope with them C+P Consistency (key, 1) Error: Service Unavailable Reads always return accurate results Replicate Impact of partitions (key, 1) (key, 2) No availability 9
Proof of Theorem Strong proof when no clocks No way to reliably stay consistent What changes with clocks?
Weak consistency Use clocks, impose partial ordering Allows a form of eventual consistency Key result that drives many of today s systems return most of the data most of the time usually correct results Where is this not ok?
12 years later: Have the rules changed? What if partitions are rare? Then most of the time C/A are maintained CAP don t need to be binary properties Choosing between C or A is not a fixed decision, can change over time
How long is the partition? We can t choose to never have P But we can put constraints on how the system operates within some threshold time Primary partitions enable partial availability But require knowing invariants to reconcile consistency after partition is done
So what do you do when there is a partition? Detection Enter partition mode Why? Recover after partition is done What are the challenges?
Recovery We ve all done this (CVS, SVN, git, hg) How does it work? Commutative operations Commutative replicated data types Special data types will come up again and again
Exercise Let s match popular applications to consistency/availability models
Time It s pretty important, eh? It marches on Only goes forward
Time It defines a distributed system message transmission time is not negligible compared to the time between events in a single process
Partial time ordering No way for all clocks to have exactly the same time all the time Why? Instead, focus on a partial ordering happens before relationship use it to build a (arbitrary) total ordering this does not use physical time What is the difference?
Scalar Clocks 21 General technique described by Leslie Lamport Explicitly maps out time as a sequence of version numbers at each participant (from 1978!!) Key idea Each process maintains a counter ( clock ) When it sends a message, it includes the counter When receiving message, each process updates its clock to be greater than timestamp in message
Nice applications of Logical Clocks Fully distributed Mutual exclusion Fairness Liveness
Physical clocks What do they give us?
Vector clocks The idea A vector clock is a list of (node, counter) pairs Every version of every object has one vector clock Detecting causality If all of A s counters are less-than-or-equal to all of B s counters, then A is ancestor of B, and can be forgotten Intuition: A was applied to every node before B was applied to any node. Therefore, A precedes B Use vector clocks to perform syntactic reconciliation
Simple Vector Clock Example 25 Key features Writes always succeed Reconcile on read Write by Sx D1 ([Sx, 1]) Write by Sx D2 ([Sx, 2]) Possible issues Large vector sizes Need to be trimmed Write by Sy Write by Sz D3 ([Sx, 2], [Sy, 1]) D4 ([Sx, 2], [Sz, 1]) Solution Add timestamps Trim oldest nodes Can introduce error Read reconcile D5 ([Sx, 2], [Sy, 1], [Sz, 1])
Vector clock properties Isomorphic: Timestamps can be used to get causal ordering (and vice versa) Strong consistency Event counting Useful for distributed debugging, causal ordering, etc
Matrix time Add another dimension Everyone has a consistent view of everyone else s clock Can be used to prune stale information
Take-aways Consistency, Availability, Partitions Can we do them all together? Depends on goals, time Time is relative Order matters Logical ordering and causality can give us total ordering consistency