Challenges in Consistent Distributed Systems Design
Indranil Gupta delves into the hard choices faced in the design of extensible distributed systems, highlighting the trade-offs between timeliness, correctness, and unpredictability. The discussion spans the CAP theorem, probabilistic choices, and key-value/NoSQL storage systems, emphasizing the need for balancing consistency and latency to meet user demands.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Probabilistically Consistent Indranil Gupta (Indy) Department of Computer Science, UIUC indy@illinois.edu FuDiCo 2015 1 DPRG: http://dprg.cs.uiuc.edu
Joint Work With Muntasir Rahman (Graduating PhD Student) Luke Leslie, Lewis Tseng Mayank Pundir (MS, now at Facebook) Work funded by Air Force Research Labs/AFOSR, National Science Foundation, Google, Yahoo!, and Microsoft
Hard Choices in Extensible Distributed Systems Users in extensible distributed systems desire Timeliness and Correctness Guarantees But these are at odds with Unpredictability Network Delays and Failures Research community and industry often tends to translate this into hard choices in systems design Examples 1. CAP Theorem: choice between consistency and availability (or latency) Either relational databases or eventually consistent NoSQL stores (Maybe a convergence now?) 2. Always get 100% answers in computation engines (batch or stream) Use checkpointing
Hard Choices Can in fact be Probabilistic Choices! Many of these are in fact probabilistic choices One of the earliest examples: pbcast/Bimodal Multicast Examples 1. CAP Theorem: We derive a probabilistic CAP theorem that defines an achievable boundary between consistency and latency in any database system We use this to incorporate probabilistic consistency and latency SLAs into Cassandra and Riak 2. Always get 100% answers in computation engines (batch or stream) In many systems, checkpointing results in 8-31x higher execution time! We show that in systems like distributed graph processing systems We can avoid checkpointing altogether Instead, have a reactive approach: upon failure, reactively scrounge state (naturally replicated) And achieve very high accuracy (95-99%)
Key-value/NoSQL Storage Systems Key-value/NoSQL stores: $3.4B sector by 2018 Distributed storage in the cloud Netflix: video position (Cassandra) Amazon: shopping cart (DynamoDB) And many others Necessary API operations: get(key) and put(key, value) And some extended operations, e.g., CQL in Cassandra key-value store
Key-value/NoSQL Storage: Fast and Fresh Cloud clients expect both Latency: Low latency for all operations (reads/writes) 500ms latency increase at Google.com costs 20% drop in revenue each extra ms $4 M revenue loss Long latency User Cognitive Drift Consistency: read returns value of one of latest writes Freshness of data means accurate tracking and higher user satisfaction Most KV stores only offer weak consistency (Eventual consistency) Eventual consistency = if writes stop, all replicas converge, eventually
Hard vs. Soft Partitions Hard partition CAP Theorem looks at hard partitions However, soft partitions may happen inside a data-center Periods of elevated message delays Periods of elevated loss rates Soft partitions are more frequent Data-center 2 (Europe) Data-center 1 (America) CoreSw Congestion at switches => Soft partition ToR ToR
Our work: From Impossibility to Possibility C Probabilistic C (Consistency) A Probabilistic A (Latency) P Probabilistic P (Partition Model) A probabilistic CAP theorem A system that validates how close we are to the achievable envelope (Goal is not: another consistency model, or NoSQL vs New/Yes SQL) 8
Probabilistic CAP W(2) R(1) W(1) A read is tc-fresh if it returns the value of a write that starts at-most tc time before the read tc time is likelihood that a random path ( client server client) has message delay exceeding tp time units pic is likelihood a read is NOT tc-fresh pua is likelihood a read DOES NOT return an answer within ta time units Probabilistic Consistency (pic ,tc) Probabilistic Latency (pua ,ta) Probabilistic Partition ( , tp) PCAP Theorem: Impossible to achieve both Probabilistic Consistency and Latency under Probabilistic Partitions if: tc + ta < tp and pua + pic < Bad network -> High ( , tp) To get better consistency -> lower (pic ,tc) To get better latency -> lower (pua ,ta) Special case: Original CAP has =1 andtp= Full proof in our arXiv paper: http://arxiv.org/abs/1509.02464 9
Towards Probabilistic SLAs Latency SLA: Similar to latency SLAs already existing in industry. Meet a desired probability that client receives operation s result within the timeout Maximize freshness probability within given freshness interval Example: Amazon shopping cart Doesn t want to lose customers due to high latency Only 10% operations can take longer than 300ms SLA: (pua, ta) = (0.1, 300ms) Minimize staleness (don t want customers to lose items) Minimize: pic(Given: tc) 10
Towards Probabilistic SLAs (2) Consistency SLA: Goal is to Meet a desired freshness probability (given freshness interval) Maximizeprobability that client receives operation s result within the timeout Example: Google search application/Twitter search Wants users to receive recent data as search Only 10% results can be more than 5 min stale SLA: (pic , tc)=(0.1, 5 min) Minimize response time (fast response to query) Minimize: pua (Given: ta) 11
Meeting these SLAs: PCAP Systems PCAP System CONTROL KNOBS ADAPTIVE CONTROL KV-store (Cassandra, Riak) Satisfies PCAP SLA System assumptions: Client sends query to coordinator server which then forwards to replicas (answers reverse path) There exist background mechanisms to bring stale replicas up to date Increased Knob Latency Consistency Read Delay Degrades Improves Read Repair Rate Unaffected Improves Consistency Level Degrades Improves Continuously adapt control knobs to always satisfy PCAP SLA
Meeting Consistency SLA for PCAP Cassandra (pic=0.135) Mean latency = 3 ms | 4 ms | 5 ms Setup 9 server Emulab cluster: each server has 4 Xeon + 12 GB RAM 100 Mbps Ethernet YCSB workload (144 client threads) Network delay: Log-normal distribution [Benson 2010] Consistency always below target SLA
Meeting Consistency SLA for PCAP Cassandra (pic=0.135) PCAP system Satisfies SLA and close to Optimal envelope Optimal envelopes under different Network conditions (based on PCAP theorems)
Geo-Distributed PCAP PCAP multiplicative controller N(20,sqrt(2)) | N(22,sqrt(2.2) Latency SLA met before and after jump Fast convergence initially, and after delay jump Reduced oscillation, compared to multiplicative controller Consistency degrades after delay jump 15
Related Work Pileus/Tuba [Doug Terry et al] Utility-based SLAs Focus on wide-area Can be used underneath our PCAP system (instead of our SLAs) Consistency Metrics: PBS [Peter Bailis et al] Considers write end time (we consider write start time) May not be able to define consistency for some read-write pairs (PCAP accommodates all combinations) Can use it in PCAP system Approximate answers: Hadoop [ApproxHadoop], Querying [BlinkDB], Bimodal multicast 16
PCAP Summary CAP Theorem motivated NoSQL Revolution But apps need freshness + fast responses Under soft partition We proposed Probabilistic models for C, A, P Probabilistic CAP theorem generalizes classical CAP PCAP system satisfies Latency/Consistency SLAs Integrated into Apache Cassandra and Riak KV stores Riak has expressed interest in incorporating these into their mainline code 17
Distributed Graph Processing and Checkpointing Checkpointing: Proactively save state to persistent storage If there s a failure, recover 100% cost Used by: PowerGraph [Gonzalez et al. OSDI 2012] Giraph [Apache Giraph] Distributed GraphLab [Low et al. VLDB 2012] Hama [Seo et al. CloudCom 2010] 18
Checkpointing Bad 8 31x Increased Per-Iteration Execution Time 31x Graph Dataset Vertex Count Edge Count 8x CA-Road 1.96 M 2.77 M Twitter 41.65 M 1.47 B UK Web 105.9 M 3.74 B 19 19
Users Already Dont (Use or Like) Checkpointing While we could turn on checkpointing to handle some of these failures, in practice we choose to disable checkpointing. [Ching et. al. (Giraph @ Facebook) VLDB 2015] Existing graph systems only support checkpoint-based fault tolerance, which most users leave disabled due to performance overhead. [Gonzalez et. al. (GraphX) OSDI 2014] The choice of interval must balance the cost of constructing the checkpoint with the computation lost since the last checkpoint in the event of a failure. [Low et. al. (GraphLab) VLDB 2012] Better performance can be obtained by balancing fault tolerance costs against that of a job restart. [Low et al. (GraphLab) VLDB 2012] 20
Our Approach: Zorro No checkpointing. Common case is fast. When failure occurs, opportunistically scrounge state (from surviving servers) and continue computation Natural replication in distributed processing systems A vertex data is present at its neighbor vertices Each vertex assigned to one server, and its neighbors likely on other servers We get very high accuracy (95%+) 21
Natural Replication => Can Retrieve a Lot of State 87 95% Graph State is Recoverable Even After Half the Servers Fail LFGraph PowerGraph 92 95% 87 91% 22 22
Natural Replication => Low InAccuracy 3% PowerGraph LFGraph 2% 23
Natural Replication => Low InAccuracy Algorithm PageRank PowerGraph 2 % 0.0025 % 1.6 % 0.0054% 5.02 % 0.84 % 0 % 0 % LFGraph 3 % 0.06 % 2.15 % 1.4 % NA NA NA NA Single-Source Shortest Paths Connected Components K-Core Graph Coloring* Group-Source Shortest Paths* Triangle Count* Approximate Diameter* 24
Takeaways Impossibility theorems and 100% correct answers are great But they entail Inflexibility in design (NoSQL or SQL) High overhead (Checkpointing) Important to explore Probabilistic tradeoffs and Achievable envelopes Leads to more flexibility in design Other applicable areas: stream processing, machine learning DPRG: http://dprg.cs.uiuc.edu
Plug: MOOC on Cloud Computing Concepts Free course, On Coursera Ran Feb-Apr 2015 120K+ students Next run: Spring 2016 Covered distributed systems and algorithms used in cloud computing Free and Open to everyone https://www.coursera.org/course/cloudcomputing Or do a search on Google for Cloud Computing Course (click on first link)
PCAP Consistency Metric Is more Generic Than PBS R(1) W(1) W(2) A read is tc-fresh if it returns the value of a write that starts at-most tc time before the read starts W(1) and R(1) can overlap tc time PCAP R(1) W(1) W(2) A read is tc-fresh if it returns the value of a write that starts at-most tc time before the read ends W(1) and R(1) cannot overlap tc PBS time 28
GeoPCAP: 2 Key Techniques SLA Local DC Read, SLA Client (2) Tune Geo-delay ? using PID Control Compare Prob WAN Model (1) Prob Composition Rules Composed model Prob CC, LC Prob C2, L2 Prob C3,L3 Prob C1, L1 Given client C or L SLA: QUICKEST: at-least one DC satisfies SLA ALL: each DC satisfies SLA
CAP Theorem NoSQL Revolution Consistency Conjectured: [Brewer 00] Proved: [Gilbert Lynch 02] Kicked off NoSQL revolution Abadi s PACELC If P, choose A or C Else, choose L (latency) or C HBase, HyperTable, BigTable, Spanner RDBMSs (non-replicated) Partition-tolerance Availability (Latency) Cassandra, RIAK, Dynamo, Voldemort
Geo-Distributed PCAP PCAP multiplicative controller N(20,sqrt(2)) | N(22,sqrt(2.2) Latency SLA met before and after jump Fast convergence initially, and after delay jump Reduced oscillation, compared to multiplicative controller Consistency degrades after delay jump 31