A Brief History of Key-Value Stores and Data Independence

a brief history of key value stores n.w
1 / 45
Embed
Share

Explore the evolution from web portals to Web 2.0, the challenges in building highly available web services and peer-to-peer networks, and the importance of data independence in file systems and databases.

  • History
  • Web Services
  • Data Independence
  • Key-Value Stores
  • Peer-to-Peer

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. A brief history of key-value stores Landon Cox Landon Cox April 9, 2018 April 9, 2018

  2. In the year 2000 Portals were thought to be a good idea Portals were thought to be a good idea Yahoo!, Lycos, AltaVista, etc. Original content up front + searchable directory The dot The dot- -com bubble was about to burst com bubble was about to burst Started to break around 1999 Lots of companies washed out by 2001 Google was really taking off Google was really taking off Founded in 1998 PageRank was best-in-class for search Proved: great search is enough (and portals are dumb) Off in the distance: Web 2.0, Off in the distance: Web 2.0, Facebook Facebook, AWS, the cloud , AWS, the cloud

  3. Questions of the day 1. 1. How do we build highly How do we build highly- -available web services? Support millions of users Want high-throughput available web services? 2. 2. How do we build highly How do we build highly- -available peer Napster had just about been shut down (centralized) BitTorrent was around the corner Want to scale to thousands of nodes No centralized trusted administration or authority Problem: everything can fall apart (and does) available peer- -to to- -peer services? peer services? Some of the solutions to #2 can help with #1 Some of the solutions to #2 can help with #1

  4. Storage interfaces Physical storage Physical storage Physical storage Physical storage Logical schema Logical schema File hierarchy File hierarchy Attr N ValN D Attr1 F D Val1 F F open, read, write open, read, write mkdir mkdir, create , create SQL Query SQL Query SQL Query SQL Query Process 1 Process 2 Process 1 Process 2 What is the interface to a DBMS? What is the interface to a DBMS? What is the interface to a file system? What is the interface to a file system?

  5. Data independence Data independence Data independence Idea that storage issues should be hidden from programs Programs should operate on data independently of underlying details In what way do In what way do FSes Both hide the physical layout of data Can change layout without altering how programs operate on data FSes and DBs provide data independence? and DBs provide data independence? In what way do DBs provide stronger data independence? In what way do DBs provide stronger data independence? File systems leave format of data within files up to programs One program can alter/corrupt layout file format Database clients cannot corrupt schema definition

  6. ACID properties Databases also ensure ACID Databases also ensure ACID What is meant by Atomicity? What is meant by Atomicity? Sequences of operations are submitted via transactions All operations in transaction succeed or fail No partial success (or failure) What is meant by Consistency? What is meant by Consistency? After transaction commits DB is in consistent state Consistency is defined by data invariants i.e., after transaction completes all invariants are true What is the downside of ensuring Consistency? What is the downside of ensuring Consistency? In tension with concurrency and scalability Particularly in distributed settings

  7. ACID properties Databases also ensure ACID Databases also ensure ACID What is meant by Isolation? What is meant by Isolation? Other processes cannot view modification of in-flight transactions Similar to atomicity Effects of a transaction cannot be partially viewed What is meant by Durability? What is meant by Durability? After transaction commits data will not be lost Committed transactions survive hardware and software failures

  8. ACID properties Databases also ensure ACID Databases also ensure ACID Do file systems ensure ACID properties? Do file systems ensure ACID properties? No Atomicity: operations can be buffered, re-ordered, flushed async Consistency: many different consistency models Isolation: hard to ensure isolation without notion of transaction Durability: need to cache undermines guarantees (can use sync) What do file systems offer instead of ACID? What do file systems offer instead of ACID? Faster performance Greater flexibility for programs Byte-array abstraction rather than table abstraction

  9. Needs of cluster-based storage Want three things Want three things Scalability (incremental addition machines) Availability (failure/loss of machines) Consistency (sensible answers to requests) Traditional DBs fail to provide these features Traditional DBs fail to provide these features Focus on strong consistency that can hinder scalability and availability Requires a lot of coordination and complexity For file systems, it depends For file systems, it depends Some offer strong consistency guarantees (poor scalability) Some offer good scalability (poor consistency)

  10. Distributed data structures (DDS) Paper from OSDI 00 Paper from OSDI 00 Steve Gribble, Eric Brewer, Joseph Hellerstein, and David Culler Pointed out inadequacies for traditional storage for large-scale services Proposed a new storage interface Proposed a new storage interface More structured than file systems (structure is provided by DDS) Not as fussy as databases (no SQL) A few operations on data structure elements

  11. Distributed data structures (DDS) Present a new storage interface Present a new storage interface More structured than file systems (structure is provided by DDS) Not as fussy as databases (no SQL ) A few operations on data structure elements Storage brick Storage brick DDS DDS Get, Put Get, Put Key1 Val1 Process 1 Storage brick Storage brick Process 2 KeyN ValN Get, Put Get, Put

  12. Distributed Hash Tables (DHTs) DHT: same idea as DDS but decentralized DHT: same idea as DDS but decentralized Same interface as a traditional hash table Same interface as a traditional hash table put(key, value) stores value under key get(key) returns all the values stored under key Built over a distributed overlay network Built over a distributed overlay network Partition key space over available nodes Route each put/get request to appropriate node Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Sean C. Rhea

  13. How DHTs Work How do we ensure the put and the get find the same machine? How does this work in DNS? K V K V K V K V k1 k1,v1 K V K V v1 K V K V K V K V put(k1,v1) get(k1) Sean C. Rhea OpenDHT: A Public DHT Service

  14. Nodes form a logical ring 000 110 010 First question: how do new nodes figure out where they should go on the ring? 100 Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Sean C. Rhea

  15. Step 1: Partition Key Space Each node in DHT will store some Each node in DHT will store some k k, ,v v pairs Given a key space Given a key space K K, , e.g. e.g. [0, 2 [0, 2160 Choose an identifier for each node, idi K, uniformly at random A pair k,v is stored at the node whose identifier is closest to k Key technique: cryptographic hashing Key technique: cryptographic hashing Node id = SHA1(MAC address) P(sha1 collision) <<< P(hardware failure) Nodes can independently compute their id pairs 160): ): Contrast this to DDS, in which an admin manually assigned nodes to partitions. 0 2160 Sean C. Rhea OpenDHT: A Public DHT Service

  16. Step 2: Build Overlay Network Each node has two sets of neighbors Each node has two sets of neighbors Immediate neighbors in the key space Immediate neighbors in the key space Important for correctness Long Long- -hop neighbors hop neighbors Allow puts/gets in O(log n) hops 0 2160 Sean C. Rhea OpenDHT: A Public DHT Service

  17. Step 3: Route Puts/Gets Thru Overlay Route greedily, always making progress Route greedily, always making progress get(k) 0 2160 k Sean C. Rhea OpenDHT: A Public DHT Service

  18. How Does Lookup Work? Source Assign IDs to nodes Assign IDs to nodes Map hash values to node with closest ID Leaf set Leaf set is successors and is successors and predecessors predecessors Correctness Routing table Routing table matches matches successively longer prefixes successively longer prefixes Efficiency Explain the red arrows. Explain the green arrows. 111 00 110 10 Lookup ID Sean C. Rhea OpenDHT: A Public DHT Service

  19. Iterative vs. recursive Previous example: recursive lookup Previous example: recursive lookup Could also perform lookup iteratively: Could also perform lookup iteratively: Which one is faster? Recursive Iterative Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Sean C. Rhea

  20. Iterative vs. recursive Previous example: recursive lookup Previous example: recursive lookup Could also perform lookup iteratively: Could also perform lookup iteratively: Why might I want to do this iteratively? Recursive Iterative Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Sean C. Rhea

  21. Iterative vs. recursive Previous example: recursive lookup Previous example: recursive lookup Could also perform lookup iteratively: Could also perform lookup iteratively: What does DNS do and why? Recursive Iterative Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Sean C. Rhea

  22. (LPC: from Pastry paper) Example routing state Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Sean C. Rhea

  23. OpenDHT Partitioning responsible for these keys Assign each node an identifier from the key space Store a key-value pair (k,v) on several nodes with IDs closest to k Call them replicas for (k,v) id = 0xC9A1 Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Sean C. Rhea

  24. OpenDHT Graph Structure 0xED Overlay neighbors match prefixes of local identifier Choose among nodes with same matching prefix length by network latency 0xC0 0x41 0x84 Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Sean C. Rhea

  25. Performing Gets in OpenDHT client Client sends a get request to gateway Gateway routes it along neighbor links to first replica encountered Replica sends response back directly over IP get(0x6b) gateway 0x41 get(0x6b) get response 0x6c Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Sean C. Rhea

  26. DHTs: The Hype High availability High availability Each key-value pair replicated on multiple nodes Incremental scalability Incremental scalability Need more storage/tput? Just add more nodes. Low latency Low latency Recursive routing, proximity neighbor selection, server selection, etc. Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Sean C. Rhea

  27. Robustness Against Failure client If a neighbor dies, a node routes through its next best one If replica dies, remaining replicas create a new one to replace it 0xC0 0x41 0x6c Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Sean C. Rhea

  28. Routing Around Failures Under churn, neighbors may have failed Under churn, neighbors may have failed How to detect failures? How to detect failures? acknowledge each hop ACK ACK 0 2160 k Sean C. Rhea OpenDHT: A Public DHT Service

  29. Routing Around Failures What if we don t receive an ACK? What if we don t receive an ACK? resend through different neighbor Timeout! 0 2160 k Sean C. Rhea OpenDHT: A Public DHT Service

  30. Computing Good Timeouts What if timeout is too long? What if timeout is too long? increases put/get latency What if timeout is too short? What if timeout is too short? get message explosion Timeout! 0 2160 k Sean C. Rhea OpenDHT: A Public DHT Service

  31. (LPC) Computing Good Timeouts Three basic approaches to timeouts Three basic approaches to timeouts Safe and static (~5s) Rely on history of observed RTTs (TCP style) Rely on model of RTT based on location 0 2160 k Sean C. Rhea OpenDHT: A Public DHT Service

  32. Computing Good Timeouts Chord errs on the side of caution Chord errs on the side of caution Very stable, but gives long lookup latencies Timeout! 0 2160 k Sean C. Rhea OpenDHT: A Public DHT Service

  33. (LPC) Timeout results Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Sean C. Rhea

  34. Recovering From Failures Can t route around failures forever Can t route around failures forever Will eventually run out of neighbors Must also find new nodes as they join Must also find new nodes as they join Especially important if they re our immediate predecessors or successors: old responsibility new node 0 2160 new responsibility Sean C. Rhea OpenDHT: A Public DHT Service

  35. Recovering From Failures Obvious algorithm: Obvious algorithm: reactive When a node stops sending acknowledgements, notify other neighbors of potential replacements Similar techniques for arrival of new nodes reactive recovery recovery 0 2160 A A B C D Sean C. Rhea OpenDHT: A Public DHT Service

  36. Recovering From Failures Obvious algorithm: Obvious algorithm: reactive When a node stops sending acknowledgements, notify other neighbors of potential replacements Similar techniques for arrival of new nodes reactive recovery recovery 0 2160 A A B C D B failed, use A B failed, use D Sean C. Rhea OpenDHT: A Public DHT Service

  37. The Problem with Reactive Recovery What if B is alive, but network is congested? What if B is alive, but network is congested? C still perceives a failure due to dropped ACKs C starts recovery, further congesting network More ACKs likely to be dropped Creates a positive feedback cycle (=BAD) 0 2160 A A B C D B failed, use A B failed, use D Sean C. Rhea OpenDHT: A Public DHT Service

  38. The Problem with Reactive Recovery What if B is alive, but network is congested? What if B is alive, but network is congested? This was the problem with Pastry This was the problem with Pastry Combined with poor congestion control, causes network to partition under heavy churn 0 2160 A A B C D B failed, use A B failed, use D Sean C. Rhea OpenDHT: A Public DHT Service

  39. Periodic Recovery Every period, each node sends its neighbor list to Every period, each node sends its neighbor list to each of its neighbors each of its neighbors 0 2160 A A E B C D my neighbors are A, B, D, and E Sean C. Rhea OpenDHT: A Public DHT Service

  40. Periodic Recovery Every period, each node sends its neighbor list to Every period, each node sends its neighbor list to each of its neighbors each of its neighbors 0 2160 A A E B C D my neighbors are A, B, D, and E Sean C. Rhea OpenDHT: A Public DHT Service

  41. Periodic Recovery Every period, each node sends its neighbor list to Every period, each node sends its neighbor list to each of its neighbors each of its neighbors How does this break the feedback loop? Volume of recovery msgs independent of failures 0 2160 A A E B C D my neighbors are A, B, D, and E Sean C. Rhea OpenDHT: A Public DHT Service

  42. Periodic Recovery Every period, each node sends its neighbor list to Every period, each node sends its neighbor list to each of its neighbors each of its neighbors Do we need to send the entire list? No, can send delta from last message 0 2160 A A E B C D my neighbors are A, B, D, and E Sean C. Rhea OpenDHT: A Public DHT Service

  43. Periodic Recovery Every period, each node sends its neighbor list to Every period, each node sends its neighbor list to each of its neighbors each of its neighbors What if we contact only a random neighbor (instead of all neighbors)? Still converges in log(k) rounds (k=num neighbors) 0 2160 A A E B C D my neighbors are A, B, D, and E Sean C. Rhea OpenDHT: A Public DHT Service

  44. (LPC) Recovery results Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Sean C. Rhea

  45. More key-value stores Two settings in which you can use Two settings in which you can use DHTs DDS in a cluster Bamboo on the open Internet DHTs How is the cloud (e.g., EC2) different/similar? How is the cloud (e.g., EC2) different/similar? Cloud is a combination of fast/slow networks Cloud is under a single administrative domain Cloud machines should fail less frequently Hyperdex targets this more forgiving environment Fixing the Embarrassing Slowness of OpenDHT on PlanetLab Sean C. Rhea

Related


More Related Content