
High Throughput Writes for Cloud Infrastructure - Lecture Insights
Explore the concepts of high-throughput updates in cloud infrastructure, emphasizing efficient batching methods, memory management, and disk storage strategies. Learn about collecting and batching updates, spilling data from memory to disk, and merging in-memory and on-disk data structures for optimal performance.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
14-848: CLOUD INFRASTRUCTURE LSMS AND BLOOM FILTERS BIG TABLE LECTURE 12 * FALL 2019 * KESDEN Big Table Portion Adapted From: Chang, et al (Google, Inc), Bigtable: A Distributed Storage System for Structured Data , OSDI 2006.
HIGH THROUGHPUT WRITES: LSMS AND BLOOM FILTERS
SCENARIO A system is needed to support high-throughput updates The total data volume is larger than the main memory budget Writes to secondary storage occur more quickly and efficiently when batched than when written individually. For example, writing a whole block of data at a time amortizes disk seek and rotational delay Sorting and indexing data in main memory can be done relatively efficiently causing relatively little delay.
COLLECT AND BATCH UPDATES IN MEMORY Collect updates in memory Sort them somehow Sorted string list Tree, Etc. Updates possible to in-memory values But, once a value is written to disk, it stays written Queries will need to find all records and merge Tombstone deletes
SPILL FROM MEMORY TO DISK As memory budget approaches full, spill them to disk Write out entire sorted string table Write out a subtree, then remove and prune it in memory Each dump from memory to disk forms a run of some kind Runs are time ordered
MERGE, IDEA #1 Possibility #1: Merge portions of in-memory data structure into on-disk data structure as spilled Common when pruning in-memory trees and merging into on-disk trees Can t reclaim memory faster than can drain to disk piece-wise
MERGE IDEA #2 Possibility #2: Dump from memory into new run , i.e. data structure in secondary storage Merge updates disk data structures in background By similar tree pruning, if tree By merging files into new files if tables But, what to do about searching a forest of trees on disk? We ll have a solution for this -- standby
MERGE IDEA #3 Compaction occurs as part of the merges Deletes Tombstoned records Merges multiple updates into one Recovers storage from merged updates and deleted values
LOG-STRUCTURED MERGE TREES (LSMS) When we spill subtrees or branches from an in-memory tree into a tree in secondary storage, this strategy is known as a Log-Structured Merge Tree (LSM) Tree The in-memory tree is often known as C0 and the tree in secondary storage is known as C1. If there are more levels of trees (not within a tree), they are known as C1, C2, etc.
MOVING TOWARD BLOOM FILTERS: ADDRESSING THE CHALLENGE LEFT UNSOLVED Multiple (May be a large number), independent key-value stores Each is in secondary storage and relatively slow to access/search. Values may not be contained within zero or more of the stores Never present Written Once Written and Over-Written Written and Lazy Deleted by Tombstoning Multiple partial updates that need to be unioned to form up-to-date record We ll soon see why this is an important scenario
MOVING TOWARD BLOOM FILTERS: GOAL Minimize wasted time/energy searching data stores which do not have requested key. Add minimal overhead w.r.t. main memory foot-print, processor load, etc.
MOVING TOWARD BLOOM FILTERS: IDEA AND CHALLENGES Idea: Keep an in-memory index Challenge: The on-disk data structures are most-likely indexes or keep relatively small metadata and are still too large to fit in memory, hence the need to keep them on disk Large data values are normally kept in separate data stores and indexes keep references to their location, etc. Revised idea: Need an extremely dense index, e.g. ideally one or a few bits per item
MOVING TOWARD BLOOM FILTERS: IDEA #1 Idea: Keep an in-memory hash table: <key, store-ID> Good: Near constant time ability to discover which stores to search Challenges: Need to manage collision Lists get to be expensive in processor and memory Could be limiting
MOVING TOWARD BLOOM FILTERS: IDEA #2 Idea #2: Keep one hash table per external store, check each One bit per address: Present or not Search only data structures where present Good: Smaller Fast to check, even given multiple data structures Challenges: Managing collision means complexity Reducing collision means a much bigger table, which means more memory wasted
MOVING TOWARD BLOOM FILTERS: IDEA #3 Idea: Ignore Collision Good: Can have small tables with little complexity Bad: Could thing something is present in secondary store when it is really a colliding record Waste time looking just to find nothing Thought: If collisions aren t super common, this is probably okay.
BLOOM FILTERS: IDEA #4 Idea: Use multiple (different) hash functions upon the same key Use light-weight, fast hsh functions vs those complex ones used for crypto, etc. Mark each associated bit Consider set of bits to be a signature and possible presence only if all bits associated with a key are set. Good: Reduces collision Bad: Fills table faster: Thought: May be okay trade. First conceived by Burton Howard Bloom in 1970
PARAMETERS How many keys, max? Memory budget? One bit per address How many hash functions? One bit per hash function consumed per key Less overlap What false positive rate is tolerable? Based upon the probabilities, we could reduce this to an equation. But, we ll take a pass. It is well studied. See Wikipedia for a discussion: https://en.wikipedia.org/wiki/Bloom_filter
PUTTING IT TOGETHER: HIGH THROUGHPUT WRITES: MEMTABLE, SSINDEX, SSTABLE Common idiom in practice Memtable in main memory contains sorted values and likely sorted <key, offset> index. When spilled to disk is divided into SSTables and SSIndexes written separately Indexes or Bloom Filters kept in memory Merging in background when threshold met in terms of number of tables, etc. Merges perform compaction Write-Ahead logs used to aid recovery. Used in some form by Cassandra, Hbase, LevelDB, BigTable, Etc.
SUMMARY Overall strategy Fill memory Spill to disk Search disk runs until they can be merged Use Bloom Filters to minimize unproductive searches Updated in-memory, but merge independent changes once on disk. The overall strategy is sound even if it Does not involve trees, for example by using sorted string tables, and Even if it leaves a forest of data structures to be searched after consulting a Bloom Filter.
BIG TABLE: A DISTRIBUTED STORAGE SYSTEM FOR STRUCTURED DATA
BIG TABLE: OVERVIEW Google Paper: OSDI 2006 Chang, et al (Google, Inc), Bigtable: A Distributed Storage System for Structured Data , OSDI 2006. Store petabytes (and more) of structured data Support applications ranging from batch-processing to real-time Richer data layout than key-value store Client controlled structure But without the impossible constraints of relational databases
DATA MODEL A Table is a sparse, distributed, persistent, multi-dimensional sorted map Indexed by a row key, column key, and timestamp (row:string, column:string, time:int64) string Values are uninterpreted arrays of bytes
TABLE (WELL, A SLICE THEREOF)
ROWS Arbitrary strings Reads and writes of rows are atomic (even across columns) Makes it easier to reason about results Data is lexicographically sorted by row
TABLETS Row range partitioned into tablets Used for distribution and load balancing Reads of short row ranges are efficient If data has good locality w.r.t. tablets, efficiency boost In earlier example, grouping Web pages by reverse domain made it efficient to find them by domain.
COLUMN FAMILIES Sets of column keys Unit of access control, disk and memory accounting Data within column family is usually same type, compressed together Column families are relatively stable and fewer than rows Colum families: contents Anchor Column keys: anchor:cnnsi.com anchor:my.look.com
TIMESTAMPS Cells are versioned Timestamps are times in microseconds Or, alternately, user can assign, e.g. version number, etc Need user-assigned unique timestamps, if want to avoid collisions Automated garbage collection Most recent n Within m amount of time
API Create and destroy tables and column families Change metadata, e.g. access control, etc Write or delete values Iterate across rows Iterate over subset of data Single row transactions (read-update-write) Cells to be used as counters Client-provided server-side scripts for transformation, filtering, summarizationetc.
INFRASTRUCTURE: HOST SERVERS Able to run on a general purpose cluster Shares cluster of servers with other applications Resources are managed by the cluster manager Scheduling Resource management Managing failure Monitoring Etc.
INFRASTRUCTURE: FILE SYSTEMS GFS or other distributed file system designed for big data Large block size Replication for robustness and throughput Random writes not required due to timestamping Random reads are required to support queries
BUILDING BLOCKS: SSTABLES SSTable used to store tables (Sorted String Table) Each SSTable contains A sequence of datablocks A sorted index of keys Index is loaded into memory when table is opened Then can be searched in memory One disk access per lookup. Small SSTables can be cached in memory
IMMUTABILITY Only memtable allows reads and writes Everything else is versioned Allows asynchronous deletes Mitigates need for locking.
INFRASTRUCTURE: CHUBBY COORDINATION SERVICE Provides locks Keeps access control lists Ensures not more than one active master (and, mostly, exactly one) Keeps track of location of tablet servers Keeps schema information Etc.
BUILDING BLOCKS: CHUBBY Distributed lock service 5 active replicas, one is master Only master serves requests Needs majority to work Paxos based Namespace is directories and tiny files Directories and files can be used as locks Locks are leased and callbacks can be requested
IMPLEMENTATION Client library Tablet servers Stores, provides access to, and manages tables Can be added and removed Splits tablets as they grow too large. (Initially each table is one tablet) Master server Assigns tablets to tablet servers, load balances tablet servers, garbage collection, Schema changes
TABLET LOCATION The first level is a file stored in Chubby that contains the location of the root tablet. The root tablet contains the location of all tablets in a special METADATA table. Each METADATA tablet contains the location of a set of user tablets. The root tablet is just the first tablet in the METADATA table, but is treated specially it is never split to ensure that the tablet location hierarchy has no more than three levels. The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet s table identifier and its end row. Each METADATA row stores approximately 1KB of data in memory.
TABLET ASSIGNMENT Each tablet is owned by on one server at a time But, data is redundant in the file system Master keeps track of live tablet servers and assignments Chubby used to keep track of tablet servers Master monitors Chubby directory Tablet servers can t serve if they lose exclusive lock on tablet Tablets reassigned when not reachable Notify master if lost lock Master heartbeats (asks for status of lock from tablet server)
SSTABLES, MEMTABLES, AND THE LOG Recall Memtables and SSTables A run is collected into a Memtable And then dumped into an on disk SSTable Log file keeps transactions not yet committed to disk Append-only, good for writes. Recovery reads metadata from METADATA table Reconstructs Memtables from logs.
LOCALITY GROUPS Grouping of multiple column families together Separate SSTable for each locality group Makes it faster to access data that is accessed together Can assign other attributes, such as to keep in memory.
COMPRESSION Per block or across blocks Per block enables small portions to be read without decompression of larger block Sometimes 2 pass schemes Want to emphasize speed over compression
2-LEVEL READ CACHING Scan cache Key-value pairs from sstable Temporal locality Block cache Sstable blocks read from GFS Spatial locality
BLOOM FILTERS Reads need to read from all SStables that make up table Bloom filters reduce the number that are accessed by don t have matching row/column pair. Ditto for non-existent pairs
COMMIT LOG Use one per tablet server, not one per tablet Reduces the number of files written, improves seek locality, reduces overhead, etc. Different files would mean different locations on disk Complicates recovery Few log entries relate to any one tablet server Parallel sort by key first, then entries for one server are together.
TABLET MIGRATION Process Compact Freeze Compact Migrate Log is clean for move with only a short freeze time