Based on the provided content, here is the information you requested: "BW-Tree: A B-tree for New Hardware Platforms

the bw tree a b tree for new hardware platforms n.w
1 / 30
Embed
Share

"Explore the innovative BW-Tree, a latch-free, log-structured B-tree designed for multi-core machines with large main memories and flash storage. Learn about its efficient key-ordered access, self-balancing capabilities, and performance highlights."

  • B-tree
  • Hardware Platforms
  • Multi-core
  • Flash Storage
  • Log-structured

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. The BW-Tree: A B-tree for New Hardware Platforms Justin Levandoski David Lomet Sudipta Sengupta

  2. An Alternate Title The BW-Tree: A Latch-free, Log- structured B-tree for Multi-core Machines with Large Main Memories and Flash Storage BW = Buzz Word BW = Buzz Word

  3. The Buzz Words (1) B-tree Key-ordered access to records Efficient point and range lookups Self balancing B-link variant (side pointers at each level) In Memory data data data data data data On Disk

  4. The Buzz Words (2) Multi-core + large main memories Latch (lock) free Worker threads do not set latches for any reason Threads never block No data partitioning Delta updates No updates in place Reduces cache invalidation Flash storage Good at random reads and sequential reads/writes Bad at random writes Use flash as append log Implement novel log-structured storage layer over flash

  5. Outline Overview Deployment scenarios Latch-free main-memory index (Hekaton) Deuteronomy Bw-tree architecture In-memory latch-free pages Latch-free structure modifications Cache management Performance highlights Conclusion

  6. Multiple Deployment Scenarios Standalone (highly concurrent) atomic record store Fast in-memory, latch-free B-tree Data component (DC) in a decoupled Deuteronomy style transactional system

  7. Microsoft SQL Server Hekaton Main-memory optimized OLTP engine Engine is completely latch-free Multi-versioned, optimistic concurrency control (VLDB 2012) Bw-tree is the ordered index in Hekaton http://research.microsoft.com/main-memory_dbs/

  8. Deuteronomy Client Request Client Request Interaction Contract Interaction Contract Transaction Component (TC) Transaction Component (TC) 1. 1. Guarantee ACID Properties Guarantee ACID Properties 2. 2. No knowledge of physical data No knowledge of physical data storage storage 1. 1. Reliable messaging Reliable messaging At least once execution multiple sends multiple sends Idempotence Idempotence At most once execution LSNs LSNs Causality Causality If DC remembers message, TC must also Write ahead log (WAL) protocol Write ahead log (WAL) protocol Contract termination Contract termination Mechanism to release contract Checkpointing Checkpointing 2. 2. Logical locking and logging 3. 3. Control Operations Record Operations Data Component (DC) Data Component (DC) 4. 4. 1. 1. 2. 2. 3. 3. Physical data storage Physical data storage Atomic record modifications Atomic record modifications Data could be anywhere (cloud/local) Data could be anywhere (cloud/local) Storage Storage http://research.microsoft.com/deuteronomy/

  9. Outline Overview Deployment scenarios Bw-tree architecture In-memory latch-free pages Latch-free structure modifications Cache management Performance highlights Conclusion

  10. Bw-Tree Architecture Focus of this talk API B-tree search/update logic In-memory pages only B-Tree Layer Logical page abstraction for B-tree layer Brings pages from flash to RAM as necessary Cache Layer Flash Layer Sequential writes to log- structured storage Flash garbage collection

  11. Outline Overview Deployment scenarios Bw-tree architecture In-memory latch-free pages Latch-free structure modifications Cache management Performance highlights Conclusion

  12. Mapping Table and Logical Pages Mapping Table Mapping Table PID Physical Address In-Memory Index page Index page Data page On Flash Data page f flash/mem lash/mem flag flag address address 1 bit 63 bits Pages are logical, identified by mapping table index Mapping table Translates logical page ID to physical address Important for latch-free behavior and log-structuring Isolates update to a single page

  13. Compare and Swap Atomic instruction that compares contents of a memory location M M to a given value V V If values are equal, installs new given value V V in M M Otherwise operation fails New Value Address M M 20 20 30 30 X X CompareAndSwap CompareAndSwap(&M, 20, 30) CompareAndSwap CompareAndSwap(&M, 20, 40) (&M, 20, 30) (&M, 20, 40) Compare Value

  14. Delta Updates Mapping Table Mapping Table PID PID Physical Physical Address Address : Delete record 48 : Delete record 48 : Insert record 50 : Insert record 50 P P Page P Each update to a page produces a new address (the delta) Delta physically points to existing root of the page Install delta address in physical address slot of mapping table using compare and swap

  15. Update Contention : Update record 35 : Update record 35 : Insert Record 60 : Insert Record 60 Mapping Table Mapping Table PID PID Physical Physical Address Address : Delete record 48 : Delete record 48 : Insert record 50 : Insert record 50 P P Page P Worker threads may try to install updates to same state of the page Winner succeeds, any losers must retry Retry protocol is operation-specific (details in paper)

  16. Delta Types Delta method used to describe all updates to a page Record update deltas Insert/Delete/Update of record on a page Structure modification deltas Split/Merge information Flush deltas Describes what part of the page is on log-structured storage on flash

  17. In-Memory Page Consolidation : Update record 35 : Update record 35 Mapping Table Mapping Table PID PID Physical Physical Address Address : Delete record 48 : Delete record 48 : Insert record 50 : Insert record 50 P P Page P Consolidated Page P Delta chain eventually degrades search performance We eventually consolidate updates by creating/installing new search-optimized page with deltas applied Consolidation piggybacked onto regular operations Old page state becomes garbage 17

  18. Garbage Collection Using Epochs Thread joins an epoch prior to each operation (e.g., insert) Always posts garbage to list for current one it joined) Garbage for an epoch reclaimed only when all threads have exited the epoch (i.e., the epoch drains) current epoch (not necessarily the Current Epoch Current Epoch Epoch 1 Epoch 1 Epoch 2 Epoch 2 Members Members Thread 1 Thread 1 Thread 2 Thread 2 Members Members Thread 3 Thread 3 Garbage Collection List Garbage Collection List Garbage Collection List Garbage Collection List

  19. Outline Overview Deployment scenarios Bw-tree architecture In-memory latch-free pages Latch-free structure modifications Cache management Performance highlights Conclusion

  20. Latch-Free Node Splits Index Entry Index Entry Mapping Table Mapping Table PID Physical Address Split Split Page 3 Page 2 Page 1 Page 4 2 Logical pointer Physical pointer 4 Page sizes are elastic No hard physical threshold for splitting Can split when convenient B-link structure allows us to half-split without latching 1. Install split at child level by creating new page 2. Install new separator key and pointer at parent level

  21. Outline Overview Deployment scenarios Bw-tree architecture In-memory latch-free pages Latch-free structure modifications Cache management Performance highlights Conclusion

  22. Cache Management Write sequentially to log-structured store using large write buffers Page marshalling Page marshalling Transforms in-memory page state to representation written to flush buffer Ensures correct ordering Incremental flushing Incremental flushing Usually only flush deltas since last flush Increased writing efficiency

  23. Representation on LSS Sequential log Mapping table Base page Base page -record . . . Base page . . . Write ordering in log -record RAM -record Flash Memory

  24. Outline Overview Deployment scenarios Bw-tree architecture In-memory latch-free pages Latch-free structure modifications Cache management Performance highlights Conclusion

  25. Performance Highlights - Setup Experimented against BerkeleyDB standalone B-tree (no transactions) Latch-free skiplist implementation Workloads Xbox Xbox 27M get/set operations from Xbox Live Primetime 94 byte keys, 1200 byte payloads, read-write ratio of 7:1 Storage Storage Deduplication Deduplication 27M deduplication chunks from real enterprise trace 20-byte keys (SHA-1 hash), 44-byte payload, read-write ratio of 2.2:1 Synthetic Synthetic 42M operations with keys randomly generated 8-byte keys, 8-byte payloads, read-write ratio of 5:1

  26. Vs. BerkeleyDB BW-Tree BW-Tree BerkeleyDB BerkeleyDB 12.0 10.40 10.40 10.0 Operations/Sec (M) Operations/Sec (M) 8.0 6.0 3.83 3.83 4.0 2.84 2.84 2.0 0.66 0.66 0.56 0.56 0.33 0.33 0.0 Xbox Xbox Synthetic Synthetic Deduplication Deduplication

  27. Vs. Skiplists Bw Bw- -Tree Tree 3.83M Ops/Sec Skiplist Skiplist 1.02 M Ops/Sec Synthetic workload Synthetic workload L1 hits L1 hits L2 hits L2 hits L3 hits L3 hits RAM RAM 100% 100% 90% 90% 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% Bw-tree Bw-tree Skiplist Skiplist

  28. Outline Overview Deployment scenarios Bw-tree architecture In-memory latch-free pages Latch-free structure modifications Cache management Performance highlights Conclusion

  29. Conclusion Introduced a high performance B-tree Latch free Install delta updates with CAS (do not update in place) Mapping table isolates address change to single page Log structured (details in another paper) Uses flash as append log Updates batched and written sequentially Flush queue maintains ordering Very good performance

  30. Questions?

Related


More Related Content