Computational Storage: Offloading Computation to Boost Performance

cs 295 modern systems computational storage n.w
1 / 19
Embed
Share

Explore the concept of computational storage where computation is offloaded to storage devices for enhanced performance. Learn about the architecture, benefits, available devices, and challenges in this cutting-edge field. Dive into real-world examples like YourSQL optimizing data filtering for improved efficiency.

  • Storage
  • Computation
  • Performance
  • Architecture
  • YourSQL

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. CS 295: Modern Systems Computational Storage Spring 2019 Sang-Woo Jun

  2. Computational Storage Offloading computation to an engine on the storage device Why? o Modern SSDs have significant amount of embedded computation capacity (often 4 or more ARM cores), but they are not always busy o Some problems are latency dependent, and moving data all the way to CPU harms performance o The host-storage link becomes a bandwidth bottleneck with enough storage devices (4x 4-lane PCIe SSD saturates a 16 lane PCIe root complex) Plus, peak internal bandwidth of a storage device is typically faster than the link bandwidth o Moving data to CPU consumes a lot of power

  3. Typical Computational Storage Architecture Computation engine typically function both as PCIe endpoint (to host) and root complex (to storage devices) FTL May exist on each storage device (off-the-shelf), or computation engine (open channel, or raw chips) Computation may be ARM cores, FPGAs or something else Host PCIe PCIe Storage Device Computation Engine Storage Device

  4. Some Available Devices Many come with near-data FPGA acceleration BittWare 250S+ EMC Dragonfire board

  5. Some Points No standard interface or programming model yet o All existing implementations have custom interfaces, with varying levels of access abstraction Block devices (transparent FTL), raw chip access, etc o Storage Networking Industry Association (SNIA) Computational Storage working group just created (2018) Accelerator cannot take advantage of page cache o Page cache exists on host, which it cannot access o Some database implementations saw even performance degradation because of this

  6. Example YourSQL Early filtering data in the storage to reduce amount of data sent to host o Offloads computation, saves link bandwidth o Query optimizer modified to move queries with low filtering ratio to an early position o Filtering ratio metric is storage aware, choosing queries that lower read page count instead of simple row count Samsung PM1725 Jo et. al., YourSQL: A High-Performance Database System Leveraging In-Storage Computing, VLDB 2016

  7. Example YourSQL Evaluation on 16-core Xeon, 64 GB memory, running MySQL o Near-storage compute has dual-core ARM R7 o Query planner and storage engine significantly re-written Improves TPC-H benchmark by 3.6x over baseline o Most improved query improved by 15x Query type 1: Selection improved 7x o Storage bandwidth used inefficiently in baseline MySQL Query type 2: Join improved 40x o Size of joined tables reduced by early filtering Baseline not fitting in memory? Jo et. al., YourSQL: A High-Performance Database System Leveraging In-Storage Computing, VLDB 2016

  8. Example BlueDBM Research prototype at MIT (2015) for distributed computational storage o 20-node cluster, 20 Virtex 7 FPGAs, total 20 TB flash o Each virtex 7 FPGA networked directly to each other via low-latency serial links (8x 10 Gbps per link)

  9. Latency Profile of Analytics on Distributed Flash Storage Distributed processing involves many system components o Flash device access o Storage software (OS, FTL, ) o Network interface (10gE, Infiniband, ) o Actual processing Flash Access 75 s Storage Software 100 s Network 20 s Processing 50~100 s 100~1000 s 20~1000 s

  10. Latency Profile of Analytics on Distributed Flash Storage Architectural modifications can remove unnecessary overhead o Near-storage processing o Cross-layer optimization of flash management software* o Dedicated storage area network o Computation Accelerator Flash Access 75 s < 5 s 50~100 s

  11. Latency-Emphasized Example Graph Traversal Latency-bound problem because the next node to be visited cannot be predicted o Completely bound by storage access latency in the worst case Flash 1 Flash 3 Flash 2 In-Store Processor Latency improved by 1. Faster SAN 2. Near-Storage Acceleraor Host 1 Host 2 Host 3

  12. Latency-Emphasized Example Graph Traversal 20000 Nodes traversed per second DRAM Flash 16000 Optimized flash system can achieve comparable performance with a smaller cluster 12000 8000 4000 0 Software+DRAM Software + Separate Network Software + Controller Network Accelerator + Controller Network Software performance measured using fast SAN

  13. Acceleration-Emphasized Example -- High-Dimensional Search Curse of dimensionality: Difficult to create effective index structure for high-dimensional data o Typically, index structure reduces problem space, and direct comparison against remaining data o Low locality between queries Caching ineffective Everything comes from storage anyways Storage good place for accelerator Computation naturally scales as more storage is added

  14. Acceleration-Emphasized Example -- High-Dimensional Search Image similarity search example o Effective way to overcome CPU performance bottleneck o Much lower power consumption thanks to FPGA CPU Bottleneck

  15. A More Complex Example -- Key-Value Cache In-memory cache (e.g., memcached) for caching high-latency queries Overall system (including cache) performance benchmarked with BG social network benchmark KVS Cache Miss Rate Application Throughput 1000 Requests\Second 20% 500 Miss Rate (%) 400 15% 300 10% 200 5% 100 0% 0 2 2.5 Number of Active User (Millions) 3 3.5 4 4.5 5 5.5 2 2.5 Number of Active Users (Millions) 3 3.5 4 4.5 5 5.5

  16. A More Complex Example -- Key-Value Cache Bluecache: Flash-based KVS Architecture o Storage + computation engine plugs directly into network o All key-value pairs stored in flash o Log-structured KV data store managed by near-data FPGA o Hardware accelerated dedicated storage network engines Flash performance much lower than memory, but attempts to reclaim some using accelerators

  17. A More Complex Example -- Key-Value Cache Flash-based KV cache has much larger capacity at comparable cost Comparable performance with large database Much lower power consumption (40 W vs. 200 W) Application Throughput KVS Cache Miss Rate 20.00% 600 1000 Requests\Second Miss Rate (%) (4.0M, 130KRPS) 400 10.00% 4.18X 200 0.00% 0 2 Number of Active User (Millions) 2.5 3 3.5 4 4.5 5 5.5 2 2.5 Number of Active Users (Millions) 3 3.5 4 4.5 5 5.5 memcached FatCache BlueCache memcached FatCache BlueCache

  18. A More Complex Example -- Graph Analytics Remember GraFBoost, graph analytics using sort-reduce Software Host (Server/PC/Laptop) Vertex Value Update Log (xs) 1GB DRAM Multirate 16-to-1 Merge-Sorter Accelerator-Aware Flash Management Sort-Reduce Accelerator Wire-Speed On-chip Sorter Edge Program FPGA Multirate Aggregator Edge Weight Partially Sort- Reduced Files Flash Edge Data Vertex Data Active Vertices

  19. A More Complex Example -- Graph Analytics 720 32 100 35 800 Host Memory (GB) 80 30 80 600 25 Threads 410 Watts 60 16 20 400 15 40 10 160 200 40 80 20 2 5 2 0 0 0 Conventional Conventional Conventional GraFBoost GraFBoost GraFBoost External Analytics + Hardware Acceleration External analytics Hardware Acceleration

More Related Content