Efficient Storage Strategies: Memory-Mapped I/O Insights

1 / 66

Embed Share

This research delves into Memory-mapped I/O for high-speed storage, discussing trends in storage devices, storage caching techniques, key-value stores, and the advantages of Memory-mapped I/O. It explores redesigning I/O paths to optimize performance for small and large I/O operations, emphasizing the significance of hits handled in hardware. The study examines how user-space and kernel-space cache systems impact storage access, offering insights into system call reduction and storage access optimization. Findings suggest that Memory-mapped I/O can enhance storage efficiency, especially for read-intensive workloads and write-optimized data structures.

chasten Follow

Uploaded on Mar 17, 2025 | 4 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Memory-mapped I/O for Fast Storage Anastasios Papagiannis Computer Science Department, University of Crete Advisor: Prof. Angelos Bilas Public PhD Defense

New storage device trends New problems Rotational disks (HDDs) require large I/Os for peak device throughput CPU overhead is negligible compare to device accesses Fast storage Flash, NVMe Millions of IOPS < 10 s access latency Small I/Os still result in peak device xput given large queue depth More IOPS result in higher CPU overhead Today bottleneck is the host I/O path, not the device There is need and opportunity to redesign the I/O path Different design tradeoffs for small and large I/Os 22 Oct 2020 Anastasios Papagiannis - PhD Defense 2

Storage caching and syscalls User Space Application Cache Storage cache + read/write system calls Reduce accesses to the device Kernel-space cache Requires system calls also for hits Used for raw (serialized) blocks User-space cache Lookups for hits + system calls for misses Application specific (deserialized) data User-space cache removes system calls for hits Lookups introduce software overheads [SIGMOD 08] Kernel Space Cache OS Device 22 Oct 2020 Anastasios Papagiannis - PhD Defense 3

Key-value stores Important building block Key-value store: A dictionary for arbitrary key-value pairs Used extensively: web indexing, social networks, data analytics, IoT Support inserts, deletes, point (lookups) and range queries (scans) Read intensive workloads With large bursts of writes Write-optimized data-structures Key-value stores use mainly a user-space storage cache With read/write system calls for misses 22 Oct 2020 Anastasios Papagiannis - PhD Defense 4

Memory-mapped I/O (MMIO) In memory-mapped I/O a file mapped to virtual address space load/store processor instructions to access data Kernel fetches/evicts pages on-demand 22 Oct 2020 Anastasios Papagiannis - PhD Defense 5

Memory-mapped I/O: Advantages Hits handled entirely in hardware Virtual to physical address translation MMU + TLB Less overheads compared to software storage cache lookups A kernel/user storage cache incurs cost even for hits Memory-mapped I/O can trivially remove costs of software lookups 22 Oct 2020 Anastasios Papagiannis - PhD Defense 6

Memory-mapped I/O: Today Main use is to map executable and shared libraries During process initialization Small & read-mostly mappings No evictions/writebacks in the common path High degree of sharing in mappings Many processes map the same library/executable Memory-mapped I/O + data intensive applications MonetDB, MongoDB Specific applications (i.e. graph processing) 22 Oct 2020 Anastasios Papagiannis - PhD Defense 7

Memory-mapped I/O: Disadvantages Requires a page fault instead of system call for cache misses Page faults are more expensive than system calls 4KB page granularity increases CPU overhead in the I/O path Many small and random I/Os With fast storage devices the host I/O path overhead dominates 2MB and 1GB not commonly used for storage due to I/O amplification 22 Oct 2020 Anastasios Papagiannis - PhD Defense 8

Contributions MMIO Potential: We design Tucana, a persistent key-value store that use memory-mapped I/O and show the pros/cons of this approach [USENIX ATC 16] MMIO Customization: We enable applications to define custom eviction policies over memory-mapped I/O and show the applicability on Kreon, a memory-mapped key-value store [ACM SoCC 18, ACM TOS 20] MMIO Scalability: We show that Linux memory-mapped I/O path fails to scale with increasing the number of user threads and we design FastMap, a scalable memory-mapped I/O path [USENIX ATC 20] MMIO Overhead: We show that page faults have increased overheads and we propose Aquila, that places the application in a privileged but protected domain, which removes the need for protection domain switches [Under review] 22 Oct 2020 Anastasios Papagiannis - PhD Defense 9

Outline Introduction Contributions MMIO: Potential & customization (~7 min) MMIO: Scaling with #threads (~15 min) MMIO: Reducing single thread overhead (~15 min) Conclusions 22 Oct 2020 Anastasios Papagiannis - PhD Defense 10

Tucana: A memory-mapped key-value store Today, key-value stores are inefficient Consume a lot of CPU cycles Mostly optimized for HDDs right decision until today Tucana [USENIX ATC 16] uses a write-optimized data-structure Absorbs bursty writes Uses memory-mapped I/O for device access Instead of read/write system calls and a user-space cache Applies memory-mapped I/O specific optimizations No need for serialization/deserialization Improves both CPU efficiency and throughput 22 Oct 2020 Anastasios Papagiannis - PhD Defense 11

MMIO performance degrades with evictions No control over evictions as with an application-specific storage cache Lazy memory cleanup for other processes to avoid starvation Bursty writes that lead to unpredictability for tail latency Fixed eviction policy does not fit well for key-value indexes Multi-level data structures 4KB page size leads to small I/Os OS fails to combine buffers and leads to bad I/O pattern Blocks to synchronize memory with storage Increased tail latency Co-design of key-value store and memory-mapped I/O path Kreon write-optimized key-value store with kmmap 22 Oct 2020 Anastasios Papagiannis - PhD Defense 12

Write-optimized key-value stores Inserts Most popular: LSM-Tree Multi-level structure Buffer writes and use single I/O to the device Data in large containers for large/sequential I/O I/O amortization due to multiple levels L0 being in memory Data movement between levels in large batches Each level is N times the previous in size Great for HDDs! However, it requires compactions To move data to lower levels sorting and merging High CPU processing and I/O amplification Memory L0 Compaction L1 Compaction L2 Disk . . . Compaction . . . Ln 22 Oct 2020 Anastasios Papagiannis - PhD Defense 13

Kreon key-value store Multiple levels for I/O amortization Increases I/O randomness & reduce CPU cycles Keeps a separate B-Tree index per-level Removes sorting and merging with spills Not an issue with fast storage devices Reduces I/O amplification Keeps key-value pairs in a single append-only log Moves only metadata between levels with spills Uses Copy-On-Write for persistence Uses kmmap to overcome MMIO issues Inserts Memory L0 Spill L1 Spill L2 Spill Disk . . . Ln Log 22 Oct 2020 Anastasios Papagiannis - PhD Defense 14

kmmap: Resolving Linux MMIO issues Per-page priorities enable users to define hot/cold dataset Pages with lower priority evicted first Kreon: L0 highest priority, then index nodes of other levels, and at last log Eliminate redundant reads for newly allocated pages Provides a non-blocking msync CoW does not have in-place updates Removes blocking reduces tail latency 22 Oct 2020 Anastasios Papagiannis - PhD Defense 15

Kreon improvement over RocksDB (a) Efficiency (cycles/op) 2.7x (small), 3.4x (large) on average Up to 95% fewer cycles for cache Up to 98% fewer cycles for spills (b) Throughput (ops/sec) 2.8x (small), 4.7x (large) on average 22 Oct 2020 Anastasios Papagiannis - PhD Defense 16

Kreon tail latency 99x 22 Oct 2020 Anastasios Papagiannis - PhD Defense 17

Summary Kreon is a write-optimized key-value store with custom MMIO path Kreon reduces CPU cycles due to MMIO and spills Kreon reduces I/O amplification due to metadata-only movement Up to 4.9x less I/O Scalability issues with increasing the number of threads 22 Oct 2020 Anastasios Papagiannis - PhD Defense 18

Outline Introduction Contributions MMIO: Potential & customization (~7 min) MMIO: Scaling with #threads (~15 min) MMIO: Reducing single thread overhead (~15 min) Conclusions 22 Oct 2020 Anastasios Papagiannis - PhD Defense 19

Linux memory-mapped I/O path scalability 5 microbenchmark device: null_blk dataset: 4TB DRAM cache: 192GB 4.5 million page-faults/s (IOPS) 4 3.5 Queue depth 27 3 2M IOPS 2.5 1.3M IOPS 2 1.5 1 0.5 0 1 2 4 8 16 32 Linux-Read Linux-Write #threads 22 Oct 2020 Anastasios Papagiannis - PhD Defense 20

Scaling memory-mapped I/O path FastMap [USENIX ATC 20] custom memory-mapped I/O path Inside the Linux kernel From the user to device Achieves higher scalability and I/O concurrency Avoid all centralized contention points Reduce CPU processing in the common path Use dedicated data structures to minimize interference among processes 22 Oct 2020 Anastasios Papagiannis - PhD Defense 21

Memory-mapped I/O path scalability 5 microbenchmark device: null_blk dataset: 4TB DRAM cache: 192GB 4.5 million page-faults/s (IOPS) 4 3.5 3x in reads 3 2.5 2 6x in writes 1.5 1 0.5 0 1 2 4 8 16 32 Linux-Read Linux-Write FastMap-Read FastMap-Write #threads 22 Oct 2020 Anastasios Papagiannis - PhD Defense 22

Outline Introduction Contributions MMIO: Potential & customization (~7 min) MMIO: Scaling with #threads (~15 min) Design Experimental Analysis MMIO: Reducing single thread overhead (~15 min) Conclusions 22 Oct 2020 Anastasios Papagiannis - PhD Defense 23

FastMap design FastMap based on 3 main design decisions Separates data structures that keep clean and dirty pages Optimize reverse mappings to reduce CPU processing Use a scalable DRAM cache to reduce latency variability 22 Oct 2020 Anastasios Papagiannis - PhD Defense 24

Linux design page page_tree (radix_tree) VMA address_space page tree_lock page 126x contented lock acquisitions 155x more wait time tree_lock is acquired for 2 main reasons insert/remove elements from page_tree Modify tags for a specific entry Used to mark a page dirty 22 Oct 2020 Anastasios Papagiannis - PhD Defense 25

FastMap design . . . page_tree (radix_tree) cpuN page_tree (radix_tree) cpu0 page_tree (radix_tree) cpu1 VMA PFD . . . dirty_tree (rb-tree) cpuN dirty_tree (rb-tree) cpu0 dirty_tree (rb-tree) cpu1 Keep dirty pages on a separate data structure Marking a page dirty/clean does not serialize insert/remove ops Choose one of multiple data-structures based on page_offset % num_cpus Page trees to keep ALL cached pages Dirty trees to keep ONLY dirty pages Sorted pages by device offset red-black tree 22 Oct 2020 Anastasios Papagiannis - PhD Defense 26

FastMap design FastMap based on 3 main design decisions Separates data structures that keep clean and dirty pages Optimize reverse mappings to reduce CPU processing Use a scalable DRAM cache to reduce latency variability 22 Oct 2020 Anastasios Papagiannis - PhD Defense 27

Reverse mappings Provide a way to find out which PTEs map a specific page Page eviction due to memory pressure or explicit writeback Destroy mappings munmap Linux uses object-based reverse mappings 22 Oct 2020 Anastasios Papagiannis - PhD Defense 28

Linux object-based reverse mappings VMA PGD page _mapcount address_space i_mmap VMA PGD read/write semaphore page _mapcount VMA PGD Executables and libraries (e.g. libc) introduce large amount of sharing Reduces DRAM consumption and housekeeping costs _mapcount can still results in useless page table traversals rw-semaphore acquired as read on all operations Cross NUMA-node traffic Spend many CPU cycles 22 Oct 2020 Anastasios Papagiannis - PhD Defense 29

FastMap uses full reverse mappings Storage with MMIO Requires minimal sharing Full reverse mappings Minimal CPU overhead Efficient munmap No ordering required and updates are scalable VMA, vaddr page VMA, vaddr VMA, vaddr page Single list per physical core VMA 22 Oct 2020 Anastasios Papagiannis - PhD Defense 30

FastMap design FastMap based on 3 main design decisions Separates data structures that keep clean and dirty pages Optimize reverse mappings to reduce CPU processing Use a scalable DRAM cache to reduce latency variability Optimizations for Eviction/writeback operations Implementation details 22 Oct 2020 Anastasios Papagiannis - PhD Defense 31

Outline Introduction Contributions MMIO: Potential & customization (~7 min) MMIO: Scaling with #threads (~15 min) Design Experimental Analysis MMIO: Reducing single thread overhead (~15 min) Conclusions 22 Oct 2020 Anastasios Papagiannis - PhD Defense 32

Testbed Intel Xeon processors 2x 32 hyper-threads 4x 80 hyper-threads + RCU in VMA management Bonsai [ASPLOS 12] Different devices Intel Optane SSD DC P4800X (375GB) in workloads null_blk in microbenchmarks Kreon persistent key-value store Accesses storage devices with memory-mapped I/O 256 GB of DDR4 DRAM CentOS v7.3 with Linux 4.14.72 22 Oct 2020 Anastasios Papagiannis - PhD Defense 33

Linux scalability 7 microbenchmark device: null_blk dataset: 4TB DRAM cache: 192GB 6 million page-faults/s (IOPS) 5 4 3 2 1 0 1 10 20 40 80 Linux-Read Linux-Write #threads 22 Oct 2020 Anastasios Papagiannis - PhD Defense 34

FastMap scalability 7 microbenchmark device: null_blk dataset: 4TB DRAM cache: 192GB 6 million page-faults/s (IOPS) 5 4 11.89x 3 2 1 0 1 10 20 40 80 Linux-Read Linux-Write FastMap-Read FastMap-Write #threads 22 Oct 2020 Anastasios Papagiannis - PhD Defense 35

FastMap execution time breakdown 600 microbenchmark 32 threads 500 400 time (ksamples) mark_dirty address-space 300 page-fault other 200 100 0 mmap-Read FastMap-Read mmap-Write FastMap-Write 22 Oct 2020 Anastasios Papagiannis - PhD Defense 36

Kreon + YCSB 100% inserts 3.2x 22 Oct 2020 Anastasios Papagiannis - PhD Defense 37

Kreon + YCSB 100% lookups 1.56x 22 Oct 2020 Anastasios Papagiannis - PhD Defense 38

Summary Increased scalability for page faults & I/O concurrency Important to achieve peak device throughput in fast storage Almost no improvements for single thread performance A page fault has higher overhead compared to system call Expensive hardware page fault mechanism Exception + trap from user to kernel space for protection purposes 22 Oct 2020 Anastasios Papagiannis - PhD Defense 39

Outline Introduction Contributions MMIO: Potential & customization (~7 min) MMIO: Scaling with #threads (~15 min) MMIO: Reducing single thread overhead (~15 min) Conclusions 22 Oct 2020 Anastasios Papagiannis - PhD Defense 40

Cache management with system calls 22 Oct 2020 Anastasios Papagiannis - PhD Defense 41

Cache management with MMIO Operations are Op1 (a): [HIT] Data access (load/store) Op1 (b): [MISS] Virtual address space manipulation (page faults) Op2: DRAM cache management (lookups/evictions/writebacks) Op3: Data transfers (device I/O) Op4: File/device mappings (mmap/munmap) Op5: Physical memory management (dynamic resizing) Today, all operations occur in the OS 22 Oct 2020 Anastasios Papagiannis - PhD Defense 42

Miss path over MMIO is expensive 6000 Op1 Page fault requires 18.5% more cycles for a 4KB I/O Compared to a system call of the same I/O size Op2 Handler (+ DRAM cache) 18% Even in the case of a DRAM cache hit no I/O Costs 2724 cycles 1.13 s Comparable to fast storage devices access latency Op3 Device I/O 49% Emulated PMEM with DRAM Exception & trap 24% 5000 DRAM access 4000 TLB miss #cycles 3000 exception + trap I/O 2000 page fault handler 1000 0 Linux 22 Oct 2020 Anastasios Papagiannis - PhD Defense 43

Linux MMIO in x86 Application that use MMIO Ring 3 exception + trap Most Privileged Ring 2 Unused Ring 1 Ring 0 OS Page faults are a specific type of hardware exception Occur for invalid translations in the page table Trap because applications runs in ring 3 Page fault in ring 3 (exception + trap) 1287 cycles (536ns) Page fault in ring 0 (exception) 552 cycles (230ns) 22 Oct 2020 Anastasios Papagiannis - PhD Defense 44

Aquila library OS Today: page faults handled in ring 0, application runs in ring 3 for protection Result: all operations 1b 5 are expensive Aquila uses ring 0 for performance and non-root mode for protection Page faults in ring 0 incur lower cost Non-root ring 0 still provides strong protection To achieve this we use hardware virtualization extensions (Intel VT-x) 22 Oct 2020 Anastasios Papagiannis - PhD Defense 45

Intel VT-x - Hardware Virtualization Ring 3 Guest Application non-root mode Ring 2 trap Ring 1 Most Privileged Guest OS Ring 0 vmexit Ring 3 Host Application root mode Ring 2 trap Ring 1 Ring 0 Host OS 22 Oct 2020 Anastasios Papagiannis - PhD Defense 46

Running applications in non-root ring-0 Ring 3 non-root mode Ring 2 Ring 1 Most Privileged common path operations Host Application Ring 0 Ring 3 Host Application root-mode Ring 2 uncommon path operations Ring 1 Ring 0 Host OS 22 Oct 2020 Anastasios Papagiannis - PhD Defense 47

Our observation: Common vs. uncommon ops Common path operations happen for every miss in ring 0 1. Virtual address space manipulation (page faults) 2. DRAM cache management (lookups/evictions/writebacks) 3. Data transfers (device I/O) Uncommon path operations happen at lower frequency with vmexit 4. File/device mappings (mmap/munmap) 5. Physical memory management (dynamic resizing) 22 Oct 2020 Anastasios Papagiannis - PhD Defense 48

Op1: Op1: Trap-less virtual memory manipulation User Application non-root ring 0 non-root ring 0 Aquila Library OS root ring 3 root ring 0 User Application Linux kernel Devices 22 Oct 2020 Anastasios Papagiannis - PhD Defense 49

Op1: Op1: Trap-less VM manipulation User Application 552 cycles (230ns) non-root ring 0 Aquila 2.33x root ring 3 User Application 1287 cycles (536ns) root ring 0 Linux kernel Devices 22 Oct 2020 Anastasios Papagiannis - PhD Defense 50

Efficient Storage Strategies: Memory-Mapped I/O Insights

Download Presentation

Presentation Transcript

Related

More Related Content