
Optimizing I/O Caching Efficiency with Memory-Mapped I/O Techniques
Explore how memory-mapped I/O enhances storage caching efficiency, reducing overhead costs and improving performance. Learn about Aquila, a novel MMIO path that customizes DRAM I/O cache policies and device access for significant I/O overhead reduction.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Memory-Mapped I/O on Steroids Anastasios Papagiannis1,2, Manolis Marazakis1, and Angelos Bilas1,2 Foundation for Research and Technology Hellas (FORTH)1& University of Crete2 EuroSys 2021 1
The necessity of storage caching The majority of storage devices are block addressable Due to capacity/cost this will not change DRAM I/O caching Performance & granularity reasons Emerging fast storage devices aggravate caching issues Hit path requires expensive software cache lookups Many additional CPU cycles are spent for cache management One size does not fit all Applications use specific techniques to reduce small random writes Separate read/write for cache pollution purposes Memory-mapped I/O has the potential to solve these issues EuroSys 2021 2
Memory-mapped I/O In memory-mapped I/O (mmio) a file is mapped to virtual address space Load/store processor instructions to access data Kernel fetch/evict page on-demand In mmio hits handled in hardware MMU + TLB Less overhead compared to software cache lookups Misses require a page fault instead of a system call 4KB page size Small & random I/Os Linux mmio does not allow customization Consequently users prefer to use explicit I/O (syscalls) instead of mmio EuroSys 2021 3
I/O caching with system calls EuroSys 2021 4
Our goal Kernel-space cache User-space cache Linux mmap Aquila Hit cost Miss cost Customization EuroSys 2021 5
Aquila: a novel mmio path Collocates app and mmio in a high-priviledge domain (non-root ring 0) Requires minimal application modifications and provides strong protection Allows to customize the DRAM I/O cache, policies, and device access Significantly reduces I/O overhead EuroSys 2021 6
Outline Motivation Aquila design Operations in mmio caches Optimizing common path operations Support for un-common path operations Experimental analysis Conclusions EuroSys 2021 7
Main operations in mmio caches Op1 (a): [HIT] Data access (load/store) Op1 (b): [MISS] Virtual address space manipulation (page faults) Op2: DRAM cache management (lookups/evictions/writebacks) Op3: Data transfers (device I/O) Op4: File/device mappings (mmap/munmap) Op5: Physical memory management (dynamic resizing) Today, all operations occur in the OS EuroSys 2021 8
Miss path over mmio is expensive 6000 Op1 Page fault requires 18.5% more cycles for a 4KB I/O Compared to a system call of the same I/O size Op2 Handler (+ DRAM cache) 18% Even in the case of a DRAM cache hit no I/O Costs about 1.13 s Comparable to fast storage devices access latency Op3 Device I/O 49% Emulated PMEM with DRAM Exception & trap 24% 5000 DRAM access TLB miss 4000 #cycles 3000 exception + trap I/O 2000 page fault handler 1000 0 Linux EuroSys 2021 9
Linux mmio in x86 Application with mmio Ring 3 exception + trap Most Privileged Ring 2 Unused Ring 1 Ring 0 OS Page faults are a specific type of hardware exception Occur for invalid translations in the page table Trap because applications run in ring 3 Page fault in ring 3 (exception + trap) 1287 cycles (536ns) Page fault in ring 0 (exception) 552 cycles (230ns) EuroSys 2021 10
Aquila library OS Today: page faults are handled in ring 0, applications run in ring 3 for protection Result: all operations 1b 5 are expensive Aquila uses ring 0 for performance and non-root mode for protection Page faults in ring 0 incur lower cost Non-root ring 0 still provides strong protection To achieve this we use hardware virtualization extensions (Intel VT-x) EuroSys 2021 11
Intel VT-x - Hardware Virtualization Ring 3 Guest Application non-root mode Ring 2 trap Ring 1 Most Privileged Guest OS Ring 0 vmexit Ring 3 Host Application root mode Ring 2 trap Ring 1 Ring 0 Host OS EuroSys 2021 12
Common vs. uncommon operations Common path operations happen for every miss in ring 0 1. Virtual address space manipulation (page faults) 2. DRAM cache management (lookups/evictions/writebacks) 3. Data transfers (device I/O) Uncommon path operations happen at lower frequency with vmexit 4. File/device mappings (mmap/munmap) 5. Physical memory management (dynamic resizing) EuroSys 2021 13
Running applications in non-root ring-0 Ring 3 non-root mode Ring 2 Ring 1 Most Privileged common path operations Host Application Ring 0 Ring 3 Host Application root-mode Ring 2 uncommon path operations Ring 1 Ring 0 Host OS EuroSys 2021 14
Op1: Op1: Trap-less virtual memory manipulation User Application non-root ring 0 non-root ring 0 Aquila Library OS root ring 3 root ring 0 User Application Linux kernel Devices EuroSys 2021 15
Op1: Op1: Trap-less virtual memory manipulation User Application 552 cycles (230ns) non-root ring 0 Aquila 2.33x root ring 3 User Application 1287 cycles (536ns) root ring 0 Linux kernel Devices EuroSys 2021 16
Op2: Op2: DRAM cache management Ideas from kmmap [ACM SoCC 18] and FastMap [USENIX ATC 20] Separate structures for clean and dirty pages Scalable page insert/remove and marking as clean/dirty Approximation of LRU for evictions/writebacks Scalable NUMA-aware page allocator Tries to allocate a page on the local NUMA node EuroSys 2021 17
Op3: Op3: Device I/O Device access from non-root ring-0 requires the host OS Increased cost from vmexit compared to a trap Aquila provides support for direct device access from non-root ring-0 This requires dedicated devices which is common for DBMS/key-value stores Block-addressable: PCIe attached devices Use SPDK to map device configuration registers to libOS Byte-addressable: DIMM attached devices Leverage DAX to directly map them to physical addresses of libOS Bypass host operating system EuroSys 2021 18
Uncommon path & Implementation Uncommon path operations Op4: File/device mappings Op5: Physical memory management Implementation details Process virtualization [Dune, OSDI 12] Specific optimizations (i.e. for running applications in ring 0) More details in the paper EuroSys 2021 19
Outline Motivation Aquila design Experimental analysis Conclusions EuroSys 2021 20
Testbed 2x Intel Xeon CPU E5-2630 v3 CPUs (2.4GHz) 32 hyper-threads Different devices Intel Optane SSD DC P4800X (375GB) in workloads Emulated pmem device backed by DRAM 256 GB of DDR4 DRAM For memory and pmem emulation CentOS v7.3 with Linux 4.14.72 EuroSys 2021 21
Workloads Microbenchmarks Storage applications: RocksDB User-space cache + read/write system calls (direct I/O) Memory-mapped I/O Linux and Aquila YCSB 32GB dataset + 8GB DRAM cache Extend available DRAM over fast storage devices: Ligra [PPoPP 13] malloc/free over an mmap-ed device BFS algorithm 64GB dataset + 8GB DRAM cache EuroSys 2021 22
Aquila vs. explicit read/write I/O calls 1.6x 10.4x 1.4x 7.2x 1.2x 7.5x EuroSys 2021 23
Reducing I/O cache overhead 2.58x 69% 43% EuroSys 2021 24
Reduced overhead of Aquila vs. Linux mmap Microbenchmark Loads at random offsets Page faults with I/O Emulated PMEM with DRAM 2.33x EuroSys 2021 25
Extending the application heap Ligra + BFS We reduce system and idle time by up to 8.31x 1.56x 4.14x 2.54x EuroSys 2021 26
Conclusions Aquila is a novel mmio path that reduces overheads and allows for cache customization It achieves this by collocating application and mmio in non-root ring 0 Fast miss path, custom cache management, custom device access We evaluate Aquila for storage cache management and for extending the application heap RocksDB: up to 40% increased throughput and up to 2.58x fewer CPU cycles for caching Ligra: up to 4x lower execution time compared to Linux mmap EuroSys 2021 27
Memory-Mapped I/O on Steroids Anastasios Papagiannis Foundation for Research and Technology Hellas (FORTH) & University of Crete email: apapag@ics.forth.gr EuroSys 2021 28