
Intelligent Data Orchestrated for Network-Driven Servers
Explore the evolution of cache hierarchy and memory subsystem pressure on servers, requiring a re-examination of technologies like Direct Cache Access (DCA) and Intelligent Data Direct I/O (IDIO). Discover how network-driven MLC prefetching and selective DRAM access optimize data processing.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
IDIO: Network-Driven, Inbound Network Data Orchestration on Server Processors Mohammad Alian, Siddharth Agarwal, Jongmin Shin, Neel Patel, Yifan Yuan, Daehoon Kim, Ren Wang, Nam Sung Kim 1
Connectivity in Future Connectivity in Future Datacenter Servers Datacenter Servers Trends: NICs w/ Tbps connectivity + evolution of cache hierarchy Implication: High pressure on the memory subsystem Requires re-examination of Direct Cache Access (DCA) technology for network data Ayar Lab s TeraPHY Optical I/O Multi-Chip Package 2
* Or DDIO for Intels Implementation Direct Cache Access (DCA) Direct Cache Access (DCA) CPU NIC PCIe LLC DDIO ways DDR DRAM 3
Cache Hierarchy Evolution Cache Hierarchy Evolution CPU NIC 256KB MLC PCIe MLC LLC Inclusive 2.5MB per core DDR Broadwell Microarchitecture DRAM 4
Cache Hierarchy Evolution Cache Hierarchy Evolution CPU NIC 1MB MLC PCIe MLC LLC 1.375MB per core Non inclusive DDIO ways Core Core DDR Skylake-X/SP Microarchitecture DCA still the same! DRAM 5
Our Work: Intelligent Placement of RX Network Data in Our Work: Intelligent Placement of RX Network Data in the Memory Hierarchy the Memory Hierarchy NIC CPU 1MB MLC IDIO PCIe MLC IDIO LLC 1.375MB per core Non inclusive IDIO ways Core Core DDR Three synergistic mechanisms Self-invalidating I/O buffer Network-driven MLC prefetching Selective Direct DRAM Access (DDA) DDA DRAM 6
Outline Outline Background & Motivation Demystifying network applications Movement of RX network data in a non-inclusive cache hierarchy Observations o Useless MLC writebacks o DMA bloating phenomenon o Sensitivity to network rate Intelligent Data Direct IO (IDIO) Self-invalidating I/O buffers Network-driven MLC prefetching Selective direct DRAM access Experimental results Conclusion 7
Outline Outline Background & Motivation Demystifying network applications Movement of RX network data in a non-inclusive cache hierarchy Observations o Useless MLC writebacks o DMA bloating phenomenon o Sensitivity to network rate Intelligent Data Direct IO (IDIO) Self-invalidating I/O buffers Network-driven MLC prefetching Selective direct DRAM access Experimental results Conclusion 8
Inclusive vs. Non Inclusive vs. Non- -Inclusive Caches Inclusive Caches Non-inclusive Cache (simplified) Inclusive Cache Private MLC Shared LLC 9
Data Movement in a Non Data Movement in a Non- -Inclusive Cache Hierarchy Inclusive Cache Hierarchy Running a Run Running a Run- -to to- -Completion SW stack Completion SW stack CPU NIC Demand by core for processing Core Core RX Desc Ring Private MLC Conflict MLC writeback DDIO Ways 7 Non-DDIO Ways 6 PCIe Write 1 2 5 4 8 Shared LLC 5 Write-Allocate 3 Conflict LLC writeback (DMA Leak) DRAM 10
Data Movement in a Non Data Movement in a Non- -Inclusive Cache Hierarchy Inclusive Cache Hierarchy Running a Run Running a Run- -to to- -Completion SW stack Completion SW stack Demand by core for processing NIC MLC Writebacks Core Core RX Desc Ring CPU 3 5 4 Private MLC Conflict MLC writeback 2 DDIO Ways 7 Non-DDIO Ways 6 PCIe Write 1 9 9 Shared LLC 8 Write-Update Observation! Non-inclusive cache hierarchy breaks isolation between I/O and application data consumed DMA buffers Observation! Unnecessary MLC writeback of DRAM 11
Summary of Observations Summary of Observations & Proposed Solution & Proposed Solution Current DCA implementation has three shortcomings in a non-inclusive hierarchy: Inefficient use of large MLC High rates of (useless) writebacks from MLC to LLC Breaking the isolation between application and network data Proposed solution*: Intelligent Data Direct I/O (IDIO) Three synergistic mechanisms Network-driven MLC prefetching Self-invalidating I/O buffer Selective direct DRAM access When to prefetch? 12
MLC and LLC Writebacks: Sensitivity to Network Rate MLC and LLC Writebacks: Sensitivity to Network Rate MLC writebacks LLC writebacks network rate 300 Million WB Network BW Ring buffer size is 1024 Per Sec 90 200 60 100 30 0 0 Receive a burst of 1024 packets 0 5 10 15 20 25 30 Burst of network Data! 300 120 Million Writebacks Per 250 100 Network BW (Gbps) 200 80 Seconds 150 60 Beneficial to prefetch to MLC 100 40 50 20 0 0 11.8 12.0 12.2 12.4 12.6 12.8 13.0 13.2 Time (ms) 13
Outline Outline Background & Motivation Demystifying network applications Movement of RX network data in a non-inclusive cache hierarchy Observations o Useless MLC writebacks o DMA bloating phenomenon o Sensitivity to network rate Intelligent Data Direct IO (IDIO) Self-invalidating I/O buffers Network-driven MLC prefetching Selective direct DRAM access Experimental results Conclusion 14
Self Self- -Invalidating I/O Buffers + MLC Prefetch Invalidating I/O Buffers + MLC Prefetch CPU NIC Burst is Detected! Demand by core for processing Core Core RX Desc Ring Prefetch to MLC Private MLC Self Invalidate DDIO Ways Non-DDIO Ways PCIe Write 1 2 4 Shared LLC Write-Allocate 3 MLC writebacks are eliminated DDIO space is extended to private MLC DRAM 15
Requirements Requirements Details in the Paper Timely invalidation to ensure correctness Software knows when a DMA buffer is consumed Execute cache invalidation instructions to invalidate DMA buffer Examples: o Data Cache Invalidate by Modified Virtual Address (DCIMVAC) instruction in ARM_v7 o Data Cache Block Invalidate (DCBI) in PowerPC Moderate MLC prefetching MLC pressure calculated over a 1us interval 16
Outline Outline Background & Motivation Demystifying network applications Movement of RX network data in a non-inclusive cache hierarchy Observations o Useless MLC writebacks o DMA bloating phenomenon o Sensitivity to network rate Intelligent Data Direct IO (IDIO) Self-invalidating I/O buffers Network-driven MLC prefetching Selective direct DRAM access Experimental results Conclusion 17
Methodology Methodology Applications DPDK micro network applications, run to completion (Touch-Drop, L2Fwd) Motivational results from real HW 2 nodes Intel Xeon Gold 6242 CPU, 1MB MLC, 36MB LLC, 2 DDIO ways, 96GiB DDR4-3200 DRAM Nvidia/Mellanox ConnectX-5 Dual 100Gb Ethernet NIC Implementation of IDIO on gem5 2 full-system nodes, running Linux (kernel 5.4.0) Modeled after a high-performance O3 CPU w/ non-inclusive cache hierarchy Enabled userspace networking by running DPDK! 100 Gbps Ethernet connectivity 18
DDIO vs. Inv vs. MLC prefetch DDIO vs. Inv vs. MLC prefetch vs. IDIO vs. IDIO Baseline (DDIO) MTPS Million Transactions Per Second Touch-Drop application Self-Invalidate Only + MLC prefetch Only IDIO 19
Steady Load Steady Load Baseline (DDIO) MTPS Million Transactions Per Second Touch-Drop application IDIO 20
IDIO: Network IDIO: Network- -Driven, Inbound Network Data Orchestration on Server Driven, Inbound Network Data Orchestration on Server Processors Processors Mohammad Alian, Siddharth Agarwal, Jongmin Shin, Neel Patel, Yifan Yuan, Daehoon Kim, Ren Wang, Nam Sung Kim Conclusion Conclusion Three key shortcomings of current DCA implementation in a non-inclusive cache hierarchy: Inefficient use of large MLC High rates of writebacks from MLC to LLC Breaking the isolation between application and network data Proposed solution: Intelligent Data Direct I/O (IDIO) Three synergistic mechanisms Self-invalidating I/O buffer Network-driven MLC prefetching Selective direct DRAM access Evaluation using gem5 capable of running DPDK with 100Gbp+ bandwidth Data movement reduction: 84% MLC and LLC writeback reduction LLC isolation: 22% performance improvement in a co-running scenario Tail latency reduction: 38% reduction in 99th latency 21