Optimizing Network Performance: TCP Acceleration and Kernel Bypass Strategies

tas tcp acceleration as an os service n.w

1 / 28

Embed Share

Explore the challenges in datacenter applications, efficient key-value store implementations, and optimization techniques like kernel bypass and RDMA to enhance network performance. Understand the trade-offs between Linux TCP and alternative approaches for superior service delivery.

fan_m Follow

Uploaded on Apr 19, 2025 | 4 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

TAS: TCP Acceleration as an OS Service Antoine Kaufmann1, Tim Stamler Naveen Kr. Sharma3, Arvind Krishnamurthy3, Thomas Anderson3 Tim Stamler2 2, Simon Peter2, MPI-SWS1 The University of Texas at Austin2 University of Washington3

RPCs are Essential in the Datacenter Remote procedure calls Remote procedure calls (RPCs) are a common building block for datacenter applications Scenario: An efficient key-value store in a datacenter 1. Low tail latency is crucial 2. Thousands of connections per machine 3. Both the application writer and datacenter operator want the full feature set of TCP a) Developers want the convenience of and sockets policy enforcement in-order delivery b) Operators want and strong flexibility

You might want to simply go with Linux Linux provides the features we want sockets in-order delivery policy enforcement flexibility But at what cost? 256B RPC request/response over Linux TCP 250 application cycles per RPC A simple KVS model: 8,300 Total CPU Cycles per RPC App Processing: 3% Kernel Processing: 97% Kernel Processing: 97% We re only doing a small amount of useful computation!

Why is Linux slow? System call and cache pollution System call and cache pollution overheads overheads Application and kernel co-location Executes entire TCP state machine Complicated data path Complicated data path State in multiple cache lines Poor cache efficiency, unscalable Poor cache efficiency, unscalable

Why not kernel-bypass? NIC interface is optimized, bottlenecks are in OS Arrakis (OSDI 14), mTCP (NSDI 14), Stackmap(ATC 16) Do network processing in userspace Expose the NIC interface to the application Hardware I/O virtualization Avoid OS overheads, can specialize stack Operators have to trust application code Little flexibility for operators to change or update network stack

Why not RDMA? Remote Direct Memory Access: Interface: one one- -sided sided and two two- -sided sided operations in NIC hardware RPCs and sockets implemented on top of basic RDMA primitives Minimize or bypass CPU overhead Lose software procotol flexibility Bad fit for many-to-many RPCs RDMA congestion control (DCQCN) doesn t work well at scale

TAS: TCP Acceleration as an OS Service TAS: TCP Acceleration as an OS Service An open source, drop An open source, drop- -in, highly efficient RPC acceleration service in, highly efficient RPC acceleration service No additional NIC hardware required Compatible with all applications that already use sockets Operates as a userspace OS service using dedicated cores for packet processing Leverages the benefits and flexibility of kernel bypass with better protection TAS accelerates TCP processing for RPCs while providing all the desired features Sockets In-order delivery Flexibility Policy enforcement

Why is Linux slow? System call and cache pollution System call and cache pollution overheads overheads Application and kernel co-location Executes entire TCP state machine Complicated data path Complicated data path State in multiple cache lines Poor cache efficiency, unscalable Poor cache efficiency, unscalable

How does TAS fix it? System call and cache pollution System call and cache pollution overheads overheads Dedicate cores for network stack Separate simple fast path and slow path Complicated data path Complicated data path Minimize and localize connection state Poor cache efficiency, unscalable Poor cache efficiency, unscalable

TAS Overview Application Application Application Application CPU 0 CPU 1 Slow Path Slow Path Simple Fast Path Simple Fast Path NIC NIC CPU 3 CPU 2

Dividing Functionality Application Application Linux Kernel TCP Stack Linux Kernel TCP Stack Open/close connections Socket API, locking Slow Path Slow Path Per packet: Per packet: Per connection: Per connection: Socket API, locking IP routing, ARP Firewalling, traffic shaping Generate data segments Congestion control Fast Path Fast Path Open/close connections IP routing, ARP Firewalling, traffic shaping Compute rate Re-transmission timeouts Flow control Process & send ACKs Re-transmission timeouts Per packet: Per packet: Generate data segments Process & send ACKs Flow control Apply rate-limit

Application Application Connection setup/teardown Socket API, locking Slow Path Slow Path Data packet payloads Per connection: Per connection: Open/close connections IP routing, ARP Firewalling, traffic shaping Compute rate Re-transmission timeouts Fast Path Fast Path Per packet: Per packet: Generate data segments Process & send ACKs Flow control Apply rate-limit Congestion statistics Retransmissions Control packets Minimal Connection State

Application Application Connection setup/teardown Socket API, locking Minimal Connection State Slow Path Slow Path Data packets Payload buffers Seq/Ack numbers Remote IP/port Send rate or window Congestion statistics Per connection: Per connection: Open/close connections IP routing, ARP Firewalling, traffic shaping Compute rate Re-transmission timeouts Fast Path Fast Path Per packet: Per packet: Generate data segments Process & send ACKs Flow control Apply rate-limit Only 2 cache lines per connection Only 2 cache lines per connection Congestion statistics Retransmissions Control packets

Application Application Connection setup/teardown Socket API, locking Slow Path Slow Path Data packets Payload buffers Per connection: Per connection: Open/close connections IP routing, ARP Firewalling, traffic shaping Compute rate Re-transmission timeouts Fast Path Fast Path Per packet: Per packet: Generate data segments Process & send ACKs Flow control Apply rate-limit Congestion statistics Retransmissions Control packets Minimal Connection State

Congestion Control Inspired by CCP(SIGCOMM 18) Application Application CPU 1 CPU 0 Slow Path Slow Path CC Algorithm per connection per connection Periodically check/update connection state Fast Path Fast Path Many CC algorithms can be implemented (described in paper) Minimal Connection State NIC NIC CPU 3 CPU 2

Workload Application Application Proportionality CPU 3 CPU 2 Slow Path Slow Path Fast Path Fast Path Minimal Connection State NIC NIC CPU 5 CPU 4

Workload Application Application Application Application Proportionality CPU 3 CPU 2 CPU 0 CPU 1 Slow Path Slow Path Monitor CPU usage and add or remove cores Fast Path Fast Path Minimal Connection State NIC NIC CPU 7 CPU 5 CPU 4 CPU 6

Evaluation

Evaluation Questions What is our throughput, latency, and scalability for RPCs? Do real applications scale with # of cores and have low tail latency? Do we distribute throughput fairly under network congestion? (See paper for more in-depth analysis)

Systems for Comparison We evaluate TAS against 3 other systems: 1. Linux a) Full kernel, trusted congestion control b) Sockets interface 2. mTCP (not in this talk, see paper) a) Pure kernel bypass approach, untrusted congestion control 3. IX a) Replace Linux with optimized data path, run in privileged mode b) Uses batching batching to reduce overhead c) c) No sockets interface No sockets interface d) d) Requires kernel modifications Requires kernel modifications untrusted congestion control

Experimental Setup Intel Xeon Platinum 8160 CPU 24 cores @ 2.10GHz 196GB of RAM Intel XL710 40Gb Ethernet Adapter Benchmarks: Single direction RPC benchmark RPC echo server A scalable key-value store Connection throughput fairness under congestion

Linux vs TAS on RPCs (1 App Core) 250 cycle application workload 64 bytes realistic small RPC Single direction RPC benchmark 32 RPCs per connection in flight RX Pipelined RPC Throughput TX Pipelined RPC Throughput 40 40 Linux TAS Linux TAS Throughput (Gbps) Throughput (Gbps) 6x 6x 4x 4x 10 10 2 2 4.66x 4.66x 1 1 12.4x 12.4x 0 0 32 64 128 256 512 1024 2048 32 64 128 256 512 1024 2048 Message size (B) Message size (B)

Connection Scalability 20% 20% 2.2x 2.2x 25 20 core RPC echo server 64B requests/responses Single RPC per connection 20 Throughput (mOps) 15 10 TAS IX Linux Key factor: minimized Key factor: minimized connection state connection state 5 0 1 16 32 48 64 80 96 Thousand connections

Key-value Store KVS Throughput KVS Throughput Increasing server cores with matching load (~2000 connections per core) 14 14 16 IX Linux TAS IX 12 12 Linux Throughput (mOps) Throughput (mOps) 10 10 TAS IX and TAS provide ~6x speedup over Linux across all cores TAS: 9 app cores, 7 TAS cores IX, Linux: 16 app/stack cores 8 8 TAS LL 6 6 4 4 2 2 0 0 2 2 4 4 8 8 12 12 16 16 Server Cores Server Cores TAS has a 15-20% performance improvement over IX without sockets TAS: 8 app cores, 8 TAS cores

Key-value Store Latency KVS latency measure with single application core, 15% server load 1 0.9 0.8 0.7 Client Median: 2.88x Client Median: 2.88x 0.6 TAS Serv + TAS Client Linux Serv + TAS Client TAS Serv + Linux Client IX Serv + TAS Client CDF CDF 0.5 0.4 Server Median: 5.78x Server Median: 5.78x 0.3 0.2 0.1 0 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190

Tail Latency Why long IX tail? Why long IX tail? Batching Batching IX has 50% higher latency in the 90p case Latency is 20us (27%) higher in the 99.99p case TAS Max: 122us In addition, IX has a 2.3x higher maximum latency IX Max: 280us 1 1 0.9998 0.8 0.9996 0.6 CDF CDF CDF CDF 0.9994 0.4 TAS Serv + TAS Client TAS Serv + TAS Client IX Serv + TAS Client 0.9992 0.2 IX Serv + TAS Client 0.999 0 15 25 35 45 55 65 75 85 95 105 115 125 Latency (us) Latency (us) 5 10 15 20 25 30 35 40 45 Latency (us) Latency (us)

Fairness Under Incast We want to see how TAS distributes throughput under congestion Incast scenario, with four 10G machines all sending to one 40G server TAS on average maintains fair throughput, while Linux is unstable 99p Connection Throughput Median Connection Throughput TAS 99p Fair Share Linux 50p TAS 50p Fair Share Throughput (mB/100ms) 1 1 0.1 0.1 Linux 99p: All 0 s Linux 99p: All 0 s 0.01 0.01 50 100 200 300 400 500 600 700 800 900 1000 2000 # Flows 50 100 200 300 400 500 600 700 800 900 1000 2000 # Flows

Conclusion Conclusion TAS has the convenience and features of Linux, with better performance & stability TAS has the convenience and features of Linux, with better performance & stability Achieved by 1. Separating TCP packet processing into a fast and slow path 2. Minimizing connection state 3. Dedicating cores to the network stack TAS is a purely software solution that is easy to deploy and operate TAS is a purely software solution that is easy to deploy and operate Try it yourself! Try it yourself! https://github.com/tcp-acceleration-service

Optimizing Network Performance: TCP Acceleration and Kernel Bypass Strategies

Download Presentation

Presentation Transcript

Related

More Related Content