Timeout-less Transport in Commodity Datacenter Networks

Timeout-less Transport in Commodity Datacenter Networks
Slide Note
Embed
Share

Suffering from packet drops in datacenter networks due to various factors such as buffer utilization, microbursts, and reactive congestion controls. This study explores the impact of timeouts, retransmission, and aggressive timeout strategies in TCP to address packet loss issues. Existing solutions and innovations to mitigate side effects of packet loss are also discussed.

  • Datacenter networks
  • Packet drops
  • Timeouts
  • TCP
  • Aggressive timeouts

Uploaded on Feb 18, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Toward Timeout-less Transport in Commodity Datacenter Networks Hwijoon Lim (KAIST), Wei Bai (Microsoft Research), Yibo Zhu (ByteDance), Youngmok Jung (KAIST), Dongsu Han (KAIST)

  2. 2 Packet Drops in Datacenter Networks In datacenter networks, we still suffer from packet drops. Buffer/port/Gbps Utilization 1G 10G 40G 100G Link Shallow Buffer 5.12kB per port per Gbps* Microbursts 90% of duration < 200 s** Reactive Congestion Controls 1 RTT tens of s * Bai et al (2017) ** Zhang et al (2017) Some packet drops cannot be detected in a timely manner.

  3. 3 Impact of Timeouts RTO RTT >> (Retransmission Timeout) (Round Trip Time) Tens of s Few ms Sender #1 #2 Receiver ACK ACK < 100 us Data packet ACK packet Packet drop Timeout #3 > 4 ms #3 ACK

  4. 4 Can we use aggressive timeouts? In TCP, RTO is calculated as max(RTOmin,SRTT + 4 Var RTT ) Can we just reduce RTOminto a lower value? (200us)* RTOmin RTOmin 1 1 0.98 0.98 CDF CDF 0.96 0.96 RTT RTO RTT RTO 0.94 0.94 0.92 0.92 0.9 0.9 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 Time (ms) Foreground Incast Flows Time (ms) Background Flows * Vijay Vasudevan et al (2009) Reducing RTOminmay not be enough because RTT variation is very large.

  5. 5 Can we use aggressive timeouts? Can we use aggressive static timeout (fixed RTO)? ms ms Million 10 80 5 8 4 60 6 3 2.1x May cause undesirable interactions with network diagnosis* 40 4 2 51x 20 2 1 0 0 0 # of Timeouts Background Average FCT Foreground 99%-ile FCT * Azrani et al (2016) Aggressive RTO degrades throughput due to spurious retransmissions.

  6. 6 Existing Solutions to Mitigate Side Effect of Loss Novel Switch Support NDP [SIGCOMM '17] CP (Cut Payload) [NSDI 14] FastLane [SoCC 15] Switch ASIC modification Improve Loss Recovery Lossless Network TLP (Tail Loss Probe) [SIGCOMM '13] IRN (Improved RoCE NIC) [SIGCOMM '18] PFC (Priority-based Flow Control) Many Side effects Protocol specific, Ineffective

  7. 7 Existing Solutions to Mitigate Side Effect of Loss Novel Switch Support NDP [SIGCOMM '17] CP (Cut Payload) [NSDI 14] FastLane [SoCC 15] Switch ASIC modification Can we minimize the impact of packet loss Improve Loss Recovery in a general and deployment-friendly way? Lossless Network TLP (Tail Loss Probe) [SIGCOMM '13] IRN (Improved RoCE NIC) [SIGCOMM '18] PFC (Priority-based Flow Control) Many Side effects Protocol specific, Ineffective

  8. 8 Existing Solutions to Mitigate Side Effect of Loss Can we minimize the impact of packet loss in a general and deployment-friendly way? Novel Switch Support NDP [SIGCOMM '17] CP (Cut Payload) [NSDI 14] FastLane [SoCC 15] Switch ASIC modification Improve Loss Recovery Lossless Network TLT (TimeoutLess Transport) TLP (Tail Loss Probe) [SIGCOMM '13] IRN (Improved RoCE NIC) [SIGCOMM '18] PFC (Priority-based Flow Control) Building block for various transport protocols Readily deployable using commodity switches Many Side effects Protocol specific, Ineffective

  9. 9 TLT: Key Observation Not all packet drops have the same impact! Sender #1 #2 Receiver Sender #1 #2 Receiver ACK 2 ACK 2 Data packet ACK packet Packet drop Timeout ACK 3 #3 #3 ACK 2 #2 ACK 4 #3 ACK 4 Timeout No Timeout

  10. TLT Design

  11. 11 TLT Design Overview (1) Important Packet Selection at Hosts Sender Receiver Switch Important Packet Unimportant Packet We mark some packets as important and keep them lossless inside the network.

  12. 12 TLT Design Overview (2) Selective Dropping at Switches Sender Threshold = 5 Switch Receiver Important Packet Unimportant Packet Important packets remain lossless inside the network.

  13. 13 TLT: Key Challenges 1. Which packet should be marked as important? - Which packet loss may trigger timeout? - # of important packets should be minimal 2. How to selectively drop the unimportant packet? - Queue separation would cause severe out-of-order problem. - How can we differentiate unimportant packet inside the same queue?

  14. 14 TLT Design: Important Packet Selection at Hosts Window-based Transport Example: TCP, DCTCP, HPCC, IRN Control the number of inflight packets under the window limit ACK triggers new packet transmission Rate-based Transport Example: RDMA (DCQCN) Control how fast the packet is injected to the network Timer triggers new packet transmission

  15. 15 TLT Design: Important Packet Selection at Hosts Common Rule: Mark all control packets important Sender #1 Receiver Sender #1 Receiver ACK ACK #2 #2 ACK ACK #2 ACK ACK Loss

  16. 16 TLT Design: Important Packet Selection at Hosts Window-based Transport Example: TCP, DCTCP, HPCC, IRN Control the number of inflight packets under the window limit ACK triggers new packet transmission Rate-based Transport Example: RDMA (DCQCN) Control how fast the packet is injected to the network Timer triggers new packet transmission

  17. 17 Challenge of Window-based Transport Maintaining self-clocking is critical Sender #1 #2 Receiver ACK ACK #3 Data packet ACK packet Packet drop Timeout #3 ACK

  18. 18 Key Idea of Important Packet Selection (Window-based) Always keep one in-flight important data packet Sender Receiver Data packet (Unimportant) Data packet (Important Data) ACK packet ACK packet (Important Echo) Lossless

  19. 19 Important Data Packet Preserves Self-clocking Guaranteed fast loss detection Sender #1 #2 Receiver ACK Can t tell whether unimportant data packet is lost yet #3

  20. 20 Important Data Packet Preserves Self-clocking Guaranteed fast loss detection Sender #1 #2 Receiver ACK #3 #4 ACK Guaranteed timely loss detection

  21. 21 Important Data Packet Preserves Self-clocking Guaranteed fast loss detection Sender Receiver Can tell whether lost ACK ACK Important Echo is a reliable indicator for loss detection.

  22. 22 Preserving one in-flight important data packet What if the base transport does not allow further transmission when the sender receives an important echo? Sender Receiver Not enough window or no new data Failed to preserve one in-flight important data packet To keep important ACK-clocking, retransmit a part of packet by marking as important.

  23. 23 Preserving one in-flight important data packet Retransmit packet proactively (unACKed and previously sent as unimportant) Sender #1 #2 Receiver Retransmit #2 Adaptively change the size of retransmitted packet to minimize impact

  24. 24 TLT Design: Important Packet Selection at Hosts Window-based Transport Example: TCP, DCTCP, HPCC, IRN Control the number of inflight packets under the window limit ACK triggers new packet transmission Rate-based Transport Example: RDMA (DCQCN) Control how fast the packet is injected to the network Timer triggers new packet transmission

  25. 25 Challenge of Rate-based Transport No ACK-clocking on rate-based transport. Timeout only happens Sender #10 #11 #12 Receiver Sender #97 #98 #99 Receiver ACK ACK ACK ACK NACK 10 NACK 10 (Retx.) #10 (Retx.) #11 (Retx.) #12 NACK 10 NACK 10 #99 ACK #10 First retransmitted sequence gets lost At the end of the message

  26. 26 Key Idea of Important Packet Selection (Rate-based) 1. Mark the last packet of the flow as important Sender #97 #98 #99 Receiver Sender #97 #98 #99 Receiver ACK ACK ACK ACK NACK #98 #99 ACK ACK (Optional) can mark every N packet as important for timely loss detection

  27. 27 Important Packet Selection at Hosts (Rate-based) 2. Mark the first packet during the retransmission as important Sender #10 #11 #12 Receiver Sender #10 #11 #12 Receiver ACK NACK 10 NACK 10 ACK NACK NACK ACK (Retx.) #10 (Retx.) #10 (Retx.) #11 (Retx.) #12 (Retx.) #11 (Retx.) #12 NACK 10 NACK 10 ACK ACK #10

  28. 28 TLT Design: Selective Dropping at Switches Limit the queue buildup of unimportant packets by color-aware dropping Red Threshold = 3 Switch Egress Queue Packet Ingress Proactive Drop DSCP Color Important Green Unimportant Red ACL Table

  29. 29 Choosing Color-aware Dropping Threshold Threshold should be max queue length of steady state for the base transport DCTCP DCQCN Queue Length Queue Length BDP ?max ?ECN ?min Time Time ?max max ?ECN,BDP = BDP

  30. Evaluation

  31. 31 Evaluation Setup Testbed Setup Implement prototype of TLT on Mellanox Messaging Accelerator (VMA) 9 hosts with 40Gbps NIC connected to a single switch (Broadcom Tomahawk) Microbenchmark + Evaluate application performance (TCP, DCTCP) Simulation Setup Simulated TCP, DCTCP, DCQCN, IRN, HPCC on NS-3 96-host leaf-spine topology (40Gbps Link Capacity) Evaluate realistic workloads (web search)

  32. 32 Testbed Experiments Incast Linux Kernel VMA VMA + 200us RTOmin VMA+TLT 1 100 0.9 90 99%-ile FCT (ms) 0.8 80 0.7 70 0.6 60 CDF 50 0.5 40 0.4 30 0.3 20 0.2 10 0.1 0 0 0 40 80 120 160 200 0 10 20 30 40 50 Number of Flows FCT (ms) CDF of FCT for DCTCP flows Tail FCT of DCTCP TLT reduces tail FCTs up to 97.2% by effectively eliminating timeouts.

  33. 33 Testbed Experiments Application Performance DCTCP DCTCP+TLT Redis Node 30 99%-ile Response Time (ms) 20 NGINX Web Servers 10 HTTP Client 0 0 30 60 90 120 150 180 Testbed Topology Number of Flows Tail Response Time (DCTCP) TLT keeps variance in response time low by mitigating timeouts.

  34. 34 Large Scale Simulation FCT (DCTCP) Baseline+PFC+TLT Baseline+PFC Baseline+TLP Baseline Baseline+200us RTOmin Baseline+TLT 60 Average Background (ms) 50 40 30 20 Better 10 0 0 5 10 15 99.9%-ile Foreground (ms) DCTCP TLT reduces the 99.9%-ile FCT of foreground incast flows by up to 81%.

  35. 35 Large Scale Simulation FCT (RDMA) Baseline+PFC+TLT Baseline+PFC Baseline Baseline+TLT 3 2 2 Average Background (ms) Average Background (ms) Average Background (ms) 2 1 1 1 Better Better Better 0 0 0 0 3 6 9 0 2 4 6 0 3 6 9 99.9%-ile Foreground (ms) 99.9%-ile Foreground (ms) 99.9%-ile Foreground (ms) HPCC DCQCN+SACK DCQCN+IRN TLT reduces the 99.9%-ile FCT of foreground incast flows by up to 69%.

  36. 36 Large Scale Simulation Timeouts / PFC Baseline+PFC+TLT Baseline+PFC Baseline+TLP Baseline Baseline+200us RTOmin Baseline+TLT 40k 1400 500 35k 1200 400 30k 1000 25k 300 800 20k 600 15k 200 400 10k 100 5k 200 0 0 0.09 0 0 0 0 DCQCN +IRN HPCC +SACK DCQCN +SACK DCTCP HPCC +SACK DCQCN +IRN DCQCN +SACK DCTCP # of PFC PAUSE Frames # of Retransmission Timeouts TLT virtually eliminates timeouts and greatly reduces the number of PFC PAUSE.

  37. 37 Conclusion TLT is a building block for existing datacenter transports to eliminate congestion timeouts. TLT achieves timeoutlessness by making important packets lossless in the network. Code: https://github.com/kaist-ina/ns3-tlt-tcp-public Sender Switch Receiver

Related


More Related Content