Analysis and Recommendations for Telemetry Retrieval Inaccuracy
This study delves into the inaccuracies in telemetry retrieval on programmable switches, discussing issues with counter arrays, sketch generation, and counter retrieval. It identifies problems with read/reset delays leading to counting errors and provides insightful recommendations for improvement.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Telemetry Retrieval Inaccuracy in Programmable Switches Analysis and Recommendations Hun Namkung Daehyeok Kim , Zaoxing Liu , Vyas Sekar , Peter Steenkiste Carnegie Mellon University Boston University Microsoft Research 1
Sketches on programmable switches are promising Network Operator Telemetry Result Sketches Counter Arrays Heavy hitter detection Entropy estimation Unique # of flows Packet Stream Programmable Switch (Tbps) 2
Sketches on programmable switches generate inaccurate results Average Error Rate (%) Counter Array Size 16K 32K 64K Expected 4.1% 1.3% 0.3% 94x more error! Observed 7.1% 17.0% 34.8% on count-min sketch This inaccuracy problem also impacts other sketches! 3
Counter retrieval causes the inaccuracy problem Switch ?1 ?1 ?2 ?3 ?2 ?3 Control Plane retrieval retrieval counter counter Data Plane Counter Arrays ?1 ?3 ?2 update update update Packet Stream epoch1 epoch3 epoch2 time This problem also impacts other telemetry tasks! 4
Read/Reset delays can cause counting errors Switch ?3 ?2 ?1 Control Plane reset reset read reset read reset read read Data Plane Counter Arrays ?3 ?2 ?1 update update update Packet Stream epoch3 epoch2 epoch1 time 5
Read/Reset delays can cause counting errors Undercounting! Switch ?3 ?2 ?1 Control Plane reset read read read read reset reset Data Plane Counter Arrays ?3 ?2 ?1 update update update Packet Stream epoch3 epoch2 epoch1 time 5
Read/Reset delays can cause counting errors Overcounting! Undercounting! Switch ?3 ?2 ?1 Control Plane reset reset reset read read read Data Plane Counter Arrays ?3 ?2 ?1 update update update Packet Stream epoch3 epoch2 epoch1 time 5
Analysis reveals two major bottlenecks Delay contributing to undercounting (99.9%) Delay contributing to overcounting (1%) ????? ???? 36% 62% Counter Arrays Counter Arrays 6 100%
Sol 1. Use two sets of counter arrays ?2 ?1 Switch read reset read reset Control Plane ?1 ?1 ?3 Set 1 Data Plane Set 2 ?2 ?0 ?2 update update update Packet Stream epoch2 epoch1 epoch3 7
Sol 2. No reset operation ?1+ ?2 ?1 Switch Control Plane reset reset reset read read read Data Plane Counter Arrays ?1+ ?2+ ?3 ?1+ ?2 ?1 update update update Packet Stream epoch3 epoch2 epoch1 ?1+ ?2 ?1 Linear property enables ?2 = 8
Trade-offs among independent solutions and guideline (Higher is better) Generality Delay Reduction Memory Efficiency Sol 1. Use two sets of counter arrays 100% Sol 2. No reset operation 99% Sol 3. Defer buffer read and perform bulk reset 95% Resources is sufficient and/or high accuracy is required? no yes Sketch satisfies the linear property? Sol 1 no yes Sol 2 Sol 3 9
Evaluation Setup Tofino Programmable Switch Control Plane Server periodically read and reset Data Plane CAIDA traces tcpreplay Sketches Multi-Resolution Bitmap HyperLogLog Count-min Sketch Count-Sketch UnivMon 10
Accuracy Improvement Expected Unoptimized Sol 1 Sol 2 Sol 3 70 64.7 60 Lower is better Error Rate (%) 50 40 35.4 34.8 30 20.1 20 10 6.2 4.8 4.8 4.8 3.6 2.8 2.8 2.8 1.7 1.6 1.6 1.5 1.1 1 0.7 0.7 0.7 0.4 0.4 0 0 0 Multi-resolution Bitmap HyperLogLog CountSketch CountMin UnivMon Five Sketches 11
Conclusion Control plane counter retrieval problem generate inaccurate results for sketches on the programmable switch. Our work: We analyze and quantify the inaccuracy We propose three solutions and they eliminate almost all the inaccuracy 12