Network Failures in Data Centers: Insights from SIGCOMM 2011

sigcomm 2011 toronto on aug 18 2011 n.w
1 / 29
Embed
Share

Gain valuable insights into network failures in data centers through the comprehensive analysis and measurements discussed at SIGCOMM 2011 in Toronto. Explore strategies to prevent, mitigate, and understand failures to enhance network reliability.

  • Network failures
  • Data centers
  • SIGCOMM
  • Insights
  • Reliability

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. SIGCOMM 2011 Toronto, ON Aug. 18, 2011 Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Phillipa Gill University of Toronto University of Toronto Gill Navendu Navendu Jain Jain & Microsoft Research Microsoft Research & Nachiappan Nachiappan Nagappan Nagappan 1

  2. Motivation 2

  3. Motivation $5,600 per minute $5,600 per minute We need to understand failures to prevent and mitigate them! 3

  4. Overview Our goal: Our goal: Improve reliability by understanding network failures 1. Failure characterization characterization Most failure prone components Understanding root cause 2. What is the impact impact of failure? 3. Is redundancy redundancy effective? Our Our c contribution: ontribution: First large failures across multiple failures across multiple DCs Methodology to extract failures from noisy data sources. Correlate events with network traffic to estimate impact Analyzing implications for future data center networks First large- -scale empirical study of network scale empirical study of network DCs 4

  5. Road Map Motivation Motivation Background & Methodology Background & Methodology Results Results 1. Characterizing failures 2. Do current network redundancy strategies help? Conclusions Conclusions 5

  6. Data center networks overview Internet Access routers/network core fabric Load balancers Aggregation Agg switch Top of Rack (ToR) switch Servers 6

  7. Data center networks overview Which components are most failure prone? Internet How effective is redundancy? What is the impact of failure? ? ? What causes failures? 7

  8. Failure event information flow Failure is logged in numerous data sources Ticket ID: 34 LINK DOWN! LINK DOWN! LINK DOWN! Syslog, SNMP traps/polling Troubleshooting Tickets 5 min traffic averages on links Network event logs Diary entries, root cause Network traffic logs Troubleshooting 8

  9. Data summary One year of event logs from Oct. 2009-Sept. 2010 Network event logs and troubleshooting tickets Network event logs are a combination of Syslog, SNMP traps and polling Caveat: may miss some events e.g., UDP, correlated faults Filtered by operators to actionable still many warnings from various software daemons running actionable events Key challenge: How to extract failures of interest? 9

  10. Extracting failures from event logs Defining failures Defining failures Device failure: device is no longer forwarding traffic. Link failure: connection between two interfaces is down. Detected by monitoring interface state. Network event logs Dealing with inconsistent data: Dealing with inconsistent data: Devices: Correlate with link failures Links: Reconstruct state from logged messages Correlate with network traffic to determine impact 10

  11. Reconstructing device state Devices may send spurious DOWN messages Verify at least one link on device fails within five minutes Conservative to account for message loss (correlated failures) LINK DOWN! DEVICE DOWN! Aggregation switch 1 Top-of-rack switch Aggregation switch 2 LINK DOWN! This sanity check reduces device failures by 10x 11

  12. Reconstructing link state Inconsistencies in link failure events Note: our logs bind each link down to the time it is resolved LINK DOWN! LINK UP! UP Link state Link state DOWN time What we expect 12

  13. Reconstructing link state Inconsistencies in link failure events Note: our logs bind each link down to the time it is resolved LINK DOWN 2! LINK UP 2! LINK UP 1! LINK DOWN 1! UP ? ? Link state Link state DOWN time 1. Take the earliest of the down times 2. Take the earliest of the up times What we sometimes see. How to deal with discrepancies? 13

  14. Identifying failures with impact Correlate link failures with network network traffic Network traffic logs traffic Only consider events where traffic decreases decreases BEFORE BEFORE AFTER AFTER ??????? ?????? ??????? ??????< ? time time DURING DURING LINK DOWN LINK DOWN LINK UP LINK UP Summary of impact Summary of impact: : 28.6% of failures impact network traffic 41.2% of failures were on links carrying no traffic E.g., scheduled maintenance activities Caveat: Caveat: Impact is only on network traffic not necessarily applications! necessarily applications! Redundancy: Network, compute, storage mask outages no traffic not 14

  15. Road Map Motivation Motivation Background & Methodology Results Results 1. 1. Characterizing Characterizing failures Distribution of failures over measurement period. Which components fail most? How long do failures take to mitigate? failures 2. Do current network redundancy strategies help? Conclusions Conclusions 15

  16. Visualization of failure panorama: Sep09 to Sep10 All Failures 46K 12000 Widespread failures Links sorted by data center Links sorted by data center 10000 Long lived failures. Link Y had failure on day X. 8000 6000 4000 2000 0 Oct-09 Dec-09 Feb-10 Time (binned by day) Time (binned by day) Apr-10 Jul-10 Sep-10 16

  17. Visualization of failure panorama: Sep09 to Sep10 All Failures 46K Failures with Impact 28% Component failure: link failures on multiple ports 12000 Links sorted by data center Links sorted by data center 10000 8000 6000 4000 Load balancer update (multiple data centers) 2000 0 Oct-09 Dec-09 Feb-10 Time (binned by day) Time (binned by day) Apr-10 Jul-10 Sep-10 17

  18. Which devices cause most failures? Internet ? ? 18

  19. Which devices cause most failures? Top of rack switches have few failures (annual prob. of failure <5%) (annual prob. of failure <5%) 100% 90% failures downtime 80% 70% 66% Percentage Percentage but a lot of downtime! 60% 50% 38% 40% 28% 30% 18% 20% 15% 9% 8% 10% 5% 4% 4% 2% 0.4% 0% Load Load Load Load Load Load Top of Top of Rack 1 Rack 1 Top of Top of Aggregation Aggregation LB-1 alancer 1 LB-2 alancer 2 ToR-1 Device type Device type Device Type Device Type LB-3 alancer 3 ToR-2 Rack 2 Rack 2 AggS-1 Switch Switch B Balancer 2 B Balancer 3 B Balancer 1 19 Load balancer 1: very little downtime relative to number of failures.

  20. How long do failures take to resolve? Internet 20

  21. How long do failures take to resolve? Load balancer 1: short-lived transient faults Median time to repair: 4 mins Load Balancer 1 Load Balancer 1 Load Balancer 2 Load Balancer 2 Top of Rack 1 Top of Rack 1 Load Balancer 3 Load Balancer 3 Top of Rack 2 Top of Rack 2 Aggregation Switch Aggregation Switch Overall Overall Correlated failures on ToRs connected to the same Aggs. Median time to repair: 5 minutes Mean: 2.7 hours Median time to repair: ToR-1: 3.6 hrs ToR-2: 22 min 21

  22. Summary Data center networks are highly reliable Majority of components have four 9 s of reliability Low-cost top of rack switches have highest reliability <5% probability of failure but most downtime Because they are lower priority component Load balancers experience many short lived faults Root cause: software bugs, configuration errors and hardware faults Software and hardware faults dominate failures but hardware faults contribute most downtime 22

  23. Road Map Motivation Motivation Background & Methodology Results Results 1. Characterizing failures 2. 2. Do Do current network redundancy strategies help? current network redundancy strategies help? Conclusions Conclusions 23

  24. Is redundancy effective in reducing impact? Redundant devices/links to mask failures This is expensive! (management overhead + $$$) Internet Goal: Goal: Reroute traffic along available paths X X How effective is this in practice? 24

  25. Measuring the effectiveness of redundancy Idea: Idea: compare traffic before and during failure Measure traffic on links: 1. Before failure 2. During failure 3. Compute normalized traffic ratio: ??????? ?????? ??????? ??????~? Acc. router (primary) Acc. router (backup) X X Agg. switch (primary) Agg. switch (backup) Compare normalized traffic over redundancy groups to normalized traffic on the link that failed 25

  26. Is redundancy effective in reducing impact? Core link failures have most impact but redundancy masks it Less impact lower in the topology Redundancy is least effective for AggS and AccR 100% Internet Normalized traffic during failure (median) Normalized traffic during failure (median) 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% All Top of Rack to Aggregation switch Aggregation switch to Access router Core Overall increase of 40% in terms of traffic due to redundancy Per link Per redundancy group 26

  27. Road Map Motivation Motivation Background & Methodology Results Results 1. Characterizing failures 2. Do current network redundancy strategies help? Conclusions Conclusions 27

  28. Conclusions Goal: Understand failures in data center networks Goal: Understand failures in data center networks Empirical study of data center failures Key observations: Key observations: Data center networks have high reliability Low-cost switches exhibit high reliability Load balancers are subject to transient faults Failures may lead to loss of small packets Future directions: Future directions: Study application level failures and their causes Further study of redundancy effectiveness 28

  29. Thanks! Contact: phillipa@cs.toronto.edu Project page: Project page: http://research.microsoft.com/~navendu/netwiser 29

Related


More Related Content