Increasing DNS Resolver Resiliency: Challenges and Learnings

Increasing DNS Resolver Resiliency: Challenges and Learnings
Slide Note
Embed
Share

Implementing a dual-stack DNS resolver using Windows DNS Server and Unbound, along with learnings and issues encountered. Discusses the burden on DNS to be extremely resilient and architecture details of a high-volume DNS resolver processing requests. Covers development processes, validation methods, and results analysis.

  • DNS Resolver
  • Dual-Stack
  • Resiliency
  • Windows DNS Server
  • Unbound

Uploaded on Feb 23, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Increasing DNS Resolver Resiliency: Challenges and Learnings Arunkumar Singaram arsingar@microsoft.com

  2. Agenda Implementing a dual-stack DNS resolver using Windows DNS Server (WinDNS) & Unbound Learnings/Issues encountered during dual- stack implementation Enabling serve-stale data from the cache (RFC 8767) in Unbound

  3. DNS has a great burden to be extremely resilient ! Protect against: zero-day vulnerabilities dormant data plane bugs

  4. Architecture High Volume DNS Resolver processes requests (~8M QPS) from Microsoft online services Global availability Presence in all Azure regions; proximity ensures quicker response times, helps workloads that depends on Geo location like Azure Traffic Manager Internal & External Authoritative DNS services Contains forwarder & stub zones that are internal only

  5. Development process Evaluation Development Validation Feature analysis of different resolvers Performance testing Containerization Query logs with Dnstap Performance testing Monitoring/Alerts Compatibility with existing system

  6. Validation Gather DNS queries Process responses Manual analysis Finesse Unbound Submit Ship Reuse query logs from production Send to both stacks in parallel Auto-categorize difference in response Identify acceptable differences Tweak code/config to get desirable response Repeat previous steps until satisfactory

  7. Results CompressionMismatch 1% NxDomainRCodeMatch 2% Majority match Around 86% of responses were identical ServFailRCodeMatch 1% HeaderFlagMismatch 0% HeaderRDAAFlagMisma tch 3% Ignorable differences SERVFAIL vs NXDOMAIN = Not a deal breaker Compression difference Missing Glue records AuthSectionCompressio nMismatch 0% = Not a deal breaker = You get the gist! NxDomainRCodeMisma tch 0% MissingSOARecord 3% TotalAuthSectionMisma tch 1% MissingGlueRecords 3% TotalPayloadMatch 86% Other 0%

  8. Learnings in Dual-Stack Ramp-up Differing behavior between services Two examples: Query for type with large record set (>512 bytes) drive.foo.com -> foo.contoso.com -> 40.113.200.201 drive.foo.com CNAME? query query to auth foo.contoso.com truncated response TC bit, NODATA drive.foo.com A Record? foo.contoso.com But with query minimization turned off query query to auth drive.foo.com CNAME? TC bit, fill UDP truncated response foo.contoso.com drive.foo.com A Record? 40.113.200.201 key : UDP only DNS client (non-standard)

  9. Learnings (continued...) Tweak things outside the DNS software Investment is > 2X the single stack Always follow Safe Deployment Practice Deploy to smallest region as first stage Have sufficient bake time between stages Run Perf Test before every release Initial investment to bootstrap dual stack is high Feature development is just one part but testing/ monitoring will be shared Ongoing investments needed to maintain compatibility with spec tests Reload config due to zone changes are very expensive due to cache flush use zone reload via unbound- control. Optimize OS network parameters to avoid socket buffer overflow.

  10. Additional Enhancements to increase resiliency with RFC 8767 (Serving stale data) Reduce impact on Authoritative DNS performance degradation Minimized impact of two separate instances of individual zone outages.

  11. Thank you

More Related Content