Tapdance delivers better insight into DNS queries

Our new open-source application for near-real-time DNS statistics

Tap dancer on a stage with a watching audience

The original blog post is in Dutch. This is the English translation.

SIDN operates a global DNS infrastructure that ensures .nl domain names are always reachable quickly from anywhere in the world. In an earlier blog, we provided a detailed description of the infrastructure and our data-driven approach to managing and improving it. Insight into DNS query data is essential to that approach. In this blog, we introduce Tapdance: our new open-source application that enriches and visualises DNS data in near real time.

To measure is to know

Our DNS team relies on DNS query statistics to monitor and improve the service. For instance, the statistics help us to quickly spot when DNS servers are under pressure, so we can scale up capacity. Understanding where in the world DNS queries originate and what the latency is between resolver and name server enables us to make better decisions about where to deploy our DNS servers. Such information forms the basis for our Autocast optimisation algorithm. Our existing tools provide a lot of the DNS data we’re interested in, but they don’t always meet our needs, and some are getting old.

Shortcomings of the established measurement methods

Until now, we’ve used 2 systems to gain insight into .nl DNS queries and our infrastructure: DSC (DNS Statistics Collector) and ENTRADA.

DSC is the de facto standard for collecting DNS query statistics on authoritative name servers, but it has a few shortcomings for us. It provides a broad set of counters covering DNS characteristics such as query type, response code and DNSSEC queries. However, by default, the statistics aren’t enriched with geographical information or the latency between resolver and name server – information that’s vital for our data-driven infrastructure. DSC aggregates statistics locally on each anycast server and writes them to disc as XML files. Those files then have to be fetched by a central component that collates statistics from all the locations. That’s a suboptimal model for our dynamic infrastructure. For one thing, there’s a delay of around 15 minutes before the data can be viewed. It also means we have to maintain an extra central component in our otherwise decentralised system. And, in the event of a fault, XML files can pile up on the anycast servers or get lost.

The second tool we use is ENTRADA, a DNS data platform developed by SIDN Labs. All .nl queries – about 8 billion a day – are stored on ENTRADA for up to 18 months, so that they can be used for research. ENTRADA data is enriched with resolver geolocation data and a certain about of data on resolver-name server latency. However, on ENTRADA, latency is calculated for TCP-based DNS queries by measuring the interval between 2 successive steps in the TCP handshake. While that method generally works well, it only tells us about latency for TCP-based DNS queries, which make up only about 1 per cent of all DNS traffic. As a result, it gives an incomplete and skewed picture of resolver latency. On top of that, ENTRADA isn’t designed for fast visualisation. Consequently, it’s about 25 minutes before the data is available to view, because of the complex processing and aggregation it has to undergo.

Tapdance modernises anycast visibility

To overcome the shortcomings of our existing tooling, we’ve developed a simple application called Tapdance that does exactly what we need. Tapdance is based on the dnstap logging standard, which is supported by nearly all name server software.

Tapdance is a standalone application that will run on any DNS anycast server, without needing a central component to collect statistics or manage the system. It therefore scales automatically with our DNS infrastructure, without any need for monitoring. By parsing dnstap logs from name server software in real time, we generate query statistics that appear on our dashboards within a minute. We also enrich the data immediately with resolver geolocation data and data on resolver-name server latency. What’s more, the latency is measured actively by sending ICMP pings to resolvers that regularly send DNS queries. So we get a picture of latency for all significant resolvers, instead of only for the ones that use TCP for DNS queries.

In short, Tapdance improves and modernises our insight into DNS traffic by providing:

  • A decentralised application architecture

  • A much shorter delay between query and visualisation

  • Richer statistics, including geographical details and latency information about all resolvers

  • Fewer processing steps, making the system less sensitive to errors

As a result, we’re able to further optimise our DNS anycast infrastructure and respond more quickly to cybersecurity threats such as DDoS attacks.

Screenshot of the Tapdance dashboard

Figure 1: Example of a Tapdance dashboard showing DNS anycast statistics (incomplete data for illustration).

Open source

We’ve published Tapdance as open-source software on Codeberg, so other DNS anycast operators can use it too. If you’ve got any questions about Tapdance or ideas about how we might improve it, feel free to drop us a line at sidnlabs@sidn.nl.