RootViz: New root DNS reachability dashboard

Visualise real-time measurement results

World map with highlighted areas

We have recently built an open dashboard called RootViz, which visualises in real time measurement data produced by all Ripe Atlas probes.

It allows users to visualise real-time measurement results (reachability and latency) between all RIPE Atlas probes and each root server, for both IPv4 and IPv6.

It complements DNSMON in 2 ways:

  • by using industry’s default time series visualisation (Grafana) and by leveraging a different dataset from DNSMON

  • it uses data from all Atlas probes, not only the robust anchors. It uses the same dataset that we have used in a study on anycast vs DDoS on the Root DNS system.

Datasets

Atlas measures every root server every 4 minutes, asking all of their 14,000+ probes to send DNS TXT CHAOS queries to each root sever Letter. Below are the links to each Atlas measurement:

How RootViz works

Every 30 minutes, the tool downloads measurement datasets for each root server letter in the table above ,covering [t-60,t-30] minutes. That involves a very large volume of data. To keep things scalable, RootViz only computes and stores the aggregated metrics for the interval, such as number of timed out probes, median RTT, etc.

We also use it to monitor one of our .nl authoritative servers, and we are looking into making the code open source. However, all the “heavy lifting” is done by RIPE Atlas probes, who do the measurements: RootViz only aggregates and visualize them.

Dashboards

In the landing page of the dashboards, we show the percentage of RIPE Atlas probes that timeout while trying to reach each Root Server Identifier (RSI, or Root Server letters).

For instance, below we show one week of timeouts for each RSI, for IPv4. We see oscilations in the period that we need later to further explore. For now, we are just visualzing the results.

https://images.ctfassets.net/yj8364fopk6s/1qQ5Ohj437qUtUhLUQfrRB/7f68c81ded3c1c60d899bd4becd862b7/Root_server_timeout_probes.png

In this process, we disregard Atlas probes that do not work by default, for IPv4 and IPv6. Atlas tags them accordinlgy.

Why timeouts? For a root server operator, the most important metric is reachability: being able to serve clients. Timeouts may suggest reachability issues between client and server, at the network , or at the server itself. (RTT is secondary, given root DNS responses are expected to be cached for 2 days.)

The basic idea is crowdsourcing: having a few probes (or a few hundred) timing out constantly indicates a persistent error, while having spikes suggests something else. We need to explore later the implications of these spikes.

In addition to these combined dashboards, we also include other metrics below:

Dashboard per RSI

We have also generated one dashboard per Root Server Identfier, which we show 9 graphs for each time server. This can be used for their operators or interested folks.

For instance, for L-ROOT, we show below the latency for IPv6, include median, 75 percentile (p75) and 99 percentile (p99):

https://images.ctfassets.net/yj8364fopk6s/2VZD7XoOL1aBC91fK34m6S/008697e3b7c516c9fc59d3a330ba49d5/L-ROOT_IPv6_latency.png

What’s next

RootViz is currently mainly useful for visually inspecting events — a way to spot things in real time that might otherwise go unnoticed.

Our goal is to add automatic anomaly detection and event analysis. We also plan to make the datasets and metrics publicly available to the community, as it can be also be used by other DNS operators to monitor their own services.

We will offer realisation of the extensions as a project to students on the TU Delft Computer Science Bachelor programme, where teams build software as part of their coursework. As a previous example, students developed NTPinfo, an NTP measurement tool that we now provide as a public service (see this announcement).

If you have ideas, feedback or suggestions, we would love to hear from you.