The time-to-live value of a DNS record "is primarily used by resolvers when they cache RRs. The TTL describes how long a RR can be cached before it should be discarded" . In other words, it is the maximum length of time (in seconds) that a DNS resolver should keep a domain in its cache.
A TTL violation involves a recursive resolver overriding the time-to-live value of a DNS record, as provided by the authoritative server. For example, as documented in , Amazon EC2 local resolvers override the TTL of .nl, changing it from 172800 to 60. Other researchers have previously reported on use of the practice on wired and mobile networks [3 and 4, respectively].
When a resolver violates the TTL specified by a DNS zone, one of two things can happen: if the involves reducing the specified value (e.g. from 2 days to 60 seconds), that will make the resolver query the authoritative server more often. If the TTL value is increased, that will make the domain reachable at an address that could in principle be wrong or illegitimate (for example, if a phishing website is taken down, it may still be available in the caches of a resolver that has increased its TTL).
The violation of TTLs divides opinion: some people see it as a legitimate practice, while others are against it [2,5,6]. There is also an internet draft that presents a method (that's currently being used by many cloud providers) to slate DNS query data when authoritative servers are unreachable .
In this post, we do not debate whether resolvers should violate TTL values provided by authoritative servers (please refer to [2,3,4,5,6] for that). The question we consider is: are TTL violations happening in the wild? TTL violations have been reported in other studies (e.g. involving wireless providers ), but the number of providers involved in the reported cases has been small.
To answer the question set out above, we have analysed data from the Ripe Atlas probes.
We measured TTL violations in the wild using the following procedure:
- Register a non-used domain name (cachetest.nl).
- Set up two authoritative name severs for cachetest.nl:
- Set up the zone files for each NS, using Ripe Atlas probe IDs as subdomains (so we can use macros to send unique queries from each probe to avoid caching -- i.e., $p.cachetest.nl, where $p is probeid):
23559 333 IN TXT "this is ns1 responding to probe 23559"
23560 333 IN TXT "this is ns1 responding to probe 23560"
23561 333 IN TXT "this is ns1 responding to probe 23561"
23562 333 IN TXT "this is ns1 responding to probe 23562"
- Run Atlas measurements with 10,000 Atlas probes:
- Parse and analyse the results.
As will be apparent from step 3, we made sure each probe query involved a unique domain name, so that a resolver cache-miss situation is guaranteed, even if the same resolver was queried. In other words, every query was designed to make the resolver query one of our authoritative servers.
After running the measurement for 1 hour , querying every 600s (almost twice the value of TTL of the records in our zone), we generated the final dataset shown in Table 1. As can be seen, 9,119 probes were involved in this measurement, querying more than 6,687 resolvers.
Since each probe can contact multiple resolvers, we see that, in the end, there are 15,923 vantage points, i.e. unique probe-resolver combinations.
Our 54,115 queries led to more than 94,805 answers, which we used in our analysis described below.
|Unique probe-resolver pairs||15,923|
|Frequency||every 10 min|
Table 1: Complete dataset
Given that we set the TTL for every record in our demo zone to 333, the question is: how many resolvers changed that TTL value? And what is the typical change, if any?
The expected value for the TTL of queries is 333. However, since multiple probes can use the same resolver, we may expect some TTLs to be slightly less than 333. However, no queries should have a TTL of more than 333.
We divided the dataset from Table 1 into three parts:
- Normal TTL: answers based on 320<=TTL<=333
- Decreased TTL: answer based on TTL < 320
- Increased TTL: answer based on TTL >333
Table 2 shows the results. As can be seen, the great majority of probes/queries/resolvers fall into the normal category, meaning their TTL deviates from the original 333 by no more than 13 (since multiple probes can use a same resolver). In the following paragraphs, we consider first the decreased TTL answers and then the increased TTL answers, in an effort to understand how much the values are being changed by the resolvers in question, and why.
|Unique probes||9,119||8,894||190 (2.08%)||274 (3.00%)|
|Unique resolvers||6,587||6,480||130 (1.97%)||275 (4.17%)|
|Unique probe-resolvers pairs||15,923||15,418||257 (1.61%)||464 (2.91%)|
|# Queries||54,115||52,701||540 (1.00%)||1.464 (2.71%)|
|# Answers||94,805||91,610||732 (0.77%||2.463 (2.60%)|
Table 2: Breakdown of results
Decreased TTL answers
As shown in Table 2, 0.77% of all the valid answers in this measurement were based on decreased TTLs. Figure 1 shows the results. As can be seen, two types of resolver dominate: those that cap the TTL at around 50s, and those that cap it at around 250-300s.
Out of 130 resolvers that reduced the TTLs, 71 reduced them to less than 50. Many of those, however, were local resolvers using private address space ranges. Out of the 71, 24 were not that kind of resolver, but belonged to networks run by mobile operators and other research institutes.
Out of the 130 resolvers, 16 (non-private addresses) reduced the TTL from 333 to between 250 and 320. No particular pattern was found here -- several operators from various countries were doing the same thing. We also found cases where Google quad8 resolvers reduced the value, but that is an outlier, given the large volume of instances of their infrastructure.
Figure 1: Histogram of TTL values for the decreased TTL group (Table2)
Increased TTL answers
As shown in Table 2, 4.17% of the resolvers actually increased the TTL values of our RRs.
Figure 2: ECDF of TTL values for answers with TTLs of more than 333 (the set value). The practice of increasing TTLs is particularly worrying, since it means that the resolvers in question will send their clients RRs that may have already been expired in the corresponding zones.
Figure 2: ECDF of TTL values that were increased
Looking at the IP addresses of the resolvers that increased our TTLs, we see they are associated with certain ISPs and other cloud providers. The discussion in  and the associated feedback revealed various reasons why a cloud provider would reduce the TTL of a RR, but to increase the TTL will only reduce traffic to the authoritatives, and may put users at risk of being served with expired answers.
The issue of DNS TTL violations is controversial, generating passionate arguments on both sides [2,5,6]. It is publicly known that some cloud providers and CDNs violate TTLs within their networks, overriding the original values provided by authoritative servers.
In this article, we report on the use of Ripe Atlas to ascertain whether TTL violation is happening in the wild. Although a small number of resolvers are known to violate TTLs, it is unclear how many users are affected.
A case can be made for reducing TTL values in certain circumstances, but the practice will ultimately lead to the resolver querying an authoritative more often. Therefore users should always be provided with the correct RR whenever the TTL value is reduced.
Increasing TTLs, on the other hand, may be dangerous to users, since they may be served with records that have expired. Consider, for example, the case of a domain that has been removed from a zone due to a phishing or malware attack: by extending the domain's TTL, a resolver will keep the domain alive for any client that looks up the domain during the extension.
Appendix: probes with increased TTL values