Recursive resolver classification
Wrapping up my master's thesis
Recursive resolvers act as middlemen between clients and DNS name servers. Operators of authoritative name servers are interested in getting a better understanding of the recursive resolvers that query them, to optimize their own services, for example. Building a classifier for recursive resolvers was therefore the goal of the research I did for my master's thesis at SIDN Labs.
Resolvers can serve a variety of clients, ranging from end users who want to visit their favourite video streaming websites to scripts that crawl the internet for marketing or research purposes. A thorough understanding of which resolvers are most important allows operators of authoritative DNS services (such as SIDN) to understand how they should set up their server infrastructures to optimise interaction with those resolvers so as to provide the best possible service to the clients using them. Also, knowing the origins of the resolvers allows researchers to measure the adoption of new technologies in the DNS and could even enable us to estimate the number of users impacted by major changes to the DNS, such as the Root KSK rollover. Like my colleagues at .nz, I have been working on a project that involves the classification of recursive resolvers to increase our understanding of the aforementioned issues. The main difference between my project and the .nz project is that I sought not only to differentiate “real” recursive resolvers from resolvers used for monitoring purposes, but also to identify various additional kinds of resolver, such as cloud providers' resolvers, ISP resolvers and so on.
Dataset creation and feature selection
I have classified recursive resolvers based on query data collected on the .nl name servers. In principle, however, data collected on any large authoritative name server should be adequate. Recursive resolvers follow various patterns when querying .nl domain names. For example, while 82 per cent of the queries are sent by 20 per cent of resolvers for A or AAAA records, some resolvers query almost exclusively for NS records. I have collected data on twenty-seven distinctive features of nearly 1.4 million unique resolvers over the course of a single day. I have also mapped known IP addresses from known companies to their serving sectors to create seven different sector types: ISPs, hosting companies, cloud providers, IT firms, research foundations, telecommunications companies and open resolvers. That dataset served as my ground truth.
Figure 1 — Companies and their traffic percentages on .nl NSs in March 2019
The pie chart in Figure 1 shows the companies and their traffic shares on .nl NSs in March 2019. I categorised the resolvers manually, depending on the type of autonomous system they belong to. Based on this manual analysis, it is clear that ISPs, large open DNS services, cloud firms and IT-related companies form half of the traffic handled by .nl NSs. Next, I used the labelled data consisting of twenty-seven feature columns and 39,361 unique IP addresses to analyse the relevance of each feature. In view of the results, I decided to use the fifteen best features for the classification, in order to reduce the dimensionality of the dataset and prevent overfitting. The most significant features are the operating system used (identified from the TTL field of the IP packet), whether DNSSEC information is requested by the resolver, and whether certain record types are requested by the resolver.
To finalise the research, I evaluated the performance of different classifiers. Table 1 shows the F-1 scores of all the algorithms used. The F1 score is the mean of precision and recall, where an F1 score reaches its best value at 1. Of the various algorithms that are popular for internet packet classification, the random forest algorithm showed the best F1 score for all class types. It was therefore used as the main algorithm for the analysis of unlabelled data.
Table 1 - F1 score of each classifier for each class type
For some classes, I had fewer training examples than for others, which might have had a negative impact on the classification. For example, while the ground truth of open resolvers consisted of precise IP addresses obtained from open resolver companies, I manually mapped research, telecommunications and hosting companies’ IP addresses to their sectors. This ultimately resulted in 98 per cent accuracy for the open resolver class and rather lower accuracies for the other classes. Nonetheless, creating this ground truth allowed me to measure the accuracy of the classification algorithms that were used in the research.
Figure 2 shows the key results of my classification. ISP resolvers are most common.
Figure 2 - Number of IP addresses in each class on 20 March 2019 and 22 May 2019 I ran our classifier on two separate days and Figure 2 shows the results. Resolvers classified as belonging to ISPs were most common on both days, followed by resolvers run in cloud environments and public resolving services. In the future, we might see a shift towards public resolving services, if DNS over HTTPS becomes more widely deployed in applications.
Future work and conclusion
To conclude, the research achieved its goals, but it became clear that classification that is 100 per cent accurate is rarely possible. My hope is that my research will suggest new angles to other researchers, draw attention to the subject and thus lead to the improvement of online DNS services. These results need to be treated with a degree of caution, however. My ground truth was both biased and ambiguous. For example, an autonomous system may host a small enterprise's recursive resolver or an open resolver. An important focus for future work is therefore to find sufficient IP addresses of each class to support improved class representation for the classifier. For a detailed account of the research, take a look at my thesis. You can also e-mail your questions and/or opinions to email@example.com or firstname.lastname@example.org.