In this blog, we present our results so far and briefly look ahead to what’s next on our roadmap.
Yet another crawler?
Although there are various open-source and commercially available crawler systems, we decided to develop our own.
The main reason is that it’ll give us full control over the measurement methodologies that the crawler system uses, allowing us to better interpret its measurements and to easily add new measurement modules. In our experience, this is more difficult with commercial crawler services.
Another important reason for building our own crawler system is that none of the existing systems fully met our requirements, for instance in terms of extensibility and integration with our data analysis systems, such as ENTRADA and nDEWS.
We’re designing our crawler system as an extensible platform. This allows us to easily add new features, such as network protocol support and classifiers that calculate high-level domain name attributes based on various measurements. While .nl is our use case, we’re designing the system so that it will also work for other top-level domains (TLDs).
Figure 1 shows that our system architecture consists of four types of component: a crawler manager, crawlers, classifiers, and an importer component. The dashboard and Hadoop components in Figure 1 together form one application, in this case for visualizing the crawler results. The arrows indicate a 'uses' relationship between components.
We decided to use queues to decouple the individual components. Decoupling the components with queues makes it easier to distribute components across multiple computing resources, which enhances redundancy and scalability. These independent components can also be configured individually. For instance, we can crawl the internet using 100 threads for the crawler instances (processes) but use only 10 threads for the classifiers. We use Redis as our queuing implementation.
Figure 1. Architecture of our crawler system.
The crawler manager provides a RESTful API, which is used to control the crawler lifecycle and retrieve status information.
The importer will, at the start of a new crawling run, read a set of to-be-crawled domain names from the TLD operator’s domain name registration database. These domain names are then passed to the domain queue, which is shared with the crawler component.
A crawler is a component that actively queries a set of domain names using a specific network protocol, currently HTTP, TLS, and DNS. Its output consists of raw data, such as HTML code or a TLS (SSL) certificate.
The crawler instances (threads) continuously read domain names from the domain queue, which contains all the domain names that have to be crawled. A crawler sends its output to the result queue for classifiers to consume. If the crawler is unable to successfully crawl a domain name, it re-inserts it into the domain queue and tries again at a later time.
Each crawler is highly multithreaded, which means it is able to quickly crawl large numbers of domain names simultaneously. To avoid overloading the services we crawl, our crawlers randomly select a domain to crawl from the domain queue. In addition, they only fetch a minimum amount of data from each domain name. For example, for HTTP a crawler will typically only fetch 1 to 3 pages.
A classifier computes a higher-level attribute for a crawled domain name. For example, the content-type classifier analyses HTML code to determine the content type of a web page. Classifiers may also combine data attributes from multiple sources (HTML, TLS, DNS) or even use the output of other classifiers.
We have developed 24 different classifiers, each answering a different question such as:
- Is the TLS certificate valid?
- Is the web page a parking page?
- How many redirects have been detected?
- What is the website type? (personal, business, e-commerce, etc)
- What software is used? (WordPress, Joomla etc)
Like crawlers, classifiers are also highly multithreaded, which means they can analyse multiple crawler results simultaneously. A classifier stores its output in a PostgreSQL database. After all the domain names have been processed, we use a script to export all the results from the PostgreSQL database to an Apache Hadoop based database. This increases query performance for applications and enables interactive analytical dashboards.
The data in the Hadoop cluster is stored using the distributed file system (HDFS) combined with the Apache Parquet file format. We can analyse the data using the Apache Impala query engine. Impala is a distributed SQL compatible query engine, allowing us to quickly analyse the data and build interactive applications.
To provide more insight into how the .nl TLD is being used, we have developed several dashboards. Figure 2 shows the 'content' dashboard, which presents information about the content type of web pages and the popularity of CMS and shopping cart software. This information is useful for multiple goals. For example, we can use it to evaluate the impact of security incidents for certain types of CMSes and shopping cart software. Domain name registrars can also use this data to gain a better insight into the usage of their domain name portfolio.
Figure 2. Content dashboard.
We implemented our dashboards using a very nice tool called Metabase. Creating dashboards is easy using Metabase. The only problem we found is that it does not support Impala as a data source :-(. Luckily Metabase is open source and creating an Impala database driver for Metabase is not that difficult. We contributed the driver code for Impala support to the Metabase project. We do not yet know if our code will be added to the official version of Metabase.
Note that the results in Figure 2 are an example based on the beta version of our crawler system and may be inaccurate.
Outlook: mapping the DNS
The crawler system presented above is a deliverable of a larger project called DNS-EMAP, which is short for DNS Ecosystem MAPper. The goal of DNS-EMAP is to map all objects related to the .nl ccTLD and the relationships between these objects into a longitudinal map of the topology of a TLD. Examples of objects are: domain names, DNS servers and websites. This information will enable us to build new applications and services to further increase the security and stability of the .nl zone and the internet as a whole. The input for DNS-EMAP will be provided by multiple distinct data sources, such as our ENTRADA DNS database, RIPE Atlas probes, and by our crawler system.