ENTRADA 2.0 is here!
Over the last few years, we've made extensive use of ENTRADA, our DNS big-data platform. Various other European ccTLD registries use ENTRADA as well. However, ENTRADA has changed very little over the years since it was launched. We've added a few minor features, and we've resolved a few bugs. But that's all. On the other hand, we have learnt a lot about what works well and what could be improved. And technology has moved on, opening up possibilities that we'd like to take advantage of. So we've recently been working on a major update: ENTRADA 2.0. Full details are given below.
At SIDN Labs, we do a lot of research aimed at bolstering the security and stability of the internet. We look at ways of detecting and countering phishing, fake webshops and botnets, for example. Much of the research makes use of DNS data: queries and replies processed by our authoritative name servers. Our name servers save DNS data in pcap: a standard format for the storage of network data. The pcap files can then imported to an analysis tool, enabling us to study the data. The set-up works fine, as long as the volume of data involved isn't too great. However, a lot of the research we want to do depends on very large amounts of pcap data. And that implies having a large-scale analysis solution. In 2014, therefore, we developed ENTRADA, whose name stands for ENhanced Top-level domain Resilience through Advanced Data Analysis. ENTRADA is a tool that converts pcap data to Parquet, a column-based format. Parquet files can be read by various SQL engines. To enable the analysis of large volumes of data, we chose Hadoop in combination with Impala. Those two systems allow for the efficient processing of collated DNS data. However, considerable Hadoop know-how is required in order to build and operate a cluster. The ENTRADA database currently has more than a trillion rows, and our Hadoop cluster consists of nine data nodes with a total storage capacity of 240TB. The hardware is getting old, though, and we therefore feel it's time to think about an upgrade. Or maybe an alternative solution.
One alternative to having our own Hadoop cluster is to use third-party cloud services. That saves us buying our own hardware, and relieves us of the burden of managing the set-up. So we can focus on our core activity, research. Athena is a serverless query service provided by Amazon Web Services (AWS). With Athena, large volumes of Parquet data can be analysed without any need for the user to have their own physical or virtual servers. The data is saved on S3, an AWS storage service. Athena can read Parquet files stored on S3, meaning that we can continue using the data collected so far without converting it to another format. Here at SIDN Labs, we're still using ENTRADA in combination with Hadoop. Nevertheless, support for a serverless SQL-service such as AWS Athena has many advantages. The two main pluses being that, with a serverless ENTRADA deployment, you don't need any Hadoop know-how, and don't have to invest in hardware. So getting started with ENTRADA is much easier for new users.
The main new features of ENTRADA 2.0 are: 1. New architecture based on Spring boot 2. Simple installation using Docker 3. Support for AWS S3 and Athena 4. Network monitor with round-trip time (RTT) analysis Previous versions of ENTRADA have been largely Java-based, and include a number of shell scripts for data processing and workflow management. ENTRADA 2.0 is entirely Java-based, using the Spring boot framework. Installing ENTRADA can be quite a challenge for new users. ENTRADA 2.0 therefore makes use of Docker, hugely simplifying initial installation. The most significant new feature, however, is support for AWS S3 and Athena. ENTRADA can import pcap data locally, or via Hadoop HDFS or AWS S3. It's then possible to store the resulting Parquet files on Hadoop HDFS or AWS S3. That flexibility means existing users can switch easily between their own Hadoop clusters and the AWS Cloud. The last neat new feature is a network monitor with round-trip time analysis. Users will be able to keep an eye on the quality of the connection between the DNS resolvers and authoritative name servers. The monitor will track the time it takes to set up a TCP connection, the so-called TCP handshake (see figure 1). We'll measure the interval between the time that the server sends the SYN-ACK and the time that the response ACK is received from the client. That interval gives insight into the quality of the connection between the resolver and the name server. For an example of the RTT monitor's output, see figure 2. The plot shows the median number of milliseconds required to set up a TCP connection. DNS operators can use RTT monitors as part of their general service monitoring.
Good to know
AWS Athena has a price model where users are charged on the basis of the amount of data analysed to perform the SQL queries. Costs can therefore escalate if you are making active use of a large dataset. So we advise estimating how many SQL queries you will be sending, and how much data will have to be scanned in order to answer the queries. You can then decide whether you are better off using AWS or setting up your own Hadoop cluster.
Getting started with ENTRADA
Getting started with ENTRADA 2.0 is easy. All the information you need is available from the ENTRADA GitHub page. If you need any additional help or clarification, or if you'd like to suggest new functionalities, feel free to drop us a line.