New research infrastructure to boost the security and resilience of .nl and the wider internet

A solid basis for SIDN Labs’ applied research

The original blog post is in Dutch, this is the English translation of it.

Over the months ahead, we will reorganise SIDN Labs' research infrastructure and relocate it from Arnhem to the data centre operated by the Nikhef research institute in Amsterdam. The infrastructure is our primary vehicle for contributing through research to further improvement of the security and resilience of .nl and the internet in the Netherlands and beyond. This blog describes the new set-up for the benefit of other research teams.

The starting point: our old research infrastructure

Since SIDN Labs was established, we have done our research on a separate on-premises infrastructure. The 3 main components of that infrastructure are:

  • Data platform: 14 servers that we use to store and analyse roughly 500 terabytes of internet measurements, including DNS data from the .nl name servers and a limited subset of the data from the Domain Registration System (DRS) for .nl. The platform has been instrumental in enabling us to develop guidance on providing authoritative DNS services for .nl and other DNS operators, for example. The platform is based on Hadoop.

  • Network testbed: a network for experimenting with new technologies that increase the security and stability of the internet’s core systems or address those fields in a fundamentally different way. The testbed is made up of BGPsec routers and routers for a SCION-based internet, amongst others.

  • VM platform: a standard system for managing virtual machines (VMs). We use the platform to develop and evaluate our prototypes, such as to visualise data or to control equipment in the network testbed.

For security reasons, our research infrastructure is separate from the rest of the SIDN network. We manage the data platform and network testbed ourselves, while other SIDN colleagues manage the VM platform and connections between the 3 components.

Why change our research infrastructure?

The main reason for rethinking our research infrastructure is that is now quite outdated. For example, our data platform servers date from 2020 and some of the other equipment even from 2017. This is a result of us often reusing equipment from SIDN’s production network for sustainability reasons. While that keeps down the capital cost of our infrastructure, we have to replace failing components ourselves, because they are no longer supported. Also, new technologies such as Trino and S3 are becoming de facto standards, displacing Hadoop. We therefore want to modernise our data analysis tool stack as well.

The second driver of change is that we want to isolate our research infrastructure more fully from the SIDN network. The aim being to further reduce the possibility of any unwanted impact on the integrity, confidentiality or availability of SIDN’s services.

Finally, our existing infrastructure isn’t well suited for our BGP security research agenda. For example, it isn’t directly linked to a major international internet exchange, meaning that we find it more difficult to gain experience with BGP and we don’t have access to a real-time BGP traffic flow of our own.

What requirements does the new research infrastructure have to meet?

Table 1 lists our 7 main requirements for the new research infrastructure, in no particular order. They differ from those that apply to SIDN’s production systems, such as the .nl domain registration system. For example, our research infrastructure doesn’t need to have the high availability of our production systems, but does need to handle much more data.

Table 1. Top-level requirements for the new research infrastructure.

Requirement

Category

The research infrastructure must…

R1

Security

protect the integrity and confidentiality of the data platform, because we use it to process .nl data from SIDN’s production systems for research purposes. However, some downtime is acceptable, because SIDN Labs doesn’t operate any .nl production services.

R2

Expertise

help us anchor new technical expertise throughout SIDN that enhances the security of .nl and the internet in the Netherlands and beyond, for instance in fields such as DNS management and data analysis.

R3

Principles

contribute to a decentralised internet and to reinforcement of the strategic digital autonomy of the Netherlands and Europe, however small with limited scale of our research infrastructure.

R4

Performance

enable interactive and complex analyses of all our datasets, so that we’re able to carry out research projects efficiently and run experiments conveniently.

R5

Adaptability

be readily adaptable to suit particular research projects and facilitate the sharing of configurations and tools within SIDN, with our research partners and with the wider internet community.

R6

Management

minimise the infrastructure management burden on SIDN Labs researchers such as updating services and systems and analysing system errors.

R7

Cost

has a predictable cost model, so that researchers can experiment freely without feeling constrained by the possibility of incurring uncertain expenses associated with the traffic, storage and computation capacity required for new data analysis methods, for example.

Chosen approach: our own equipment at Nikhef in Amsterdam

The core of our approach is that we manage our new research infrastructure ourselves as we do now, but relocate it to the data centre of research institute Nikhef in Amsterdam and equip it as much as possible with European equipment and open source software. That approach will enable us to meet the requirements listed in table 1 as follows.

First, using our own equipment and open-source software will maximise our adaptability (R5). We will also increase our expertise in technologies such as Kubernetes and the operational management of the infrastructure (R2), while also making a small contribution to a decentralised internet and to the digital autonomy of the Netherlands and Europe (R3).

In addition, we anable rapid, interactive data analyses (R4) through the co-location of computation and data storage in a single data centre. An alternative approach would be to buy remote data storage as a service utilising a high-speed optical interface. However, that would increase the complexity of the infrastructure and our management workload. Furthermore, at Nikhef we have direct access to a real-time flow of BGP data via direct links to internet exchanges such as AMS-IX and NLix.

We could alternatively have the entire data platform (computation and storage) hosted by a public cloud service provider. The disadvantage of that is that European service providers do not yet offer sufficiently mature managed versions of tools such as Apache Spark and Trino, which we require for complex data-analyses (R4). US hyperscalers do offer them, but the use of their services would not be consistent with our wish to contribute to a decentralised internet and digital autonomy (R3). While the use of a public cloud service would reduce our infrastructure management workload, it would necessitate a ‘reserved instance’ for cost control reasons (R7), which we cannot obtaine from a provider consistent with R3 and R4.

A drawback of our approach is that it provides limited redundancy (R1). For example, our infrastructure could be rendered unavailable if the Amsterdam area experienced a prolonged power outage. However, we regard that as an acceptable risk, because we’re not operating .nl production systems. Furthermore, Nikhef has standard data centre continuity provisions including an emergency power supply provided by a diesel generator. We will ensure integrity and confidentiality by taking standard measures and utilising SIDN’s ISO 27001 certification framework.

Continuing with an infrastructure of our own has the further disadvantage that we won’t be able to reduce our infrastructure management workload (R6). Therefore, 3 colleagues from SIDN's operational teams have recently joined the Labs team for a combined 1 FTE to help set up and manage the research infrastructure and promote SIDN-wide knowledge sharing (E2).

Finally, our chosen approach involves extra expense due to the extra FTE and the write-off of significant hardware investments. However, the associated annual costs are predictable (R7) and, when calculated over a 5-year period, similar to or slightly lower than the cost of migrating to a public cloud service provider (if there was one that met our requirements).

Our new technical design

Figure 1 outlines the design of our new infrastructure, which we divided across 2 racks (A and B) at Nikhef. We discuss it briefly below.

Upstream A contains:
Network A
Compute cluster (data platform)
Storage cluster (data platform)
app cluster (data platform)
in Rack A

Upstream B contains:
Network B
Scion router (netwerk testbed)
P4 switch (netwerk testbed)
TimeNL server (netwerk testbed)
VM cluster (VM platform)
Rack b

Network A and B (the backbone) are connected

Figure 1: Design of our new research infrastructure.

Kubernetes-based data platform

Our new data platform will use Kubernetes, an open-source product like all the other data platform tools. Open-source Kubernetes will also be used for the new .nl registration system, meaning that we can share that knowledge throughout SIDN.

We will use Kubernetes-based tools including Apache Spark and Trino for complex data-analyses, such as the combination of DNS queries and DMAP data. Another application is ENTRADA, our system for storing and analysing large volumes of DNS queries from the .nl DNS servers.

The data platform will run on a cluster of 12 servers, where we’ll store our research data including DNS queries, DMAP data and a limited subset of DRS data. We will use MinIO S3 for that purpose, a popular storage technology for data analysis within the research community and elsewhere.

Proxmox-based VM platform

Our new VM platform will consist of 4 servers running Proxmox, an open-source virtualisation platform. We’re opting for a voluntary support contract, so that we’re supporting Proxmox’s further development.

New time server added to network testbed

We’re adding a new stratum-1 NTP server to our network testbed. We’ll use it for our NTP research and it will form part of our public time service, TimeNL. The new server will receive a time signal from VSL, which provides the official Dutch time utilising caesium atomic clocks.

We will relocate the rest of our network testbed will be relocated largely ‘as-is’.

Backbone of European network equipment

The backbone of our infrastructure consists of routers, firewalls and switches connecting the data platform, VM platform and network testbed to each other and to the internet. The equipment involved is European-made, and we’ll use SURF and NLix as upstreams to the internet.

Security measures

We’ve opted for a duplicated backbone design (networks A and B in Figure 1), so that we still have access to the whole infrastructure if 1 of our 2 upstreams goes down. For backups, we’ll use a Dutch service provider’s off-site S3 service.

We'll assure the integrity and confidentiality of our infrastructure through standard measures such as 2FA and role-based access control for SIDN Labs researchers. We will audit these measures as part of SIDN’s ISO 27001 certification arrangements.

We’d love to hear your feedback!

Redesigning our research infrastructure has been a major undertaking. If you’d like to know more, don’t hesitate to reach out, because there’s lots more we could tell you.