Using machine learning to make the internet more secure
4 challenges we'll be tackling in the period ahead
The aim of the research done at SIDN Labs is to improve the security and resilience of the internet. And machine learning can make an important contribution to that aim. In this blog post, we outline what machine learning is and describe 4 challenges that we'll be tackling with the help of machine learning in the period ahead. Our preference is to do that in partnership with universities, companies and other top-level domains (TLDs). So, if you have any feedback on the ideas presented here, or you'd like to talk about working together in pursuit of shared ambitions, we'd love to hear from you.
What is machine learning?
Machine learning involves the automated extraction of rules and patterns from large volumes of data. One possible application is in a botnet detector: a system that finds botnets by analysing the network traffic Internet of Things (IoT) devices. If the detector spots that a device is sending abnormal DDoS traffic, suggesting that it's been recruited to a botnet, the device can be automatically blocked. To do its job, a botnet detector needs rules for distinguishing normal network traffic from DDoS traffic.
There are two ways of formulating those rules (see figure 1). The traditional way involves knowledge-driven programming: a cybersecurity expert uses his knowledge of botnet-related network traffic to write detection rules. However, data-driven rule definition is an option as well. That involves an algorithm that automatically extracts the characteristics of botnet traffic by analysing historical traffic labelled as normal or suspicious. The latter approach to rule extraction is used in machine learning. In that context, the labelled data points are referred to as 'examples', and the resulting rules form a 'model'. Having an adequate number of examples is a precondition for developing a good model.
Figure 1a: Knowledge-driven programming*
Figure 1b: Data-driven programming*
Great strides have been made with machine learning in recent years. We've seen the emergence of DeepMind's AlphaGo, for example, which can play Go better than the human champion. And smart home assistants, such as Google Assistant and Siri, have really taken off. Such advances are the result of more research being done in pursuit of better algorithms. That has been facilitated by the increasing availability of data to which machine learning algorithms can be applied. For detailed explanations of machine learning, refer to the second chapter of the Dutch-language whitepaperTijd voor implementatie van verantwoorde datadiensten by TNO and the book Learning From Data.
Machine learning at SIDN Labs
At SIDN Labs, we're using machine learning to help increase the internet's security and resilience. In that context, machine learning is particularly useful for tasks where examples are available, but manual rule definition is difficult.
We've identified 4 such challenges, which we intend to tackle in the period ahead. The first three all involve strategies for reducing domain name abuse, while the 4th involves helping to protect the internet against large-scale incidents, such as DDoS attacks.
Challenge 1: Detection of fake webshops
Fake webshop detection has been on SIDN Labs' agenda for several years. We recently presented a poster about our work at ICT.OPEN2019. Fake webshops cause problems by harvesting credit card details and/or fraudulently taking payments for goods that they don't supply, or turn out to be fakes. Fortunately, we are getting better at detecting them, but we believe that further improvement is possible.
We observe that scammers sometimes change their tactics. We therefore want to keep abreast of the scammers' innovations by developing an adaptive detection model. That implies setting up a system that automatically evaluates and retrains detection models, for example. We also expect that new fraud patterns can be picked up more quickly if analysts assess certain webshops. The challenge there is to establish what assessments yield the best results (active learning).
Another aim is to increase the accuracy of our detection models. By collaborating more with other actors in this field, we can broaden our perspective of the problem and eliminate blind spots. For example, we've already been involved in a pilot where credit card provider ICS Cards forwarded details of shops reported to it, and gave feedback on detections. In the future, we'd like to link up with other TLD registries and with registrars, so that we can all learn from one another to boost the effectiveness of our activities in this field. Finally, it would be useful to study the effectiveness of fake webshop countermeasures and to quantify the economic and other damage prevented, perhaps by collaborating with the Netherlands Consumer Authority (ACM).
Challenge 2: Understanding the .nl zone better
At SIDN, we know a lot about how the .nl zone is used (see, for example, our market research and the dashboard on stats.sidnlabs.nl). However, we have relatively little insight into inter-sector differences in domain name security. That's because many businesses don't give their Trade Register numbers when registering domain names. As a result, we can't make direct links between domain names and the sectors that their registrants are active in (as indicated by 'SBI codes').
At SIDN Labs, we're working on a solution to that problem, which involves sector assignment on the basis of text classification. That will make it possible to assess the impact of DDoS attacks more accurately, for example. The first version of our classification model proved able to identify the site owner's primary sector correctly (out of a list of twelve options) in 65 per cent of cases. We regard that as an promising start, since the primary sector definition is somewhat arbitrary and often counter-intuitive. For example, there's a lot of overlap between printers' websites and publishers' websites. However, a printer belongs in the primary sector 'Industry', whereas a publisher comes under 'Information and communication'.
In the future, we want to look into using information about various levels to increase accuracy and to segment domain names more deeply. We believe that the rapid development of text analysis methods and the availability of pre-trained neural networks, such as ULMFiT and ELMo should make that possible.
Challenge 3: Detection of hacked websites
Websites are sometimes hacked and then used for malicious purposes. Hacked websites pose a threat to unsuspecting visitors, whose machines are liable to be infected by malware. At the moment, the detection of domain names linked to hacked websites is a reactive process: our operations teams act in response to incoming reports.
However, we want to look into the possibility of using machine learning for proactively identifying domains with hacked websites. The first step towards that goal will be searching for patterns associated with domain names that have been compromised in the past. Obvious places to start are the ENTRADA, OpenINTEL and DMAP databases. The detection of hacked websites ties in with a study that is looking for a way of distinguishing compromised legitimate domains from those registered for malicious purposes.
Hacked domain detection is complicated by various factors. For example, we anticipate that the time aspect will prove significant. Another problem is that the signals are hard to interpret. A peak in DNS lookups by access providers' resolvers may be caused by a SPAM campaign, or just as easily by a popular tweet. To explain the signals, we'll need to use sophisticated signal and pattern recognition techniques. It'll also be important to collaborate with others, because enhanced signal recognition may well imply access to supplementary data sources held by social media companies, registrars and internet access providers.
Challenge 4: Detection of suspect IoT traffic
Our SPIN project is intended to make home networks more secure. An important element of that is temporarily blocking suspect network traffic associated with Internet of Things (IoT) devices, preventing their participation in DDoS attacks. Distinguishing suspect traffic from normal traffic is an active research field, where machine learning may prove useful.
Over the last year, several articles have been published describing the use of 'autoencoders' for the recognition of abnormal IoT network traffic. Autoencoders are special neural networks that can summarise data points, but have difficulty handling abnormal data points. That characteristic makes them potentially useful for identifying abnormal -- and therefore suspect -- network traffic. We therefore plan to explore the scope for using this type of anomaly detection within SPIN. That will be a challenge, since relatively little realistic training data is available, and any detection method employed has to be efficient, because SPIN equipment often has limited computation power.
Furthermore, privacy is a central feature of SPIN: all data is processed locally, and nothing is forwarded to outside recipients. However, data sharing could significantly enhance an anomaly detection model. Collective learning from a model backed up by an assurance that sensitive training data will remain confidential is therefore an attractive (but challenging) route to explore. This challenge ties in with other research into detection based on federated learning and with our vision of collective security.
Joint machine learning research
It is of course possible to think of other applications for machine learning that would help to increase internet security and resilience, in addition to those described above. And we would love to hear any ideas you may have.
We're also open to the idea of collaboration -- with other disciplines, say. Can you suggest an additional data source, alternative approach or machine learning expertise we might use, for example? If so, please drop me a line. My address is firstname.lastname@example.org.
*Bron figures: Deep Learning with Python (François Chollet)