Final study project
As part of my Computer Science course at Delft University of Technology, I therefore decided to look into the possibility of detecting fraudulent webshops before they start trading. The research formed my final study project, and was carried out at SIDN Labs. If a way could be found to predict whether a newly registered domain name was going to be used for the sale of fake goods, that would open the way for preventive intervention to protect consumers.
I came up with the idea of building a data model using a combination of registration data and infrastructure metrics, which could then be used to make predictions about new domain name registrations.
Data-based fraud prediction
Whereas existing anti-fraud strategies focus on identifying scam webshops by analysing site content, the goal of my research was a way of predicting what a site's content is going to be before it goes live. That implied relying on the data that's available immediately following registration of the domain name.
Reuse of recently cancelled domain names
First, we looked at the information provided by the registrant. The registrant's name, address, phone number and e-mail address, plus the time of registration, were used to build a profile, which was analysed against historical data to calculate the probability of fraud. The history of the registered domain name was also considered. That's relevant because it's common for scam webshops to reuse recently cancelled domain names. Scammers probably hope to take advantage of the fact that domain names that have been in use for a while do better in terms of positioning in search engine results. Many scam webshops consequently have domain names that bear no relation to what they are selling. Expensive shoes might be sold by a site whose domain name relates to a former art gallery, for example. Or a webshop offering designer jeans might have a domain name that once belonged to a housing association.
Webhosting data from OpenINTEL
In addition to registration data, information about the infrastructure used by a site can provide valuable pointers. SIDN's records include details of the registrar used to register the domain name, as well as the associated name servers. However, SIDN doesn't hold the web hosting details. In order to include that data in the prediction process, I turned to OpenINTEL. OpenINTEL is an active DNS measurement-platform: the status of large parts of the Domain Name System, including the .nl zone, are scanned on a daily basis and the results are archived. By drawing on OpenINTEL records, the address of the web server could be included in the data model. The registrar's name and the addresses of the name servers and web server together form an infrastructure profile that I expected to provide further pointers as to the probability of fraud.
My supposition was that certain service providers would be attractive to scammers due to a combination of low prices, high convenience and a slack attitude to abuse reports.
Training the model
The various items of data referred to above were collected for two groups of domain names: one group that had recently been used for scam webshops and one group that had been used for legitimate purposes. The two datasets were then used to 'train' a prediction model. That involved analysing the data with a machine-learning algorithm. The outcome was a model capable of making predictions about how recently registered domain names are likely to be used. Over time, both the registration profiles and the infrastructure profiles associated with fraud are likely to change. The model therefore requires regular retraining on the basis of the latest data. To that end, details of correctly identified fraudulent domain names can be used to build a dataset for subsequent model retraining. A continuous cycle is thus established, enabling the model to adapt to new trends and developments. Training is based exclusively on domain names registered in the last two months, so that the model is not pointlessly complicated by the inclusion of registration profiles and infrastructure profiles that are no longer indicative of fraud.
In order to assess the accuracy of the prediction model, we input all registrations made in the first six months of 2018. For the experiment, the model was retrained daily using the latest information. Of the domain names flagged up by the model as suspicious, 85 per cent did indeed appear to be fake product outlets. In 12 per cent of the suspicious cases, it was not possible to establish what the domain name was actually used for. Predictions of abuse turned out to be demonstrably false in just 3 per cent of cases.
All registrations made between April 2016 and August 2018 were analysed using the procedure described. That gave us a dataset on 30,000-plus scam webshops. Analysis of the dataset yielded a number of interesting additional findings. First, two thirds of all fraudulent webshops studied were the product of just twelve campaigns (where we define a campaign as a group of registrations that use the same procedure to generate the details needed to register a domain name).
The majority of fraudulent registrations appear to be part of coordinated operations. For example, we observed repeatedly that campaigns would abruptly switch registrars or hosting service providers. Some registrations also appeared to be automated, with large numbers of domain names all registered at the same time. On the other hand, spelling and typing errors were common in the registration data, suggesting manual entry.
Looking at when domain names are registered for fraudulent webshops, it's apparent that most malicious registrations are made on weekdays, just like legitimate registrations. However, there's a noticeable difference where the time of day is concerned: malicious registrations are typically made between midnight and 10am (UTC). That suggests that the registrations are initiated outside Europe, in a time zone where the peak period coincides with the working day. The China Standard Time zone (UTC+8) is a likely candidate. The malicious registration time window that we observed corresponds to the period 8am to 6pm in that time zone. What's more, analysis of weekly registration numbers reveals that there is one week in each year when almost no domain names are registered for scam webshops: the week of the Chinese New Year festival.
Using the prediction method described, it's possible to identify scam webshops before they start trading. The system is currently 85 per cent accurate, and there is scope for improving on that. By studying the detected fraud campaigns, inspiration can be found for the development of new features. It would also be interesting to investigate whether the approach is transferrable to the detection of other forms of domain name abuse.