EP4136592A1

EP4136592A1 - Malicious domain hosting type classification systems and methods

Info

Publication number: EP4136592A1
Application number: EP21787700.0A
Authority: EP
Inventors: Mohamed Nabeel; Issa Khalil; Ting Yu
Original assignee: Qatar Foundation for Education Science and Community Development
Current assignee: Qatar Foundation for Education Science and Community Development
Priority date: 2020-04-13
Filing date: 2021-04-13
Publication date: 2023-02-22
Also published as: CN115812200A; WO2021210998A1; JP2023525653A; AU2021257379A1; EP4136592A4

Abstract

The present application provides a software-based classifier built on a machine learning model that distinguishes between two kinds of malicious URL hosting apex domains: public and private. This classification helps security professionals specify which domain levels to block, the whole apex domain in the case of private apexes or specific subdomains in the case of public ones. The classifier is also built on a machine learning model that differentiates attacker-owned hosting domains from compromised hosting domains. This distinction is crucial to help security operators take the appropriate mitigation actions. For example, attacker-owned domains could be blocked permanently whereas compromised ones temporarily.

Description

TITLE

MALICIOUS DOMAIN HOSTING TYPE CLASSIFICATION SYSTEMS AND

METHODS

PRIORITY CLAIM

[0001] The present application claims priority to and the benefit of U.S. Provisional Application 63/009,151, filed April 13, 2020, the entirety of which is herein incorporated by reference.

TECHNICAL FIELD

[0002] The present application relates generally to domain classification. More specifically, the present application provides a software-based classifier built on a machine learning model that distinguishes between public and private malicious URL hosting apex domains.

BACKGROUND

[0003] Every week millions of users are tricked to access malicious web sites from where miscreants launch various attacks including phishing, spams, and malware. Even with recent advances in techniques and tools to detect malicious websites, many malicious websites are undetected or detected much after the damage is done. One key reason for this negative trend is that, instead of registering their own domains, attackers are increasingly hosting their websites on infrastructures they do not own, thus evading detection by current reputation systems. While detection of malicious websites, especially phishing and malware websites registered by attackers, have been extensively studied, very little has been done to analyze how these malicious websites are hosted. Early knowledge of what hosting type a malicious URL is coming from helps security operators take appropriate actions.

[0004] Appropriate mitigation actions against a malicious website may differ greatly depending on how that site is hosted. If it is hosted under a private apex domain, where all its subdomains and pages are under the direct control of the apex domain owner, the malicious website could be blocked at the apex domain level. If it is hosted under a public apex domain (e.g., a web hosting sendee provider), it would be more appropriate to block at the subdomain level. Further, for the former case, the private apex domain may be legitimate but compromised, or may be attacker- generated, which, again, would warrant different mitigation actions. Attacker-owned apex domains could be blocked permanently, while compromised domains may be blocked only temporarily.

[0005] Hosting types of malicious URLs are conventionally detected manually by domain reputation systems and blacklists. For example, the Anti-Phishing Working Group may identify them. Though there exists lists of public apex domains from multiple sources, they are not complete, even when combined together. Further, these lists are often not up to date due to the highly dynamic nature of public web-hosting and cloud business. Therefore, given a malicious URL, one cannot simply look up such lists to decide whether it is hosted in public apex domains.

[0006] Additionally, conventional approaches to categorize malicious websites as hosted on compromised or attacker-owned apex domains are not as effective as desired. One conventional approach is to take domain popularity such as Alexa ranking into consideration. It is generally understood that compromised domains have some residual reputation and long-lived whereas attacker-owned domains have low reputation and short-lived. However, the inventors’ analysis of the malicious websites in VT shows that such observations do not always hold. While there are compromised domains that have high Alexa ranking and long life time (e.g. linode.com, cleverreach.com), a the inventors have observes that there exist many other likely abandoned or little maintained domains with low or no Alexa ranking (e.g. gemtown88.com, vanemery.com) that are compromised by attackers to launch their atacks. Further, newly created benign domains possess neither of the above properties, making it likely to mislabel them as attacker-owned when they are indeed compromised. On the other hand, though it is certainly true that many domains created by attackers are short-lived with very low Alexa rankings, sophisticated attackers nowadays increasingly utilize long-lived domains, for example, by creating and parking those domains for a while, (e.g. crackarea.com, estilo.com.ec) to evade detection. Additionally, attackers are able to artificially inflate the popularity of their domains at least in the short term without requiring much resource. Therefore, relying on the popularity and/or lifetime alone does not result in accurate labeling of these malicious domains.

[0007] Accordingly, a need exists to more quickly and efficiently detect hosting types of malicious URLs to improve security'. SUMMARY

[0008] The present application provides a software-based classifier built on a machine learning model that distinguishes between two kinds of malicious URL hosting apex domains: public and private. This classification helps security professionals specify which domain levels to block, the whole apex domain in the case of private apexes or specific subdomains in the case of public ones. In at least some aspects, the classifier is built on a machine learning model that differentiates attacker-owned hosting domains from compromised hosting domains. This distinction is crucial to help security operators take the appropriate mitigation actions. For example, attacker- owned domains could be blocked permanently whereas compromised ones temporarily.

[0009] In light of the technical features set forth herein, and without limitation, in a first aspect of the disclosure in the present application, which may be combined with any other aspect unless specified otherwise, a system includes a display and a memory' m communication with a processor. The processor may be configured to identify a malicious domain from a set of received domains; determine, using a model, whether the identified malicious domain is a public domain or a private domain; determine, if the identified malicious domain is a private domain, using a model, whether the private domain is a compromised domain or an atacker-owned domain; and display the determined malicious domain hosting type on the display, the determined malicious hosting type being a public domain, a compromised private domain, or an attacker-owned private domain.

[0010] In a second aspect of the disclosure in the present application, which may be combined with any other aspect unless specified otherwise, a method includes identifying a malicious domain from a set of received domains. It may be determined, using a model, whether the identified malicious domain is a public domain or a private domain. If the identified malicious domain is a private domain, it may be determined, using a model, whether the private domain is a compromised domain or an attacker-owned domain. The determined malicious domain hosting type may be displayed. In this aspect, the determined malicious hosting type is a public domain, a compromised private domain, or an atacker-owned private domain.

[0011] Additional features and advantages of the disclosed method and apparatus are described in, and will be apparent from, the following Detailed Description and the Figures. The features and advantages described herein are not all-inclusive and, in particular, many additional features and advantages will be apparent to one of ordinary skill in the art in view of the figures and description. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and not to limit the scope of the inventive subject matter.

BRIEF DESCRIPTION OF THE DRAWINGS

[0001] FIG. 1 illustrates a block diagram of an example system for malicious domain hosting type classification, according to an aspect of the present disclosure.

[0002] FIG. 2 illustrates a flow chart of an example method for malicious domain hosting type classification, according to an aspect of the present disclosure.

[0003] FIG. 3 illustrates a graph of a comparison of VT URL intelligence against SA and

GSB.

[0004] FIG. 4 illustrates a graph showing the AUC of the ROC curves is 96% for GT1.

[0005] FIG. 5 illustrates a graph showing the AUC of the ROC curve is 99% for GT2.

[0006] FIG. 6 illustrates a table showing various features of the five features groups that the private domain classifier may take into account, according to an aspect of the present disclosure.

[0007] FIG. 7 illustrates a correlation matrix for class labels, domain duration, quantity of scanner count, and Alexa rank, according to an aspect of the present disclosure.

[0008] FIG 8 illustrates a graph showing the ROC curves and feature importance in an example in which the private domain classifier is a Random Forest classifier, according to an aspect of the present disclosure.

[0009] FIG. 9 illustrates a graph showing the ROC curve for an example in which the private domain classifier is a Random Forest classifier, according to an aspect of the present disclosure.

[0010] FIG. 10 illustrates a graph showing the CDF of the number of FQDNs per apex during a period for likely benign domains and malicious domains.

[0011] FIG. 11A illustrates a graph showing the number of FQDNs per apex for the two categories of apex domains, public and private.

[0012] FIG. 11B illustrates a graph showing the average Alexa ranking distribution for public and private apex domains. [0013] FIG. 11C illustrates a graph showing the domain lifetime distribution for public and private apex domains.

[0014] FIG. 12A illustrates a graph showing #FQDNs per apex for compromised and attacker- owned domains.

[0015] FIG. 12B illustrates a graph showing the average Alexa rank distribution for compromised and attacker-owned apex domains.

[0016] FIG. 12C illustrates a graph showing the domain lifetime distribution for compromised and attacker-owned apex domains.

[0017] FIG. 13 illustrates a feature correlation matrix for the features used in the public domain classifier, according to an aspect of the present disclosure.

[0018] FIGS. 14A and 14B illustrate graphs showing the feature importance for a Random Forest-based public domain classifier for the two datasets GT1 and GT2.

[0019] FIGS. 15A and 15B illustrate graphs showing the t-SNE for a Random Forest- based public domain classifier for the two datasets GT1 and GT2

[0020] FIGS. 16A and 16B illustrate graphs showing the precision-recall for a Random Forest-based public domain classifier for the two datasets GT1 and GT2.

[0021] FIGS. 17A and 17B illustrate graphs showing the feature importance for a Random Forest-based private domain classifier 140 for the two datasets GT1 and GT2.

[0022] FIGS. 18A and 18B illustrate graphs showing the t-SNE for a Random Forest- based private domain classifier 140 for the two datasets GT1 and GT2.

[0023] FIGS. 19A and 19B illustrate graphs showing the precision-recall for a Random Forest-based private domain classifier 140 for the two datasets GT1 and GT2,

DETAILED DESCRIPTION

[0024] The present application provides a new and innovative malicious domain hosting type classification system and method. Early knowledge of what hosting type a malicious URL is coming from helps security operators take appropriate actions. The distinction between public and private apex domains has a profound impact on the inference and prediction of malicious domains, especially when it relies on the association of subdomains belonging to the same apex domain. Further, once malicious websites are detected, the actions against the hosting apex domains would be different depending on whether they are public or private. The provided classification system identifies public and private apex domains based on a key observation that subdomains of private apex domains have more consistent behavior and properties compared to those of public apex domains.

[0025] In at least some aspects, the classification system may determine whether a hosting domain marked as malicious is compromised or attacker-owned. For example, once the provided system identifies a malicious website as hosted in a private apex domain, the provided system may further classify the apex domain based on its owner. A malicious website is either created by attackers on their own registered domains (e.g. getbinance.org) or on compromised benign domains (e.g. questionpro.com). In the latter case, legitimate domains exploited for malicious activities are victim domains. Takedown strategies and who should be contacted differ depending on the type of the apex domain. Detection of compromised domains early helps owners to identify root causes, take corrective measures, and control reputation damage, while Security Operation Center (SOC) teams may temporarily block such victim domains to protect their users. On the other hand, atacker-owned domains would require completely different actions. They are usually first blacklisted to contain the immediate damage. They could be further shutdown through third-party takedown services, domain registration deletion, or ownership transferring if they are involved in cybersquatting.

[0026] The inventors have found that the provided classifier achieves a 97.2% accuracy with 97.7% precision and 95, 6% recall with respect to identifying public and private apex domains. In addition, the inventors have found that the provided classifier achieves a 96.4% accuracy with 99.1% precision and 92,6% recall with respect to determining whether a malicious hosting domain is compromised or attacker-owned.

[0027] As used herein, an apex domain is defined as a public apex domain if its subdomains (e.g., alice.000webhostapp.com) or pages (e.g., sites.google.com/alice) are not created and not under the control of the owner of the apex domain (e.g., 000webhostapp.com). As used herein, an apex domain is defined as a private apex domain if its subdomains (e.g., careers.nsa.gov) are created and managed by the owner of the apex domain (e.g., nsa.gov).

[0028] FIG. 1 illustrates a block diagram of an example system 100. In other examples, the components of the system 100 may be combined, rearranged, removed, or provided on a separate device or server. The example system 100 may include an example classification system 110 that classifies the hosting type of a malicious domain. For instance, the example classification system 110 may automatically label malicious websites (i.e. URLs) as attacker- owned public domains (e.g. 000webhostapp.com), compromised (private) domains (questionpro.com) or attacker-owned (private) domains (getbinanace.org). in various aspects, the classification system 110 may be in communication wath at least one reputation system 160 over a network 150. The network 150 can include, for example, the Internet or some other data network, including, but not limited to, any suitable wide area network or local area network.

[0029] The reputation system 160 may be any suitable blacklist or reputation system that provides a reputation (e.g., whether they are malicious) of websites, or URLs. In some aspects, the reputation system 160 is the VirusTotal (VT) system. VirusTotal (VT) is a known reputation sendee that provides aggregated intelligence on any URL by consulting third-party anti-virus tools and URL/domain reputation services. Each of these tools is referred to herein as a scanner. VT aggregates the query results every second and makes them available for subscribed users as a feed. In other examples, the reputation system 160 may be generated/maintained by Google Safe Browsing (GSB), Phishtank, Anti-Phishing Working Group (APWG), McAfee Site Advisor (SA), or other suitable blacklists or reputation systems. In some aspects, the classification system 110 may be in communication with more than one blacklist or reputation system.

[0030] In various aspects, the classification system 110 may include a processor in communication with a memory 114. The processor may be a CPU 112, an ASIC, or any other similar device. In some examples, the classification system 110 may include a display 116. The display 116 may be any suitable display for displaying information. In various aspects, the classification system 110 may include a malicious domain identifier 120. The malicious domain identifier 120 may identify a malicious domain based on information received from the reputation system 160. In various aspects, the classification system 110 may include a public domain classifier 130. The public domain classifier 130 may determine whether a malicious domain is a public apex domain or a private apex domain. In various aspects, the classification system 110 may include a private domain classifier 140. The private domain classifier 140 may determine whether a private apex domain is compromised or attacker-owned. Each of the malicious domain identifier 120, the public domain classifier 130, and the private domain classifier 140 may be implemented by software executed by the CPU 112. In other examples, the components of the classification system 110 may be combined, rearranged, removed, or provided on a separate device or server. [0031] in some examples, the public domain classifier 130 may be a Random Forest classifier. In other examples, the public domain classifier may be a Support Vector Classification (SV), Extra Tree (ET), Logistic Regression (LR), Decision Tree (DT), Gradient Boosting (GB), Ada Boosting (AB), or K-Neighbors (KN) classifier. In some examples, the private domain classifier 140 may be a Random Forest classifier or an Extra Tree (ET) classifier. In other examples, the public domain classifier may be a Support Vector Classification (SV), Logistic Regression (LR), Decision Tree (DT), Gradient Boosting (GB), Ada Boosting (AB), or K- Neighbors (KN) classifier.

[0032] FIG. 2 shows a flow chart of an example method 200 for classifying the hosting type of a malicious domain. Although the example method 200 is described with reference to the flowchart illustrated in FIG. 2, it will be appreciated that many other methods of performing the acts associated with the method 200 may be used. For example, the order of some of the blocks may be changed, certain blocks may be combined with other blocks, and some of the blocks described are optional. The method 200 may be performed by processing logic that may comprise hardware (circuitry, dedicated logic, etc.), software, or a combination of both.

[0033] In some aspects, the example method 200 may begin with identifying a malicious domain (block 202). For instance, the malicious domain identifier 120 may identify a malicious domain. The malicious domain identifier 120 may identify the malicious domain from a set of received URLs (e.g., from the reputation system 160). Out of all the URLs marked by at least one scanner of the reputation system 160, a domain may be identified that is highly likely to be malicious. In some aspects, a domain may be identified as highly likely to be malicious if a threshold number of scanners marks the domain. For example, the provided classification system may utilize historical VirusTotal (VT) URL feed information when identifying a malicious domain. A basic measure of maliciousness from VT results is the number of scanners that mark a URL as malicious. The higher this value is for a given URL, the more likely the URL is malicious. In one example, a URL marked by five or more scanners may be identified as malicious. In other examples, a different threshold quantity of scanners marking a URL may be utilized to identify a URL as malicious.

[0034] FIG. 3 illustrates a graph of a comparison of VT URL intelligence against SA and GSB, where "ext mal" corresponds to the percentage of URLs from each class of VT that are marked as malicious by SA or GSB. When the quantity of scanners is below five, the majority of VT marked malicious URLs are not identified as malicious by either SA or GSB. Meanwhile, for those URLs who are marked as malicious by no less than 5 scanners in VT (i.e., #scanner 5), the majority (over 70%) of them are in agreement with the external intelligence from SA and GSB.

[0035] Returning to FIG. 2, in at least one example, the malicious domain identifier 120 continuously profiles domains observed in the VT URL feed. In such an example, the malicious domain identifier 120 incrementally builds an aggregated record for each Fully Qualified Domain Name (FQDN). The profile record for a given FQDN may include the first time seen, the time last seen, number of times scanned, number of times marked malicious, and/or corresponding URLs and VT scan summaries.

[0036] It will be appreciated that the malicious domain identifier 120 may identify a malicious domain from URLs received from the reputation system 160 when the reputation system 160 is a blacklist or reputation system other than VT. In some aspects, the malicious domain identifier 120 may identify a malicious domain from URLs received from more than one reputation system 160. For instance, the malicious domain identifier 120 may crosscheck results from one reputation system 160 with results from another reputation system 160.

[0037] It may then be determined whether an identified malicious domain is a public domain or a private domain (block 204). For instance, the public domain classifier 130 may determine whether the identified malicious domain is a public domain or a private domain. Publicly available lists such as browser public suffix list, CDN lists, dynamic DNS lists, popular webhosting domains or proxy services can be useful, but they can also be prohibitively restrictive as they are slow to keep up-to-date, and thus tend to mistakenly include many non-existent domains and meanwhile miss newly appeared public domains.

[0038] A ground truth data set for the public domain classifier 130 may be collected as follows. Publicly available lists, including the public suffix list, popular web hosting providers and CDN lists, and dynamic DNS lists, may be aggregated and the intersection with apex domains in datasets DS1 and DS2 may be taken. Potential public domains may be identified by searching over our datasets for the keywords likely to be used by public apex domains such as hosting, free, web, share, upload, drop, cdn, file, photo, and proxy. Random samples of five hundred apex domains may be taken from DS1 and DS2 respectively.

[0039] A tentative private domain ground truth data may be collected by randomly selecting 1000 apex domains from each dataset (DS1 and DS2) that are mutually exclusive from the tentative public dataset. From these tentative ground truth sets, manual verification may be done to create the final ground truth sets. For each apex domain, a confidence score can be assigned between 50 and 100 to indicate a confidence in the label, with 100 being the most confident and 50 being undecided. To impro ve the quality of labeling, two domain experts performed the labeling for all the domains and the domains with conflicting labels were excluded.

[0040] The public domain classifier 130 may take into account at least some of the features detailed in the Table 1 below to determine whether a malicious domain is a public domain or a private domain.

Table 1.

[0041] Compared to private apex domains, public domains tend to host more subdomains and further they are scanned more frequently in VT. The features #subdomains and #scans capture these observations. Since subdomains are not under the control of the public apex domain owner, in practice, some of the subdomains are malicious and others are benign, whereas subdomains under private apexes tend to be mostly either benign or malicious. #Mal_Scans and Mal_Scan_Ratio capture the volume and this difference. Most public apexes, especially CDNs and proxy services, utilize FQDNs of the domains they serve (e.g. www.superwhys.com.akamai.com) whereas private apexes uses mostly descriptive popular keywords in the subdomain part such as www, mail, ns and m. By profiling all domains seen in PDNS during the study period, the inventors identified the top 100 subdomains as the popular keywords. These differences are captured using the #Pop_Keywords, Ratio_Pop_Keywords and #Avg_Depth features. The inventors observed that there is more variations between subdomain names under public apex domains than under private apex domains. Avg_Sub_Entropy measures the average entropy across all subdomains to capture this observation.

[0042] FIGS. 4 and 5 illustrate graphs showing the AUCs of the two ROC curves are 96% and 99% for GT1 and GT2 respectively, demonstrating high degrees of separability of the two classes. [0043] The FQDNs associated with such public domains are created by attackers and the number of such FQDNs could be utilized to assess the reputation of public domains.

[0044] In some aspects, a public domain may be categorized into one of seven groups: Dynamic DNS, Web Proxy Services, CDN, Web hosting, Blogging and content hosting, contentsharing services, and shorteners and forms

[0045] Returning to FIG. 2, if the public domain classifier 130 determined the identified malicious domain to be a private domain, it may then be determined whether the private domain is a compromised domain or an attacker-owned domain (block 206). For instance, the private domain classifier 130 may determine whether the private domain is a compromised domain or an attacker-owned domain. In order to identify compromised domains, deviations of the visual and auxiliary information in the apex domain and the domain under consideration are relied upon. The inventors observed that compromised domains have very different contents compared to the main website and the auxiliary information such as hosting IPs are different for the main w¾bsite (reputed hosting provider) and the domain under consideration (bullet proof hosting). On the other hand, attacker-owned domains have relatively new registration information, are likely to utilize fast flux networks, are short-lived (l ikely to be NX domain), and blacklisted.

[0046] Two ground truth sets of compromised and attacker-owned apex domains AC-GT1 (AC stands for attacker-owned/ compromised) and AC-GT2 may be manually created from the private domains identified from DS1 and DS2 respectively using our public/private classifier. A random sample of 2500 domains from each of DS1 and DS2 may be selected. Similar to the public/private ground truth collection, manual inspection may be performed of each sample and a confidence score may be provided to indicate how confident the domain experts are about the label. The following information and sources are manually inspected to decide if a malicious apex is compromised or attacker-owned. In addition to checking the website, auxiliary information was checked such as registration information including historical WHOIS records, hosting information, and PDNS information. The detailed report from two threat intelligence platforms, riskiq.com and otx.alienvault.com, was also checked. Further, detailed reports were inspected from two reputation services, GSB and SA.

[0047] In order to identify compromised domains, deviations of the visual and auxiliary information in the apex domain and the domain under consideration were relied upon. The inventors observed that compromised domains have very different contents compared to the main website and the auxiliary information such as hosting IPs are different for the main website (reputed hosting provider) and the domain under consideration (bullet proof hosting). On the other hand, attacker-owned domains have relatively new registration information, likely to utilize fast flux networks, are short-lived (likely to be NX domain), and blacklisted. After manual verification, the high-confidence labels were selected.

[0048] In at least one example, the private domain classifier 140 takes into account at least five groups of features: lexical, VT report, VT profile, PDNS, and Alexa features. Lexical features capture the properties of the URL under consideration. VT report features include those attributes that are directly available from VT reports, VT profile features are extracted from the VT NOD system, and PDNS features are extracted from the Farsight Passive DNS DB. Most of the lexical, Alexa and PDNS features are known from previous research of detecting malicious domains or URLs. The table illustrated in FIG. 6 shows various features of the five features groups that the private domain classifier 140 may take into account. Compared to conventional approaches, the novel features that the private domain classifier takes into account include VT_Duration, Positive_Count, Domain_ Malicious, #Total_Scans, #Benign_Scans, Sibling__Malicious, SOA_Domains_Nos, and SOA _Domain.

[0049] VT Report Features are directly extracted from the VT reports. The inventors observed the VT_Duration feature for compromised domains tend to be higher than that for attacker- owned domains. One reason is that compromised domains are in general harder to detect by existing systems as attackers are exploiting the reputation of legitimate domains. Due to the same reason, the inventors observed that the number of scanners that mark a compromised site as malicious is less than that for atacker-owned sites. Positive_count captures this observation. Compared to attacker-owned domains, it was observed that attackers more often use compromised domains as a redirection site in order to evade detection.

[0050] VT profile features capture the intuition that almost all subdomains and scans of attacker-owned domains are malicious whereas only some of the subdomains and scans of compromised domains are malicious.

[0051] From the PDN S features, the number of authoritative name servers and the number of SOA domains capture the observation that attacker-owned domains change their hosting providers more often than benign domains in order to evade detection or takedown. Additionally, attackers drop catch domains to exploit the residual trust in them, which also results in domain being associated with multiple name servers. Comparison of apex domains with name server domains and SOA features capture the observation that benign domains are more likely to be hosted in their own servers compared to attacker-owned ones.

[0052] The present disclosure improves upon several lexical features presented in previous works. Specifically, the inventors observed that attacker-owned domains more often use these squatting methods to impersonate brands compared to compromised domains. The present disclosure profiles Alexa Top 1M domains over 1 year to identify Alexa top 1000 brands to detect combosquatting, levelsquatting and target embedding domains which are shown to be more than hundred times prevalent compared to more traditional squatting types. Features Brand, Similar, and Pop Keywords capture new squatting tactics used by attackers.

[0053] In addition to VT features shown in the table of FIG. 6, the private domain classifier 140 considers three new classes of features, PDNS, Alexa and lexical features, to improve classification performance. It indeed improves the performance matrices, and as shown in FIG. 7, several classifiers including GB, ET and RF perform quite well resulting in an accuracy slightly above 90% with 10-fold cross validation for AC-GT1. FIG 8 illustrates a graph showing the ROC curves and feature importance in an example in which the private domain classifier 140 is a Random Forest classifier. The private domain classifier 140 achieves 90.6% accuracy with 94.7% precision and 86.1% recall. An important consideration in building robust machine learning models is that the model should generalize to different ground truth datasets. To this end, a new model is trained using AC-GT2, With RF classifier, the inventors achieved 96.8% accuracy with 99.1% precision and 93.4% recall. FIG. 9 illustrates a graph showing the ROC curve for an example in which the private domain classifier 140 is a Random Forest classifier.

[0054] The inventors have made various insights of the VT URL Feed dataset that help determine the features used in the public domain classifier 130 and the private domain classifier 140. The VT URL Feed dataset contains 814,678,956 unique URLs in the period from Aug. 1 2019 to Nov. 18 2019. Note that the same URL may be scanned multiple times in a given day. Each new' scan is considered a different one. However, if VT is simply queried multiple times to retrieve an existing report instead of triggering new' scans, it does not change the scan ID. Hence, such multiple reports with the same scan id are considered as one record. It was observed that the daily average of observed likely benign scans (i.e. #scanners = 0) are 89.3% of the total number of scans, which is around 4.8M. The inventors observed that, on average, malicious URLs are scanned 6 times whereas benign URLs are scanned only twice. This follows general user behavior where the more suspicious the URLs are, the more they are checked. Another observation was that the daily average scan count is roughly twice the average URL count.

[0055] The inventors also compared the coverage of malicious websites in the inventors’ dataset compared to typical blacklists and reputation services. While there are many VT reports with 1 or 2 #scanners, on average 45.7% of the malicious scans have 5 or more #scanners (i.e. the top two areas in the Figure). The inventors focused on categorizing scans with 5 or more #seanners, which corresponds to 1659K weekly malicious reports on average. Tins corresponds to 276K malicious websites weekly on average. In comparison, Google Transparency Report and Phishtank report around 50K and 4K per week respectively. This shows that the classification system 110 is trained on a much larger set of malicious URLs compared to popular blacklists and thus have a higher impact.

[0056] VT scanners assign each malicious URL one of the following class labels: malicious, malware, phishing, mining and suspicious site. Since VT scanners most of the time assign conflicting class labels, a simple majority voting heuristic may be used to derive the final class label for a malicious website. For example, the inventors took random samples of 100 websites of each class type and manually cross checked against several publicly available blacklists or APIs including phishtank, GSB and SA. The inventors’ manual inspection showed that more than 98% of the labels using majority voting are in agreement with external intelligence, validating our heuristic. While malware and phishing sites dominate the reported malicious websites, there are only a few malicious mining sites and suspicious sites in the dataset.

[0057] FIG. 10 illustrates a graph showing the CDF of the number of FQDNs per apex during a period for likely benign domains (i.e. #scanners = 0) and malicious domains (i.e. #scanners 5). Frequencies less than 5 have been excluded and the long tail of frequencies greater than 500. It can be seen that 90.2% of the apexes in the benign category have only one FQDN whereas only 12.3% of the apexes in the malicious category have only one FQDN. Further, around 40% malicious apex domains have more than 40 FQDNs whereas only 5% of benign apex domains have more than 40 FQDNs. These observations show that attackers create many subdomains to launch their attacks in a similar fashion as fast-flux networks.

[0058] Another observation is that there is a long tail of apex domains having more than 500 FQDNs, with some having millions of them. For example, blogspot.com (blogging), coop.it (URL shortened), mcafee.com (mcafee end-point hosts) and opendns.com (Cisco open DNS) all have over one million FQDNs. The number of FQDNs observed is used as a feature in the public domain classifier 130 as the higher this number is, the more likely the domain is public.

[0059] Returning to FIG. 2, the determined malicious domain hosting type may then be displayed (block 208). For instance, the classification system 110 may display the determined malicious domain hosting type on the display 116. The displayed malicious domain hosting type may be displayed with the malicious domain URL. The determined malicious domain hosting type may be a public domain (e.g., an attacker-owned public domain), a compromised private domain, or an attacker-owned private domain. Security operators may view the determined malicious domain hosting type and the malicious domain's URL on the display 116 to determine and take the appropriate actions.

Experimental Validation

[0060] The inventors’ analysis identified 6,675 malicious public apex domains and 725,325 malicious private apex domains in both datasets. In other words, only 0.91% apex domains in VT URL feed are public. However, the inventors observed a high proportion of URLs and scans belonging to these public apex domains. Out of all reports, 46.5% of URLs are hosted on public apex domains. This observation is consistent with the fact that public apex domains host many subdomains whereas private apex domains host only a few in general.

[0061 ] FIG. 11A illustrates a graph showing the number of FQDNs per apex for the two categories of apex domains, public and private. More than 80% of public apex domains have more than 20 FQDNs whereas 95% of private apex domains have less than 10 FQDNs. While many of the public domains have a large num- ber of subdomains, there is a long tail of public domains with a huge number of subdomains (over 200K). These observations suggest that attackers prefer to creating many subdomains under public apex domains as they are available at zero cost and they can ride on the reputation of public apex domains such as their TLS certificates, hosting and registration information so that they may not be easily detected by traditional blacklists and reputation systems.

[0062] FIG. 11B illustrates a graph showing the average Alexa ranking distribution for public and private apex domains. For unranked domains, the insignificant rank of 1 million was assigned for better visualization. It is not surprising that public domains have higher average Alexa rankings compared to private domains as public apex domains are accessed more frequently by users. An interesting result is that half of public domains are not popular (unranked), showing that attackers also create subdomains on less popular public domains to launch attacks. Since public apex domains host many benign domains, current registration and domain reputation based systems and inference based systems may inadvertently blacklist public apex domains, disrupting benign sites.

[0063] The domain lifetime can be estimated by taking the lifetime of the PDNS footprint for each apex domain. FIG. 11 C illustrates a graph showing the domain lifetime distribution for public and private apex domains. The inventors observed that public domains are longer lived compared to private domains even though a large majority of them are unranked sites providing a free platform for attackers to launch their attacks for prolonged time periods. Further, around 10% of private domains are very short-lived indicating they are likely to be attacker- owned domains.

[0064] The private domain classifier 140 detects that 65.6% apex domains in VT URL feed are compromised, indicating there are more compromised websites than attacker-owned ones. This observation is consistent with prior work done on phishing websites and public threat intelligence reports.

[0065] FIG. 12A illustrates a graph showing #FQDNs per apex for compromised and attacker-owned domains. An interesting observation is that most of the compromised domains host slightly more malicious subdomains than attacker-owned ones. When mitigation actions are taken, it is important for domain owners to first identify and clean up all of the malicious subdomains, which can be more than 500.

[0066] FIG. 12B illustrates a graph showing the average Alexa rank distribution for compromised and attacker-owned apex domains. As expected, most of the attacker-owned domains have either a low Alexa rank or no rank. However, it is interesting to note that there are some attacker- owned domains with Alexa ranking below 100K. Another interesting observation is that there about 10% unranked compromised domains, indicating that attackers launch attacks from less popular benign websites as well, which could be utilized to launch attacks such as DDoS that do not require reputed sites.

[0067] FIG. 12C illustrates a graph showing the domain lifetime distribution for compromised and attacker-owned apex domains. It is not surprising that compromised domains in general live longer than attacker-owned ones. However, there are about 40% of attacker-owned domains active for more than 200 days indicating there is a need to develop better techniques to detect these malicious domains early and take appropriate actions. One reason for their long duration is that attackers register domains and park them for a while as an evasive technique.

[0068] FIG. 13 illustrates a feature correlation matrix for the features used in the public domain classifier 130. FIGS. 14A and 14B illustrate graphs showing the feature importance for a Random Forest-based public domain classifier 130 for the two datasets GT1 and GT2. The feature importance graphs indicate which features are important in constructing the model. FIGS. 15A and 15B illustrate graphs showing the t-SNE for a Random Forest-based public domain classifier 130 for the two datasets GT1 and GT2. The t-SNE graphs utilize a nonlinear dimensionality reduction technique to embed the feature vectors into two dimensional space data for visualization. They show how two classes are clustered based on the features collected. One reason for the better performance in the second ground truth set is that, as shown in FIGS. 15A and 15B, two classes in ground truth data have a better separation in the second set, resulting in a better decision boundaries. Further, the feature importance graphs show that almost all features play a hand in deciding the label, making it less biased and importantly less susceptible for adversarial manipulations. FIGS. 16A and 16B illustrate graphs showing the precision-recall for a Random Forest-based public domain classifier 130 for the two datasets GT1 and GT2.

[0069] FIGS. 17A and 17B illustrate graphs showing the feature importance for a Random Forest-based private domain classifier 140 for the two datasets GT1 and GT2. FIGS. 18A and 18B illustrate graphs showing the t-SNE for a Random Forest-based private domain classifier 140 for the two datasets GT1 and GT2. FIGS. 19A and 19B illustrate graphs showing the precision-recall for a Random Forest-based private domain classifier 140 for the two datasets GT1 and GT2.

[0070] Without further elaboration, it is believed that one skilled in the art can use the preceding description to utilize the claimed inventions to their fullest extent. The examples and aspects disclosed herein are to be construed as merely illustrative and not a limitation of the scope of the present disclosure in any way. It will be apparent to those having skill in the art that changes may be made to the details of the above-described examples without departing from the underlying principles discussed. In other words, various modifications and improvements of the examples specifically disclosed in the description above are within the scope of the appended claims. For instance, any suitable combination of features of the various examples described is contemplated.

Claims

CLAIMS The invention is claimed as follows:

1. A system for classifying malicious domain hosting types, the system comprising: a display; a memory; and a processor in communication with the memory, the processor configured to: identify a malicious domain from a set of received domains; determine, using a model, whether the identified malicious domain is a public domain or a private domain; determine, if the identified malicious domain is a private domain, using a model, whether the private domain is a compromised domain or an attacker-owned domain; and display the determined malicious domain hosting type on the display, the determined malicious hosting type being a public domain, a compromised private domain, or an attacker-owned private domain.

2. A method for classifying malicious domain hosting types comprising: identifying a malicious domain from a set of received domains; determining, using a model, whether the identified malicious domain is a public domain or a private domain; determining, using a model, if the identified malicious domain is a pri vate domain, whether the private domain is a compromised domain or an attacker-owned domain; and displaying the determined malicious domain hosting type, the determined malicious hosting type being a public domain, a compromised private domain, or an attacker-owned private domain.