WO2016201938A1

WO2016201938A1 - Multi-stage phishing website detection method and system

Info

Publication number: WO2016201938A1
Application number: PCT/CN2015/098463
Authority: WO
Inventors: 耿光刚; 李晓东
Original assignee: 中国互联网络信息中心
Priority date: 2015-06-17
Filing date: 2015-12-23
Publication date: 2016-12-22
Also published as: CN104899508A; CN104899508B

Abstract

The present invention discloses a multi-stage phishing website detection method and system, and combines means of both fast filtering and accurate filtering. Multiple stages of fast filtering are used to control the number of potential phishing websites to be in a relatively small range; furthermore, an accurate determination model is trained by analyzing statistical features of positive and negative samples in a small range. The method comprises the following steps: selecting a to-be-detected range of websites to perform fast filtering and excluding obvious non-phishing websites therefrom; and performing accurate determination on the remaining range of the websites after the fast filtering to determine whether said websites are phishing websites. The system comprises: a fast filtering module, configured to select a range of to-be-detected websites to perform fast filtering and exclude obvious non-phishing websites therefrom; and an accurate determination module, configured to perform accurate determination on the remaining range of the to-be-detected websites after the fast filtering.

Description

Multi-stage phishing website detection method and system

Technical field

The present invention relates to the field of information technology, and in particular, to the field of network security technologies, and in particular, to a multi-stage phishing website detection method and system.

Background technique

Today, the Internet has become an important part of people's social life, but with the increasing popularity of the Internet and the increasing application level, in addition to traditional information security threats such as Trojans, viruses and botnets, Internet phishing scams have gradually become One of the most important means of attack by cybercriminals.

Phishing is a new international word. The first two letters of phreak (the person who steals the phone line) replace the fishing f, which is based on social engineering (ie, deception) combined with network communication technology. Cybercrime means. The purpose of Internet phishing is to defraud the account password (online banking, online games or Alipay, etc.), credit card information and personal data on the victim's website, such as online transfer, stealing online game equipment, stealing email information and stealing credit cards. Internet phishing is mainly implemented through a phishing site. For example, a phishing website can pretend to be a bank online banking page to steal a user's bank card number and password, thereby transferring the user's deposit in the bank account; disguising as an official website of the online game to steal the user Online game account, stealing the virtual currency or equipment of the user in the online game; pretending to send the Q coin website, stealing the user's QQ number and password to steal the QQ number; disguising as a winning website, stealing the user's personal information, and then using personal information to achieve The purpose of the crime; you can also obtain the user's email account and password through the above means, and then learn the user's email information, to achieve the purpose of spying on the privacy of others, and even stealing trade secrets.

In order to prevent and combat the criminal behavior of Internet phishing and maintain the self-interest and privacy of Internet users, it is the most effective and direct technical means to take detection methods to hang phishing websites hidden in the Internet.

With the continuous development of information technology, more and more phishing websites exist in the Internet, and various phishing websites are emerging one after another, covering various types of Internet pages in various fields. At this stage, Internet phishing detection uses pattern mining technology based on statistical machine learning. This is due to the demonstration effect of artificial intelligence and machine learning theory successfully applied in many fields in recent years. The detection of phishing websites based on statistical machine learning has gradually Become a popular phishing website detection method.

The existing machine learning methods used in the detection of phishing fraud websites based on statistical learning mainly include decision trees, Bagging, support vector machines, etc. These general machine learning algorithms are widely used in pattern recognition fields such as text classification and face recognition. Can be used directly for phishing website detection. If the model based on the above machine learning algorithm is to achieve good results in the actual Internet, a necessary condition is that the training sample needs to cover various Internet pages, but the existing anti-phishing technology research is large. More based on the effectiveness of relatively small sample set verification algorithms, some sample sets even contain only dozens of samples, which can be questionable for generalization. In addition, even if the sample set is really large enough to cover all kinds of samples, and the various types of samples meet the proportion of the actual Internet, considering that the phishing detection is an extreme class imbalance problem (that is, the global billion-level website only has hundreds of thousands of orders of magnitude per year. Phishing website), it is difficult to obtain good detection results by directly using the existing pattern classification algorithm.

Summary of the invention

In view of the above problems, the present invention provides a multi-stage phishing website detection method and system, and the core idea is to combine the means of rapid filtering and precision filtering. Through multi-stage rapid filtering, the suspected phishing website is controlled within a relatively small range; further, the accurate judgment model is trained by analyzing the statistical characteristics of positive and negative samples in a small range.

One of the objectives of the present invention is to provide a multi-stage phishing website detection method comprising the following steps:

1) Select a website within the scope of detection to perform rapid filtering to exclude obvious non-phishing websites;

2) extracting multi-dimensional features used in performing the fast filtering;

3) Using the above multi-dimensional features on the training set, the website within the remaining range after the fast filtering is accurately determined to determine whether it is a phishing website.

Further, the step 1), the fast filtering of the website to be detected in the Internet includes:

1-1) use the brand host and/or domain name whitelist for the first layer of filtering;

1-2) using the login box, sensitive words and copyright information for the second layer of filtering;

1-3) Perform third-level filtering using website-related features.

Further, in step 1-1), the first layer of filtering is used to quickly exclude normal brand websites and ensure quick access of key websites.

Further, in step 1-2), the sensitive words include a bank, a credit card, a payment, a winning, a login, and a password.

Further, in step 1-2), the second layer filtering adopts a Bayesian filtering method.

Further, in step 1-3), the website related features include PageRank, domain name registration time, and favicon.

Further, the accurate determination in step 2) includes training an accurate decision model by analyzing statistical characteristics of positive and negative samples in the remaining range.

Further, the statistical characteristics of the positive and negative samples include existing statistical phishing detection features, DNS registration and analytic features, and brand element features.

Further, the accurate determination model in step 2) is trained by the confusing data set.

Another object of the present invention is to provide a multi-stage phishing website detection system, including:

a fast filtering module for selecting a range of websites to be detected for rapid filtering, excluding the obvious non-fishing Fish website

An accurate decision module for accurately determining the website to be detected in the remaining range after rapid filtering.

Further, the fast filtering module includes:

a first filtering module for performing first layer filtering by using a brand name domain library and/or a domain name white list;

a second filtering module for performing second layer filtering by using sensitive words;

A third filtering module is configured to perform third layer filtering by using relevant features of the website.

The method and system of the present invention are divided into multiple stages to determine whether a website within a to-be-detected range is a phishing website, and can quickly filter a large number of non-phishing websites, and control the suspected fishing to be relatively small through multi-layer rapid filtering in the previous stage. At the same time, through accurate judgment, using multi-dimensional features, training classification model, accurate judgment of suspected phishing websites. That is to improve the efficiency of phishing website detection, and accurately determine the phishing website. It not only effectively overcomes the deficiencies of phishing website detection as an extremely unbalanced detection, but also greatly speeds up the detection of phishing websites, and is suitable for online applications.

DRAWINGS

FIG. 1 is a schematic diagram showing the imbalance of the phishing fraud detection problem according to the present invention.

2 is a flow chart of the method for detecting a multi-stage phishing website according to the present invention.

3 is a schematic diagram of the module composition of the system of the present invention.

detailed description

A necessary premise for good results of the phishing detection method based on pattern classification is that the training samples are rich enough to cover various web pages. However, the problem of phishing website detection in the actual Internet environment is an extreme class imbalance problem. As shown in Figure 1, the black spot in the center of the figure represents a phishing website, and the gray circle represents a non-phishing website.

The existing statistical learning-based phishing detection methods and strategies do not consider this fact, and lack of necessary explanation for the coverage and rationality of the constructed test data set. The present invention is directed to the above situation, and designs a layered detection strategy, that is, multi-stage fishing detection. The core of this strategy is to rationally design the filtering rules of each layer to achieve the purpose of improving detection efficiency and accuracy. In order to achieve this goal, the first few stages of the detection strategy focus on the improvement of detection efficiency, that is, it can quickly eliminate the obvious non-phishing pages on the Internet, that is, remove the websites outside the black circle shown in Figure 1. Suspected fishing is reduced to the black circle; further, in the subsequent stage, the suspected fishing in the black circle is determined to ensure high accuracy and low false detection rate.

The method for detecting the multi-stage phishing website of the present invention is specifically described below with reference to the accompanying drawings. The range to be detected applicable to the method of the present invention may be directed to a website collection. The present invention does not limit the size of the collection, and may be a website collection of the entire Internet. Hehe. as shown in picture 2:

In an embodiment of the present invention, the multi-stage phishing website detection method includes two major stages of fast filtering and accurate determination, wherein the fast filtering is implemented by a fast filtering module, and the accurate determination is implemented by an accurate determining module. The operating environment is: the software environment is not limited to Windows or Unix systems, and can be used in any common development language, such as C++, Java, Perl, and so on. The hardware environment is also not limited, and can be an ordinary personal computer or a common server.

Compared with the traditional single-stage phishing detection method, the three-stage rapid filtering performed by the present invention can effectively eliminate the majority of non-phishing websites on the Internet, and only a small number of websites that need accurate judgment enter the final accurate classification. stage. High efficiency is critical for phishing detection.

The fast filtering includes three specific stages in this embodiment. First, the first stage is to use the brand name domain name and the domain name white list to perform the first layer filtering to quickly exclude the normal brand website, considering that these websites have great daily The access requirements, this layer of filtering can guarantee fast access to key sites.

The second stage is the filtering of the login box, sensitive words and copyrights, that is, the second layer of filtering. The sensitive words include and are not limited to: "bank, credit card, payment, winning, login, password", etc., according to the increase of the type of phishing websites. Update settings. The second layer of filtering uses Bayesian filtering, also known as Bayesian classification. The related working principles are generally known to those skilled in the art and will not be described herein. Through the second layer filtering step, most common web pages will be filtered out, which will greatly improve the overall detection efficiency.

The third stage further determines the page containing relevant sensitive words based on PageRank (The PageRank[The PageRank Citation Ranking: Bringing Order to the Web; Page, Lawrence and Brin, Sergey and Motwani, Rajeev and Winograd, Terry: technical Report .Stanford Infolab; 1999] is a web page ranking algorithm proposed by Larry Page. The basic idea is that compared with non-popular websites, a popular website is characterized by more popular websites connected to it. Aspects: The more websites that link to a website, the more popular the website; the higher the popularity of a website that links to a website, the more popular the website. That is, the popularity of a website and the website that links to it. The number is proportional to the popularity of the link to the site.) The domain registration time and favicon (the favorite icon is the small icon that appears on the left side of the browser's address bar, also known as the website avatar. Depending on the browser, favicon The display is also different: in most major browsers such as FireFox and Internet Explorer (5.5 And above), favicon is not only displayed in the favorites, but also appears in the address bar, when the user can drag favicon to the desktop to create a shortcut to the website.) Features, the third stage is based on this Principle: Normal brand websites often have high PageRank, and the domain name registration time is greater than K years (such as more than 3 years), and often has a counterfeit favicon, while phishing websites are just the opposite, that is, after the first layer of filtering and After the second layer of filtering, if the website within the remaining range does not have high PageRank, domain name note If the book is short and/or there is no need to prevent favicon, it will be judged as a suspected phishing website.

The second stage and the third stage above are all implemented by training a simple classifier with a small number of features. Ability to quickly exclude large numbers of non-phishing sites from detection. Not only improve detection efficiency, but also save hardware and software resources.

The next step is the accurate detection and determination stage: using the rich features of the existing statistical phishing detection (URL characters, titles, DOM trees, search engine rankings, login boxes, etc.) will be the same series of DNS registration and resolution features, brand element features, etc. The confusing data set is trained to accurately determine the model. For example, the confusing data set may be a data set composed of samples (websites) in the black circle in FIG. 1 to carry out the final determination of fishing or not. In addition, model training is a field of pattern recognition, machine learning, especially in the field of supervised learning, that is, learning from a training material or establishing a model, see: http://en.wikipedia.org/wiki/%E7% 9B%91%E7%9D%A3%E5%AD%A6%E4%B9%A0, and will not be described here.

As shown in FIG. 3, it is a schematic diagram of a module structure of a multi-stage phishing website detection system according to an embodiment of the present invention, where the system includes:

The fast filtering module is used to select a range of websites to be detected for rapid filtering, and to exclude obvious non-phishing websites;

The first filtering module is configured to perform first layer filtering by using a brand name domain library and/or a domain name white list, including: a brand name domain filtering module and a host white list filtering module;

The second filtering module is configured to perform second layer filtering by using sensitive words, including: a login box detecting module and a sensitive word filtering module;

The third filtering module is configured to perform third layer filtering by using related features of the website, including: a PageRank obtaining module, a domain name registration information acquiring module, and a favicon obtaining and matching module.

An accurate decision module for accurately determining the website to be tested in the remaining range after rapid filtering, including:

The multi-dimensional feature extraction module is configured to extract multi-dimensional features including, but not limited to, the above three filtering modules:

Domain registration feature: The registration duration of the domain name used by the website;

Logo feature: Whether the suspected phishing website contains a brand logo;

Favicon feature: whether the suspected phishing website contains the brand Favicon;

PageRank feature: The PageRank value of the domain name used by the website;

Login box feature: Whether the website contains a login box;

Sensitive word characteristics: Whether the website contains keywords such as “bank”, “payment”, “password”, “winning”;

Copyright statement characteristics: Whether the website contains a copyright statement of a brand;

Https feature: Whether the website uses the Https protocol.

Accurate decision module, using the above multi-dimensional features on the training set, training support vector machine [https://en.wiki A classifier such as pedia.org/wiki/Support_vector_machine, a decision tree [https://en.wikipedia.org/wiki/Decision_tree], and a classification model that determines the suspected website. Specific model training and classification decisions can be found at https://en.wikipedia.org/wiki/Statistical_classification.

From the above, the multi-stage phishing detection method and system of the present invention is expected to improve the performance of phishing website detection based on statistical machine learning from two aspects of detection efficiency and robustness. Through multi-stage filtering, the fast filtering step efficiently filters most non-phishing websites, which greatly solves the defect that the existing phishing detection method needs to extract a large number of features and comprehensive judgments. It can balance detection efficiency and accuracy, and is suitable for large-scale server-side processing as well as client applications such as browser plug-ins. The differences in detection performance of the phishing website compared to the prior art are described below in tabular form:

Comparison table of detection performance between the present invention and prior art phishing websites

In the above table, the phishing website detection method of the prior art is: a heuristic phishing detection method, which uses some column heuristic rules to determine the phishing. This method requires manual setting of heuristic parameters, and the phishers can easily avoid the rules. This determines that the heuristic rule method is often not suitable for the rapidly changing Internet environment, especially since the method is completely unsuitable for the emerging phishing mode discovery, and the limitations are obvious.

In the above table, the phishing website detection method of the prior art 2 is: a single-stage phishing detection method based on statistical machine learning. This kind of method avoids the defect that the parameter setting of the heuristic rule method is easy to be avoided by the angler, and can easily adapt to the determination of multiple fishing, but the construction of the high accuracy model needs to extract a large number of features, and the feature extraction phase takes a long time. Not suitable for online testing with high time requirements.

It should be noted that, although the multi-stage phishing website detection is described as the above four stages in this embodiment, in actuality, those skilled in the art may adjust and test according to the validity and extraction complexity of the related features of the website, until Get the phishing detection strategy that best suits your current network environment. That is to say, the method of the present invention is not limited to the above four stages, and the number of stages may be increased or reduced according to actual conditions, for example, the second and third stages may be combined into one stage; or for example A URL similarity filtering phase can be added between the first and second phases (the URL of the phishing website often contains the brand name of the phishing target) and the like. The adjustments such as the above are in accordance with the technical idea of the present invention, and the scope of the present invention should be defined by the claims.

Claims

A multi-stage phishing website detection method includes the following steps:

1) Select a website within the scope of detection to perform rapid filtering to exclude obvious non-phishing websites;

2) extracting multi-dimensional features used in performing the fast filtering;

3) Using the above multi-dimensional features on the training set, the website within the remaining range after the fast filtering is accurately determined to determine whether it is a phishing website.
The method for detecting a multi-stage phishing website according to claim 1, wherein the step 1) of performing fast filtering on the website to be detected in the Internet comprises:

1-1) use the brand host and/or domain name whitelist for the first layer of filtering;

1-2) using the login box, sensitive words and copyright information for the second layer of filtering;

1-3) Perform third-level filtering using website-related features.
The multi-stage phishing website detecting method according to claim 2, wherein in the step 1-1), the first layer filtering is used to exclude a normal brand website.
The multi-stage phishing website detecting method according to claim 2, wherein in step 1-2), the sensitive words include a bank, a credit card, a payment, a winning, a login, and a password.
The multi-stage phishing website detecting method according to claim 2, wherein in the step 1-2), the second layer filtering adopts a Bayesian filtering method.
The multi-stage phishing website detecting method according to claim 2, wherein in the step 1-3), the website related feature comprises a PageRank, a domain name registration time, and a favicon.
The multi-stage phishing website detecting method according to claim 1, wherein the accurate determination in the step 3) comprises: training an accurate decision model by analyzing statistical characteristics of positive and negative samples in the remaining range.
The multi-stage phishing website detecting method according to claim 7, wherein the statistical characteristics of the positive and negative samples include existing statistical phishing detection features, DNS registration and analytic features, and brand element features.
A multi-stage phishing website detection system, comprising:

A fast filtering module is used to select a website to be detected in a range for rapid filtering, and to exclude obvious non-phishing websites;

a multi-dimensional feature extraction module, configured to extract multi-dimensional features used by the fast filtering module for rapid filtering;

An accurate decision module for using the above multi-dimensional features on the training set for the remaining range after fast filtering The website to be tested is accurately determined.
The multi-stage phishing website detecting system according to claim 9, wherein the fast filtering module comprises:

a first filtering module for performing first layer filtering by using a brand name domain library and/or a domain name white list;

a second filtering module for performing second layer filtering by using sensitive words;

A third filtering module is configured to perform third layer filtering by using relevant features of the website.