CN108337255B

CN108337255B - Phishing website detection method based on web automatic test and width learning

Info

Publication number: CN108337255B
Application number: CN201810088364.8A
Authority: CN
Inventors: 袁巍; 聂依凡; 李浩鹏; 贾昂; 蔡明辉; 姜源
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2018-01-30
Filing date: 2018-01-30
Publication date: 2020-08-04
Anticipated expiration: 2038-01-30
Also published as: CN108337255A

Abstract

The invention discloses a phishing website detection method based on web automatic test and breadth learning, and belongs to the technical field of computer network safety. According to the invention, traditional feature extraction is carried out based on url and html pages, interactive feature extraction is carried out by using a web automatic testing technology, and finally width learning training is carried out by using a preprocessed training sample after feature extraction, so that phishing websites are accurately and quickly identified and detected, and the network information and property safety of people are protected.

Description

Phishing website detection method based on web automatic test and width learning

Technical Field

The invention belongs to the technical field of computer network security, and particularly relates to a phishing website detection method based on web automatic test and breadth learning.

Background

Phishing is an attack way to steal sensitive information such as personal identity data and financial accounts of users by sending massive fraudulent junk mails, false advertisements of web pages and the like purported to come from banks or well-known institutions. The most typical phishing attacks are luring the user to a well-designed phishing website that closely resembles the target organization's website, capturing personally sensitive information entered by the user on the website, or tricking the user into remitting money. Since victims are not easy to be alerted in the attack process, phishing websites have become one of the most serious internet crime means at present, and the detection of phishing websites has become one of the hottest research directions in the field of network security.

In 2016, a global chinese phishing website status statistical analysis report (2016) (hereinafter referred to as "report") was jointly issued by the internet domain name management technology national engineering laboratory (emc) and the international anti-phishing working group (APWG) which were first raised by the CNNIC, and the china anti-phishing website alliance (APAC). Data shows that the number of phishing websites in China is increased by 150.96% in the same ratio in 2016, main counterfeit objects are Taobao and Mizhong movement, and domain names used by various large banks mainly comprise COM, CC, PW and NET.

In 2017, the number of the 360 mobile phone guardians in the third quarter is 7.9 hundred million times for nationwide mobile phone users to intercept phishing websites, and is increased by 102.6% in the third quarter compared with 2016. And classifying the intercepted phishing websites at the mobile phone end, wherein the gambling and lottery type phishing websites account for 80.2 percent of the total proportion, and the types of false shopping, false recruitment, financial securities, false drugs, phishing advertisements and the like are sequentially decreased.

Although the number of interception is large, the intercepted websites mostly exist for a long time, and the latest phishing websites are difficult to capture and block. The life cycle of the phishing websites is only 4.684 days on average, and the reported average life cycle is 13.327 days, so that the phishing websites must be identified and intercepted in a very short time, otherwise, the safety of the properties of the people is threatened.

At present, the identification and interception technology for phishing websites is executed by antivirus software and a browser, and the technology is divided into the following categories:

① blacklist filtering technique, adding manually detected and publically reported phishing websites to blacklist, performing interception and raising warning when the accessed url (Uniform Resource L adapter) exists in the blacklist.

② url feature extraction, corresponding features such as domain names are extracted through visited urls, but the judgment method is not reliable, because the urls do not have the decisive features of phishing websites, and the misjudgment rate of the method are high.

③, the detection of phishing websites is carried out by combining various website page elements as features, because the feature acquisition of the webpage takes a certain time, the accuracy of the method is improved compared with that of the second method, but the execution speed and efficiency are not high.

Disclosure of Invention

Aiming at the defects or improvement requirements of the prior art, the invention provides a phishing website detection method based on web automatic test and width learning, which aims to perform traditional feature extraction based on url and html pages, perform interactive feature extraction by using a web automatic test technology, and perform width learning training by using a preprocessing training sample after feature extraction, so that the phishing website is accurately and quickly identified and detected, and the network information and property safety of people are protected.

To achieve the above object, according to one aspect of the present invention, there is provided a phishing website detection method based on web automation test and breadth learning, comprising the steps of:

(1) performing static feature extraction, dynamic feature extraction and interactive feature extraction on a large number of phishing websites and normal websites in a data set at a Personal Computer (PC) end to form a feature vector set;

the data set is from phishing websites and normal websites collected on the network or is directly obtained from a network security company;

(2) dividing the characteristic vector set in the step (1) into a training set and a verification set by using a k-fold cross verification method;

(3) training width learning by using the training set, testing and comparing by using the verification set, constructing a basic model and optimizing the performance of the classifier;

the classifier is a model trained by a width learning algorithm, and when the classifier is used, a website is input into the classifier, and whether the website is a phishing website is output; the performance of the classifier refers to the accuracy of the classifier for identifying the phishing website; optimizing performance means improving recognition accuracy;

(4) and collecting the misjudgment websites and the newly included websites as a new feature vector set, performing incremental learning for increasing input on the model, and optimizing the model.

Preferably, step (1) is specifically:

(1.1) carrying out static feature extraction on the url;

(1.2) simulating an interface-free browser by using a web automation testing technology to access the url of the data set;

(1.3) carrying out dynamic feature extraction on the page visited by the url;

and (1.4) simulating a browser to carry out interactive click browsing on the page, and returning interactive features.

Preferably, the static features in step (1.1) include:

① url contains ip address;

② url is a pure number from the beginning to the first point;

③ url contains sensitive characters, such as @;

④ url port is 80 port;

⑤ url is less than 23 characters in length;

⑥ url contains keywords related to shopping or property accounts, such as account, banking, taobao;

the above six static features are denoted as < F1, F2, F3, F4, F5, F6 >.

Preferably, the dynamic characteristics in step (1.3) include:

① html whether the title contains sensitive characters such as 'lottery', 'gambling abroad', 'winning';

② whether there is a form;

③ whether the resource of the picture is the same as the original url;

④, if href of the link is the same domain name as url, the href is the abbreviation of Hypertext Reference, which is to specify url of hyperlink target;

the four dynamic characteristics are marked as < F7, F8, F9 and F10 >.

Preferably, the interactive features in step (1.4) include:

① form whether stringent;

② click on the link, if it is null;

③ click on the link, if url redirection occurs;

the above three interactive features are denoted as < F11, F12, F13 >.

Preferably, the step (2) is specifically:

(2.1) setting a k value;

and (2.2) dividing the data set in the step 1 into a training set and a verification set by using a k-fold cross-validation method.

Preferably, step (3) is specifically:

(3.1) training the width learning model by using the feature vector set of the webpage samples in the training set in the step (2) and testing the performance of the classifier; the webpage sample refers to the websites of the concentrated fishing websites and the websites of the normal websites;

and (3.2) continuously adjusting the network architecture to train and test by adding feature nodes and enhanced nodes until the classifier reaches the expected performance, acquiring weight information of each layer and storing the model.

Preferably, step (3.1) is specifically:

(3.11) initializing the number N2 of the characteristic windows, the number N1 of the characteristic nodes in the windows and the value N3 of the enhanced nodes; randomly initializing a classifier model feature node weight matrix, and processing feature node weights by using sparse self-coding;

(3.12) carrying out matrix multiplication on the feature vector set of the webpage sample and the weight matrix obtained in the step (3.11) to obtain a feature node matrix;

(3.13) randomly initializing an enhanced node weight matrix;

(3.14) multiplying the characteristic node matrix obtained in the step (3.12) with the weight matrix obtained in the step (3.13) to obtain an enhanced node matrix;

(3.15) transversely splicing the characteristic node matrix obtained in the step (3.12) and the enhanced node matrix obtained in the step (3.14) according to columns to obtain an input matrix;

(3.16) solving the plus sign generalized inverse of the input matrix obtained in the step (3.15) and carrying out matrix multiplication with < Y > to obtain a weight matrix; the < Y > is a matrix composed of labels of the web page samples; the label of the webpage sample represents whether the webpage sample is a phishing website or not, for example, label 1 represents the phishing website, and label 0 represents the phishing website or not;

(3.17) since step (2) is k-fold cross validation, repeating step (3.1) k times, averaging the accuracy of k times;

(3.18) gradually increasing the values of N1, N2 and N3, observing whether the precision of the width model is improved or not, and finding the optimal parameters.

Preferably, step (3.2) is specifically:

(3.21) adjusting and testing the model obtained in the step (3.1) by using an incremental learning method for increasing the number of characteristic nodes and the number of enhanced nodes;

and (3.22) circularly setting the times of the step (3.21), recording the obtained test precision, comparing and determining the optimal number of characteristic nodes and the optimal number of enhanced nodes, and storing the optimal model.

Preferably, step (4) is specifically: collecting misjudged websites and newly included websites as a new feature vector set, performing incremental learning of increasing input on the model, and obtaining an adjusted weight matrix, thereby realizing optimization of the model;

preferably, in the step (1), extracting static characteristics of url by using a regular expression, and simulating an interface-free browser by using Phantomjs to perform an interface-free UI automatic test; PhantomJS is JavaScript API based on webkit, and is a non-interface browser.

The invention uses the url static feature extraction, the html dynamic feature extraction and the interactive feature extraction technology of the web automatic test technology to extract the static feature of the url; then, carrying out interface-free UI automatic test, simulating a browser to carry out url access, simultaneously realizing page source code extraction, and extracting dynamic features from html; simulating link clicking, simulating account input, login and other operations of the form, saving the process of page rendering, and quickly extracting interactive features; for new phishing websites which are not in the blacklist, whether the link is empty or not is tested by simulating clicking the link; simulating a login form, and testing whether the form is normal or not; and simulating clicking the link and testing whether url redirection exists. By the brand-new interactive features, the phishing website is quickly and accurately detected.

The invention utilizes the width learning model to improve the identification capability of the new phishing website. The breadth learning is a new machine learning method and idea, and is different from the deep learning in that the breadth learning architecture has shallow hierarchy and low requirement on computing resources. In addition, the whole model needs to be improved again when a new sample is received in deep learning, a large amount of time needs to be consumed, the original model does not need to be retrained in the width learning algorithm, only feature extraction needs to be carried out on a newly added phishing website sample, the existing model is adjusted and supplemented, and the detection precision is continuously improved in self-updating.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

1. according to the method provided by the invention, static feature extraction is carried out on the basis of url, dynamic feature extraction is carried out on the basis of html page, interactive feature extraction is carried out on the page by utilizing a web automatic test technology simulation browser, the features are changed from static to dynamic and then to interactive feedback, feature mining is changed from shallow to deep, and the quantity and quality of the features are ensured; finally, the accurate, quick and self-adaptive phishing website identification technology is realized by utilizing the characteristics of only a small amount of resources, quick training and incremental learning of the width learning model;

2. the interactive feature extraction combined with the url static feature and the html dynamic feature extraction can greatly improve the identification precision of the phishing website, is suitable for the latest phishing website, and can accurately and quickly detect and identify the phishing website;

3. feature extraction is carried out on the newly added phishing website sample through breadth learning, the existing model is adjusted and supplemented, the time required for updating the model is greatly reduced, and the requirement on computing resources is low; meanwhile, the detection precision of the width learning can be continuously improved by self in self-updating.

Drawings

FIG. 1 is a general flowchart of a phishing website detection method based on web automation testing and breadth learning in a preferred embodiment of the invention;

FIG. 2 is a diagram illustrating the feature extraction of a phishing website in accordance with a preferred embodiment of the present invention;

FIG. 3 is a schematic diagram of the preparation of a k-fold cross validation dataset for width learning according to a preferred embodiment of the present invention;

FIG. 4 is a diagram illustrating the training and optimization of breadth learning in accordance with a preferred embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The invention provides a phishing website intelligent detection method based on web automatic testing and breadth learning, which is a main flow chart of the invention as shown in figure 1 and clearly shows the flow and the relation among the steps of the whole invention. The following specifically describes the implementation of the procedure:

(1) step 1: and performing static feature extraction, dynamic feature extraction and interactive click access on a large number of phishing websites and normal websites in the data set at the PC side.

As shown in fig. 2, step 1 is specifically as follows:

step 1.1, performing static feature extraction on url itself, wherein the static feature extraction comprises the following six features:

① url contains ip address which can be used to escape domain name registration and user check;

② url, the regular site rarely inserts a domain name with a pure number, such as the regular site hundredths https:// www.baidu.com/, the phishing site http:// www.030033.com/;

③ url, wherein the front of the @ character is an account, and the back is a real address, which is a common way for phishing websites;

④ url port is 80 port, regular url is accessed through 80 port, non 80 port has the suspicion of phishing website;

⑤ url is smaller than 23 characters, statistical, regular website url is generally not more than 23 characters;

⑥ url contains keywords such as account, banking, taobao and the like, which relate to shopping and are worth reminding by a bank;

the above six features are matched and corresponding features are extracted by using six regular expression pairs url, for example, re. Other features are extracted in the same way. If the above feature occurs, 1 is returned, otherwise 0 is returned, and six feature sets < F1, F2, F3, F4, F5, F6> are formed in total.

And step 1.2, the url is accessed by the non-interface browser through the codes, the rendering process of the browser page is omitted, and the time is saved for the subsequent feature extraction. The specific codes are as follows:

self.dirver＝webdriver.PhantomJS()

dirver.get('http://www.douyu.com/directory/all')

step 1.3, performing dynamic feature extraction on the page accessed by url through an HTM L (HyperText Markup L anguage) source code, wherein the dynamic feature extraction comprises the following four features:

① html title contains sensitive characters such as 'lottery', 'gambling' and 'winning', and some illegal phishing websites like to use the interest of netizens;

② if there is a form, which is also a sensitive feature because the ultimate purpose of the phishing website is to steal account passwords;

③ whether the resource of the picture is the same as the original url or not, because the phishing website often steals the pictures of other genuine websites;

④ whether the href of the link is the same domain name as the url, because the phishing website can not write the article content by itself, mostly a generation chain, and the href is the abbreviation of Hypertext Reference, which is UR L for specifying the hyperlink target.

The four features are analyzed from the source code by using the driver object in step 1.1, such as driver, find _ element _ by _ tag _ name ("title"). text, and the content of the title is extracted. Other features are extracted in the same way. If the above feature occurs, 1 is returned, otherwise 0 is returned, and four feature sets < F7, F8, F9, F10> are formed in total.

Step 1.4, interactive click access is carried out on the page, and the following three characteristics are provided:

① form is strict, whether random account password can be logged in or not is judged, because phishing websites usually do not have an original database, any account password can be logged in generally;

② click on the link, if the most links are empty, the possibility of the website being a phishing website is higher;

③ clicking the link to determine whether url redirection occurs, wherein the phishing website has short life cycle and is often jumped to unchecked ip address by url redirection;

the three characteristics can be used for simulating a browser to click a page in real time by using a driver, so that the characteristics are extracted according to a return value, for example, an account number can be filled in a form, a password is similar, and if the click login is successful, the probability that the website is a phishing website can be predicted. Other features are extracted in the same way. If the above characteristics appear, returning to 1, otherwise, returning to 0, and forming three characteristic sets < F11, F12 and F13> in total;

(2) step 2: the k-fold cross-distribution training set and the validation set are shown in fig. 3, and the step 2 is specifically as follows:

step 2.1, setting a k value: the parameter k represents the times of repeated training and testing and is in direct proportion to the consumed resources;

and 2.2, dividing a training set and a verification set by using k-fold cross verification to train the model, dividing an original data set into k parts, taking k-1 part as the training set and 1 part as the verification set, and repeating the steps k times to improve the generalization capability of the model.

Fig. 4 is a schematic diagram of training and optimizing the learning of the width, which is specifically described in the following steps 3 and 4:

(3) and step 3: training a width model and optimizing the width model, wherein the step 3 is as follows:

step 3.1, training a width learning model by using the phishing website sample feature vector set obtained in the step 1 and testing the performance of a classifier; the method specifically comprises the following steps:

step 3.1.1, initializing a feature window number N2, a feature node number N1 in the window, and a value of an enhanced node number N3, and initializing N1 × N2 as samples/600, N3 as samples/10, and samples representing the number of samples according to multiple experimental experiences; randomly initializing a classifier model feature node weight matrix We, and processing feature node weights by using sparse self-coding;

step 3.1.2, performing matrix multiplication on the phishing website training sample feature set X and the obtained weight matrix We in the step 3.1.1 to obtain a feature node matrix Z;

step 3.1.3, randomly initializing an enhanced node weight matrix Wh;

step 3.1.4, multiplying the characteristic node matrix Z obtained in the step 3.1.2 by the weight matrix Wh obtained in the step 3.1.3 to obtain an enhanced node matrix;

step 3.1.5, transversely splicing the characteristic node matrix Z obtained in the step 3.1.2 and the enhanced node matrix H obtained in the step 3.1.4 according to columns to obtain an input matrix A;

step 3.1.6, solving the plus sign generalized inverse of the input matrix A obtained in step 3.1.5 and performing matrix multiplication with the applied training sample label set to obtain a weight matrix W, wherein specific codes are as follows:

W＝np.linalg.inv((A.T).dot(A)+lamda*

np.eye((A.T).shape[0])).dot((A.T).dot(train_y))；

step 3.1.7, since the step 2 is k-fold cross validation, the step 3.1 is repeated k times, the precision of k times is averaged, and the generalization capability of the classifier is enhanced;

and 3.1.8, optimizing the parameters N1, N2 and N3 according to experimental experience, obtaining the highest precision at the peak value, and storing the values of the three parameters.

Step 3.2, training and testing are carried out by increasing the node number of the matrix to adjust the network architecture until the classifier reaches the expected performance or is adjusted for a certain number of times, and weight information of each layer under the optimal condition is obtained and stored;

step 3.2 specifically comprises the following processing:

step 3.2.1, adjusting and testing the model obtained in the step 3.1 by using an incremental learning method for increasing the number of characteristic nodes and enhancing the number of nodes;

and 3.2.2, circularly setting the times to perform the step 3.2.1, recording the obtained test precision, comparing and determining the optimal number of characteristic nodes and the optimal number of enhanced nodes, and storing the optimal model.

(4) Step 4, collecting misjudgment websites and newly included websites as a new feature vector set, and performing incremental learning for increasing input on the model in due time;

step 4 specifically comprises the following steps:

4.1, extracting and saving the characteristics of the classification failure example in the step 3;

4.2, extracting the characteristics of the new website and storing the extracted characteristics in a local file;

4.3, when the quantity reaches a preset value, performing unified input type incremental learning, and adjusting the W weight matrix so as to update the model;

in conclusion, according to the technical scheme provided by the invention, the intelligent phishing website detection method based on the web automatic test and the width learning combines the traditional characteristics and the interactive characteristics, utilizes the width learning training model, has low resource consumption, realizes the quick self-adaptive updating, simultaneously ensures the accuracy of the model, and can realize the accurate interception and attack in the extremely short life cycle of the phishing website.

The k-fold cross verification method is characterized in that a sample is disordered and then evenly divided into k parts, k-1 parts of training are selected in turn, the rest parts are verified, the square sum of prediction errors is calculated, and finally the square sum of the prediction errors of k times is averaged to be used as a basis for selecting an optimal model structure. Assuming that there are N samples, taking N for a particular k, is the leave one out (leave oneout).

The matrix A is a matrix of m × n, and the plus sign generalized inverse of the matrix A in the invention refers to (A ' A) A ', wherein A ' represents a transposed matrix of A.

The weight in the invention refers to a width learning model parameter.

The Sparse self-encoding (Sparse auto encoder) refers to a technology for automatically learning features from unmarked data and providing better feature description than original data.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A phishing website detection method based on width learning is characterized by comprising the following steps:

(1) performing static feature extraction, dynamic feature extraction and interactive feature extraction on a large number of phishing websites and normal websites in a website data set at a PC (personal computer) end to form a feature vector set;

(3) training width learning by using the training set, testing and comparing by using the verification set, constructing a basic model and optimizing the performance of the classifier; the performance of the classifier refers to the accuracy of the classifier for identifying the phishing website;

(4) collecting misjudgment websites and newly included websites as a new feature vector set, and performing incremental learning for increasing input on the model to optimize the model;

the step (1) is specifically as follows:

(1.1) carrying out static feature extraction on the url; the static features include: whether the domain name of the url is a pure number from the beginning to the first point and whether the length of the url is less than 23 characters;

(1.3) carrying out dynamic feature extraction on the page visited by the url; the dynamic features include: whether the resource of the picture is the same as the original url or not;

2. A phishing website detection method based on breadth learning as claimed in claim 1 wherein said static feature in step (1.1) further comprises:

① url contains ip address;

③ url, said sensitive characters including @;

④ url port is 80 port;

⑥ url contains keywords related to shopping or property accounts, including account, banking, taobao.

3. A phishing website detection method based on breadth learning as claimed in claim 1 wherein said dynamic feature in step (1.3) further comprises:

① html contains sensitive characters including 'lottery', 'gambling' and 'winning';

② whether there is a form;

④ if the href of the link is the same domain name as the url, and the href is the abbreviation of Hypertext Reference, which is the url specifying the target of the hyperlink.

4. A phishing website detection method based on breadth learning as claimed in claim 1 wherein said interactive feature in step (1.4) comprises:

① form whether stringent;

② click on the link, if it is null;

③ click on the link, if url redirection occurs.

5. The phishing website detection method based on breadth learning as claimed in claim 1, wherein the step (2) is specifically as follows:

(2.1) setting a k value;

and (2.2) dividing the data set in the step (1) into a training set and a verification set by using a k-fold cross verification method.

6. The phishing website detection method based on breadth learning as claimed in claim 1, wherein the step (3) is specifically as follows:

(3.1) training the width learning model by using the feature vector set of the webpage samples in the training set in the step (2) and testing the performance of the classifier;

7. A phishing website detection method based on breadth learning as claimed in claim 6, wherein the step (3.1) is specifically:

(3.13) randomly initializing an enhanced node weight matrix;

(3.16) solving the plus sign generalized inverse of the input matrix obtained in the step (3.15) and carrying out matrix multiplication with < Y > to obtain a weight matrix; the < Y > is a matrix composed of labels of the web page samples; the label representative of the webpage sample is a phishing website or not;

8. A phishing website detection method based on breadth learning as claimed in claim 6, wherein the step (3.2) is specifically:

9. The phishing website detection method based on breadth learning as claimed in claim 1, wherein the step (4) is specifically as follows: and collecting misjudged websites and newly included websites as a new feature vector set, performing incremental learning of increasing input on the model, and obtaining an adjusted weight matrix, thereby realizing optimization of the model.