CN108337255B - Phishing website detection method based on web automatic test and width learning - Google Patents

Phishing website detection method based on web automatic test and width learning Download PDF

Info

Publication number
CN108337255B
CN108337255B CN201810088364.8A CN201810088364A CN108337255B CN 108337255 B CN108337255 B CN 108337255B CN 201810088364 A CN201810088364 A CN 201810088364A CN 108337255 B CN108337255 B CN 108337255B
Authority
CN
China
Prior art keywords
url
learning
model
phishing website
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810088364.8A
Other languages
Chinese (zh)
Other versions
CN108337255A (en
Inventor
袁巍
聂依凡
李浩鹏
贾昂
蔡明辉
姜源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201810088364.8A priority Critical patent/CN108337255B/en
Publication of CN108337255A publication Critical patent/CN108337255A/en
Application granted granted Critical
Publication of CN108337255B publication Critical patent/CN108337255B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/30Network architectures or network communication protocols for network security for supporting lawful interception, monitoring or retaining of communications or communication related information
    • H04L63/306Network architectures or network communication protocols for network security for supporting lawful interception, monitoring or retaining of communications or communication related information intercepting packet switched data communications, e.g. Web, Internet or IMS communications

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Technology Law (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a phishing website detection method based on web automatic test and breadth learning, and belongs to the technical field of computer network safety. According to the invention, traditional feature extraction is carried out based on url and html pages, interactive feature extraction is carried out by using a web automatic testing technology, and finally width learning training is carried out by using a preprocessed training sample after feature extraction, so that phishing websites are accurately and quickly identified and detected, and the network information and property safety of people are protected.

Description

Phishing website detection method based on web automatic test and width learning
Technical Field
The invention belongs to the technical field of computer network security, and particularly relates to a phishing website detection method based on web automatic test and breadth learning.
Background
Phishing is an attack way to steal sensitive information such as personal identity data and financial accounts of users by sending massive fraudulent junk mails, false advertisements of web pages and the like purported to come from banks or well-known institutions. The most typical phishing attacks are luring the user to a well-designed phishing website that closely resembles the target organization's website, capturing personally sensitive information entered by the user on the website, or tricking the user into remitting money. Since victims are not easy to be alerted in the attack process, phishing websites have become one of the most serious internet crime means at present, and the detection of phishing websites has become one of the hottest research directions in the field of network security.
In 2016, a global chinese phishing website status statistical analysis report (2016) (hereinafter referred to as "report") was jointly issued by the internet domain name management technology national engineering laboratory (emc) and the international anti-phishing working group (APWG) which were first raised by the CNNIC, and the china anti-phishing website alliance (APAC). Data shows that the number of phishing websites in China is increased by 150.96% in the same ratio in 2016, main counterfeit objects are Taobao and Mizhong movement, and domain names used by various large banks mainly comprise COM, CC, PW and NET.
In 2017, the number of the 360 mobile phone guardians in the third quarter is 7.9 hundred million times for nationwide mobile phone users to intercept phishing websites, and is increased by 102.6% in the third quarter compared with 2016. And classifying the intercepted phishing websites at the mobile phone end, wherein the gambling and lottery type phishing websites account for 80.2 percent of the total proportion, and the types of false shopping, false recruitment, financial securities, false drugs, phishing advertisements and the like are sequentially decreased.
Although the number of interception is large, the intercepted websites mostly exist for a long time, and the latest phishing websites are difficult to capture and block. The life cycle of the phishing websites is only 4.684 days on average, and the reported average life cycle is 13.327 days, so that the phishing websites must be identified and intercepted in a very short time, otherwise, the safety of the properties of the people is threatened.
At present, the identification and interception technology for phishing websites is executed by antivirus software and a browser, and the technology is divided into the following categories:
① blacklist filtering technique, adding manually detected and publically reported phishing websites to blacklist, performing interception and raising warning when the accessed url (Uniform Resource L adapter) exists in the blacklist.
② url feature extraction, corresponding features such as domain names are extracted through visited urls, but the judgment method is not reliable, because the urls do not have the decisive features of phishing websites, and the misjudgment rate of the method are high.
③, the detection of phishing websites is carried out by combining various website page elements as features, because the feature acquisition of the webpage takes a certain time, the accuracy of the method is improved compared with that of the second method, but the execution speed and efficiency are not high.
Disclosure of Invention
Aiming at the defects or improvement requirements of the prior art, the invention provides a phishing website detection method based on web automatic test and width learning, which aims to perform traditional feature extraction based on url and html pages, perform interactive feature extraction by using a web automatic test technology, and perform width learning training by using a preprocessing training sample after feature extraction, so that the phishing website is accurately and quickly identified and detected, and the network information and property safety of people are protected.
To achieve the above object, according to one aspect of the present invention, there is provided a phishing website detection method based on web automation test and breadth learning, comprising the steps of:
(1) performing static feature extraction, dynamic feature extraction and interactive feature extraction on a large number of phishing websites and normal websites in a data set at a Personal Computer (PC) end to form a feature vector set;
the data set is from phishing websites and normal websites collected on the network or is directly obtained from a network security company;
(2) dividing the characteristic vector set in the step (1) into a training set and a verification set by using a k-fold cross verification method;
(3) training width learning by using the training set, testing and comparing by using the verification set, constructing a basic model and optimizing the performance of the classifier;
the classifier is a model trained by a width learning algorithm, and when the classifier is used, a website is input into the classifier, and whether the website is a phishing website is output; the performance of the classifier refers to the accuracy of the classifier for identifying the phishing website; optimizing performance means improving recognition accuracy;
(4) and collecting the misjudgment websites and the newly included websites as a new feature vector set, performing incremental learning for increasing input on the model, and optimizing the model.
Preferably, step (1) is specifically:
(1.1) carrying out static feature extraction on the url;
(1.2) simulating an interface-free browser by using a web automation testing technology to access the url of the data set;
(1.3) carrying out dynamic feature extraction on the page visited by the url;
and (1.4) simulating a browser to carry out interactive click browsing on the page, and returning interactive features.
Preferably, the static features in step (1.1) include:
① url contains ip address;
② url is a pure number from the beginning to the first point;
③ url contains sensitive characters, such as @;
④ url port is 80 port;
⑤ url is less than 23 characters in length;
⑥ url contains keywords related to shopping or property accounts, such as account, banking, taobao;
the above six static features are denoted as < F1, F2, F3, F4, F5, F6 >.
Preferably, the dynamic characteristics in step (1.3) include:
① html whether the title contains sensitive characters such as 'lottery', 'gambling abroad', 'winning';
② whether there is a form;
③ whether the resource of the picture is the same as the original url;
④, if href of the link is the same domain name as url, the href is the abbreviation of Hypertext Reference, which is to specify url of hyperlink target;
the four dynamic characteristics are marked as < F7, F8, F9 and F10 >.
Preferably, the interactive features in step (1.4) include:
① form whether stringent;
② click on the link, if it is null;
③ click on the link, if url redirection occurs;
the above three interactive features are denoted as < F11, F12, F13 >.
Preferably, the step (2) is specifically:
(2.1) setting a k value;
and (2.2) dividing the data set in the step 1 into a training set and a verification set by using a k-fold cross-validation method.
Preferably, step (3) is specifically:
(3.1) training the width learning model by using the feature vector set of the webpage samples in the training set in the step (2) and testing the performance of the classifier; the webpage sample refers to the websites of the concentrated fishing websites and the websites of the normal websites;
and (3.2) continuously adjusting the network architecture to train and test by adding feature nodes and enhanced nodes until the classifier reaches the expected performance, acquiring weight information of each layer and storing the model.
Preferably, step (3.1) is specifically:
(3.11) initializing the number N2 of the characteristic windows, the number N1 of the characteristic nodes in the windows and the value N3 of the enhanced nodes; randomly initializing a classifier model feature node weight matrix, and processing feature node weights by using sparse self-coding;
(3.12) carrying out matrix multiplication on the feature vector set of the webpage sample and the weight matrix obtained in the step (3.11) to obtain a feature node matrix;
(3.13) randomly initializing an enhanced node weight matrix;
(3.14) multiplying the characteristic node matrix obtained in the step (3.12) with the weight matrix obtained in the step (3.13) to obtain an enhanced node matrix;
(3.15) transversely splicing the characteristic node matrix obtained in the step (3.12) and the enhanced node matrix obtained in the step (3.14) according to columns to obtain an input matrix;
(3.16) solving the plus sign generalized inverse of the input matrix obtained in the step (3.15) and carrying out matrix multiplication with < Y > to obtain a weight matrix; the < Y > is a matrix composed of labels of the web page samples; the label of the webpage sample represents whether the webpage sample is a phishing website or not, for example, label 1 represents the phishing website, and label 0 represents the phishing website or not;
(3.17) since step (2) is k-fold cross validation, repeating step (3.1) k times, averaging the accuracy of k times;
(3.18) gradually increasing the values of N1, N2 and N3, observing whether the precision of the width model is improved or not, and finding the optimal parameters.
Preferably, step (3.2) is specifically:
(3.21) adjusting and testing the model obtained in the step (3.1) by using an incremental learning method for increasing the number of characteristic nodes and the number of enhanced nodes;
and (3.22) circularly setting the times of the step (3.21), recording the obtained test precision, comparing and determining the optimal number of characteristic nodes and the optimal number of enhanced nodes, and storing the optimal model.
Preferably, step (4) is specifically: collecting misjudged websites and newly included websites as a new feature vector set, performing incremental learning of increasing input on the model, and obtaining an adjusted weight matrix, thereby realizing optimization of the model;
preferably, in the step (1), extracting static characteristics of url by using a regular expression, and simulating an interface-free browser by using Phantomjs to perform an interface-free UI automatic test; PhantomJS is JavaScript API based on webkit, and is a non-interface browser.
The invention uses the url static feature extraction, the html dynamic feature extraction and the interactive feature extraction technology of the web automatic test technology to extract the static feature of the url; then, carrying out interface-free UI automatic test, simulating a browser to carry out url access, simultaneously realizing page source code extraction, and extracting dynamic features from html; simulating link clicking, simulating account input, login and other operations of the form, saving the process of page rendering, and quickly extracting interactive features; for new phishing websites which are not in the blacklist, whether the link is empty or not is tested by simulating clicking the link; simulating a login form, and testing whether the form is normal or not; and simulating clicking the link and testing whether url redirection exists. By the brand-new interactive features, the phishing website is quickly and accurately detected.
The invention utilizes the width learning model to improve the identification capability of the new phishing website. The breadth learning is a new machine learning method and idea, and is different from the deep learning in that the breadth learning architecture has shallow hierarchy and low requirement on computing resources. In addition, the whole model needs to be improved again when a new sample is received in deep learning, a large amount of time needs to be consumed, the original model does not need to be retrained in the width learning algorithm, only feature extraction needs to be carried out on a newly added phishing website sample, the existing model is adjusted and supplemented, and the detection precision is continuously improved in self-updating.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
1. according to the method provided by the invention, static feature extraction is carried out on the basis of url, dynamic feature extraction is carried out on the basis of html page, interactive feature extraction is carried out on the page by utilizing a web automatic test technology simulation browser, the features are changed from static to dynamic and then to interactive feedback, feature mining is changed from shallow to deep, and the quantity and quality of the features are ensured; finally, the accurate, quick and self-adaptive phishing website identification technology is realized by utilizing the characteristics of only a small amount of resources, quick training and incremental learning of the width learning model;
2. the interactive feature extraction combined with the url static feature and the html dynamic feature extraction can greatly improve the identification precision of the phishing website, is suitable for the latest phishing website, and can accurately and quickly detect and identify the phishing website;
3. feature extraction is carried out on the newly added phishing website sample through breadth learning, the existing model is adjusted and supplemented, the time required for updating the model is greatly reduced, and the requirement on computing resources is low; meanwhile, the detection precision of the width learning can be continuously improved by self in self-updating.
Drawings
FIG. 1 is a general flowchart of a phishing website detection method based on web automation testing and breadth learning in a preferred embodiment of the invention;
FIG. 2 is a diagram illustrating the feature extraction of a phishing website in accordance with a preferred embodiment of the present invention;
FIG. 3 is a schematic diagram of the preparation of a k-fold cross validation dataset for width learning according to a preferred embodiment of the present invention;
FIG. 4 is a diagram illustrating the training and optimization of breadth learning in accordance with a preferred embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The invention provides a phishing website intelligent detection method based on web automatic testing and breadth learning, which is a main flow chart of the invention as shown in figure 1 and clearly shows the flow and the relation among the steps of the whole invention. The following specifically describes the implementation of the procedure:
(1) step 1: and performing static feature extraction, dynamic feature extraction and interactive click access on a large number of phishing websites and normal websites in the data set at the PC side.
As shown in fig. 2, step 1 is specifically as follows:
step 1.1, performing static feature extraction on url itself, wherein the static feature extraction comprises the following six features:
① url contains ip address which can be used to escape domain name registration and user check;
② url, the regular site rarely inserts a domain name with a pure number, such as the regular site hundredths https:// www.baidu.com/, the phishing site http:// www.030033.com/;
③ url, wherein the front of the @ character is an account, and the back is a real address, which is a common way for phishing websites;
④ url port is 80 port, regular url is accessed through 80 port, non 80 port has the suspicion of phishing website;
⑤ url is smaller than 23 characters, statistical, regular website url is generally not more than 23 characters;
⑥ url contains keywords such as account, banking, taobao and the like, which relate to shopping and are worth reminding by a bank;
the above six features are matched and corresponding features are extracted by using six regular expression pairs url, for example, re. Other features are extracted in the same way. If the above feature occurs, 1 is returned, otherwise 0 is returned, and six feature sets < F1, F2, F3, F4, F5, F6> are formed in total.
And step 1.2, the url is accessed by the non-interface browser through the codes, the rendering process of the browser page is omitted, and the time is saved for the subsequent feature extraction. The specific codes are as follows:
self.dirver=webdriver.PhantomJS()
dirver.get('http://www.douyu.com/directory/all')
step 1.3, performing dynamic feature extraction on the page accessed by url through an HTM L (HyperText Markup L anguage) source code, wherein the dynamic feature extraction comprises the following four features:
① html title contains sensitive characters such as 'lottery', 'gambling' and 'winning', and some illegal phishing websites like to use the interest of netizens;
② if there is a form, which is also a sensitive feature because the ultimate purpose of the phishing website is to steal account passwords;
③ whether the resource of the picture is the same as the original url or not, because the phishing website often steals the pictures of other genuine websites;
④ whether the href of the link is the same domain name as the url, because the phishing website can not write the article content by itself, mostly a generation chain, and the href is the abbreviation of Hypertext Reference, which is UR L for specifying the hyperlink target.
The four features are analyzed from the source code by using the driver object in step 1.1, such as driver, find _ element _ by _ tag _ name ("title"). text, and the content of the title is extracted. Other features are extracted in the same way. If the above feature occurs, 1 is returned, otherwise 0 is returned, and four feature sets < F7, F8, F9, F10> are formed in total.
Step 1.4, interactive click access is carried out on the page, and the following three characteristics are provided:
① form is strict, whether random account password can be logged in or not is judged, because phishing websites usually do not have an original database, any account password can be logged in generally;
② click on the link, if the most links are empty, the possibility of the website being a phishing website is higher;
③ clicking the link to determine whether url redirection occurs, wherein the phishing website has short life cycle and is often jumped to unchecked ip address by url redirection;
the three characteristics can be used for simulating a browser to click a page in real time by using a driver, so that the characteristics are extracted according to a return value, for example, an account number can be filled in a form, a password is similar, and if the click login is successful, the probability that the website is a phishing website can be predicted. Other features are extracted in the same way. If the above characteristics appear, returning to 1, otherwise, returning to 0, and forming three characteristic sets < F11, F12 and F13> in total;
(2) step 2: the k-fold cross-distribution training set and the validation set are shown in fig. 3, and the step 2 is specifically as follows:
step 2.1, setting a k value: the parameter k represents the times of repeated training and testing and is in direct proportion to the consumed resources;
and 2.2, dividing a training set and a verification set by using k-fold cross verification to train the model, dividing an original data set into k parts, taking k-1 part as the training set and 1 part as the verification set, and repeating the steps k times to improve the generalization capability of the model.
Fig. 4 is a schematic diagram of training and optimizing the learning of the width, which is specifically described in the following steps 3 and 4:
(3) and step 3: training a width model and optimizing the width model, wherein the step 3 is as follows:
step 3.1, training a width learning model by using the phishing website sample feature vector set obtained in the step 1 and testing the performance of a classifier; the method specifically comprises the following steps:
step 3.1.1, initializing a feature window number N2, a feature node number N1 in the window, and a value of an enhanced node number N3, and initializing N1 × N2 as samples/600, N3 as samples/10, and samples representing the number of samples according to multiple experimental experiences; randomly initializing a classifier model feature node weight matrix We, and processing feature node weights by using sparse self-coding;
step 3.1.2, performing matrix multiplication on the phishing website training sample feature set X and the obtained weight matrix We in the step 3.1.1 to obtain a feature node matrix Z;
step 3.1.3, randomly initializing an enhanced node weight matrix Wh;
step 3.1.4, multiplying the characteristic node matrix Z obtained in the step 3.1.2 by the weight matrix Wh obtained in the step 3.1.3 to obtain an enhanced node matrix;
step 3.1.5, transversely splicing the characteristic node matrix Z obtained in the step 3.1.2 and the enhanced node matrix H obtained in the step 3.1.4 according to columns to obtain an input matrix A;
step 3.1.6, solving the plus sign generalized inverse of the input matrix A obtained in step 3.1.5 and performing matrix multiplication with the applied training sample label set to obtain a weight matrix W, wherein specific codes are as follows:
W=np.linalg.inv((A.T).dot(A)+lamda*
np.eye((A.T).shape[0])).dot((A.T).dot(train_y));
step 3.1.7, since the step 2 is k-fold cross validation, the step 3.1 is repeated k times, the precision of k times is averaged, and the generalization capability of the classifier is enhanced;
and 3.1.8, optimizing the parameters N1, N2 and N3 according to experimental experience, obtaining the highest precision at the peak value, and storing the values of the three parameters.
Step 3.2, training and testing are carried out by increasing the node number of the matrix to adjust the network architecture until the classifier reaches the expected performance or is adjusted for a certain number of times, and weight information of each layer under the optimal condition is obtained and stored;
step 3.2 specifically comprises the following processing:
step 3.2.1, adjusting and testing the model obtained in the step 3.1 by using an incremental learning method for increasing the number of characteristic nodes and enhancing the number of nodes;
and 3.2.2, circularly setting the times to perform the step 3.2.1, recording the obtained test precision, comparing and determining the optimal number of characteristic nodes and the optimal number of enhanced nodes, and storing the optimal model.
(4) Step 4, collecting misjudgment websites and newly included websites as a new feature vector set, and performing incremental learning for increasing input on the model in due time;
step 4 specifically comprises the following steps:
4.1, extracting and saving the characteristics of the classification failure example in the step 3;
4.2, extracting the characteristics of the new website and storing the extracted characteristics in a local file;
4.3, when the quantity reaches a preset value, performing unified input type incremental learning, and adjusting the W weight matrix so as to update the model;
in conclusion, according to the technical scheme provided by the invention, the intelligent phishing website detection method based on the web automatic test and the width learning combines the traditional characteristics and the interactive characteristics, utilizes the width learning training model, has low resource consumption, realizes the quick self-adaptive updating, simultaneously ensures the accuracy of the model, and can realize the accurate interception and attack in the extremely short life cycle of the phishing website.
The k-fold cross verification method is characterized in that a sample is disordered and then evenly divided into k parts, k-1 parts of training are selected in turn, the rest parts are verified, the square sum of prediction errors is calculated, and finally the square sum of the prediction errors of k times is averaged to be used as a basis for selecting an optimal model structure. Assuming that there are N samples, taking N for a particular k, is the leave one out (leave oneout).
The matrix A is a matrix of m × n, and the plus sign generalized inverse of the matrix A in the invention refers to (A ' A) A ', wherein A ' represents a transposed matrix of A.
The weight in the invention refers to a width learning model parameter.
The Sparse self-encoding (Sparse auto encoder) refers to a technology for automatically learning features from unmarked data and providing better feature description than original data.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (9)

1. A phishing website detection method based on width learning is characterized by comprising the following steps:
(1) performing static feature extraction, dynamic feature extraction and interactive feature extraction on a large number of phishing websites and normal websites in a website data set at a PC (personal computer) end to form a feature vector set;
(2) dividing the characteristic vector set in the step (1) into a training set and a verification set by using a k-fold cross verification method;
(3) training width learning by using the training set, testing and comparing by using the verification set, constructing a basic model and optimizing the performance of the classifier; the performance of the classifier refers to the accuracy of the classifier for identifying the phishing website;
(4) collecting misjudgment websites and newly included websites as a new feature vector set, and performing incremental learning for increasing input on the model to optimize the model;
the step (1) is specifically as follows:
(1.1) carrying out static feature extraction on the url; the static features include: whether the domain name of the url is a pure number from the beginning to the first point and whether the length of the url is less than 23 characters;
(1.2) simulating an interface-free browser by using a web automation testing technology to access the url of the data set;
(1.3) carrying out dynamic feature extraction on the page visited by the url; the dynamic features include: whether the resource of the picture is the same as the original url or not;
and (1.4) simulating a browser to carry out interactive click browsing on the page, and returning interactive features.
2. A phishing website detection method based on breadth learning as claimed in claim 1 wherein said static feature in step (1.1) further comprises:
① url contains ip address;
③ url, said sensitive characters including @;
④ url port is 80 port;
⑥ url contains keywords related to shopping or property accounts, including account, banking, taobao.
3. A phishing website detection method based on breadth learning as claimed in claim 1 wherein said dynamic feature in step (1.3) further comprises:
① html contains sensitive characters including 'lottery', 'gambling' and 'winning';
② whether there is a form;
④ if the href of the link is the same domain name as the url, and the href is the abbreviation of Hypertext Reference, which is the url specifying the target of the hyperlink.
4. A phishing website detection method based on breadth learning as claimed in claim 1 wherein said interactive feature in step (1.4) comprises:
① form whether stringent;
② click on the link, if it is null;
③ click on the link, if url redirection occurs.
5. The phishing website detection method based on breadth learning as claimed in claim 1, wherein the step (2) is specifically as follows:
(2.1) setting a k value;
and (2.2) dividing the data set in the step (1) into a training set and a verification set by using a k-fold cross verification method.
6. The phishing website detection method based on breadth learning as claimed in claim 1, wherein the step (3) is specifically as follows:
(3.1) training the width learning model by using the feature vector set of the webpage samples in the training set in the step (2) and testing the performance of the classifier;
and (3.2) continuously adjusting the network architecture to train and test by adding feature nodes and enhanced nodes until the classifier reaches the expected performance, acquiring weight information of each layer and storing the model.
7. A phishing website detection method based on breadth learning as claimed in claim 6, wherein the step (3.1) is specifically:
(3.11) initializing the number N2 of the characteristic windows, the number N1 of the characteristic nodes in the windows and the value N3 of the enhanced nodes; randomly initializing a classifier model feature node weight matrix, and processing feature node weights by using sparse self-coding;
(3.12) carrying out matrix multiplication on the feature vector set of the webpage sample and the weight matrix obtained in the step (3.11) to obtain a feature node matrix;
(3.13) randomly initializing an enhanced node weight matrix;
(3.14) multiplying the characteristic node matrix obtained in the step (3.12) with the weight matrix obtained in the step (3.13) to obtain an enhanced node matrix;
(3.15) transversely splicing the characteristic node matrix obtained in the step (3.12) and the enhanced node matrix obtained in the step (3.14) according to columns to obtain an input matrix;
(3.16) solving the plus sign generalized inverse of the input matrix obtained in the step (3.15) and carrying out matrix multiplication with < Y > to obtain a weight matrix; the < Y > is a matrix composed of labels of the web page samples; the label representative of the webpage sample is a phishing website or not;
(3.17) since step (2) is k-fold cross validation, repeating step (3.1) k times, averaging the accuracy of k times;
(3.18) gradually increasing the values of N1, N2 and N3, observing whether the precision of the width model is improved or not, and finding the optimal parameters.
8. A phishing website detection method based on breadth learning as claimed in claim 6, wherein the step (3.2) is specifically:
(3.21) adjusting and testing the model obtained in the step (3.1) by using an incremental learning method for increasing the number of characteristic nodes and the number of enhanced nodes;
and (3.22) circularly setting the times of the step (3.21), recording the obtained test precision, comparing and determining the optimal number of characteristic nodes and the optimal number of enhanced nodes, and storing the optimal model.
9. The phishing website detection method based on breadth learning as claimed in claim 1, wherein the step (4) is specifically as follows: and collecting misjudged websites and newly included websites as a new feature vector set, performing incremental learning of increasing input on the model, and obtaining an adjusted weight matrix, thereby realizing optimization of the model.
CN201810088364.8A 2018-01-30 2018-01-30 Phishing website detection method based on web automatic test and width learning Active CN108337255B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810088364.8A CN108337255B (en) 2018-01-30 2018-01-30 Phishing website detection method based on web automatic test and width learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810088364.8A CN108337255B (en) 2018-01-30 2018-01-30 Phishing website detection method based on web automatic test and width learning

Publications (2)

Publication Number Publication Date
CN108337255A CN108337255A (en) 2018-07-27
CN108337255B true CN108337255B (en) 2020-08-04

Family

ID=62926122

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810088364.8A Active CN108337255B (en) 2018-01-30 2018-01-30 Phishing website detection method based on web automatic test and width learning

Country Status (1)

Country Link
CN (1) CN108337255B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522838A (en) * 2018-11-09 2019-03-26 大连海事大学 A kind of safety cap image recognition algorithm based on width study
CN110213741B (en) * 2019-05-23 2022-02-08 青岛智能产业技术研究院 Method for detecting authenticity of vehicle sending information in real time based on width learning
CN110287124B (en) * 2019-07-03 2023-04-25 大连海事大学 Method for automatically marking software error report and carrying out severity identification
CN110365691B (en) * 2019-07-22 2021-12-28 云南财经大学 Phishing website distinguishing method and device based on deep learning
CN110749793A (en) * 2019-10-31 2020-02-04 杭州中恒云能源互联网技术有限公司 Dry-type transformer health management method and system based on width learning and storage medium
CN111854732B (en) * 2020-07-27 2024-02-13 天津大学 Indoor fingerprint positioning method based on data fusion and width learning
CN113098887A (en) * 2021-04-14 2021-07-09 西安工业大学 Phishing website detection method based on website joint characteristics
CN113591653B (en) * 2021-07-22 2024-08-02 中南大学 Incremental zinc flotation working condition judging method based on width learning system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102523202A (en) * 2011-12-01 2012-06-27 华北电力大学 Deep learning intelligent detection method for fishing webpages
CN106789888A (en) * 2016-11-18 2017-05-31 重庆邮电大学 A kind of fishing webpage detection method of multiple features fusion
CN107392025A (en) * 2017-08-28 2017-11-24 刘龙 Malice Android application program detection method based on deep learning

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605794B (en) * 2013-12-05 2017-02-15 国家计算机网络与信息安全管理中心 Website classifying method
US9596265B2 (en) * 2015-05-13 2017-03-14 Google Inc. Identifying phishing communications using templates
CN105323248B (en) * 2015-10-23 2018-09-25 绵阳师范学院 A kind of rule-based interactive Chinese Spam Filtering method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102523202A (en) * 2011-12-01 2012-06-27 华北电力大学 Deep learning intelligent detection method for fishing webpages
CN106789888A (en) * 2016-11-18 2017-05-31 重庆邮电大学 A kind of fishing webpage detection method of multiple features fusion
CN107392025A (en) * 2017-08-28 2017-11-24 刘龙 Malice Android application program detection method based on deep learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"Broad Learning System: An Effective and Efficient Incremental Learning System Without the Need for Deep Architecture";C. L. Philip Chen et al;《IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS》;20170721;第1-15页 *
"Comparative Analysis of Features Based Machine Learning Approaches for Phishing Detection";Ankit Kumar Jain and B.B.Gupta;《2016 3rd International Conference on Computing for Sustainable Global Development》;20161031;第2125-2130页 *
"防网络钓鱼的安全域名服务器研究";何高辉;《中国优秀硕士学位论文全文数据库 信息科技辑》;20120115;正文第41-57页 *

Also Published As

Publication number Publication date
CN108337255A (en) 2018-07-27

Similar Documents

Publication Publication Date Title
CN108337255B (en) Phishing website detection method based on web automatic test and width learning
Wang et al. PDRCNN: Precise phishing detection with recurrent convolutional neural networks
Yang et al. Detecting malicious URLs via a keyword-based convolutional gated-recurrent-unit neural network
CN107786575B (en) DNS flow-based self-adaptive malicious domain name detection method
CN110808968B (en) Network attack detection method and device, electronic equipment and readable storage medium
CN104077396B (en) Method and device for detecting phishing website
CN109510815B (en) Multi-level phishing website detection method and system based on supervised learning
CN103559235B (en) A kind of online social networks malicious web pages detection recognition methods
Cui et al. Malicious URL detection with feature extraction based on machine learning
CN103530367B (en) A kind of fishing website identification system and method
Liu et al. CNN based malicious website detection by invalidating multiple web spams
Goswami et al. An Unsupervised Method for Detection of XSS Attack.
CN109005145A (en) A kind of malice URL detection system and its method extracted based on automated characterization
CN105337993B (en) It is a kind of based on the mail security detection device being association of activity and inertia and method
CN103544436A (en) System and method for distinguishing phishing websites
CN112217787B (en) Method and system for generating mock domain name training data based on ED-GAN
Liu et al. An efficient multistage phishing website detection model based on the CASE feature framework: Aiming at the real web environment
CN108566399A (en) Fishing website recognition methods and system
CN105072214A (en) C&amp;C domain name identification method based on domain name feature
CN106951784B (en) XSS vulnerability detection-oriented Web application reverse analysis method
CN109977118A (en) A kind of abnormal domain name detection method of word-based embedded technology and LSTM
CN107818132A (en) A kind of webpage agent discovery method based on machine learning
CN105138907A (en) Method and system for actively detecting attacked website
Abunadi et al. Feature extraction process: A phishing detection approach
Shyni et al. Phishing detection in websites using parse tree validation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant