CN110365691B

CN110365691B - Phishing website distinguishing method and device based on deep learning

Info

Publication number: CN110365691B
Application number: CN201910664425.5A
Authority: CN
Inventors: 冯涛; 刘蕊
Original assignee: Yunnan University of Finance and Economics
Current assignee: Yunnan University of Finance and Economics
Priority date: 2019-07-22
Filing date: 2019-07-22
Publication date: 2021-12-28
Anticipated expiration: 2039-07-22
Also published as: CN110365691A

Abstract

The application discloses a phishing website distinguishing method and device based on deep learning. The method comprises the steps of inputting websites of a target phishing website as a training website data set into a first preset model; training the first preset model according to the training website data set to extract a first feature set; merging the first feature set and a preset second feature set to obtain a merged feature set; and inputting the training website data set into a second preset model for training according to the combined feature set, and outputting a fishing website distinguishing model. The phishing website identification method and the phishing website identification device solve the technical problem that the phishing website identification method in the related technology is low in efficiency. The method and the device have the technical effects that the phishing website detection time is short, the detection result accuracy is high, and the generalization of the discrimination method in practical application is stronger.

Description

Phishing website distinguishing method and device based on deep learning

Technical Field

The application relates to the technical field of deep learning, in particular to a phishing website distinguishing method and device based on deep learning.

Background

The concept of deep learning is derived from the research of an artificial neural network, and a multi-layer perceptron with multiple hidden layers is a deep learning structure. Deep learning forms a more abstract class or feature of high-level representation properties by combining low-level features to discover a distributed feature representation of the data.

The phishing website is a network fraud behavior, which means that lawless persons use various means to imitate URL addresses and page contents of real websites, or insert dangerous HTML codes into some webpages of websites by using bugs on server programs of real websites, so as to cheat private data such as account numbers and passwords of user banks or credit cards.

The technology for preventing phishing website attacks in the related art has at least the following problems: in the identification process, the webpage content corresponding to the url needs to be downloaded, so that the detection time is long, and the user internet experience is influenced; on the other hand, malicious code contained in the fishing net station may have been triggered during the downloading of the suspicious web page. Therefore, at present, there is no perfect method for fighting phishing website attacks, and there is a large promotion space.

Aiming at the technical problem that the phishing website identification method in the related technology is low in efficiency, an effective solution is not provided at present.

Disclosure of Invention

The application mainly aims to provide a phishing website distinguishing method and device based on deep learning so as to solve the technical problem that a phishing website identification method in the related technology is low in efficiency.

In order to achieve the above object, according to one aspect of the present application, there is provided a phishing website discrimination method based on deep learning.

The phishing website distinguishing method based on deep learning comprises the following steps: inputting the website of the target phishing website as a training website data set into a first preset model; training the first preset model according to the training website data set to extract a first feature set; merging the first feature set and a preset second feature set to obtain a merged feature set; and inputting the training website data set into a second preset model for training according to the combined feature set, and outputting a fishing website distinguishing model.

Further, before inputting the website of the target phishing website as a training website data set into the first preset model, the method further comprises the following steps: acquiring phishing website data and non-phishing website data; constructing a website data set according to the phishing website data and the non-phishing website data; constructing a label data set according to the website data set, wherein the label data set is used for marking a numerical label corresponding to each URL in the website data set; and determining the training website data set and a testing website data set according to the label data set, wherein the testing website data set is used for testing the trained first preset model.

Further, training the first preset model according to the training website data set to extract a first feature set includes: inputting the test website data set into the trained first preset model; judging whether the accuracy of the output result of the first preset model reaches a preset threshold value or not; if the accuracy of the output result of the first preset model reaches a preset threshold value, ending the training; otherwise, returning to the step of continuing to execute the training of the first preset model according to the training website data set so as to extract the first feature set.

Further, training the first preset model according to the training website data set to extract a first feature set further comprises: calculating a saliency score of each URL character in the training website data set; calculating the influence score of each URL character according to the significance score of each URL character; sorting each URL character from high to low according to the influence score of the URL character; defining the URL characters in the top N positions as high-influence score characters; and extracting the first characteristic set according to the high-impact character.

Further, after inputting the training website data set into a second preset model for training according to the combined feature set and outputting a fishing website distinguishing model, the method further comprises the following steps: compiling a program according to the fishing website distinguishing model to obtain a fishing website distinguishing program; and embedding the phishing website distinguishing program into terminal equipment.

Further, after the fishing website determining program is embedded into the terminal device, the method further comprises the following steps: receiving a website access request sent by the terminal equipment; starting the fishing website distinguishing model in the fishing website distinguishing program; inputting website information in the website access request into the fishing website distinguishing model; outputting a fishing website judgment result; if the phishing website judging result is that the website information in the website access request is a phishing website, prompting that a terminal device user opens the phishing website; otherwise, allowing the terminal equipment to perform access operation.

In order to achieve the above object, according to another aspect of the present application, there is provided a phishing website discriminating apparatus based on deep learning.

The phishing website judging device based on deep learning comprises: the first input unit is used for inputting the website of the target phishing website as a training website data set into a first preset model; the extraction unit is used for training the first preset model according to the training website data set so as to extract a first characteristic set; the merging unit is used for merging the first feature set and a preset second feature set to obtain a merged feature set; and the first output unit is used for inputting the training website data set into a second preset model for training according to the combined feature set and outputting a fishing website distinguishing model.

Further, still include: the device comprises an acquisition unit, a storage unit and a display unit, wherein the acquisition unit is used for acquiring phishing website address data and non-phishing website address data; the first construction unit is used for constructing a website data set according to the phishing website data and the non-phishing website data; a second constructing unit, configured to construct a tag data set according to the website data set, where the tag data set is a data set used for marking a numerical tag corresponding to each URL in the website data set; and the determining unit is used for determining the training website data set and the testing website data set according to the label data set, wherein the testing website data set is used for testing the trained first preset model.

Further, the extraction unit includes: the input module is used for inputting the test website data set into the trained first preset model; the judging module is used for judging whether the accuracy of the output result of the first preset model reaches a preset threshold value or not; and the stopping module is used for stopping training if the accuracy of the output result of the first preset model reaches a preset threshold value, and the returning module is used for returning to continue executing the step of training the first preset model according to the training website data set to extract the first characteristic set if the accuracy of the output result of the first preset model does not reach the preset threshold value.

Further, the extraction unit further includes: the first calculation module is used for calculating the significance score of each URL character in the training website data set; the second calculation module is used for calculating the influence score of each URL character according to the saliency score of each URL character; the sorting module is used for sorting each URL character from high to low according to the influence of the URL character; the definition module is used for defining the URL characters which are sequenced at the top N positions as high-influence score characters; and the extracting module is used for extracting the first feature set according to the high-influence character.

In the method and the device for distinguishing the phishing websites based on deep learning in the embodiment of the application, the websites of the target phishing websites are used as a training website data set to be input into a first preset model, the first preset model is trained according to the training website data set to extract a first characteristic set, the first characteristic set and a preset second characteristic set are combined to obtain a combined characteristic set, and the training website data set is input into a second preset model according to the combined characteristic set to be trained, so that the purpose of outputting the fishing website distinguishing model is achieved.

In addition, the following technical effects are achieved: 1. the phishing website distinguishing method based on deep learning does not depend on a blacklist, and is still effective for the latest phishing webpage. 2. The content of the web page is not downloaded during the detection process, thereby avoiding the possibility of triggering malicious codes in the web page. 3. The detection time is short, and the internet surfing experience of the user cannot be influenced. 4. Training is carried out through a training set storing mass website data, and then testing is carried out through a testing set, so that the finally obtained discrimination model has stronger generalization in practical application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, serve to provide a further understanding of the application and to enable other features, objects, and advantages of the application to be more apparent. The drawings and their description illustrate the embodiments of the invention and do not limit it. In the drawings:

FIG. 1 is a flowchart illustrating a method for determining phishing websites based on deep learning according to a first embodiment of the present application;

FIG. 2 is a flowchart illustrating a method for determining phishing websites based on deep learning according to a second embodiment of the present application;

FIG. 3 is a flowchart illustrating a method for determining phishing websites based on deep learning according to a third embodiment of the present application;

FIG. 4 is a schematic flowchart illustrating a method for determining phishing websites based on deep learning according to a third embodiment of the present application;

FIG. 5 is a flowchart illustrating a method for determining phishing websites based on deep learning according to a fourth embodiment of the present application;

FIG. 6 is a flowchart illustrating a method for determining phishing websites based on deep learning according to a fifth embodiment of the present application;

FIG. 7 is a flowchart illustrating a phishing website discrimination method based on deep learning according to a sixth embodiment of the present application;

FIG. 8 is a flowchart illustrating a method for determining phishing websites based on deep learning according to a sixth embodiment of the present application;

FIG. 9 is a schematic diagram illustrating a composition structure of a phishing website determining apparatus based on deep learning according to a first embodiment of the present application;

FIG. 10 is a schematic diagram illustrating a composition of a phishing website determining apparatus based on deep learning according to a second embodiment of the present application;

FIG. 11 is a block diagram illustrating a phishing website determining apparatus based on deep learning according to a third embodiment of the present application; and

fig. 12 is a schematic diagram illustrating a composition structure of a phishing website determining apparatus based on deep learning according to a fourth embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

According to an embodiment of the present application, there is provided a phishing website determination method based on deep learning, as shown in fig. 1, the method includes steps S101 to S104 as follows:

step S101, inputting the website of the target phishing website as a training website data set into a first preset model.

In specific implementation, the training website data set is based on existing manually marked website data as a data source, the first preset model preferably selects RNN (Recurrent neural network) models with different structures and different Unit types, and specifically may include LSTM (Long Short-Term Memory, Long Short-Term Memory network), GRU (Gated Recurrent Unit), bllstm (Bi-directional Long Short-Term Memory, bidirectional Long Short-Term Memory network), BiGRU (Bi-directional Gated Recurrent Unit), and other deep learning models.

Step S102, training the first preset model according to the training website data set to extract a first feature set.

In specific implementation, the obtained training website data set is used for respectively training the four RNN deep learning models LSTM, GRU, BilSTM and BiGRU until each model can obtain a good recognition result, namely the accuracy is more than 99.8%. Then the four are completely trainedThe trained RNN deep learning model carries out internal state analysis, high-frequency influence score (impact score) characters are extracted as features, and a feature set F ═ { F is obtained_lstm,f_gru,f_bilstm,f_bigruAnd finding out the most suspicious characters considered by the four high-accuracy RNN deep learning models and adding the characters into the first feature set.

Step S103, merging the first feature set and a preset second feature set to obtain a merged feature set.

Based on the feature set F ═ { F ═ obtained in step S102_lstm,f_gru,f_bilstm,f_bigruAnd combining the feature set with the existing artificial classification features to obtain the combined feature set. The combined feature set covers character features of the phishing websites obtained through manual classification in the prior art, and also integrates character features extracted through four deep learning models which are trained and obtain good training results, so that the character features adopted for recognizing the phishing websites in the related technology are greatly enriched, and a foundation is laid for subsequently improving the recognition efficiency of the phishing websites.

And S104, inputting the training website data set into a second preset model for training according to the combined feature set, and outputting a fishing website distinguishing model.

In specific implementation, the second preset model may adopt a Random Forest tree model with mature technology and good effect, and it should be noted that other types of classifier models may also implement the technical solution of the present application, which is not specifically limited herein. And further inputting the training website data set into a Random Forest model for training based on the obtained combined feature set, and finally obtaining a fishing website distinguishing model driven by a deep learning model.

According to the fishing website distinguishing method and device based on deep learning, the website of the target fishing website is input into a first preset model as a training website data set, the first preset model is trained according to the training website data set to extract a first characteristic set, the first characteristic set and a preset second characteristic set are combined to obtain a combined characteristic set, the training website data set is input into a second preset model according to the combined characteristic set for training, the purpose of outputting a fishing website distinguishing model is achieved, and the technical effect of improving fishing website distinguishing efficiency is achieved.

Preferably, as shown in fig. 2, before inputting the website of the target phishing website as the training website data set into the first preset model, the following steps S201 to S204 are further included:

step S201, acquiring phishing website data and non-phishing website data.

The amount of the training data has a significant influence on the finally obtained phishing website discrimination model, and therefore, before the website of the target phishing website is input into the first preset model as the training website data set, massive (millions of pieces) website data need to be collected, specifically comprising the phishing website and normal (or harmless or non-phishing) website data. Preferably, the number of phishing websites and the number of normal (or harmless or non-phishing) websites are collected in a 1:1 ratio.

And step S202, constructing a website data set according to the phishing website data and the non-phishing website data.

And further constructing a website data set based on the collected millions of massive phishing website data and non-phishing website data to serve as a basic database for subsequent model training.

Step S203, constructing a label data set according to the website data set, wherein the label data set is used for marking a numerical label corresponding to each URL in the website data set.

The mass (millions of) phishing website data and non-phishing website data stored in the constructed website data set are existing data which are judged or classified manually, and in specific implementation, each URL _ i in the website data set is labeled with a corresponding label, and a specific labeling method can be as follows: and (3) adopting binary data marking, namely marking as 1 if the URL _ i is a fishing website, and marking as 0 if the URL _ i is a normal website or a non-fishing website, so as to form a label data set based on the marked website data set.

Step S204, determining the training website data set and a testing website data set according to the label data set, wherein the testing website data set is used for testing the trained first preset model.

In order to further ensure the accuracy and reliability of the model training result, in the specific implementation of the method, the obtained label data set is further divided into a training website data set and a testing website data set, the ratio of the number of websites in the two data sets can be that the training website data set accounts for 80% of the label data set, the rest 20% of websites can be used as the testing website data set, and the four deep learning models after training are tested by using the testing website data set.

Preferably, as shown in fig. 3 and 4, the training of the first preset model according to the training website data set to extract a first feature set specifically includes the following steps S301 to S304:

step S301, inputting the test website data set into the trained first preset model.

In specific implementation, after the four deep learning models are trained through the training website data set, the training results of the four deep learning models after training need to be further evaluated by using the test website data set.

Step S302, determining whether the accuracy of the output result of the first preset model reaches a preset threshold.

In specific implementation, whether the accuracy of the output results of the four deep learning models reaches a preset threshold is respectively judged according to a certain preset rule, the preset rule can be that a satisfactory training result can be considered to be obtained only when the accuracy of the output results of the four deep learning models all reach the preset threshold, the preset threshold can be flexibly configured according to actual needs, for example, the accuracy can be set to 99.8% or more, and the accuracy reaches 99.8% specifically means that: after the test website data set is respectively input into the four trained deep learning models, the proportion of correct results output by the four deep learning models is at least 99.8%.

Step S303, if the accuracy of the output result of the first preset model reaches a preset threshold, ending the training.

If the accuracy of the output results of the four deep learning models reaches 99.8%, it is indicated that the four models can obtain good recognition results for the data in the training website data set, and the deep learning model training can be finished to perform the operation of the subsequent steps.

Step S304, if the accuracy of the output result of the first preset model does not reach a preset threshold value, returning to continue executing the step of training the first preset model according to the training website data set to extract a first feature set.

In specific implementation, if the accuracy of the output result of any one of the four deep learning models does not reach 99.8%, the training is considered to be unqualified, and the training is returned to continue to use the training website data set to train the model with the unqualified output result.

Preferably, as shown in fig. 5, the training the first preset model according to the training website data set to extract a first feature set further includes the following steps S401 to S405:

step S401, calculating the significance score of each URL character in the training website data set.

In the RNN deep learning model of the present application, the character c in each URL_iIs embedded into a vector e by adopting a word2vec method_i(it is simply understood that ci is denoted by ei, which is a vector). Thus, for a URL with n characters, an E ═ E0, E1, …, en is obtained. The RNN deep learning model after training can be regarded as a high-dimensional nonlinear function S_p(E) For each E, it is input to a function S_pA y value is obtained, and is used to determine whether the URL represented by E is a phishing website. Therefore, we need to know which E of E_iTo S_pFunction(s)The influence is the largest, namely the RNN deep learning model judges and identifies the phishing website according to the character characteristics in the URL, and the following method is specifically adopted:

by calculating a first order Taylor expansion S_p(e) To approximate S with a linear function e_p(e)：

Wherein ω is S_pB is constant with respect to the derivative of e.

The absolute value of the derivative represents the sensitivity of the final decision to a particular dimension change, telling us how much a particular dimension of the website character embedding contributes to the final decision. On this basis, we represent the saliency score (saliencystcore) of the URL character using the sum of the absolute values of the embedded vector contributions_iThe significance score for middle character j is then:

step S402, calculating the influence score of each URL character according to the saliency score of each URL character.

It should be noted that, in the implementation, the significance score of a certain character in the sample set cannot be simply summed or averaged as its influence score (impactscore). Because the importance of a character is related to its context, namely: the importance of the same character in different URLs, and even in different locations within the same URL, is quite different. The embodiment of the present application therefore provides an algorithm for calculating an imprictscore for a character C:

wherein N is the total number of web sites, ω_iIs a set of characters in a web site, β isA constant. The function of the ISI function is to take the index of the largest k elements from the set S, which is defined as follows:

in practical implementation, it is assumed that there are about 0.75M (million) phishing sites in the previously collected site data set in the embodiment of the present application, the parameter β is 0.1, the average length of the phishing sites in the data set is 89.8, that is, about 9 characters with highest saliencystcore are selected from each site on average to participate in C_impact-scoreAnd (4) calculating.

Step S403, sorting each URL character from high to low according to the influence of the URL character.

The influence score C of each URL character is obtained based on the calculation formula_impact-scoreThe influence of these characters is classified as C_impact-scoreThe sorting is performed in order from high to low or from low to high.

Step S404, defining the URL character in the top N order as high-influence character.

And defining the URL characters with the sequences positioned at the top N positions as high-influence score characters based on the sequences obtained after the character influence score sorting, wherein the numerical value of N can be flexibly configured according to actual operation or requirements, and is not specifically limited, so that the high-influence score characters respectively corresponding to the four trained RNN deep learning models can be obtained.

Another way to obtain high impact score characters may be to take the impact score C from the set X directly through the ISI formula_impact-scoreThe maximum subscript of the lambda elements, that is,

F_{α＝{lstm,gru,bilstm,bigru}}＝{X_i|i∈ISI({C_α(x)|x∈X},λ)}，

wherein, X is the character set contained in the URL in the data set, and lambda is a constant parameter.

Step S405, extracting the first feature set according to the high-influence character.

Based on the two ways of obtaining high-impact character in step S404, extracting the top N characters as high-impact character according to actual requirements, and finally adding the high-impact character obtained by the four deep learning models to the deep learning feature set to obtain the first feature set, i.e. the first feature set

F＝{f_lstm,f_gru,f_bilstm,f_bigru}。

Preferably, as shown in fig. 5, after inputting the training website data set into a second preset model for training according to the merged feature set and outputting a fishing website discrimination model, the following steps S501 to S502 are further included:

and step S501, programming according to the fishing website distinguishing model to obtain a fishing website distinguishing program.

In specific implementation, various programming software in the prior art can be used for programming the obtained phishing website distinguishing model, which is not described herein again.

Step S502, the phishing website distinguishing program is embedded into the terminal equipment.

In specific implementation, the obtained phishing website distinguishing program can be embedded into a browser, a mobile phone APP or any open platform (such as WeChat, microblog, QQ and the like) for identifying phishing websites.

Preferably, as shown in fig. 7 and 8, after the phishing site discrimination program is embedded in the terminal device, the following steps S601 to S606 are further included:

step S601, receiving a website access request sent by the terminal device.

During specific implementation, the phishing website distinguishing program embedded in the terminal equipment can monitor the website access behavior of the user of the terminal equipment in real time, and receive a website access request initiated by the user of the terminal equipment, such as a request of the user accessing a certain website through URL _ i.

Step S602, the fishing site discrimination model in the fishing site discrimination program is started.

And when the phishing website judging program receives a request of a terminal equipment user for accessing a certain website through URL _ i, starting a phishing website judging model in the program immediately to judge the website.

Step S603, inputting the website information in the website access request into the phishing website discrimination model.

In specific implementation, the URL _ i in the website access request is input into the phishing website distinguishing model, and character characteristic information and the like in the URL _ i are analyzed.

And step S604, outputting the fishing website judgment result.

After the judgment is finished, the fishing website judgment result is output, and the specific output form of the fishing website judgment result can be configured in advance, for example: and when the output result of the model is 1, the user is considered to be about to visit the phishing website, and when the output result of the model is 0, the user is considered to be about to visit the non-phishing website or the normal website.

Step S605, if the phishing website determination result is that the website information in the website access request is a phishing website, prompting that the terminal device user is opening the phishing website.

When the output result of the model is 1, the website to be accessed by the user is a phishing website, and the user can be reminded of careful operation on a display interface of the terminal equipment in a pop-up window mode and the like.

Step S606, if the phishing website judging result is that the website information in the website access request is not a phishing website, allowing the terminal device to perform access operation.

When the output result of the model is 0, the website to be accessed by the user is a normal website, and the user can be allowed to continue accessing.

From the above description, it can be seen that the fishing website distinguishing model is output by inputting the website of the target fishing website as the training website data set into the first preset model, training the first preset model according to the training website data set to extract the first feature set, merging the first feature set with the preset second feature set to obtain the merged feature set, and inputting the training website data set into the second preset model according to the merged feature set for training, so that the following technical effects are achieved: 1. the phishing website distinguishing method based on deep learning does not depend on a blacklist, and is still effective for the latest phishing webpage. 2. The content of the web page is not downloaded during the detection process, thereby avoiding the possibility of triggering malicious codes in the web page. 3. The detection time is extremely short, and the internet surfing experience of the user cannot be influenced. 4. The detection Accuracy (Accuracy) reaches more than 99.9 percent, and the F1 score reaches more than 99.8 percent.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

According to an embodiment of the present invention, there is also provided a phishing website determining apparatus for implementing the above phishing website determining method based on deep learning, as shown in fig. 9, the apparatus including:

the first input unit 1 is used for inputting the website of the target phishing website as a training website data set into a first preset model.

In specific implementation, the training website data set is based on existing manually marked website data as a data source, the first preset model preferably selects RNN (Recurrent neural network) models with different structures and different Unit types, and specifically may include LSTM (Long Short-Term Memory, Long Short-Term Memory network), GRU (Gated Recurrent Unit), bllstm (Bi-directional Long Short-Term Memory, bidirectional Long Short-Term Memory network), BiGRU (Bi-directional Gated Recurrent Unit), and other deep learning models. The website of the target phishing website is input into the first preset model as a training website data set through the first input unit 1.

And the extraction unit 2 is used for training the first preset model according to the training website data set to extract a first feature set.

In specific implementation, the obtained training website data is usedThe four RNN deep learning models LSTM, GRU, BilSTM and BiGRU are trained respectively until each model can obtain a good recognition result, namely the accuracy is over 99.8%. Then, the extraction unit 2 performs internal state analysis on the four fully trained RNN deep learning model models, extracts high-frequency influence score (impact score) characters as features, and obtains a feature set F ═ { F ═ F_lstm,f_gru,f_bilstm,f_bigruAnd finding out the most suspicious characters considered by the four high-accuracy RNN deep learning models and adding the characters into the first feature set.

And the merging unit 3 is configured to merge the first feature set with a preset second feature set to obtain a merged feature set.

Based on the feature set F ═ { F obtained above_lstm,f_gru,f_bilstm,f_bigruAnd combining the feature set with the existing artificial classification features through a combining unit 3 to obtain the combined feature set. The combined feature set covers character features of the phishing websites obtained through manual classification in the prior art, and also integrates character features extracted through four deep learning models which are trained and obtain good training results, so that the character features adopted for recognizing the phishing websites in the related technology are greatly enriched, and a foundation is laid for subsequently improving the recognition efficiency of the phishing websites.

And the first output unit 4 is used for inputting the training website data set into a second preset model for training according to the combined feature set and outputting a fishing website distinguishing model.

In specific implementation, the second preset model may adopt a Random Forest tree model with mature technology and good effect, and it should be noted that other types of classifier models may also implement the technical solution of the present application, which is not specifically limited herein. And further inputting the training website data set into a Random Forest model for training through the first output unit 4 based on the obtained combined feature set, and finally obtaining a fishing website distinguishing model driven by a deep learning model.

Preferably, as shown in fig. 10, the apparatus further comprises:

and the acquisition unit 5 is used for acquiring the website data of the phishing website and the website data of the non-phishing website.

The amount of the training data has a significant influence on the finally obtained phishing website discrimination model, and therefore, before inputting the website of the target phishing website as the training website data set into the first preset model, a large amount (millions of pieces) of website data, specifically including the phishing website and the normal (or harmless or non-phishing) website data, needs to be collected by the obtaining unit 5. Preferably, the number of phishing websites and the number of normal (or harmless or non-phishing) websites are collected in a 1:1 ratio.

And the first construction unit 6 is used for constructing a website data set according to the phishing website data and the non-phishing website data.

Based on the collected mass (millions of) phishing website data and non-phishing website data, a website data set is further constructed through the first construction unit 6 and is used as a basic database for subsequent model training.

A second constructing unit 7, configured to construct a tag data set according to the website data set, where the tag data set is a data set used for marking a numerical tag corresponding to each URL in the website data set.

The mass (millions of) phishing website data and non-phishing website data stored in the constructed website data set are existing data which are judged or classified manually, and in specific implementation, each URL _ i in the website data set is labeled with a corresponding label, and a specific labeling method can be as follows: and (3) adopting binary data marking, namely marking as 1 if the URL _ i is a fishing website, and marking as 0 if the URL _ i is a normal website or a non-fishing website, so that a label data set is formed by the second construction unit 7 based on the marked website data set.

A determining unit 8, configured to determine the training website data set and a testing website data set according to the tag data set, where the testing website data set is a data set used for testing the trained first preset model.

In order to further ensure the accuracy and reliability of the model training result, in the specific implementation of the present application, the obtained tag data set is further divided into a training website data set and a testing website data set by using the determining unit 8, the ratio of the number of websites in the two data sets may be that the training website data set accounts for 80% of the tag data set, the remaining 20% of websites in the two data sets are used as the testing website data set, and the four deep learning models after training are tested by using the testing website data set.

Preferably, as shown in fig. 11, the extraction unit 2 includes:

an input module 21, configured to input the test website data set into the trained first preset model.

In specific implementation, after the four deep learning models are trained through the above-mentioned training website data set, the input module 21 needs to further evaluate the training results of the trained four deep learning models by using the testing website data set.

And the judging module 22 is configured to judge whether the accuracy of the output result of the first preset model reaches a preset threshold.

In specific implementation, the determining module 22 determines whether the accuracy of the output results of the four deep learning models reaches a preset threshold according to a certain preset rule, where the preset rule may be that a satisfactory training result is considered to be obtained only when the accuracy of all the output results of the four deep learning models reaches the preset threshold, the preset threshold may be flexibly configured according to actual needs, for example, it may be set that the accuracy reaches 99.8% or more, and the accuracy reaches 99.8% specifically means that: after the test website data set is respectively input into the four trained deep learning models, the proportion of correct results output by the four deep learning models is at least 99.8%.

And a termination module 23, configured to terminate the training if the accuracy of the output result of the first preset model reaches a preset threshold.

If the accuracy of the output results of the four deep learning models reaches 99.8%, which indicates that the four models can obtain good recognition results for the data in the training website data set, the termination module 23 can end the deep learning model training and perform the operation of the subsequent steps.

And the returning module 24 is configured to return to continue to execute the step of training the first preset model according to the training website data set to extract the first feature set if the accuracy of the output result of the first preset model does not reach the preset threshold.

In specific implementation, if the accuracy of the output result of any one of the four deep learning models does not reach 99.8%, the training is considered to be unqualified, and the returning module 24 returns to continue to use the training website data set to train the model with the unqualified output result.

Preferably, as shown in fig. 12, the extraction unit 2 further includes:

a first calculating module 25, configured to calculate a saliency score of each URL character in the training website data set.

In specific implementation, the first calculation module 25 obtains the following result by adopting the following method:

Wherein ω is S_pB is constant with respect to the derivative of e.

The absolute value of the derivative represents the sensitivity of the final decision to a particular dimensional change, telling the embodiment of the application how much a particular dimension of the website character embedding contributes to the final decision. On this basis, the sum of absolute values of the contributions of the embedded vectors is used in the embodiment of the application to represent the saliency score (saliencystcore) of the URL character_iThe significance score for middle character j is then:

and a second calculating module 26, configured to calculate an influence score of each URL character according to the saliency score of each URL character.

In specific implementation, the second calculation module 26 calculates an algorithm of impactscore for the character C:

wherein N is the total number of web sites, ω_iIs a set of characters in the web address and β is a constant. The function of the ISI function is to take the index of the largest k elements from the set S, which is defined as follows:

and the sorting module 27 is configured to sort each URL character from high to low according to the influence score of the URL character.

A defining module 28, configured to define the URL character ordered in the top N bits as a high impact score character.

Based on the sequence obtained after the character influence score sorting, the URL characters sorted in the first N bits are defined as high-influence score characters by the definition module 28, where the numerical value of N may be flexibly configured according to actual operation or requirements, and is not specifically limited here, so that the high-influence score characters corresponding to the trained four RNN deep learning models can be obtained.

F_{α＝{lstm,gru,bilstm,bigru}}＝{X_i|i∈ISI({C_α(x)|x∈X},λ)}，

And the extracting module 29 is configured to extract the first feature set according to the high-impact character.

F＝{f_lstm,f_gru,f_bilstm,f_bigru}。

Preferably, the apparatus further comprises:

and the compiling unit is used for compiling a program according to the fishing website distinguishing model to obtain a fishing website distinguishing program.

In specific implementation, the obtained phishing website discrimination model can be programmed by using programming units in various programming software in the prior art, which is not described herein again.

And the embedding unit is used for embedding the fishing website distinguishing program into the terminal equipment.

In specific implementation, the obtained phishing website judging program can be embedded into a browser, a mobile phone APP or any open platform (such as WeChat, microblog, QQ and the like) through an embedding unit to identify the phishing website.

Preferably, the apparatus further comprises:

and the receiving unit is used for receiving the website access request sent by the terminal equipment.

When the method is specifically implemented, the phishing website distinguishing program embedded in the terminal equipment can monitor the website access behavior of the user of the terminal equipment in real time, and the receiving unit receives a website access request initiated by the user of the terminal equipment, such as a request of the user accessing a certain website through URL _ i.

And the starting unit is used for starting the fishing website distinguishing model in the fishing website distinguishing program.

When the phishing website distinguishing program receives a request of a terminal device user for accessing a certain website through URL _ i, the phishing website distinguishing model in the starting unit starting program is used for distinguishing the website.

And the second input unit is used for inputting the website information in the website access request into the phishing website distinguishing model.

In specific implementation, the URL _ i in the website access request is input into the phishing website distinguishing model through the second input unit, and character characteristic information and the like in the URL _ i are analyzed.

And the second output unit outputs the fishing website judgment result.

After the judgment is finished, the fishing website judgment result is output through the second output unit, and the specific output form of the fishing website judgment result can be configured in advance, for example: and when the output result of the model is 1, the user is considered to be about to visit the phishing website, and when the output result of the model is 0, the user is considered to be about to visit the non-phishing website or the normal website.

And the prompting unit is used for prompting the terminal equipment user that the phishing website is opened if the phishing website judging result is that the website information in the website access request is the phishing website.

When the output result of the model is 1, the website to be visited by the user is a phishing website, and the user can be reminded of careful operation in a pop-up window mode and the like on a display interface of the terminal equipment through a prompting unit.

And the access unit is used for allowing the terminal equipment to perform access operation if the phishing website judging result shows that the website information in the website access request is not the phishing website.

When the output result of the model is 0, the website to be accessed by the user is a normal website, and the user can be allowed to continue accessing through the access unit.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may alternatively be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from multiple modules or steps. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A phishing website distinguishing method based on deep learning is characterized by comprising the following steps:

inputting the website of a target phishing website as a training website data set into a first preset model, wherein the first preset model is a recurrent neural network model with different structures and different unit types;

training the first preset model according to the training website data set to extract a first feature set, which specifically comprises:

calculating a saliency score of each URL character in the training website data set;

calculating the influence score of each URL character according to the significance score of each URL character;

sorting each URL character from high to low according to the influence score of the URL character;

defining the URL characters in the top N positions as high-influence score characters;

extracting the first feature set according to the high-influence character;

the first feature set comprises suspicious characters determined by the first preset model;

merging the first feature set and a preset second feature set to obtain a merged feature set;

and inputting the training website data set into a second preset model for training according to the combined feature set, and outputting a fishing website distinguishing model.

2. A phishing website discrimination method based on deep learning as claimed in claim 1, wherein before inputting the website of the target phishing website as the training website data set into the first preset model, further comprising:

acquiring phishing website data and non-phishing website data;

constructing a website data set according to the phishing website data and the non-phishing website data;

constructing a label data set according to the website data set, wherein the label data set is used for marking a numerical label corresponding to each URL in the website data set;

and determining the training website data set and a testing website data set according to the label data set, wherein the testing website data set is used for testing the trained first preset model.

3. A phishing website discrimination method based on deep learning as claimed in claim 2, wherein training the first preset model according to the training website data set to extract a first feature set comprises:

inputting the test website data set into the trained first preset model;

judging whether the accuracy of the output result of the first preset model reaches a preset threshold value or not;

if the accuracy of the output result of the first preset model reaches a preset threshold value, ending the training; if not, then,

and returning to continue executing the step of training the first preset model according to the training website data set to extract the first feature set.

4. A phishing website discrimination method based on deep learning as claimed in claim 1, wherein after inputting the training website data set into a second preset model for training according to the combined feature set and outputting a fishing website discrimination model, further comprising:

compiling a program according to the fishing website distinguishing model to obtain a fishing website distinguishing program;

and embedding the phishing website distinguishing program into terminal equipment.

5. A phishing website discrimination method based on deep learning as claimed in claim 4, further comprising after embedding the phishing website discrimination program into a terminal device:

receiving a website access request sent by the terminal equipment;

starting the fishing website distinguishing model in the fishing website distinguishing program;

inputting website information in the website access request into the fishing website distinguishing model;

outputting a fishing website judgment result;

if the phishing website judging result is that the website information in the website access request is a phishing website, prompting that a terminal device user opens the phishing website; if not, then,

and allowing the terminal equipment to perform access operation.

6. A fishing website discriminating device based on deep learning is characterized by comprising

The first input unit is used for inputting the website of the target phishing website as a training website data set into a first preset model, and the first preset model is a recurrent neural network model with different structures and different unit types;

the extracting unit is configured to train the first preset model according to the training website data set to extract a first feature set, and specifically includes:

the first calculation module is used for calculating the significance score of each URL character in the training website data set;

the second calculation module is used for calculating the influence score of each URL character according to the saliency score of each URL character;

the sorting module is used for sorting each URL character from high to low according to the influence of the URL character;

the definition module is used for defining the URL characters which are sequenced at the top N positions as high-influence score characters;

the extracting module is used for extracting the first feature set according to the high-influence character;

the merging unit is used for merging the first feature set and a preset second feature set to obtain a merged feature set;

and the first output unit is used for inputting the training website data set into a second preset model for training according to the combined feature set and outputting a fishing website distinguishing model.

7. A fishing website discriminating device according to claim 6, further comprising:

the device comprises an acquisition unit, a storage unit and a display unit, wherein the acquisition unit is used for acquiring phishing website address data and non-phishing website address data;

the first construction unit is used for constructing a website data set according to the phishing website data and the non-phishing website data;

a second constructing unit, configured to construct a tag data set according to the website data set, where the tag data set is a data set used for marking a numerical tag corresponding to each URL in the website data set;

and the determining unit is used for determining the training website data set and the testing website data set according to the label data set, wherein the testing website data set is used for testing the trained first preset model.

8. A fishing website discriminating device according to claim 7, wherein the extracting unit comprises:

the input module is used for inputting the test website data set into the trained first preset model;

the judging module is used for judging whether the accuracy of the output result of the first preset model reaches a preset threshold value or not;

a termination module for terminating the training if the accuracy of the output result of the first preset model reaches a preset threshold,

and the returning module is used for returning to continue executing the step of training the first preset model according to the training website data set to extract the first characteristic set if the accuracy of the output result of the first preset model does not reach a preset threshold value.