CN110365691B - Phishing website distinguishing method and device based on deep learning - Google Patents

Phishing website distinguishing method and device based on deep learning Download PDF

Info

Publication number
CN110365691B
CN110365691B CN201910664425.5A CN201910664425A CN110365691B CN 110365691 B CN110365691 B CN 110365691B CN 201910664425 A CN201910664425 A CN 201910664425A CN 110365691 B CN110365691 B CN 110365691B
Authority
CN
China
Prior art keywords
website
data set
training
phishing
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201910664425.5A
Other languages
Chinese (zh)
Other versions
CN110365691A (en
Inventor
冯涛
刘蕊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunnan University of Finance and Economics
Original Assignee
Yunnan University of Finance and Economics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunnan University of Finance and Economics filed Critical Yunnan University of Finance and Economics
Priority to CN201910664425.5A priority Critical patent/CN110365691B/en
Publication of CN110365691A publication Critical patent/CN110365691A/en
Application granted granted Critical
Publication of CN110365691B publication Critical patent/CN110365691B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a phishing website distinguishing method and device based on deep learning. The method comprises the steps of inputting websites of a target phishing website as a training website data set into a first preset model; training the first preset model according to the training website data set to extract a first feature set; merging the first feature set and a preset second feature set to obtain a merged feature set; and inputting the training website data set into a second preset model for training according to the combined feature set, and outputting a fishing website distinguishing model. The phishing website identification method and the phishing website identification device solve the technical problem that the phishing website identification method in the related technology is low in efficiency. The method and the device have the technical effects that the phishing website detection time is short, the detection result accuracy is high, and the generalization of the discrimination method in practical application is stronger.

Description

Phishing website distinguishing method and device based on deep learning
Technical Field
The application relates to the technical field of deep learning, in particular to a phishing website distinguishing method and device based on deep learning.
Background
The concept of deep learning is derived from the research of an artificial neural network, and a multi-layer perceptron with multiple hidden layers is a deep learning structure. Deep learning forms a more abstract class or feature of high-level representation properties by combining low-level features to discover a distributed feature representation of the data.
The phishing website is a network fraud behavior, which means that lawless persons use various means to imitate URL addresses and page contents of real websites, or insert dangerous HTML codes into some webpages of websites by using bugs on server programs of real websites, so as to cheat private data such as account numbers and passwords of user banks or credit cards.
The technology for preventing phishing website attacks in the related art has at least the following problems: in the identification process, the webpage content corresponding to the url needs to be downloaded, so that the detection time is long, and the user internet experience is influenced; on the other hand, malicious code contained in the fishing net station may have been triggered during the downloading of the suspicious web page. Therefore, at present, there is no perfect method for fighting phishing website attacks, and there is a large promotion space.
Aiming at the technical problem that the phishing website identification method in the related technology is low in efficiency, an effective solution is not provided at present.
Disclosure of Invention
The application mainly aims to provide a phishing website distinguishing method and device based on deep learning so as to solve the technical problem that a phishing website identification method in the related technology is low in efficiency.
In order to achieve the above object, according to one aspect of the present application, there is provided a phishing website discrimination method based on deep learning.
The phishing website distinguishing method based on deep learning comprises the following steps: inputting the website of the target phishing website as a training website data set into a first preset model; training the first preset model according to the training website data set to extract a first feature set; merging the first feature set and a preset second feature set to obtain a merged feature set; and inputting the training website data set into a second preset model for training according to the combined feature set, and outputting a fishing website distinguishing model.
Further, before inputting the website of the target phishing website as a training website data set into the first preset model, the method further comprises the following steps: acquiring phishing website data and non-phishing website data; constructing a website data set according to the phishing website data and the non-phishing website data; constructing a label data set according to the website data set, wherein the label data set is used for marking a numerical label corresponding to each URL in the website data set; and determining the training website data set and a testing website data set according to the label data set, wherein the testing website data set is used for testing the trained first preset model.
Further, training the first preset model according to the training website data set to extract a first feature set includes: inputting the test website data set into the trained first preset model; judging whether the accuracy of the output result of the first preset model reaches a preset threshold value or not; if the accuracy of the output result of the first preset model reaches a preset threshold value, ending the training; otherwise, returning to the step of continuing to execute the training of the first preset model according to the training website data set so as to extract the first feature set.
Further, training the first preset model according to the training website data set to extract a first feature set further comprises: calculating a saliency score of each URL character in the training website data set; calculating the influence score of each URL character according to the significance score of each URL character; sorting each URL character from high to low according to the influence score of the URL character; defining the URL characters in the top N positions as high-influence score characters; and extracting the first characteristic set according to the high-impact character.
Further, after inputting the training website data set into a second preset model for training according to the combined feature set and outputting a fishing website distinguishing model, the method further comprises the following steps: compiling a program according to the fishing website distinguishing model to obtain a fishing website distinguishing program; and embedding the phishing website distinguishing program into terminal equipment.
Further, after the fishing website determining program is embedded into the terminal device, the method further comprises the following steps: receiving a website access request sent by the terminal equipment; starting the fishing website distinguishing model in the fishing website distinguishing program; inputting website information in the website access request into the fishing website distinguishing model; outputting a fishing website judgment result; if the phishing website judging result is that the website information in the website access request is a phishing website, prompting that a terminal device user opens the phishing website; otherwise, allowing the terminal equipment to perform access operation.
In order to achieve the above object, according to another aspect of the present application, there is provided a phishing website discriminating apparatus based on deep learning.
The phishing website judging device based on deep learning comprises: the first input unit is used for inputting the website of the target phishing website as a training website data set into a first preset model; the extraction unit is used for training the first preset model according to the training website data set so as to extract a first characteristic set; the merging unit is used for merging the first feature set and a preset second feature set to obtain a merged feature set; and the first output unit is used for inputting the training website data set into a second preset model for training according to the combined feature set and outputting a fishing website distinguishing model.
Further, still include: the device comprises an acquisition unit, a storage unit and a display unit, wherein the acquisition unit is used for acquiring phishing website address data and non-phishing website address data; the first construction unit is used for constructing a website data set according to the phishing website data and the non-phishing website data; a second constructing unit, configured to construct a tag data set according to the website data set, where the tag data set is a data set used for marking a numerical tag corresponding to each URL in the website data set; and the determining unit is used for determining the training website data set and the testing website data set according to the label data set, wherein the testing website data set is used for testing the trained first preset model.
Further, the extraction unit includes: the input module is used for inputting the test website data set into the trained first preset model; the judging module is used for judging whether the accuracy of the output result of the first preset model reaches a preset threshold value or not; and the stopping module is used for stopping training if the accuracy of the output result of the first preset model reaches a preset threshold value, and the returning module is used for returning to continue executing the step of training the first preset model according to the training website data set to extract the first characteristic set if the accuracy of the output result of the first preset model does not reach the preset threshold value.
Further, the extraction unit further includes: the first calculation module is used for calculating the significance score of each URL character in the training website data set; the second calculation module is used for calculating the influence score of each URL character according to the saliency score of each URL character; the sorting module is used for sorting each URL character from high to low according to the influence of the URL character; the definition module is used for defining the URL characters which are sequenced at the top N positions as high-influence score characters; and the extracting module is used for extracting the first feature set according to the high-influence character.
In the method and the device for distinguishing the phishing websites based on deep learning in the embodiment of the application, the websites of the target phishing websites are used as a training website data set to be input into a first preset model, the first preset model is trained according to the training website data set to extract a first characteristic set, the first characteristic set and a preset second characteristic set are combined to obtain a combined characteristic set, and the training website data set is input into a second preset model according to the combined characteristic set to be trained, so that the purpose of outputting the fishing website distinguishing model is achieved.
In addition, the following technical effects are achieved: 1. the phishing website distinguishing method based on deep learning does not depend on a blacklist, and is still effective for the latest phishing webpage. 2. The content of the web page is not downloaded during the detection process, thereby avoiding the possibility of triggering malicious codes in the web page. 3. The detection time is short, and the internet surfing experience of the user cannot be influenced. 4. Training is carried out through a training set storing mass website data, and then testing is carried out through a testing set, so that the finally obtained discrimination model has stronger generalization in practical application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, serve to provide a further understanding of the application and to enable other features, objects, and advantages of the application to be more apparent. The drawings and their description illustrate the embodiments of the invention and do not limit it. In the drawings:
FIG. 1 is a flowchart illustrating a method for determining phishing websites based on deep learning according to a first embodiment of the present application;
FIG. 2 is a flowchart illustrating a method for determining phishing websites based on deep learning according to a second embodiment of the present application;
FIG. 3 is a flowchart illustrating a method for determining phishing websites based on deep learning according to a third embodiment of the present application;
FIG. 4 is a schematic flowchart illustrating a method for determining phishing websites based on deep learning according to a third embodiment of the present application;
FIG. 5 is a flowchart illustrating a method for determining phishing websites based on deep learning according to a fourth embodiment of the present application;
FIG. 6 is a flowchart illustrating a method for determining phishing websites based on deep learning according to a fifth embodiment of the present application;
FIG. 7 is a flowchart illustrating a phishing website discrimination method based on deep learning according to a sixth embodiment of the present application;
FIG. 8 is a flowchart illustrating a method for determining phishing websites based on deep learning according to a sixth embodiment of the present application;
FIG. 9 is a schematic diagram illustrating a composition structure of a phishing website determining apparatus based on deep learning according to a first embodiment of the present application;
FIG. 10 is a schematic diagram illustrating a composition of a phishing website determining apparatus based on deep learning according to a second embodiment of the present application;
FIG. 11 is a block diagram illustrating a phishing website determining apparatus based on deep learning according to a third embodiment of the present application; and
fig. 12 is a schematic diagram illustrating a composition structure of a phishing website determining apparatus based on deep learning according to a fourth embodiment of the present application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
According to an embodiment of the present application, there is provided a phishing website determination method based on deep learning, as shown in fig. 1, the method includes steps S101 to S104 as follows:
step S101, inputting the website of the target phishing website as a training website data set into a first preset model.
In specific implementation, the training website data set is based on existing manually marked website data as a data source, the first preset model preferably selects RNN (Recurrent neural network) models with different structures and different Unit types, and specifically may include LSTM (Long Short-Term Memory, Long Short-Term Memory network), GRU (Gated Recurrent Unit), bllstm (Bi-directional Long Short-Term Memory, bidirectional Long Short-Term Memory network), BiGRU (Bi-directional Gated Recurrent Unit), and other deep learning models.
Step S102, training the first preset model according to the training website data set to extract a first feature set.
In specific implementation, the obtained training website data set is used for respectively training the four RNN deep learning models LSTM, GRU, BilSTM and BiGRU until each model can obtain a good recognition result, namely the accuracy is more than 99.8%. Then the four are completely trainedThe trained RNN deep learning model carries out internal state analysis, high-frequency influence score (impact score) characters are extracted as features, and a feature set F ═ { F is obtainedlstm,fgru,fbilstm,fbigruAnd finding out the most suspicious characters considered by the four high-accuracy RNN deep learning models and adding the characters into the first feature set.
Step S103, merging the first feature set and a preset second feature set to obtain a merged feature set.
Based on the feature set F ═ { F ═ obtained in step S102lstm,fgru,fbilstm,fbigruAnd combining the feature set with the existing artificial classification features to obtain the combined feature set. The combined feature set covers character features of the phishing websites obtained through manual classification in the prior art, and also integrates character features extracted through four deep learning models which are trained and obtain good training results, so that the character features adopted for recognizing the phishing websites in the related technology are greatly enriched, and a foundation is laid for subsequently improving the recognition efficiency of the phishing websites.
And S104, inputting the training website data set into a second preset model for training according to the combined feature set, and outputting a fishing website distinguishing model.
In specific implementation, the second preset model may adopt a Random Forest tree model with mature technology and good effect, and it should be noted that other types of classifier models may also implement the technical solution of the present application, which is not specifically limited herein. And further inputting the training website data set into a Random Forest model for training based on the obtained combined feature set, and finally obtaining a fishing website distinguishing model driven by a deep learning model.
According to the fishing website distinguishing method and device based on deep learning, the website of the target fishing website is input into a first preset model as a training website data set, the first preset model is trained according to the training website data set to extract a first characteristic set, the first characteristic set and a preset second characteristic set are combined to obtain a combined characteristic set, the training website data set is input into a second preset model according to the combined characteristic set for training, the purpose of outputting a fishing website distinguishing model is achieved, and the technical effect of improving fishing website distinguishing efficiency is achieved.
Preferably, as shown in fig. 2, before inputting the website of the target phishing website as the training website data set into the first preset model, the following steps S201 to S204 are further included:
step S201, acquiring phishing website data and non-phishing website data.
The amount of the training data has a significant influence on the finally obtained phishing website discrimination model, and therefore, before the website of the target phishing website is input into the first preset model as the training website data set, massive (millions of pieces) website data need to be collected, specifically comprising the phishing website and normal (or harmless or non-phishing) website data. Preferably, the number of phishing websites and the number of normal (or harmless or non-phishing) websites are collected in a 1:1 ratio.
And step S202, constructing a website data set according to the phishing website data and the non-phishing website data.
And further constructing a website data set based on the collected millions of massive phishing website data and non-phishing website data to serve as a basic database for subsequent model training.
Step S203, constructing a label data set according to the website data set, wherein the label data set is used for marking a numerical label corresponding to each URL in the website data set.
The mass (millions of) phishing website data and non-phishing website data stored in the constructed website data set are existing data which are judged or classified manually, and in specific implementation, each URL _ i in the website data set is labeled with a corresponding label, and a specific labeling method can be as follows: and (3) adopting binary data marking, namely marking as 1 if the URL _ i is a fishing website, and marking as 0 if the URL _ i is a normal website or a non-fishing website, so as to form a label data set based on the marked website data set.
Step S204, determining the training website data set and a testing website data set according to the label data set, wherein the testing website data set is used for testing the trained first preset model.
In order to further ensure the accuracy and reliability of the model training result, in the specific implementation of the method, the obtained label data set is further divided into a training website data set and a testing website data set, the ratio of the number of websites in the two data sets can be that the training website data set accounts for 80% of the label data set, the rest 20% of websites can be used as the testing website data set, and the four deep learning models after training are tested by using the testing website data set.
Preferably, as shown in fig. 3 and 4, the training of the first preset model according to the training website data set to extract a first feature set specifically includes the following steps S301 to S304:
step S301, inputting the test website data set into the trained first preset model.
In specific implementation, after the four deep learning models are trained through the training website data set, the training results of the four deep learning models after training need to be further evaluated by using the test website data set.
Step S302, determining whether the accuracy of the output result of the first preset model reaches a preset threshold.
In specific implementation, whether the accuracy of the output results of the four deep learning models reaches a preset threshold is respectively judged according to a certain preset rule, the preset rule can be that a satisfactory training result can be considered to be obtained only when the accuracy of the output results of the four deep learning models all reach the preset threshold, the preset threshold can be flexibly configured according to actual needs, for example, the accuracy can be set to 99.8% or more, and the accuracy reaches 99.8% specifically means that: after the test website data set is respectively input into the four trained deep learning models, the proportion of correct results output by the four deep learning models is at least 99.8%.
Step S303, if the accuracy of the output result of the first preset model reaches a preset threshold, ending the training.
If the accuracy of the output results of the four deep learning models reaches 99.8%, it is indicated that the four models can obtain good recognition results for the data in the training website data set, and the deep learning model training can be finished to perform the operation of the subsequent steps.
Step S304, if the accuracy of the output result of the first preset model does not reach a preset threshold value, returning to continue executing the step of training the first preset model according to the training website data set to extract a first feature set.
In specific implementation, if the accuracy of the output result of any one of the four deep learning models does not reach 99.8%, the training is considered to be unqualified, and the training is returned to continue to use the training website data set to train the model with the unqualified output result.
Preferably, as shown in fig. 5, the training the first preset model according to the training website data set to extract a first feature set further includes the following steps S401 to S405:
step S401, calculating the significance score of each URL character in the training website data set.
In the RNN deep learning model of the present application, the character c in each URLiIs embedded into a vector e by adopting a word2vec methodi(it is simply understood that ci is denoted by ei, which is a vector). Thus, for a URL with n characters, an E ═ E0, E1, …, en is obtained. The RNN deep learning model after training can be regarded as a high-dimensional nonlinear function Sp(E) For each E, it is input to a function SpA y value is obtained, and is used to determine whether the URL represented by E is a phishing website. Therefore, we need to know which E of EiTo SpFunction(s)The influence is the largest, namely the RNN deep learning model judges and identifies the phishing website according to the character characteristics in the URL, and the following method is specifically adopted:
by calculating a first order Taylor expansion Sp(e) To approximate S with a linear function ep(e):
Figure BDA0002139250770000101
Wherein ω is SpB is constant with respect to the derivative of e.
The absolute value of the derivative represents the sensitivity of the final decision to a particular dimension change, telling us how much a particular dimension of the website character embedding contributes to the final decision. On this basis, we represent the saliency score (saliencystcore) of the URL character using the sum of the absolute values of the embedded vector contributionsiThe significance score for middle character j is then:
Figure BDA0002139250770000102
step S402, calculating the influence score of each URL character according to the saliency score of each URL character.
It should be noted that, in the implementation, the significance score of a certain character in the sample set cannot be simply summed or averaged as its influence score (impactscore). Because the importance of a character is related to its context, namely: the importance of the same character in different URLs, and even in different locations within the same URL, is quite different. The embodiment of the present application therefore provides an algorithm for calculating an imprictscore for a character C:
Figure BDA0002139250770000103
wherein N is the total number of web sites, ωiIs a set of characters in a web site, β isA constant. The function of the ISI function is to take the index of the largest k elements from the set S, which is defined as follows:
Figure BDA0002139250770000111
in practical implementation, it is assumed that there are about 0.75M (million) phishing sites in the previously collected site data set in the embodiment of the present application, the parameter β is 0.1, the average length of the phishing sites in the data set is 89.8, that is, about 9 characters with highest saliencystcore are selected from each site on average to participate in Cimpact-scoreAnd (4) calculating.
Step S403, sorting each URL character from high to low according to the influence of the URL character.
The influence score C of each URL character is obtained based on the calculation formulaimpact-scoreThe influence of these characters is classified as Cimpact-scoreThe sorting is performed in order from high to low or from low to high.
Step S404, defining the URL character in the top N order as high-influence character.
And defining the URL characters with the sequences positioned at the top N positions as high-influence score characters based on the sequences obtained after the character influence score sorting, wherein the numerical value of N can be flexibly configured according to actual operation or requirements, and is not specifically limited, so that the high-influence score characters respectively corresponding to the four trained RNN deep learning models can be obtained.
Another way to obtain high impact score characters may be to take the impact score C from the set X directly through the ISI formulaimpact-scoreThe maximum subscript of the lambda elements, that is,
Fα={lstm,gru,bilstm,bigru}={Xi|i∈ISI({Cα(x)|x∈X},λ)},
wherein, X is the character set contained in the URL in the data set, and lambda is a constant parameter.
Step S405, extracting the first feature set according to the high-influence character.
Based on the two ways of obtaining high-impact character in step S404, extracting the top N characters as high-impact character according to actual requirements, and finally adding the high-impact character obtained by the four deep learning models to the deep learning feature set to obtain the first feature set, i.e. the first feature set
F={flstm,fgru,fbilstm,fbigru}。
Preferably, as shown in fig. 5, after inputting the training website data set into a second preset model for training according to the merged feature set and outputting a fishing website discrimination model, the following steps S501 to S502 are further included:
and step S501, programming according to the fishing website distinguishing model to obtain a fishing website distinguishing program.
In specific implementation, various programming software in the prior art can be used for programming the obtained phishing website distinguishing model, which is not described herein again.
Step S502, the phishing website distinguishing program is embedded into the terminal equipment.
In specific implementation, the obtained phishing website distinguishing program can be embedded into a browser, a mobile phone APP or any open platform (such as WeChat, microblog, QQ and the like) for identifying phishing websites.
Preferably, as shown in fig. 7 and 8, after the phishing site discrimination program is embedded in the terminal device, the following steps S601 to S606 are further included:
step S601, receiving a website access request sent by the terminal device.
During specific implementation, the phishing website distinguishing program embedded in the terminal equipment can monitor the website access behavior of the user of the terminal equipment in real time, and receive a website access request initiated by the user of the terminal equipment, such as a request of the user accessing a certain website through URL _ i.
Step S602, the fishing site discrimination model in the fishing site discrimination program is started.
And when the phishing website judging program receives a request of a terminal equipment user for accessing a certain website through URL _ i, starting a phishing website judging model in the program immediately to judge the website.
Step S603, inputting the website information in the website access request into the phishing website discrimination model.
In specific implementation, the URL _ i in the website access request is input into the phishing website distinguishing model, and character characteristic information and the like in the URL _ i are analyzed.
And step S604, outputting the fishing website judgment result.
After the judgment is finished, the fishing website judgment result is output, and the specific output form of the fishing website judgment result can be configured in advance, for example: and when the output result of the model is 1, the user is considered to be about to visit the phishing website, and when the output result of the model is 0, the user is considered to be about to visit the non-phishing website or the normal website.
Step S605, if the phishing website determination result is that the website information in the website access request is a phishing website, prompting that the terminal device user is opening the phishing website.
When the output result of the model is 1, the website to be accessed by the user is a phishing website, and the user can be reminded of careful operation on a display interface of the terminal equipment in a pop-up window mode and the like.
Step S606, if the phishing website judging result is that the website information in the website access request is not a phishing website, allowing the terminal device to perform access operation.
When the output result of the model is 0, the website to be accessed by the user is a normal website, and the user can be allowed to continue accessing.
From the above description, it can be seen that the fishing website distinguishing model is output by inputting the website of the target fishing website as the training website data set into the first preset model, training the first preset model according to the training website data set to extract the first feature set, merging the first feature set with the preset second feature set to obtain the merged feature set, and inputting the training website data set into the second preset model according to the merged feature set for training, so that the following technical effects are achieved: 1. the phishing website distinguishing method based on deep learning does not depend on a blacklist, and is still effective for the latest phishing webpage. 2. The content of the web page is not downloaded during the detection process, thereby avoiding the possibility of triggering malicious codes in the web page. 3. The detection time is extremely short, and the internet surfing experience of the user cannot be influenced. 4. The detection Accuracy (Accuracy) reaches more than 99.9 percent, and the F1 score reaches more than 99.8 percent.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
According to an embodiment of the present invention, there is also provided a phishing website determining apparatus for implementing the above phishing website determining method based on deep learning, as shown in fig. 9, the apparatus including:
the first input unit 1 is used for inputting the website of the target phishing website as a training website data set into a first preset model.
In specific implementation, the training website data set is based on existing manually marked website data as a data source, the first preset model preferably selects RNN (Recurrent neural network) models with different structures and different Unit types, and specifically may include LSTM (Long Short-Term Memory, Long Short-Term Memory network), GRU (Gated Recurrent Unit), bllstm (Bi-directional Long Short-Term Memory, bidirectional Long Short-Term Memory network), BiGRU (Bi-directional Gated Recurrent Unit), and other deep learning models. The website of the target phishing website is input into the first preset model as a training website data set through the first input unit 1.
And the extraction unit 2 is used for training the first preset model according to the training website data set to extract a first feature set.
In specific implementation, the obtained training website data is usedThe four RNN deep learning models LSTM, GRU, BilSTM and BiGRU are trained respectively until each model can obtain a good recognition result, namely the accuracy is over 99.8%. Then, the extraction unit 2 performs internal state analysis on the four fully trained RNN deep learning model models, extracts high-frequency influence score (impact score) characters as features, and obtains a feature set F ═ { F ═ Flstm,fgru,fbilstm,fbigruAnd finding out the most suspicious characters considered by the four high-accuracy RNN deep learning models and adding the characters into the first feature set.
And the merging unit 3 is configured to merge the first feature set with a preset second feature set to obtain a merged feature set.
Based on the feature set F ═ { F obtained abovelstm,fgru,fbilstm,fbigruAnd combining the feature set with the existing artificial classification features through a combining unit 3 to obtain the combined feature set. The combined feature set covers character features of the phishing websites obtained through manual classification in the prior art, and also integrates character features extracted through four deep learning models which are trained and obtain good training results, so that the character features adopted for recognizing the phishing websites in the related technology are greatly enriched, and a foundation is laid for subsequently improving the recognition efficiency of the phishing websites.
And the first output unit 4 is used for inputting the training website data set into a second preset model for training according to the combined feature set and outputting a fishing website distinguishing model.
In specific implementation, the second preset model may adopt a Random Forest tree model with mature technology and good effect, and it should be noted that other types of classifier models may also implement the technical solution of the present application, which is not specifically limited herein. And further inputting the training website data set into a Random Forest model for training through the first output unit 4 based on the obtained combined feature set, and finally obtaining a fishing website distinguishing model driven by a deep learning model.
Preferably, as shown in fig. 10, the apparatus further comprises:
and the acquisition unit 5 is used for acquiring the website data of the phishing website and the website data of the non-phishing website.
The amount of the training data has a significant influence on the finally obtained phishing website discrimination model, and therefore, before inputting the website of the target phishing website as the training website data set into the first preset model, a large amount (millions of pieces) of website data, specifically including the phishing website and the normal (or harmless or non-phishing) website data, needs to be collected by the obtaining unit 5. Preferably, the number of phishing websites and the number of normal (or harmless or non-phishing) websites are collected in a 1:1 ratio.
And the first construction unit 6 is used for constructing a website data set according to the phishing website data and the non-phishing website data.
Based on the collected mass (millions of) phishing website data and non-phishing website data, a website data set is further constructed through the first construction unit 6 and is used as a basic database for subsequent model training.
A second constructing unit 7, configured to construct a tag data set according to the website data set, where the tag data set is a data set used for marking a numerical tag corresponding to each URL in the website data set.
The mass (millions of) phishing website data and non-phishing website data stored in the constructed website data set are existing data which are judged or classified manually, and in specific implementation, each URL _ i in the website data set is labeled with a corresponding label, and a specific labeling method can be as follows: and (3) adopting binary data marking, namely marking as 1 if the URL _ i is a fishing website, and marking as 0 if the URL _ i is a normal website or a non-fishing website, so that a label data set is formed by the second construction unit 7 based on the marked website data set.
A determining unit 8, configured to determine the training website data set and a testing website data set according to the tag data set, where the testing website data set is a data set used for testing the trained first preset model.
In order to further ensure the accuracy and reliability of the model training result, in the specific implementation of the present application, the obtained tag data set is further divided into a training website data set and a testing website data set by using the determining unit 8, the ratio of the number of websites in the two data sets may be that the training website data set accounts for 80% of the tag data set, the remaining 20% of websites in the two data sets are used as the testing website data set, and the four deep learning models after training are tested by using the testing website data set.
Preferably, as shown in fig. 11, the extraction unit 2 includes:
an input module 21, configured to input the test website data set into the trained first preset model.
In specific implementation, after the four deep learning models are trained through the above-mentioned training website data set, the input module 21 needs to further evaluate the training results of the trained four deep learning models by using the testing website data set.
And the judging module 22 is configured to judge whether the accuracy of the output result of the first preset model reaches a preset threshold.
In specific implementation, the determining module 22 determines whether the accuracy of the output results of the four deep learning models reaches a preset threshold according to a certain preset rule, where the preset rule may be that a satisfactory training result is considered to be obtained only when the accuracy of all the output results of the four deep learning models reaches the preset threshold, the preset threshold may be flexibly configured according to actual needs, for example, it may be set that the accuracy reaches 99.8% or more, and the accuracy reaches 99.8% specifically means that: after the test website data set is respectively input into the four trained deep learning models, the proportion of correct results output by the four deep learning models is at least 99.8%.
And a termination module 23, configured to terminate the training if the accuracy of the output result of the first preset model reaches a preset threshold.
If the accuracy of the output results of the four deep learning models reaches 99.8%, which indicates that the four models can obtain good recognition results for the data in the training website data set, the termination module 23 can end the deep learning model training and perform the operation of the subsequent steps.
And the returning module 24 is configured to return to continue to execute the step of training the first preset model according to the training website data set to extract the first feature set if the accuracy of the output result of the first preset model does not reach the preset threshold.
In specific implementation, if the accuracy of the output result of any one of the four deep learning models does not reach 99.8%, the training is considered to be unqualified, and the returning module 24 returns to continue to use the training website data set to train the model with the unqualified output result.
Preferably, as shown in fig. 12, the extraction unit 2 further includes:
a first calculating module 25, configured to calculate a saliency score of each URL character in the training website data set.
In specific implementation, the first calculation module 25 obtains the following result by adopting the following method:
by calculating a first order Taylor expansion Sp(e) To approximate S with a linear function ep(e):
Figure BDA0002139250770000171
Wherein ω is SpB is constant with respect to the derivative of e.
The absolute value of the derivative represents the sensitivity of the final decision to a particular dimensional change, telling the embodiment of the application how much a particular dimension of the website character embedding contributes to the final decision. On this basis, the sum of absolute values of the contributions of the embedded vectors is used in the embodiment of the application to represent the saliency score (saliencystcore) of the URL characteriThe significance score for middle character j is then:
Figure BDA0002139250770000172
and a second calculating module 26, configured to calculate an influence score of each URL character according to the saliency score of each URL character.
In specific implementation, the second calculation module 26 calculates an algorithm of impactscore for the character C:
Figure BDA0002139250770000181
wherein N is the total number of web sites, ωiIs a set of characters in the web address and β is a constant. The function of the ISI function is to take the index of the largest k elements from the set S, which is defined as follows:
Figure BDA0002139250770000182
and the sorting module 27 is configured to sort each URL character from high to low according to the influence score of the URL character.
The influence score C of each URL character is obtained based on the calculation formulaimpact-scoreThe influence of these characters is classified as Cimpact-scoreThe sorting is performed in order from high to low or from low to high.
A defining module 28, configured to define the URL character ordered in the top N bits as a high impact score character.
Based on the sequence obtained after the character influence score sorting, the URL characters sorted in the first N bits are defined as high-influence score characters by the definition module 28, where the numerical value of N may be flexibly configured according to actual operation or requirements, and is not specifically limited here, so that the high-influence score characters corresponding to the trained four RNN deep learning models can be obtained.
Another way to obtain high impact score characters may be to take the impact score C from the set X directly through the ISI formulaimpact-scoreThe maximum subscript of the lambda elements, that is,
Fα={lstm,gru,bilstm,bigru}={Xi|i∈ISI({Cα(x)|x∈X},λ)},
wherein, X is the character set contained in the URL in the data set, and lambda is a constant parameter.
And the extracting module 29 is configured to extract the first feature set according to the high-impact character.
Based on the two ways of obtaining high-impact character in step S404, extracting the top N characters as high-impact character according to actual requirements, and finally adding the high-impact character obtained by the four deep learning models to the deep learning feature set to obtain the first feature set, i.e. the first feature set
F={flstm,fgru,fbilstm,fbigru}。
Preferably, the apparatus further comprises:
and the compiling unit is used for compiling a program according to the fishing website distinguishing model to obtain a fishing website distinguishing program.
In specific implementation, the obtained phishing website discrimination model can be programmed by using programming units in various programming software in the prior art, which is not described herein again.
And the embedding unit is used for embedding the fishing website distinguishing program into the terminal equipment.
In specific implementation, the obtained phishing website judging program can be embedded into a browser, a mobile phone APP or any open platform (such as WeChat, microblog, QQ and the like) through an embedding unit to identify the phishing website.
Preferably, the apparatus further comprises:
and the receiving unit is used for receiving the website access request sent by the terminal equipment.
When the method is specifically implemented, the phishing website distinguishing program embedded in the terminal equipment can monitor the website access behavior of the user of the terminal equipment in real time, and the receiving unit receives a website access request initiated by the user of the terminal equipment, such as a request of the user accessing a certain website through URL _ i.
And the starting unit is used for starting the fishing website distinguishing model in the fishing website distinguishing program.
When the phishing website distinguishing program receives a request of a terminal device user for accessing a certain website through URL _ i, the phishing website distinguishing model in the starting unit starting program is used for distinguishing the website.
And the second input unit is used for inputting the website information in the website access request into the phishing website distinguishing model.
In specific implementation, the URL _ i in the website access request is input into the phishing website distinguishing model through the second input unit, and character characteristic information and the like in the URL _ i are analyzed.
And the second output unit outputs the fishing website judgment result.
After the judgment is finished, the fishing website judgment result is output through the second output unit, and the specific output form of the fishing website judgment result can be configured in advance, for example: and when the output result of the model is 1, the user is considered to be about to visit the phishing website, and when the output result of the model is 0, the user is considered to be about to visit the non-phishing website or the normal website.
And the prompting unit is used for prompting the terminal equipment user that the phishing website is opened if the phishing website judging result is that the website information in the website access request is the phishing website.
When the output result of the model is 1, the website to be visited by the user is a phishing website, and the user can be reminded of careful operation in a pop-up window mode and the like on a display interface of the terminal equipment through a prompting unit.
And the access unit is used for allowing the terminal equipment to perform access operation if the phishing website judging result shows that the website information in the website access request is not the phishing website.
When the output result of the model is 0, the website to be accessed by the user is a normal website, and the user can be allowed to continue accessing through the access unit.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may alternatively be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from multiple modules or steps. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (8)

1. A phishing website distinguishing method based on deep learning is characterized by comprising the following steps:
inputting the website of a target phishing website as a training website data set into a first preset model, wherein the first preset model is a recurrent neural network model with different structures and different unit types;
training the first preset model according to the training website data set to extract a first feature set, which specifically comprises:
calculating a saliency score of each URL character in the training website data set;
calculating the influence score of each URL character according to the significance score of each URL character;
sorting each URL character from high to low according to the influence score of the URL character;
defining the URL characters in the top N positions as high-influence score characters;
extracting the first feature set according to the high-influence character;
the first feature set comprises suspicious characters determined by the first preset model;
merging the first feature set and a preset second feature set to obtain a merged feature set;
and inputting the training website data set into a second preset model for training according to the combined feature set, and outputting a fishing website distinguishing model.
2. A phishing website discrimination method based on deep learning as claimed in claim 1, wherein before inputting the website of the target phishing website as the training website data set into the first preset model, further comprising:
acquiring phishing website data and non-phishing website data;
constructing a website data set according to the phishing website data and the non-phishing website data;
constructing a label data set according to the website data set, wherein the label data set is used for marking a numerical label corresponding to each URL in the website data set;
and determining the training website data set and a testing website data set according to the label data set, wherein the testing website data set is used for testing the trained first preset model.
3. A phishing website discrimination method based on deep learning as claimed in claim 2, wherein training the first preset model according to the training website data set to extract a first feature set comprises:
inputting the test website data set into the trained first preset model;
judging whether the accuracy of the output result of the first preset model reaches a preset threshold value or not;
if the accuracy of the output result of the first preset model reaches a preset threshold value, ending the training; if not, then,
and returning to continue executing the step of training the first preset model according to the training website data set to extract the first feature set.
4. A phishing website discrimination method based on deep learning as claimed in claim 1, wherein after inputting the training website data set into a second preset model for training according to the combined feature set and outputting a fishing website discrimination model, further comprising:
compiling a program according to the fishing website distinguishing model to obtain a fishing website distinguishing program;
and embedding the phishing website distinguishing program into terminal equipment.
5. A phishing website discrimination method based on deep learning as claimed in claim 4, further comprising after embedding the phishing website discrimination program into a terminal device:
receiving a website access request sent by the terminal equipment;
starting the fishing website distinguishing model in the fishing website distinguishing program;
inputting website information in the website access request into the fishing website distinguishing model;
outputting a fishing website judgment result;
if the phishing website judging result is that the website information in the website access request is a phishing website, prompting that a terminal device user opens the phishing website; if not, then,
and allowing the terminal equipment to perform access operation.
6. A fishing website discriminating device based on deep learning is characterized by comprising
The first input unit is used for inputting the website of the target phishing website as a training website data set into a first preset model, and the first preset model is a recurrent neural network model with different structures and different unit types;
the extracting unit is configured to train the first preset model according to the training website data set to extract a first feature set, and specifically includes:
the first calculation module is used for calculating the significance score of each URL character in the training website data set;
the second calculation module is used for calculating the influence score of each URL character according to the saliency score of each URL character;
the sorting module is used for sorting each URL character from high to low according to the influence of the URL character;
the definition module is used for defining the URL characters which are sequenced at the top N positions as high-influence score characters;
the extracting module is used for extracting the first feature set according to the high-influence character;
the first feature set comprises suspicious characters determined by the first preset model;
the merging unit is used for merging the first feature set and a preset second feature set to obtain a merged feature set;
and the first output unit is used for inputting the training website data set into a second preset model for training according to the combined feature set and outputting a fishing website distinguishing model.
7. A fishing website discriminating device according to claim 6, further comprising:
the device comprises an acquisition unit, a storage unit and a display unit, wherein the acquisition unit is used for acquiring phishing website address data and non-phishing website address data;
the first construction unit is used for constructing a website data set according to the phishing website data and the non-phishing website data;
a second constructing unit, configured to construct a tag data set according to the website data set, where the tag data set is a data set used for marking a numerical tag corresponding to each URL in the website data set;
and the determining unit is used for determining the training website data set and the testing website data set according to the label data set, wherein the testing website data set is used for testing the trained first preset model.
8. A fishing website discriminating device according to claim 7, wherein the extracting unit comprises:
the input module is used for inputting the test website data set into the trained first preset model;
the judging module is used for judging whether the accuracy of the output result of the first preset model reaches a preset threshold value or not;
a termination module for terminating the training if the accuracy of the output result of the first preset model reaches a preset threshold,
and the returning module is used for returning to continue executing the step of training the first preset model according to the training website data set to extract the first characteristic set if the accuracy of the output result of the first preset model does not reach a preset threshold value.
CN201910664425.5A 2019-07-22 2019-07-22 Phishing website distinguishing method and device based on deep learning Expired - Fee Related CN110365691B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910664425.5A CN110365691B (en) 2019-07-22 2019-07-22 Phishing website distinguishing method and device based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910664425.5A CN110365691B (en) 2019-07-22 2019-07-22 Phishing website distinguishing method and device based on deep learning

Publications (2)

Publication Number Publication Date
CN110365691A CN110365691A (en) 2019-10-22
CN110365691B true CN110365691B (en) 2021-12-28

Family

ID=68220658

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910664425.5A Expired - Fee Related CN110365691B (en) 2019-07-22 2019-07-22 Phishing website distinguishing method and device based on deep learning

Country Status (1)

Country Link
CN (1) CN110365691B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111198995B (en) * 2020-01-07 2023-03-24 电子科技大学 Malicious webpage identification method
CN111556065A (en) * 2020-05-08 2020-08-18 鹏城实验室 Phishing website detection method and device and computer readable storage medium
CN112491820B (en) * 2020-11-12 2022-07-29 新华三技术有限公司 Abnormity detection method, device and equipment
CN113378168B (en) * 2021-07-04 2022-05-31 昆明理工大学 Method for realizing DDoS attack detection in SDN environment based on Renyi entropy and BiGRU algorithm
CN113657453B (en) * 2021-07-22 2023-08-01 珠海高凌信息科技股份有限公司 Detection method based on harmful website generating countermeasure network and deep learning

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105956472A (en) * 2016-05-12 2016-09-21 宝利九章(北京)数据技术有限公司 Method and system for identifying whether webpage includes malicious content or not
CN106373057A (en) * 2016-09-29 2017-02-01 西安交通大学 Network education-orientated poor learner identification method
CN106789888A (en) * 2016-11-18 2017-05-31 重庆邮电大学 A kind of fishing webpage detection method of multiple features fusion
CN107104803A (en) * 2017-03-31 2017-08-29 清华大学 It is a kind of to combine the user ID authentication method confirmed with vocal print based on numerical password
CN107679041A (en) * 2017-10-20 2018-02-09 苏州大学 English event synchronous anomalies method and system based on convolutional neural networks
CN108337255A (en) * 2018-01-30 2018-07-27 华中科技大学 A kind of detection method for phishing site learnt based on web automatic tests and width
CN109657721A (en) * 2018-12-20 2019-04-19 长沙理工大学 A kind of multi-class decision-making technique of combination fuzzy set and random forest tree

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10148691B2 (en) * 2016-12-31 2018-12-04 Fortinet, Inc. Detection of unwanted electronic devices to provide, among other things, internet of things (IoT) security

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105956472A (en) * 2016-05-12 2016-09-21 宝利九章(北京)数据技术有限公司 Method and system for identifying whether webpage includes malicious content or not
CN106373057A (en) * 2016-09-29 2017-02-01 西安交通大学 Network education-orientated poor learner identification method
CN106789888A (en) * 2016-11-18 2017-05-31 重庆邮电大学 A kind of fishing webpage detection method of multiple features fusion
CN107104803A (en) * 2017-03-31 2017-08-29 清华大学 It is a kind of to combine the user ID authentication method confirmed with vocal print based on numerical password
CN107679041A (en) * 2017-10-20 2018-02-09 苏州大学 English event synchronous anomalies method and system based on convolutional neural networks
CN108337255A (en) * 2018-01-30 2018-07-27 华中科技大学 A kind of detection method for phishing site learnt based on web automatic tests and width
CN109657721A (en) * 2018-12-20 2019-04-19 长沙理工大学 A kind of multi-class decision-making technique of combination fuzzy set and random forest tree

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于深度学习的层次化钓鱼网站检测方法;胡俊等;《通信技术》;20170510;全文 *

Also Published As

Publication number Publication date
CN110365691A (en) 2019-10-22

Similar Documents

Publication Publication Date Title
CN110365691B (en) Phishing website distinguishing method and device based on deep learning
CN105516113B (en) System and method for automatic network fishing detected rule evolution
CN105897714B (en) Botnet detection method based on DNS traffic characteristics
CN106453061B (en) A kind of method and system identifying network fraudulent act
CN101820366B (en) Pre-fetching-based fishing web page detection method
CN102279875B (en) Method and device for identifying fishing website
CN110266647A (en) It is a kind of to order and control communication check method and system
CN107659570A (en) Webshell detection methods and system based on machine learning and static and dynamic analysis
CN104077396A (en) Method and device for detecting phishing website
CN106776946A (en) A kind of detection method of fraudulent website
CN107341399A (en) Assess the method and device of code file security
CN110502897A (en) A kind of identification of webpage malicious JavaScript code and antialiasing method based on hybrid analysis
CN104156490A (en) Method and device for detecting suspicious fishing webpage based on character recognition
CN107888554A (en) The detection method and device of server attack
CN110135157A (en) Malware homology analysis method, system, electronic equipment and storage medium
CN107437038A (en) A kind of detection method and device of webpage tamper
CN104615760A (en) Phishing website recognizing method and phishing website recognizing system
CN103685308A (en) Detection method and system of phishing web pages, client and server
CN109104421B (en) Website content tampering detection method, device, equipment and readable storage medium
CN108337255A (en) A kind of detection method for phishing site learnt based on web automatic tests and width
CN113901376A (en) Malicious website detection method and device, electronic equipment and computer storage medium
CN108229170A (en) Utilize big data and the software analysis method and device of neural network
CN112532624B (en) Black chain detection method and device, electronic equipment and readable storage medium
CN107609389A (en) A kind of verification method and system of image content-based correlation
CN114650176A (en) Phishing website detection method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20211228