CN109617864B - Website identification method and website identification system - Google Patents

Website identification method and website identification system Download PDF

Info

Publication number
CN109617864B
CN109617864B CN201811427628.4A CN201811427628A CN109617864B CN 109617864 B CN109617864 B CN 109617864B CN 201811427628 A CN201811427628 A CN 201811427628A CN 109617864 B CN109617864 B CN 109617864B
Authority
CN
China
Prior art keywords
website
vector
classification
word
websites
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811427628.4A
Other languages
Chinese (zh)
Other versions
CN109617864A (en
Inventor
王海洋
王艳华
刘大伟
廖华明
李雪梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yantai Branch Institute Of Computing Technology Chinese Academy Of Science
Institute of Computing Technology of CAS
Original Assignee
Yantai Branch Institute Of Computing Technology Chinese Academy Of Science
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yantai Branch Institute Of Computing Technology Chinese Academy Of Science, Institute of Computing Technology of CAS filed Critical Yantai Branch Institute Of Computing Technology Chinese Academy Of Science
Priority to CN201811427628.4A priority Critical patent/CN109617864B/en
Publication of CN109617864A publication Critical patent/CN109617864A/en
Application granted granted Critical
Publication of CN109617864B publication Critical patent/CN109617864B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention relates to a website identification method and a website identification system, wherein the method comprises the following steps: respectively acquiring a plurality of first feature matrixes which are in one-to-one correspondence with a plurality of first websites; deep learning training is carried out on all the first feature matrixes based on the original network classification model to obtain a deep learning website classification model; acquiring a second feature matrix; performing classification probability calculation on the second feature matrix based on the deep learning website classification model to obtain a classification probability vector; and identifying the second website as a counterfeit website or a normal website according to the classification probability value in the classification probability vector. The website identification method and the website identification system reduce manual intervention, reduce the probability that a normal website is judged as a counterfeit website by mistake, and improve the accuracy of counterfeit website identification.

Description

Website identification method and website identification system
Technical Field
The invention relates to the technical field of network security, in particular to a website identification method and a website identification system.
Background
With the rapid development of the internet, websites become important ways for displaying public information in various industries and key targets for lawless persons to attack, the prominent attack mode is to counterfeit websites (such as phishing websites), and a large number of counterfeit websites not only bring the problem that users cannot safely visit websites, but also cause economic loss of users.
In order to improve the security of the user accessing the website and reduce the economic loss of the user, a common method for identifying the counterfeit website is as follows: acquiring website data by visiting a website, extracting website features from the website data, matching the extracted website features with preset normal website features to obtain a matching result, and judging that the website belongs to a counterfeit website according to the matching result; the website data may be an image obtained by screenshot of a website at the same resolution, or a website image and text information are crawled by using a crawler technology.
However, website features often depend on a manual extraction mode to ensure certain accuracy, when a large number of websites are faced, the workload of manually extracting the website features is large, the efficiency is low, counterfeit websites are judged according to the matching result of the extracted website features and normal website features, and the normal websites are easily mistakenly judged as counterfeit websites.
Disclosure of Invention
The invention aims to solve the technical problems that in the prior art, website features depend on manual extraction, counterfeit websites are judged according to matching results of the extracted website features and normal website features, the workload of manual website feature extraction is large, the efficiency is low, and the normal websites are easily mistakenly judged as the counterfeit websites, and provides a website identification method and a website identification system.
The technical scheme for solving the technical problems is as follows:
according to a first aspect of the present invention, there is provided a website identification method, including the steps of:
step 100, respectively acquiring a plurality of first feature matrices corresponding to a plurality of first websites one to one;
200, performing deep learning training on all the first feature matrixes based on an original network classification model to obtain a deep learning website classification model;
step 300, acquiring a second feature matrix corresponding to a second website;
step 400, performing classification probability calculation on the second feature matrix based on the deep learning website classification model to obtain a classification probability vector;
and 500, identifying the second website as a counterfeit website or a normal website according to the classification probability value in the classification probability vector.
According to a second aspect of the present invention, there is provided a website identification system comprising: the system comprises a deep learning module and a website identification module;
the deep learning module is used for acquiring a first feature matrix corresponding to each of a plurality of first websites and a second feature matrix corresponding to a second website, performing deep learning training on all the first feature matrices based on an original network classification model to obtain a deep learning website classification model, and performing classification probability calculation on the second feature matrix based on the deep learning website classification model to obtain a classification probability vector;
and the website identification module is used for identifying the second website as a counterfeit website or a normal website according to the classification probability value in the classification probability vector.
The website identification method and the website identification system have the advantages that: deep learning training is carried out on the plurality of first feature matrixes based on the original website classification model to obtain a deep learning website classification model, so that the original website classification model is gradually and automatically corrected into the deep learning website classification model, and the accuracy of the deep learning website classification model is improved; the second feature matrix is classified and operated based on the deep learning website classification model to obtain a classification probability vector, so that manual intervention is reduced, and website features in the second feature matrix are intelligently operated to obtain the classification probability vector; the website is identified to be a counterfeit website or a normal website through the classification probability value in the classification probability vector, the classification probability value is closer to the authenticity of the category of the website, the probability that the normal website is judged to be the counterfeit website by mistake is reduced, and the accuracy of identifying the counterfeit website is improved.
Drawings
Fig. 1 is a schematic flowchart of a website identification method according to an embodiment of the present invention;
FIG. 2 is a diagram of a first feature matrix according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a calculation formula of a first classified predictive vector according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating an original website classification model according to an embodiment of the present invention;
fig. 5 is a schematic diagram illustrating classification accuracy of a deep learning website classification model operation classification probability value according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a website identification system according to an embodiment of the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
Example one
As shown in fig. 1, a schematic flow chart of a website identification method according to an embodiment of the present invention includes the following steps:
step 100, respectively acquiring a plurality of first feature matrices corresponding to a plurality of first websites one to one;
200, performing deep learning training on all first feature matrixes based on an original network classification model to obtain a deep learning website classification model;
step 300, acquiring a second feature matrix corresponding to a second website;
step 400, performing classification operation on the second feature matrix based on the deep learning website classification model to obtain a classification probability vector;
and 500, identifying a second website or a normal website according to the classification probability value in the classification probability vector.
Deep learning training is carried out on the plurality of first feature matrixes based on the original website classification model to obtain a deep learning website classification model, so that the original website classification model is gradually and automatically corrected into the deep learning website classification model, and the accuracy of the deep learning website classification model is improved; the second feature matrix is classified and operated based on the deep learning website classification model to obtain a classification probability vector, so that manual intervention is reduced, and website features in the second feature matrix are intelligently operated to obtain the classification probability vector; the website is identified to be a counterfeit website or a normal website through the classification probability value in the classification probability vector, the classification probability value is closer to the authenticity of the category of the website, the probability that the normal website is judged to be the counterfeit website by mistake is reduced, and the accuracy of identifying the counterfeit website is improved.
Preferably, step 100 specifically comprises:
step 110, acquiring a website data set, wherein the website data set comprises website text information of all first websites and website text information of all second websites;
step 120, determining an index tag set according to all website text information, wherein the index tag set comprises a plurality of index tag vectors corresponding to each website text information one by one;
step 130, grouping all index label vectors to obtain training sets corresponding to all first websites and test sets corresponding to second websites;
step 140, respectively performing word vector training on each index tag vector in a training set to obtain a plurality of first word vector sets corresponding to each index tag vector one by one, wherein each first word vector set comprises a plurality of first word vectors;
and 150, respectively combining all the first word vectors in each first word vector set according to a preset word list matrix to obtain first feature matrices corresponding to each first website one by one.
For example, crawler software is adopted to automatically crawl website text information from 40 websites, wherein the 40 websites are divided into two categories, namely a counterfeit website and a normal website, the counterfeit website is represented by 1, and the normal website is represented by 2; each website corresponds to one website text message, and 40 website text messages form a website data set; each website text message corresponds to a feature matrix, for example: a 7 x 50 first feature matrix contains 7 first word vectors, each having a column dimension of 50, as shown in fig. 2.
Grouping all index label vectors in the index label set into a training set test set according to a preset grouping proportion, for example: the preset proportion is 8:2, the training set has 32 index tag vectors, and the test set has 8 index tag vectors; or, the preset ratio is 9:1, the training set comprises 36 index tag vectors, and the test set comprises 4 index tag vectors.
The mapping relation between the website text information and the index tag vectors is established, word vector training is carried out on each index tag vector to obtain a word vector, website features in the first feature matrix are represented by the word vectors, differentiation of the website features is improved, automatic extraction of the website features is achieved, and compared with a mode of manually extracting the website features, the work load of website feature extraction is reduced.
Preferably, the step 120 specifically includes:
step 121, filtering each website text message through a regular matching formula to obtain a plurality of text messages to be participled corresponding to each website text message one by one;
step 122, performing word segmentation on each text message to be segmented respectively to obtain a plurality of keywords corresponding to each text message to be segmented one by one;
step 123, respectively determining index tag vectors corresponding to the keywords one by one;
and step 124, combining all index tag vectors to obtain an index tag set.
Specifically, each website text message comprises an escape symbol, a website link and text information to be participled, and the escape symbol and the website link in each website text message are filtered through a regular matching formula to obtain the text information to be participled.
For example, one website text message is 'click button free getting million good gifts', words are divided into 'click button free getting million good gifts' to obtain 6 keywords, an index tag for uniquely identifying the keyword is grouped for each keyword, and an index tag 7 for uniquely identifying an unknown word is added; index tag vector [1,2,3,4,5,6,7 ] using word training tool (e.g., wordvec tool)]TTraining a neural network to obtain 7 first word vectors, establishing a word mapping table corresponding to a website, wherein the word mapping table is shown as a word mapping table in table 1, unknown words are supplemented by 0 in table 1, and each index label corresponds to one first word vectorEach first word vector is a 50-dimensional row vector.
TABLE 1
Index tag Keyword First word vector
1 Click on [0.12,0.31,……,0.58]
2 Push button [0.15,0.91,……,0.11]
3 Free of charge [0.11,0.41,……,0.18]
4 Reception device [0.32,0.51,……,0.23]
5 Million of [0.22,0.11,……,0.56]
6 Good gift [0.16,0.61,……,0.58]
7 0 [0.11,0.15,……,0.36]
Information such as an escape symbol and a website link is filtered from each website text message through a regular matching formula to obtain text messages to be segmented, accuracy of the text messages to be segmented is improved, each index label vector is determined according to each text message to be segmented, keywords in the text messages to be segmented are sequentially distributed in the first feature matrix from front to back through the index label vectors, and accuracy of the first feature matrix is improved.
Preferably, in step 300, the second web site is provided with one or more.
When there is one second website, the number of index tag vectors included in the test set is one, and step 300 specifically includes:
step 310a, performing word vector training on the index label vectors in the test set to obtain a second word vector set, wherein the second word vector set comprises a plurality of second word vectors;
and 320a, combining all second word vectors in the second word vector set according to the word list matrix to obtain a second feature matrix corresponding to the second website.
When there are a plurality of second websites, the number of the index tag vectors included in the test set is multiple, and step 300 specifically includes:
step 310b, respectively performing word vector training on each index tag vector in the test set to obtain a plurality of second word vector sets corresponding to each index tag vector one by one, wherein each second word vector set comprises a plurality of second word vectors;
and 320b, respectively combining all the second word vectors in each second word vector set according to the word list matrix to obtain a plurality of second feature matrices which are in one-to-one correspondence with each second website.
For example: the test set comprises 8 index label vectors, each index label vector is a column vector with 64 dimensions, each column vector is trained by words to obtain 64 second word vectors, each second word vector is a row vector with 256 dimensions, the 64 second word vectors are combined according to a 64 x 256 word list matrix to obtain each second feature matrix, and the dimension of each second feature matrix is equal to the dimension of each first feature matrix.
Preferably, the original network classification model includes a convolutional neural network, a classification probability normalization calculation function, and a cross entropy function, and step 200 specifically includes:
step 210, performing deep learning training on any first feature matrix through a convolutional neural network to obtain a corresponding first classification prediction vector;
220, carrying out normalization calculation on the first classification prediction vector through a classification probability normalization calculation function to obtain a corresponding second classification prediction vector;
step 230, calculating a cross entropy value between the second classification prediction vector and the real label vector through a cross entropy function;
and 240, after correcting the weight parameters in the convolutional neural network according to the cross entropy, returning to the step 210 for circular execution until all the first feature matrix training is finished, and obtaining a deep learning website classification model.
Specifically, the weight matrix W is formed by weight parameters in the convolutional neural networkTExtracting website feature vectors X, W from each first feature matrix by the convolutional neural networkTX obtains a first classification prediction vector L, as shown in fig. 3, which is a calculation formula of the first classification prediction vector L, in fig. 3, m represents the number of website features corresponding to a first feature matrix, and n represents the total number of website classifications, for example: the counterfeit websites are classified into 8 levels, each level is a counterfeit website classification, the normal websites are classified into 2 levels, each level is a normal website classification, and the total number of the website classifications is 10.
Preferably, the convolutional neural network includes a first convolutional layer, a second convolutional layer, and a full-link layer, where the first convolutional layer includes a first convolutional sublayer and a second convolutional sublayer having different numbers of convolutional kernels, the number of convolutional kernels of the first convolutional sublayer is equal to the number of convolutional kernels of the second convolutional layer, and step 210 specifically includes:
step 211, performing convolution operation on any first characteristic matrix through a first convolution sublayer in the first layer of convolution layer to obtain a corresponding first output matrix, and performing convolution operation on the first characteristic matrix through a second convolution sublayer in the first layer of convolution layer to obtain a corresponding second output matrix;
step 212, performing convolution operation on the second output matrix through one or more third convolution sublayers in the second convolution layer to obtain a corresponding third output matrix;
and step 213, performing classification learning training on the first output matrix and the third output matrix through the full connection layer to obtain a corresponding website classification prediction vector.
As shown in fig. 4, the original website classification model of this embodiment includes an input layer, an embedded layer, a first convolution layer, a second convolution layer, a full connection layer, a classification probability calculation layer, and a weight parameter correction layer; and the input layer inputs each index tag vector to the embedding layer, the embedding layer stores a word list matrix, and each first feature matrix or each second feature matrix is output after word vector training is carried out on each index tag vector according to the word list matrix.
The first embedded layer comprises a first convolution sublayer A, a first convolution sublayer B, a second convolution sublayer A and a second convolution sublayer B, and the second convolution layer comprises a third convolution layer A and a third convolution layer B; the number of convolution kernels of the first convolution sublayer a, the number of convolution kernels of the first convolution sublayer B, the number of convolution kernels of the third convolution layer a, and the number of convolution kernels of the third convolution layer B are all equal, the number of convolution kernels of the second convolution sublayer a is equal to the number of convolution kernels of the second convolution sublayer B, and the number of convolution kernels of the second convolution sublayer B is greater than the number of convolution kernels of the third convolution layer B, for example: the number of convolution kernels of the second convolution sublayer B is 256 and the number of convolution kernels of the third convolution layer B is 128; the convolution kernels in the first convolution sublayer a, the first convolution sublayer B, the second convolution sublayer a and the second convolution sublayer B have the same step size, for example: the step size is 1.
For any one first feature matrix, the first convolution sublayer A performs neural network convolution processing on the first feature matrix to obtain a first output matrix A, and the first convolution sublayer B performs neural network convolution processing on the first feature matrix to obtain a first output matrix B; the second convolution sublayer A performs neural network convolution processing on the first characteristic matrix to obtain a second output matrix A, the second output matrix A is processed through an excitation function and then input into a third convolution layer A, and the third convolution layer A performs neural network convolution processing on the second output matrix A to obtain a third output matrix A; the second convolution sublayer B performs neural network convolution processing on the first characteristic matrix to obtain a second output matrix B, the second output matrix B is processed through an excitation function and then input into a third convolution layer B, and the third convolution layer B performs neural network convolution processing on the second output matrix B to obtain a third output matrix B; wherein the first output matrix a, the first output matrix B, the third output matrix a and the third output matrix B have the same dimensions, for example: the third output matrix B is a 1 × 64 × 128 matrix.
The full-connection layer carries out neural network training on the first output matrix A, the first output matrix B, the third output matrix A and the third output matrix B to obtain each website classification prediction vector, and each neuron according to the probability dropout so as to overcome overfitting of the full-connection layer in the neural network training process, for example: the dropout probability is 0.5, and the better over-fitting prevention effect is achieved.
The classification probability calculation layer stores a classification probability normalization calculation function, and the weight parameter correction layer stores a cross entropy function.
And performing neural network learning on each first characteristic matrix through convolutional layers with a plurality of sizes to obtain a plurality of output matrixes, so as to realize the training of network characteristics layer by layer, extract website classification prediction vectors from the plurality of output matrixes and strengthen automatic learning and abstract website classification prediction vectors.
Preferably, the classification probability normalization calculation function is represented by a first formula:
Figure BDA0001881997840000091
alternatively, the first and second electrodes may be,
Figure BDA0001881997840000092
wherein, p (y)(i)=jL(i),Wj T) Represents the ith second classification prediction vector, L(i)Representing the ith first class prediction vector, y(i)Represents the jth website class corresponding to the ith first class prediction vector, Wj TRepresents the jth weight coefficient, pjRepresents the jth classification label value in the ith second classification prediction vector, i is more than or equal to 1 and less than or equal to n, n represents the total number of websites,
Figure BDA0001881997840000101
Wj Tit may be a weight coefficient vector or a weight coefficient matrix.
p(y(i)=j|L(i),Wj T) And pjAre two expressions of a first formula, where p (y)(i)=j|L(i),Wj T) Expressing the first formula, p, as a vectorjThe first formula is represented in the form of a numerical value in the vector.
By passing
Figure BDA0001881997840000102
And
Figure BDA0001881997840000103
normalizing each classification label value in each second classification prediction vector can ensure that each classification label value is between 0 and 1 and ensure that each weight coefficient Wj TAnd each first classified prediction vector L(i)Not limited to a range of values, i.e. each weight coefficient Wj TAnd each first classified prediction vector L(i)From infinity to infinitesimal.
Preferably, the cross entropy function is expressed by a second formula:
Figure BDA0001881997840000104
wherein, loss represents the cross entropy value corresponding to the jth classification probability predicted value gkThe k-th true tag value, p, in the true tag vector representing the j dimensionkAnd representing the kth classification probability predicted value in the ith second classification prediction vector, wherein k is less than or equal to j.
For example: the weight parameter correction layer calculates the sum of the cross entropy value and each weight parameter, updates the sum to the corresponding weight parameter, reduces the order of the cross entropy function by solving a partial derivative mode, updates the weight parameter by each cross entropy value according to a time sequence, enables the weight parameter to gradually approach a real label value, ensures the accuracy of the deep learning website classification model, reduces the difference between the classification probability value and the real label value, and improves the accuracy of the classification probability value, as shown in fig. 5, the classification accuracy of the deep learning website classification model is calculated and is only 17% before deep learning training, and the classification accuracy is improved to 96.5% after 50 times of deep learning training.
Preferably, step 500 specifically includes:
and determining the maximum classification probability value from the classification probability vectors, and identifying whether the second website is a counterfeit website or a normal website according to the maximum classification probability value.
For example, if the website corresponding to the maximum classification probability value is classified as a counterfeit website, the second website is identified as a counterfeit website, and if the website corresponding to the maximum classification probability value is classified as a normal website, the second website is identified as a normal website.
The counterfeit websites or the normal websites are identified through the maximum classification probability value, the identification mode can be simplified, and the identification efficiency of the counterfeit websites is improved in the face of a large number of counterfeit websites.
Example two
In this embodiment, as shown in fig. 6, a website identification system includes a deep learning module and a website identification module; the deep learning module is used for respectively acquiring a plurality of first feature matrixes which are in one-to-one correspondence with a plurality of first websites, performing deep learning training on all the first feature matrixes based on an original network classification model to obtain a deep learning website classification model, and performing classification probability calculation on a second feature matrix based on the deep learning website classification model to obtain a classification probability vector;
and the website identification module is used for identifying the second website as a counterfeit website or a normal website according to the classification probability value in the classification probability vector.
Preferably, the deep learning module comprises an input layer, an embedding layer, a convolutional neural network, a classification probability calculation layer and a weight parameter correction layer.
The input layer is used for acquiring a website data set, and the website data set comprises website text information of all first websites and website text information of all second websites; determining an index tag set according to all website text information, wherein the index tag set comprises a plurality of index tag vectors which are in one-to-one correspondence with each website text information; grouping all the index label vectors to obtain a training set corresponding to all the first websites and a test set corresponding to the second websites;
and the embedding layer is used for respectively carrying out word vector training on each index label vector in the training set to obtain a plurality of first word vector sets corresponding to each index label vector one to one, wherein each first word vector set comprises a plurality of first word vectors, and all the first word vectors in each first word vector set are respectively combined according to a preset word list matrix to obtain a first feature matrix corresponding to each first website one to one.
The embedded layer is further used for performing word vector training on one index tag vector contained in the test set when one second website is provided, so as to obtain a second word vector set, wherein the second word vector set comprises a plurality of second word vectors, and all the second word vectors in the second word vector set are combined according to the word list matrix, so as to obtain a second feature matrix corresponding to the second website; when a plurality of second websites are provided, respectively carrying out word vector training on each index tag vector in the test set to obtain a plurality of second word vector sets corresponding to each index tag vector one by one, wherein the number of the index tag vectors contained in the test set is multiple, each second word vector set comprises a plurality of second word vectors, and respectively combining all the second word vectors in each second word vector set according to the word list matrix to obtain a plurality of second feature matrices corresponding to each second website one by one.
The convolutional neural network is used for carrying out deep learning training on any one first characteristic matrix to obtain each corresponding first classification prediction vector;
and the classification probability calculation layer is used for carrying out normalization calculation on the first classification prediction vector through a classification probability normalization calculation function to obtain a corresponding second classification prediction vector.
And the weight coefficient correction layer is used for calculating a cross entropy value between the second classification prediction vector and the real label vector through a cross entropy function, correcting weight parameters in the convolutional neural network according to the cross entropy value, and obtaining a deep learning website classification model until all the first feature matrix training is finished.
Preferably, the input layer is specifically configured to filter each website text information through a regular matching formula, so as to obtain a plurality of text information to be participled corresponding to each website text information one to one; segmenting each text message to be segmented respectively to obtain a plurality of keywords which are in one-to-one correspondence with each text message to be segmented; respectively determining index tag vectors corresponding to the keywords one by one; and combining all index label vectors to obtain an index label set.
Preferably, the convolutional neural network comprises a first convolutional layer, a second convolutional layer and a full-link layer, wherein the first convolutional layer comprises a first convolutional sublayer and a second convolutional sublayer which have different numbers of convolutional kernels, the second convolutional layer comprises one or more third convolutional sublayers, and the number of convolutional kernels of the first convolutional sublayer is equal to the number of convolutional kernels of the third convolutional sublayer.
And the first convolution layer is used for performing convolution operation on any first characteristic matrix through the first convolution sublayer to obtain a corresponding first output matrix, and performing convolution operation on the first characteristic matrix through the second convolution sublayer to obtain a corresponding second output matrix.
And the second convolution sublayer is used for performing convolution operation on the second output matrix through one or more third convolution sublayers to obtain a corresponding third output matrix.
And the full connection layer is used for carrying out classification learning training on the first output matrix and the third output matrix to obtain a corresponding first classification prediction vector.
Preferably, the classification probability normalization calculation function is represented by a first formula:
Figure BDA0001881997840000131
alternatively, the first and second electrodes may be,
Figure BDA0001881997840000132
wherein, p (y)(i)=j|L(i),Wj T) Represents the ith second classification prediction vector, L(i)Representing the ith first class prediction vector, y(i)Represents the jth website class corresponding to the ith first class prediction vector, Wj TRepresents the jth weight coefficient, pjRepresents the jth classification label value in the ith second classification prediction vector, i is more than or equal to 1 and less than or equal to n, n represents the total number of websites,
Figure BDA0001881997840000133
Wj Twhich may be a vector or a matrix.
Preferably, the cross entropy function is expressed by a second formula:
Figure BDA0001881997840000134
wherein, loss represents the cross entropy value corresponding to the jth classification probability predicted value gkThe k-th true tag value, p, in the true tag vector representing the j dimensionkAnd representing the kth classification probability predicted value in the ith second classification prediction vector, wherein k is less than or equal to j.
Preferably, the website identification module is specifically configured to determine a maximum classification probability value from the classification probability vectors, and identify the second website as a counterfeit website or a normal website according to the maximum classification probability value.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (9)

1. A website identification method is characterized by comprising the following steps:
step 100, respectively acquiring a plurality of first feature matrices corresponding to a plurality of first websites one to one;
200, performing deep learning training on all the first feature matrixes based on an original network classification model to obtain a deep learning website classification model;
step 300, acquiring a second feature matrix corresponding to a second website;
step 400, performing classification operation on the second feature matrix based on the deep learning website classification model to obtain a classification probability vector;
500, identifying the second website as a counterfeit website or a normal website according to the classification probability value in the classification probability vector;
the step 100 specifically includes:
step 110, acquiring a website data set, wherein the website data set comprises website text information of all the first websites and website text information of the second websites;
step 120, determining an index tag set according to all the website text information, wherein the index tag set comprises a plurality of index tag vectors corresponding to each website text information one by one;
step 130, grouping all the index tag vectors to obtain training sets corresponding to all the first websites and test sets corresponding to the second websites;
step 140, performing word vector training on each index tag vector in the training set to obtain a plurality of first word vector sets corresponding to each index tag vector one to one, where each first word vector set includes a plurality of first word vectors;
step 150, respectively combining all the first word vectors in each first word vector set according to a preset word list matrix to obtain a plurality of first feature matrices corresponding to each first website one to one.
2. The method according to claim 1, wherein the step 120 specifically comprises:
step 121, filtering each website text message through a regular matching formula to obtain a plurality of text messages to be participled corresponding to each website text message one by one;
step 122, performing word segmentation on each text message to be word segmented respectively to obtain a plurality of keywords corresponding to each text message to be word segmented one by one;
step 123, respectively determining index tag vectors corresponding to the keywords one by one;
and step 124, combining all the index tag vectors to obtain the index tag set.
3. The website identification method according to claim 1, wherein in the step 300, the second website is provided with one or more of,
when there is one second website, and the number of index tag vectors included in the test set is one, the step 300 specifically includes:
step 310a, performing word vector training on the index label vectors in the test set to obtain a second word vector set, wherein the second word vector set comprises a plurality of second word vectors;
step 320a, combining all the second word vectors in the second word vector set according to the vocabulary matrix to obtain the second feature matrix corresponding to the second website;
when there are a plurality of second websites, and the number of the index tag vectors included in the test set is multiple, the step 300 specifically includes:
step 310b, respectively performing word vector training on each index tag vector in the test set to obtain a plurality of second word vector sets corresponding to each index tag vector one by one, wherein each second word vector set comprises a plurality of second word vectors;
and 320b, respectively combining all the second word vectors in each second word vector set according to the word list matrix to obtain a plurality of second feature matrices corresponding to each second website one to one.
4. The method according to claim 1, wherein the original network classification model includes a convolutional neural network, a classification probability normalization calculation function, and a cross entropy function, and the step 200 specifically includes:
step 210, performing deep learning training on any one first feature matrix through the convolutional neural network to obtain a corresponding first classification prediction vector;
220, carrying out normalization calculation on the first classification prediction vector through the classification probability normalization calculation function to obtain a corresponding second classification prediction vector;
step 230, calculating a cross entropy value between the second classification prediction vector and a real label vector through the cross entropy function;
step 240, after the weight parameters in the convolutional neural network are corrected according to the cross entropy, returning to the step 210 to perform circularly until all the first feature matrix training is finished, and obtaining the deep learning website classification model.
5. The website identification method according to claim 4, wherein the convolutional neural network comprises a first convolutional layer, a second convolutional layer and a full link layer, the first convolutional layer comprises a first convolutional sublayer and a second convolutional sublayer with different numbers of convolutional kernels, the second convolutional layer comprises one or more third convolutional sublayers, the number of convolutional kernels of the first convolutional sublayer is equal to the number of convolutional kernels of the third convolutional sublayer, and the step 210 specifically comprises:
step 211, performing convolution operation on any one of the first feature matrices through the first convolution sublayer in the first convolution layer to obtain a corresponding first output matrix, and performing convolution operation on the first feature matrix through the second convolution sublayer in the first convolution layer to obtain a corresponding second output matrix;
step 212, performing convolution operation on the second output matrix through one or more third convolution sublayers in the second convolution layer to obtain a corresponding third output matrix;
step 213, performing classification learning training on the first output matrix and the third output matrix through the full connection layer to obtain a corresponding first classification prediction vector.
6. The method of claim 4, wherein the normalized computation function of classification probability is expressed by a first formula:
Figure FDA0002897942000000041
wherein, p (y)(i)=j|L(i),Wj T) Represents the ith of said second classified prediction vector, L(i)Representing the ith of said first classified prediction vector, y(i)Represents the jth website class corresponding to the ith first class prediction vector, Wj TRepresents the jth weight coefficient, pjRepresenting the ith of said second classified prediction vectorIn the j classification probability prediction value, i is more than or equal to 1 and less than or equal to n, and n represents the total number of websites.
7. The website identification method according to claim 6, wherein the cross entropy function is expressed by a second formula:
Figure FDA0002897942000000042
wherein, loss represents the cross entropy value corresponding to the jth classification probability predicted value gkA k-th real tag value, p, in said real tag vector representing a j dimensionkRepresents the kth classification probability predicted value in the ith second classification prediction vector, and k is less than or equal to j.
8. The website identification method according to any one of claims 1 to 7, wherein the step 500 specifically comprises:
determining the maximum classification probability value from the classification probability vector;
and identifying the second website as the counterfeit website or the normal website according to the maximum classification probability value.
9. A website identification system, comprising: the system comprises a deep learning module, a website identification module and an input layer;
the deep learning module is used for acquiring first feature matrixes corresponding to a plurality of first websites and second feature matrixes corresponding to second websites, performing deep learning training on all the first feature matrixes based on an original network classification model to obtain a deep learning website classification model, and performing classification probability calculation on the second feature matrixes based on the deep learning website classification model to obtain a classification probability vector;
the website identification module is used for identifying the second website as a counterfeit website or a normal website according to the classification probability value in the classification probability vector;
the input layer is used for acquiring a website data set, and the website data set comprises website text information of all first websites and website text information of all second websites; determining an index tag set according to all website text information, wherein the index tag set comprises a plurality of index tag vectors which are in one-to-one correspondence with each website text information; and grouping all the index label vectors to obtain a training set corresponding to all the first websites and a test set corresponding to the second websites.
CN201811427628.4A 2018-11-27 2018-11-27 Website identification method and website identification system Active CN109617864B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811427628.4A CN109617864B (en) 2018-11-27 2018-11-27 Website identification method and website identification system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811427628.4A CN109617864B (en) 2018-11-27 2018-11-27 Website identification method and website identification system

Publications (2)

Publication Number Publication Date
CN109617864A CN109617864A (en) 2019-04-12
CN109617864B true CN109617864B (en) 2021-04-16

Family

ID=66005321

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811427628.4A Active CN109617864B (en) 2018-11-27 2018-11-27 Website identification method and website identification system

Country Status (1)

Country Link
CN (1) CN109617864B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110442823A (en) * 2019-08-06 2019-11-12 北京智游网安科技有限公司 Website classification method, Type of website judgment method, storage medium and intelligent terminal
CN110807197A (en) * 2019-10-31 2020-02-18 支付宝(杭州)信息技术有限公司 Training method and device for recognition model and risk website recognition method and device
CN111078869A (en) * 2019-11-07 2020-04-28 国家计算机网络与信息安全管理中心 Method and device for classifying financial websites based on neural network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8218859B2 (en) * 2008-12-05 2012-07-10 Microsoft Corporation Transductive multi-label learning for video concept detection
CN103544436A (en) * 2013-10-12 2014-01-29 深圳先进技术研究院 System and method for distinguishing phishing websites
CN104239485A (en) * 2014-09-05 2014-12-24 中国科学院计算机网络信息中心 Statistical machine learning-based internet hidden link detection method
CN105338001A (en) * 2015-12-04 2016-02-17 北京奇虎科技有限公司 Method and device for recognizing phishing website

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014018630A1 (en) * 2012-07-24 2014-01-30 Webroot Inc. System and method to provide automatic classification of phishing sites

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8218859B2 (en) * 2008-12-05 2012-07-10 Microsoft Corporation Transductive multi-label learning for video concept detection
CN103544436A (en) * 2013-10-12 2014-01-29 深圳先进技术研究院 System and method for distinguishing phishing websites
CN104239485A (en) * 2014-09-05 2014-12-24 中国科学院计算机网络信息中心 Statistical machine learning-based internet hidden link detection method
CN105338001A (en) * 2015-12-04 2016-02-17 北京奇虎科技有限公司 Method and device for recognizing phishing website

Also Published As

Publication number Publication date
CN109617864A (en) 2019-04-12

Similar Documents

Publication Publication Date Title
CN108737406B (en) Method and system for detecting abnormal flow data
CN107835496B (en) Spam short message identification method and device and server
CN109299741B (en) Network attack type identification method based on multi-layer detection
CN110602113B (en) Hierarchical phishing website detection method based on deep learning
CN109617864B (en) Website identification method and website identification system
WO2019179403A1 (en) Fraud transaction detection method based on sequence width depth learning
CN109635010B (en) User characteristic and characteristic factor extraction and query method and system
CN109190698B (en) Classification and identification system and method for network digital virtual assets
CN112258223B (en) Marketing advertisement click prediction method based on decision tree
CN112529638B (en) Service demand dynamic prediction method and system based on user classification and deep learning
CN110647916A (en) Pornographic picture identification method and device based on convolutional neural network
CN113628059A (en) Associated user identification method and device based on multilayer graph attention network
CN111178364A (en) Image identification method and device
CN112085112A (en) Image category detection method, system, electronic equipment and storage medium
CN111833310A (en) Surface defect classification method based on neural network architecture search
CN113571133A (en) Lactic acid bacteria antibacterial peptide prediction method based on graph neural network
CN109101984B (en) Image identification method and device based on convolutional neural network
CN111079930B (en) Data set quality parameter determining method and device and electronic equipment
CN114119191A (en) Wind control method, overdue prediction method, model training method and related equipment
CN116595486A (en) Risk identification method, risk identification model training method and corresponding device
CN116722992A (en) Fraud website identification method and device based on multi-mode fusion
CN112529637B (en) Service demand dynamic prediction method and system based on context awareness
CN114842425A (en) Abnormal behavior identification method for petrochemical process and electronic equipment
CN109308565B (en) Crowd performance grade identification method and device, storage medium and computer equipment
CN114926702A (en) Small sample image classification method based on depth attention measurement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant