CN111198995B

CN111198995B - Malicious webpage identification method

Info

Publication number: CN111198995B
Application number: CN202010012212.7A
Authority: CN
Inventors: 廖永建; 王勇; 王栋; 吴宇; 梁艺宽
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-01-07
Filing date: 2020-01-07
Publication date: 2023-03-24
Anticipated expiration: 2040-01-07
Also published as: CN111198995A

Abstract

The invention discloses a malicious webpage identification method, which comprises the following steps: step 1, acquiring a malicious webpage data set, and obtaining a training set and a test set of malicious webpages through data preprocessing; step 2, acquiring character-level embedding of a training set and a test set by using a Char-CNN model; step 3, constructing a BiLSTM-Attention neural network model; step 4, constructing a BilSTM-Attention neural network model by utilizing a training set and character level embedding thereof and static word embedding training step 3; step 5, verifying the trained BilSTM-Attention neural network model in the step 4 by using a test set and character level embedding thereof and static word embedding; and step 6, after verification in the step 5, using the trained BilSTM-Attention neural network model for carrying out malicious webpage identification on webpage data accessed by the user. The method adopts a bidirectional long-time and short-time memory cyclic neural network based on an attention mechanism, and simultaneously adopts a method of combining character level embedding and static word embedding, thereby realizing the purpose of malicious webpage identification.

Description

Malicious webpage identification method

Technical Field

The invention relates to the technical field of internet security, in particular to a malicious webpage identification method.

Background

With the development of the internet industry in recent years, networks have become an indispensable part of people's lives. At the same time, however, malicious criminal activity using the internet is also increasing. The operations of utilizing malicious webpages to carry out phishing attacks, popularizing junk advertisements, guiding to download malicious software and the like are main activities of internet crimes. According to the' global Chinese phishing website status statistical analysis report (2016 >) and the recent report of the anti-phishing alliance, china is known to be the country with the largest proportion of suffering from malicious webpages, and the number of the malicious webpages rapidly increases year by year.

The traditional method for identifying malicious web pages is generally an identification method based on a blacklist technology. Is also the method most widely used in industry today. The blacklist technique is to maintain a list of malicious domain names, if the accessed domain name is not in the list of malicious domain names, the browser will regard the accessed domain name as a normal domain name, and if the accessed domain name is in the list, the accessed domain name is regarded as a malicious domain name. The method has the advantages that the technology is simple to implement, and the confirmed malicious webpage can be accurately identified. But has the disadvantage of not being able to identify malicious domain names that have not previously appeared and requires a technician to maintain a list of malicious domain names at all times.

With the development of machine learning technology in recent years, more and more people apply the machine learning technology to malicious web page detection. Manually extracting characteristics such as url length, whether the url length is an https link, domain name length and the like from the url link, detecting content of a webpage by using a honeypot technology, detecting whether a malicious script exists, detecting whether pictures on a website are illegal pictures and the like, and then classifying the pictures based on machine learning algorithms such as svm and random forest. However, this method is very dependent on experts in network security, and requires a person who is very familiar with malicious web pages to perform artificial feature extraction on the malicious web page data set. The quality of the final classification result is greatly influenced by the manually extracted features.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the existing problems, the method for identifying the malicious web pages directly classifies URL links by using character-level embedding and a bidirectional long-and-short-term memory recurrent neural network (Bi LSTM), thereby achieving the purpose of identifying the malicious web pages.

The technical scheme adopted by the invention is as follows:

a malicious webpage identification method comprises the following steps:

step 1, acquiring a malicious webpage data set, and obtaining a training set and a test set of malicious webpages through data preprocessing;

step 2, acquiring character-level embedding of a training set and a test set by using a Char-CNN model;

step 3, constructing a BiLSTM-Attention neural network model;

step 4, constructing a BilSTM-Attention neural network model by utilizing a training set, character level embedding thereof and static word embedding training step 3;

step 5, verifying the trained BilSTM-Attention neural network model in the step 4 by using a test set and character level embedding thereof and static word embedding;

and step 6, after verification in the step 5, using the trained BilSTM-Attention neural network model for carrying out malicious webpage identification on webpage data accessed by the user.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

the invention adopts a bidirectional long-time memory cyclic neural network based on an attention mechanism, and simultaneously adopts a method combining character level embedding and static word embedding to realize the purpose of malicious webpage identification, compared with the traditional malicious webpage identification method, the method of the invention comprises the following steps:

1. no personnel are required to maintain a domain name blacklist;

2. no special network security personnel are required to design features;

3. the identification rate of newly appeared malicious web pages is high;

4. the method is suitable for identifying the malicious web pages appearing at the mobile terminal.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a flowchart of a malicious web page identification method according to the present invention.

FIG. 2 is a schematic structural diagram of a BilSTM-Attention neural network model constructed by the present invention.

FIG. 3 is a schematic diagram of the present invention using a trained BilSTM-Attention neural network model for malicious web page identification of web page data accessed by a user.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, a method for identifying a malicious web page includes the following steps:

specifically, the method comprises the following steps:

step 1.1, removing samples with missing url links or missing labels of malicious webpage data sets, and then performing word segmentation processing; the segmentation of the English text is based on a space, but the url link is a special English text without a space, the embodiment adopts a python word ninia module to perform segmentation processing on the url link in the malicious webpage data set, and all symbols in the url link are reserved;

in the step 1.2, a plurality of abbreviated words are contained in the url link, so preprocessing operations such as stem extraction and word shape reduction are required. According to the method, a Porter Stemmer module and a WordNetLemmatizer module in a python NLTK package are adopted to extract stems and restore word shapes of url links in a malicious webpage data set;

step 1.3, in order to avoid the situation that the letters are mixed in capital and lowercase, the embodiment adopts a python lower () method to convert all letters of url links in the malicious webpage data set into lowercase or uppercase (preferably lowercase), and completes the normalization operation;

and step 1.4, dividing the malicious webpage data set processed in the step 1.1 to the step 1.3 into a training set and a testing set according to the proportion of 7:3 or 8:2 (preferably 8:2).

specifically, the method comprises the following steps:

step 2.1, constructing a character table:

abcdefghijklmnopqrsttuvwxyz 0123456789-; | A! Is there a 69 characters,'/\\\ # $% & & - + - = [ ] { } are encoded using one-hot, and then a full 0 vector is added for processing characters not in the character table, forming a character table including 70 characters, and the character table is represented as a one-hot vector, for example, a [1,0,0,0 … ], having a dimension of 50.

Step 2.2, the training set or the test set is expressed by a one-hot vector of a character table, and then a Char-CNN model is input for training to obtain corresponding character level embedding, for example: a [0.2324124,0.2124244,0.5252411, … ].

Wherein, the Char-CNN model is a neural network model of 6 convolutional layers.

Step 3, constructing a BiLSTM-Attention neural network model; the BilSTM model is a bidirectional LSTM (bidirectional long-short memory recurrent neural network), and the Bi-LSTM model trains data of an input layer in a forward direction and a backward direction, so that the BilSTM can better capture context information in sentences than the LSTM, and then an Attention layer is added on the BilSTM model to form the BilSTM-Attention neural network model.

As shown in fig. 2, specifically:

step 3.1, constructing an input layer, wherein the input layer is used for inputting the malicious webpage data set preprocessed by the data in the step 1; for example [ www.dark moon. Com ];

step 3.2, constructing an embedding layer, wherein the embedding layer utilizes character-level embedding of the malicious webpage data set and static word embedding to replace words in the malicious webpage data set to obtain an embedded representation of each url link in the malicious webpage data set;

step 3.3, constructing an LSTM layer, wherein the LSTM layer comprises two layers, one layer is a forward propagation layer, and the other layer is a backward propagation layer; each LSTM layer includes a forgetting gate, an input gate, an output gate, and a cell state, wherein,

(1) Updating the forget gate output: f. of _t ＝σ(w _f h _t-1 +U _f x _t +b _f )；h _t-1 Indicating history information, x _t Representing new information flowing into the cell, b _f Is a bias term;

(2) Update input gate two part output:

i _t ＝σ(w _i h _t-1 +U _i x _t +bi)；

a _t ＝tanh(w _a h _t-1 +U _a x _t +b _a )；

(3) And (3) updating the cell state:

C _t ＝C _t-1 f _t +i _t a _t ；

(4) Updating two parts of output of an output gate:

o _t ＝σ(w ₀ h _t-1 +U ₀ x _t +b ₀ )；

h _t ＝o _t tanh(C _t )；

(5) The current sequence index prediction outputs:

y _t ＝σ(Vh _t +c)；

wherein, w _f ，U _f ，b _f ，w _i ，U _i ，w _a ，U _a ，w ₀ ，U ₀ Obtaining parameters required to be trained for the BilSTM-Attention neural network model; sigma is a sigmoid function;

step 3.4, an attention layer is constructed, the attention layer is used for calculating the weight of all time sequences, and then the weight of all time sequences is output as a feature vector;

step 3.5, an output layer is constructed, the output layer is a full connection layer, the output of the attention layer is used as the input of the output layer, and a softmax classifier is used for processing the output of the attention layer to obtain a classification result

Step 4, constructing a BilSTM-Attention neural network model by utilizing a training set and character level embedding thereof and static word embedding training step 3; wherein, the static word embedding can adopt a Glove static word vector which is trained by Stanford university and has a dimension of 50.

Specifically, the method comprises the following steps:

step 4.1, constructing a text dictionary represented by vectors by all words in url link texts in a training set;

step 4.2, comparing the constructed text dictionary with the static word embedding one by one, if the static word embedding contains word vectors in the text dictionary, replacing the word vectors in the static word embedding, and if the static word embedding does not contain the word vectors in the text dictionary, replacing the word vectors by character-level embedding, thereby obtaining the vector representation of each url link in the training set; that is, wi in the word (containing words and characters) S = (W1, W2 …, wn) in the training set url link is mapped to Wi. S represents a url link, wi represents a word in the url link; wi is a vector, i.e., embedding, dimension of 50. The entire training set is replaced. The vector representation of each url link is obtained, i.e. a two-dimensional matrix, each column representing a word vector or a character vector.

4.3, inputting the vector representation of each url link in the training set into a forward propagation layer and a backward propagation layer in the lstm layer; the forward propagation layer and the backward propagation layer together extract language information represented by the input url-linked vector; adding the results of the forward propagation layer and the backward propagation layer at the same time to obtain a semantic feature vector in each url link, and then transmitting the semantic feature vector to an attribute layer;

4.4, receiving the semantic feature vector in each url link by the attribute layer, firstly calculating the weights of all time sequences, then outputting the weights of all time sequences as feature vectors, and calculating by adopting the following calculation formula:

U _t ＝V tan h(w ₁ h+bw)；

a _t ＝softmax(U _t )；

c ^t ＝∑a _t h；

where h is the semantic feature vector in each url link, w ₁ Is a parameter vector, bw is a bias term; u shape _t Hidden layer representation for the neural network; a is _t Is to U _t Carrying out softmax function normalization to obtain a weight matrix; then the weight matrix a _t Carrying out weighted sum with the semantic feature vector h to obtain a text vector c containing important information in url link ^t Finally, the text vector c ^t Transmitting to the output layer;

step 4.5, the output layer processes the text vector c by adopting a softmax function ^t The formula is as follows:

y＝softmax(w _j c ^t +b _j )

wherein y is the output of the model, 0 represents a normal url link, and 1 represents a malicious url link; w is a _j Representing a weight coefficient matrix to be trained from an attention layer to an output layer; b _j Representing the corresponding bias term to be trained.

Because the malicious webpage identification problem is a two-classification problem, the loss function adopted by the output layer is a binary cross entropy loss function, and the loss function is an index for measuring whether the model is converged or not. And (5) the Loss function Loss is stable, and the model is converged, so that the model training is completed. The formula is as follows:

log(yt|yp)＝-(yt*log(yp)+(1-yt)log(1-yp))

wherein y is a label corresponding to the x sample in the training set, the value set of the classification problem is {0,1}, yt is a real label of a certain sample, and yp is the probability of the sample yt = 1; and then drawing a Loss curve through a python matplotlib package to judge whether the Loss of the Loss function is stable or not by judging whether the Loss is balanced or not.

specifically, the method comprises the following steps:

step 5.1, inputting a test set, embedding the test set in a character level, and embedding a static word into a trained BilSTM-Attention neural network model to obtain a classification result of each url link, wherein 0 represents a normal url link, and 1 represents a malicious url link;

and 5.2, comparing the classification result of each url link with a labeled label (namely, the label labeled by each url link in the data set is 0 or 1), if the classification result is consistent with the labeled label, pred +1, and finally calculating acc = pred/number of urls in the test set, wherein acc is the accuracy of malicious webpage identification of the trained BilSTM-Attention neural network model, and when the accuracy meets the requirement, the verification is passed.

And step 6, after verification in the step 5, using the trained BilSTM-Attention neural network model for carrying out malicious webpage identification on webpage data accessed by the user. As shown in fig. 3, specifically: inputting a webpage data set accessed by a user into an input layer of a BilSTM-Attention neural network model after the processing of the step 1 and the step 2; and after the replacement is performed by combining the embedding layer with character level embedding and static word embedding, the classification result is output through an LSTM layer, an attention layer and an output layer in sequence, if the classification result is a normal url link, the access is allowed, and if the classification result is a malicious url link, the access is denied.

The above description is intended to be illustrative of the preferred embodiment of the present invention and should not be taken as limiting the invention, but rather, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

Claims

1. A malicious webpage identification method is characterized by comprising the following steps:

step 3, constructing a BiLSTM-Attention neural network model;

step 6, after the verification of the step 5, using the trained BilSTM-Attention neural network model for carrying out malicious webpage identification on webpage data accessed by the user;

the method of the step 3 comprises the following steps:

step 3.1, constructing an input layer, wherein the input layer is used for inputting the malicious webpage data set preprocessed by the data in the step 1;

(2) Update input gate two part output:

i _t ＝σ(w _i h _t-1 +U _i x _t +bi)；

a _t ＝tanh(w _a h _t-1 +U _a x _t +b _a )；

(3) And (3) updating the cell state:

C _t ＝C _t-1 f _t +i _t a _t ；

(4) Updating two parts of output of an output gate:

o _t ＝σ(w ₀ h _t-1 +U ₀ x _t +b ₀ )；

h _t ＝o _t tanh(C _t )；

(5) Current sequence index prediction output:

y _t ＝σ(Vh _t +c)；

wherein, w _f ，U _f ，b _f ，w _i ，U _i ，w _a ，U _a ，w ₀ ，U ₀ Obtaining parameters required to be trained for the BilSTM-Attention neural network model; sigma is sigmoid function;

and 3.5, constructing an output layer, wherein the output layer is a fully-connected layer, the output of the attention layer is used as the input of the output layer, and the output of the attention layer is processed by using a softmax classifier to obtain a classification result.

2. The malicious webpage identification method according to claim 1, wherein the method of step 1 is:

step 1.1, removing samples of url link loss or label loss of the malicious webpage data set, performing word segmentation processing on the url link in the malicious webpage data set by adopting a python word ninia module, and reserving all symbols in the url link;

step 1.2, performing stemming extraction and morphological reduction on url links in a malicious webpage data set by adopting a PorterStemmer module and a WordNetLemmatizer module in a python NLTK package;

step 1.3, adopting a python lower () method to convert all letters of url links in the malicious webpage data set into lower case or upper case, and finishing normalization operation;

step 1.4, processing the malicious webpage data set processed in the steps 1.1-1.3 according to the following formula 7:3 or 8: the scale of 2 is divided into a training set and a test set.

3. The malicious webpage identification method according to claim 1, wherein the method of step 2 is:

step 2.1, constructing a character table:

abcdefghijklmnopqrsttuvwxyz 0123456789-; | But! Is there a : 69 characters, "/" ___ $% & - + - = [ ] { }, are encoded using one-hot, and then a full 0 vector is used for processing characters which are not in the character table, so that a character table comprising 70 characters is formed, and the character table is represented as a one-hot vector;

and 2.2, representing the training set or the test set by using one-hot vectors of the character table, and inputting a Char-CNN model for training to obtain corresponding character level embedding.

4. The malicious webpage identification method according to claim 1 or 3, wherein the Char-CNN model is a neural network model of 6 convolutional layers.

5. The malicious webpage identification method according to claim 4, wherein the method of step 4 is:

step 4.2, comparing the constructed text dictionary with the static word embedding one by one, if the static word embedding contains word vectors in the text dictionary, replacing the word vectors in the static word embedding, and if the static word embedding does not contain the word vectors in the text dictionary, replacing the word vectors by character-level embedding, thereby obtaining the vector representation of each url link in the training set;

4.3, inputting the vector representation of each url link in the training set into a forward propagation layer and a backward propagation layer in the lstm layer; the forward propagation layer and the backward propagation layer together extract language information represented by the input url-linked vector; adding the results of the forward propagation layer and the backward propagation layer at the same time to obtain a semantic feature vector in each url link, and then transmitting the semantic feature vector to an attention layer;

step 4.4, the attribute layer receives the semantic feature vector in each url link and calculates by adopting the following calculation formula:

U _t ＝V tanh(w ₁ h+bw)；

a _t ＝softmax(U _t )；

c ^t ＝∑a _t h；

where h is the semantic feature vector in each url link, w ₁ Is a parameter vector, bw is a bias term; u shape _t Hidden layer representation for the neural network; a is _t Is to U _t Performing softmax function normalization to obtain a weight matrix; then the weight matrix a _t Carrying out weighted sum with the semantic feature vector h to obtain a text vector c containing important information in url link ^t Finally, the text vector c ^t Transmitting to the output layer;

y＝softmax(w _j c ^t +b _j )

6. The malicious webpage identification method according to claim 5, wherein the loss function adopted by the output layer is a binary cross entropy loss function, and the formula is as follows:

log(yt|yp)＝-(yt*log(yp)+(1-yt)log(1-yp))

wherein y is a label corresponding to the x sample in the training set, the value set of the classification problem is {0,1}, yt is a real label of a certain sample, and yp is the probability of the sample yt = 1; and then drawing a Loss curve through a pythonomaplib package to judge whether the Loss of the Loss function is stable or not by judging whether the Loss curve is balanced or not.

7. The malicious webpage identification method according to claim 5, wherein the method of step 5 is:

and 5.2, comparing the classification result of each url link with the labeled label, if the classification result is consistent with the labeled label, pred +1, and finally calculating the number of acc = pred/url in the test set, wherein acc is the accuracy of the trained BilSTM-Attention neural network model for identifying the malicious web pages, and the verification is passed when the accuracy meets the requirement.