CN111198995A

CN111198995A - Malicious webpage identification method

Info

Publication number: CN111198995A
Application number: CN202010012212.7A
Authority: CN
Inventors: 廖永建; 王勇; 王栋; 吴宇; 梁艺宽
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-01-07
Filing date: 2020-01-07
Publication date: 2020-05-26
Anticipated expiration: 2040-01-07
Also published as: CN111198995B

Abstract

The invention discloses a malicious webpage identification method, which comprises the following steps: step 1, acquiring a malicious webpage data set, and obtaining a training set and a test set of malicious webpages through data preprocessing; step 2, acquiring character-level embedding of a training set and a test set by using a Char-CNN model; step 3, constructing a BiLSTM-Attention neural network model; step 4, constructing a BilSTM-Attention neural network model by utilizing a training set, character level embedding thereof and static word embedding training step 3; step 5, verifying the trained BilSTM-Attention neural network model in the step 4 by using a test set and character level embedding thereof and static word embedding; and step 6, after verification in the step 5, using the trained BilSTM-Attention neural network model for carrying out malicious webpage identification on webpage data accessed by the user. The method adopts a bidirectional long-time memory cyclic neural network based on an attention mechanism, and simultaneously adopts a method combining character level embedding and static word embedding, thereby realizing the purpose of identifying malicious webpages.

Description

Malicious webpage identification method

Technical Field

The invention relates to the technical field of internet security, in particular to a malicious webpage identification method.

Background

With the development of the internet industry in recent years, networks have become an indispensable part of people's lives. At the same time, however, malicious criminal activity using the internet is also increasing. The operations of utilizing malicious webpages to carry out phishing attacks, popularizing junk advertisements, guiding to download malicious software and the like are main activities of internet crimes. According to the' global Chinese phishing website status statistical analysis report (2016 >) and the recent report of the China anti-phishing alliance, China is a country suffering the greatest proportion of malicious webpages, and the number of the malicious webpages rapidly increases year by year.

The traditional method for identifying malicious web pages is generally an identification method based on a blacklist technology. Is also the method most widely used in industry today. The blacklist technique is to maintain a list of malicious domain names, if the accessed domain name is not in the list of malicious domain names, the browser will regard the accessed domain name as a normal domain name, and if the accessed domain name is in the list, the accessed domain name is regarded as a malicious domain name. The method has the advantages that the technology is simple to implement, and the confirmed malicious webpage can be accurately identified. But has the disadvantage of not being able to identify previously undisplayed malicious domain names and requires technicians to maintain lists of malicious domain names at all times.

With the development of machine learning technology in recent years, more and more people apply the machine learning technology to malicious web page detection. Manually extracting characteristics such as url length, whether the url length is an https link, domain name length and the like from the url link, detecting content of a webpage by using a honeypot technology, detecting whether a malicious script exists, detecting whether pictures on a website are illegal pictures and the like, and then classifying the pictures based on machine learning algorithms such as svm and random forest. However, this method is very dependent on experts in network security, and requires a person who is very familiar with malicious web pages to perform artificial feature extraction on the malicious web page data set. The quality of the final classification result is greatly influenced by the manually extracted features.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the existing problems, the method for identifying the malicious web pages is provided, and the URL links are directly classified by using character-level embedding and a bidirectional long-and-short time memory recurrent neural network (Bi LSTM), so that the purpose of identifying the malicious web pages is achieved.

The technical scheme adopted by the invention is as follows:

a malicious webpage identification method comprises the following steps:

step 1, acquiring a malicious webpage data set, and obtaining a training set and a test set of malicious webpages through data preprocessing;

step 2, acquiring character-level embedding of a training set and a test set by using a Char-CNN model;

step 3, constructing a BiLSTM-Attention neural network model;

step 4, constructing a BilSTM-Attention neural network model by utilizing a training set, character level embedding thereof and static word embedding training step 3;

step 5, verifying the trained BilSTM-Attention neural network model in the step 4 by using a test set and character level embedding thereof and static word embedding;

and step 6, after verification in the step 5, using the trained BilSTM-Attention neural network model for carrying out malicious webpage identification on webpage data accessed by the user.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

the invention adopts a bidirectional long-time memory cyclic neural network based on an attention mechanism, and simultaneously adopts a method combining character level embedding and static word embedding to realize the purpose of malicious webpage identification, compared with the traditional malicious webpage identification method, the method of the invention comprises the following steps:

1. no personnel are required to maintain a domain name blacklist;

2. no special network security personnel are required to design features;

3. the identification rate of newly appeared malicious web pages is high;

4. the method is suitable for identifying the malicious web pages appearing at the mobile terminal.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a flowchart of a malicious web page identification method according to the present invention.

FIG. 2 is a schematic structural diagram of a BilSTM-Attention neural network model constructed by the present invention.

FIG. 3 is a schematic diagram of the present invention using a trained BilSTM-Attention neural network model for malicious web page identification of web page data accessed by a user.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, a method for identifying a malicious web page includes the following steps:

specifically, the method comprises the following steps:

step 1.1, removing samples with malicious webpage data set url link missing or label missing, and then performing word segmentation processing; the segmentation of the English text is based on a space, but the url link is a special English text without a space, the embodiment adopts a python word ninia module to perform segmentation processing on the url link in the malicious webpage data set, and all symbols in the url link are reserved;

step 1.2, the url link contains many abbreviated words, so preprocessing operations such as stem extraction and word shape reduction are also needed. According to the method, a Porter Stemmer module and a WordNetLemmatizer module in a python NLTK package are adopted to extract stems and restore word shapes of url links in a malicious webpage data set;

step 1.3, in order to avoid the situation that the letters are mixed in capital and lowercase, the embodiment adopts a python lower () method to convert all the letters of url links in the malicious webpage data set into lowercase or uppercase (preferably lowercase), and completes the normalization operation;

and step 1.4, dividing the malicious webpage data set processed in the step 1.1-step 1.3 into a training set and a testing set according to the proportion of 7:3 or 8:2 (preferably 8: 2).

specifically, the method comprises the following steps:

step 2.1, constructing a character table:

abcdefghijklmnopqrsttuvwxyz 0123456789-; | A! Is there a The 69 characters, '/' __ $% & [ ] { } are encoded using one-hot, plus a full 0 vector for processing characters not in the character table, forming a character table including 70 characters, and representing the character table as a one-hot vector, e.g., a [1,0,0 … 0], having a dimension of 50.

Step 2.2, representing the training set or the test set by using one-hot vectors of the character table, and then inputting the Char-CNN model for training to obtain corresponding character level embedding, for example: a [0.2324124,0.2124244,0.5252411, … ].

Wherein, the Char-CNN model is a neural network model of 6 convolutional layers.

Step 3, constructing a BiLSTM-Attention neural network model; the BilSTM model is a bidirectional LSTM (bidirectional long-short memory recurrent neural network), and the Bi-LSTM model trains data of an input layer in a forward direction and a backward direction, so that the BilSTM can better capture context information in sentences than the LSTM, and then an Attention layer is added on the BilSTM model to form the BilSTM-Attention neural network model.

As shown in fig. 2, specifically:

step 3.1, constructing an input layer, wherein the input layer is used for inputting the malicious webpage data set preprocessed by the data in the step 1; com, for example [ www.dark moon ];

step 3.2, constructing an embedding layer, wherein the embedding layer utilizes character-level embedding of the malicious webpage data set and static word embedding to replace words in the malicious webpage data set to obtain an embedded representation of each url link in the malicious webpage data set;

step 3.3, constructing an LSTM layer, wherein the LSTM layer comprises two layers, one layer is a forward propagation layer, and the other layer is a backward propagation layer; each LSTM layer includes a forgetting gate, an input gate, an output gate, and a cell state, wherein,

(1) updating the forget gate output: f. of_t＝σ(w_fh_t-1+U_fx_t+b_f)；h_t-1Indicating history information, x_tRepresenting new information flowing into the cell, b_fIs a bias term;

(2) update input gate two part output:

i_t＝σ(w_ih_t-1+U_ix_t+bi)；

a_t＝tanh(w_ah_t-1+U_ax_t+b_a)；

(3) and (3) updating the cell state:

C_t＝C_t-1f_t+i_ta_t；

(4) updating two parts of output of an output gate:

o_t＝σ(w₀h_t-1+U₀x_t+b₀)；

h_t＝o_ttanh(C_t)；

(5) the current sequence index prediction outputs:

y_t＝σ(Vh_t+c)；

wherein, w_f，U_f，b_f，w_i，U_i，w_a，U_a，w₀，U₀Obtaining parameters required to be trained for the BilSTM-Attention neural network model; sigma is sigmoid function;

step 3.4, an attention layer is constructed, the attention layer is used for calculating the weight of all time sequences, and then the weight of all time sequences is output as a feature vector;

step 3.5, an output layer is constructed, the output layer is a full connection layer, the output of the attention layer is used as the input of the output layer, and a softmax classifier is used for processing the output of the attention layer to obtain a classification result

Step 4, constructing a BilSTM-Attention neural network model by utilizing a training set, character level embedding thereof and static word embedding training step 3; wherein, the static word embedding can adopt a Glove static word vector which is trained by Stanford university and has a dimension of 50.

Specifically, the method comprises the following steps:

step 4.1, constructing a text dictionary represented by vectors by all words in url link texts in a training set;

step 4.2, comparing the constructed text dictionary with the static word embedding one by one, if the static word embedding contains word vectors in the text dictionary, replacing the word vectors in the static word embedding, and if the static word embedding does not contain the word vectors in the text dictionary, replacing the word vectors by character-level embedding, thereby obtaining the vector representation of each url link in the training set; that is, Wi in the words (containing words and characters) S ═ in the training set url links (W1, W2 …, Wn) is mapped as Wi. S represents a url link, Wi represents a word in the url link; wi is a vector, i.e., embedding, dimension of 50. The entire training set is replaced. The vector representation of each url link is obtained, i.e. a two-dimensional matrix, each column representing a word vector or a character vector.

4.3, inputting the vector representation of each url link in the training set into a forward propagation layer and a backward propagation layer in the lstm layer; the forward propagation layer and the backward propagation layer together extract language information represented by the input url-linked vector; adding the results of the forward propagation layer and the backward propagation layer at the same time to obtain a semantic feature vector in each url link, and then transmitting the semantic feature vector to an attention layer;

step 4.4, the attribute layer receives the semantic feature vector in each url link, calculates the weight of all time sequences, outputs the weight of all time sequences as the feature vector, and adopts the following calculation formula to calculate:

U_t＝V tan h(w₁h+bw)；

a_t＝softmax(U_t)；

c^t＝∑a_th；

where h is the semantic feature vector in each url link, w₁Is a parameter vector, bw is a bias term; u shape_tHidden layer representation for the neural network; a is_tIs to U_tPerforming softmax function normalization to obtain a weight matrix; then the weight matrix a_tCarrying out weighted sum with the semantic feature vector h to obtain a text vector c containing important information in url link^tFinally, the text vector c^tTransmitting to the output layer;

step 4.5, the output layer processes the text vector c by adopting a softmax function^tThe formula is as follows:

y＝softmax(w_jc^t+b_j)

wherein y is the output of the model, 0 represents a normal url link, and 1 represents a malicious url link; w is a_jRepresenting attention layer to output layer to be trainedA weight coefficient matrix; b_jRepresenting the corresponding bias term to be trained.

Because the malicious webpage identification problem is a two-classification problem, the loss function adopted by the output layer is a binary cross entropy loss function, and the loss function is an index for measuring whether the model is converged or not. And (5) the Loss function Loss is stable, and the model is converged, so that the model training is completed. The formula is as follows:

log(yt|yp)＝-(yt*log(yp)+(1-yt)log(1-yp))

wherein y is a label corresponding to the x sample in the training set, the value set of the binary problem is {0, 1}, yt is a real label of a certain sample, and yp is the probability when yt of the sample is 1; and then drawing a Loss curve through a python matplotlib package to judge whether the Loss of the Loss function is stable or not by judging whether the Loss is balanced or not.

specifically, the method comprises the following steps:

step 5.1, inputting a test set, embedding the test set in a character level, and embedding a static word into a trained BilSTM-Attention neural network model to obtain a classification result of each url link, wherein 0 represents a normal url link, and 1 represents a malicious url link;

and 5.2, comparing the classification result of each url link with a labeled label (namely, the label labeled by each url link in the data set is 0 or 1), if the classification result is consistent with the labeled label, pred +1, and finally calculating the quantity of acc, pred/url in the test set, wherein acc is the accuracy of the trained BilSTM-Attention neural network model for identifying the malicious web pages, and the verification is passed when the accuracy meets the requirement.

And step 6, after verification in the step 5, using the trained BilSTM-Attention neural network model for carrying out malicious webpage identification on webpage data accessed by the user. As shown in fig. 3, specifically: inputting a webpage data set accessed by a user into an input layer of a BilSTM-Attention neural network model after the processing of the step 1 and the step 2; and after the replacement is performed by combining the embedding layer with character level embedding and static word embedding, the classification result is output through an LSTM layer, an attention layer and an output layer in sequence, if the classification result is a normal url link, the access is allowed, and if the classification result is a malicious url link, the access is denied.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A malicious webpage identification method is characterized by comprising the following steps:

step 3, constructing a BiLSTM-Attention neural network model;

2. The malicious webpage identification method according to claim 1, wherein the method of step 1 is:

step 1.1, removing samples of url link loss or label loss of the malicious webpage data set, performing word segmentation processing on the url link in the malicious webpage data set by adopting a pythonwordannia module, and reserving all symbols in the url link;

step 1.2, performing stemming extraction and morphological reduction on url links in a malicious webpage data set by adopting a PorterStemmer module and a WordNetLemmatizer module in a python NLTK package;

step 1.3, adopting a python lower () method to convert all letters of url links in the malicious webpage data set into lower case or upper case, and finishing normalization operation;

and step 1.4, dividing the malicious webpage data set processed in the step 1.1-step 1.3 into a training set and a testing set according to the proportion of 7:3 or 8: 2.

3. The malicious webpage identification method according to claim 1, wherein the method of step 2 is:

step 2.1, constructing a character table:

abcdefghijklmnopqrsttuvwxyz 0123456789-; | A! Is there a : one-hot coding is used for 69 characters of'/\\ _ @ # & +- - < > () [ ] { }, then a full 0 vector is used for processing characters which are not in the character table, a character table comprising 70 characters is formed, and the character table is represented as a one-hot vector;

and 2.2, representing the training set or the test set by using one-hot vectors of the character table, and inputting a Char-CNN model for training to obtain corresponding character level embedding.

4. The malicious webpage identification method according to claim 1 or 3, wherein the Char-CNN model is a neural network model of 6 convolutional layers.

5. The malicious webpage identification method according to claim 1, wherein the method of step 3 is:

step 3.1, constructing an input layer, wherein the input layer is used for inputting the malicious webpage data set preprocessed by the data in the step 1;

(2) update input gate two part output:

i_t＝σ(w_ih_t-1+U_ix_t+bi)；

a_t＝tanh(w_ah_t-1+U_ax_t+b_a)；

(3) and (3) updating the cell state:

C_t＝C_t-1f_t+i_ta_t；

(4) updating two parts of output of an output gate:

o_t＝σ(w₀h_t-1+U₀x_t+b₀)；

h_t＝o_ttanh(C_t)；

(5) the current sequence index prediction outputs:

y_t＝σ(Vh_t+c)；

and 3.5, constructing an output layer, wherein the output layer is a fully-connected layer, the output of the attention layer is used as the input of the output layer, and the output of the attention layer is processed by using a softmax classifier to obtain a classification result.

6. The malicious webpage identification method according to claim 5, wherein the method of step 4 is:

step 4.2, comparing the constructed text dictionary with the static word embedding one by one, if the static word embedding contains word vectors in the text dictionary, replacing the word vectors in the static word embedding, and if the static word embedding does not contain the word vectors in the text dictionary, replacing the word vectors by character-level embedding, thereby obtaining the vector representation of each url link in the training set;

step 4.4, the attention layer receives the semantic feature vector in each url link, and the following calculation formula is adopted for calculation:

U_t＝V tanh(w₁h+bw)；

a_t＝softmax(U_t)；

c^t＝∑a_th；

y＝softmax(w_jc^t+b_j)

wherein y is the output of the model, 0 represents a normal url link, and 1 represents a malicious url link; w is a_jRepresenting a weight coefficient matrix to be trained from an attention layer to an output layer; b_jRepresenting the corresponding bias term to be trained.

7. The malicious webpage identification method according to claim 6, wherein the loss function adopted by the output layer is a binary cross entropy loss function, and the formula is as follows:

log(yt|yp)＝-(yt*log(yp)+(1-yt)log(1-yp))

wherein y is a label corresponding to the x sample in the training set, the value set of the binary problem is {0, 1}, yt is a real label of a certain sample, and yp is the probability when yt of the sample is 1; then, a Loss curve is drawn through a pythonmatplotlib package, and whether the Loss of the Loss function is stable or not is judged by judging whether the Loss is balanced or not.

8. The malicious webpage identification method according to claim 6, wherein the method of step 5 is:

and 5.2, comparing the classification result of each url link with the labeled label, if the classification result is consistent with the labeled label, pred +1, and finally calculating the number of the acs (pre/url in the test set), wherein the acc is the accuracy of the trained BilSTM-Attention neural network model for identifying the malicious web pages, and the accuracy passes the verification when meeting the requirement.