CN109308494B

CN109308494B - LSTM model and network attack identification method and system based on LSTM model

Info

Publication number: CN109308494B
Application number: CN201811127061.9A
Authority: CN
Inventors: 姚鸿富; 陈奋; 陈荣有; 程长高
Original assignee: Xiamen Fuyun Information Technology Co ltd
Current assignee: Xiamen Fuyun Information Technology Co ltd
Priority date: 2018-09-27
Filing date: 2018-09-27
Publication date: 2021-06-22
Anticipated expiration: 2038-09-27
Also published as: CN109308494A

Abstract

The invention relates to an LSTM recurrent neural network model and a network attack identification method based on the same, wherein the establishing process of the LSTM recurrent neural network model comprises the following steps: s100: collecting a plurality of text-format network request data as a training data set, and setting a label type for each network request data according to the content of the network request data; s200: preprocessing the network request data in the training data set, and converting the network request data into digital sequence type data with preset length; s300: and training according to the digital sequence type data in the training data set to construct an LSTM recurrent neural network model. The network request data are converted into digital sequence data, and then a training data set formed by the data is used for training to construct an LSTM recurrent neural network model, so that the prediction of the category of the network request data is realized.

Description

LSTM model and network attack identification method and system based on LSTM model

Technical Field

The invention relates to the technical field of network security, in particular to an LSTM recurrent neural network model and a network attack identification method based on the model.

Background

With the continuous promotion of the informatization strategy and the rapid development of technologies such as internet, cloud computing and the like, more and more related businesses of enterprises complete the datamation transformation to move the business to the network end. But due to the openness and uncontrollable property of the network application itself and the limitation of network application developers, the network application has a high possibility of network vulnerability to be utilized. Hackers can develop network attacks such as SQL injection and XSS attacks by utilizing the vulnerabilities, bring risks such as website paralysis, information leakage, webpage tampering and horse hanging to websites, and bring huge losses to website company main bodies and users thereof.

At present, the mainstream solution for protecting network application in the market is a regular rule-based website application level intrusion prevention system. The protection method based on the rules often causes false alarm or false alarm when flexible and changeable network attack is faced, and the establishment and maintenance of the rules need to be responsible for professional safety related personnel, although the method still hardly covers various kinds of attack, the method is difficult to effectively deal with unknown attack and 0day attack, even conflicts between the rules may occur, in addition, the establishment of the rules is difficult to grasp the balance problem of misjudgment and missed judgment problems, and the too strict rules easily kill normal traffic by mistake to cause misjudgment. Rules that are too loose are easily bypassed, resulting in missed decisions.

Disclosure of Invention

Aiming at the problems, the invention aims to provide an LSTM recurrent neural network model and a network attack recognition method based on the model.

The specific scheme is as follows:

an LSTM recurrent neural network model for network attack category identification, the LSTM recurrent neural network model establishing process comprises the following steps:

s100: collecting a plurality of text-format network request data as a training data set, and setting a label type for each network request data according to the content of the network request data;

s200: preprocessing the network request data in the training data set, and converting the network request data into digital sequence type data with preset length;

the preprocessing comprises decoding, Chinese character replacement, case changing, dictionary extraction, sequence coding and sequence length processing;

the decoding is: restoring codes contained in the network request data into characters before the codes corresponding to the codes;

the Chinese characters are replaced by: uniformly replacing characters other than letters, numbers and symbols with lower case letters 'z';

the case transform is: converting uppercase characters into lowercase characters;

the dictionary extraction is as follows: counting the occurrence frequency of each character in the data set by taking a single character as a unit, arranging dictionary numbers in descending order according to the number to form a dictionary, setting the character with a sequence complement as "+", wherein the dictionary number is 0, and encoding the character which does not appear in the training data set as "z", wherein the dictionary number is 1;

the sequence is encoded as: replacing characters in the network request data with dictionary numbers according to the dictionary numbers of the characters set in the dictionary, replacing the characters which are not recorded in the dictionary with characters 'z', and setting the dictionary number to be 1 to form digital sequence type data;

the sequence length processing is as follows: and according to the preset length, performing length processing on the digital sequence type data to enable the lengths of all the digital sequence type data to be the preset length.

S300: and training according to the digital sequence type data in the training data set to construct an LSTM recurrent neural network model.

Further, the label categories include three categories, which are category 1: normal network request data; class 2: network request data containing SQL injection attack; class 3: the network containing the XSS attack requests data.

Further, the length processing includes one of a head operation and a tail operation, the head operation being to: for the digital sequence type data with the length less than the preset length, the front part of the digital sequence type data is filled by using a dictionary number 0, and for the sequence with the length more than the preset length, the front part of the digital sequence type data with the length more than the preset length is cut off; the tail operates to: for digital sequence data whose length is less than a preset length, the data is filled with dictionary number 0, and for sequences whose length exceeds the preset length, the data is truncated to include the part exceeding the preset length.

Further, step S300 includes the steps of:

s301: setting an embedding layer;

s302: using a Dropout layer to randomly disconnect a certain proportion of network connections to avoid overfitting;

s303: connect the LSTM layer and specify a certain proportion of Dropout;

s304: connecting a full connection layer and a softmax layer, and outputting a class probability corresponding to each label class as an output result;

s305: setting an optimizer as adam random gradient descent, setting a loss function as a multi-classification cross entropy function, setting a measurement index as accuracy, and performing a plurality of rounds of training until the model converges.

A network attack recognition method is based on the LSTM recurrent neural network model for network attack category recognition, and comprises the following steps: preprocessing the network request data to obtain digital sequence type data with preset length, identifying the probability of the label category of the digital sequence type data by using an LSTM recurrent neural network model to obtain the category probability of each label category, and taking the label category corresponding to the dimension with the maximum probability value in the category probabilities as the label category of the network request data.

A network attack recognition system is based on the network attack recognition method of the embodiment of the invention, and comprises the following steps: the data input unit sends received network request data to the data conversion unit, the data conversion unit carries out preprocessing conversion on the network request data to obtain digital sequence type data with preset length, the digital sequence type data are sent to the LSTM model unit, the LSTM model unit identifies the probability of label types of the digital sequence type data and outputs the type probability of each label type to the decision unit, the decision unit judges whether the probability of the data in the attack type is larger than or equal to a preset probability threshold value, if yes, interception is carried out, and if not, interception is not carried out.

According to the technical scheme, the model is trained on mass marking data in a supervised learning mode, and the accuracy of the model can be up to 99.7% finally. The model used by the invention is based on character-level coding of text data and is preprocessed, so that the method has the characteristics of less characteristic coding, a small number of model parameters, a small weight file of the model and better practicability. And the LSTM is adopted as a deep learning method, so that complicated characteristic engineering is avoided, original data can be normalized into an input format of the model through simple data preprocessing, and the method has the characteristic of simplicity in implementation.

Drawings

Fig. 1 is a schematic flow chart according to a first embodiment of the present invention.

Fig. 2 is a schematic flow chart of the preprocessing in step S200 according to the embodiment of the invention.

FIG. 3 is a schematic diagram of an LSTM recurrent neural network model according to an embodiment of the present invention.

Fig. 4 is a diagram illustrating the accuracy and the variation of the loss function during the model training process according to an embodiment of the present invention.

Fig. 5 is a schematic structural diagram of a network attack recognition system in the third embodiment of the present invention.

Detailed Description

To further illustrate the various embodiments, the invention provides the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the embodiments. Those skilled in the art will appreciate still other possible embodiments and advantages of the present invention with reference to these figures.

The invention will now be further described with reference to the accompanying drawings and detailed description.

The first embodiment is as follows:

referring to fig. 1 to 3, the invention provides an LSTM recurrent neural network model for network attack category identification, and the establishing process of the LSTM recurrent neural network model comprises the following steps:

s100: the method comprises the steps of collecting a plurality of text-format network request data as a training data set, and setting label types for each network request data according to the content of the network request data.

The plurality of data means a large amount or mass of data in the field, and the more the amount of data is, the more accurate the final model is.

The network request data may be classified into multiple categories according to the type of the network attack, and the network request data is mainly classified into normal network request data and attack category network request data, in this embodiment, SQL injection attacks and XSS attacks are mainly distinguished, so the tag categories include three categories, which are respectively category 1: normal network request data; class 2: network request data containing SQL injection attack; class 3: the network request data containing the XSS attack, i.e., the network request data of both class 2 and class 3 belong to the attack class.

S200: and preprocessing the network request data in the training data set, and converting the network request data into digital sequence type data with preset length.

The preset length is the length of an input sample of the LSTM recurrent neural network model set according to requirements, and a person skilled in the art can set the size of the preset length by himself.

The pretreatment comprises the following steps: decoding, Chinese character replacement, case change, dictionary extraction, sequence coding, sequence length processing and the like.

The decoding is: and restoring the codes contained in the network request data into characters before the codes corresponding to the codes. The code can be common codes such as url code and Unicode code.

The Chinese characters are replaced by: characters such as Chinese characters and the like which are not letters, numbers and symbols are replaced by lower case letters 'z' in a unified mode.

The case transform is: the upper case characters are converted into lower case characters.

The dictionary extraction is as follows: counting the number of times of each character appearing in the data set by taking a single character as a unit, arranging dictionary numbers in descending order according to the number to form a dictionary, setting the character with the sequence complement as "+", the dictionary number as 0, and coding the character which does not appear in the training data set as "z", and the dictionary number as 1. The dictionary may be stored in a file form for a long time.

The sequence is encoded as: the characters in the network request data are replaced with dictionary numbers according to the dictionary numbers of the characters set in the dictionary, the characters not recorded in the dictionary are replaced with characters "z", and the dictionary number thereof is set to 1, forming digit sequence type data.

The sequence length processing is as follows: and according to the required length, carrying out length processing on the digital sequence data in the network request data to ensure that all the digital sequence data are in the preset length.

The length processing may be a head operation or a tail operation.

The head is operative to: the character-sequence-type data whose length is less than the preset length is supplemented with a supplementary symbol "+" in front thereof, and the corresponding digit-sequence-type data is supplemented with a dictionary number 0 in front thereof, and the part in front of the digit-sequence-type data which exceeds the length of the digit sequence is cut off.

The tail operates to: the character-sequence data whose length is less than the preset length is supplemented with a complementary symbol "+" at the rear, and the corresponding digit-sequence data is supplemented with a dictionary number 0 at the rear, and the part exceeding the sequence length at the rear is cut off for the sequence exceeding the length.

Because the network request data in text format can not be directly used as the input of the LSTM recurrent neural network model, the network request data in text format needs to be converted into digital sequence type data with preset length. In this example, the numeric sequence data is obtained based on character-level encoding of text data. And after the original network request data is preprocessed by decoding Chinese substitution case and case conversion and the like, effective characters are greatly reduced, so that the dictionary file is smaller, the feature coding of the network request is simple, the parameter quantity of the model is further reduced, the complexity of the model is not too high, and the practicability is better.

S300: training according to the digital sequence type data in the training data set to construct an LSTM recurrent neural network model, which specifically comprises the following steps:

s301: setting an Embedding layer (Embedding): in this embodiment, preferably, the parameter EMBEDDING _ SIZE of the EMBEDDING layer is set to 128, assuming that N is the input sample SIZE, the output structure of the EMBEDDING layer is (N, preset length, 128), the length of each input sample is a preset length, that is, each input sample includes characters with the preset length, each character is represented as a 128-dimensional vector, each input sample is (preset length, 128), and N input samples are (N, preset length, 128). The 128 is set by comprehensively considering the complexity and the accuracy of the model through experiments, when the parameters are larger, the model is more complex, the calculated amount is larger, and when the parameters are smaller, the model is simpler, but the accuracy is not high enough. The word vectors at the character level are extracted by the embedding layer, a single character is represented as a 128-dimensional vector, and the extracted word vectors are input to the LSTM layer.

S302: using the Dropout layer to randomly break a certain proportion of the network connections avoids over-fitting.

The Dropout means that the neural network unit is temporarily discarded from the network according to a certain probability in the training process of the deep learning network.

The said certain ratio can be set by the person skilled in the art on the basis of experience and experimental results.

S303: the LSTM layer is connected and a certain fraction of Dropout is specified.

In this embodiment, the output of the LSTM is a 64-dimensional vector, i.e., a single network request is extracted through the LSTM layer into a 64-dimensional text vector.

S304: connecting a full connection layer (Dense) and a softmax layer, and outputting the result as the category probability mapped to each label category.

In this embodiment, a fully connected layer is used as the classifier, and a softmax layer is used to map the output of the model to the class probabilities.

S305: setting the optimizer as adam random gradient descending, setting the loss function as multi-class cross entropy (elementary _ cross entropy), setting the measurement index as accuracy (accuracycacy), and storing the model after convergence.

And training the model by using an Adam gradient descent algorithm optimizer, and dividing the training data set into a development training set development test set according to the proportion of 8: 2. The Adam gradient descent optimizer updates the parameters of the whole neural network by optimizing a loss function, wherein the loss function is defined by means of multi-class cross entropy as follows:

where p is the output of softmax, i.e., class probability of each tag, t is the true tag of each network request data, i is the data bit, and j is the tag bit.

In this embodiment, the training model reaches convergence in about 30 iterations (epochs), and the model is saved locally.

Testing the accuracy of the model:

the method of steps S100 and S200 is adopted to establish a test data set, the network request data in the test data set is used for testing the LSTM recurrent neural network model in step S300, the label category corresponding to the dimension with the maximum probability value is used as the label category of the network request data according to the category probability output by the LSTM recurrent neural network model, and the label category is compared with the previously identified category to calculate the accuracy.

The experimental results are as follows:

as shown in fig. 4, the accuracy change and the loss function change during the model training process are shown, and it can be seen that the model finally converges and has an accuracy of 99.5% or more.

In the experiment, 13 ten thousand pieces of request data are used for training, 14000 pieces of sample data are used for testing, a confusion matrix of the test results is shown in the following table 1, and the accuracy reaches 99.8%.

TABLE 1

	Real tag 0	Genuine label 1	Real label 2
				Predictive tag 0	6049	11	0
Predictive tag 1	7	7170	1
				Predictive tag 2	2	0	759

According to the embodiment of the invention, the trouble of writing regular rules is avoided, meanwhile, compared with other traditional machine learning methods, the complicated process of characteristic engineering is avoided, the characteristics of the data are automatically extracted in the model through the embedded layer, and the superiority of deep learning is embodied.

Example two:

a network attack recognition method based on the LSTM recurrent neural network model for network attack category recognition according to the first embodiment, the method comprising: preprocessing network request data to obtain digital sequence type data with preset length, identifying the probability of the label category of the digital sequence type data by using an LSTM recurrent neural network model to obtain the category probability of each label category, taking the label category corresponding to the dimension with the maximum probability value in the category probabilities as the label category of the network request data, and selecting whether to intercept or not according to the prediction result.

Specifically, the identification of whether to perform interception may set a probability threshold, and perform interception when the probability of the category 2 or the category 3 is greater than or equal to the probability threshold, or not perform interception.

In this embodiment, the probability threshold is set to 0.6, and only if the probability of the category 2 or 3 is greater than or equal to 0.6, the interception is performed, and if the probability is less than 0.6, the interception is not performed.

Example three:

as shown in fig. 5, a network attack recognition system is based on the network attack recognition method of the second embodiment, and includes a data input unit, a data conversion unit, an LSTM model unit, and a decision unit, where the data input unit sends received network request data to the data conversion unit, the data conversion unit performs preprocessing conversion on the network request data to obtain digital sequence data with a preset length, and then sends the digital sequence data to the LSTM model unit, the LSTM model unit recognizes a probability of a label type of the digital sequence data, outputs a type probability of each label type to the decision unit, and the decision unit determines whether the probability of the attack type of the data is greater than or equal to a preset probability threshold, if so, blocks the data, and otherwise, does not block the data.

The attack category is a category of data other than normal network request data, and in this embodiment, the attack category is category 2 or category 3.

While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An LSTM recurrent neural network model for cyber-attack category identification, comprising: the LSTM recurrent neural network model establishing process comprises the following steps:

s100: collecting a plurality of text-format network request data as a training data set, and setting a label type for each network request data according to the content of the network request data, wherein the label type comprises normal network request data and attack type network request data;

the Chinese characters are replaced by: uniformly replacing characters other than letters, numbers and symbols with a lower case letter "z";

the sequence length processing is as follows: according to the preset length, carrying out length processing on the digital sequence type data to enable the lengths of all the digital sequence type data to be the preset length;

2. The LSTM recurrent neural network model for cyber-attack category identification, as claimed in claim 1, wherein: the label categories include three categories, which are respectively category 1: normal network request data; class 2: network request data containing SQL injection attack; class 3: the network containing the XSS attack requests data.

3. The LSTM recurrent neural network model for cyber-attack category identification, as claimed in claim 1, wherein: the length processing includes one of a head operation and a tail operation, the head operation being to: for the digital sequence type data with the length less than the preset length, the front part of the digital sequence type data is filled by using a dictionary number 0, and for the sequence with the length more than the preset length, the front part of the digital sequence type data with the length more than the preset length is cut off; the tail operates to: for digital sequence data whose length is less than a preset length, the data is filled with dictionary number 0, and for sequences whose length exceeds the preset length, the data is truncated to include the part exceeding the preset length.

4. The LSTM recurrent neural network model for cyber-attack category identification, as claimed in claim 1, wherein: step S300 includes the steps of:

s301: setting an embedding layer;

s303: connect the LSTM layer and specify a certain proportion of Dropout;

s305: setting an optimizer as adam random gradient descent, setting a loss function as a multi-classification cross entropy function, setting a measurement index as accuracy, and iterating for multiple times until the model is converged.

5. A network attack identification method based on the LSTM recurrent neural network model for network attack category identification as claimed in any one of claims 1-4, comprising: preprocessing the network request data to obtain digital sequence type data with preset length, predicting the probability of the label category of the digital sequence type data by using an LSTM recurrent neural network model to obtain the category probability of each label category, and taking the label category corresponding to the dimension with the maximum probability value in the category probabilities as the label category of the network request data.

6. A network attack recognition system, based on the network attack recognition method of claim 5, comprising: the data input unit sends received network request data to the data conversion unit, the data conversion unit carries out preprocessing conversion on the network request data to obtain digital sequence type data with preset length, the digital sequence type data are sent to the LSTM model unit, the LSTM model unit identifies the probability of label types of the digital sequence type data and outputs the type probability of each label type to the decision unit, the decision unit judges whether the probability of the data in the attack type is larger than or equal to a preset probability threshold value, if yes, interception is carried out, and if not, interception is not carried out.