CN112182575A

CN112182575A - Attack data set malicious segment marking method and system based on LSTM

Info

Publication number: CN112182575A
Application number: CN202011033505.XA
Authority: CN
Inventors: 安韬; 王智民
Original assignee: Beijing 6Cloud Information Technology Co Ltd
Current assignee: Beijing 6Cloud Information Technology Co Ltd
Priority date: 2020-09-27
Filing date: 2020-09-27
Publication date: 2021-01-05

Abstract

The invention provides an LSTM-based method and system for labeling malicious segments of an attack data set. The method comprises the following steps: extracting key values of all groups of parameters in the malicious URL; converting the key value into a feature representation; inputting the feature representation into the trained LSTM model for prediction to obtain a prediction result value with the maximum numerical value; and acquiring the corresponding malicious segment according to the prediction result value with the maximum value, thereby acquiring the position of the malicious segment in the malicious URL and labeling the malicious segment. The system comprises: the data processing unit is used for extracting key values of all groups of parameters in the malicious URL and converting the key values into characteristic representation; the data prediction unit is used for inputting the feature representation into the trained LSTM model for prediction to obtain a prediction result value with the maximum numerical value; and the data labeling unit is used for acquiring the corresponding malicious segment according to the prediction result value with the maximum value, thereby acquiring the position of the malicious segment in the malicious URL and labeling the malicious segment.

Description

Attack data set malicious segment marking method and system based on LSTM

Technical Field

The invention relates to the technical field of network and information security, in particular to an LSTM-based attack data set malicious segment marking method and an LSTM-based attack data set malicious segment marking system.

Background

The Web firewall is the first line of defense for information security. With the rapid update of network technologies, new hacker technologies are also emerging, which brings challenges to traditional rule firewalls. Traditional web intrusion detection techniques intercept intrusion accesses by maintaining a set of rules. However, hard rules are easily bypassed in the presence of flexible hackers, and it is difficult to cope with 0day attacks based on the rule set of the past knowledge. Attacks such as SQL injection, command injection, etc. pose a significant threat to data security. In order to detect the Web attack on the website, the website traffic needs to be extracted, and the traffic needs to be analyzed and detected.

In recent years, with the rapid development of machine learning, experts and scholars at home and abroad carry out a great deal of research on the machine learning and apply the machine learning to the field of network space security. Machine learning applications will become the mainstream trend in the cyberspace security domain. The application has the advantages that the attack behavior of unknown characteristics can be detected, and the defects and shortcomings of the traditional method are overcome.

However, the machine learning based detection model is a black box model that provides prediction results based only on input samples. Currently, the detection result is lack of explanation. Furthermore, the training process of the model relies on a large number of labeled data sets that require manual labeling by experienced experts.

When the Web intrusion system detects the malicious category of the attack URL, if the malicious segment can be provided as the evidence at the same time, the interpretability of the model on the prediction result is improved, and the burden of related workers is reduced. Machine learning models used for detection require the use of large data sets that label the location of malicious segments when trained. Sample labeling often requires an experienced expert to perform manual labeling, taking a lot of effort. At present, a better automatic data labeling method is lacked.

Disclosure of Invention

The method adopts an LSTM model for prediction, can output the abnormal probability of a sample, and simultaneously positions the malicious segment according to the parameter value with the maximum prediction result in the abnormal probability of the sample, thereby providing a method for effectively constructing a labeled data set for model training and greatly reducing the manual labeling cost. The system realizes the labeling of the malicious segments of the data to be predicted based on the method.

In order to achieve the above object, a first aspect of the present invention provides a method for tagging malicious segments in a LSTM-based web attack data set, where the method includes:

extracting key values of all groups of parameters in the malicious URL;

converting the key value into a feature representation;

inputting the feature representation into a trained LSTM model for prediction to obtain a prediction result value with the maximum numerical value;

and acquiring a corresponding malicious segment according to the prediction result value with the maximum numerical value, thereby acquiring the position of the malicious segment in the malicious URL and labeling the malicious segment.

Optionally, the extracting key values of each group of parameters in the malicious URL includes:

analyzing and escaping the malicious URL, and converting a URL code value in the malicious URL into a corresponding character;

and extracting key values of all groups of parameters in the converted malicious URL. Key data in the malicious URL are obtained through key value extraction, and the accuracy of a prediction result is improved.

Optionally, the converting the key value into the feature representation includes:

filtering and extracting ascii characters in the key values;

converting the numeric characters in the ascii characters into '0', converting capital letters into lowercase letters, and keeping the special symbols unchanged to obtain first characters;

converting the first character into a second character with a first set length;

performing feature conversion on each character in the second characters to obtain feature representation of each character;

and splicing the characteristic representations of the characters according to the sequence of the characters in the second character to obtain the characteristic representation of the key value. And finally, converting the second character into a characteristic representation, wherein the obtained characteristic representation has uniform length and uniform format, and is convenient for prediction by subsequent input of an LSTM model.

Further, the converting the first character into a second character with a first set length includes:

if the length of the first character is smaller than a first set length, 0 is supplemented at the end of the first character to obtain a second character;

if the length of the first character is larger than a first set length, intercepting the character which accords with the first set length from the first character to obtain a second character. And acquiring the second characters with uniform length by means of 0 complementing and intercepting, and meeting the length requirement of the characters.

Optionally, the performing feature conversion on each character in the second character to obtain a feature representation of each character includes:

and performing feature conversion by adopting word embedding, and embedding each character by using words to obtain a feature representation with the length of a second length.

Further, the obtaining of the feature representation with the length of the second length by word embedding each character includes: each character is embedded by a word to obtain a feature representation length of 64.

Further, the trained LSTM model is obtained by training with a training set including normal URLs and malicious URLs. The LSTM model is trained by adopting a training set comprising normal URLs and malicious URLs, so that the trained LSTM model can output suspicious segments as assistance, and the interpretability of the model can be increased.

The invention provides a system for labeling malicious segments in a web attack data set based on LSTM, which is characterized by comprising the following steps:

the data processing unit is used for extracting key values of all groups of parameters in the malicious URL and converting the key values into feature representations;

the data prediction unit is used for inputting the feature representation into a trained LSTM model for prediction to obtain a prediction result value with the maximum numerical value;

and the data labeling unit is used for acquiring the corresponding malicious segment according to the prediction result value with the maximum numerical value, thereby acquiring the position of the malicious segment in the malicious URL and labeling the malicious segment.

Further, the data processing unit includes:

the parsing and escaping module is used for parsing and escaping the malicious URL and converting a URL code value in the malicious URL into a corresponding character;

the key value extraction module is used for extracting key values of all groups of parameters in the converted malicious URL;

the filtering module is used for filtering and extracting ascii characters in the key values;

the character adjusting module is used for converting numeric characters in the ascii characters into '0', converting capital letters into lowercase letters, keeping special symbols unchanged to obtain first characters, and converting the first characters into second characters with a first set length;

the characteristic conversion module is used for performing characteristic conversion on each character in the second characters to obtain characteristic representation of each character;

and the characteristic splicing module is used for splicing the characteristic representations of the characters according to the sequence of the characters in the second characters to obtain the characteristic representation of the key value.

The key value extraction module extracts and obtains key data in the malicious URL through key values, and the accuracy of a prediction result is improved. The character adjusting module performs uniform format conversion on the key value of the malicious URL, converts the key value into a second character with uniform length, and finally converts the second character into characteristic representation, so that the obtained characteristic representation has uniform length and uniform format, and the subsequent input of the LSTM model is facilitated for prediction.

In another aspect, the present invention provides a machine-readable storage medium having stored thereon instructions for causing a machine to perform the LSTM-based web attack data set malicious segment tagging method described herein.

According to the technical scheme, the LSTM model is adopted for prediction, the probability of sample abnormity can be output, the position of the malicious segment is located according to the parameter value with the largest prediction result in the sample abnormity probability, a method for effectively constructing the labeled data set is provided for model training, and the manual labeling cost is greatly reduced. The system realizes the labeling of the malicious segments of the data to be predicted based on the method.

Additional features and advantages of embodiments of the invention will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the embodiments of the invention without limiting the embodiments of the invention. In the drawings:

FIG. 1 is a flowchart of a method for labeling malicious segments in a web attack data set based on LSTM according to an embodiment of the present invention;

FIG. 2 is a block diagram of a system for tagging malicious segments in a web attack data set based on LSTM according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.

FIG. 1 is a flowchart of a method for labeling malicious segments in a web attack data set based on LSTM according to an embodiment of the present invention. As shown in fig. 1, the method includes:

extracting key values of all groups of parameters in the malicious URL;

converting the key value into a feature representation;

and extracting key values of all groups of parameters in the converted malicious URL.

The composition of the URL includes a protocol, a domain name, a path, parameters, a piece of information, and the like. Each URL may have multiple sets of parameters, each set of parameters being a key-value pair (e.g., using the URL lib. There may be some special characters in the URL, and the URL code value is converted into corresponding characters by parsing escape. Then, key values of each set of parameters in the malicious URL are extracted (for example, by using URL lib. The correspondence between the URL code value and the character is referred to table 1.

Character(s)	URL encoded value	Character(s)	URL encoded value	Character(s)	URL encoded value
						Blank space	％20	+	％2B	>	％3E
"	％22	,	％2C	？	％3F
						#	％23	/	％2F	@	％4o
％	％25	:	％3A	\	％5C
						&	％26	；	％3B	\|	％7C
(	％28	<	％3C
						)	％29	＝	％3D

TABLE 1 URL code value and character correspondence table

Key data in the malicious URL are obtained through key value extraction, and the accuracy of a prediction result is improved.

filtering and extracting ascii characters in the key values;

converting the first character into a second character with a first set length, wherein the first length is length;

and splicing the feature representations of the characters according to the sequence of the characters in the second characters to obtain the feature representation of the key value, wherein the shape of the feature representation x is (batch _ size, length, feature _ size), the batch _ size is the size of the data batch, the length is the length of the second character, and the feature _ size is the feature size output by the LSTM model embedding layer.

And finally, converting the second character into a characteristic representation, wherein the obtained characteristic representation has uniform length and uniform format, and is convenient for prediction by subsequent input of an LSTM model.

After obtaining the feature representation x, the URL feature is encoded using LSTM with a feature size of 64, resulting in a hidden state h ═ LSTM (x), where the activation function is relu. And a full connection layer is arranged behind the LSTM layer, and the probability y of whether the URL is judged to be the malicious URL or not is obtained by adopting a sigmoid activation function.

if the length of the first character is larger than a first set length, intercepting the character which accords with the first set length from the first character to obtain a second character. And acquiring the second characters with uniform length by means of 0 complementing and intercepting, ensuring that the length of each sample is equal, and meeting the length requirement of the characters.

and performing feature conversion by using word embedding, and embedding each character to obtain a feature representation with the length of a second length, wherein the second length is 64, that is, each character is embedded to obtain a feature representation with the length of 64.

The data set includes normal URLs and abnormal URLs, where a normal URL is a negative sample, a label is 0, an abnormal URL is a positive sample, and a label is 1. The training set obtains a training set containing positive samples and negative samples by randomly sampling a data set consisting of the positive samples and the negative samples. Each sample includes a URL string and a corresponding tag value. Data are sequentially extracted from the training set and input into the LSTM initial model as a batch of data for training. And the prediction result adopts cross entropy as a loss function, and model training is carried out by adopting an adam algorithm. The parameters that need to be trained include all parameters in the embedding layer, LSTM layer, fully-connected layer. And obtaining the well-trained LSTM model after the training is finished.

It should be noted that in the invention, the keras framework is adopted to realize the construction of the LSTM model, and the keras can realize the model quickly, because the keras framework provides various network level models for the user to select, and also provides various tools to help the user to define the network level. In other embodiments, the LSTM model can be realized by adopting a deep learning framework such as TensorFlow, PyTorch and the like.

It should also be noted that the LSTM of the present invention belongs to a structure of neural network, and similarly, the LSTM layer can be replaced by a similar neural network layer, such as RNN, GRU.

Further, the data processing unit includes:

The labeling process is described below in one embodiment.

The malicious URL is:

“/secured/index.php/component/civicrm/？task％3dcivicrm％2fajax％2fjqstate％26_value％3d-1+union+select+1％2cconcat(0x67371)”

obtaining after analysis:

“/secured/index.php/component/civicrm/？task＝civicrm/ajax/jqstate&_value＝-1union select 1,concat(0x67371)”

extracting all key values therein results in: [ ' civicrm/ajax/jqstate ', -1 unit select 1, concat (0x67371) ' ].

Two groups of values of [ 'civicrm/ajax/jqstate', '-1 unit select 1, concat (0x67371)' ] are converted into characteristic representation, and all parameters are predicted through LSTM, so that predicted result values [0.12733723,0.6848065] are obtained.

The largest of the predicted result values is 0.6848065, that is to say, the predicted result value corresponding to the value of '-1 unit select 1, concat (0x67371)' is the largest, and the malicious segment in the malicious sample can be considered as the segment. The annotated sample consists of the parsed sample and the malicious segment position, resulting in ("/cured/index. php/component/civicrm/.

According to the method, the LSTM model is adopted for prediction, the probability of sample abnormity can be output, meanwhile, the position of a malicious segment is positioned according to the parameter value with the largest prediction result in the sample abnormity probability, a method for effectively constructing a labeling data set is provided for model training, and the manual labeling cost is greatly reduced.

Those skilled in the art will appreciate that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, which is stored in a storage medium and includes several instructions to enable a single chip, a chip, or a processor (processor) to execute all or part of the steps in the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

While the embodiments of the present invention have been described in detail with reference to the accompanying drawings, the embodiments of the present invention are not limited to the details of the above embodiments, and various simple modifications can be made to the technical solution of the embodiments of the present invention within the technical idea of the embodiments of the present invention, and the simple modifications are within the scope of the embodiments of the present invention. It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. In order to avoid unnecessary repetition, the embodiments of the present invention will not be described separately for the various possible combinations.

In addition, any combination of the various embodiments of the present invention is also possible, and the same should be considered as disclosed in the embodiments of the present invention as long as it does not depart from the spirit of the embodiments of the present invention.

Claims

1. A method for labeling malicious segments of a web attack data set based on LSTM is characterized by comprising the following steps:

extracting key values of all groups of parameters in the malicious URL;

converting the key value into a feature representation;

2. The LSTM-based method for labeling malicious segments in a web attack dataset as claimed in claim 1, wherein the extracting key values of each set of parameters in a malicious URL comprises:

3. The LSTM-based web attack dataset malicious segment tagging method of claim 1, wherein the converting the key value into a feature representation comprises:

filtering and extracting ascii characters in the key values;

converting the first character into a second character with a first set length;

and splicing the characteristic representations of the characters according to the sequence of the characters in the second character to obtain the characteristic representation of the key value.

4. The LSTM-based web attack dataset malicious segment tagging method according to claim 3, wherein the converting the first character into a second character with a first set length comprises:

if the length of the first character is larger than a first set length, intercepting the character which accords with the first set length from the first character to obtain a second character.

5. The LSTM-based web attack data set malicious segment tagging method of claim 3, wherein the performing feature transformation on each character in the second characters to obtain a feature representation of each character comprises:

6. The LSTM-based web attack data set malicious segment tagging method according to claim 5, wherein the step of embedding each character into a feature representation with a second length through a word comprises the steps of: each character is embedded by a word to obtain a feature representation length of 64.

7. The LSTM-based web attack data set malicious segment tagging method of claim 1, wherein the trained LSTM model is trained using a training set comprising normal URLs and malicious URLs.

8. An LSTM-based web attack dataset malicious segment tagging system, the system comprising:

9. The LSTM-based web attack dataset malicious segment tagging system of claim 8, wherein the data processing unit comprises:

10. A machine-readable storage medium having stored thereon instructions for causing a machine to perform the LSTM-based web attack data set malicious segment tagging method of any one of claims 1 to 7.