CN112437084A

CN112437084A - Attack feature extraction method

Info

Publication number: CN112437084A
Application number: CN202011319212.8A
Authority: CN
Inventors: 王高翃; 贾宝林; 朱连凯; 连栋; 王英; 陆炜; 张家鹏; 陈政熙
Original assignee: Shanghai Institute of Process Automation Instrumentation
Current assignee: Shanghai Institute of Process Automation Instrumentation
Priority date: 2020-11-23
Filing date: 2020-11-23
Publication date: 2021-03-02
Anticipated expiration: 2040-11-23
Also published as: CN112437084B

Abstract

The invention discloses a method for extracting attack features, which is characterized by comprising the following steps of: the method comprises the steps of obtaining attack fields through protocol analysis and converting the attack fields into a digital matrix, carrying out preliminary separation on field modes through special characters, carrying out statistical analysis on separated data and updating the digital matrix, carrying out statistics on repeated field combinations in the updated digital matrix, establishing an attack characteristic classification model, extracting key characteristics related to attacks in known and unknown attack fields, and predicting attack types of known and unknown attack information. The invention establishes a universal attack characteristic extraction method based on the analysis of the network communication protocol and the understanding of the attack characteristics, and classifies the attack types according to the relevant characteristics. Through analysis of different attack samples, information with attack characteristics in the attack field is extracted by using a statistical method, and on the basis, classification modeling is carried out on attack types, so that efficient and objective automatic extraction of the attack characteristics is realized.

Description

Attack feature extraction method

Technical Field

The invention relates to a method for extracting attack features, in particular to a method for extracting network attack features, and belongs to the field of data extraction.

Background

With the increasing scale of networks, the number of network attacks is also increased. How to ensure the normal and stable operation of a network system becomes the main subject of network security, and attack detection based on attack characteristics becomes the most common detection mode. The attack characteristics are a summarized description of the attack behavior, and generally, the attack characteristics are unique characteristics in the flow data generated by the attack, and an attack behavior can be intuitively found and determined through the characteristics and cannot cause great influence on daily production life. For an unknown attack behavior, the characteristics of the unknown attack behavior need to be analyzed and extracted so as to provide early warning and defense for the attack.

The existing attack feature automatic extraction technology is divided into a network-based attack feature extraction technology and a host-based attack feature extraction technology. The attack feature extraction technology based on the network extracts the attack features in the attack information by an algorithm by utilizing the attack information on the network; the attack feature extraction technology based on the host computer obtains relevant attack information from the attacked host computer and analyzes the information to obtain features by changing the system environment to a certain extent. The accuracy, the feature extraction speed, the feature usability and the method of the two methods have different degrees of advantages and disadvantages.

The process of extracting the attack features is very complicated, the speed of extracting the attack features by adopting an infiltration expert is low, the subjectivity is high, and the effectiveness of the extracted features cannot be determined. Therefore, an efficient and objective automatic attack feature extraction technology is needed.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the existing attack feature extraction mode is slow in speed and high in subjectivity, and the effectiveness of the extracted features cannot be determined.

In order to solve the above problems, the technical solution of the present invention is to provide a method for extracting attack features, which is characterized by comprising the following steps:

step one, acquiring an attack field through protocol analysis, converting data of the attack field expressed by binary stream into a digital matrix by taking bytes as a unit, and processing fields with different lengths by using masks;

determining characteristic characters serving as separators, and performing preliminary separation on the field modes through special characters;

step three: carrying out statistical analysis on the separated data, carrying out statistics on occurrence frequency information of character strings among all corresponding separators, extracting common key fields, setting corresponding threshold values, representing some character strings which occur frequently by uniform marking sequence numbers, and updating a sequence number-character/character string correspondence table and a number sequence number matrix to obtain preliminarily extracted key attack fields;

step four: counting field combinations which repeatedly appear in the updated numerical matrix, selecting the length n of a character and character string combination, counting the occurrence frequency of the character and character string combination with the length n, setting a corresponding threshold value, extracting the combination of the character and character string with more occurrence frequency, and combining adjacent combinations to obtain the final extracted feature information;

step five: establishing an attack feature classification model, extracting key features related to attacks in known and unknown attack fields through training and application of a Recurrent Neural Network (RNN) model on the basis, and predicting the attack types of known and unknown attack information.

Preferably, in the first step, for a case that 256 characters do not completely appear in the attack field, all the appearing characters are sorted correspondingly and a sequence number-character correspondence table is recorded, and finally, the sequence numbers of the corresponding characters are stored in a digital matrix to obtain a digital signal matrix, each line of the digital matrix represents one attack field, the original attack field can be obtained by searching the sequence number-character correspondence table, and the sequence numbers in the obtained digital sequence number matrix correspond to the characters in the original attack field one by one.

Preferably, after the number matrix is obtained, a threshold value may be set based on statistical information of the frequency of occurrence of characters, some characters with less occurrence are represented by uniform labeled sequence numbers, and the sequence number-character correspondence table and the number sequence number matrix are updated, where the sequence number in the updated number sequence number matrix may correspond to one or more characters in the original attack field.

Preferably, after the digital matrix is updated, for the case that the lengths of the attack fields are different, the fields with different lengths are recorded and processed in the form of a mask matrix.

Preferably, the characteristic characters include paired delimiters, juxtaposed delimiters, and assigned numbers.

Preferably, the sequence number in the updated numeric sequence number matrix in step three may correspond to a single character, multiple characters, or a specific common character string, where the common character string is a key attack field extracted preliminarily.

Preferably, the fourth step is specifically to count the serial number pairs of the beginning and the end of the character and character string combination with the length of n, count the character and character string combination with the fixed beginning and the end on the basis of the occurrence frequency of the character and character string combination, record the occurrence position of the combination, merge the adjacent combinations with the occurrence frequency above a preset threshold value through comparison of the position information according to the obtained position information of the character string combination, and combine the combined character and character string combination into the finally extracted attack features.

Preferably, the establishing of the attack feature classification model includes performing label work on each character string combination in the field and the attack type of each piece of attack information, and training and applying through a Recurrent Neural Network (RNN) model on the basis.

Compared with the prior art, the invention has the beneficial effects that:

the invention establishes a universal attack characteristic extraction method based on the analysis of the network communication protocol and the understanding of the attack characteristics, and classifies the attack types according to the relevant characteristics. In the method, through the analysis of different attack samples, information with attack characteristics in an attack field is extracted by using a statistical method, and on the basis, the attack types are classified and modeled, so that the high-efficiency and objective automatic extraction of the attack characteristics is realized.

Drawings

FIG. 1 is a flow chart of a method for attack feature extraction according to the present invention;

FIG. 2 is a diagram illustrating a conversion of an attack field into a number matrix according to an embodiment of the present invention;

FIG. 3 is a diagram of a mask matrix according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating an embodiment of a data field pattern after initial segmentation;

fig. 5 is a schematic diagram of a merged common character string according to an embodiment of the present invention.

Detailed Description

In order to make the invention more comprehensible, preferred embodiments are described in detail below with reference to the accompanying drawings.

As shown in fig. 1, in view of the problems encountered in the attack behavior feature extraction process in the background art, the present invention provides an attack feature extraction method, which is based on the analysis of a network communication protocol and the understanding of attack features, establishes a general attack feature extraction method, and classifies attack types according to related features. In the method, through the analysis of different attack samples, information with attack characteristics in an attack field is extracted by using a statistical method, and on the basis, the classification modeling is carried out on the attack types. Specifically, the method comprises the following steps:

the method comprises the following steps: the method comprises the steps of obtaining an attack field through protocol analysis and converting the attack field into a digital matrix, converting the attack field expressed by binary stream into the digital matrix by taking bytes as units, and processing fields with different lengths by using masks.

Since a single byte can represent a total of 256 characters, typically, all 256 different characters will not be present in the attack field. And for the condition that 256 characters do not completely appear in the attack field, correspondingly sequencing all the appeared characters, recording a sequence number-character corresponding table, and finally storing the sequence numbers of the corresponding characters in a digital matrix to obtain the digital signal matrix. Each line of the digital matrix represents an attack field, and the original attack field can be obtained by searching the sequence number-character corresponding table. The serial numbers in the obtained digital serial number matrix correspond to the characters in the original attack field one by one. After the number matrix is obtained, a threshold value can be set based on the frequency statistical information of the occurrence of characters, some characters with less occurrence are represented by uniform marked sequence numbers, and the sequence number-character correspondence table and the number sequence number matrix are updated. The sequence numbers in the updated numerical sequence number matrix may correspond to one or more characters in the original attack field. After the digital matrix is updated, for the condition that the lengths of all attack fields are different, the fields with different lengths are recorded and processed in the form of a mask matrix.

For example, the attack information includes 3000 pieces of information as follows:

mail[#post_render][]＝passthru&mail[#type]＝markup&mail[#markup]＝/usr/b in/who&form_id＝user_register_form

form_id＝user_register_form&mail[#post_render][]＝exec&mail[#type]＝mark up&mail[#markup]＝/bin/hostname

…

after conversion to a digital matrix as shown in figure 2.

The corresponding mask lengths are 103 and 100 and the mask matrix is shown in fig. 3.

Step two: the field pattern is preliminarily divided by special characters, and characteristic characters as separators are determined, including pairs of symbols [ ], "", etc., and parallel separators &,/etc., and assignment symbols ═ are included.

Step three: and performing statistical analysis on the separated data, performing statistics on occurrence frequency information of character strings among all corresponding separators, setting corresponding threshold values, representing some frequently-occurring character strings by uniform marked serial numbers, and updating a serial number-character (character string) corresponding table and a numerical serial number matrix. The sequence numbers in the updated numeric sequence number matrix may correspond to a single character, multiple characters, or a particular common string of characters. These common strings are the key attack fields extracted preliminarily.

The separation of each piece of information is shown in fig. 4 according to the statistical result of the occurrence frequency of the character string after the separation is completed.

The character strings appearing less frequently between separators are replaced with ". about", and the character strings appearing more frequently include "[ # post _ render ]", "[ ]" [ # type ] "and the like in the order of appearance.

Step four: combinations of fields that occur repeatedly are counted in the reorganized number matrix. Selecting the length n of a character and character string combination, counting the occurrence frequency of the character and character string combination with the length n, setting a corresponding threshold value, extracting the combination of the character and character string with more occurrence frequency, and combining adjacent combinations to finally extract the characteristic information.

Specifically, the serial number pairs of the beginning and the end of the character and character string combination with the length of n are counted, the character and character string combination with the fixed beginning and the fixed end is counted on the basis of the occurrence frequency of the character and character string combination, and the occurrence position of the combination is recorded. And comparing the obtained position information of the character string combination and combining adjacent combinations with the occurrence frequency above a preset threshold value. And combining the characters and character strings obtained after combination into the finally extracted attack features.

The common character string (part) obtained after combining the character string combinations according to the occurrence frequency is shown in fig. 5.

Step five: establishing an attack characteristic classification model, including performing label work on each character string combination in the field and performing label work on the attack type of each attack information, and on the basis, extracting key characteristics related to the attack in the known and unknown attack fields through training and application of models such as a Recurrent Neural Network (RNN) and the like, and predicting the attack types of the known and unknown attack information.

Claims

1. A method for extracting attack features is characterized by comprising the following steps:

2. A method of attack feature extraction as claimed in claim 1, wherein: in the first step, for the condition that 256 characters do not completely appear in the attack field, all the characters appearing are sorted correspondingly, a sequence number-character corresponding table is recorded, finally, the sequence numbers of the corresponding characters are stored in a digital matrix to obtain a digital signal matrix, each line of the digital matrix represents one attack field, the original attack field can be obtained by searching the sequence number-character corresponding table, and the sequence numbers in the obtained digital sequence number matrix correspond to the characters in the original attack field one by one.

3. A method of attack feature extraction as claimed in claim 2, wherein: after the number matrix is obtained, a threshold value can be set based on the frequency statistical information of the occurrence of characters, some characters with less occurrence are represented by uniform marked sequence numbers, the sequence number-character correspondence table and the number sequence number matrix are updated, and the sequence number in the updated number sequence number matrix can correspond to one or more characters in the original attack field.

4. A method of attack feature extraction as claimed in claim 3, wherein: after the digital matrix is updated, for the condition that the lengths of all attack fields are different, the fields with different lengths are recorded and processed in the form of a mask matrix.

5. A method of attack feature extraction as claimed in claim 1, wherein: the characteristic characters comprise paired separators, separators in a parallel relation and assignment numbers.

6. A method of attack feature extraction as claimed in claim 1, wherein: the sequence number in the updated numerical sequence number matrix in step three may correspond to a single character, multiple characters, or a specific common character string, where the common character string is a key attack field extracted preliminarily.

7. A method of attack feature extraction as claimed in claim 1, wherein: the fourth step is specifically that the serial number pairs of the beginning and the end of the character and character string combination with the length of n are counted, the character and character string combination with the fixed beginning and the end are counted on the basis of the occurrence frequency of the character and character string combination, the occurrence position of the combination is recorded, the adjacent combinations with the occurrence frequency higher than the preset threshold value are combined through the comparison of the position information according to the obtained position information of the character string combination, and the combined character and character string combination is the attack feature which is finally extracted.

8. A method of attack feature extraction as claimed in claim 1, wherein: the establishment of the attack characteristic classification model comprises the steps of carrying out label work on each character string combination in the field and the attack type of each piece of attack information, and training and applying through a Recurrent Neural Network (RNN) model on the basis.