CN115334179A

CN115334179A - Unknown protocol reverse analysis method based on named entity recognition

Info

Publication number: CN115334179A
Application number: CN202210848786.7A
Authority: CN
Inventors: 王俊峰; 李晓慧; 丛培鑫; 张永光
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2022-07-19
Filing date: 2022-07-19
Publication date: 2022-11-11
Anticipated expiration: 2042-07-19
Also published as: CN115334179B

Abstract

The invention discloses an unknown protocol reverse analysis method based on named entity recognition, which is characterized in that a field named entity is obtained by inducing and summarizing a large number of known protocol field composition specifications, and a BIO labeling method is adopted to label a protocol sequence by taking characters as units in combination with the named entity; performing feature extraction on the context semantics of the protocol sequence by adopting a BilSTM model, and outputting the identified label probability to a CRF layer to reasonably standardize the identification result of the named entity in combination with the conditional transition probability; and predicting the unknown protocol standard by adopting the trained intelligent agent. The method and the device improve the analysis capability of the protocol on the field structure and the semantic information reversely, can automatically extract the protocol semantic information without manual intervention, enhance the portability of application, and can reversely analyze protocol sequences of different types; reverse parsing of unknown protocols in binary character form is achieved, enabling good inference of composition specifications for known and unknown protocols.

Description

Unknown protocol reverse analysis method based on named entity recognition

Technical Field

The invention relates to the field of protocol reverse engineering, in particular to an unknown protocol reverse analysis method based on named entity identification, which carries out reverse inference on an unknown network protocol in a computer network.

Background

In recent years, the internet gradually becomes an important carrier for supporting the scientific and technological development of society, and meanwhile becomes an object on which realization of more and more production life styles needs to be supported. The number of people connected to the internet and Web applications through different devices is rapidly increasing due to the relatively friendly availability and easy accessibility of the internet. According to the 'statistical report' recently issued by the China Internet information center (CNNIC), the number of netizens in China reaches 10.32 hundred million people in the end of 2021 years, and meanwhile, the scale of mobile phone users and the popularity index of the Internet always keep a high-speed growth trend for many years. In the same period, the usage amount of personal Web application is steadily increased, the scale of users of online application such as online take-out and the like is obviously increased, and the increase rate is over 10 percent. The user amount of instant messaging, short video and network payment is steadily at the first three, and the maximum amount accounts for 97.3 percent of the total amount of netizens. The scale of netizens and the variety and number of devices connected to the internet are expanding, so that the data volume carried in the internet is increasing. Even in an industrial internet network, there are over 59 million industry-related applications, and over 7000 million connected devices (sets). The realization of various production and living modes of people increasingly needs to depend on the Internet.

Whether it is a public network with people as the main user group or a private network with secret-related enterprises and institutional users as the main user group. In order to ensure that various devices, applications, etc. operate stably, efficiently, and safely in a network, it is necessary to ensure the safety and reliability of data transmission. The types and the quantity of intelligent products such as wearable equipment and Web application are rapidly increased under the drive of science and technology, and the production life style of people is gradually enriched by the development of high-tech products. However, the proliferation of thousands of high-tech devices and applications has undoubtedly led to a proliferation of traffic and data volume in the network, and the variety and number of network protocols required for data exchange in the internet has been growing synchronously. Protocols in network traffic are classified into two broad categories, known protocols and unknown protocols, according to whether the protocols are disclosed by specifications or not. In the bit stream of data transmission, unknown protocols that are difficult to identify and analyze occupy over 40% of the traffic data. The number and the types of unknown protocols are increased rapidly, and the difficulty of the network owner in controlling the owned network is greatly increased.

Because of the limited ways to obtain protocol data, the data forms for the current protocol reverse problem are divided into two categories: offline network traffic and instruction execution traces. The latter is more time and effort consuming both from the point of view of data acquisition and manual analysis, since it requires monitoring of the instruction sequence and the corresponding context. By using the method of the off-line network flow, the experimental work is carried out on the basis of the original flow, more complicated preprocessing is not needed to be carried out on the data, and the applicability is wider. The problems of incomplete semantic information inference, reliance on manual analysis, and characteristic characters of text-like protocols still exist. In addition, the network protocols themselves can be classified into two types, text-type protocols and binary protocols. The former has more definite data expression, while the latter has a single data form.

For the resolution of the known protocols, the current tools are relatively mature, and common resolution tools such as tshark and tcpdump can realize the output of format fields of more than 2000 known protocols, but cannot realize unknown protocols with unpublished protocol formats. Unknown protocols have unique and undisclosed self-contained structures, often created by the designer himself for safety reasons. The original unknown protocol reverse engineering relied entirely on manual analysis, but the time required was not negligible. The specifications of the unknown protocol SMB were generated by the open source project Samba through a manual analysis method after 12 years. Therefore, the automatic protocol reverse engineering becomes a key problem, and different output results such as a protocol format, a protocol state machine, a complete protocol specification and the like can be obtained by analyzing two different data objects such as an execution track and network flow.

Disclosure of Invention

The technical problem to be solved by the invention is to provide an unknown protocol reverse analysis method (for convenience of description, the method is abbreviated as NEREUP) based on named entity identification, so as to carry out reverse inference on the composition, design specification and semantic information of an unknown protocol in network flow.

The technical scheme adopted by the invention is as follows: an unknown protocol reverse analysis method based on named entity recognition comprises the following steps:

step 1: collecting data information containing different application layer protocols, summarizing and summarizing field components of the header of a known application layer protocol, discovering component field types of the existing known application layer protocol, and defining field type named entities;

step 2: capturing network flow under a real network environment, classifying and selecting different application layer protocols, extracting a binary protocol sequence of the protocols, and performing data preprocessing on an original binary protocol sequence;

and step 3: making a character vector conversion dictionary for the preprocessed data, respectively mapping converted protocol sequence characters into character vectors, taking the generated original character One-hot vectors as the input of a bidirectional long-short term memory network, respectively inputting values into forward and backward long-short term memory network units by an input layer, splicing by combining the characteristics and information of the previous moment and the next moment to obtain the vectors of an output layer, and outputting the vectors as hidden state vectors of the characters;

and 4, step 4: corresponding the predicted probability value with all the label types through the hidden state vector of the character; inputting the vector of the output layer into a Conditional Random Field (CRF), performing final probability correction by combining the transition probability of the character in real-time data to obtain the label probability corresponding to the predicted named entity class, and training to finish the intelligent agent;

and 5: inputting the unknown protocol sequence to be reversely analyzed into the intelligent agent after the conversion of the step 2 and the step 3, and obtaining the named entity recognition result of the unknown protocol through prediction;

and 6: and improving the extended Backowsian pattern to formally describe the network protocol format, the field structure and the semantic information, standardizing the result after the named entity of the unknown protocol is identified by using a pattern mode, and finally obtaining the protocol specification and the convention of the unknown protocol.

Further, the field type named entity definition specifically includes:

1) An index field: the most direct identification field for distinguishing different protocol sequences in the same application layer protocol represents the sequence number of the protocol sequence or the sequence number of an object acted by the protocol;

2) A type field: indicating basic properties of the current protocol, expressing semantics in a binary mode, and enabling a one-bit binary character to be a field; when information is transmitted, expressive meanings are expressed only by one binary '0' and '1' or two binary bits of '0' to '3'; the semantic information which contains the state information of the protocol and can transmit data by using 'yes' and 'no'; the consecutive type field is merged into a 'flag' field at parsing time, expressed in bytes.

3) A function field: carrying information which is relatively intuitive in data and can directly generate action; one is to express the information to be transmitted in a limited number of digits, each digit alone representing a particular type of function; the other decimal digit semantics of the field itself is important data transmitted by the field, and carries data by using the number of bytes exceeding the required number of bits.

Further, the step 2 specifically includes:

step 2.1: selecting a PCAP data packet of flow collected from each port and a routing node in a secure network; performing data statistics on all PCAP packets, and counting the number of application layer protocols which can be made into a data set; each protocol subset is classified according to the type of the application layer protocol, the number of the types of all the application layer protocols is represented by g, the number of the protocols in each Packet is represented by m, and all the derived data Packet packets containing a single protocol are represented as follows:

wherein ,

represents the d protocol in the g application layer protocol;

step 2.2: extracting an original binary protocol sequence of each application layer protocol, converting the original binary protocol sequence into a hexadecimal system, carrying out BIO (binary input/output) labeling on the converted hexadecimal system sequence by taking a character as a unit, and adopting protocol header and protocol load labeling and protocol header field named entity labeling; the protocol head with protocol specification and the protocol load without protocol specification of the application layer protocol are segmented, and then the composition specification of the protocol head is reversely analyzed.

Further, the step 3 specifically includes:

step 3.1: traversing the data in the data set to generate a character vector conversion dictionary, carrying out vector conversion definition on hexadecimal characters, and predefining other values which may occur;

the character vector conversion dictionary Dic is defined as follows:

where p is a key giving a key-value pair representation of hexadecimal characters; < UNK > represents other characters outside the dictionary, and < PAD > represents a completion character;

step 3.2: after a single character is mapped into a character vector, the generated One-hot vector of the original character is used as an input value x and is respectively input into a forward LSTM network and a backward LSTM network; for a single LSTM network, the output value O at time series t _t Comprises the following steps:

O _t ＝σ(W _o ·[h _t-1 ,x _t ]+b _o ) (3)

wherein σ represents the passage "The door structure is processed by a Sigmoid function; h is a total of _t-1 Is the output value of the previous time, x _t As an input value at the present time, W _o and b_o The weights and the bias coefficients are determined through experiments;

step 3.3: and taking the output value vector obtained by splicing as a hidden state vector of the character, wherein the whole splicing process comprises the following steps:

and combining and outputting the final output of the bidirectional long-short term memory network, wherein each direction has e LSTM structure models, and the final output value h is expressed in the following form:

h＝[h _fe ,h _be ] (4)

wherein ,h_fe and h_be Respectively an output value of a forward LSTM network structure and an output value of a backward LSTM network structure;

after the bidirectional long-short term memory network processing, obtaining a value probability matrix P of each label corresponding to the current character; the original input sequence is marked as X, and Y is used for representing a label sequence corresponding to a prediction result output by the bidirectional long-short term memory network; both are expressed in the following form:

X＝{x ₁ ,x ₂ ,…,x _n } (5)

Y＝{y ₁ ,y ₂ ,…,y _n } (6)

in the formula, n is the number of characters in the sequence;

the set label quantity l is represented by k, the dimension of the output matrix P is n × k, and the representation method for the parameter of the ith row and the jth column is defined as follows:

P _n×k ＝{P _ij :x _i →l _j |1≤i≤n,1≤j≤k} (7)

in the formula ,P_ij Representing the probability value of the character i predicted as the label j; l _j Representing the jth label in the label set;

the mapped probability is input to the next layer as a non-normalized probability.

Further, the step 4 specifically includes:

step 4.1: the prediction result of the ith character is set as delta _i The probability of transition to label j is defined as p _ij The predicted result is expressed in the form:

δ _i ＝max(p _ij )1≤i≤n,1≤j≤k (8)

and 4.2: respectively adding the identifiers of the Start and the End before and after the prediction sequence as columns, adding new label types of the Start and the End as rows, and establishing a constraint conditional probability transition matrix A _(k+2)×(k+2) ；

Step 4.3: for the probability s (X, y) of possible values y on the input sequence X, denoted by P _n×k and A_(k+2)×(k+2) Jointly calculating to obtain:

wherein ,

representing the transition probability of the label i to the label i +1 before and after the label character under the real condition;

representing the actual probability that a single character is predicted as label i;

the probability value for each tag is normalized and p (y | X):

wherein ,Y_allPossible Representing all possible sets of predictive labels;

representing a predictable label;

defining Loss function Loss as the inverse of log-likelihood, and obtaining output prediction label sequence by dynamic programming algorithm

Therefore, the output of the bidirectional long-short term memory network is input into a layer of linear chain element random field CRF to obtain a final prediction sequence y.

Further, the step 6 specifically includes:

step 6.1: expressing the protocol in a semantic expression composition area by using the field named entity definition given in the step 1;

step 6.2: explaining the specific forming forms of different types of fields by taking bytes as units according to the field forming characteristics of each protocol;

step 6.3: giving the value taking condition of byte characters; for all network protocols, a common formal representation method is given, protocol names are replaced by Protocol, and Index, type and Function are used to represent Type definitions of an Index field, a Type field and a Function field respectively, and a paradigm is as follows:

Protocol＝{F|O} (12)

the parenthesis in the normal form indicates [0, + ∞]A character string set composed of characters;

indicating an unnecessary field type;

therefore, the composition and semantic information of the unknown protocol are formally described.

Compared with the prior art, the invention has the beneficial effects that:

1) The method introduces a named entity identification method in NLP by improving the automation degree of reverse analysis of unknown protocols, and extracts field structures and semantic information by using context semantic information existing between fields in a BilSTM learning protocol sequence by using a deep learning method;

2) According to the invention, the prediction probability is output to the conditional random field CRF, so that the unreasonable prediction condition in the field is avoided, and the accuracy of the invention is improved;

3) The invention carries out serialization processing on the protocol, can break the limit of text protocols and binary protocols, is carried out on the premise of not distinguishing protocol types, and improves the application range of the invention.

Drawings

FIG. 1 is a system architecture diagram of the method NEREUP of the present invention.

Fig. 2 is an exemplary diagram of the protocol sequence labeled by the BIO labeling method.

Fig. 3 is a block diagram of a bi-directional LSTM network.

Fig. 4 is a schematic diagram of the operation mode between any two adjacent LSTM units in the BiLSTM model.

Fig. 5 is a diagram illustrating comparison between trained agent and other methods when segmenting application layer protocol header and protocol payload data.

Fig. 6 is a schematic diagram of model convergence caused by the fact that the change of hidden layer dimensions increases with the training turns when selecting hyper-parameters of model training when analyzing the format of an application layer protocol header.

Fig. 7 is a schematic diagram of model convergence caused by the fact that the number of batches changes with the increase of training rounds when selecting hyper-parameters for model training when analyzing the format of an application layer protocol header.

Fig. 8 is a schematic diagram of model convergence caused by the fact that the change of the learning rate is increased along with the training round when selecting the hyper-parameters of the model training when analyzing the format of the application layer protocol header.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

The system structure of the method is shown in figure 1, and the method consists of three parts of data preprocessing, agent training based on BilSTM and CRF and format reverse inference of unknown protocols.

Step 1: a large amount of data information containing different application layer protocols is collected, and the protocol header fields are extracted to collect the known application layer protocol header components. It is found in the manual analysis process that the field composition logic in the protocol header usually contains a certain timing relationship. By sequentially observing the protocol header sequence from front to back, it can be found that the positions of the fields are often arranged in a global first and local second manner. The top field usually represents the overall data of the application layer protocol sequence, including sequence number, length, etc., and represents the basic information of the sequence. The following fields often sequentially represent control information carried by the data transmission, including contents indicating protocol functions, control basic information and the like. The fields of INDEX, TYPE, and FUNCTION are defined to represent the protocol specification. The network protocol field types defined are as follows:

1) The INDEX field (INDEX) is the most direct identification field in the same application layer protocol that distinguishes different protocol sequences. It usually represents the sequence number of the protocol sequence or the object sequence number to which the protocol acts. The uniqueness of the protocol sequence can be marked or the protocols for processing the same serial number object can be distinguished by distinguishing the index fields of the same protocol.

2) The TYPE field (TYPE) indicates the basic properties of the current protocol. Such fields typically express semantics in a binary fashion, and a one-bit binary character can become a field. When transmitting information, a few meanings are often expressed by only one binary "0" and "1" or two binary bits "0" to "3". Thus, the state information of the protocol and the semantic information that data can be transferred with yes and no are contained therein. The consecutive TYPE field is often merged into a "flag" field at parsing time to make the expression in byte form.

3) The FUNCTION Field (FUNCTION) is mostly present in the latter half of the protocol, and generally carries information that is relatively intuitive in data and can directly act on. This TYPE of field covers two cases, the first one being similar to the TYPE field, yet a limited number of digits can be used to separately express the information that is intended to be transmitted. Each number alone represents a particular type of function. Particularly, most of the data is expressed by taking byte as a unit, if the number of bits is more, a large number of '0' exists in the front part of the field, and only the word of a few bits at the tail part is used for expression. This approach leaves room for protocol updates that can deliver more data types to be expressed. The second case is more straightforward, and the decimal digit semantics of the field itself is the important data that the field transmits. Similarly, the representation method still uses the byte number far exceeding the required bit number to carry data, so that a margin can be left for explosive increase of subsequent transmission data, and the data which cannot be transmitted by a protocol is avoided.

Step 2: capturing network flow under a real network environment, classifying and selecting different application layer protocols, extracting binary protocol sequences of the protocols, and performing data preprocessing on original binary protocol sequences.

And selecting the PCAP data packet of the flow collected from each port and the routing node in the security network. And performing data statistics on all PCAP packets, and counting the number of application layer protocols which can be made into a data set. Each protocol subset is classified according to the type of the application layer protocol, the number of the types of all the application layer protocols is represented by g, the number of the protocols in each Packet is represented by m, and all finally derived data Packet packets containing a single protocol can be represented as follows:

wherein ,

indicating the d-th protocol in the g-th application layer protocol.

The original binary protocol sequence of each application layer protocol is extracted, the original binary protocol sequence is converted into hexadecimal, the original protocol sequence is shortened to 1/4 of the original protocol sequence, and the semantic and field structure information is guaranteed while the size of the data set is reduced. Then, marking the protocol hexadecimal sequence by using a BIO (B-begin, I-side, O-out) marking set, and adopting protocol header and protocol load marking and protocol header field named entity marking; the protocol head with protocol specification and the protocol load without protocol specification of the application layer protocol are segmented, and then the composition specification of the protocol head is reversely analyzed.

The set of annotation tags is shown in tables 1 and 2. Table 1 shows a protocol header and a protocol load segmentation label set, and protocol header data that needs to be reversely parsed is separated. Table 2 shows a set of label labels for the protocol header specification field, and labels the protocol field according to the composition specification. The labeling is shown in fig. 2.

Table 1 protocol header and protocol load division label

Table 2 protocol header specification field label

And step 3: making a character vector conversion dictionary for the preprocessed data, and respectively corresponding the converted protocol sequence characters to a specific value; the generated One-hot vector of the original character is used as the input of a bidirectional long-short term memory network, the values are respectively input into the long-short term memory network units in the forward direction and the backward direction by the input layer, the vectors of the output layer are obtained by combining the characteristics and information of the previous moment and the next moment and are spliced and output as the hidden state vector of the character.

A bidirectional long-short term memory network (BilTM network) is composed of a forward LSTM group and a backward LSTM group. The specific structure is schematically shown in fig. 3. The circular marks in the figure represent One-hot vectors of hexadecimal original characters, and before conversion, data in the data set needs to be traversed to generate a character vector conversion dictionary.

The character vector conversion dictionary Dic is defined as follows:

where p is a key giving a key-value pair representation of hexadecimal characters; < UNK > represents other characters outside the dictionary, and < PAD > represents a completion character.

The hyper-parameters of the BilSTM network are set as follows:

1) The minimum number of samples for a single input network is 64;

2) The dimension of the hidden layer is set to 300;

3) The update interval of the BilSTM network is 40 times of update;

4) The learning rate is set to 0.001;

5) Gradient clipping is set to 5.

In order to ensure that the model has portability when processing different data, vector conversion definition is carried out on hexadecimal characters, and other values which may occur need to be predefined. After a single character is mapped into a character vector, the value x is respectively input into the forward LSTM network and the backward LSTM network, and for a single LSTM network, the output value O is obtained in the time sequence t _t Comprises the following steps:

O _t ＝σ(W _o ·[h _t-1 ,x _t ]+b _o ) (3)

the Sigmoid function processing through the gate structure in equation (3) is represented by σ, h _t-1 Is the output value of the previous moment, x _t Is an input value at the present time, W _O and b_O Are weights and bias coefficients that need to be determined experimentally. After the characters are processed in the forward and backward double-LSTM network, output value vectors obtained by splicing are used as hidden state vectors of the characters and output to a CRF layer. Splicing deviceThe overall flow diagram is shown in fig. 4. And combining and outputting the final output of the LSTM in two directions, wherein each direction is respectively provided with e models of the LSTM structure, and the final output value h is expressed in the following form:

h＝[h _fe ,h _be ] (4)

wherein ,h_fe and h_be Respectively, the output value of the forward LSTM network structure and the output value of the backward LSTM network structure.

Specifically, after the processing of the bidirectional long-short term memory model, the value probability matrix P of each label corresponding to the current character can be obtained. The original input sequence is X, and Y represents a label sequence corresponding to the prediction result output by the bidirectional LSTM model. Both can be expressed in the following form:

X＝{x ₁ ,x ₂ ,…,x _n } (5)

Y＝{y ₁ ,y ₂ ,…,y _n } (6)

wherein n is the number of characters in the sequence, k is used for representing the set label number l to represent the label, the dimension of the output matrix P is n × k, and the representing method for the parameter of the ith row and the jth column is defined as follows:

P _n×k ＝{P _ij :x _i →l _j |1≤i≤n,1≤j≤k} (7)

in the formula ,P_ij Representing a probability value representing that the character i is predicted as the label j; l _j The representation represents the jth label in the set of labels.

Probability of mapping takes non-normalized probability as input of next layer

And 4, step 4: corresponding the predicted probability value with all the label types through the hidden state vector of the character; and inputting the vector of the output layer into a conditional random field CRF, performing final probability correction by combining the transition probability of the character in the real-time data to obtain the predicted label probability corresponding to the named entity class, and training to finish the intelligent agent.

Generally speaking, after the probability matrix P is obtained, it can achieve a good effect by selecting the label type with the maximum character transition probability as the prediction result of the character to be output. Here, the prediction result of the i-th character is defined as δ _i The probability of transition to label j is defined as p _ij The predicted result can be expressed as follows:

δ _i ＝max(p _ij )1≤i≤n,1≤j≤k (8)

considering the continuity of the fields of the same type needing to be marked, a conditional random field pair P needs to be added _n×k And (6) correcting. Firstly, respectively adding Start and End marks before and after a prediction sequence as columns, adding new types of Start and End marks as rows, and establishing a constraint conditional probability transition matrix A _(k+2)×(k+2) . Then, for the probability s (X, y) of possible value y on the sequence X, P is needed _n×k and A_(k+2)×(k+2) Jointly calculating to obtain:

wherein ,

representing the actual probability that a single character is predicted as label i.

The probability value for each tag is normalized and p (y | X):

wherein ,Y_allPossible Representing all possible sets of predictive labels;

representing a predictable label.

Defining Loss function Loss as the inverse number of log-likelihood, and obtaining output prediction label sequence by dynamic programming algorithm

Therefore, the output of the bidirectional long-short term memory network is input into a layer of linear chain element random field CRF, constraint conditions are added to the output label sequence, the final result can simultaneously meet the internal transfer relation of labels on the basis of the output sequence Y, unreasonable prediction sequence combination is abandoned, and the final prediction sequence Y is obtained.

And 5: and (3) inputting the unknown protocol sequence to be reversely analyzed into the intelligent agent after the conversion of the step (2) and the step (3), and obtaining a named entity identification result of the unknown protocol through prediction.

The header of the protocol and the payload information carried by the header need to be decomposed to better extract the protocol header data for the next protocol format inference. Then, after the protocol header data is obtained, the reverse format inference needs to be performed to achieve the final parsing purpose. And selecting a data set divided by an application layer protocol header and load information, marking each hexadecimal character of 2101395 pieces of marked protocol data, and then training and predicting by using the established model. The results shown in fig. 5 were obtained. In the figure, the horizontal axis represents four evaluation indexes, the vertical axis represents the calculated percentage, and the same horizontal axis value contains experimental data of a plurality of methods. Therefore, the evaluation indexes of the method on the data set can reach a level of more than 95%, and the method is ahead of other classical models. In order to show the application effect of the named entity recognition technology on the protocol sequence data, the accuracy index is particularly increased to evaluate the experimental result. Accuracy (Accuracy) is defined as follows:

the accuracy is different from other evaluation indexes, the region identified by the named entity is replaced by the predicted label type of each character, namely the predicted label sequence is compared with the correctly labeled label sequence to obtain the final ratio. As can be seen from the figure, the values of the accuracy evaluation indexes in the results of the invention are all higher than those of other evaluation indexes. This phenomenon indicates that the finally obtained named entity tag sequence is very similar to the correct tag sequence, and plays a role in accurately separating protocol header data and protocol load information.

Then, after separating the sequence field of the protocol header from the payload information carried in the protocol, a reverse inference of the format of the field contained in the protocol header is required. 2119796 pieces of protocol header data which are produced during data preprocessing are selected as data, and field type named entity recognition training and prediction are carried out on a protocol header sequence. In order to obtain the optimal hyper-parameter to be adopted during training, a plurality of groups of experiments are designed on the basis of controlling the unique variable, and the experimental results are shown in fig. 6, 7 and 8. After the optimal hyper-parameters are selected, model training and prediction are performed on the protocol header data, and the final evaluation index values are shown in table 3. It can be seen that the present invention performs better on unknown protocol resolution in comparison with various models. The numerical value on the accuracy rate is high, so that the recognition and the division of the field structure are well represented when the named entities of various field types are faced, and the format and the field semantic information of a complete application layer protocol header message sequence can be reversely deduced through the division of the field types.

TABLE 3 protocol header field Format inference results

Natural language is not sufficiently standardized to describe strict protocol specifications. The extended bacause paradigm, which is a standardized representation of descriptive programming languages, is modified to illustrate protocol specifications, and is defined herein. First, the protocol is represented in the general composition area of the semantic expression by using the three field types given in step 1. Next, a specific configuration form of the different types of fields is explained in units of bytes according to the field configuration characteristics of each protocol. And finally, giving the value taking condition of the byte character. For all network protocols, a generic formal representation method is specifically given in order to ensure that the improved paradigm is able to fully describe the field and semantic information. The Protocol name is replaced by "Protocol", and Index, type and Function are used to represent the field Type definition proposed in step 1, and the paradigm is as follows:

Protocol＝{F|O} (13)

the parenthesis in the paradigm distinguishes from mathematical formulas for representing [0, + ∞]A set of strings of characters. In particular, by

Indicates an unnecessary field type, i.e., a field area that can be increased or decreased with a change in the protocol type in the same kind of protocol. Therefore, the composition and semantic information of the unknown protocol are formally described.

Claims

1. An unknown protocol reverse analysis method based on named entity recognition is characterized by comprising the following steps:

step 1: collecting data information containing different application layer protocols, summarizing and summarizing field components of the header of a known application layer protocol, discovering the types of the component fields of the existing known application layer protocol, and defining field naming entities;

step 2: capturing network flow under a real network environment, classifying and selecting different application layer protocols, extracting a binary protocol sequence of the protocols, and performing data preprocessing on the original binary protocol sequence;

and 4, step 4: corresponding the predicted probability value with all the label types through the hidden state vector of the character; inputting the vector of the output layer into a conditional random field CRF, performing final probability correction by combining the transition probability of the character in real-time data to obtain the predicted label probability corresponding to the named entity class, and training to finish the intelligent agent;

and 5: inputting an unknown protocol sequence to be reversely analyzed into the intelligent agent after the conversion of the step 2 and the step 3, and obtaining a named entity recognition result of the unknown protocol through prediction;

2. The unknown protocol reverse parsing method based on named entity recognition as claimed in claim 1, wherein said field named entity definition specifically comprises:

1) An index field: the most direct identification field for distinguishing different protocol sequences in the same application layer protocol represents the serial number of the protocol sequence or the serial number of an object acted by the protocol;

2) A type field: indicating basic properties of the current protocol, expressing semantics in a binary mode, and forming a field by a one-bit binary character; when information is transmitted, expressive meanings are expressed only by one binary '0' and '1' or two binary bits of '0' to '3'; the semantic information which contains the state information of the protocol and can transmit data by yes and no; merging the continuous type fields into 'flag' fields during parsing, and expressing the flag fields in a byte form;

3) A function field: carrying information which is relatively visual and can directly generate effects in data; one is to express the information to be transmitted in a limited number of digits, each digit alone representing a particular type of function; the other decimal digit semantics of the field itself is important data transmitted by the field, and the number of bytes exceeding the required number of digits is used for carrying the data.

3. The unknown protocol reverse parsing method based on named entity recognition as claimed in claim 1, wherein the step 2 is specifically:

wherein ,

represents the d protocol in the g application layer protocol;

step 2.2: extracting an original binary protocol sequence of each application layer protocol, converting the original binary protocol sequence into a hexadecimal system, carrying out BIO (binary input/output) labeling on the converted hexadecimal system sequence by taking a character as a unit, and adopting protocol header and protocol load labeling and protocol header field named entity labeling; the protocol header with protocol specification and the protocol load without protocol specification of the application layer protocol are segmented, and then the composition specification of the protocol header is reversely analyzed.

4. The unknown protocol reverse parsing method based on named entity recognition as claimed in claim 1, wherein said step 3 is specifically:

the character vector conversion dictionary Dic is defined as follows:

where p is a key giving a key-value pair representation of hexadecimal characters; UNK is represented as the other characters outside the dictionary, < PAD > represents the complement character;

O _t ＝σ(W _o ·[h _t-1 ,x _t ]+b _o ) (3)

wherein, sigma represents Sigmoid function processing when passing through a gate structure; h is _t-1 Is the output value of the previous moment, x _t Is that whenInput value of previous time, W _o and b_o The weights and the bias coefficients are determined through experiments;

h＝[h _fe ,h _be ] (4)

X＝{x ₁ ,x ₂ ,…,x _n } (5)

Y＝{y ₁ ,y ₂ ,…,y _n } (6)

in the formula, n is the number of characters in the sequence;

P _n×k ＝{P _ij :x _i →l _j |1≤i≤n,1≤j≤k} (7)

in the formula ,P_ij Representing a probability value representing that the character i is predicted as the label j; l _j Representing the jth label in the representation label set; the mapped probability is input to the next layer as a non-normalized probability.

5. The unknown protocol reverse parsing method based on named entity recognition as claimed in claim 4, wherein said step 4 is specifically:

δ _i ＝max(p _ij )1≤i≤n,1≤j≤k (8)

wherein ,

normalizing the probability value of each label, and outputting p (y | X) by using Softmax:

wherein ,Y_allP o _ssible Representing all possible sets of predictive labels;

representing a predictable label;

defining a loss function Loss is the inverse of log-likelihood, and the output prediction label sequence is obtained through a dynamic programming algorithm

6. The unknown protocol reverse parsing method based on named entity recognition as claimed in claim 1, wherein said step 6 is specifically:

step 6.1: expressing the protocol in a composition area of semantic expression by using the field named entity definition given in the step 1;

step 6.2: explaining the specific forming form of different types of fields by taking bytes as units according to the field forming characteristics of each protocol;

Protocol＝{F|O} (12)

the big brackets in the normal form indicate [0, + ∞]A character string set composed of characters;

indicating an unnecessary field type;