CN112039196A

CN112039196A - Power monitoring system private protocol analysis method based on protocol reverse engineering

Info

Publication number: CN112039196A
Application number: CN202010321746.8A
Authority: CN
Inventors: 汪杰; 钟志明; 刘沛林; 郑惠芳; 卫敬宜
Original assignee: Guangdong Power Grid Co Ltd; Dongguan Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: Guangdong Power Grid Co Ltd; Dongguan Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2020-04-22
Filing date: 2020-04-22
Publication date: 2020-12-04

Abstract

The invention discloses a power monitoring system private protocol analysis method based on protocol reverse engineering, which comprises the steps of firstly guiding the flow of a private protocol into a test system for analysis through a bypass flow mirror image; the protocol reverse module uses a local sequence comparison algorithm and a non-weighted pairwise average algorithm to perform initial clustering on protocol messages of a private protocol; separating a variable domain and an invariable domain of the protocol message by a multi-sequence comparison algorithm to obtain structural information of the protocol message; searching semantic fields in the protocol message through a search algorithm to obtain protocol format information; and obtaining a state conversion sequence in each session by using the structural information and semantic information of the protocol message, constructing a state prefix tree, merging redundant states, and then optimizing the state prefix tree to obtain a final minimum determined protocol state machine. The method has important significance for ensuring the safe and stable operation of the power grid if being applied to the aspect of safety monitoring of the private protocol network flow.

Description

Power monitoring system private protocol analysis method based on protocol reverse engineering

Technical Field

The invention relates to the technical field of power monitoring, in particular to a power monitoring system private protocol analysis method based on protocol reverse engineering.

Background

With the rapid development of the information and intelligence degree of the power grid, the power monitoring system also becomes a key object of network attack, and the Ukrainian power grid, the Venezuela power grid and the Argentina power grid receive network attack in recent years, so that the state large-scale power failure is brought to deep warning. To ensure that the power industry can operate safely, reliably, accurately, and economically, we must defend and isolate security issues like the ukrainian attack event in advance.

Based on the existing problems of the anti-virus gateway, we have made the following analysis, because the mechanism of virus detection is to determine whether it is malicious software or virus based on a section of virus code, any data stream propagated on the network must be recombined into a file to be detected by the anti-virus scanning engine, the gateway anti-virus product must realize the virus filtering of the network traffic, and the data in the network traffic must be recombined into a file, i.e. the reduction of OSI7 layer, therefore, the problem lies in that the application protocol and application of the power monitoring system is one of the industrial applications that all anti-virus gateways in the industry cannot identify and detect at present, that the network virus traverse and many virus corpses propagate in the network at present, and further, the industrial virus is not a single image, for example, ukrainian event proves that the virus invades the application system of the power monitoring system is one of the current network attack means, and the malicious agents do not have unauthorized behaviors in normal environment and are always in a static state, which also causes that the related security system cannot detect the malicious behaviors. However, these malicious code viruses are only awakened when triggered by a specific application environment, and if an antivirus gateway and an application which can only scan HTTP traffic cannot achieve any protection effect, we need to search for countermeasures, that is, to perform comprehensive threat analysis on the application of the power monitoring system, so as to achieve a protection security solution of the power monitoring system in a real sense.

In the process of performing security monitoring and analysis on network traffic of an electric power monitoring system, deep analysis needs to be performed on an industrial control protocol message, except for the following well-known protocols: besides IEC60870-5-101/102/103/104, IEC61850, Modbus and DNP3, a large number of proprietary protocols such as national grid IEC104, IEC103 extended by each equipment manufacturer, and configuration protocols of switch PLC applied to schneider of primary power system by foreign manufacturers are widely applied to power grids. For the analysis of these proprietary protocols, if the protocol specification can be obtained, the analysis code can be written specifically for processing, but this consumes a lot of time to adapt one by one, and the other party needs to modify or upgrade the protocol specification correspondingly. If the protocol specification is not met, the analysis work cannot be carried out.

Disclosure of Invention

Therefore, the invention provides a power monitoring system private protocol analysis method based on protocol reverse engineering, which aims to solve the problems in the prior art.

In order to achieve the above object, an embodiment of the present invention provides the following:

in an aspect of the embodiments of the present invention, a method for analyzing a private protocol of a power monitoring system based on reverse protocol engineering is provided, which includes the following steps:

step 100, importing the flow of the private protocol into a test system for analysis through a bypass flow mirror image;

200, a protocol reverse module of the test system uses a local sequence comparison algorithm and a non-weighted pairwise average algorithm to perform initial clustering on protocol messages of a private protocol;

step 300, separating a variable domain and an invariable domain of the protocol message through a multi-sequence comparison algorithm, and further analyzing to obtain structural information of the protocol message;

step 400, searching semantic fields in the protocol message through a self-defined search algorithm to obtain protocol format information;

step 500, obtaining a state transition sequence in each session by using the structural information and semantic information of the protocol packet, constructing a state prefix tree, merging redundant states, and then optimizing the state prefix tree to obtain a final minimum determined protocol state machine.

As a preferred scheme of the present invention, the protocol inversion method of the protocol inversion module is used to obtain structure information, semantic information, state information, and context information of a protocol packet.

As a preferred scheme of the invention, the multiple sequence alignment algorithm comprises a global optimal double sequence alignment algorithm, which is marked as a Needleman-Wunsch algorithm, and comprises the following steps:

step 301, similarity scoring: for two sequences with lengths of m and n respectively, the algorithm first constructs a similarity matrix S of (n +1) (m +1), with subscripts labeled i and j respectively;

step 302, scoring and summing: iteratively summing the similarity matrix S to obtain a new matrix M, the summation formula is as follows, wherein the gap penalty w is set to 0:

step 303, optimal backtracking: the matrix element with the highest summation score is traced back to the initial position, the left side, the upper left diagonal line and the upper side of the matrix element are respectively considered, the matrix element is moved to the adjacent element with the highest summation score, and when the matrix element, the matrix element and the upper left diagonal line are the same, the matrix element is preferentially moved to the upper left diagonal line element; inserting a null in the vertical sequence if moving to the left; inserting a null in the transverse sequence if shifted to the upper side; otherwise, no operation is performed.

As a preferred embodiment of the present invention, the search algorithm in step 400 realizes protocol format extraction by progressive multiple sequence alignment, and specifically includes the following steps:

step 401, calculating a distance matrix: the Smith-Waterman algorithm is adopted to find out the local optimal comparison between every two sequence messages, and the similarity between the sequence messages is calculated according to the local optimal comparison, so that the sequence is constructedDistance matrix D of message sets, wherein D_pqRepresenting the distance between the sample sequence p and the sample sequence q;

step 402, constructing and dividing a guide tree: calculating the distance between the subclasses by adopting a non-weighted pairwise group arithmetic mean method, and gradually merging the subclasses with the minimum distance; the distance between the subclass Ci and Cj can be calculated as follows:

step 403, performing asymptotic multiple sequence alignment: and performing subsequent traversal on the guide tree, performing double-sequence dynamic programming comparison by adopting a Needleman-Wunsch algorithm, filling unaligned bytes, performing progressive multiple sequence comparison to obtain a plurality of sequence message subsets when a plurality of guide trees are constructed, and analyzing and processing each sequence message subset to obtain protocol structure information.

As a preferred aspect of the present invention, the process of the Smith-Waterman algorithm is the same as the process of the Needleman-Wunsch algorithm, but the mismatch and gap penalties of the Smith-Waterman algorithm are negative, the alignment resumes once the sum matrix is less than 0, and the Smith-Waterman algorithm alignment may end up with any element of the matrix that scores the greatest, and not necessarily at the bottom right hand corner.

As a preferred aspect of the present invention, the step 402 further includes:

and setting a distance threshold, stopping merging when dij is too large, and finally segmenting to obtain a plurality of guide trees, wherein leaf nodes in the guide trees represent original sample sequences, and middle nodes represent aligned sequences obtained by performing double-sequence alignment on child nodes.

As a preferred scheme of the present invention, the search algorithm further includes a heuristic semantic extraction algorithm, where the heuristic semantic extraction algorithm is based on semantic inference of network traffic, and specifically infers the semantic meaning of a certain byte in the protocol according to the value and change characteristics of the byte in a sequence message, where the semantic meaning of all bytes constitutes the protocol format of the sample set, and according to the message sequence, a binary field and a text field can be first identified, where semantic inference is performed in a segment, that is, semantic inference of the binary field and the text field is separated.

As a preferred aspect of the present invention, the inference method of the protocol state machine in step 500 includes the following steps:

step 501, segmenting the session into Message sequences, extracting and clustering sample Message features by using the obtained protocol format and semantic information, obtaining a Message Type (MT) set M, and further representing the session sample S as a Message Type Sequence (MTs):

s_i＝(a₁,...,a_h)，a₁,...,a_h∈M；

construction to correctly accept all s_iE.g. prefix tree of S, where nodes represent states and edges are the inputs a causing state transitions_iE is M; then generalizing out each type m_iThe precursor type sequence Pi belonging to the element M is represented by a regular expression;

Pi＝.*r(a₁|...|a_j)*,(r,a₁,...,a_j∈M)；

the initial state q0 is labeled as type without predecessor type, and the non-initial state qi (i >0) is defined as:

qi＝{m_i|P_i→a_0i}，i>0；

wherein a is_0iRepresenting the input sequence that caused the transition from q0 to qi, i.e. labeling the state qi as its Pi could match a_0iAll types of sets of (1); and finally fusing the states with the same labels by applying an Exbar algorithm, and extracting the minimum definite finite state machine.

As a preferred solution of the present invention, the minimum deterministic finite state machine includes protocol state information and context, each state in the state machine represents a protocol state, and the transition relationship between the states represents a context.

The invention has the following advantages:

the automatic private protocol deep message analysis method under the condition of unknown protocol specification extracts protocol fields, state machines and the like, and the method has important significance for ensuring the safe and stable operation of a power grid in the aspect of safety monitoring of private protocol network flow.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

FIG. 1 is a flow chart of a method provided by an embodiment of the present invention;

FIG. 2 is a diagram of a protocol inverse model according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an example of DNA sequence alignment provided by an embodiment of the present invention;

FIG. 4 is a flow chart of semantic inference provided by an embodiment of the invention.

Detailed Description

The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the present invention provides a method for analyzing a private protocol of a power monitoring system based on a reverse protocol engineering, which includes the following steps:

The protocol reverse engineering refers to extracting protocol grammar, semantics and synchronization information by monitoring and analyzing network input and output, system behavior and instruction execution flow of a protocol entity under the condition of not depending on protocol description. The protocol grammar is embodied as the structure and format of control information and control message in data message, the protocol grammar defines the key words, data type, length, etc. of each field in the protocol, and the type, length, position and sequence relation of each field form the lexical, which is a part of grammar. The protocol semantics define the actual meaning that these field protocols represent during the protocol parsing process, e.g., the "Request-URI" field in the HTTP Request message identifies the requested content.

The inference of the protocol grammar and semantic information is included in the format inference of the protocol reverse analysis. Protocol synchronization refers to the timing relationship of various messages during communication, for example, a "connection allowed" or "connection denied" message must be sent as a response to a "connection request" message, a "connection closed" message must also be sent under certain conditions after a connection is established, and the like. The timing relationship of the control message transmission also determines the restriction relationship of the communication states (transmission state, reception state, waiting state, etc.) of the two communication parties, so the network protocol is often described by using finite state machines of the two communication parties.

The ideal protocol inverse model should include three stages of preprocessing, protocol format extraction, and state machine extraction, as shown in fig. 2.

In the input preprocessing stage, original data streams are divided into subsequences corresponding to each message in an independent session through two steps of session delimitation and message delimitation; in the protocol format extraction stage, a subsequence is used as input, and three steps of domain division, structure identification and semantic attribute inference are carried out; and in the state machine extraction stage, a state prefix tree is constructed through state labeling to combine state redundancy, and a corresponding minimum determination state machine is obtained through the state machine optimization step. Whether the session delimitation can be realized determines whether a state conversion sequence can be obtained or not, and whether the message delimitation can be realized determines whether the state can be marked or not, so that the input preprocessing stage is the basis of the reverse protocol.

Although the protocol reverse targets incoming messages, the format of outgoing messages needs to be extracted because of the possible domain constraint relationship between incoming and outgoing messages. The format extraction algorithm of the output message is the same as that of the input message, but only the part related to the input message needs to be reserved when the protocol description is finally output.

The realization of the protocol automatic reverse technology can obviously reduce the workload of manual analysis, improve the analysis efficiency of the private protocol and enable the quick automatic response to the network security event to be possible. In the field of protocol reverse engineering, more intensive research has been carried out at home and abroad. Existing protocol inversion techniques are roughly classified into two types according to the difference of analysis objects: the analysis method based on network flow and the analysis method based on instruction execution sequence.

The analysis method based on the network flow takes the captured network message sequence as an analysis object, is relatively easy to input and obtain, is relatively flexible to realize, has strong universality, and has the defects of difficult reverse encryption and complex protocol and accuracy rate depending on the abundance degree of an input sample set. The analysis method based on the instruction execution sequence takes the instruction execution track in the protocol analysis process as an analysis object, takes the protocol input data as a taint data source, obtains protocol specifications according to the use mode and context information of the taint data pair, can obtain more accurate and comprehensive protocol format and semantic information, and has the defects of needing to obtain executable codes, being complex to implement and being difficult to realize automation.

Because the operating environment of the industrial control equipment is relatively closed, executable codes of core equipment such as PLC and the like are difficult to derive, and the analysis method based on the instruction execution sequence is difficult to apply. In comparison, the analysis method based on the message sequence is independent of the implementation platform of the protocol entity, and has stronger universality. Therefore, the embodiment specifically adopts an analysis method based on network traffic as a basic technical route of the protocol reverse analysis, so as to meet the requirement of the automatic reverse analysis of the industrial control protocol.

The protocol reverse method of the protocol reverse module is used for acquiring the structural information, semantic information, state information and context information of the protocol message, and the protocol reverse analysis is realized by adopting the following method:

a) acquiring message structure information by adopting a multi-sequence comparison algorithm;

b) setting a heuristic algorithm for matching aiming at common semantic fields in an industrial control protocol, and extracting semantic information;

c) protocol state machine (including context) inference based on state labeling.

Sequence Alignment (also known as Sequence Alignment) is the basis of bioinformatics and has important significance in discovering function, structure and evolution information in biological sequences. Briefly, sequence alignment refers to the arrangement of two or more sequences together to find the most similar match between the sequences, while distinguishing the differences between the sequences. Taking the two DNA sequences shown in FIG. 3 as an example, the matched symbols (A, T (or U) in nucleic acid, C, G is a single letter representation of amino acid residues) after alignment are arranged on the same column, and the missing symbols are replaced by a placeholder "-".

The theoretical basis for sequence alignment methods is evolutionary theory, which is commonly used to study sequences with homology, particularly biological sequences such as protein or DNA sequences. The protocol format is regarded as language, and the protocol messages have evolutionary similarity, so that the sequence comparison can also be used for analyzing and researching the protocol format. For very short and very similar sequences, alignment can be performed manually. In practical applications, target sequences (such as biological sequences, text sequences, etc.) are very long, and sequence alignment can only be completed by a computer by constructing a sequence alignment algorithm. Sequence alignments can be divided into double-Sequence alignments (Pair-wise Sequence Alignment) and Multiple-Sequence alignments (Multiple Sequence Alignment), depending on the number of alignments.

The task of double sequence alignment is to find out the similarity between the sequence to be tested and the target sequence, and typical algorithms include a dot matrix method, a dynamic programming algorithm, a FASTA algorithm, a BLAST algorithm and the like. Among them, the dynamic programming algorithm has higher sensitivity in identifying similarity, and is the most commonly used sequence alignment algorithm. The most typical Needleman-Wunsch algorithm and Smith-Waterman algorithm of the dynamic programming algorithms are first described herein.

The Needleman-Wunsch algorithm is a global optimal double-sequence comparison algorithm and mainly comprises 3 steps of similarity scoring, scoring summation and optimal backtracking:

To illustrate the algorithm flow, taking the sequence alignment of HTTP request GET/index. htmlhttp/1.0 and GET/HTTP/1.0 as an example, the summation matrix of Needleman-Wunsch algorithm is shown in table 3-1, and the shaded part is the trace back path. The final sequence alignment result is GET/? Is there a Is there a Is there a Is there a Is there a Is there a Is there a Is there a Is there a HTTP/1.0.

TABLE 3-1 Needleman-Wunsch Algorithm Process example

The concept of the Smith-Waterman algorithm is similar to the Needleman-Wunsch algorithm, except that: (1) the mismatch score and the gap penalty are negative values, and once the summation matrix is less than 0, the comparison is restarted; (2) the alignment may terminate at any element in the matrix that scores the greatest, not necessarily at the bottom right hand corner. The Smith-Waterman algorithm has a higher sensitivity in aligning two sequences with local similarity, but cannot obtain the overall alignment. Both the Smith-Waterman algorithm and the Needleman-Wunsch algorithm compute all the matrix elements, and thus both time complexities are O (mn).

Multiple sequence alignments are essentially a generalized generalization of two-sequence alignments. Common multiple sequence alignment algorithms are mainly classified into 3 types: an exact Alignment algorithm, an iterative Alignment algorithm (iterative Alignment), and a Progressive Alignment algorithm (Progressive Alignment). The exact alignment algorithm expands the dynamic programming idea into multiple sequence alignments to obtain the optimal alignment result, but the complexity increases exponentially with the increase of sequences, and the exact alignment algorithm generates a tremendous amount of computation in practical application.

The iterative alignment algorithm improves the multiple sequence alignment by a series of iterations based on a generated preliminary sequence alignment result until the result is not improved. Although the iterative algorithm has the advantages of good robustness and insensitive sequence number, the iterative calculation process still needs to consume considerable calculation resources. The gradual comparison algorithm adopts a greedy idea, and is a dynamic programming algorithm for iteratively executing double-sequence comparison, wherein a new sequence is gradually added from the comparison of two sequences until all the sequences are added. Compared with the former two, the asymptotic comparison algorithm has obvious advantages in efficiency. Although the optimal alignment result cannot be guaranteed, the asymptotic alignment still can obtain a satisfactory effect when the similarity of the sequence samples is high. Therefore, most of the existing protocol format extraction methods based on network traffic adopt a progressive sequence comparison algorithm.

The search algorithm in step 400 implements protocol format extraction by progressive multiple sequence alignment, specifically comprising the steps of:

step 401, calculating a distance matrix: finding out the local optimal comparison between every two sequence messages by adopting a Smith-Waterman algorithm, calculating the similarity between the sequence messages according to the local optimal comparison, and constructing a distance matrix D of the sequence message set, wherein D_pqRepresenting the distance between the sample sequence p and the sample sequence q;

since protocols may exist in multiple format types, if forced alignment is performed in an asymptotic multiple sequence alignment, a large number of invalid padding bits may be added to the samples. In order to improve the accuracy of sequence alignment, a distance threshold is set, merging is stopped when dij is too large, and finally a plurality of guide trees are obtained by segmentation. In the guide tree, leaf nodes represent original sample sequences, and intermediate nodes represent aligned sequences obtained by performing double-sequence alignment on child nodes.

The search algorithm also comprises a heuristic semantic extraction algorithm, wherein the heuristic semantic extraction algorithm is based on semantic inference of network flow, specifically, the semantic of a certain byte in the protocol is inferred according to the value and the change characteristic of the byte in a sequence message, the semantic of all bytes forms the protocol format of the sample set, a binary field and a text field can be firstly identified according to the message sequence, and the semantic inference is carried out in the segment, namely, the semantic inference of the binary field and the text field is separated.

For convenience of description, and to distinguish from a sample field (i.e., a binary field or text field, hereinafter referred to as a field), the present scheme refers to a structure having specific semantics in a sample structure as a key field, hereinafter referred to as a field. Depending on the static characteristics of the sample, an interval (internal) field, a sequence number (serial) field, a data (data) field, a length (length) field, and a Format identification (FD) field can be identified. Furthermore, structures for which no semantics are recognized may be identified as unknown fixed fields, or unknown variable fields. For each field identification, a set of semantic inference strategies is formulated. The length field and the format identification field are the most basic two fields, and the identification strategies of the two fields are mainly described here.

a. Length (length) field:

the length field is characterized by its own length, typically 1-4 bytes, and takes on a value equal to the length of some segment or consecutive segments of the sample, and the scope will not precede (but may be within) the field. The identification strategy is to judge whether the value of the field is equal to the length of a certain section or a plurality of continuous sections, and if the value of the field is equal to the length of the subsequent section or the plurality of continuous sections, the field is judged to be the length field. Since our semantic inference is performed in the segment, the scope of the length field may also include the segment, and in order to not increase the time complexity of the algorithm too much, a simplification is made without affecting the validity and correctness of the algorithm: the segments following the current segment are taken as a whole, i.e., the scope includes either all or none of the following segments. Similarly, due to the function of the spacer, when the text field identifies the length field, in order to reduce the time complexity of the algorithm and reduce the false identification, the merging operation has been performed on the samples according to the spacer, that is, the samples have been segmented and ended. And the segmentation of the binary field needs to be carried out according to the identification of the length field to a certain extent, so that the segmentation operation of the binary field and the identification of the length field are carried out synchronously, and mutual constraint and mutual verification are carried out. Therefore, the length field identification strategy of the text field and the binary field is different.

The length identification process of the text field is as algorithm 1:

algorithm 1.An algorithm to the length key of ASCII section

Input：Cluster

Output：Cluster

ASCIILengthInfer(Cluster)

The input of the algorithm is a same-class same-segment message sample, and the output is an input sample after length identification. Judging each field of the segment one by one, firstly confirming that the field is a field of an unidentified number type (the 2 nd line of the algorithm), then matching the value of the field with a certain segment or a plurality of continuous segments (the 3 rd line of the algorithm), finding out the scope of the field, and if the scope is empty, judging that the field is not a length field; if there is more than one scope, the closest scope to the field is chosen (algorithm lines 4-7). Finally, the recognition result is updated into the input sample (lines 8-10 of the algorithm).

The recognition algorithm of binary fields is as algorithm 2:

algorithm 2.An algorithm to the length key of Binary section

Input：Cluster

Output：Cluster

BinaryLengthInfer(Cluster)

Unlike the text field recognition algorithm, the preliminarily recognized length field is stored in keywdist (line 8 of the algorithm), and the length field in keywdist is filtered in a loop until no length field is negated (lines 11-13 of the algorithm). The screening process is as algorithm 3:

algorithm 3.An algorithm to measure the Binary section, and filter out the right length key

Input：keywdlist，Cluster

Output：isFinished，keywdlist

FieldCheck(keywdlist，Cluster)

The algorithm first combines consecutive unidentified invariant fields (UBC) and consecutive unidentified variable fields (UBV) of the same rate of change (rows 1-2 of the algorithm), and then determines the scope of each length field (possibly more than one scope) in keywdlist one by one. If the scope boundary is not on the boundary of any field, then the scope is culled (Algorithm line 5). If the scope of a length field is completely removed, the length field is negated and removed from the keywdlist (lines 6-9 of the algorithm).

b. Format identification (FD) field

The format identification field is characterized in that the value change rate of the format identification field is small, and the value of the format identification field is closely related to the following format sequence (classification) (one value corresponds to the current subclass). According to the action range and the identification sequence of the FD field, the identification of the FD field is divided into two stages. The FD field identified in the first stage is after the current stage, so that it has different values in the current stage, and we decide for the unknown variable field: if each value of the field corresponds to a unique format sequence (one format sequence may correspond to different values of the FD field) after the current segment, the field is determined to be the FD field, and the scope is determined to be all segments after the current segment. The FD field identified in the second stage has a scope in the current segment, so that it has only a unique value in the current segment, and the FD field in the same segment (current segment) of different classes of samples has a different value. Therefore two problems need to be solved for FD field identification for the second phase: (1) how to locate the FD field; (2) when the FD field is identified.

The recognition strategy of the discover is that scanning is carried out from left to right, some semantic fields are preliminarily judged according to semantic inference firstly, a first FD field is determined, classification is carried out according to values of the FD field, semantic inference and next FD field recognition are carried out on classified subsections, and finally format fusion is carried out. The identification strategy has the advantage of more accurately locating the FD field. However, the FD field is a field that determines the format sequence after the FD field, and the prerequisite for identifying the FD field is to identify other fields after the FD field first, otherwise, the FD field cannot be identified more accurately. The discovery is to recognize the previous FD field, classify according to the value of the FD field, and then perform the following semantic inference and the recognition of the next FD field on the subclass. Thus, the disadvantage of discover is that the identification order is reversed, and the identified FD field is less accurate or even impossible to identify. However, if the fault is only for the discover, the identification order of the FD field is reversed, i.e. scanning from right to left, and thus the FD field cannot be accurately located.

Therefore, the FD field identification strategy of the discover is improved by using the principle that the value of one FD field can only correspond to one format sequence, and the improved strategy is as follows, for example, algorithm 4:

algorithm 4.An algorithm to the FD key of the current section

Input：InitialCluster

Output：FDCluster

FDInfer(InitialCluster)

Where the InitialCluster input is a sample set that has undergone other semantic inferences than the FD field. The output of the algorithm is the set of samples identified by the FD field. After the semantic inference of the current segment other than FD is finished, the algorithm recursively locates from left to right the possible occurrence of FD field. Because other semantic inferences are performed according to the sequence comparison result of the same type of message of the current segment except the current FD field, the format sequences of the same type of message of the current segment are the same, and therefore the FD field may only be an unidentified fixed field aligned from left to right between the types of messages, and is determined as a field to be determined (row 1 of algorithm).

And classifying according to the value of the undetermined field (row 2 of the algorithm), and then recursively calling the algorithm for each subclass to continuously locate the FD field backwards (row 4 of the algorithm). And after positioning is finished, further confirming whether the undetermined field is an FD field from right to left, wherein the confirmed rule is that the undetermined field is scanned backwards until the next FD field or the last field of the current segment is scanned (if the FD field is not encountered), the scanned format sequence is recorded (line 5 of the algorithm), the format sequences corresponding to the undetermined field with the same value are compared to see whether the undetermined field corresponds to the same format sequence, and if the undetermined field corresponds to the same format sequence, the field can be inferred to be the FD field (lines 6-10 of the algorithm). The format sequences here are identical with two meanings:

(1) the format sequences are the same, and no FD field exists in the format sequences;

(2) if the format sequences are the same, but the format sequences have FD fields, whether the values of the FD fields among various samples have intersection needs to be calculated, if so, the FD fields are judged to be the same format sequence, otherwise, the FD fields are different format sequences.

Experiments prove that the 'forward positioning and reverse identification' strategy can better position the FD field and accurately identify the FD field.

And (3) semantic inference flow: considering that there may be a certain constraint relationship between semantics of each field, for example, the length field is a reflection of its scope length, and the FD field determines the format sequence of its scope (substructure), so as to avoid unnecessary repeated judgment, improve the efficiency and accuracy of semantic inference, and aim at the characteristics of each semantic, determine a semantic inference process, as shown in fig. 3. After multi-sequence comparison is carried out on a current segment of message samples, fixed bytes and variable bytes are distinguished according to the change rate of each byte, whether the fixed fields are interval fields or not is judged for text fields, and if not, the fixed fields are identified as unknown fixed fields (the binary field does not carry out judgment); then identifying whether the variable field is a data field or not, and if not, tentatively identifying the variable field as an unknown variable field; and then identifies whether the remaining unknown variable field is a sequence number field. After the fields of the current segment are all identified, for the text field segment, firstly, the fields are merged according to the interval symbol, and then, the length field is identified. For binary fields, field merging and length field identification are performed synchronously. And finally, according to the identified format sequence, identifying the FD field.

The inference method of the protocol state machine in step 500 includes the following steps:

s_i＝(a₁,...,a_h)，a₁,...,a_h∈M；

construction to correctly accept all s_iE.g. prefix tree of S, where nodes represent states and edges are the inputs a causing state transitions_iE is M; then, each kind is summarizedType m_iThe precursor type sequence Pi belonging to the element M is represented by a regular expression;

Pi＝.*r(a₁|...|a_j)*,(r,a₁,...,a_j∈M)；

qi＝{m_i|p_i→a_0i}，i>0；

The minimum definite finite state machine comprises protocol state information and context relations, each state in the state machine represents a protocol state, and the transition relations between the states represent the context relations.

Based on the above scheme, we try to implement encoding, and as shown in fig. 4, the following deep packet parsing result under the condition of unknown protocol specification, which is performed by taking a common HTTP packet as an example, can be seen that a useful protocol field can be extracted indeed.

Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1.A power monitoring system private protocol analysis method based on protocol reverse engineering is characterized by comprising the following steps:

2. The method according to claim 1, wherein the protocol inversion method of the protocol inversion module is used for acquiring structural information, semantic information, state information, and context information of a protocol packet.

3. The method for resolving the proprietary protocol of the power monitoring system based on the reverse protocol engineering as claimed in claim 1 or 2, wherein the multiple sequence alignment algorithm comprises a globally optimal double sequence alignment algorithm, denoted as Needleman-Wunsch algorithm, and comprises the following steps:

4. The method according to claim 1, wherein the search algorithm in step 400 implements protocol format extraction by progressive sequence alignment, and specifically comprises the following steps:

5. The method for resolving the proprietary protocol of the power monitoring system based on the protocol reverse engineering as claimed in claim 1 or 4, wherein the process of the Smith-Waterman algorithm is the same as that of the Needleman-Wunsch algorithm, and the mismatch and gap penalty of the Smith-Waterman algorithm is negative; when the summation matrix is less than 0, the comparison is restarted; the Smith-Waterman algorithm alignment ends up with any element in the matrix that scores the greatest.

6. The method according to claim 5, wherein the step 402 further comprises:

7. The method for resolving the proprietary protocol of the power monitoring system based on the reverse protocol engineering of claim 1 or 4, wherein the search algorithm further comprises a heuristic semantic extraction algorithm;

the heuristic semantic extraction algorithm is based on semantic inference of network flow, and infers the semantics of a certain byte in the protocol according to the value and change characteristics of the byte in the sequence message, wherein the semantics of all bytes form the protocol format of the sample set; binary fields and text fields are identified from the message sequence, the semantic inference is performed within the fields, and the semantic inference of the binary fields and the text fields is separate.

8. The method for resolving the proprietary protocol of the power monitoring system based on the reverse protocol engineering of claim 1, wherein the inference method of the protocol state machine in the step 500 comprises the following steps:

s_i＝(a₁,...,a_h)，a₁,...,a_h∈M；

Pi＝.*r(a₁|...|a_j)*,(r,a₁,...,a_j∈M)；

qi＝{m_i|P_i→a_0i}，i>0；

9. The method according to claim 8, wherein the minimum definite finite state machine contains protocol state information and context, each state in the state machine represents a protocol state, and the transition relationship between the states represents the context.