CN115883398B

CN115883398B - Reverse analysis method and device for private network protocol format and state

Info

Publication number: CN115883398B
Application number: CN202211496290.4A
Authority: CN
Inventors: 牛伟纳; 王崇宇; 朱宇坤; 张小松; 陈瑞东; 周玉祥
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-11-25
Filing date: 2022-11-25
Publication date: 2024-03-22
Anticipated expiration: 2042-11-25
Also published as: CN115883398A

Abstract

The invention belongs to the technical field of information security, and provides a reverse analysis method and device for a private network protocol format and state. The aim is to provide a method which does not require code analysis during analysis and thus enables protocol reversal based on traffic in environments with restrictions. The scheme comprises the steps of preprocessing flow through a private protocol extraction module and a character comparison module. The format analysis and judgment process converts binary traffic into hexadecimal and performs boundary segmentation and optimization. The semantic judgment process uses constraint reasoning on associated fields, heuristic reasoning on non-associated fields and outputs grammar information. The state machine construction process uses a normalization module to normalize the functional segments, uses a hierarchical clustering algorithm module to label the states, uses a dissimilarity calculation module to calculate the distance, uses a conversion relation construction module to construct a conversion diagram, and uses a probability reduction module to reduce the states.

Description

Reverse analysis method and device for private network protocol format and state

Technical Field

The invention belongs to the technical field of information security, and provides a reverse analysis method and device for a private network protocol format and state.

Background

With the rapid development of multiple fields such as internet of vehicles, industrial control networks, botnets and the like in recent years, a large number of private protocols developed based on specific purposes exist in current network traffic. Related protocols have no format document and do not disclose protocol source codes, so that the existing partial analysis modes in a plurality of fields such as protocol vulnerability mining, network behavior analysis, network monitoring, botnet detection and the like are limited. And safety defects exist in a large number of private protocol designs, and the protocol safety cannot be ensured by a source closing mode. Therefore, the problems of the format and the state transition relation based on the flow analysis protocol become the key to the development of network security.

The current protocol reverse direction can be divided into protocol reverse based on network track and protocol reverse based on program instruction from the angle of analysis object, the former has wide application prospect, but the accuracy is slightly deficient, and the encryption traffic can not be processed. The latter has feasibility in encrypted traffic analysis, but the constraints are excessive. The field of the invention is the protocol reversal based on network tracks. Existing approaches can be divided into three major categories: the first type, based on aligned format analysis, is PI, discover, netzob, netPlier. The second category employs statistical-based field partitioning. Such as frequent item mining, information entropy, n-gram, HSMM inferred KW, etc. Third, learning models such as reinforcement learning, neural networks (LSTM), etc. are combined with NLP directions.

The earliest tool in the reverse direction of traffic-based protocols was PI proposed in 2004, and has become a research hotspot in recent years when private traffic has emerged in large numbers. The protocol reverse based on sequence alignment is an important branch, and has continuous achievement output. The related method is used for comparing according to the byte change characteristics, and a scoring matrix and a guide tree are constructed under the guidance of a punishment function. The core idea is similar to finding the common longest subsequence. The PI and other tools in the related research bear the same idea, and the main innovation is that: and (3) optimizing a reward and punishment function, optimizing an error judging mechanism, and judging the semantic judgment of specific characters such as the FD field, the separator, the length and the like in forward and reverse combination.

The NetPlier method published in NDSS in 2021 combines sequence comparison and probability statistics for the first time, and changes state labeling conversion into probability problem. This type of method has the following disadvantages: (1) introducing excessive protocol priori knowledge into part of the method. (2) In addition to keyword decisions, there are more erroneous divisions of functional segment format boundaries. (3) The comparison yield results are in a protocol boundary format and are not fully related to the subsequent state machine construction according to the format analysis results.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a reverse analysis method for private network protocol formats and states, which can carry out fine-granularity format field boundary judgment, optimize protocol boundary judgment based on progressive multi-sequence comparison, information entropy and probability statistics and realize state labeling based on hierarchical clustering. Code analysis is not needed in the analysis process, so that protocol analysis can be performed according to the flow under the environment with limiting conditions.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a reverse analysis method for private network protocol format and state is characterized in that: the suspicious boundary judgment is added in the preprocessing stage, the functional segments are divided by using information entropy in the format reverse direction, the functional segment format combination is carried out by using a frequent combination mode, hierarchical clustering is used in a labeling part constructed by a state machine, and normalized Canbela dissimilarity is used in distance calculation. The method specifically comprises the following steps:

s1, in an acquisition stage, acquiring flow to be analyzed by using tcpdump and nmap tools, and capturing a flow packet sequence to obtain an input flow set, wherein the flow packet sequence comprises a request flow packet sequence and a response flow packet sequence;

s2, preprocessing an input flow packet sequence according to the relation between protocol interaction behaviors and the flow packet sequence, and extracting and primarily dividing private protocol full-segment information;

s3, carrying out fine granularity judgment on the format boundary by combining progressive multi-sequence comparison, information entropy and frequent items on the reverse stage of the full-segment information format of the private protocol obtained in the step 2 to obtain a boundary segmentation result and a field type, and further obtaining a protocol boundary part of the private network protocol format;

S4, a semantic judgment module: based on the boundary segmentation result, deducing semantic meaning of each divided field, judging the fields with associated attribute by using the existing method, judging the non-associated fields by using heuristic judgment, wherein the judgment result is a corresponding relation of 'field start-stop position-semantic', so as to obtain a semantic part of a private network protocol format, and the protocol boundary part forms the private network protocol format in the semantic part;

s5, a format state machine construction stage, wherein passive state machine construction is carried out based on a format boundary inference result.

In the above technical solution, step 2 includes the following steps:

2.1 Analyzing the bottom layer known protocol, extracting IP and port information, judging the flow receiving and transmitting direction according to the IP and the port information, dividing the collected flow packets of the application program into two types of a server and a client, respectively generating a matrix Server, client, storing in txt format, wherein the matrix format is as follows: { sequence number 1, private protocol traffic packet 1}; { sequence number 2, private protocol traffic packet 2}; ..; { sequence number N, private protocol traffic packet N };

2.2 The method comprises the steps of) carrying out private protocol full-section extraction on an application layer in each flow packet from a first byte to a last byte on the basis of 2.1 analysis of a known protocol of a bottom layer, storing one private protocol flow packet as a row, and forming a new file server_all_oneline. Txt and client_all_oneline. Txt containing the private protocol full section according to the sequence of the private protocol flow packet, wherein each flow is expressed in hexadecimal;

2.3 The suspicious boundary judgment is carried out, the hexadecimal character conversion ASSIC codes in the Server_all_oneline. Txt and the client_all_oneline. Txt obtained in the step 2.1 and the step 2.2 are compared with a public feature library, and the opposite flow packet can be expressed as follows: "#? The character positions meeting the text characteristics of the character positions are marked, a vector which is less than a package sequence number, the text characteristics 1 and the positions 1 is constructed, the text characteristics N and the vectors with the positions N are stored, the text characteristics N in the vector are used as a boundary point judgment basis, the positions N correspond to the judged Suspicious boundary points, and the server_Suspicus_boundary. Txt and the client Suspicus_boundary. Txt files are output;

in the technical scheme, the specific steps of the step 3 are as follows:

3.1 Performing progressive multi-sequence comparison, performing preliminary boundary division, counting the number of private protocol flow packets obtained in the step 2, performing progressive multi-sequence comparison on files with flow sum more than 1000 by using a G-INS-I algorithm, performing progressive multi-sequence comparison on files with flow sum less than 1000 by using an L-INS-I algorithm, performing filling up gaps in the sequence by using a '-' in the progressive multi-sequence comparison process, and marking field boundaries by using spaces: the flow packet comparison result is used as a row vector, the comparison result forms a two-dimensional matrix, and the row records the sequence of the flow packet, the load and the position of the column representing characters in the sequence, and the sequence is expressed as follows: 0001: P1P 2P 3P 4.— P63p64..p 256 >, < 0002: P1P 2P 3P4-P63 — P128 >, pn (n=1..256) represents one byte of data, i.e., 2 16-bit data, in the actual private protocol traffic packet, and the two-dimensional matrix is stored into output files server_msa.txt, client_msa.txt.

3.2 Extracting private protocol functional segments: on the basis of step 3.1, a first column of longitudinal vector of the two-dimensional matrix is extracted as input, a space is used as a boundary for longitudinal vector extraction, information entropy of the longitudinal sequence vector is calculated, a "-" field filled in during the 3.1 completion process is recorded as zero when an entropy value is calculated, information entropy of a first column to a last column of the private protocol is repeatedly calculated, a load segmentation is carried out at a position where the entropy value rapidly changes between the two columns, characters from an initial character to the entropy value rapidly increasing position are extracted, the characters are judged to be the private protocol functional segments, and functional segment files server_functional_oneline. Txt and client_functional_oneline. Txt are output.

3.3 A very long sequence determination module): calculating the sequence length of each line of two files, namely the length of each input flow packet, of the server_all_oneline and the client_all_oneline, marking and extracting a single flow packet exceeding the threshold length by taking the length average value plus the standard deviation as a threshold value, and outputting the single flow packet as the server_oversize_oneline and the client_oversize_oneline.

3.4 Over-cut functional segment combination): and 2.3, reading the suspicious boundary, if the suspicious boundary comprises a space mark before and after, not processing, and if the suspicious boundary is marked with a field without the space mark before and after, inserting a space column behind the position of the suspicious boundary. And then calculating the combination frequency between the bytes of the function sections with the lengths of 1 byte and 2 bytes, mining by using a closed frequent item, combining high probability combinations in the function sections, setting the upper limit of the combination length to be 3 bytes (namely 6 16-level bits), deleting the space between two columns of bytes for the bytes judged to need to be combined, and outputting the processed boundary judgment result files Server_result.

3.5 Judging the field types, wherein the types are divided into five types of fixed length fixed value, variable length variable value, fixed length enumeration, fixed length increment and fixed length disorder, reading 3.4 output files, marking the fields with fixed values in each column as fixed length fixed value fields, marking the positions with the values of '-' in each column as variable length variable value fields, marking the fields with the value types smaller than 8 as fixed length enumeration fields, marking the fields with the value types conforming to increment rules as fixed length increment fields, judging the other fixed length fields as fixed length disorder fields, and outputting field type judging results of an S end and a C end, such as an S end: 1-2, fixed length fixed value, < 2-4, fixed length enumeration, < 14-256, variable length variable value > (the number represents the field start-stop position in hexadecimal).

In the above technical solution, step 4 specifically includes:

4.1 Reading the overlength sequence judged in 3.3, counting the length of the data segment, comparing the length value with the fixed-length unordered field, and judging the fixed-length unordered field as the length segment if the values are consistent;

4.2 Performing heuristic judgment on the non-associated fields such as the zone bit and the like according to a character value mode, and constructing a suspicious grammar list, wherein heuristic judgment rules are as follows; the fixed-length fixed-value is judged as a mark field, the fixed-length enumeration value is judged as a control field and a protocol message type field, the fixed-length increment type value is judged as a sequence number bit field, the fixed-length high entropy value field in the functional section is judged as a check code field, semantic judgment results of an S end and a C end are respectively output, the judgment results are corresponding relation of 'field start-stop position-semantic', and the format is: < field location, field type, suspicious attribute 1,...

In the technical scheme, the specific steps of the step 5 are as follows:

5.1 Normalizing the values of the functional segments, namely, normalizing the functional segments extracted in the step 3.2, reading two files of Server_functional_oneline. Txt and client_functional oneline. Txt, respectively extracting column vectors in the files by taking space marks in the files as boundaries, and compressing the functional segments into a same-dimensional space by using 0-mean normalization to generate Server result normalization. Txt and client_result_normalization. Txt;

5.2 Calculating the distance between normalized functional segments in 5.1, carrying out state labeling by using a hierarchical clustering AGNES algorithm, clustering different message states, reading each transverse vector of Server result normalization.txt and Client result normalization.txt in 5.1 in distance calculation, taking the Canbela dissimilarity value of each flow packet functional segment as a division basis in the clustering process, taking each flow packet as an initial cluster, calculating any two clusters of distances, merging two clusters closest to each other to generate a new cluster, repeating merging the clusters closest to each other, and terminating the hierarchical clustering algorithm when the distance between the farthest two clusters exceeds a threshold value or the number of clusters reaches a specified value;

5.3 Based on the state labeling in the step 5.2, drawing a state machine, drawing a conversion edge based on a time sequence relation of class conversion, and performing state simplification by using the existing two modes of state pruning based on abnormal probability and state merging based on equivalent state.

In the technical scheme, the format boundary achieves byte granularity division and performs merging processing aiming at excessive segmentation of the functional segments. And the state machine extraction part optimizes state labeling by combining the segmentation information of the format.

The invention also provides a reverse analysis device for the format and the state of the private network protocol, which comprises the following modules:

information acquisition and input module: collecting flow to be analyzed by using tcpdump and nmap tools, and capturing a flow packet sequence to obtain an input flow set, wherein the flow packet sequence comprises a request flow packet sequence and a response flow packet sequence;

pretreatment stage module: preprocessing an input flow packet sequence according to the relation between protocol interaction behavior and the flow packet sequence: extracting and primarily dividing private protocol full-segment information;

extracting a protocol boundary module: carrying out fine granularity judgment on the format boundary by combining progressive multi-sequence comparison, information entropy and frequent items in a reverse stage of the full-segment information format of the private protocol to obtain a boundary segmentation result and a field type, and further obtaining a protocol boundary part of the private network protocol format;

The semantic judgment module: based on the boundary segmentation result, deducing semantic meaning of each divided field, judging the fields with associated attribute by using the existing method, judging the non-associated fields by using heuristic judgment, wherein the judgment result is a corresponding relation of 'field start-stop position-semantic', so as to obtain a semantic part of a private network protocol format, and the protocol boundary part forms the private network protocol format in the semantic part;

the state machine construction module: and carrying out passive state machine construction based on the format boundary inference result.

In the above scheme, the preprocessing stage module implementation includes the following steps:

2.3 The suspicious boundary judgment is carried out, the hexadecimal character conversion ASSIC codes in the Server_all_oneline. Txt and the client_all_oneline. Txt obtained in the step 2.1 and the step 2.2 are compared with a public feature library, and the opposite flow packet can be expressed as follows: "#? And marking character positions meeting text characteristics of the character positions of the text, and constructing a vector which is less than a package sequence number, the text characteristics 1 and the positions 1, wherein the text characteristics N and the vectors with the positions N are stored, the text characteristics N in the vector are used as a boundary point judgment basis, the positions N correspond to the judged Suspicious boundary points, and the server_Suspicus_boundary. Txt and the client_Suspicus_boundary. Txt files are output.

In the scheme, the protocol boundary extraction module comprises the following specific steps of:

3.1 Performing progressive multi-sequence comparison, performing preliminary boundary division, counting the number of private protocol flow packets obtained in the step 2, performing progressive multi-sequence comparison on files with flow sum more than 1000 by using a G-INS-I algorithm, performing progressive multi-sequence comparison on files with flow sum less than 1000 by using an L-INS-I algorithm, performing filling up gaps in the sequence by using a '-' in the progressive multi-sequence comparison process, and marking field boundaries by using spaces; the flow packet comparison result is used as a row vector, the comparison result forms a two-dimensional matrix, and the row records the sequence of the flow packet, the load and the position of the column representing characters in the sequence, and the sequence is expressed as follows: 0001: P1P 2P 3P 4.— P63p64..p 256 >, < 0002: P1P 2P 3P4-P63 — P128 >, pn (n=1..256) represents one byte of data, i.e. 2 16-bit data, in the actual private protocol traffic packet, and the two-dimensional matrix is stored into output files server_msa.txt, client_msa.txt;

3.2 Extracting private protocol functional segments: on the basis of step 3.1, extracting a first column of longitudinal vector of a two-dimensional matrix as input, taking a space as a boundary extracted by the longitudinal vector, calculating information entropy of the longitudinal sequence vector, recording a filled "-" field as zero when calculating an entropy value in the 3.1 completion process, repeatedly calculating information entropy of a first column to a last column of a private protocol, carrying out load segmentation on a position where the entropy value is suddenly changed between the two columns, extracting a character from an initial character to the position where the entropy value is suddenly changed, judging the character as a private protocol functional section, and outputting functional section files server_functional_oneline. Txt and client_functional_oneline. Txt;

3.3 A very long sequence determination module): calculating the length of each line sequence of two files, namely the length of each input flow packet, of the server_all_oneline and the client_all_oneline, marking and extracting a single flow packet exceeding the length of the threshold by taking the length average value and the standard deviation as the threshold, and outputting the single flow packet as the server_oversize_oneline and the client_oversize_oneline;

3.4 Over-cut functional segment combination): and (3) reading the suspicious boundary in the step (2.3), if the suspicious boundary comprises a space mark before and after, not processing, and if the suspicious boundary is marked with a field without the space mark before and after, inserting a space column behind the position where the suspicious boundary is located. Then calculating the combination frequency between bytes of the functional sections with the lengths of 1 byte and 2 bytes, mining by using a closed frequent item, combining high probability combinations in the functional sections, setting the upper limit of the combination length to be 3 bytes, deleting the space between two columns of bytes for the bytes judged to be combined, and outputting the processed boundary judgment result files Server_result. Txt and client_result. Txt;

3.5 Judging the field types, wherein the types are divided into five types of fixed length fixed value, variable length variable value, fixed length enumeration, fixed length increment and fixed length disorder, reading the output file of the step 3.4, marking the field with fixed value in each column as a fixed length fixed value field, marking the position with the value of '-' in each column as a variable length variable value field, marking the field with the value of less than 8 types as a fixed length enumeration field, marking the field with the value type conforming to the increment rule as a fixed length increment field, judging the other fixed length fields as fixed length disorder fields, and outputting the field type judging results of an S end and a C end, such as an S end: fixed length fixed value >, fixed length enumeration > < 2-4, >, < 14-256, variable length variable value > (the number represents the field start-stop position in hexadecimal);

in the above scheme, the implementation of the semantic judgment module specifically includes:

4.1 Reading the overlength sequence judged in 3.3, counting the length of the data segment, comparing the length value with the fixed-length unordered field, judging that the fixed-length unordered field is the length if the values are consistent, otherwise, judging that the field is undetermined;

4.2 Performing heuristic judgment on the non-associated fields such as the zone bit and the like according to a character value mode, and constructing a suspicious grammar list, wherein heuristic judgment rules are as follows; the fixed-length fixed-value is judged as a mark field, the fixed-length enumeration value is judged as a control field and a protocol message type field, the fixed-length increment type value is judged as a sequence number bit field, the fixed-length high entropy value field in the functional section is judged as a check code field, semantic judgment results of an S end and a C end are respectively output, the judgment results are corresponding relation of 'field start-stop position-semantic', and the format is: < field location, field type, suspicious attribute 1,., suspicious attribute N >;

In the scheme, the semantic judgment module specifically comprises the following steps:

5.1 Normalizing the values of the function segments, namely, normalizing the function segments extracted in the step 3.2, reading two files of Server_functional_oneline. Txt and client_functional oneline. Txt, respectively extracting column vectors in the files by taking space marks in the files as boundaries, and compressing the function segments into a same-dimensional space by using 0-mean normalization to generate Server_result_normalization. Txt and client_result_normalization. Txt;

5.2 Calculating the distance between the functional segments after 5.1 normalization, carrying out state labeling by using a hierarchical clustering AGNES algorithm, clustering different message states, reading each transverse vector of Server_result_normalization.txt and client_result_normalization.txt in 5.1) in distance calculation, taking the Canbela dissimilarity value of each flow packet functional segment as a division basis, calculating the distance between any two clusters by taking each flow packet as an initial cluster in the clustering process, merging two clusters with the nearest distance to generate a new cluster, repeatedly merging the clusters with the nearest distance, and terminating the hierarchical clustering algorithm when the distance between the farthest two clusters exceeds a threshold value or the number of clusters reaches a specified value;

Compared with the prior art, the invention has the beneficial effects that:

1. the preprocessing stage makes the encoding decision. Based on the partial similarity between private protocol design and public protocol grammar, after a great deal of analysis is carried out on the current public protocol, the field auxiliary format boundary segmentation with the markedness in the text protocol is extracted. The mixed type traffic boundary decision granularity may be refined.

2. The format boundary segmentation part analyzes different amounts of traffic by combining byte change characteristics and combining frequencies, and combines fields with the combining frequencies exceeding a threshold value. In the single comparison mode, the negative influence of similar values of different functional fields on boundary segmentation is reduced. In addition, only the information entropy value is used for dividing the functional section and the load section, and the information entropy is not used for carrying out fine granularity division, so that the functional section extraction optimization is realized. While constructing a foundation for the subsequent state labeling, the boundary can not be segmented due to the fact that the information entropy difference value is too small in analysis of byte granularity is avoided.

3. The state labeling part uses hierarchical clustering and performs category classification by combining normalized Canbela distances. The threshold can be dynamically adjusted to obtain the specified category number at any level. Avoiding excessive human intervention caused by the unknown number of categories.

4. The invention utilizes the existing flow information to carry out boundary segmentation and state labeling on the flow according to the needs, so that the invention can carry out protocol reverse work under the uncontrolled environment, namely under the condition of no program source code and no protocol standard document.

Drawings

Fig. 1 is a general architecture diagram of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and detailed description.

Examples

Modbus protocol based on Ethernet TCP/IP, proposed by Nede corporation. The modbus known prior is not introduced in the scheme, so the modbus/tcp protocol can be defined as a proprietary protocol for the present scheme and device.

In the acquisition stage, a packet grabbing tool is used for acquiring 900 request response flow packet sequences on a core switch, firstly, flow is analyzed from bottom to top based on an OSI model, flow IP and port information are recorded, and the receiving and transmitting directions are judged according to the IP and the ports. And then carrying out private protocol full-segment extraction from the first byte to the last byte on the application layer in each flow packet. And finally, carrying out suspicious boundary judgment, and comparing hexadecimal character conversion ASSIC codes with a public feature library, wherein no special character exists in the current flow, so that no mark is made. And outputting server_all_oneline. Txt and client_all_oneline. Txt. The result of single package processing in the file is as follows < 0001:059400000006010300080004 >

And in the format reverse stage, firstly, executing an L-INS-I algorithm to perform progressive multi-sequence comparison. The sequence alignment process uses "-" to complement gaps and uses spaces to identify field boundaries. The comparison results form a two-dimensional matrix, and the rows record the sequence and the load of the flow packet, and the columns represent the field positions. Outputting the first field boundary judgment results of the S end and the C end, wherein the single packet result is as follows: < 0001:0594000000060103000800 > -04.

The first section longitudinal vector of the two-dimensional matrix is extracted as input, and a space is taken as a boundary extracted by the longitudinal vector. The entropy of the longitudinal sequence vector is calculated and the field labeled "-" is zero at the time of calculation. And cutting the load at the maximum point of the variance of the sequence entropy value, extracting the character from the initial character to the position of the abrupt increase of the entropy value, and judging the character as a private protocol functional section.

Calculating the length distribution of input traffic, marking and extracting a single traffic packet exceeding a threshold length, reading the suspicious boundary in the step S2 and the step 3 as the format segmentation basis of the functional segment, and complementing the boundary division result in the functional segment.

Over-cut functional segment combination: and calculating the combination frequency of single byte and single byte in the input of the traffic packet of the non-ultra-long sequence, calculating backward combination probability, combining high probability function fields, and setting the upper limit of the combination length to 3 bytes (namely 6 16-system bits). The output results "proprietary protocol format boundary" file server_result. Txt, client_result. Txt, single packet results are as follows: < 0001:0594000000060103000800 > -04.

And judging the field type, reading the server_MSA.txt and client_MSA.txt files, and marking the position with the value of "-" in each column as a variable-length variable-value field. The fixed-length fixed-value field is marked on the position with fixed value in each column, and the fixed-length unordered field is marked on other positions. The result "private protocol format field type" is output. The analysis results were as follows: 1-4 fixed length enumeration >, < 5-8 fixed length fixed value >, < 9-12 fixed length disorder >, < 12-13 fixed length fixed value >, < 14-15 fixed length enumeration >, < 16-496 variable length variable value >.

Judging the fields with the front-back association characteristics such as the length and the like by combining the constraint reasoning analysis method, executing grammar reasoning and judging the length fields. And performing heuristic judgment on non-associated fields such as a flag bit and the like according to a character value-taking mode, constructing a suspicious grammar list, judging the fixed-length fixed-value-taking mode as a flag field, judging the fixed-length enumeration value-taking mode as a control field and a protocol message type field, judging the fixed-length incremental value-taking mode as a serial number bit field and judging the fixed-length high entropy value field in a functional section as a check code field. The result "field semantic comparison table" is output, and the inferred result is as follows: 1_4, fixed length enumeration, control field, protocol message type field >, < 5-8, fixed length fixed value, flag field >, < 9-12, fixed length unordered, length >, < 12-13, fixed length fixed value, flag field >, < 14-15, fixed length enumeration, control field, protocol message type field >, < 16-496 variable length variable value, data segment >

And in the format state machine construction stage, firstly, performing functional segment value normalization, reading Server_result.txt and client_result.txt, respectively extracting column vectors in the file by taking the space field determined in the step S3 as a boundary, and compressing all the processed functional segments into the same-dimensional space by using 0-mean normalization. Generating Server_result_normalization. Txt, client_result_normalization. Txt

Hierarchical clustering is used for state labeling, and different message states are clustered by using AGNES. Each input is treated as a cluster. And (3) calculating the distance between any two clusters, wherein in the distance calculation, reading each transverse vector in the step (1), taking the Canbela dissimilarity of each flow packet functional section as a division basis, repeating the steps until the distance between the two furthest clusters exceeds a threshold value or the number of clusters reaches a specified value, and terminating the algorithm.

And drawing a state machine, drawing a conversion edge based on a time sequence relation of the conversion between classes, and outputting a result of 'complete protocol state machine'. The state is simplified by using the existing state pruning based on abnormal probability and state merging based on equivalent state, and a result of 'reduced protocol state machine' is output.

Information acquisition and input module: collecting network traffic;

the private protocol extraction module: disassembling a bottom layer known protocol, analyzing an IP port and extracting a private protocol field to be analyzed;

character comparison module: comparing the mark field of the public text protocol and judging suspicious boundary points;

progressive multi-sequence alignment module: performing multi-sequence alignment and protocol character alignment;

a field type judging module: marking fixed length fixed value, variable length variable value, fixed length enumeration, fixed length increment and fixed length unordered fields;

load segmentation module: calculating longitudinal information entropy, and segmenting a load segment and a functional segment;

and an ultralong sequence judging module: calculating the length distribution of input traffic, and marking and extracting single traffic packets exceeding a threshold length;

functional segment combination module: calculating the combination frequency of the functional segments, calculating backward combination probability aiming at excessively segmented single bytes, combining high-probability functional fields, setting the upper limit of combination as 6 16 binary digits, executing normal field calculation on bytes from the first byte of the ultra-long sequence to the entropy mutation point, and comparing difference results;

the semantic judgment module: invoking an existing association analysis method to analyze the field and heuristically judge the semantic type based on the field change characteristics;

And (3) a normalization module: longitudinally carrying out 0-mean normalization on the continuous bytes which are judged to be functional segments after alignment, and normalizing an original data set into a data set with a mean value of 0 and a variance of 1;

dissimilarity calculation module: calculating sequence function segment dissimilarity according to the Canbela distance;

hierarchical clustering algorithm module: classifying the input traffic based on distance;

the conversion relation construction module: generating a protocol state machine based on the time sequence conversion;

probability about Jian Mokuai: and calculating the input transition probability and the merging state between classes.

The above is merely representative examples of numerous specific applications of the present invention and should not be construed as limiting the scope of the invention in any way. All technical schemes formed by adopting transformation or equivalent substitution fall within the protection scope of the invention.

Claims

1. A reverse analysis method for private network protocol format and state is characterized in that: the method comprises the following steps:

s4, a semantic judgment module: based on the boundary segmentation result, deducing semantic meaning of each divided field, judging the fields with associated attribute by using the existing method, judging the non-associated fields by using heuristic judgment, wherein the judgment result is a corresponding relation of a field start-stop position and semantic, so as to obtain a semantic part of a private network protocol format, and the protocol boundary part and the semantic part form the private network protocol format;

s5, a format state machine construction stage, wherein a passive state machine construction is carried out based on a format boundary inference result;

the specific steps of the step 3 are as follows:

3.1 Performing progressive multi-sequence comparison, performing preliminary boundary division, counting the number of private protocol flow packets obtained in the step 2, performing progressive multi-sequence comparison on files with flow sum more than 1000 by using a G-INS-I algorithm, performing progressive multi-sequence comparison on files with flow sum less than 1000 by using an L-INS-I algorithm, performing filling up gaps in the sequence by using a '-' in the progressive multi-sequence comparison process, and marking field boundaries by using spaces; the flow packet comparison result is used as a row vector, the comparison result forms a two-dimensional matrix, and the row records the sequence of the flow packet, the load and the position of the column representing characters in the sequence, and the sequence is expressed as follows: P1P 2P 3P4- … -P63P 64 … P256>, < 0002P 1P 2P 3P 4-P63- … -P128 > … >, pn (n=1 … 256) represents one byte of data, i.e. 2 16 bits of data, in the actual private protocol traffic packet, storing the two-dimensional matrix into output files server_msa.txt, client_msa.txt;

3.4 Over-cut functional segment combination): reading the suspicious boundary in the step 2.3, if the suspicious boundary comprises a space mark before and after, not processing, if the suspicious boundary is marked as a suspicious boundary and a field with no space mark before and after, inserting a space column behind the suspicious boundary, then calculating the combination frequency between bytes for the functional sections with the length of 1 byte and 2 bytes, mining by using a closed frequent item, combining the high probability combination in the functional sections, setting the upper limit of the combination length as 3 bytes, deleting the space between the two columns of bytes for the bytes judged to be combined, and outputting the processed boundary judgment result files Server_result. Txt and client_result. Txt;

3.5 Judging the field types, wherein the types are divided into five types of fixed length fixed value, variable length variable value, fixed length enumeration, fixed length increment and fixed length disorder, reading the output file of the step 3.4, marking the field with fixed value in each column as a fixed length fixed value field, marking the position with the value of '-' in each column as a variable length variable value field, marking the field with the value of less than 8 types as a fixed length enumeration field, marking the field with the value type conforming to the increment rule as a fixed length increment field, judging the other fixed length fields as fixed length disorder fields, outputting the field type judging results of an S end and a C end, and S end: <1-2, fixed length fixed value >, <2-4, fixed length enumeration >, …, <14-256, variable length variable value >;

the specific steps of the step 5 are as follows:

5.1 Normalizing the values of the function segments, namely, normalizing the function segments extracted in the step 3.2, reading two files of Server_functional_oneline. Txt and client_functional_oneline. Txt, respectively extracting column vectors in the files by taking space marks in the files as boundaries, and compressing the function segments into a same-dimensional space by using 0-mean normalization to generate Server_result_normalization. Txt and client_result_normalization. Txt;

2. The reverse analysis method for proprietary network protocol format and status according to claim 1, wherein step 2 comprises the steps of:

2.1 Analyzing the bottom layer known protocol, extracting IP and port information, judging the flow receiving and transmitting direction according to the IP and the port information, dividing the collected flow packets of the application program into two types of a server and a client, respectively generating a matrix Server, client, storing in txt format, wherein the matrix format is as follows: { sequence number 1, private protocol traffic packet 1}, { sequence number 2, private protocol traffic packet 2}, … { sequence number N, private protocol traffic packet N };

2.3 The suspicious boundary judgment is carried out, the hexadecimal character conversion ASSIC codes in the Server_all_oneline. Txt and the client_all_oneline. Txt obtained in the step 2.1 and the step 2.2 are compared with a public feature library, and the opposite flow packet can be expressed as follows: "#? And marking character positions meeting text characteristics of the character positions of the text, constructing a vector of the sequence number, the text characteristics 1, the positions 1 and …, the text characteristics N and the positions N, storing the vector, wherein the text characteristics N in the vector are used as a boundary point judgment basis, the positions N correspond to the judged Suspicious boundary points, and outputting server_Suspicus_boundary. Txt and client_Suspicus_boundary. Txt files.

3. The reverse analysis method for proprietary network protocol format and status according to claim 1, wherein step 4 specifically comprises:

4.2 Performing heuristic judgment on the non-associated fields such as the zone bit and the like according to a character value mode, and constructing a suspicious grammar list and a heuristic judgment rule; the fixed-length fixed-value is judged as a mark field, the fixed-length enumeration value is judged as a control field and a protocol message type field, the fixed-length increment type value is judged as a sequence number bit field, the fixed-length high entropy value field in the functional section is judged as a check code field, semantic judgment results of an S end and a C end are respectively output, the judgment results are corresponding relation of 'field start-stop position-semantic', and the format is: < field location, field type, suspicious attribute 1, …, suspicious attribute N >.

4. A reverse analysis device for proprietary network protocol format and status, comprising the following modules:

the semantic judgment module: based on the boundary segmentation result, deducing semantic meaning of each divided field, judging the fields with associated attribute by using the existing method, judging the non-associated fields by using heuristic judgment, wherein the judgment result is a corresponding relation of a field start-stop position and semantic, so as to obtain a semantic part of a private network protocol format, and the protocol boundary part and the semantic part form the private network protocol format;

the state machine construction module: constructing a passive state machine based on the format boundary inference result;

the protocol boundary extraction module comprises the following specific steps:

the semantic judgment module comprises the following specific steps:

5. The reverse analysis device for proprietary network protocol formats and states according to claim 4, wherein the preprocessing stage module implementation comprises the steps of:

6. The reverse analysis device for proprietary network protocol format and state according to claim 4, wherein the semantic judgment module implementation specifically comprises: