CN109040081B - Protocol field reverse analysis system and method based on BWT - Google Patents

Protocol field reverse analysis system and method based on BWT Download PDF

Info

Publication number
CN109040081B
CN109040081B CN201810908816.2A CN201810908816A CN109040081B CN 109040081 B CN109040081 B CN 109040081B CN 201810908816 A CN201810908816 A CN 201810908816A CN 109040081 B CN109040081 B CN 109040081B
Authority
CN
China
Prior art keywords
character string
stream
bwt
abstract
section
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810908816.2A
Other languages
Chinese (zh)
Other versions
CN109040081A (en
Inventor
黄晓雪
孙云霄
黄俊恒
刘扬
王佰玲
王巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology Weihai
Original Assignee
Harbin Institute of Technology Weihai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology Weihai filed Critical Harbin Institute of Technology Weihai
Priority to CN201810908816.2A priority Critical patent/CN109040081B/en
Publication of CN109040081A publication Critical patent/CN109040081A/en
Application granted granted Critical
Publication of CN109040081B publication Critical patent/CN109040081B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/22Parsing or analysis of headers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/18Protocol analysers
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Abstract

The invention provides a BWT-based protocol field reverse analysis system and method, which construct a specific suffix index, so that the matching of specific substrings only needs to be complete once in each comparison process, and the suffix index-based comparison algorithm has high design flexibility, less space consumption in an index stage, variable subsequence length and capability of quickly identifying protocol fields. After identifying the fixed fields, the invention counts the fields with high frequency by a random multi-stream multi-segment matching method, constructs a grammar tree according to the field positions and the field quantity, and extracts the field structure, thereby realizing the reversal of the field format. The method classifies the reverse fields, uses the classified reverse fields as the input of a fuzzy test tool, sends a large number of malformation test cases to the target communication entity, and simultaneously monitors the abnormality of the target communication entity by combining a debugger and a sniffer, finds the abnormality and analyzes the abnormality so as to improve the safety of the target communication entity in the following.

Description

Protocol field reverse analysis system and method based on BWT
Technical Field
The present invention relates to the field of network protocols, and in particular, to a BWT-based system and method for reverse analysis of protocol fields.
Background
With the popularization of internet technology, network protocol related research is increasing, and many researches are based on detailed description information of protocols, namely protocol specifications, so that the role in the field of protocol reversal is more and more important. The protocol refers to a set of rules and conventions that must be followed when various communication entities exchange information with each other in a computer network and a distributed system, and the rules specify the data format and related synchronization problems exchanged by the two communication parties, thereby ensuring that the two communication parties exchange information orderly and reliably. Not only are there a large number of communication protocols that have been standardized, but there are also a large number of proprietary, i.e. unknown, protocols in the network. The unknown protocol arises in part because of the concerns of the software vendor or individual for economic benefit, security, privacy, etc., and protocol details are not disclosed; another part is due to some malware using proprietary protocols to prevent tracking and analysis.
The protocol reverse analysis is to analyze a network protocol message or a protocol software execution process to obtain a protocol specification description model under the condition that protocol features (such as protocol format features, ports, traffic features, and the like) are unknown. The protocol specification includes a protocol format and a protocol state machine.
At present, the protocol format extraction mainly adopts a progressive sequence comparison algorithm, the format type of message samples is not considered when the progressive sequence comparison algorithm executes comparison, and the accuracy of protocol analysis can be reduced by forcibly comparing the message samples with different formats. When the similarity of sequence samples is high, a satisfactory effect can be obtained, and if the sequence difference is large, the alignment effect is seriously influenced.
Disclosure of Invention
In order to overcome the above-mentioned deficiencies in the prior art, the present invention provides a BWT-based protocol field reverse analysis system, comprising: the device comprises an input preprocessing module, a protocol format extraction module and a field fuzzy test module;
the input preprocessing module is used for segmenting input messages by taking sessions and messages as granularity, and redundancy and interference in original data need to be proposed before segmentation;
the protocol format extraction module is used for identifying the format corresponding to each message; extracting a protocol format based on network flow, taking an intercepted network data stream as an analysis object, deducing the protocol format according to the value change frequency and the characteristics of a protocol field, and constructing a field structure tree after identifying partial fields of the protocol;
the field fuzzing test module is used for fuzzing the specific field extracted by the protocol format module through a fuzzing test tool SPIKE.
A protocol field reverse analysis method based on BWT,
dividing input messages by taking sessions and messages as granularity, wherein redundancy and interference in original data need to be proposed before division; data acquisition, session delimitation and abstract clustering are three important stages of input preprocessing; marking a TCP session with a TCP quadruple; taking the extracted TCP stream as input, and carrying out abstract clustering, thereby finding out a TCP stream set with similar characteristics;
identifying a format corresponding to each message; marking the abstract flow sets, comparing each abstract flow set by adopting a BWT algorithm, and extracting fixed fields in the abstract flow sets; meanwhile, a random multi-stream multi-section counting method is adopted to count the occurrence times of different fields, and a field structure tree is constructed according to the quantity relation, so that the protocol format is extracted;
and step three, carrying out protocol domain classification on the fields extracted by the protocol format extraction module, blurring variable value-taking domains behind fixed fields, constructing a malformation test case, sending the malformation test case to a target communication entity, and simultaneously monitoring the abnormality of the target communication entity by adopting a debugger and a sniffer.
In the present invention, the first step includes: the system comprises a data acquisition unit, a session delimitation unit and an abstract clustering unit;
the data acquisition unit compiles a script for capturing data traffic based on the timing task command, and accesses a target communication entity for preset times in the script; defining an access target communication filtering rule, reserving flow related to a target communication entity, storing the flow accessed each time into a corresponding pcap file, and storing an acquired offline data packet file;
after the session delimitation unit collects the data packet files, stripping the data packets in a multi-level mode according to a protocol family, and performing stream extraction on a plurality of pcap files by using a stream extraction tool;
the abstract clustering unit clusters the extracted streams, so that a plurality of streams are abstracted into different types;
clustering to obtain a plurality of abstract stream classes, wherein the streams in each abstract stream class have similarity, and a cluster number is used as an abstract stream class sequence number; and re-marking the message into an abstract stream type serial number format.
In the present invention, the second step includes: a BWT-based sequence alignment unit;
a BWT sequence alignment unit constructs a BWT sequence aiming at a preset mother sequence, a $ symbol is added at the end of the mother sequence, the BWT sequence is circularly shifted to the right, and a new sequence is obtained by shifting one bit each time; finally obtaining n sequences with the length of n, and forming a BWT matrix with the mother sequences, wherein n is the length of the mother sequences;
the n sequences are sorted according to a dictionary order to form a BWT array; the lexical order of the $ characters is minimal;
taking the last bit of each sequence, and splicing the last bit of each sequence from top to bottom to form a BWT string;
and taking the last character x of the substring to be searched, searching the character x in the first row on the left side of the BWT array, obtaining the last character k of the row, if the tail k of the row is the nth k in the last row of the BWT array, searching the nth k in the first row on the left side of the BWT array, taking the tail of the row as a new k, circulating, if a sequence formed by each k in the circulating process is not matched with the reverse order of the substring to be searched, exiting the circulation, if the sequence is matched all the time, stopping after m times of operation, and m is the length of the substring to be searched, and finding the substring to be searched.
In the present invention, the second step includes: a BWT-based protocol format extraction unit;
the BWT protocol format extraction unit is used for randomly selecting two streams x and y in the same abstract stream, selecting A sections at the same position on the basis of AB section marks of the two streams, constructing a suffix tree by taking the A sections of the y streams as a template, and uniformly dividing the A sections of the x streams into division objects with specified unit length; in the AB section, the section A is a text section, and the section B is a binary section;
the uniform dividing mode is as follows: setting 2 bytes of a specified unit length, and dividing a first A section of the x stream into uniform character strings with the length of 2; marking a position for each character string; then, the character string at each position is used as a matching character string to be matched in the first A section of the y stream, if the matching is achieved, the character string with the length of 2 appears in the first A section of the y stream, and the character string is reserved; if the matching is not achieved, the character string with the length of 2 does not appear in the section A of the y stream, and the character string is discarded; the matching result is a character string array;
the length expansion mode is as follows: expanding each character string in the reserved character string array by one character to the right, combining two adjacent reserved character strings if the expanded character is positioned in the adjacent reserved character strings, refreshing the length of the character strings, namely the sum of the lengths of the adjacent character strings, and reserving the initial position of the initial character of the new character string in the section A; if the expanded character is not located in the adjacent character string, refreshing the length of the character string, namely adding one to the length of the original character string, and reserving the initial position of the new character string in the section A;
the recursive matching mode is as follows: respectively matching the merged reserved character string arrays in a suffix tree of a section A of the y stream one by one, repeating a length expansion mode on the successfully matched character string arrays until the length of the matched character string can not be expanded, and obtaining a final result as a character string array: bwt _ xy _ a1{ a1_ STR _1, a1_ STR _2, a1_ STR _ n }, the string array representing the set of all local longest common subsequences of the first a segment in the random dual stream x and y streams; counting the occurrence times and the occurrence positions of each different character string in the set, and outputting a segment A result;
the sectional summarizing mode is as follows: the format of the abstract stream class is preset without considering the section B; sequentially comparing the rest A sections of the random double-flow x flow and the random double-flow y flow, and repeating the uniform segmentation mode, the length expansion mode and the recursive matching mode to obtain an m character string array set; the m character string array sets represent the sets of local longest public character strings of all A sections in the x stream and the y stream, and the results are output after random double-stream multi-section statistics;
the random multi-stream multi-segment statistical method is as follows: in the steps, two streams p and q are randomly extracted, wherein the p and the q belong to an abstract stream class, and the occurrence times of the same subsequence in the two streams are counted; performing multiple random double-flow multi-section statistics on the abstract flow class, and presetting times in a repeated uniform segmentation mode, a length expansion mode, a recursive matching mode and a section gathering mode, wherein the preset times are N2Secondly, N is the number of all the streams in the abstract class stream; after executing N2Then, N will be obtained2Number of character strings of the form of (bwt _ pq _ A1, bwt _ pq _ A2, …, bwt _ pq _ An)Group set, adding the new character string in the A section into the statistical table, and recording the occurrence frequency; if the new public subsequence appears in the statistical table, updating the appearance times of the corresponding subsequence;
the field structure tree is constructed in the following way: deducing a field structure relation according to the occurrence times and the occurrence sequence of the character string byte strings; the character string with the most occurrence times can be deduced to be a first-level field, and the occurrence times of the character string should be similar to the number of the streams in the abstract stream class; if the sum of the occurrence times of different character strings at a certain position is similar to the first-level field, performing field marking on the position, indicating that the position is a composite field, and a plurality of subsequences of the position are second-level fields; by analogy, the field structure tree can be deduced recursively from top to bottom according to the relationship between the occurrence order and the occurrence number.
In the present invention, step three further comprises: a field fuzzing test module;
the field fuzzy test module is also used for taking the value type and the field sequence of the fixed field as the input of the SPIKE fuzzy test tool, constructing a malformation test case, sending the malformation test case to a target communication entity and detecting the running condition of the target communication entity;
aiming at the fixed fields obtained by BWT comparison, classifying the fixed fields, identifying the separator fields and dividing the character string fields; sequentially adding fields into the SPIKE according to the sequence of the fields in the message sample; for fixed fields, the s _ string () function is used to add a fixed string or byte string to SPIKE; for variable value content after the fixed field, adding variable data into SPIKE by using an s _ string _ variable (), s _ binary _ block _ size _ half word _ bigenic _ variable () function to finish writing a fuzzy test script;
adding a debugger, and running a compiled fuzzy test script through SPIKE; observing the state of the debugger, and when the crash happens, the register value is abnormal; when the debugger detects the abnormality, the target communication entity crashes, so that SPIKE cannot be connected to the target, and SPIKE crash is caused; the last output before SPIKE crash is checked to determine which malformed test case causes the crash of the target communication entity, and meanwhile, the flow monitored by the sniffer is combined to assist in finding the malformed test case causing the crash, and the cause of the crash is further analyzed to find and repair the vulnerability of the software, so that the safety of the software is improved.
According to the technical scheme, the invention has the following advantages:
the protocol field extraction provided by the invention is different from progressive multi-sequence comparison, and the BWT-based comparison method has the advantages that the problem of inaccurate comparison is converted into accurate comparison, and a specific suffix index is constructed, so that the matching of a specific substring only needs to be completed once in each comparison process, the suffix index-based comparison algorithm has high design flexibility, less space consumption in an index stage, and the length or the variability of the subsequence can be quickly found out to be an accurate matching string. The BWT algorithm of the invention establishes an index structure with the property of a suffix tree, realizes the rapid search of short sequences, and simultaneously, the suffix tree method effectively reduces the inaccurate matching and avoids the redundant comparison process, thus leading the comparison to be rapid and the memory to be efficient.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings used in the description will be briefly introduced, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of a BWT-based protocol field reverse analysis system;
FIG. 2 is a field structure tree diagram;
fig. 3 is a flowchart of a BWT-based protocol field reverse analysis method.
Detailed Description
In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions of the present invention will be clearly and completely described below with reference to specific embodiments and drawings. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the scope of protection of this patent.
The invention relates to a BWT-based protocol field reverse analysis system and an analysis method, which comprises the following steps: as shown in fig. 1, the input preprocessing module 1, the protocol format extracting module 2 and the field fuzz testing module 3. The input preprocessing module 1 is used for segmenting input messages by taking sessions and messages as granularity; the protocol format extraction module 2 is used for identifying the format corresponding to each message; and after identifying partial fields of the protocol, constructing a field structure tree; the field fuzzing module 3 is used for fuzzing the specific field extracted by the protocol format module through a fuzzing tool SPIKE.
The present invention also provides a BWT-based protocol field reverse analysis method, as shown in fig. 3, which uses sessions and packets as granularity to segment the input packets; identifying a format corresponding to each message; and carrying out protocol domain classification on the fields extracted by the protocol format extraction module, and carrying out fuzzy on variable value-taking domains behind fixed fields.
In the embodiment provided by the invention, the input preprocessing module 1 comprises a data acquisition unit, a session delimitation unit and an abstract clustering unit. The data acquisition unit requires that a target communication entity is selected in an experimental environment, and the protocol of the communication entity is unknown in a real situation, but communication protocols with known protocol specifications, such as FTP, IRC and the like, can be adopted in the experimental environment so as to judge the effectiveness of the method. And writing a script for capturing data traffic by using the timing task command, and accessing the same target communication entity for multiple times in the script. Defining a filtering rule, only reserving flow related to a target communication entity, and storing the flow accessed each time into a corresponding pcap file to obtain a large number of offline data packet files.
In the embodiment provided by the invention, the session delimitation unit requires a protocol format reverse method based on network flow to extract the protocol message format according to the byte change frequency and the value-taking characteristics of a large number of messages of the same type. The main basis for realizing the method is as follows: a single message sample is an example of a message format which is acceptable in the state of the protocol entity, and a plurality of message samples corresponding to the same message format have similarity.
After the data packet is captured, the data packet needs to be stripped in multiple levels according to protocol families such as TCP/IP and the like. For example, under the kali system, a data extraction tool tcpflow is used, and the tool can quickly reassemble tcp data packets and correctly process retransmission and out-of-order problems. TCP flow extraction may also be performed using the TCP session reassembly function of libnids or the trace flow function of the wireshark tool.
In the embodiment provided by the invention, the abstract clustering unit is used for extracting messages of the same type and clustering a large number of extracted streams. After the original pcap is obtained, a corresponding type is identified for each byte of each message sequence. If a printable character, it is represented by 'A'; if it is a binary byte, it is represented by 'B'. For the convenience of marking, continuous A is synthesized into 'A' sample segments (for simplifying segmentation, the length of the A sample segment is required to be more than or equal to 3, otherwise, the A sample segment is the B sample segment), and continuous B is synthesized into 'B' sample segments, and at the moment, the message type forms a sample segment sequence consisting of a plurality of 'A' and 'B'. The following approach is an example of http protocol preliminary segmentation.
Figure BDA0001761246480000091
Because the types appear alternately, if the number of the sample segments of the streams is the same and the first sample segments of the two streams are consistent, the two streams are judged to belong to the same abstract class, and on the premise of the judgment, the clustering efficiency is improved. Due to the specificity of the message sequence, it is difficult to define a quantization segment which can really measure the distance, so the definition of the cluster is only borrowed. For the displayable sample segment 'A', the length of the longest common subsequence is taken as the measurement standard of message similarity. The distance formula between the flow x and the flow y is as follows. In the clustering process, two clusters with the minimum distance are merged into a new cluster. The spacing between two clusters is the minimum distance between the elements in the two clusters. For the non-displayable binary segment 'B', the unsatisfactory TCP flows are removed by finding the most frequent byte sequence, finding messages with a common type.
Figure BDA0001761246480000092
And clustering to obtain a plurality of abstract stream classes, wherein the streams in each abstract stream class have similar rows, belong to the same class of streams, and are defaulted to messages with similar formats in a network protocol. The cluster number is used as the abstract stream class number. And (4) marking the message as an abstract stream serial number format again to prepare for deduction of the protocol state machine. The following table is a message abstract flow label:
1 2 5 3 2 2 6 5 5 8 1
in the embodiment provided by the present invention, the protocol format extraction module 2 includes a BWT-based sequence alignment unit and a BWT-based protocol format extraction unit. The BWT-based sequence alignment unit extracts protocol formats for the same type of sample segments of the same type of stream based on the stream which is preliminarily segmented and clustered. The existing method, progressive sequence alignment method, and the BWT-based sequence alignment algorithm proposed in this patent can be used here.
BWT converts the original text into a similar text, which is converted such that the same character positions are consecutive or adjacent. The BWT algorithm can match a short sequence a with a long sequence a to find the position of a in a.
In the embodiment provided by the invention, the BWT-based protocol format extraction unit is: firstly, the protocol format extraction mode based on BWT is as follows: firstly, two streams x and y are randomly selected from the same class of abstract streams, on the basis of AB segment marking on the two streams, an A segment at the same position is selected, a suffix tree is constructed by taking the A segment of the y stream as a template, and the A segment of the x stream is taken as a segmentation object to be uniformly segmented with the appointed unit length.
(1) Uniform dividing manner
Initially, a specified unit length of 2 bytes is set, and the first a segment of the x stream is divided into uniform character strings of length 2. The position is marked for each string. And then, taking the character string at each position as a matching character string, matching in the first A section of the y stream, if the matching is achieved, indicating that the character string with the length of 2 appears in the first A section of the y stream, and keeping the character string. If the matching is not achieved, the character string with the length of 2 is not shown in the section A of the y stream, and the character string is discarded. The result is an array of strings.
(2) Length extension mode
And expanding each character string in the reserved character string array by one character to the right, merging two adjacent reserved character strings if the expanded character is positioned in the adjacent reserved character strings, refreshing the length of the character strings (namely the sum of the lengths of the adjacent character strings), and reserving the initial position of the initial character of the new character string in the section A. If the expanded character is not located in an adjacent string, the string length is refreshed (i.e., the original string length plus 1), and the initial position of the new string in the first character in section A is preserved.
(3) Recursive matching mode
And (3) respectively matching the merged reserved character string arrays in a suffix tree of a section A of the y stream one by one, and repeating the step (2) for the character string arrays successfully matched until the length of the matched character strings can not be expanded any more, wherein the final result is a character string array bwt _ xy _ A1{ A1_ STR _1, A1_ STR _2,. and A1_ STR _ n }, wherein the array represents a set of all local longest common subsequences of a first section A in the random dual stream x stream and the y stream. Counting the number of occurrences and the occurrence position for each different character string in the set, the a-segment final result can be expressed as:
TABLE 1 first A-stage statistics for dual stream
Figure BDA0001761246480000111
(4) Sectional summary mode
And (4) comparing the rest A sections of the random double-stream x stream and the random double-stream y stream in sequence without considering the B section, repeating the steps (1), (2) and (3), and finally obtaining an m character string array set as shown in the table 2. The array set represents a set of local longest common character columns of all the sections a in the x stream and the y stream, and the final result is shown in table 3 after random double-stream multi-section statistics.
Table 2 abstract stream class a segment form
A1 A2 ... Am
TABLE 3 statistical results for all A segments of the Dual stream
Figure BDA0001761246480000121
(5) Random multi-stream multi-segment statistics
In the above steps, two streams p and q are randomly extracted (p and q belong to the same abstract stream class), and the occurrence times of the same subsequence in the two streams are counted. In order to obtain enough subsequences for statistical quantitative relationship, multiple random dual stream multi-segment statistics are performed on the abstract stream class, i.e. the steps (1) (2) (3) (4) are repeated enough times (default N)2And N is the number of all streams in the abstract class stream). After executing N2Then, N will be obtained2A character string array set in the form of (bwt _ pq _ A1, bwt _ pq _ A2, … and bwt _ pq _ An), adding the character strings newly appeared in the A section into a statistical table, and recording the appearance times; and if the new public subsequence appears in the statistical table, updating the appearance times of the corresponding subsequence. The following results were obtained:
TABLE 4 statistical results of all A segments of random multithreading
Figure BDA0001761246480000131
(6) Building a field structure tree
The field structure relationship can be deduced according to the occurrence times and the occurrence sequence of the character string byte strings. As shown in fig. 2.
Based on the resulting BWT sequence, a fuzz test may be performed. And taking the value type and the field sequence of the BWT sequence fixed field as the input of a SPIKE fuzzy test tool, constructing a malformation test case, sending the malformation test case to a target communication entity, and detecting the running condition of the target communication entity.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (3)

1. A BWT-based protocol field reverse analysis method, the method comprising:
dividing input messages by taking sessions and messages as granularity, wherein redundancy and interference in original data need to be proposed before division; data acquisition, session delimitation and abstract clustering are three important stages of input preprocessing; marking a TCP session with a TCP quadruple; taking the extracted TCP stream as input, and carrying out abstract clustering, thereby finding out a TCP stream set with similar characteristics;
identifying a format corresponding to each message; marking the abstract flow sets, comparing each abstract flow set by adopting a BWT algorithm, and extracting fixed fields in the abstract flow sets; meanwhile, a random multi-stream multi-section counting method is adopted to count the occurrence times of different fields, and a field structure tree is constructed according to the quantity relation, so that the protocol format is extracted;
BWT sequence alignment is to construct BWT sequence aiming at a preset mother sequence, add a $ symbol at the end of the mother sequence, circularly shift right, shift one bit each time to obtain a new sequence; finally obtaining n sequences with the length of n, and forming a BWT matrix with the mother sequences, wherein n is the length of the mother sequences; the n sequences are sorted according to a dictionary order to form a BWT array; the lexical order of the $ characters is minimal; taking the last bit of each sequence, and splicing the last bit of each sequence from top to bottom to form a BWT string;
taking the last character x of a substring to be searched, searching the character x in the first row on the left side of the BWT array, obtaining the last character k of a row, if the k at the tail of the row is the nth k in the last row of the BWT array, searching the nth k in the first row on the left side of the BWT array, taking the tail of the row as a new k, circulating, exiting the circulation if the sequence formed by each k in the circulating process is not matched with the reverse order of the substring to be searched, stopping after m operations if the sequence is matched all the time, and finding the substring to be searched, wherein m is the length of the substring to be searched;
BWT protocol format extraction is used for randomly selecting two streams x and y in the same class of abstract streams, selecting A sections at the same position on the basis of AB section marks of the two streams, constructing a suffix tree by taking the A sections of the y streams as a template, and uniformly dividing the A sections of the x streams into division objects with specified unit length; in the AB section, the section A is a text section, and the section B is a binary section;
the uniform dividing mode is as follows: setting 2 bytes of a specified unit length, and dividing a first A section of the x stream into uniform character strings with the length of 2; marking a position for each character string; then, the character string at each position is used as a matching character string to be matched in the first A section of the y stream, if the matching is achieved, the character string with the length of 2 appears in the first A section of the y stream, and the character string is reserved; if the matching is not achieved, the character string with the length of 2 does not appear in the section A of the y stream, and the character string is discarded; the matching result is a character string array;
the length expansion mode is as follows: expanding each character string in the reserved character string array by one character to the right, combining two adjacent reserved character strings if the expanded character is positioned in the adjacent reserved character strings, refreshing the length of the character strings, namely the sum of the lengths of the adjacent character strings, and reserving the initial position of the initial character of the new character string in the section A; if the expanded character is not located in the adjacent character string, refreshing the length of the character string, namely adding one to the length of the original character string, and reserving the initial position of the new character string in the section A;
the recursive matching mode is as follows: respectively matching the merged reserved character string arrays in a suffix tree of a section A of the y stream one by one, repeating a length expansion mode on the successfully matched character string arrays until the length of the matched character string can not be expanded, and obtaining a final result as a character string array: bwt _ xy _ a1{ a1_ STR _1, a1_ STR _2, a1_ STR _ n }, the string array representing the set of all local longest common subsequences of the first a segment in the random dual stream x and y streams; counting the occurrence times and the occurrence positions of each different character string in the set, and outputting a segment A result;
the sectional summarizing mode is as follows: the format of the abstract stream class is preset without considering the section B; sequentially comparing the rest A sections of the random double-flow x flow and the random double-flow y flow, and repeating the uniform segmentation mode, the length expansion mode and the recursive matching mode to obtain an m character string array set; the m character string array sets represent the sets of local longest public character strings of all A sections in the x stream and the y stream, and the results are output after random double-stream multi-section statistics;
the random multi-stream multi-segment statistical method is as follows: in the above mode, two streams p and q are randomly extracted, and the p and q belong to an abstract stream class, and the occurrence times of the same subsequence in the two streams are counted; performing multiple random double-flow multi-section statistics on the abstract flow class, and presetting times in a repeated uniform segmentation mode, a length expansion mode, a recursive matching mode and a section gathering mode, wherein the preset times are N2Secondly, N is the number of all the streams in the abstract class stream; after executing N2Then, N will be obtained2A character string array set in the form of (bwt _ pq _ A1, bwt _ pq _ A2, … and bwt _ pq _ An), adding the character strings newly appeared in the A section into a statistical table, and recording the appearance times; if the new public subsequence appears in the statistical table, updating the appearance times of the corresponding subsequence;
the field structure tree is constructed in the following way: deducing a field structure relation according to the occurrence times and the occurrence sequence of the character string byte strings; the character string with the most occurrence times can be deduced to be a first-level field, and the occurrence times of the character string should be similar to the number of the streams in the abstract stream class; if the sum of the occurrence times of different character strings at a certain position is similar to the first-level field, performing field marking on the position, indicating that the position is a composite field, and a plurality of subsequences of the position are second-level fields; by analogy, field structure trees can be deduced recursively from top to bottom according to the relationship between the occurrence sequence and the occurrence quantity;
and step three, carrying out protocol domain classification on the fields extracted by the protocol format extraction module, blurring variable value-taking domains behind fixed fields, constructing a malformation test case, sending the malformation test case to a target communication entity, and simultaneously monitoring the abnormality of the target communication entity by adopting a debugger and a sniffer.
2. The BWT-based protocol field inverse parsing method of claim 1,
the first step comprises the following steps: the system comprises a data acquisition unit, a session delimitation unit and an abstract clustering unit;
the data acquisition unit compiles a script for capturing data traffic based on the timing task command, and accesses a target communication entity for preset times in the script; defining an access target communication filtering rule, reserving flow related to a target communication entity, storing the flow accessed each time into a corresponding pcap file, and storing an acquired offline data packet file;
after the session delimitation unit collects the data packet files, stripping the data packets in a multi-level mode according to a protocol family, and performing stream extraction on a plurality of pcap files by using a stream extraction tool;
the abstract clustering unit clusters the extracted streams, so that a plurality of streams are abstracted into different types; clustering to obtain a plurality of abstract stream classes, wherein the streams in each abstract stream class have similarity, and a cluster number is used as an abstract stream class sequence number; and re-marking the message into an abstract stream type serial number format.
3. The BWT-based protocol field inverse parsing method of claim 1,
the third step also comprises: a field fuzzing test module;
the field fuzzy test module is also used for taking the value type and the field sequence of the fixed field as the input of the SPIKE fuzzy test tool, constructing a malformation test case, sending the malformation test case to a target communication entity and detecting the running condition of the target communication entity;
aiming at the fixed fields obtained by BWT comparison, classifying the fixed fields, identifying the separator fields and dividing the character string fields; sequentially adding fields into the SPIKE according to the sequence of the fields in the message sample; compiling the fuzzy test script;
the method comprises the steps of adding a debugger, running a compiled fuzzy test script through SPIKE, determining which malformed test case causes the target communication entity to crash by checking the last output before SPIKE crash, simultaneously combining the flow monitored by a sniffer to assist in finding the malformed test case causing the crash, and further analyzing the cause of the crash to find the vulnerability of software and repair the vulnerability, so that the software security is improved.
CN201810908816.2A 2018-08-10 2018-08-10 Protocol field reverse analysis system and method based on BWT Active CN109040081B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810908816.2A CN109040081B (en) 2018-08-10 2018-08-10 Protocol field reverse analysis system and method based on BWT

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810908816.2A CN109040081B (en) 2018-08-10 2018-08-10 Protocol field reverse analysis system and method based on BWT

Publications (2)

Publication Number Publication Date
CN109040081A CN109040081A (en) 2018-12-18
CN109040081B true CN109040081B (en) 2020-08-04

Family

ID=64632703

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810908816.2A Active CN109040081B (en) 2018-08-10 2018-08-10 Protocol field reverse analysis system and method based on BWT

Country Status (1)

Country Link
CN (1) CN109040081B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110061869B (en) * 2019-04-09 2022-04-15 中南民族大学 Network track classification method and device based on keywords
CN110008383B (en) * 2019-04-11 2021-07-27 北京安护环宇科技有限公司 Black and white list retrieval method and device based on multiple indexes
CN110134590B (en) * 2019-04-18 2023-04-18 上海大学 Tenebrio chinensis whisker fuzzy test case generation method aiming at Modbus/TCP
CN110113332A (en) * 2019-04-30 2019-08-09 北京奇安信科技有限公司 A kind of detection industry control agreement whether there is the method and device of exception
CN110457704B (en) * 2019-08-12 2022-11-15 北京明略软件系统有限公司 Target field determination method and device, storage medium and electronic device
CN110602073B (en) * 2019-09-02 2021-05-18 西安电子科技大学 Unmanned aerial vehicle flight control protocol field division method based on information theory
CN111162959B (en) * 2019-11-28 2021-07-06 中国航空工业集团公司西安航空计算技术研究所 Parameter-based avionics interface data communication protocol fuzzy test method
CN111767695B (en) * 2020-06-28 2023-10-13 国网吉林省电力有限公司 Method for optimizing field boundary reasoning in protocol reverse engineering
CN112055062B (en) * 2020-08-21 2024-04-09 深圳市信锐网科技术有限公司 Data communication method, device, equipment and readable storage medium
CN114090840A (en) * 2020-08-24 2022-02-25 华为技术有限公司 Sequence searching method, device, equipment and medium
CN112055003B (en) * 2020-08-26 2022-12-23 上海电力大学 Method for generating private protocol fuzzy test case based on byte length classification
CN112699658A (en) * 2020-12-31 2021-04-23 科大讯飞华南人工智能研究院(广州)有限公司 Text comparison method and related device
WO2022162655A1 (en) * 2021-01-26 2022-08-04 Elbit Systems C4I and Cyber Ltd. A system and method for producing specifications for fields with variable number of elements

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2002353634A1 (en) * 2001-11-24 2003-06-10 3G Licensing S.A. Selectively transmitting full or compressed header packets
CN103297427A (en) * 2013-05-21 2013-09-11 中国科学院信息工程研究所 Unknown network protocol identification method and system
CN103414708A (en) * 2013-08-01 2013-11-27 清华大学 Method and device for protocol automatic reverse analysis of embedded equipment

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103036848B (en) * 2011-09-29 2015-11-25 西门子公司 The reverse engineering approach of agreement and system
CN103117748B (en) * 2013-01-29 2016-03-16 中国科学院计算技术研究所 The method and system in a kind of BWT implementation method, suffix sorted
CN104168288A (en) * 2014-08-27 2014-11-26 中国科学院软件研究所 Automatic vulnerability discovery system and method based on protocol reverse parsing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2002353634A1 (en) * 2001-11-24 2003-06-10 3G Licensing S.A. Selectively transmitting full or compressed header packets
CN103297427A (en) * 2013-05-21 2013-09-11 中国科学院信息工程研究所 Unknown network protocol identification method and system
CN103414708A (en) * 2013-08-01 2013-11-27 清华大学 Method and device for protocol automatic reverse analysis of embedded equipment

Also Published As

Publication number Publication date
CN109040081A (en) 2018-12-18

Similar Documents

Publication Publication Date Title
CN109040081B (en) Protocol field reverse analysis system and method based on BWT
WO2021088385A1 (en) Online log analysis method, system, and electronic terminal device thereof
CN107665191B (en) Private protocol message format inference method based on extended prefix tree
CN102891852B (en) Message analysis-based protocol format automatic inferring method
US20210081437A1 (en) Systems and methods for trie-based automated discovery of patterns in computer logs
Zhang et al. Proword: An unsupervised approach to protocol feature word extraction
CN114553983B (en) Deep learning-based high-efficiency industrial control protocol analysis method
WO2023284132A1 (en) Method and system for analyzing cloud platform logs, device, and medium
US11528285B2 (en) Label guided unsupervised learning based network-level application signature generation
CN113656254A (en) Abnormity detection method and system based on log information and computer equipment
US20140280929A1 (en) Multi-tier message correlation
CN103036848B (en) The reverse engineering approach of agreement and system
Egri et al. Cross-correlation based clustering and dimension reduction of multivariate time series
CN110430133B (en) Inter-domain path identifier prefix obtaining method based on confidence interval
WO2023179014A1 (en) Traffic identification method and apparatus, electronic device, and storage medium
CN115102848A (en) Log data extraction method, system, device and medium
CN112968865B (en) Network protocol grammatical feature rapid extraction method based on association rule mining
CN110609901B (en) User network behavior prediction method based on vectorization characteristics
CN111556014B (en) Network attack intrusion detection method adopting full-text index
CN110336817B (en) Unknown protocol frame positioning method based on TextRank
Rao et al. A Hierarchical Tree-Based Syslog Clustering Scheme for Network Diagnosis
CN107689846B (en) Method and system for detecting data errors
JP2003228571A (en) Method of counting appearance frequency of character string, and device for using the method
CN112152873B (en) User identification method and device, computer equipment and storage medium
US20230012041A1 (en) Identity Graphing for Network Genomes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Wang Bailing

Inventor after: Sun Yunxiao

Inventor after: Huang Junheng

Inventor after: Liu Yang

Inventor after: Huang Xiaoxue

Inventor after: Wang Wei

Inventor before: Huang Xiaoxue

Inventor before: Sun Yunxiao

Inventor before: Huang Junheng

Inventor before: Liu Yang

Inventor before: Wang Bailing

Inventor before: Wang Wei

CB03 Change of inventor or designer information