CN109547414B

CN109547414B - Fixed-length message format reversing method based on lighting effect

Info

Publication number: CN109547414B
Application number: CN201811268875.4A
Authority: CN
Inventors: 刘琰; 高李政; 罗军勇; 朱玛; 左青松
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2018-10-29
Filing date: 2018-10-29
Publication date: 2021-04-20
Anticipated expiration: 2038-10-29
Also published as: CN109547414A

Abstract

The invention provides a fixed-length message format reverse method based on a lighting effect. The method comprises the following steps: step 1, judging the domain types of all fixed-length domains in a fixed-length message m, wherein the domain types comprise a synchronous domain and an asynchronous domain; step 2, if all the fixed-length domains in the fixed-length message m are synchronous domains, determining a domain boundary sequence of each fixed-length domain of the fixed-length message according to a first domain boundary identification rule; and 3, if the fixed-length domain in the fixed-length message m comprises an asynchronous domain, determining a domain boundary sequence of each fixed-length domain of the fixed-length message according to a second domain boundary identification rule. The invention is inspired by the lighting effect of the building, analogizes the fixed-length message into the building, analogizes the byte in the message into the window of the building, analogizes the continuous interval with the same values of the two messages into the lighting area, and establishes different domain boundary identification rules for different types of fixed-length messages by combining the influence of the domain type of the fixed-length message on the lighting effect, thereby effectively solving the difficult problem of fixed-length domain boundary identification.

Description

Fixed-length message format reversing method based on lighting effect

Technical Field

The invention relates to the technical field of network security, in particular to a fixed-length message format reversing method based on a lighting effect.

Background

Fixed-length message formats are a common application protocol message format, and many application protocols contain fixed-length messages or fixed-length segments. Since the fixed-length format has high information compressibility and is difficult to perform feature extraction and feature analysis, it is also difficult to perform format inversion.

In 2004, the pi (protocol information) project reversed the format of the message using the idea of gene sequence alignment in bioinformatics. The method is applicable to messages in which rich feature fields exist, such as text messages in the traditional protocol classification. The fixed-length message has high information compressibility and unobvious characteristic fields, so that the reverse effect of the PI on the fixed-length message is limited. In 2006, Microsoft Cui et al proposed the discovery algorithm with the message format reversed. When the format of the fixed-length message is reversed, the algorithm simply treats each byte in the message as a separate field, and has certain one-sidedness. ASAP and ProDecoder use the N-gram method in the natural language theme extraction to carry out the format reversal of the fixed length message. However, the method based on the N-gram is only effective for the messages with the length not less than N and obvious characteristic fields, and the reverse result for the fixed-length messages is not ideal. ProGraph utilizes a method related to graph theory and information theory to carry out protocol reversal, the method has undersize analysis granularity of the message, and the premise is that values of different domains in the message are assumed to have general correlation and have certain one-sidedness. In short, the characteristic field is not obvious due to the high information compressibility of the fixed-length message. Therefore, the above fixed-length message format reversal technique based on the characteristic field analysis has limited effect.

In addition, some researchers consider that the fixed-length message is subjected to format reversal through a binary analysis method, and the idea can be briefly summarized as follows: the processing process of the application program on the message is observed through a binary analysis technology, so that the information of the size, the position, the semantics and the like of each domain in the message is determined. The protocol reverse technology based on binary analysis has high accuracy and more abundant obtained results. However, the protocol reverse technology based on binary analysis requires an application program for acquiring the protocol, and the acquisition of the application program is difficult and has software protection problems. Meanwhile, the protocol reverse technology based on binary analysis is difficult to automate.

Disclosure of Invention

Aiming at the defects in the conventional fixed-length message format reverse technology, the invention provides a fixed-length message reverse method based on a lighting effect. By establishing a mapping relation between the building and the fixed-length message, the boundary problem of each domain in the presumed fixed-length message is converted into a probability statistic problem, and the reverse accuracy of the fixed-length message format is improved.

The invention provides a fixed-length message format reverse method based on a lighting effect, which comprises the following steps:

step 1, judging the domain types of all fixed-length domains in a fixed-length message m, wherein the domain types comprise a synchronous domain and an asynchronous domain;

step 2, if all the fixed-length domains in the fixed-length message m are synchronous domains, determining a domain boundary sequence of each fixed-length domain of the fixed-length message according to a first domain boundary identification rule;

and 3, if the fixed-length domain in the fixed-length message m comprises an asynchronous domain, determining a domain boundary sequence of each fixed-length domain of the fixed-length message according to a second domain boundary identification rule.

Further, the first domain boundary identification rule specifically includes:

step 21, obtaining any two message segments m in the fixed-length message m_iAnd m_j；

Step 22, defining message segment m_iAnd m_jThe value of the byte with the middle offset of k is m_ikAnd m_jkRepeated comparison of m_ikAnd m_jkUntil the comparison result tends to be stable, k is not more than the length of the fixed-length message m;

and step 23, determining a boundary sequence of a lighting area in the fixed-length message m according to the comparison result, and taking the boundary sequence of the lighting area as the boundary sequence of all the fixed-length areas in the fixed-length message m, wherein the lighting area refers to a message interval formed by bytes with continuously equal values in one comparison.

Further, the second domain boundary identification rule specifically includes:

step 31, determining a storage mode of a fixed-length domain in the fixed-length message m, wherein the storage mode comprises large-end storage and small-end storage;

step 32, counting the value type of each byte in the fixed-length message m, and determining a normally-bright area and a normally-dark area in the fixed-length message m according to the counting result, wherein the normally-bright area refers to a message interval formed by the bytes with the only value in the fixed-length message m, and the normally-dark area refers to a message interval formed by the bytes with the value type of 256 in the fixed-length message m;

step 33, dividing the fixed-length message m into N message blocks according to the normally bright area and the normally dark area, wherein N is a positive integer greater than 1;

step 34, regarding to the message block n, if the length l of the message block n is 1, taking the message block n as a fixed-length domain;

step 35, if the length l of the message block n is greater than 1, defining that the message block n includes a message sequence n₁,...,n_lFor any two sub-messages n in the message sequence_iAnd n_jIs the value n of x_ixAnd n_jxComparing;

if the fixed-length domain in the message block n is stored in a large end, counting the head lamp frequency f with the offset of x in the message block n_x ^sAnd according to a first preset correction rule, the first lamp frequency f with the offset of x_x ^sCarrying out correction;

if the fixed-length domain in the message block n is stored as a small end, counting the frequency f of the tail lamp with the offset of x in the message block n_x ^eAnd according to a second preset correction rule, the frequency of the tail lamp with the offset x is f_x ^eCarrying out correction;

the first light frequency refers to the frequency of a starting boundary of a light-up area of bytes with the offset of x, the last light frequency refers to the frequency of an ending boundary of the light-up area of bytes with the offset of x, the light-up area refers to a message interval formed by bytes with continuous and equal values in one comparison, and N is 1,2, … …, N, x is more than or equal to 0 and is less than or equal to l-1;

step 36, determining the maximum head lamp frequency in all the head lamp frequencies or all the end lamp frequencies

Or maximum end lamp frequency

Step 37, maximum head lamp frequency

Or maximum end lamp frequency

For reference, x which can be a start boundary of a domain or an end boundary of a domain is obtained according to a preset screening condition.

Further, the first preset correction rule specifically includes: when the first correction condition is satisfied, the head lamp frequency f is shifted by x_x ^s Adding 1;

the first correction condition includes:

the byte offset x is the starting boundary of the message block and n_ix＝n_jx(ii) a Or

The byte offset x is not the starting boundary of the message block and has n_i(x-1)≠n_j(x-1)And n_ix＝n_jxSimultaneously, the two steps are carried out;

the second preset correction rule specifically includes: when the second correction condition is satisfied, the end lamp frequency f is shifted by x_x ^e Adding 1;

the second correction condition includes:

the byte offset x is the ending boundary of the message block and satisfies n_ix＝n_jx(ii) a Or

The byte offset x is not the ending boundary of the message block, and n_ix＝n_jxAnd n_i(x+1)≠n_j(x+1)Simultaneously, the two steps are carried out;

further, the preset screening conditions specifically include:

when head lamp frequency

When inequality (1) is satisfied, taking the byte with offset x as the initial boundary of the fixed-length field;

when the frequency of the lamp ends

When inequality (2) is satisfied, taking the byte with offset x as the ending boundary of the fixed-length field;

when in use

And

when all satisfy the above conditions, compare

And

size of (1), if

Taking the byte with the offset of x as the initial boundary of a certain fixed-length field in the fixed-length message m; if it is

The byte offset x is taken as the ending boundary of a fixed-length field in the fixed-length message m, where β is a preset threshold.

The invention has the beneficial effects that:

the fixed-length message format reversing method based on the lighting effect can effectively solve the problem of fixed-length domain boundary identification. Because the information compressibility of the fixed-length message is high, the characteristic extraction is difficult, and the effect of the existing method on the fixed-length domain boundary identification is generally poor. The method simulates the fixed-length message to a building, simulates each byte in the message to a window of the building, simulates two continuous intervals with the same message value to a lighting area, combines the influence of the domain type of the fixed-length message on the lighting effect, sets different domain boundary identification rules aiming at different types of fixed-length messages (ideal message and non-ideal message), and respectively obtains the candidate boundary sequences of the fixed-length domain in the big-small end storage mode by counting the first lighting probability and the last lighting probability. And then, according to a certain rule, screening out real domain boundaries from the candidate boundaries. The experimental result proves that the effect of the method is obviously better than that of other methods, and more domain boundaries can be accurately identified. Meanwhile, the invention also provides a brand new feasible idea for the fixed-length domain message format reversal, and is beneficial to the further development of the fixed-length domain message format reversal.

Drawings

FIG. 1 is a schematic diagram of a lighting effect of a building according to an embodiment of the present invention;

fig. 2 is a schematic diagram of mapping a lighting effect of a building to a fixed-length message according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of a lighting area of a fixed-length message according to an embodiment of the present invention;

fig. 4 is a schematic flowchart of a fixed-length message format reversing method based on a lighting effect according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating the probability of lighting in an ideal message according to an embodiment of the present invention;

fig. 6 is a schematic diagram illustrating that a fixed-length domain respectively adopts a large-end storage mode and a small-end storage mode according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a normally-bright area of an SMB message header provided by an embodiment of the present invention;

fig. 8 is a schematic diagram of a normally dark area of a DHCP protocol header according to an embodiment of the present invention;

fig. 9 is a schematic diagram illustrating a relationship between a value type, a value uniformity, and a lighting probability provided in an embodiment of the present invention;

fig. 10 is a schematic diagram illustrating a relationship between a lighting probability and a head light probability according to an embodiment of the present invention;

fig. 11 is a schematic diagram of a light-up probability and an end light probability when a fixed-length domain adopts small-end storage according to an embodiment of the present invention;

fig. 12 is a schematic diagram of an identification field of an SMB message header provided by an embodiment of the present invention;

FIG. 13 is a diagram illustrating an identification field of a BitTorrent message header according to an embodiment of the present invention;

fig. 14 is a schematic diagram of a normally dark area in a DNS message according to an embodiment of the present invention;

fig. 15 is a schematic diagram of message blocking according to a normally bright area and a normally dark area according to an embodiment of the present invention;

fig. 16 is a schematic diagram of distribution of SMB message header value categories according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to better understand the technical solutions provided by the present invention, the following first specifically introduces terms related to the embodiments provided by the present invention.

A fixed-length field refers to a field of fixed length. The value of the fixed-length field may be a numerical value or a byte array, and can be uniformly regarded as a numerical type. The location and order of the fixed-length field in the fixed-length message (or fixed-length segment) is fixed, and its semantics are determined by its location.

A fixed-length message refers to a message composed of fixed-length fields. A fixed-length message is not a fixed-length message, but a message type (primary type). Although the fixed-length message is not equal to the fixed-length message, the length of the fixed-length message is fixed in most cases, the message length can be changed only when the extensible field exists in the fixed-length message, and the invention does not consider the situation that the extensible field exists in the fixed-length message.

A fixed-length segment is a segment consisting of fixed-length fields. To facilitate message parsing, fixed-length segments are typically located in the message header. The format reverse method of the fixed-length segment is the same as that of the fixed-length message.

The nature of the fixed-length message format reversal: the key to reverse the fixed-length message format is to infer the boundaries of each fixed-length field in the fixed-length message. The boundary of the fixed-length field refers to the starting position and the ending position of the fixed-length field in the fixed-length message, which are respectively calledA start boundary and an end boundary. Since the length and order of the fixed-length fields are fixed, the offset of the start boundary and the end boundary of the fixed-length fields from the fixed-length message header is also fixed. Setting the starting boundary of the long field as b_s(i) End boundary is b_e(i) Where s and e represent start and end respectively, i represents sequence number of fixed length field in fixed length message, i ≧ 1, b_s(i) And b_e(i) The values of (a) are respectively the offset of the first byte of the fixed-length field from the fixed-length message header and the offset of the last byte from the fixed-length message header. The starting boundary of the first fixed-length field in the fixed-length message is b_s(1) Is provided with b_s(1) 0. Since the fixed-length fields in the fixed-length message are end-to-end, equation b_e(i) The fixed-length domain boundary can be inferred from the equation that holds +1 ═ bs (i + 1); and vice versa. Thus, the core of the fixed-length message format reversal can be summarized as: a start boundary sequence or an end boundary sequence of the fixed-length domain is presumed.

The lighting effect: how to guess the starting boundary (or ending boundary) of each room without entering a building? It is assumed that each room has several windows, and the position and number of the windows can represent the position and size of the room. As shown in fig. 1: in daytime, there is no obvious difference between windows; at night, some rooms are lighted, and some rooms are not lighted, so that the brightness of windows is different. The area formed by a plurality of continuous lightening windows on the same layer is called a lightening area. The lighting area may include one or more rooms, and since the brightness of all windows in each room is consistent, the starting boundary of the lighting area is necessarily the starting boundary of a certain room, and the ending boundary thereof is also necessarily the ending boundary of a certain room. Since the lighting areas may be different every day, the start boundary (end boundary) of the lighting areas is counted every day, and when the number of recording days is large enough, the boundary sequence will be stable. At this point, each boundary in the sequence corresponds to the starting boundary (ending boundary) of each room in the building. The invention refers to the principle as the lighting effect of the building, and the lighting effect can be used for detecting the boundary of each room in the building without entering the building.

As shown in fig. 2 and 3: the building is composed of rooms, and each room comprises a plurality of windows; the message consists of fields, each field containing several bytes. The invention first performs the following conceptual mapping: mapping the fixed-length message into a first-layer building; mapping the fixed-length domain into a room; the bytes are mapped to windows.

In the "light effect", a window has only two states, i.e., light or dark; and 256 values exist in one byte, so that the byte value cannot be directly mapped into a window state.

Light-up area of fixed-length message: setting any two fixed-length messages as m_iAnd m_jSetting the value of the byte with k offset in the two fixed length messages as m_ikAnd m_jkM is_ikAnd m_jkWhether equal is mapped to window state, where m is_ik＝m_jkMapping as light, m_ik≠m_jkMapping to be not lighting, and calling a message interval composed of bytes with continuously equal values in one comparison as a lighting area of the fixed-length message.

The lighting probability: in any two fixed-length messages of the same protocol, the probability that the byte with the offset x takes equal value is called the lighting probability of x.

The lighting probability characterizes the likelihood that bytes of the same offset will take equal values in different messages of fixed length. The magnitude of the probability of lighting is affected by two factors: the value type and the value uniformity. The value type refers to the number of value types of the bytes with the same offset in all messages; the value uniformity refers to the degree of uniformity of distribution of the byte values over each value type.

In order to more intuitively illustrate the two influencing factors, a specific example is shown in table 1. The j, j +1 and j +2 rows in table 1 respectively represent the values of the same offset byte in different fixed-length messages, the i, i +1, … and i +5 columns respectively represent a message segment in the same fixed-length message, and the number in the table corresponds to the value of the byte. Table 1 lists 6 message fragments, each of which is 3 in length.

Example of values of different bytes in the messages of Table 1

The value types are at most 256, and at least 1. As can be seen from Table 1: the byte with offset j in table 1 has three kinds of values, which are 2, 3, and 4. In table 1, the value types of the byte with the offset j are the same as the value types of the byte with the offset j +1, and are 2, 3, and 4, however, the value of the byte with the offset j is concentrated on 2, and the value of the byte with the offset j +1 is relatively uniform.

The 6 message fragments in table 1 were compared pairwise for a total of 15 comparisons. As shown in table 2, the number of times that the bytes with offsets j, j +1, and j +2 have the same value is 6, 3, and 0, respectively, and the result is analyzed as follows: firstly, the distribution of the values of the bytes j +1 and j +2 on each type is relatively uniform, however, the number of times that the values of j +1 are the same is more than that of j +2 because the types of the values of j +1 are less; secondly, the values of the byte j and the byte j +1 are equal in type, but the values of the byte j +1 are distributed more uniformly, so that the times of the same values are less.

Table 2 number of times different bytes in the message take the same value

Byte offset	Number of times of same value
		j	6
j+1	3
		j+2	0

As can be seen from tables 1 and 2: the influence of the value type and the value uniformity on the lighting probability is as follows: firstly, the more the value types, the smaller the lighting probability and vice versa. Secondly, the more uniform the value is, the smaller the lighting probability is, and vice versa.

Fig. 4 is a flowchart illustrating a fixed-length message format reversing method based on a lighting effect according to an embodiment of the present invention. As shown in fig. 4, the method comprises the steps of:

s101, judging the domain types of all fixed-length domains in the fixed-length message m, wherein the domain types comprise a synchronous domain and an asynchronous domain;

s102, if all the fixed-length domains in the fixed-length message m are synchronous domains, determining a domain boundary sequence of each fixed-length domain of the fixed-length message according to a first domain boundary identification rule;

s103, if the fixed-length domain in the fixed-length message m comprises an asynchronous domain, determining a domain boundary sequence of each fixed-length domain of the fixed-length message according to a second domain boundary identification rule.

Specifically, the present invention defines the synchronization domain as a fixed-length domain satisfying the following conditions: firstly, if the value of any byte in the domain is fixed, the values of other bytes are also fixed; secondly, if the value of any byte in the domain changes, the values of other bytes also change. As can be seen from the synchronization domain definition: the change of the value of each byte in the synchronization field is synchronous. The synchronization domain is a special fixed-length domain, and the fixed-length domain with the length of 1 or the value of only one can be regarded as the synchronization domain. When the fixed-length fields composing the fixed-length message are all synchronous fields, the fixed-length message is called an ideal message.

As can be seen with reference to the synchronous domain definition, an asynchronous domain refers to a fixed-length domain of an asynchronous domain. Referring to the ideal message definition, a fixed-length message including an asynchronous domain is a non-ideal message.

In the synchronous domain, the lighting probability of each byte is equal, and the lighting time is consistent. When the fixed-length fields constituting the message are all synchronous fields, that is, when the fixed-length message is an ideal message, the boundary of the fields can be presumed by using the principle of lighting effect.

In the asynchronous domain, the lighting probability of each byte is different, and the lighting time is different. Therefore, when the fixed-length message is a non-ideal message, the boundary of the "light-up area" may coincide with the non-boundary of the fixed-length field, and simply regarding the boundary of the "light-up area" as the boundary of the fixed-length field entails an error.

According to the above content, the fixed-length message format reverse method based on the lighting effect provided by the invention firstly judges the domain type of each fixed-length domain in the fixed-length message, and determines whether the fixed-length message is an ideal message or a non-ideal message; then, if the fixed-length message is an ideal message, reversing the message format according to the first domain boundary identification rule; if the fixed-length message is a non-ideal message, the message format is reversed according to the second domain boundary identification rule.

On the basis of the foregoing embodiment, the first domain boundary identification rule specifically includes:

s1021, acquiring any two message segments m in fixed-length message m_iAnd m_j；

S1022, defining message segment m_iAnd m_jThe value of the byte with the middle offset of k is m_ikAnd m_jkRepeated comparison of m_ikAnd m_jkUntil the comparison result tends to be stable, k is not more than the length of the fixed-length message m;

s1023, according to the comparison result, determining a boundary sequence of a lighting area in the fixed-length message m, and taking the boundary sequence of the lighting area as the boundary sequence of all fixed-length areas in the fixed-length message m, wherein the lighting area refers to a message interval formed by bytes with continuous and equal values in one comparison.

Specifically, it has been explained in the above embodiments that, from the synchronization field definition: the lighting probability of each byte in the synchronous domain is equal, the lighting time is consistent, and the boundary of the domain can be presumed by using the lighting effect principle. Fig. 5 shows the lighting probability of an ideal message of length 14, consisting of 4 fields. When the fields constituting the message are all synchronization fields, steps S1021 to S1023 can be simply described as: and comparing the messages pairwise, and counting the boundaries of the lighting areas. When the counting times are enough, the boundary sequence of the lighting area is stable, and the boundary sequence of the lighting area is the boundary sequence of all the fields in the message.

On the basis of the foregoing embodiments, the second domain boundary identification rule specifically includes:

s1031, determining a storage mode of a fixed-length domain in the fixed-length message m, wherein the storage mode comprises large-end storage and small-end storage;

specifically, the storage pattern of the field refers to the storage order of the high and low bits in the field. Similar to the storage mode of data in the memory, the storage mode of the domain can be divided into the following two types: a big-end storage mode and a small-end storage mode. Wherein, the high order byte of the large-end storage mode field is close to the message head, and the low order byte is close to the message tail. The high order byte of the small-end storage mode field is near the tail of the message, and the low order byte is near the head of the message. The storage of the log value "0 x 12345678" by the magnitude-side storage mode is shown in fig. 6.

Since the fixed-length domain constituting the fixed-length message is usually an asynchronous domain, the storage mode of the domain also has an adverse effect on the message format, and the storage modes of the size end and the size end must be analyzed respectively.

S1032, counting the value type of each byte in the fixed-length message m, and determining a normally-bright area and a normally-dark area in the fixed-length message m according to the counting result, wherein the normally-bright area refers to a message interval formed by taking the byte with the only value in the fixed-length message m, and the normally-dark area refers to a message interval formed by taking the byte with the value type of 256 in the fixed-length message m;

specifically, the normally-bright area refers to an area with a unique value in the long-length message, which is represented as normally-bright in the "lighting effect", and is referred to as a normally-bright area in the present invention. The normally bright regions can be divided into two categories: the first is that the theoretical value is unique: the protocol stipulates that the value of a certain field in the message is unique, and the value of the field is always unique under the condition of not considering message errors. For example, the first four bytes of the SMB message always take on the value of "0 xFF534D 42", thus forming the normally bright area as shown in fig. 7. The second one is that the actual value is unique: some fields can take a plurality of values, however, in actual experimental data, the values can be unique; in addition, because the number of high-order values in the numerical value domain is small, when experimental data is not abundant, the high-order values may be unique.

Normally bright areas are relatively common in messages. The normally bright areas have two effects on the format reversal. On one hand, the probability of the last light of the predecessor byte of the normally-bright area is 0, and the probability of the first light of the successor byte is 0, so that whether the two positions are the boundary of the domain or not cannot be judged. On the other hand, the boundary of the "normally bright region" is also generally the boundary of the domain, and the normally bright region is relatively easy to recognize.

The normally dark area refers to that if the value types of each byte in a certain area are all 256, the lighting probability of each byte is small, and the whole area is normally dark in the lighting effect, which is called as the normally dark area in the invention. As shown in fig. 8, the start offset is 4, and the area with length 4 is a normally dark area of the DHCP message header, and the normally dark area is the Transaction ID field of DHCP.

Normally dark areas are relatively common in messages. The dark regions have two effects on format reversal. On one hand, the lighting probability of the 'over-number domain' is smaller, and the first light probability and the last light probability are also smaller; on the other hand, the boundary of the normally dark region is also the boundary of the domain, and the characteristics are obvious and easy to identify.

S1033, dividing the fixed-length message m into N message blocks according to the normally bright area and the normally dark area, wherein N is a positive integer greater than 1;

specifically, in step S1032, the value category of each byte in the message is counted, and then after the normally bright area and the normally dark area in the message are identified, as shown in fig. 14, the message is divided into a plurality of message blocks by using the normally bright area and the normally dark area. If the message block has a length of 1, it is treated as a field as by step S1034. If the length of the message block is greater than 1, the following operations are continued as from step S1035 to step S1037.

S1034, aiming at a message block n, if the length l of the message block n is 1, taking the message block n as a fixed-length domain;

specifically, in addition to regarding a message block having a length of 1 as a fixed-length field, in consideration of the specificity of a normally bright area and a normally dark area, the following is specified:

if there are no less than three consecutive ASCII printable characters in the constant light region, which is likely to be the identification field of a certain protocol, the present invention treats it as a separate field. If the value of each byte in the normally-bright area is not 0 and no continuous text character exists, each byte in the normally-bright area is taken as a field. If each byte in a certain normally-bright area is 0 under the condition of abundant data quantity, the normally-bright area is likely to be a reserved field in the message, and the invention treats the normally-bright area as a complete field. For example, the first field of the SMB message header takes the value of "0 xFF534D 42", where the corresponding ASCII text of "0 x534D 42" is "SMB", as shown in fig. 12; the ASCII text corresponding to the value of the second field in the BitTorrent message is "BitTorrent protocol", as shown in fig. 13.

The normally dark region is typically a numeric region or a portion of the numeric region. When the amount of data is rich, the normally dark area can always cover the entire area. Therefore, the invention treats the dark area as a complete area. As shown in fig. 14, for example, the first field in the DNS message is a "Transaction ID" field, and the length thereof is 2. 26 ten thousand DNS messages are filtered from the DARPA data set, when the number of the messages is 13 ten thousand, the value types of the first byte in the 'Transaction ID' are 172, the value types of the second byte are 256, namely, a normally dark area only contains low-order bytes; when the number of messages increases to 26 ten thousand, the value category of the first byte also increases to 256.

S1035, if the length l of the message block n is larger than 1, defining that the message block n comprises a message sequence n₁,...,n_lFor any two sub-messages n in the message sequence_iAnd n_jIs the value n of x_ixAnd n_jxComparing; if the fixed-length domain in the message block n is stored in a large end, counting the head lamp frequency f with the offset of x in the message block n_x ^sAnd according to a first preset correction rule, the first lamp frequency f with the offset of x_x ^sTo carry outCorrecting; if the fixed-length domain in the message block n is stored as a small end, counting the frequency f of the tail lamp with the offset of x in the message block n_x ^eAnd according to a second preset correction rule, the frequency of the tail lamp with the offset x is f_x ^eCarrying out correction; the first light frequency refers to the frequency of a starting boundary of a light-up area of bytes with the offset of x, the last light frequency refers to the frequency of an ending boundary of the light-up area of bytes with the offset of x, the light-up area refers to a message interval formed by bytes with continuous and equal values in one comparison, and N is 1,2, … …, N, x is more than or equal to 0 and is less than or equal to l-1;

specifically, since the domain boundary identification of the normally-bright region and the normally-dark region has been specifically specified in step S1034, the comparison in this step does not include the comparison of the normally-bright region and the normally-dark region. The asynchronous domain is a numeric domain, and the numeric domain has a difference of high and low bits. The higher the digit, the fewer the value types, and the more concentrated the values, so the lighting probability is higher. When the domain constituting the message is an asynchronous domain, it is assumed that the message is in a big-end mode, and the relationship between the value type and value uniformity of each byte in the message and the lighting probability is shown in fig. 9. The lighting probability of the domain exhibits a characteristic of stepwise decreasing as the number of bits decreases. Since a field of length 1 exists in the message and the lighting probability of the end boundary of the field may be greater than that of the start boundary of the next field, this feature is not sufficient as a basis for the identification of the field boundary.

On the basis of this feature, another important concept of the invention is presented: head light probability. The following definitions are first made:

a.x represents bytes in the message with an offset of x, and x ≧ 1.

L (x) represents the event that x lights up,

x is not lighted, p (L (x)) is the lighting probability of x,

represents the probability that x is not on, and

c. sl (x) is defined as event "x is the starting boundary of a certain 'light area'.

On the basis of the above definition, the head light probability refers to the probability that the byte offset by x is the starting boundary of a certain "light area" in any two messages of the same protocol.

The probability of the first light of x is represented by p (SL (x))

I.e., the probability of the top light of x is equal to the probability of x-1 not being lit and x being lit. By the formula

It can be seen that the magnitude of the head light probability is related to two factors: first, in relation to the size of p (L (x)), when p (L (x-1) | L (x)) is determined, the larger p (L (x)) is, the larger p (SL (x)) is. Second, in relation to the size of p (L (x-1) | L (x)), when p (L (x)) is determined, the smaller p (L (x-1) | L (x)) is, the larger p (SL (x)) is. The analysis of the size of p (L (x-1) | L (x)) is divided into two cases:

in the first case: x and x-1 belong to the same domain. Since the probability of lighting is smaller as the number of bits is lower, if the byte with the lower number of bits is lit, the byte with the higher number of bits is also likely to be lit, and therefore p (L (x-1) | L (x)) is generally larger.

In the second case: x is the starting boundary of a certain domain and x-1 is the ending boundary of the last domain. Since x-1 and x belong to different domains, the correlation is weak, and the lighting probability of x-1 is small, p (L (x-1) | L (x)) is generally small.

The present invention estimates the distribution of the head light probability based on fig. 9 and the above analysis, as shown in fig. 10. It can be seen that the probability of the head light is sharpened at the initial boundary of the domain, which is called as the boundary sharpening feature of the probability of the head light in the invention. Using this feature, the start boundary of the domain can be presumed.

The situation when the message is stored for big end has been analyzed above. When the message is stored for the small end, the probability of the head light is defined as follows according to the probability definition of the head light.

Probability of end light: in any two messages of the same protocol, the byte offset by x is the probability of the ending boundary of a certain "light region".

Define EL (x) as event "x is the ending boundary of a certain 'lighting region', then p (EL (x)) is_xHas a probability of an end light, and

i.e. the end-light probability represents the probability that x lights up and x +1 does not. Referring to the method of analyzing the relationship between the head light probability and the lighting probability, it is assumed that the relationship between the lighting probability and the end light probability is as shown in fig. 11. Thus, when messages are stored for the small end, the end-of-light probability can be used as an identifying feature for the end-of-domain boundary.

S1036, determining the maximum head lamp frequency in all the head lamp frequencies or all the tail lamp frequencies

Or maximum end lamp frequency

S1037, maximum head lamp frequency

Or maximum end lamp frequency

On the basis of the foregoing embodiments, the first preset correction rule specifically includes: when the first correction condition is satisfied, the head lamp frequency f is shifted by x_x ^sAdding 1; wherein the first correction condition includes: the byte offset x is the starting boundary of the message block and n_ix＝n_jx(ii) a Or the byte offset x is not the starting boundary of the message block and has n_i(x-1)≠n_j(x-1)And n_ix＝n_jxSimultaneously, the two steps are carried out;

the second preset correction rule specifically includes: when the second correction condition is satisfied, the end lamp frequency f is shifted by x_x ^eAdding 1; wherein the second correction condition includes: the byte offset x is the ending boundary of the message block and satisfies n_ix＝n_jx(ii) a Or the byte offset x is not the ending boundary of the message block and n is_ix＝n_jxAnd n_i(x+1)≠n_j(x+1)Simultaneously, the two steps are carried out;

the preset screening conditions specifically include: when head lamp frequency

when the frequency of the lamp ends

when in use

And

when all satisfy the above conditions, compare

And

size of (1), if

The effectiveness of the fixed-length message format reverse method based on the lighting effect provided by the invention is verified through a specific experiment.

1. Experimental data and parameters

The experimental data in this chapter are mainly obtained by filtering and removing duplicates from a Darpa data set, campus network data, telecom operator data and data captured by Wireshark under a single machine condition.

The experimental data in this section are DNS messages and SMB messages. The two kinds of messages are mixed messages, the header has a fixed length section, and the invention intercepts the fixed length section to carry out experiments. The experimental data are described in table 3. The experimental parameter is a frequency selection parameter beta, and the invention sets beta to 15 according to an empirical value.

TABLE 3 fixed-length message Format inverse Experimental data description

Numbering	Protocol name	Length of fixed length segment	Data volume	Number of messages
					1	DNS	12	40.9MB	266337
2	SMB	32	13.7MB	72739

2. Analysis of Experimental results

2.1 DNS messages

The header of the DNS message is a fixed-length segment with the length of 12 and comprises 6 fixed-length domains. Wherein the length, position, etc. information of each field is shown in table 4. The statistics of the value types of each byte in the fixed-length segment are shown in table 10. As can be seen from table 5, there are one normally dark region and four normally bright regions in the fixed length segment. The starting boundary of the normally dark region is 0 and the length is 2, and the judgment is correct if the region is judged to be one field. The starting boundaries and lengths of the normally-bright regions are shown in table 6, and the four normally-bright regions are all regarded as separate fields according to the judgment criteria, but the judgment is wrong. The reasons for the occurrence of errors are: these four normally bright areas correspond to the upper bytes of fields with sequence numbers 3 to 6, respectively. The values of these fields determine the number of parameter segments. For example, when the value of a Questions field is 2, there are 2 corresponding Questions parameters. Assuming that the length of each parameter is 10, if the upper byte of a certain domain is not 0, the length of the DNS message should be at least 256 × 10+12 — 2572. The actual DNS message length is only a few hundred bytes at most, so the value of the upper byte is always 0. The messages are blocked by using the normally bright area and the normally dark area, and each message block with the length of 1 is taken as one domain, and the obtained DNS message blocks are shown in table 7. The head light frequency and the end light frequency of each byte in the message block with the length larger than 1 are counted, and the result is shown in table 8.

TABLE 4 fixed-Length Domain information in DNS messages

Numbering	Name (R)	Offset of origin	Length of
				1	Transaction ID	0	2
2	Flags	2	2
				3	Questions	4	2
4	Answer RRs	6	2
				5	Authority RRs	8	2
6	Additional RRs	10	2

Table 5 value category distribution of DNS message headers

Message offset	0	1	2	3	4	5	6	7	8	9	10	11
													Kind of value	256	256	8	8	1	2	1	3	1	2	1	3

TABLE 6 normally-on regions in DNS messages

Starting boundary	4	6	8	10
					Length of	1	1	1	1

TABLE 7 starting boundary and Length of DNS message Block

Numbering	1	2	3	4	5
						Starting boundary	2	5	7	9	11
Length of	2	1	1	1	1

TABLE 8 head and end light frequencies for each byte in DNS message block

Message offset	2	3
			Frequency of head light	735756888	108146724
Frequency of tail lamp	3070741	816345523

From Table 8, it can be seen that

And is provided with

Thus choosing offset 2 as the starting boundary and offset 3 as the ending boundary, i.e., there is only one field in the message block, the determination is correct. While

The reason is that the Flags field is composed of a plurality of Flags in units of bits, and the difference of high and low bits does not exist between the Flags, so that the boundary sharpening phenomenon is not obvious.

2.2 SMB messages

The SMB message header is a fixed length segment, 32 bytes in length, and contains 14 fields, where the information of each field is shown in table 9. The normally bright and dark regions divide the message into two message blocks as shown in table 10.

The value class of each byte in the long segment is counted, and the result is shown in fig. 16. As can be seen from fig. 16, the fixed-length segment of the SMB message contains one normally bright area and two normally dark areas. Continuous ASCII text characters 'SMB' exist in the value of the normally-bright area, so that the normally-bright area is used as a complete area; the two normally dark areas correspond to a Signature area and a Multiplex ID area in the message respectively. The judgment of the normally bright area and the normally dark area is correct.

Counting the first light frequency of each byte in two message blocks

And end lamp frequency

And calculate

And

the results obtained are shown in Table 11. The highlighted items in black in table 11 are the maximum first lamp frequency and the maximum last lamp frequency. The start and end border sequences of the domains obtained according to the screening conditions are shown in Table 12. Converting the ending boundary into the starting boundary, the presumed starting boundary sequence and the actual starting boundary sequence are shown in table 13.

TABLE 9 information for fixed-length fields in SMB messages

TABLE 10 starting boundary and Length of SMB message blocks

Numbering	1	2
			Starting boundary	4	22
Length of	10	8

TABLE 11 SMB message Block Experimental results description

TABLE 12 Border sequences obtained according to the screening conditions

Starting boundary sequence	4,5,6,10,11,12,22,28
		Ending boundary sequence	7,8,9,13,23,25,29

TABLE 13 actual and Experimental sequences

Actual sequence	4,5,6,7,9,10,12,22,24,26,28
		Experimental sequences	4,5,6,8,9,10,11,12,22,24,26,28

As shown in Table 13, the experimental results are different from the actual sequences in two places (highlighted by black). First the fourth start boundary is identified in error. Second, the initial boundary 11 is added to the experimental results. The field corresponding to the latter is a Flags2 field, which is composed of a plurality of Flags in units of bits, and since there is no difference between the high and low bits between the Flags, the edge sharpening is not obvious. In addition, although the Flags2 field is split into two fields of length 1, the subsequent message analysis is not actually affected.

3. Summary of the experiments

The invention carries out format reverse experiments on DNS messages and SMB messages, analyzes the results, and verifies the correctness of the fixed-length message format reverse method based on the lighting effect.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A fixed-length message format reverse method based on a lighting effect is characterized by comprising the following steps:

step 1, judging the domain types of all fixed-length domains in a fixed-length message m, wherein the domain types comprise a synchronous domain and an asynchronous domain; the synchronous domain is a fixed-length domain which meets a first condition and a second condition: the method comprises the following steps that firstly, if the value of any byte in a domain is fixed, the values of other bytes are also fixed; a second condition is that if the value of any byte in the domain changes, the values of other bytes also change; the asynchronous domain refers to a fixed-length domain of the asynchronous domain;

step 3, if the fixed-length domain in the fixed-length message m comprises an asynchronous domain, determining a domain boundary sequence of each fixed-length domain of the fixed-length message according to a second domain boundary identification rule;

the first domain boundary identification rule specifically includes:

step 23, according to the comparison result, determining a boundary sequence of a lighting area in the fixed-length message m, and taking the boundary sequence of the lighting area as a boundary sequence of all fixed-length areas in the fixed-length message m, wherein the lighting area refers to a message interval composed of bytes with continuously equal values in one comparison;

the second domain boundary identification rule specifically includes:

step 31, determining a storage mode of a fixed-length domain in the fixed-length message m, wherein the storage mode comprises large-end storage and small-end storage; wherein, the high order byte of the large-end storage mode domain is close to the message head, and the low order byte is close to the message tail; the high order byte of the small-end storage mode field is close to the tail part of the message, and the low order byte is close to the head part of the message;

step 32, counting the value type of each byte in the fixed-length message m, and determining a normally-bright area and a normally-dark area in the fixed-length message m according to the counting result, wherein the normally-bright area refers to a message interval formed by the bytes with the only value in the fixed-length message m, the normally-dark area refers to a message interval formed by the bytes with the value type of 256 in the fixed-length message m, and the lighting probability refers to the probability that the values of the bytes with the same offset in any two message segments in the fixed-length message m are equal;

Or maximum end lamp frequency

Step 37, maximum head lamp frequency

Or maximum end lamp frequency

2. The method according to claim 1, wherein the first preset correction rule specifically comprises: when the first correction condition is satisfied, the head lamp frequency f is shifted by x_x ^sAdding 1;

the first correction condition includes:

the second preset correction rule specifically includes: when the second correction condition is satisfied, the end lamp frequency f is shifted by x_x ^eAdding 1;

the second correction condition includes:

The byte offset x is not the ending boundary of the message block, and n_ix＝n_jxAnd n_i(x+1)≠n_j(x+1)And at the same time.

3. The method according to claim 1, wherein the preset screening conditions specifically include:

when head lamp frequency