CN115632973A

CN115632973A - Protocol packet structure analysis method, device, equipment and storage medium

Info

Publication number: CN115632973A
Application number: CN202211292799.7A
Authority: CN
Inventors: 宋伟聪
Original assignee: Agricultural Bank of China
Current assignee: Agricultural Bank of China
Priority date: 2022-10-21
Filing date: 2022-10-21
Publication date: 2023-01-20

Abstract

The invention discloses a method, a device, equipment and a storage medium for analyzing a protocol packet structure, which can form a data frame set when data frames sent from the same network port are obtained. And then determining the segmentation step length of each data frame in the data frame set based on the frequency of each field in the data frame set and the ranking of the frequency. And then segmenting each data frame based on the segmentation step length to obtain a characteristic field set of the data frame set. The similarity of the two characteristic field subsets is calculated by randomly dividing the characteristic field set into the two characteristic field subsets. Clustering the feature field subsets with the similarity larger than the preset logarithm of the preset threshold value based on a preset clustering algorithm to obtain feature fields with the same data type, and processing the feature fields with different types based on a multi-sequence comparison algorithm to further obtain the protocol format of the data frame sent by the network port. The protocol packet structure analysis method not only shortens the time consumption of determining the protocol format, but also ensures that the result is more accurate.

Description

Protocol packet structure analysis method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of data transmission, in particular to a protocol packet structure analysis method, a device, equipment and a storage medium.

Background

With the increasing complexity of network environments, more and more network protocol developers choose to construct a proprietary protocol for sending and receiving data so as to meet personalized requirements of the network protocol developers. However, many malicious applications utilize a private protocol to transmit network data, so that the possibility of breaking the network data is reduced, and thus, illegal activities become more confidential, and a great threat is brought to network security. How to analyze and identify the unknown protocols in time is an important factor related to network security. Most of the traditional analysis of unknown protocol packets adopts a mode of reversely analyzing binary codes from a software level, and the method is complex in realization, low in portability and incapable of effectively identifying the unknown protocol packets.

Disclosure of Invention

In order to solve the problems of complex realization and low accuracy rate in the prior art, the invention provides a protocol packet structure analysis method, a device, equipment and a storage medium, which have the characteristics of high identification efficiency, higher accuracy rate and the like.

According to a specific embodiment of the present invention, a method for analyzing a protocol packet structure is provided, which includes:

acquiring data frames sent from the same network port to form a data frame set;

determining the segmentation step length of each data frame in the data frame set based on the frequency of each field in the data frame set and the ranking of the frequency;

segmenting each data frame based on the segmentation step length to obtain a characteristic field set of the data frame set;

randomly dividing the characteristic field set into two characteristic field subsets, and calculating the similarity of the two characteristic field subsets;

clustering the feature field subsets with the similarity greater than a preset logarithm of a preset threshold value based on a preset clustering algorithm to obtain feature fields with the same type;

and processing the different types of characteristic fields based on a multi-sequence comparison algorithm to obtain the protocol format of the data frame sent by the network port.

Further, the protocol packet structure analysis method further includes:

sending a network data packet generated based on the protocol format to a server connected with the network port;

and verifying the validity of the protocol format based on the reply message of the server to the network data packet.

Further, after segmenting each data frame based on the segmentation step size to obtain a feature field set of the data frame set, the method further includes:

and rejecting the characteristic fields with the occurrence frequency less than the preset frequency in the characteristic field set.

Further, the determining the slicing step size of each data frame in the data frame set based on the frequency of occurrence of each field in the data frame set and the ranking of the frequency includes:

when the frequency of each field and the ranking of the frequency meet a preset segmentation formula, determining the quantity of each data frame to be segmented, wherein the preset segmentation formula is as follows:

ln(f)+ln(r)＝ln(c)

wherein r is the frequency of occurrence of each of the fields, f is the ranking of the frequency, and c is a constant;

and determining the segmentation step length of each data frame based on the number of the data frames to be segmented.

Further, the protocol packet structure analysis method further includes:

splicing the characteristic fields which are associated in the characteristic field set to obtain spliced characteristic fields;

and if the occurrence frequency of the spliced characteristic fields is not less than the preset frequency, adding the spliced characteristic fields into the characteristic field set.

Further, the randomly dividing the feature field set into two feature field subsets and calculating the similarity of the two feature field subsets includes:

calculating Jacard coefficient values of the two feature field subsets, and taking the obtained Jacard coefficient values as the similarity of the two feature field subsets.

Further, the processing different types of feature fields based on the multiple sequence comparison algorithm to obtain the protocol format of the data frame sent by the network port includes:

performing data alignment on the different types of feature fields based on the multi-sequence comparison algorithm to obtain aligned feature data;

and acquiring field information of a protocol and position information of each field based on the aligned characteristic data, and acquiring a protocol format of a data frame sent by the network port.

According to a specific embodiment of the present invention, an apparatus for analyzing a protocol packet structure includes:

the data acquisition module is used for acquiring data frames sent from the same network port to form a data frame set;

the step length determining module is used for determining the segmentation step length of each data frame in the data frame set based on the frequency of each field in the data frame set and the ranking of the frequency;

the data segmentation module is used for segmenting each data frame based on the segmentation step length to obtain a characteristic field set of the data frame set;

the data comparison module is used for randomly dividing the characteristic field set into two characteristic field subsets and calculating the similarity of the two characteristic field subsets;

the data classification module is used for clustering the feature field subsets with the similarity greater than the preset logarithm of a preset threshold value based on a preset clustering algorithm to obtain feature fields with the same type; and

and the format determining module is used for processing different types of characteristic fields based on a multi-sequence comparison algorithm so as to obtain the protocol format of the data frame sent by the network port.

According to a specific embodiment of the present invention, there is provided an apparatus including: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the protocol packet structure analysis method.

According to an embodiment of the present invention, a storage medium is provided, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the protocol packet structure analysis method as described above.

The protocol packet structure analysis method provided by the invention can form a data frame set when acquiring data frames sent from the same network port. And then determining the segmentation step length of each data frame in the data frame set based on the frequency of occurrence and the ranking of the frequency of each field in the data frame set. And then segmenting each data frame based on the segmentation step length to obtain a characteristic field set of the data frame set. The method comprises the steps of randomly dividing a feature field set into two feature field subsets, and calculating the similarity of the two feature field subsets. Clustering the feature field subsets with the similarity larger than the preset logarithm of the preset threshold value based on a preset clustering algorithm to obtain feature fields with the same data type, and processing the feature fields with different types based on a multi-sequence comparison algorithm to further obtain the protocol format of the data frame sent by the network port. The protocol packet structure analysis method is characterized in that the preprocessed characteristic fields are clustered, and the characteristic fields of various types obtained after clustering are compared by adopting a multi-sequence comparison algorithm, so that the time consumption for determining the protocol format is shorter, and the result is more accurate.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow diagram of a protocol packet structure analysis method provided in accordance with an example embodiment;

FIG. 2 is a flow diagram providing protocol packet structure validation in accordance with an illustrative embodiment;

fig. 3 is a block diagram of a protocol packet structure analysis apparatus provided in accordance with an example embodiment;

FIG. 4 is a block diagram of an apparatus provided in accordance with an exemplary embodiment.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, an embodiment of the present invention provides a method for analyzing a protocol packet structure, where the method may include the following steps:

101. and acquiring data frames sent from the same network port to form a data frame set.

The protocol data acquired from each device can be classified by adopting a distinguishing mode based on port numbers, the protocol data corresponding to each network port number is generally acquired in a data frame mode, and the acquired data frames form a data frame set of the corresponding network port number. The distinguishing and classifying mode based on the network port number can better distinguish and identify unknown protocol packets, avoid the interference of protocol packets in other different formats and effectively improve the identification efficiency of the protocol packets.

102. And determining the segmentation step length of each data frame in the data frame set based on the frequency of occurrence and the ranking of the frequency of each field in the data frame set.

Before identification, all data frames in the data frame set need to be uniformly differentiated according to the segmentation step length, so that the characteristic field is obtained. The determination of the segmentation step length can be determined according to the zigh law, and if the frequency of occurrence of a certain item set in the unknown protocol data set is r and the frequency ranking is f, the segmentation step length can be known according to the zigh law, and fr = C is a constant under ideal conditions, wherein C is a constant, and the following preset segmentation formula is adopted in the invention:

ln(f)+ln(r)＝ln(c)

where r is the frequency of occurrence of each field, f is the ranking of the frequency, and c is a constant. And determining the segmentation step length after determining the number of the required characteristic fields, and then segmenting each data frame according to the segmentation step length.

103. And segmenting each data frame based on the segmentation step length to obtain a characteristic field set of the data frame set.

And after the characteristic field set of the data frame set is obtained, the characteristic fields with the occurrence frequency less than the preset frequency in the characteristic field set can be removed. By eliminating some very infrequent feature fields, which usually refer to feature fields carrying very little distinguishing feature information, the feature field set dimension is ensured to be as small as possible, and the identification processing is easier to perform. Each feature field can be processed based on a TF/IDF weighting technology, wherein TF is a word frequency, IDF is an inverse document frequency, wherein the word frequency refers to the number of times a certain feature field appears in a data frame, but the word frequency is normalized to be:

TF = the number of times a certain feature field appears in the data frame/total number of feature fields in the data frame.

When analyzing the inverse document frequency, the practical application scenario of the inverse document frequency often needs to be considered, and the corpus is assisted to be simulated, and the calculation method is as follows:

IDF = log (total number of data frames/(number of data frames containing a certain feature field + 1)), then the TF/IDF calculation formula is as follows:

TF/IDF = term frequency/inverse document frequency, and a feature field with a low frequency of occurrence can be excluded by setting an appropriate value.

104. And randomly dividing the characteristic field set into two characteristic field subsets, and calculating the similarity of the two characteristic field subsets.

After the segmentation step of the protocol data frame is calculated by using zigh's law, some screening can be performed on the feature field data set in a threshold-limited manner to reduce the calculation amount. The feature field set can be randomly divided into two feature field subsets, and the two parts of data are arranged according to the descending order of the occurrence times of each feature and are represented as

A = { T11: N11, T12: N12,.., T1N: N1N }, B = { T21: N21, T22: N22,.., T2N: N2N }. The similarity calculation is then performed on the two data sets a, B using the jaccard coefficients. The jaccard coefficient is often used to calculate similarities between different data in searching for data. The value of the jaccard coefficient is equal to the number of the same samples in the two sample sets divided by the number of all samples in the two sample sets, and the specific calculation formula is as follows:

when the data set of the characteristic field subset is divided, a random average segmentation method is adopted. The higher the Jacard coefficient is, the higher the similarity of two data sets which represent random division is, i.e. the better the screening effect by the Jacard coefficient is. And taking a corresponding threshold value when the Jacard coefficient reaches the peak value as a screening threshold value, and screening the characteristic data set. When the selection is carried out, some longer characteristic fields may exist, the length of the longer characteristic fields exceeds the segmentation step length, so that the characteristic fields also need to be spliced, two or more continuous characteristic items are spliced, the occurrence times of the characteristic items are calculated, and if the requirement of still meeting the threshold value is met, the spliced characteristic items are added into an original characteristic field set. Then, according to the principle that the selected features should satisfy the principle that the selected features can be maximally related to the unselected features and simultaneously have minimal overlap with the selected features, the n features are selected, and finally, the feature field set is obtained.

105. And clustering the feature field subsets with the similarity larger than the preset logarithm of the preset threshold value based on a preset clustering algorithm to obtain feature fields with the same type.

The MT-BRICH algorithm can be adopted as the preset clustering algorithm, and the concept of TCF is introduced into the CF clustering characteristic on the basis of keeping the advantages of the original BRICH algorithm, so that the concept of multiple thresholds is added, the reconstruction times of TCF Tree can be greatly reduced under the condition that clusters are not consistent in size, and the time complexity of the algorithm is greatly reduced. Meanwhile, the accuracy of the algorithm is improved, and clustering errors caused by overlarge volume difference of different clusters in the original algorithm are made up. Meanwhile, the success of clustering is very sensitive to the selection of clustering parameters, and the specific processing flow is as follows:

106. and processing the different types of characteristic fields based on a multi-sequence comparison algorithm to obtain the protocol format of the data frame sent by the network port.

In the case of the same protocol message type, the protocol formats thereof often have similarities as well. Therefore, the protocol format data with the same or certain similarity can be aligned by a multi-sequence comparison algorithm in the DNA field, so that the fixed field and the variable field of the protocol can be obtained, and the detailed information such as the position of the protocol field in the whole protocol data, the number of occupied bytes and the like can be further deduced according to the information, thereby finally summarizing and analyzing the format information of the protocol. According to the comparison between the advantages and the disadvantages of the algorithms, the Clustal Omega algorithm and the MAFFT algorithm can be selected for multi-sequence comparison.

Clustal Omega uses a seed-guided tree and a new HMM engine that focuses on two profiles to generate these alignments. Clustal Omega is based on consistency and is widely viewed as one of the fastest online implementations of all multi-sequence alignment tools, and its accuracy is still high in both consistency-based and matrix-based algorithms. The MAFFT algorithm converts an original comparison sequence into a series of vector sequences, then utilizes comparison vector information as a Fourier signal for calculation, and rapidly and effectively identifies sequences of the same ancestor, and the core idea of the MAFFT algorithm is to improve the running speed of a CPU (Central processing Unit) by fast Fourier transform. Also, MATTF is applicable to sequences of similar length, but having a large number of delete modify operations. The MATTF algorithm is equivalent to the T-COFFE in the operation accuracy of the algorithm, meanwhile, a progressive algorithm and an iterative correction algorithm can be utilized, on the premise of not losing too much accuracy, the FFT-NS-2 is much faster than CLUSTALW, and the FFT-NS-i is faster than the T-COFFEE by more than 100 times.

The selection of the specific algorithm can be selected by those skilled in the art according to the actual application situation, and the invention is not limited herein.

Therefore, the accuracy of clustering can be improved by adopting the MT-BIRCH algorithm, and the multi-sequence comparison algorithm used in the original DNA, RNA and other nucleic acid sequences is innovatively used, so that the time consumption of the sequence comparison algorithm is shorter, the result is more accurate, and particularly when the protocol message contains a variable-length field.

Referring to fig. 2, in another embodiment of the present invention, the method for analyzing a protocol packet structure may further include the following steps, and other steps of the method for analyzing a protocol packet structure in this embodiment may refer to the details in the above embodiment, which are not described herein again.

201. And sending the network data packet generated based on the protocol format to a server connected with the network port.

202. And verifying the validity of the protocol format based on the reply message of the server to the network data packet.

Specifically, the network data packet may be synthesized by the tester according to the obtained protocol format, and then the network data packet may be sent to the target machine, where the target machine includes the existing server implementation of the protocol. The target will respond to the network data packet sent by the tester accordingly. Whether the server determines that the network data packet started by the test machine is valid or not can be determined by analyzing the response data packet, and then the protocol format is verified. In specific implementation, the TLSG2005 switch with the port mirroring function can be used for port mirroring setting. Wireshark software is used to monitor all communication messages between the tester and the target to determine if these messages, synthesized by the inferred protocol structure, are syntactically valid. The verification method is simple and effective, and can conveniently develop experiments.

Referring to fig. 3, based on the same design concept, an embodiment of the present invention further provides a protocol packet structure analysis apparatus, which can implement the steps of the protocol packet structure analysis method when the apparatus is operated, where the apparatus includes:

the data obtaining module 301 is configured to obtain data frames sent from the same network port to form a data frame set.

Step size determining module 302, configured to determine a segmentation step size of each data frame in the data frame set based on the frequency of occurrence and the rank of the frequency of each field in the data frame set.

And the data segmentation module 303 is configured to segment each data frame based on a segmentation step size to obtain a feature field set of the data frame set.

And the data comparison module 304 is configured to randomly divide the feature field set into two feature field subsets, and calculate similarity between the two feature field subsets.

The data classification module 305 is configured to cluster, based on a preset clustering algorithm, feature field subsets with similarity greater than a preset logarithm of a preset threshold to obtain feature fields of the same type. And

and a format determining module 306, configured to process the different types of feature fields based on a multiple sequence comparison algorithm to obtain a protocol format of a data frame sent by a network port.

The protocol packet structure analyzing device has the same beneficial effects as the protocol packet structure analyzing method, and the specific implementation manner of the protocol packet structure analyzing device can refer to the embodiment of the protocol packet structure analyzing method, which is not described in detail herein.

Referring to fig. 4, an embodiment of the present invention also provides an apparatus, which may include: a memory 401 and a processor 402.

A memory 401 for storing programs.

The processor 402 is configured to execute the program to implement the steps of the protocol packet structure analysis method according to the above embodiment.

Embodiments of the present invention also provide a storage medium having a computer program stored thereon, where the computer program is executed by a processor to implement the steps of the protocol packet structure analysis method according to the above embodiments.

While, for purposes of simplicity of explanation, the methodologies are shown and described as a series of acts, it will be appreciated by those skilled in the art that the claimed subject matter is not limited by the order of acts, as some steps may, in accordance with the claimed subject matter, occur in other orders and/or concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and reference may be made to the partial description of the method embodiment for relevant points.

The steps in the method of each embodiment of the present invention may be sequentially adjusted, combined, and deleted according to actual needs, and the technical features described in each embodiment may be replaced or combined.

The modules and sub-modules in the device and the terminal of the embodiments of the invention can be combined, divided and deleted according to actual needs.

In the embodiments provided in the present invention, it should be understood that the disclosed terminal, apparatus and method may be implemented in other ways. For example, the above-described terminal embodiments are merely illustrative, and for example, the division of a module or a sub-module is only one logical function division, and other division manners may be available in actual implementation, for example, a plurality of sub-modules or modules may be combined or integrated into another module, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules or sub-modules described as separate parts may or may not be physically separate, and parts that are modules or sub-modules may or may not be physical modules or sub-modules, may be located in one place, or may be distributed over a plurality of network modules or sub-modules. Some or all of the modules or sub-modules can be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, each functional module or sub-module in each embodiment of the present invention may be integrated into one processing module, or each module or sub-module may exist alone physically, or two or more modules or sub-modules may be integrated into one module. The integrated modules or sub-modules may be implemented in the form of hardware, or may be implemented in the form of software functional modules or sub-modules.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software unit executed by a processor, or in a combination of the two. The software cells may be located in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for analyzing a protocol packet structure, comprising:

acquiring data frames sent from the same network port to form a data frame set;

and processing different types of characteristic fields based on a multi-sequence comparison algorithm to obtain the protocol format of the data frame sent by the network port.

2. The method of claim 1, further comprising:

3. The method of claim 1, wherein after segmenting each of the data frames based on the segmentation step size to obtain a set of feature fields of the set of data frames, further comprising:

4. The method of claim 1, wherein determining the slicing step size for each of the data frames in the set of data frames based on the frequency of occurrence of each field in the set of data frames and the ranking of the frequency comprises:

ln(f)+ln(r)＝ln(c)

5. The method of claim 3, further comprising:

6. The method of claim 1, wherein randomly dividing the set of feature fields into two feature field subsets and calculating the similarity between the two feature field subsets comprises:

7. The method according to claim 1, wherein the processing different types of feature fields based on the multiple sequence comparison algorithm to obtain the protocol format of the data frame sent out by the network port comprises:

8. An apparatus for analyzing a protocol packet structure, comprising:

9. An apparatus, comprising: a memory and a processor;

the memory is used for storing programs;

the processor, configured to execute the program, and implement the steps of the protocol packet structure analysis method according to any one of claims 1 to 7.

10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the individual steps of the protocol packet structure analysis method according to any one of claims 1 to 7.