CN110457465B

CN110457465B - Classification method for unknown bit stream protocol

Info

Publication number: CN110457465B
Application number: CN201910541785.6A
Authority: CN
Inventors: 吴静; 王思源; 郭凡; 周建国; 周沫; 潘玥
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2019-06-21
Filing date: 2019-06-21
Publication date: 2022-04-26
Anticipated expiration: 2039-06-21
Also published as: CN110457465A

Abstract

The invention discloses a classification method for unknown bit stream protocols, which comprises the steps of firstly, taking bit streams which are preset to frequently appear at specific positions as the characteristics of a protocol, finding out frequently appearing information at all positions of a data frame, and constructing a frequent item set; then, detecting the incidence relation of the constructed frequent item set, and filtering the frequent items which do not meet the preset conditions; then, converting different data frames into expression of Boolean vectors according to the constructed frequent item set and the detected incidence relation; and then carrying out hierarchical clustering on the converted data frames. The invention realizes the consideration of the position information, so that the classification result is more accurate; and the hierarchical clustering method can restore the system layer structure of the protocol.

Description

Classification method for unknown bit stream protocol

Technical Field

The invention relates to the field of communication, in particular to a classification method for unknown bit stream protocols.

Background

The network protocol analysis refers to analyzing and deducing the type, format and content of the network protocol from the acquired network data so as to obtain the situation of the target network. In the military aspect, the protocol of the analysis target network can obtain a protocol format drafted by the enemy, so that the contents mutually transmitted by the enemy can be analyzed and obtained, additional information is obtained, and the further reaction to the action of the enemy can be realized; the topological connection of the enemy network can be analyzed from the protocol, so that the position and the number of communication entities in the enemy network can be analyzed; in civil use, the protocol of the analysis network can obtain the information transmitted in the network, and a network manager can manage and maintain the network through the protocol, so as to analyze and monitor the network environment.

Traditional analysis and detection of network protocols often is by fixed feature based methods. Conventional network protocols often use fixed formats, for example, application layer protocols typically use fixed port numbers. Thus, the analyst can obtain the protocol type of the analyst as long as the corresponding fixed feature is found, and then the analyst can correspond to the fixed feature according to the known knowledge so as to obtain the information in the fixed feature.

The inventor of the present application finds that the method of the prior art has at least the following technical problems in the process of implementing the present invention:

however, as the network scale increases, information security requirements increase, and the types of services increase, communicants increasingly use proprietary protocols to communicate, the formats of which are often unknown. In this case, therefore, it is difficult to analyze and interpret it based on past knowledge by the existing method.

Therefore, the method in the prior art has the problems that the unknown bit stream protocol cannot be classified or the classification effect is poor.

Disclosure of Invention

In view of the above, the present invention provides a classification method for an unknown bit stream protocol, so as to solve or at least partially solve the problem that the method in the prior art cannot classify the unknown bit stream protocol or has a poor classification effect.

The invention provides a classification method for an unknown bit stream protocol, which comprises the following steps:

step S1: taking a bit stream which is preset to frequently appear at a specific position as a characteristic of a protocol, finding out frequently appearing information at all positions of a data frame, and constructing a frequent item set;

step S2: detecting the incidence relation of the constructed frequent item set, and filtering frequent items which do not meet preset conditions;

step S3: converting different data frames into expression of Boolean vectors according to the constructed frequent item set and the detected incidence relation, wherein one Boolean vector corresponds to one data frame;

step S4: and carrying out hierarchical clustering on the converted data frames.

In one embodiment, step S1 specifically includes:

step S1.1: aligning the left sides of a set I formed by all data frames captured from a network environment in sequence;

step S1.2: counting all frames from the ith byte of the frame header, and screening out the frequently occurring bytes;

step S1.3: judging whether a value k of an ith byte which frequently appears exceeds a threshold value, if so, indicating that a certain value k of the ith byte forms a frequent item, and marking the obtained frequent item as Aij, wherein j is the jth frequent item of which the ith byte reaches a threshold standard tf, and the frequent item comprises a first-level frequent item and an n-level frequent item;

step S1.4: and finally recording all the frequent items as a set F, wherein the set F comprises information which appears frequently at all positions, including first-level frequent and n-level frequent, n is 0,1,2 and … p, and when a higher-level frequent item is generated, removing the content in the previous-level frequent item.

In one embodiment, step S2 specifically includes:

step S2.1: detecting the incidence relation of the constructed frequent item set, and constructing a coarse relation table;

step S2.2: and filtering the contents in the coarse relation table.

In one embodiment, step S2.1 specifically includes:

finding out any two frequent items A which meet the minimum confidence coefficient and are respectively positioned before and after the frame_imAnd A_jnAnd satisfy i<j；

Judging the confidence coefficient between two frequent items and the confidence coefficient in the reverse direction, and when the confidence coefficient is larger than a threshold value t_rlAnd the confidence of reversal is less than t_eThen record the relation A_im->A_jnWherein->The frequent item on the left side of the symbol is called a antecedent, and the frequent item on the left side of the symbol is called a consequent;

after all the frequent items are taken out, all the relations form a rough relation table R.

In one embodiment, step S2.2 specifically includes:

sequentially checking all relations in the coarse relation table, reserving a group of relations with the maximum reserved reverse confidence when one back item has a plurality of front items, and rejecting other relations;

the remaining relations are according to their frequent items A_imAnd (5) sequencing the sequence of the I in the sequence, and finally constructing a relation tree T containing s nodes.

In one embodiment, the method comprises:

and when the number of the finally constructed relationship trees is more than 1, placing the obtained relationship trees under the same root node.

In one embodiment, step S3 specifically includes:

step S3.1: coding each node of the relation tree T into n₁,n₂,…,n_s；

Step S3.2: according to the existence of the node, each frame in the frame set I is respectively represented by 0 and 1 to form an s-dimensional Boolean vector, wherein the value of the k-th coordinate represents n_kIf the frame exists, finally obtaining a set I consisting of vectors from the set I of frames_v。

In one embodiment, step S4 specifically includes:

step S4.1: vector set I is measured by Jacard distance_vAny two vectors v_iAnd v_jEach sample forms an independent cluster respectively, and hierarchical clustering is carried out to obtain a dendrogram structure;

step S4.1: and sequentially classifying each category according to the set cluster number, wherein the classified category corresponds to the actual category of the data frame.

In one embodiment, the method further comprises:

and analyzing the independence of each cluster frame to restore the hierarchical architecture of the protocol.

One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:

the invention provides a classification method for unknown bit stream protocols, which comprises the steps of firstly, taking bit streams which are preset to frequently appear at specific positions as the characteristics of a protocol, finding out frequently appearing information at all positions of a data frame, and constructing a frequent item set; then, detecting the incidence relation of the constructed frequent item set, and filtering the frequent items which do not meet the preset conditions; converting different data frames into expression of Boolean vectors according to the constructed frequent item set and the detected incidence relation, wherein one Boolean vector corresponds to one data frame; and finally, performing hierarchical clustering on the converted data frames.

The invention provides a frame classification system facing to a bit stream protocol with unknown format and content by taking the different characteristics of different types of protocol frequent character strings into consideration based on frequent feature extraction and association rule mining, so that the types of frames in a classified frame cluster are relatively simple. The method fully considers the position information of inherent fields of different protocols in the frames, finds key fields forming different protocol types by detecting the incidence relation of the constructed frequent item set and filtering frequent items which do not accord with preset conditions, and then classifies the frames with different protocol types by using a hierarchical clustering algorithm based on Jacard distance, thereby considering the position information to enable the classification result to be more accurate;

furthermore, the invention adopts a hierarchical clustering method, which can restore the system chromatography structure of the protocol, thereby further analyzing the independent clusters.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart illustrating a classification method for unknown bit stream protocols according to the present invention;

FIG. 2 is a flow chart of a level one frequent fetch according to an embodiment of the present invention;

FIG. 3 is a flow chart of multi-level frequent extraction according to an embodiment of the present invention;

FIG. 4 is a flowchart of relational tree mining according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a confidence score according to an embodiment of the present invention;

FIG. 6 is a flowchart illustrating hierarchical clustering according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of an independence analysis in accordance with an embodiment of the present invention.

Detailed Description

The invention aims to provide a classification method for unknown bit stream protocols, aiming at the problems that the unknown bit stream protocols cannot be classified or the classification effect is poor in the method in the prior art, so that the classification accuracy is improved and the classification effect is improved.

In order to achieve the above purpose, the main concept of the invention is as follows:

based on frequent feature extraction and association rule mining, the method for classifying the frames of the bit stream protocol with unknown format and content is provided by considering the different characteristics of different types of frequent character strings of the protocol, so that the types of the frames in the classified frame cluster are relatively simple. The method fully considers the position information of inherent fields of different protocols in the frames, provides the concept of a relation tree, finds key fields forming different protocol types through the construction of the relation tree, then classifies the frames with different protocol types by using a hierarchical clustering algorithm based on the Jacard distance, and considers the position information to enable the classification result to be more accurate; the use of hierarchical clustering makes it possible to restore the hierarchical structure of the protocol.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

The present embodiment provides a classification method for unknown bit stream protocols, please refer to fig. 1, which includes:

specifically, the inventors of the present application found, through a great deal of practice and research, that: a common protocol inverse analysis method is usually directed to a single-format protocol, such as a specific application layer protocol, and the obtained protocol formats are usually similar to each other, so a method for performing certain division and identification on the format of the protocol and the semantics thereof through analysis based on different byte statistical characteristics is proposed. However, when the formats of various types of protocols in the entire network system are unknown, there are often many differences between captured frames, and the method of analyzing only a protocol of a specific format may be ineffective.

There are many types of frames in network protocols, and common frames are divided into data frames for transferring data and control frames for controlling the network. Data frames tend to be long and the format is not very fixed; the control frame is usually shorter in length and more uniform in format. For example, in the internet protocol hierarchy, data frames in the same network environment are often divided into a plurality of protocols in different hierarchies, and in the same network hierarchy, there are also different types of protocols, for example, the protocols in the transmission layer under the internet protocol system are respectively divided into two types, namely TCP and UDP, and the internet protocol system can be divided into a plurality of hierarchies, so that the first premise for reasoning unknown bit stream frames is to divide captured frames according to the categories to obtain a frame set with a purer type, and then to infer the content of the captured frames.

The unknown frames are classified, firstly, the characteristics of the frames of different types are found, and then the frames are classified according to the characteristics. For unknown frames, frequently occurring bytes or strings are a simple and easy to process feature. The frequently occurring bytes or strings of words of the frames for different types of protocols are different. The frequent bytes are called frequent entries, and then the frequent entries need to be extracted first. Generally, frequent extraction is performed, a traditional association rule mining algorithm is often used, and the algorithm only considers content information of frequent items and is not good for the characteristics of a bit stream protocol.

Therefore, the method provided by the invention considers the position information of the inherent fields of different protocol types in the data frame while extracting the characteristics, thereby enabling the clustering result to be more accurate.

The frequently occurring information is the extracted features for subsequent clustering.

Step S2: and detecting the incidence relation of the constructed frequent item set, and filtering the frequent items which do not meet the preset conditions.

Specifically, step S2 is to perform mining of association rules on the frequent item set constructed in S1, for example, mining out the frequent items that satisfy the minimum confidence threshold, and then performing further mining according to the association degree between two frequent items.

Step S3: and converting different data frames into expression of Boolean vectors according to the constructed frequent item set and the detected incidence relation, wherein one Boolean vector corresponds to one data frame.

Step S3 is to perform format conversion on the data frame to facilitate subsequent clustering.

Step S4: and carrying out hierarchical clustering on the converted data frames.

In particular, I can be set according to vectors_vAny two vectors v_iAnd v_jThe distance between them, clustering.

In one embodiment, step S1 specifically includes:

Specifically, the data frame captured in step S1.1 is a data frame that has already been subjected to framing processing. Statistics are performed from the ith byte of the frame header, i.e. according to the position information.

Wherein the first-level frequent extraction is special, as shown in fig. 2, the first-level frequent extraction uses three thresholds, t_f1,t_rAnd t_l. Wherein, t_f1Indicating a frequent threshold beyond which the value k of the byte position is considered frequent. t is t_rIndicating a remaining frame number stop condition. When the current process reaches the ith byte, stopping the current cycle if the number of the remaining frames is less than the value; t is t_lIndicating a frame length stop condition. When i is>t_lThe current cycle is stopped.

Each value k of the first-level frequency is only 256 possible, namely from 0 to 255, the occurrence frequency of the 256 values is counted respectively, and then the occurrence frequency of the value k is s_kWhen it is satisfied

The value k is considered to be frequent.

The specific implementation of the examples is as follows:

the preset frame number u is 3030, p is 2, t_f1＝0.04,t_l＝40,t_rThe first level of frequent extraction is performed on the frame set I by 50, I by 4. When proceeding to byte 4, the number of frames that still take value in byte 4 is c 3030 frames. First of all, satisfy c>t_rAnd also satisfies 4<t_lSatisfy the aboveAfter the condition, respectively counting the number of occurrences of all 256 values of the 4 th byte, and at t_f1When the frequency threshold is 0.04, the minimum frequency threshold obtained from the frequency threshold at this time is u · t_f1121 frames: suppose that the 4 th byte contains a number of frames s with k_kE.g. k 17-th hour s_k2424, and s when k is 56_kBoth values of 17 and 56 satisfy the frequent threshold requirement if 1818, 2424 and 1818 exceed 121. When k represents 256 values of one byte, the k values are increased from 0 to 255 and are calculated respectively. Finally, if only two values, i.e., k 17 and k 56, are satisfied in the 256 values

Then, two first-level frequent items 4-7 and 4-56 which are mined currently can be sequentially added into the frequent item set F; if the above condition is not satisfied, the frequent item of the position is not considered. And sequentially increasing the size of i until the loop of i stops.

When p-level frequency is generated, p-1-level frequency is used as a candidate set, and generation and detection of higher frequency items are performed according to the apriori principle. Compared with the one-level frequent extraction step, the threshold t is additionally required for the p-level frequent extraction_cAnd t_fp. Wherein, firstly, a group of adjacent frequent items needs to be obtained, wherein x is from A_imM is 0 to the number of the items frequent in the ith byte, and y is from A_(i+1)nAnd n is the number of 0 to i +1 th byte frequent items. t is t_cThe difference between two adjacent frequent terms x and y is defined when | P (x) -P (y) is satisfied<t_cThen, it is considered that x, y constitute the candidate p-level frequent A_im,nCalculating A_im,nFrequency P (A)_im,n) Then the p-level frequent threshold t is summed_fpIn comparison, the p-level frequency satisfying the threshold is recorded to F, and the two p-1 levels forming the p-level frequency are frequently removed from F. Calculating P (A)_im,n) An AC algorithm is required for multi-pattern matching. All possible combinations A under the value of i_im,nAfter the processing is finished, adding 1 to the value of i, and continuously processing the next wordNodal position.

The specific implementation of the examples is as follows:

the preset frame number u is 3030, p is 2, i is 4, and t is set_c＝0.1，t_f2＝1.5×10^-4Then a second order frequent needs to be generated, as shown in fig. 3, then all frequent items, such as 4-7, 4-56, need to be taken out from the first order frequent where i is 4; all the frequent items, such as three items 5-128, 5-250 and 5-18, need to be taken out from the level of i-5, and then the two adjacent groups of frequent items need to be completely combined, so there are 6 kinds, respectively 4-7,128; 4 to 7,250; 4-7, 18; 4-56,128; 4-56,250; 4-56,18, where the last two bits represent frequent entries of two consecutive bytes. Comparing the difference of the frequencies of any two combinations respectively, wherein | P (A) is satisfied_4p)-P(A_5q)|<0.1 in a combination of 4 to 7,128; 4-7, 18; 4-56,250; 4-56,18, 4 combinations. In this combination, the presence 7,128 in bytes 4 and 5, respectively, is detected; 7, 18; 56,250, respectively; 56,18, etc. with a frequency of 4 values, respectively, of 1.5X 10^-4In comparison, the values 4-56,18 that satisfy this condition are eventually recorded and recorded into the frequent item set F. Now a higher level of frequent items is generated, so the two frequent items 4-56 at the lower level need to be culled in F; 5-18.

In one embodiment, step S2 specifically includes:

step S2.2: and filtering the contents in the coarse relation table.

In one embodiment, step S2.1 specifically includes:

Judging the confidence coefficient between two frequent items and the confidence coefficient in the reverse direction, and when the confidence coefficient is larger than a threshold value t_rlAnd the confidence of reversal is less than t_eThen record the relation A_im->A_jnWherein->Left symbolThe frequent item on the side is called a front item, and the frequent item on the side is called a back item;

Specifically, the frequent item set F obtained in step 1 is a csv table containing p-level frequent items, where p is 0,1,2, …, n, and each row in the table represents a frequent item, and all the frequent items are recorded in the frequent item set F. The byte position i of the frequent item and the content A of the frequent item are recorded in each line_ij. The set T formed by all nodes of the relation tree satisfies

Firstly, find out any two frequent items A which meet the minimum confidence coefficient and are respectively positioned before and after the frame_imAnd A_jnThen, according to the definition in step 1, i is satisfied<j. The confidence between the two and the confidence of the inverse are checked simultaneously. When the confidence is greater than a certain threshold t_rlWhile the confidence of the simultaneous reversals is less than t_eRecord the relationship a_im->A_jnWherein->Frequent terms to the left of the symbol are called antecedents, and vice versa. After all the frequent items are taken out, all the relations form a rough relation table R.

Two frequent items A before and after the data frame_imAnd A_jnAre all taken from the frequent itemset F. A two-layer cycle was constructed, the first cycle taking A from F first_imThe second cycle takes A from F_jn. Wherein t is_rlAnd t_eAre all values close to 1. Suppose that frequent items in a frequent item set are denoted by f_kWhere k is 0,1,2 …, n. Where n is different from the previous one, it means that the number of frequent items in the frequent item set F is n. Then, if A is taken for the first time_imBy f_pWhen coming out of watch, A taken out for the second time_jnUsing f_qIs expressed by then p is<q is calculated. Wherein the confidence is represented as P (A)_im|A_jn)>t_rlAnd the inverse confidence is denoted as P (A)_jn|A_im)<t_eThe relationship satisfying the inverse confidence level can eliminate the interference of the equivalent term, and the structure satisfying the two confidence level thresholds is output to the rough relationship table R. After the processing is finished, increasing the value of n, if n takes the last value, increasing the value of j until A_jnThe last item in the frequent item set F is fetched. When A is_jnAfter the last item is taken, then A_imTaking the following items: increasing the value of m first, and increasing the value of i when m finishes the last value until A_imAnd taking the penultimate item in the frequent item set F, ending the output of the coarse relationship table R, and starting the construction of the relationship tree T, wherein the whole process including the construction of the coarse relationship table R is shown in FIG. 4.

The specific implementation of the examples is as follows:

setting t_rl0.9 and t_e0.95, for example, first takes one item 2-15 from the frequent item set F, and then takes the next item 2-25 in order, i of the current 2-25 is 2, which is not larger than 2 of the first item, so that the next item is skipped. The next term is 7-40, and the current i is 7, satisfying 7>2, therefore, the following flow may be continued. If P (2-25|7-40)>0.9, while P (7-40|2-25)<0.95, then the relationship is part of the relationship tree T, and thus the relationships 2-25->7-40 are stored in a coarse relationship table R. If when P (2-25|7-40)>0.9 and P (7-40|2-25)>0.95 holds at the same time, then 2-25 and 7-40 constitute equivalent terms, and there is no need to record the relationship into T. And if P (2-25|7-40)<0.9, the entry is discarded directly. Continue to take the next item 7-42 out of the frequent item set F, since P (2-25|7-42)<0.9, so relationships 2-25->7-42 are discarded directly. At this point, continue to take A_jnThen, the checking is carried out in sequence, and after all frequent item sets in the F are checked, A is carried out_imTaking off an item 2-25, when A_imThis step ends after all the items in the frequent set of items have been taken.

In one embodiment, step S2.2 specifically includes:

Specifically, after the coarse relationship table R is obtained, the contents in the coarse relationship table need to be filtered. And sequentially checking all the relations in the rough relation table, and once one back item simultaneously has a plurality of front items, only keeping a group of relations with the maximum reverse confidence coefficient for keeping, and removing other relations. The remaining relations are according to their frequent items A_imAnd (5) sequencing the sequence of the I, finally constructing a relation tree T containing s nodes, and if a plurality of relation trees are finally obtained, placing the obtained plurality of relation trees under the same root node.

In specific implementation, the coarse relationship table is composed of three items, namely a predecessor node A of B, a successor node B of A and a complex confidence P (B | A) between the predecessor node A and the successor node B. Where A and B may form a pair of relationships A- > B. After the above steps, some nodes may have two predecessor nodes. For example, many of the records in R include two pairs of relationships: a- > B and C- > B. When constructing the relationship tree, according to the principle of the maximum confidence of the inverse direction, a larger one of the two relationships is selected to be reserved, and the other one is abandoned, so that a relationship tree T without a ring structure is formed.

After the coarse relationship table R is generated, the construction of the relationship tree T needs to be started. Each node in the relation tree T is composed of a number of successor nodes and a predecessor node. The successor nodes of each node are multiple, and the predecessor nodes are only one, so that the successor nodes are represented by a successor node table. Firstly, initializing Root of a T, and then reading data R in R in sequence_kWhere k is 0,1, …, n. n is R, wherein R_kThe antecedent of is r_k(a) The last term is r_k(b) And the first term is different from the others when taken out of R.

When the first term R is taken out of R₁When, add successor node r under root node first₁(a) Simultaneously at node r₁(a) Rear faceContinuing to add successor node r₁(b) The operation of the specific embodiment is as follows:

after constructing a Root, e.g. the first data R read into the R-table₁Is 2-16->17-32, p (b | a) ═ 0.42, and this inverse confidence does not need to be considered when reading in the first relationship, so it is only necessary to add 2-16 to the Root successor node, and then add 17-32 to the 2-16 successor node table.

After adding the rest of the items r_kThe details are different. When get r_k(a) In the meantime, all the nodes n in T need to be traversed first_sWherein s is 0,1,2, …, m; m represents the number of nodes in the current T. If there is one n_s＝r_k(a) Then continue to check if r exists in T_k(b) In that respect At this time, if r_k(b) If not, then r is_k(b) Addition to n_sIn the successor node table of (1); if r is_k(b) If so, P (r) is compared assuming its predecessor node is bb_k(b)|r_k(a) P (r) and P (r)_k(b) The magnitude of | bb), the larger relationship is retained; and the other relation is removed in T, namely the relation is removed from the successor node table of the predecessor node. Returning to the upper level judgment, if there is not one n_s＝r_k(a) Then r needs to be added directly under Root node_k(a) While assuming r_k(b) The predecessor junction is bb, compare P (r)_k(b)|r_k(a) P (r) and P (r)_k(b) Bb), the larger relationship is retained, and the rest is the same as above, and the whole flow is shown in fig. 5.

The specific implementation of the examples is as follows:

for example, there are already three nodes in the relation tree, Root, 2-16, 17-32, where 2-16 is the predecessor node of 17-32. The second relationship is now read from R species, assuming that the relationship is 4-250- >17-32, and P (4-250- >17-172) ═ 0.6. Consider first the predecessor items 4-250 of the relationship. Since the original three items of T types do not include 4-250, 4-250 are added in the subsequent node list of Root. Considering next 17-32, through the search in T, it is found that T kinds contain 17-32, so it is necessary to determine which branch to keep by comparing the magnitudes of the two confidences. When P (17-32|2-16) > P (17-32|4-250), the original T branches are kept, no operation is performed, and the next relation is read continuously. And conversely, 17-32 nodes in the T are removed, the nodes are deleted in the subsequent node tables from 2 to 16, 4-250 are added in the T, and Root nodes are added in the precursor nodes from 4 to 250. The third relationship is then read from R and the operation is repeated.

In one embodiment, the method comprises:

In one embodiment, step S3 specifically includes:

step S3.1: coding each node of the relation tree T into n₁,n₂,…,n_s；

Specifically, there are 3 nodes, 2-16, 4-250 and 17-32, under the relationship tree T except Root. Now the data frame frames are read in sequence from the frame set I_i. Wherein i is 0,1, …, u; u is the total number of data frames. One frame is read_iWhich takes a value of 16 at byte 2, 320 at byte 4 and 0 at byte 17, the frame is converted to a boolean vector (1,0, 0); and if another frame_jWhere i ≠ j, the value of the 2 nd byte is 7, and the value of the 4 th byte is 255, since the frame length does not have 18, the value of the 17 th byte does not exist, and thus the frame can only be converted into a boolean vector (0,0, 0). And adding the obtained vectors to a set I of Boolean vectors_vIn (1).

In one embodiment, step S4 specifically includes:

Specifically, for I_vHierarchical clustering is performed between the frames obtained in (1). At the beginning I_vCan be translated into u clusters, where each vector represents a separate one of the clusters. The distance between each cluster is calculated. When performing hierarchical clustering, a group average method may be used when calculating the distance between clusters. The specific way in which it operates is as follows,

all samples in the two clusters are combined one by one and their distance from each other is calculated. For example, there are 2 and 3 samples in the two clusters a and B, respectively, and then after all combinations have been performed, a total of 6 distances need to be calculated. The 6 distances are then averaged to calculate an average d, which is then the distance between the two clusters a and B. And when measuring the distance of the samples in the cluster, the Jacard distance is used for measurement.

The specific implementation of the examples is as follows:

if two samples (0,0,1) and (0,1,1) are preset in cluster a and two samples (1,1,0) and (1,0,0) are preset in cluster B, then 4 distances, respectively (0,0,1) and (1,1,0), need to be calculated first; (0,0,1) and (1,0, 0); (0,1,1) and (1,1, 0); (0,1,1) and (1,0, 0). These four distances are calculated as follows according to the definition of the Jacard distance: 1,1,0.667,1. The average distance between two clusters was 0.917.

And after the distances between every two clusters are calculated, finding the minimum distance among all the distances, combining the two clusters into one cluster, recalculating the distance between every two clusters, repeating the steps until the number of the remaining clusters is k, and terminating the cycle. Wherein k is the set cluster number, and the specific flow is shown in fig. 6.

The specific implementation of the examples is as follows:

a, B, C, D four clusters are preset, k 2. Where the distance between AB is 0.6, the distance between AC is 0.8, the distance between BC is 0.7, the distance between AD is 0.9, the distance between BD is 0.74, and the distance between CDs is 0.3, then the samples in the two clusters of CDs are merged and merged into a new cluster, which is defined as E in this embodiment to have 3 clusters at this time, so that the merging needs to be continued. And then recalculating the distances between the clusters. For example, after calculation, the distances between the clusters are respectively AB of 0.6, AE of 0.83 and BE of 0.73, the distance between the AB is the minimum at this time, and then the AB is merged, only two clusters are left at this time, the requirement of the k value is met, and the result can BE output.

And when the k value is determined, outputting the divided clusters. The vectors in the clusters in the result correspond to the actual frames one-to-one, so the cluster analysis of these vectors can also be regarded as the cluster analysis of the data frames.

In one embodiment, the method further comprises:

Specifically, when hierarchical clustering is performed on clusters, only two clusters can be merged each time, when k takes a certain value, there may be frames of multiple different categories under one cluster, and at this time, the independence of the clusters needs to be analyzed, and the size of the difference between the distance obtained by calculation when a target cluster is merged in forming and the distance when the next merging is performed in the cluster result tree needs to be considered, as shown in fig. 7. Defining a cluster formed by combining two sub-clusters, and calling the cluster as a parent cluster of the two sub-clusters; defining a cluster formation distance d_fIs the distance between two sub-clusters, then only the difference between the distance formed by the cluster and the parent cluster of the cluster needs to be obtained

The greater of the difference in distance between the cluster and the cluster sub-cluster

Ratio of

When the value is less than t_raThe current cluster is considered not to be independent.

The specific implementation of the examples is as follows:

preset clusters A and B merge to form cluster D, while clusters C and D merge to form cluster E, where t_raTake 0.2. For example, the formation distance of cluster a was 0.4, the formation distance of cluster B was 0.3, the formation distance of cluster C was 0.2, the formation distance of cluster D was 0.6, and the formation distance of cluster E was 0.8, and the example performed the independence analysis on cluster D.

Wherein the content of the first and second substances,

the value of (a) is the distance difference between DE, and is 0.2; while

The value is the difference in distance between the DBs, which is 0.3, and therefore

Has a size of 0.667, which is greater than t_raAnd thus cluster D can be considered to represent an independent frame type. If a deeper research analysis needs to be performed on the cluster D, the frame of the cluster D is used as the set I again, and the step 1 is returned to continue the analysis.

Generally speaking, the method of the invention fully considers the position information of the inherent fields of different protocols in the frames, proposes the concept of the relation tree, finds out the key fields forming different protocol types through the construction of the relation tree, then classifies the frames with different protocol types by using the hierarchical clustering algorithm based on the Jacard distance, and considers the position information to enable the classification result to be more accurate; the use of hierarchical clustering makes it possible to restore the hierarchical structure of the protocol.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass such modifications and variations.

Claims

1. A classification method for unknown bit stream protocols, comprising:

step S4: carrying out hierarchical clustering on the converted data frames;

wherein, step S1 specifically includes:

2. The method according to claim 1, wherein step S2 specifically comprises:

step S2.2: and filtering the contents in the coarse relation table.

3. The method according to claim 2, characterized in that step S2.1 comprises in particular:

4. The method according to claim 2, characterized in that step S2.2 comprises in particular:

5. The method of claim 4, wherein the method comprises:

6. The method according to claim 1, wherein step S3 specifically comprises:

step S3.1: coding each node of the relation tree T into n₁,n₂,…,n_s；

7. The method according to claim 1, wherein step S4 specifically comprises:

8. The method of claim 7, wherein the method further comprises: