CN110457465B - Classification method for unknown bit stream protocol - Google Patents

Classification method for unknown bit stream protocol Download PDF

Info

Publication number
CN110457465B
CN110457465B CN201910541785.6A CN201910541785A CN110457465B CN 110457465 B CN110457465 B CN 110457465B CN 201910541785 A CN201910541785 A CN 201910541785A CN 110457465 B CN110457465 B CN 110457465B
Authority
CN
China
Prior art keywords
frequent
frequent item
items
frame
item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910541785.6A
Other languages
Chinese (zh)
Other versions
CN110457465A (en
Inventor
吴静
王思源
郭凡
周建国
周沫
潘玥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN201910541785.6A priority Critical patent/CN110457465B/en
Publication of CN110457465A publication Critical patent/CN110457465A/en
Application granted granted Critical
Publication of CN110457465B publication Critical patent/CN110457465B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/18Protocol analysers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a classification method for unknown bit stream protocols, which comprises the steps of firstly, taking bit streams which are preset to frequently appear at specific positions as the characteristics of a protocol, finding out frequently appearing information at all positions of a data frame, and constructing a frequent item set; then, detecting the incidence relation of the constructed frequent item set, and filtering the frequent items which do not meet the preset conditions; then, converting different data frames into expression of Boolean vectors according to the constructed frequent item set and the detected incidence relation; and then carrying out hierarchical clustering on the converted data frames. The invention realizes the consideration of the position information, so that the classification result is more accurate; and the hierarchical clustering method can restore the system layer structure of the protocol.

Description

Classification method for unknown bit stream protocol
Technical Field
The invention relates to the field of communication, in particular to a classification method for unknown bit stream protocols.
Background
The network protocol analysis refers to analyzing and deducing the type, format and content of the network protocol from the acquired network data so as to obtain the situation of the target network. In the military aspect, the protocol of the analysis target network can obtain a protocol format drafted by the enemy, so that the contents mutually transmitted by the enemy can be analyzed and obtained, additional information is obtained, and the further reaction to the action of the enemy can be realized; the topological connection of the enemy network can be analyzed from the protocol, so that the position and the number of communication entities in the enemy network can be analyzed; in civil use, the protocol of the analysis network can obtain the information transmitted in the network, and a network manager can manage and maintain the network through the protocol, so as to analyze and monitor the network environment.
Traditional analysis and detection of network protocols often is by fixed feature based methods. Conventional network protocols often use fixed formats, for example, application layer protocols typically use fixed port numbers. Thus, the analyst can obtain the protocol type of the analyst as long as the corresponding fixed feature is found, and then the analyst can correspond to the fixed feature according to the known knowledge so as to obtain the information in the fixed feature.
The inventor of the present application finds that the method of the prior art has at least the following technical problems in the process of implementing the present invention:
however, as the network scale increases, information security requirements increase, and the types of services increase, communicants increasingly use proprietary protocols to communicate, the formats of which are often unknown. In this case, therefore, it is difficult to analyze and interpret it based on past knowledge by the existing method.
Therefore, the method in the prior art has the problems that the unknown bit stream protocol cannot be classified or the classification effect is poor.
Disclosure of Invention
In view of the above, the present invention provides a classification method for an unknown bit stream protocol, so as to solve or at least partially solve the problem that the method in the prior art cannot classify the unknown bit stream protocol or has a poor classification effect.
The invention provides a classification method for an unknown bit stream protocol, which comprises the following steps:
step S1: taking a bit stream which is preset to frequently appear at a specific position as a characteristic of a protocol, finding out frequently appearing information at all positions of a data frame, and constructing a frequent item set;
step S2: detecting the incidence relation of the constructed frequent item set, and filtering frequent items which do not meet preset conditions;
step S3: converting different data frames into expression of Boolean vectors according to the constructed frequent item set and the detected incidence relation, wherein one Boolean vector corresponds to one data frame;
step S4: and carrying out hierarchical clustering on the converted data frames.
In one embodiment, step S1 specifically includes:
step S1.1: aligning the left sides of a set I formed by all data frames captured from a network environment in sequence;
step S1.2: counting all frames from the ith byte of the frame header, and screening out the frequently occurring bytes;
step S1.3: judging whether a value k of an ith byte which frequently appears exceeds a threshold value, if so, indicating that a certain value k of the ith byte forms a frequent item, and marking the obtained frequent item as Aij, wherein j is the jth frequent item of which the ith byte reaches a threshold standard tf, and the frequent item comprises a first-level frequent item and an n-level frequent item;
step S1.4: and finally recording all the frequent items as a set F, wherein the set F comprises information which appears frequently at all positions, including first-level frequent and n-level frequent, n is 0,1,2 and … p, and when a higher-level frequent item is generated, removing the content in the previous-level frequent item.
In one embodiment, step S2 specifically includes:
step S2.1: detecting the incidence relation of the constructed frequent item set, and constructing a coarse relation table;
step S2.2: and filtering the contents in the coarse relation table.
In one embodiment, step S2.1 specifically includes:
finding out any two frequent items A which meet the minimum confidence coefficient and are respectively positioned before and after the frameimAnd AjnAnd satisfy i<j;
Judging the confidence coefficient between two frequent items and the confidence coefficient in the reverse direction, and when the confidence coefficient is larger than a threshold value trlAnd the confidence of reversal is less than teThen record the relation Aim->AjnWherein->The frequent item on the left side of the symbol is called a antecedent, and the frequent item on the left side of the symbol is called a consequent;
after all the frequent items are taken out, all the relations form a rough relation table R.
In one embodiment, step S2.2 specifically includes:
sequentially checking all relations in the coarse relation table, reserving a group of relations with the maximum reserved reverse confidence when one back item has a plurality of front items, and rejecting other relations;
the remaining relations are according to their frequent items AimAnd (5) sequencing the sequence of the I in the sequence, and finally constructing a relation tree T containing s nodes.
In one embodiment, the method comprises:
and when the number of the finally constructed relationship trees is more than 1, placing the obtained relationship trees under the same root node.
In one embodiment, step S3 specifically includes:
step S3.1: coding each node of the relation tree T into n1,n2,…,ns
Step S3.2: according to the existence of the node, each frame in the frame set I is respectively represented by 0 and 1 to form an s-dimensional Boolean vector, wherein the value of the k-th coordinate represents nkIf the frame exists, finally obtaining a set I consisting of vectors from the set I of framesv
In one embodiment, step S4 specifically includes:
step S4.1: vector set I is measured by Jacard distancevAny two vectors viAnd vjEach sample forms an independent cluster respectively, and hierarchical clustering is carried out to obtain a dendrogram structure;
step S4.1: and sequentially classifying each category according to the set cluster number, wherein the classified category corresponds to the actual category of the data frame.
In one embodiment, the method further comprises:
and analyzing the independence of each cluster frame to restore the hierarchical architecture of the protocol.
One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:
the invention provides a classification method for unknown bit stream protocols, which comprises the steps of firstly, taking bit streams which are preset to frequently appear at specific positions as the characteristics of a protocol, finding out frequently appearing information at all positions of a data frame, and constructing a frequent item set; then, detecting the incidence relation of the constructed frequent item set, and filtering the frequent items which do not meet the preset conditions; converting different data frames into expression of Boolean vectors according to the constructed frequent item set and the detected incidence relation, wherein one Boolean vector corresponds to one data frame; and finally, performing hierarchical clustering on the converted data frames.
The invention provides a frame classification system facing to a bit stream protocol with unknown format and content by taking the different characteristics of different types of protocol frequent character strings into consideration based on frequent feature extraction and association rule mining, so that the types of frames in a classified frame cluster are relatively simple. The method fully considers the position information of inherent fields of different protocols in the frames, finds key fields forming different protocol types by detecting the incidence relation of the constructed frequent item set and filtering frequent items which do not accord with preset conditions, and then classifies the frames with different protocol types by using a hierarchical clustering algorithm based on Jacard distance, thereby considering the position information to enable the classification result to be more accurate;
furthermore, the invention adopts a hierarchical clustering method, which can restore the system chromatography structure of the protocol, thereby further analyzing the independent clusters.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart illustrating a classification method for unknown bit stream protocols according to the present invention;
FIG. 2 is a flow chart of a level one frequent fetch according to an embodiment of the present invention;
FIG. 3 is a flow chart of multi-level frequent extraction according to an embodiment of the present invention;
FIG. 4 is a flowchart of relational tree mining according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating a confidence score according to an embodiment of the present invention;
FIG. 6 is a flowchart illustrating hierarchical clustering according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of an independence analysis in accordance with an embodiment of the present invention.
Detailed Description
The invention aims to provide a classification method for unknown bit stream protocols, aiming at the problems that the unknown bit stream protocols cannot be classified or the classification effect is poor in the method in the prior art, so that the classification accuracy is improved and the classification effect is improved.
In order to achieve the above purpose, the main concept of the invention is as follows:
based on frequent feature extraction and association rule mining, the method for classifying the frames of the bit stream protocol with unknown format and content is provided by considering the different characteristics of different types of frequent character strings of the protocol, so that the types of the frames in the classified frame cluster are relatively simple. The method fully considers the position information of inherent fields of different protocols in the frames, provides the concept of a relation tree, finds key fields forming different protocol types through the construction of the relation tree, then classifies the frames with different protocol types by using a hierarchical clustering algorithm based on the Jacard distance, and considers the position information to enable the classification result to be more accurate; the use of hierarchical clustering makes it possible to restore the hierarchical structure of the protocol.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
The present embodiment provides a classification method for unknown bit stream protocols, please refer to fig. 1, which includes:
step S1: taking a bit stream which is preset to frequently appear at a specific position as a characteristic of a protocol, finding out frequently appearing information at all positions of a data frame, and constructing a frequent item set;
specifically, the inventors of the present application found, through a great deal of practice and research, that: a common protocol inverse analysis method is usually directed to a single-format protocol, such as a specific application layer protocol, and the obtained protocol formats are usually similar to each other, so a method for performing certain division and identification on the format of the protocol and the semantics thereof through analysis based on different byte statistical characteristics is proposed. However, when the formats of various types of protocols in the entire network system are unknown, there are often many differences between captured frames, and the method of analyzing only a protocol of a specific format may be ineffective.
There are many types of frames in network protocols, and common frames are divided into data frames for transferring data and control frames for controlling the network. Data frames tend to be long and the format is not very fixed; the control frame is usually shorter in length and more uniform in format. For example, in the internet protocol hierarchy, data frames in the same network environment are often divided into a plurality of protocols in different hierarchies, and in the same network hierarchy, there are also different types of protocols, for example, the protocols in the transmission layer under the internet protocol system are respectively divided into two types, namely TCP and UDP, and the internet protocol system can be divided into a plurality of hierarchies, so that the first premise for reasoning unknown bit stream frames is to divide captured frames according to the categories to obtain a frame set with a purer type, and then to infer the content of the captured frames.
The unknown frames are classified, firstly, the characteristics of the frames of different types are found, and then the frames are classified according to the characteristics. For unknown frames, frequently occurring bytes or strings are a simple and easy to process feature. The frequently occurring bytes or strings of words of the frames for different types of protocols are different. The frequent bytes are called frequent entries, and then the frequent entries need to be extracted first. Generally, frequent extraction is performed, a traditional association rule mining algorithm is often used, and the algorithm only considers content information of frequent items and is not good for the characteristics of a bit stream protocol.
Therefore, the method provided by the invention considers the position information of the inherent fields of different protocol types in the data frame while extracting the characteristics, thereby enabling the clustering result to be more accurate.
The frequently occurring information is the extracted features for subsequent clustering.
Step S2: and detecting the incidence relation of the constructed frequent item set, and filtering the frequent items which do not meet the preset conditions.
Specifically, step S2 is to perform mining of association rules on the frequent item set constructed in S1, for example, mining out the frequent items that satisfy the minimum confidence threshold, and then performing further mining according to the association degree between two frequent items.
Step S3: and converting different data frames into expression of Boolean vectors according to the constructed frequent item set and the detected incidence relation, wherein one Boolean vector corresponds to one data frame.
Step S3 is to perform format conversion on the data frame to facilitate subsequent clustering.
Step S4: and carrying out hierarchical clustering on the converted data frames.
In particular, I can be set according to vectorsvAny two vectors viAnd vjThe distance between them, clustering.
In one embodiment, step S1 specifically includes:
step S1.1: aligning the left sides of a set I formed by all data frames captured from a network environment in sequence;
step S1.2: counting all frames from the ith byte of the frame header, and screening out the frequently occurring bytes;
step S1.3: judging whether a value k of an ith byte which frequently appears exceeds a threshold value, if so, indicating that a certain value k of the ith byte forms a frequent item, and marking the obtained frequent item as Aij, wherein j is the jth frequent item of which the ith byte reaches a threshold standard tf, and the frequent item comprises a first-level frequent item and an n-level frequent item;
step S1.4: and finally recording all the frequent items as a set F, wherein the set F comprises information which appears frequently at all positions, including first-level frequent and n-level frequent, n is 0,1,2 and … p, and when a higher-level frequent item is generated, removing the content in the previous-level frequent item.
Specifically, the data frame captured in step S1.1 is a data frame that has already been subjected to framing processing. Statistics are performed from the ith byte of the frame header, i.e. according to the position information.
Wherein the first-level frequent extraction is special, as shown in fig. 2, the first-level frequent extraction uses three thresholds, tf1,trAnd tl. Wherein, tf1Indicating a frequent threshold beyond which the value k of the byte position is considered frequent. t is trIndicating a remaining frame number stop condition. When the current process reaches the ith byte, stopping the current cycle if the number of the remaining frames is less than the value; t is tlIndicating a frame length stop condition. When i is>tlThe current cycle is stopped.
Each value k of the first-level frequency is only 256 possible, namely from 0 to 255, the occurrence frequency of the 256 values is counted respectively, and then the occurrence frequency of the value k is skWhen it is satisfied
Figure BDA0002102777490000071
The value k is considered to be frequent.
The specific implementation of the examples is as follows:
the preset frame number u is 3030, p is 2, tf1=0.04,tl=40,trThe first level of frequent extraction is performed on the frame set I by 50, I by 4. When proceeding to byte 4, the number of frames that still take value in byte 4 is c 3030 frames. First of all, satisfy c>trAnd also satisfies 4<tlSatisfy the aboveAfter the condition, respectively counting the number of occurrences of all 256 values of the 4 th byte, and at tf1When the frequency threshold is 0.04, the minimum frequency threshold obtained from the frequency threshold at this time is u · tf1121 frames: suppose that the 4 th byte contains a number of frames s with kkE.g. k 17-th hour sk2424, and s when k is 56kBoth values of 17 and 56 satisfy the frequent threshold requirement if 1818, 2424 and 1818 exceed 121. When k represents 256 values of one byte, the k values are increased from 0 to 255 and are calculated respectively. Finally, if only two values, i.e., k 17 and k 56, are satisfied in the 256 values
Figure BDA0002102777490000072
Then, two first-level frequent items 4-7 and 4-56 which are mined currently can be sequentially added into the frequent item set F; if the above condition is not satisfied, the frequent item of the position is not considered. And sequentially increasing the size of i until the loop of i stops.
When p-level frequency is generated, p-1-level frequency is used as a candidate set, and generation and detection of higher frequency items are performed according to the apriori principle. Compared with the one-level frequent extraction step, the threshold t is additionally required for the p-level frequent extractioncAnd tfp. Wherein, firstly, a group of adjacent frequent items needs to be obtained, wherein x is from AimM is 0 to the number of the items frequent in the ith byte, and y is from A(i+1)nAnd n is the number of 0 to i +1 th byte frequent items. t is tcThe difference between two adjacent frequent terms x and y is defined when | P (x) -P (y) is satisfied<tcThen, it is considered that x, y constitute the candidate p-level frequent Aim,nCalculating Aim,nFrequency P (A)im,n) Then the p-level frequent threshold t is summedfpIn comparison, the p-level frequency satisfying the threshold is recorded to F, and the two p-1 levels forming the p-level frequency are frequently removed from F. Calculating P (A)im,n) An AC algorithm is required for multi-pattern matching. All possible combinations A under the value of iim,nAfter the processing is finished, adding 1 to the value of i, and continuously processing the next wordNodal position.
The specific implementation of the examples is as follows:
the preset frame number u is 3030, p is 2, i is 4, and t is setc=0.1,tf2=1.5×10-4Then a second order frequent needs to be generated, as shown in fig. 3, then all frequent items, such as 4-7, 4-56, need to be taken out from the first order frequent where i is 4; all the frequent items, such as three items 5-128, 5-250 and 5-18, need to be taken out from the level of i-5, and then the two adjacent groups of frequent items need to be completely combined, so there are 6 kinds, respectively 4-7,128; 4 to 7,250; 4-7, 18; 4-56,128; 4-56,250; 4-56,18, where the last two bits represent frequent entries of two consecutive bytes. Comparing the difference of the frequencies of any two combinations respectively, wherein | P (A) is satisfied4p)-P(A5q)|<0.1 in a combination of 4 to 7,128; 4-7, 18; 4-56,250; 4-56,18, 4 combinations. In this combination, the presence 7,128 in bytes 4 and 5, respectively, is detected; 7, 18; 56,250, respectively; 56,18, etc. with a frequency of 4 values, respectively, of 1.5X 10-4In comparison, the values 4-56,18 that satisfy this condition are eventually recorded and recorded into the frequent item set F. Now a higher level of frequent items is generated, so the two frequent items 4-56 at the lower level need to be culled in F; 5-18.
In one embodiment, step S2 specifically includes:
step S2.1: detecting the incidence relation of the constructed frequent item set, and constructing a coarse relation table;
step S2.2: and filtering the contents in the coarse relation table.
In one embodiment, step S2.1 specifically includes:
finding out any two frequent items A which meet the minimum confidence coefficient and are respectively positioned before and after the frameimAnd AjnAnd satisfy i<j;
Judging the confidence coefficient between two frequent items and the confidence coefficient in the reverse direction, and when the confidence coefficient is larger than a threshold value trlAnd the confidence of reversal is less than teThen record the relation Aim->AjnWherein->Left symbolThe frequent item on the side is called a front item, and the frequent item on the side is called a back item;
after all the frequent items are taken out, all the relations form a rough relation table R.
Specifically, the frequent item set F obtained in step 1 is a csv table containing p-level frequent items, where p is 0,1,2, …, n, and each row in the table represents a frequent item, and all the frequent items are recorded in the frequent item set F. The byte position i of the frequent item and the content A of the frequent item are recorded in each lineij. The set T formed by all nodes of the relation tree satisfies
Figure BDA0002102777490000091
Firstly, find out any two frequent items A which meet the minimum confidence coefficient and are respectively positioned before and after the frameimAnd AjnThen, according to the definition in step 1, i is satisfied<j. The confidence between the two and the confidence of the inverse are checked simultaneously. When the confidence is greater than a certain threshold trlWhile the confidence of the simultaneous reversals is less than teRecord the relationship aim->AjnWherein->Frequent terms to the left of the symbol are called antecedents, and vice versa. After all the frequent items are taken out, all the relations form a rough relation table R.
Two frequent items A before and after the data frameimAnd AjnAre all taken from the frequent itemset F. A two-layer cycle was constructed, the first cycle taking A from F firstimThe second cycle takes A from Fjn. Wherein t isrlAnd teAre all values close to 1. Suppose that frequent items in a frequent item set are denoted by fkWhere k is 0,1,2 …, n. Where n is different from the previous one, it means that the number of frequent items in the frequent item set F is n. Then, if A is taken for the first timeimBy fpWhen coming out of watch, A taken out for the second timejnUsing fqIs expressed by then p is<q is calculated. Wherein the confidence is represented as P (A)im|Ajn)>trlAnd the inverse confidence is denoted as P (A)jn|Aim)<teThe relationship satisfying the inverse confidence level can eliminate the interference of the equivalent term, and the structure satisfying the two confidence level thresholds is output to the rough relationship table R. After the processing is finished, increasing the value of n, if n takes the last value, increasing the value of j until AjnThe last item in the frequent item set F is fetched. When A isjnAfter the last item is taken, then AimTaking the following items: increasing the value of m first, and increasing the value of i when m finishes the last value until AimAnd taking the penultimate item in the frequent item set F, ending the output of the coarse relationship table R, and starting the construction of the relationship tree T, wherein the whole process including the construction of the coarse relationship table R is shown in FIG. 4.
The specific implementation of the examples is as follows:
setting trl0.9 and te0.95, for example, first takes one item 2-15 from the frequent item set F, and then takes the next item 2-25 in order, i of the current 2-25 is 2, which is not larger than 2 of the first item, so that the next item is skipped. The next term is 7-40, and the current i is 7, satisfying 7>2, therefore, the following flow may be continued. If P (2-25|7-40)>0.9, while P (7-40|2-25)<0.95, then the relationship is part of the relationship tree T, and thus the relationships 2-25->7-40 are stored in a coarse relationship table R. If when P (2-25|7-40)>0.9 and P (7-40|2-25)>0.95 holds at the same time, then 2-25 and 7-40 constitute equivalent terms, and there is no need to record the relationship into T. And if P (2-25|7-40)<0.9, the entry is discarded directly. Continue to take the next item 7-42 out of the frequent item set F, since P (2-25|7-42)<0.9, so relationships 2-25->7-42 are discarded directly. At this point, continue to take AjnThen, the checking is carried out in sequence, and after all frequent item sets in the F are checked, A is carried outimTaking off an item 2-25, when AimThis step ends after all the items in the frequent set of items have been taken.
In one embodiment, step S2.2 specifically includes:
sequentially checking all relations in the coarse relation table, reserving a group of relations with the maximum reserved reverse confidence when one back item has a plurality of front items, and rejecting other relations;
the remaining relations are according to their frequent items AimAnd (5) sequencing the sequence of the I in the sequence, and finally constructing a relation tree T containing s nodes.
Specifically, after the coarse relationship table R is obtained, the contents in the coarse relationship table need to be filtered. And sequentially checking all the relations in the rough relation table, and once one back item simultaneously has a plurality of front items, only keeping a group of relations with the maximum reverse confidence coefficient for keeping, and removing other relations. The remaining relations are according to their frequent items AimAnd (5) sequencing the sequence of the I, finally constructing a relation tree T containing s nodes, and if a plurality of relation trees are finally obtained, placing the obtained plurality of relation trees under the same root node.
In specific implementation, the coarse relationship table is composed of three items, namely a predecessor node A of B, a successor node B of A and a complex confidence P (B | A) between the predecessor node A and the successor node B. Where A and B may form a pair of relationships A- > B. After the above steps, some nodes may have two predecessor nodes. For example, many of the records in R include two pairs of relationships: a- > B and C- > B. When constructing the relationship tree, according to the principle of the maximum confidence of the inverse direction, a larger one of the two relationships is selected to be reserved, and the other one is abandoned, so that a relationship tree T without a ring structure is formed.
After the coarse relationship table R is generated, the construction of the relationship tree T needs to be started. Each node in the relation tree T is composed of a number of successor nodes and a predecessor node. The successor nodes of each node are multiple, and the predecessor nodes are only one, so that the successor nodes are represented by a successor node table. Firstly, initializing Root of a T, and then reading data R in R in sequencekWhere k is 0,1, …, n. n is R, wherein RkThe antecedent of is rk(a) The last term is rk(b) And the first term is different from the others when taken out of R.
When the first term R is taken out of R1When, add successor node r under root node first1(a) Simultaneously at node r1(a) Rear faceContinuing to add successor node r1(b) The operation of the specific embodiment is as follows:
after constructing a Root, e.g. the first data R read into the R-table1Is 2-16->17-32, p (b | a) ═ 0.42, and this inverse confidence does not need to be considered when reading in the first relationship, so it is only necessary to add 2-16 to the Root successor node, and then add 17-32 to the 2-16 successor node table.
After adding the rest of the items rkThe details are different. When get rk(a) In the meantime, all the nodes n in T need to be traversed firstsWherein s is 0,1,2, …, m; m represents the number of nodes in the current T. If there is one ns=rk(a) Then continue to check if r exists in Tk(b) In that respect At this time, if rk(b) If not, then r isk(b) Addition to nsIn the successor node table of (1); if r isk(b) If so, P (r) is compared assuming its predecessor node is bbk(b)|rk(a) P (r) and P (r)k(b) The magnitude of | bb), the larger relationship is retained; and the other relation is removed in T, namely the relation is removed from the successor node table of the predecessor node. Returning to the upper level judgment, if there is not one ns=rk(a) Then r needs to be added directly under Root nodek(a) While assuming rk(b) The predecessor junction is bb, compare P (r)k(b)|rk(a) P (r) and P (r)k(b) Bb), the larger relationship is retained, and the rest is the same as above, and the whole flow is shown in fig. 5.
The specific implementation of the examples is as follows:
for example, there are already three nodes in the relation tree, Root, 2-16, 17-32, where 2-16 is the predecessor node of 17-32. The second relationship is now read from R species, assuming that the relationship is 4-250- >17-32, and P (4-250- >17-172) ═ 0.6. Consider first the predecessor items 4-250 of the relationship. Since the original three items of T types do not include 4-250, 4-250 are added in the subsequent node list of Root. Considering next 17-32, through the search in T, it is found that T kinds contain 17-32, so it is necessary to determine which branch to keep by comparing the magnitudes of the two confidences. When P (17-32|2-16) > P (17-32|4-250), the original T branches are kept, no operation is performed, and the next relation is read continuously. And conversely, 17-32 nodes in the T are removed, the nodes are deleted in the subsequent node tables from 2 to 16, 4-250 are added in the T, and Root nodes are added in the precursor nodes from 4 to 250. The third relationship is then read from R and the operation is repeated.
In one embodiment, the method comprises:
and when the number of the finally constructed relationship trees is more than 1, placing the obtained relationship trees under the same root node.
In one embodiment, step S3 specifically includes:
step S3.1: coding each node of the relation tree T into n1,n2,…,ns
Step S3.2: according to the existence of the node, each frame in the frame set I is respectively represented by 0 and 1 to form an s-dimensional Boolean vector, wherein the value of the k-th coordinate represents nkIf the frame exists, finally obtaining a set I consisting of vectors from the set I of framesv
Specifically, there are 3 nodes, 2-16, 4-250 and 17-32, under the relationship tree T except Root. Now the data frame frames are read in sequence from the frame set Ii. Wherein i is 0,1, …, u; u is the total number of data frames. One frame is readiWhich takes a value of 16 at byte 2, 320 at byte 4 and 0 at byte 17, the frame is converted to a boolean vector (1,0, 0); and if another framejWhere i ≠ j, the value of the 2 nd byte is 7, and the value of the 4 th byte is 255, since the frame length does not have 18, the value of the 17 th byte does not exist, and thus the frame can only be converted into a boolean vector (0,0, 0). And adding the obtained vectors to a set I of Boolean vectorsvIn (1).
In one embodiment, step S4 specifically includes:
step S4.1: vector set I is measured by Jacard distancevAny two vectors viAnd vjEach sample forms an independent cluster respectively, and hierarchical clustering is carried out to obtain a dendrogram structure;
step S4.1: and sequentially classifying each category according to the set cluster number, wherein the classified category corresponds to the actual category of the data frame.
Specifically, for IvHierarchical clustering is performed between the frames obtained in (1). At the beginning IvCan be translated into u clusters, where each vector represents a separate one of the clusters. The distance between each cluster is calculated. When performing hierarchical clustering, a group average method may be used when calculating the distance between clusters. The specific way in which it operates is as follows,
all samples in the two clusters are combined one by one and their distance from each other is calculated. For example, there are 2 and 3 samples in the two clusters a and B, respectively, and then after all combinations have been performed, a total of 6 distances need to be calculated. The 6 distances are then averaged to calculate an average d, which is then the distance between the two clusters a and B. And when measuring the distance of the samples in the cluster, the Jacard distance is used for measurement.
The specific implementation of the examples is as follows:
if two samples (0,0,1) and (0,1,1) are preset in cluster a and two samples (1,1,0) and (1,0,0) are preset in cluster B, then 4 distances, respectively (0,0,1) and (1,1,0), need to be calculated first; (0,0,1) and (1,0, 0); (0,1,1) and (1,1, 0); (0,1,1) and (1,0, 0). These four distances are calculated as follows according to the definition of the Jacard distance: 1,1,0.667,1. The average distance between two clusters was 0.917.
And after the distances between every two clusters are calculated, finding the minimum distance among all the distances, combining the two clusters into one cluster, recalculating the distance between every two clusters, repeating the steps until the number of the remaining clusters is k, and terminating the cycle. Wherein k is the set cluster number, and the specific flow is shown in fig. 6.
The specific implementation of the examples is as follows:
a, B, C, D four clusters are preset, k 2. Where the distance between AB is 0.6, the distance between AC is 0.8, the distance between BC is 0.7, the distance between AD is 0.9, the distance between BD is 0.74, and the distance between CDs is 0.3, then the samples in the two clusters of CDs are merged and merged into a new cluster, which is defined as E in this embodiment to have 3 clusters at this time, so that the merging needs to be continued. And then recalculating the distances between the clusters. For example, after calculation, the distances between the clusters are respectively AB of 0.6, AE of 0.83 and BE of 0.73, the distance between the AB is the minimum at this time, and then the AB is merged, only two clusters are left at this time, the requirement of the k value is met, and the result can BE output.
And when the k value is determined, outputting the divided clusters. The vectors in the clusters in the result correspond to the actual frames one-to-one, so the cluster analysis of these vectors can also be regarded as the cluster analysis of the data frames.
In one embodiment, the method further comprises:
and analyzing the independence of each cluster frame to restore the hierarchical architecture of the protocol.
Specifically, when hierarchical clustering is performed on clusters, only two clusters can be merged each time, when k takes a certain value, there may be frames of multiple different categories under one cluster, and at this time, the independence of the clusters needs to be analyzed, and the size of the difference between the distance obtained by calculation when a target cluster is merged in forming and the distance when the next merging is performed in the cluster result tree needs to be considered, as shown in fig. 7. Defining a cluster formed by combining two sub-clusters, and calling the cluster as a parent cluster of the two sub-clusters; defining a cluster formation distance dfIs the distance between two sub-clusters, then only the difference between the distance formed by the cluster and the parent cluster of the cluster needs to be obtained
Figure BDA0002102777490000131
The greater of the difference in distance between the cluster and the cluster sub-cluster
Figure BDA0002102777490000132
Ratio of
Figure BDA0002102777490000133
When the value is less than traThe current cluster is considered not to be independent.
The specific implementation of the examples is as follows:
preset clusters A and B merge to form cluster D, while clusters C and D merge to form cluster E, where traTake 0.2. For example, the formation distance of cluster a was 0.4, the formation distance of cluster B was 0.3, the formation distance of cluster C was 0.2, the formation distance of cluster D was 0.6, and the formation distance of cluster E was 0.8, and the example performed the independence analysis on cluster D.
Wherein the content of the first and second substances,
Figure BDA0002102777490000134
the value of (a) is the distance difference between DE, and is 0.2; while
Figure BDA0002102777490000135
The value is the difference in distance between the DBs, which is 0.3, and therefore
Figure BDA0002102777490000136
Has a size of 0.667, which is greater than traAnd thus cluster D can be considered to represent an independent frame type. If a deeper research analysis needs to be performed on the cluster D, the frame of the cluster D is used as the set I again, and the step 1 is returned to continue the analysis.
Generally speaking, the method of the invention fully considers the position information of the inherent fields of different protocols in the frames, proposes the concept of the relation tree, finds out the key fields forming different protocol types through the construction of the relation tree, then classifies the frames with different protocol types by using the hierarchical clustering algorithm based on the Jacard distance, and considers the position information to enable the classification result to be more accurate; the use of hierarchical clustering makes it possible to restore the hierarchical structure of the protocol.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass such modifications and variations.

Claims (8)

1. A classification method for unknown bit stream protocols, comprising:
step S1: taking a bit stream which is preset to frequently appear at a specific position as a characteristic of a protocol, finding out frequently appearing information at all positions of a data frame, and constructing a frequent item set;
step S2: detecting the incidence relation of the constructed frequent item set, and filtering frequent items which do not meet preset conditions;
step S3: converting different data frames into expression of Boolean vectors according to the constructed frequent item set and the detected incidence relation, wherein one Boolean vector corresponds to one data frame;
step S4: carrying out hierarchical clustering on the converted data frames;
wherein, step S1 specifically includes:
step S1.1: aligning the left sides of a set I formed by all data frames captured from a network environment in sequence;
step S1.2: counting all frames from the ith byte of the frame header, and screening out the frequently occurring bytes;
step S1.3: judging whether a value k of an ith byte which frequently appears exceeds a threshold value, if so, indicating that a certain value k of the ith byte forms a frequent item, and marking the obtained frequent item as Aij, wherein j is the jth frequent item of which the ith byte reaches a threshold standard tf, and the frequent item comprises a first-level frequent item and an n-level frequent item;
step S1.4: and finally recording all the frequent items as a set F, wherein the set F comprises information which appears frequently at all positions, including first-level frequent and n-level frequent, n is 0,1,2 and … p, and when a higher-level frequent item is generated, removing the content in the previous-level frequent item.
2. The method according to claim 1, wherein step S2 specifically comprises:
step S2.1: detecting the incidence relation of the constructed frequent item set, and constructing a coarse relation table;
step S2.2: and filtering the contents in the coarse relation table.
3. The method according to claim 2, characterized in that step S2.1 comprises in particular:
finding out any two frequent items A which meet the minimum confidence coefficient and are respectively positioned before and after the frameimAnd AjnAnd satisfy i<j;
Judging the confidence coefficient between two frequent items and the confidence coefficient in the reverse direction, and when the confidence coefficient is larger than a threshold value trlAnd the confidence of reversal is less than teThen record the relation Aim->AjnWherein->The frequent item on the left side of the symbol is called a antecedent, and the frequent item on the left side of the symbol is called a consequent;
after all the frequent items are taken out, all the relations form a rough relation table R.
4. The method according to claim 2, characterized in that step S2.2 comprises in particular:
sequentially checking all relations in the coarse relation table, reserving a group of relations with the maximum reserved reverse confidence when one back item has a plurality of front items, and rejecting other relations;
the remaining relations are according to their frequent items AimAnd (5) sequencing the sequence of the I in the sequence, and finally constructing a relation tree T containing s nodes.
5. The method of claim 4, wherein the method comprises:
and when the number of the finally constructed relationship trees is more than 1, placing the obtained relationship trees under the same root node.
6. The method according to claim 1, wherein step S3 specifically comprises:
step S3.1: coding each node of the relation tree T into n1,n2,…,ns
Step S3.2: according to the existence of the node, each frame in the frame set I is respectively represented by 0 and 1 to form an s-dimensional Boolean vector, wherein the value of the k-th coordinate represents nkIf the frame exists, finally obtaining a set I consisting of vectors from the set I of framesv
7. The method according to claim 1, wherein step S4 specifically comprises:
step S4.1: vector set I is measured by Jacard distancevAny two vectors viAnd vjEach sample forms an independent cluster respectively, and hierarchical clustering is carried out to obtain a dendrogram structure;
step S4.1: and sequentially classifying each category according to the set cluster number, wherein the classified category corresponds to the actual category of the data frame.
8. The method of claim 7, wherein the method further comprises:
and analyzing the independence of each cluster frame to restore the hierarchical architecture of the protocol.
CN201910541785.6A 2019-06-21 2019-06-21 Classification method for unknown bit stream protocol Active CN110457465B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910541785.6A CN110457465B (en) 2019-06-21 2019-06-21 Classification method for unknown bit stream protocol

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910541785.6A CN110457465B (en) 2019-06-21 2019-06-21 Classification method for unknown bit stream protocol

Publications (2)

Publication Number Publication Date
CN110457465A CN110457465A (en) 2019-11-15
CN110457465B true CN110457465B (en) 2022-04-26

Family

ID=68480680

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910541785.6A Active CN110457465B (en) 2019-06-21 2019-06-21 Classification method for unknown bit stream protocol

Country Status (1)

Country Link
CN (1) CN110457465B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112367325B (en) * 2020-11-13 2023-11-07 中国人民解放军陆军工程大学 Unknown protocol message clustering method and system based on closed frequent item mining

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101753622B (en) * 2009-12-25 2012-10-31 青岛朗讯科技通讯设备有限公司 Method for extracting characteristics of application layer protocols
US9203689B2 (en) * 2012-10-26 2015-12-01 International Business Machines Corporation Differential dynamic host configuration protocol lease allocation
CN103297427B (en) * 2013-05-21 2016-01-06 中国科学院信息工程研究所 A kind of unknown network protocol recognition method and system
US9672495B2 (en) * 2014-12-23 2017-06-06 Sap Se Enhancing frequent itemset mining
CN104796407B (en) * 2015-03-23 2018-03-30 电子科技大学 A kind of extracting method of unknown protocol feature
CN104767736A (en) * 2015-03-23 2015-07-08 电子科技大学 Method for separating unknown single protocol data stream into different types of data frames
CN107689899A (en) * 2017-09-01 2018-02-13 南京南瑞集团公司 A kind of unknown protocol recognition methods and system based on bit stream
CN109033636B (en) * 2018-07-27 2022-04-22 电子科技大学 Link layer access mechanism analysis method based on link layer bit stream blind analysis

Also Published As

Publication number Publication date
CN110457465A (en) 2019-11-15

Similar Documents

Publication Publication Date Title
Li et al. Causal decision trees
CN109344285B (en) Monitoring-oriented video map construction and mining method and equipment
CN110263666B (en) Action detection method based on asymmetric multi-stream
CN106846355B (en) Target tracking method and device based on lifting intuitive fuzzy tree
CN110532880B (en) Sample screening and expression recognition method, neural network, device and storage medium
CN107145516B (en) Text clustering method and system
CN102254006A (en) Method for retrieving Internet video based on contents
CN107247873B (en) Differential methylation site recognition method
CN114844840B (en) Method for detecting distributed external network flow data based on calculated likelihood ratio
CN102521534A (en) Intrusion detection method based on crude entropy property reduction
CN112287753A (en) System for improving face recognition precision based on machine learning and algorithm thereof
CN115511012B (en) Class soft label identification training method with maximum entropy constraint
CN111831706A (en) Mining method and device for association rules among applications and storage medium
CN110457465B (en) Classification method for unknown bit stream protocol
CN110995713A (en) Botnet detection system and method based on convolutional neural network
CN113010884B (en) Real-time feature filtering method in intrusion detection system
Neto et al. PIC-Score: Probabilistic Interpretable Comparison Score for Optimal Matching Confidence in Single-and Multi-Biometric Face Recognition
CN117478390A (en) Network intrusion detection method based on improved density peak clustering algorithm
CN111259442B (en) Differential privacy protection method for decision tree under MapReduce framework
CN112422546A (en) Network anomaly detection method based on variable neighborhood algorithm and fuzzy clustering
Wu et al. A robust inference algorithm for crowd sourced categorization
CN115329748B (en) Log analysis method, device, equipment and storage medium
CN113852605B (en) Protocol format automatic inference method and system based on relation reasoning
CN116206174A (en) Pseudo tag construction method, device, equipment and medium for model training
CN114528909A (en) Unsupervised anomaly detection method based on flow log feature extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant