CN101282251A

CN101282251A - A Method of Mining Feature of Application Layer Protocol Identification

Info

Publication number: CN101282251A
Application number: CNA2008101060589A
Authority: CN
Inventors: 刘兴彬; 杨建华; 胡玥; 谢高岗
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2008-05-08
Filing date: 2008-05-08
Publication date: 2008-10-08
Anticipated expiration: 2028-05-08
Also published as: CN101282251B

Abstract

The invention discloses a mining method for application layer protocol identification features. The method comprises the following steps: Step A, filtering and encoding the training data packet set for the first time, and extracting quasi-protocol identification feature data information; Step B, performing first mining from the extracted quasi-protocol identification feature data information , to obtain a multi-level frequent itemset; step C, performing the first filtering on the multi-level frequent itemset, and correcting the frequency of the remaining multi-level frequent itemsets after the first filtering and mining for the second time, Filter it for the second time to obtain the final protocol identification features; step D, if the byte recognition rate of all final protocol identification features meets the requirements, or the sum of data packet recognition rates meets the requirements, then no longer dig the second and subsequent The data of the data packet; otherwise, the second and subsequent data packets are cyclically mined until the total recognition rate reaches the requirement. It can analyze and mine the data packet collection, and can extract all the identification features of the corresponding application layer protocol, which greatly improves the feature extraction efficiency and the overall identification rate.

Description

A Method of Mining Feature of Application Layer Protocol Identification

技术领域technical field

本发明涉及计算机网络流量监测分析技术领域，特别是涉及一种应用层协议识别特征挖掘方法。The invention relates to the technical field of computer network flow monitoring and analysis, in particular to a mining method for application layer protocol identification features.

背景技术Background technique

网络应用层流量识别对网络规划、网络管理、流量工程、安全检测等至关重要。传统的应用层识别方法主要基于国际互联网代理成员管理局(InternetAssigned Numbers Authority，IANA)定义应用对应的协议端口，但出于隐藏流量或者其他方面的需要，现有技术大量使用动态协议端口或者加密数据包负载，这给传统的应用层识别方法带来了很大的挑战。Network application layer traffic identification is crucial to network planning, network management, traffic engineering, security detection, etc. The traditional application layer identification method is mainly based on the Internet Assigned Numbers Authority (IANA) to define the protocol port corresponding to the application, but for the purpose of hiding traffic or other aspects, the existing technology uses a large number of dynamic protocol ports or encrypted data Packet load, which brings great challenges to traditional application layer identification methods.

为了解决这一问题，人们提出了深层数据包解析技术(Deep PacketInspection，DPI)识别方法。该方法通过找出数据包特征字串，组成应用层协议识别特征库，采用特征匹配的方式进行流量识别。In order to solve this problem, a deep packet inspection (Deep Packet Inspection, DPI) identification method has been proposed. The method finds out the characteristic string of the data packet to compose the application layer protocol identification characteristic database, and adopts the characteristic matching method to carry out traffic identification.

但使用该方法的前提是必须正确找出协议的应用层协议特征，应用层协议的识别特征的准确性对识别率、准确率以及误识别率都有极大的影响。However, the premise of using this method is that the application layer protocol characteristics of the protocol must be correctly found out. The accuracy of the identification characteristics of the application layer protocol has a great impact on the recognition rate, accuracy rate and false recognition rate.

目前提取应用层协议特征的方法主要有以下两种：Currently, there are two main methods for extracting application layer protocol features:

第一，找到应用层协议的定义文档，根据文档的规定得到该协议的应用层特征。但是目前的很多新的应用层协议采用私有协议实现，无法获得协议定义文档，如PPlive。此外，即使协议公开，协议更新的频率也给协议识别特征的更新带来了巨大的困难。First, find the definition document of the application layer protocol, and obtain the application layer characteristics of the protocol according to the provisions of the document. However, many new application layer protocols are currently implemented using private protocols, and protocol definition documents, such as PPlive, cannot be obtained. In addition, even if the protocol is public, the frequency of protocol updates brings great difficulties to the update of protocol identification features.

第二，通过wireshark、tcpdump等捕包工具捕获协议的通信过程的数据包，通过人工的查看、对比每一个流相对应的数据包，发现该协议的应用层特征。但是，这种方法效率与可信度低。Second, use wireshark, tcpdump and other packet capture tools to capture the data packets in the communication process of the protocol, and manually check and compare the data packets corresponding to each flow to find the application layer characteristics of the protocol. However, this method has low efficiency and reliability.

随着新应用不断出现，要实现准确的应用层流量识别就要不断地寻找并更新每种协议的应用层特征。但是由于目前没有更好的应用层协议特征挖掘方法，人工分析的方法仍然是目前最常用的一种应用层协议识别特征挖掘方法。目前很多网络监测分析产品公司为了能提高识别特征库的准确性，专门有一个强大的团队负责定期对各种应用进行跟踪，包括数据包捕获、人工协议分析等。With the continuous emergence of new applications, to achieve accurate application-layer traffic identification, it is necessary to constantly find and update the application-layer characteristics of each protocol. However, since there is no better mining method for application layer protocol features at present, manual analysis is still the most commonly used mining method for application layer protocol identification features. At present, many network monitoring and analysis product companies have dedicated a strong team to track various applications on a regular basis, including data packet capture and manual protocol analysis, in order to improve the accuracy of the identification signature database.

但是，依靠人工分析实现应用层协议识别特征的挖掘方法，特征提取效率很低，无法满足新的应用层协议不断涌现和已有协议频繁升级的需求，而且，通过人工观察发现的识别特征通常不完全，导致总体识别效率不高。However, relying on manual analysis to realize the mining method of application layer protocol identification features, the feature extraction efficiency is very low, and it cannot meet the needs of new application layer protocols emerging and frequent upgrades of existing protocols. completely, resulting in low overall recognition efficiency.

发明内容Contents of the invention

本发明的目的在于提供一种应用层协议识别特征挖掘方法，其能够完全自动化的对数据包集合进行分析、挖掘，可提取出相应应用层协议的所有识别特征，极大的提高了特征提取效率和总体识别率。The purpose of the present invention is to provide a method for mining application layer protocol identification features, which can fully automatically analyze and mine data packet collections, extract all identification features of corresponding application layer protocols, and greatly improve feature extraction efficiency and overall recognition rate.

为实现本发明的目的而提供的一种应用层协议识别特征挖掘方法，包括下列步骤：A kind of application layer protocol identification feature mining method provided for realizing the purpose of the present invention comprises the following steps:

步骤A，对训练数据包集合进行第一次过滤，以及进行编码，提取准协议识别特征数据信息；Step A, the training data packet set is filtered for the first time, and encoded, and the quasi-protocol identification feature data information is extracted;

步骤B，从提取的准协议识别特征数据信息中进行第一次挖掘，得到多级频繁项集；Step B, performing the first mining from the extracted quasi-protocol identification feature data information to obtain multi-level frequent itemsets;

步骤C，对所述多级频繁项集进行第一次过滤，对第一次过滤后剩余的多级频繁项集的频繁度进行修正和第二次挖掘后，对其进行第二次过滤，得到最终协议识别特征。Step C, performing the first filtering on the multi-level frequent itemsets, correcting the frequency of the remaining multi-level frequent itemsets after the first filtering and mining for the second time, performing a second filtering on them, Get the final protocol identification feature.

所述的应用层协议识别特征挖掘方法，还包括下列步骤：The described application layer protocol identification feature mining method also includes the following steps:

步骤D，若所有最终协议识别特征的字节识别率达到要求，或者数据包识别率总和达到要求时，则不再挖掘第二个及以后数据包的数据；否则循环挖掘第二个及以后数据包，直到总识别率达到要求。Step D, if the byte recognition rate of all final protocol identification features meets the requirements, or the sum of the data packet recognition rates meets the requirements, then no longer mine the data of the second and subsequent data packets; otherwise, cyclically mine the second and subsequent data packages until the total recognition rate meets the requirement.

所述步骤A包括下列步骤：Described step A comprises the following steps:

A1.捕获训练数据包集合并对训练数据包集合按流进行划分后存储到流结构体中；A1. Capture the training data packet set and store the training data packet set in the flow structure after dividing the training data packet set by flow;

A2.利用混杂流量过滤方法过滤训练数据包集合中的混杂流量；A2. Utilize the mixed traffic filtering method to filter the mixed traffic in the training data packet set;

A3.利用基于位置对提取的数据包中的字节进行编码的方法对应用层载荷进行编码；A3. Encoding the application layer payload by encoding the bytes in the extracted data packet based on the location;

A4.对经过编码后数据的信息进行提取，提取准协议识别特征数据信息。A4. Extract the encoded data information, and extract the quasi-protocol identification feature data information.

所述步骤A2中，混杂流量过滤方法包括下列步骤：In the step A2, the hybrid flow filtering method includes the following steps:

A21.过滤掉满足HTTP协议和FTP协议的流的内容；A21. Filter out the content of the flow that meets the HTTP protocol and the FTP protocol;

A22.过滤掉TCP流中没有完整三次握手的流。A22. Filter out the flows that do not have a complete three-way handshake in the TCP flow.

步骤A21中，所述过滤掉满足FTP协议的流的内容，包括下列步骤：In step A21, the content of the flow that satisfies the FTP protocol is filtered out, including the following steps:

过滤掉采用PASV模式通信的流结构体的数据包；Filter out the data packets of the flow structure that communicates in PASV mode;

过滤掉采用20、21端口的结构体的数据包。Filter out the packets using the structure of ports 20 and 21.

所述采用FTP的PASV模式通信的流结构体的判断方法，包括下列步骤：The method for judging the flow structure of the PASV mode communication using FTP comprises the following steps:

寻找采用21端口的流结构体，判断属于该结构体的数据包是否有以227开头的数据包；Find the flow structure using port 21, and judge whether the data packets belonging to the structure have data packets starting with 227;

若有则进一步判断是否是PASV模式的应答包，若是则该数据包中就包含服务器端准备和客户端进行PASV模式数据连接的IP地址和端口号，同时该数据包的目的IP地址也即客户端的IP地址，FTP数据连接采用TCP协议，记录这四个数据；If there is, it is further judged whether it is a response packet of the PASV mode, and if so, the data packet includes the IP address and the port number that the server prepares to carry out the PASV mode data connection with the client, and the destination IP address of the data packet is also the client The IP address of the terminal, the FTP data connection adopts the TCP protocol, and records these four data;

当遍历完所有采用21端口的流结构体后，得到所有采用FTP的PASV模式通信的流信息。After traversing all flow structures using port 21, all flow information using FTP in PASV mode communication is obtained.

所述过滤掉采用PASV模式通信的流结构体的数据包，包括下列步骤：Said filtering out the packets of the flow structure that adopts PASV mode communication includes the following steps:

将每个流和记录的采用FTP的PASV模式的数据连接流信息进行对比，若流结构体中的五元组信息中的四个分别与记录的PASV模式数据连接的流信息相同，则认定此流是采用FTP的PASV模式通信的流，丢弃该流中的所有数据包。Compare each flow with the recorded data connection flow information in PASV mode using FTP. If four of the five-tuple information in the flow structure are the same as the recorded flow information in PASV mode data connection, then this is considered A flow is a flow that communicates in PASV mode of FTP, and all packets in this flow are discarded.

所述步骤A3中，基于位置对提取的数据包中的字节进行编码的方法，包括下列步骤：In said step A3, the method for encoding the bytes in the extracted data packet based on the position comprises the following steps:

将数据包中个用两个十六进制数表示的字节的值，编码为一个用5个字符表示的单项，从左边起第1个字符是I，表示Item，第二、三个字符表示该字节在上述所提取的前N个字节中处于第几位，用十六进制表示，从零开始计数，若小于十六则第二位为零；第四、五个字符是原来字节的两个十六进制字符。Encode the value of a byte represented by two hexadecimal numbers in the data packet into a single item represented by 5 characters. The first character from the left is I, which means Item, and the second and third characters Indicates which bit the byte is in the first N bytes extracted above, expressed in hexadecimal, counting from zero, if less than sixteen, the second bit is zero; the fourth and fifth characters are Two hexadecimal characters of the original byte.

所述步骤A4中，所述准协议识别特征数据信息，包括具有相同偏移数据包的经过编码的信息，以及相关的统计辅助数据信息；In the step A4, the quasi-protocol identification feature data information includes encoded information with the same offset data packet, and related statistical auxiliary data information;

所述提取准协议识别特征数据信息，包括下列步骤：The extraction of quasi-protocol identification feature data information includes the following steps:

A41.对TCP流提取准协议识别特征数据信息，并导入到事务数据库中；A41. Extract the quasi-protocol identification feature data information for the TCP flow, and import it into the transaction database;

A42.对UDP流提取准协议识别特征数据信息，并导入到事务数据库中。A42. Extract the quasi-protocol identification feature data information from the UDP stream, and import it into the transaction database.

所述步骤B包括下列步骤：Said step B comprises the following steps:

步骤B1，先设定一个初始频繁率，由初始频繁率乘以数据库中事务的总个数，然后取整得到初始频繁度；计算数据库中每一个单项的频繁度，过滤掉频繁度小于初始频繁度的单项，剩余的每一个单项称为一级频繁项，所有一级频繁项的集合称为一级频繁项集；Step B1, first set an initial frequency, multiply the initial frequency by the total number of transactions in the database, and then round up to get the initial frequency; calculate the frequency of each single item in the database, filter out the frequency less than the initial frequency The single item of the degree, each of the remaining single items is called a first-level frequent item, and the set of all first-level frequent items is called a first-level frequent item set;

步骤B2，从K-1级频繁项集中选择两个K-1级频繁项A和B，这两个K-1级频繁项必须满足K-1级频繁项A的前K-2个单项与K-1级频繁项B的K-2个单项相同，K-1级频繁项A的第K-1个单项与K-1级频繁项B的第K-1个单项的位置不同，若B的第K-1个单项的位置大于A的第K-1个单项的位置，则将B的第K-1个单项，附加在A的后面得到准K级频繁项，计算该准K级频繁项的频繁度，若大于等于初始频繁度，则其确实是K级频繁项，按此方法得到所有的K级频繁项，组成K级频繁项集；其中，K≥2。Step B2, select two K-1 level frequent items A and B from the K-1 level frequent item set, these two K-1 level frequent items must satisfy the first K-2 single items of K-1 level frequent item A and The K-2 individual items of the K-1 frequent item B are the same, and the K-1th individual item of the K-1 frequent item A is different from the K-1th individual item of the K-1 frequent item B. If B The position of the K-1th single item of A is greater than the position of the K-1th single item of A, then the K-1th single item of B is appended to the back of A to obtain a quasi-K-level frequent item, and the quasi-K-level frequent item is calculated If the frequency of the item is greater than or equal to the initial frequency, it is indeed a K-level frequent item. According to this method, all K-level frequent items are obtained to form a K-level frequent itemset; where K≥2.

所述事务包括经过编码的字节、包长、字节百分比和包百分比四个属性字段，存储所提取的数据。The transaction includes four attribute fields of coded bytes, packet length, byte percentage and packet percentage to store the extracted data.

所述步骤C包括下列步骤：Described step C comprises the following steps:

C1.利用频繁度修正和频繁项过滤方法，对所述多级频繁项集进行第一次频繁项过滤，以及频繁度修正和第二次频繁项过滤；C1. Using frequency correction and frequent item filtering methods, performing the first frequent item filtering, frequency correction and the second frequent item filtering on the multi-level frequent item set;

C2.将修正与过滤后的频繁项集中的频繁项转化为特征字符串，挖掘绝对包长特征和包长与内容差值关系特征，并标记相应事务，然后将相应于该事务的准协议识别特征数据信息进行过滤，得到最终的协议识别特征。C2. Convert the frequent items in the corrected and filtered frequent item set into feature strings, mine the absolute packet length feature and the feature of the relationship between packet length and content difference, and mark the corresponding transaction, and then identify the quasi-protocol corresponding to the transaction The feature data information is filtered to obtain the final protocol identification feature.

所述步骤C1包括下列步骤：Said step C1 comprises the following steps:

步骤C11，对所述多级频繁项集进行第一次频繁项过滤；Step C11, performing the first frequent item filtering on the multi-level frequent item set;

步骤C12，对第一次频繁项过滤后的多级频繁项集进行频繁度修正和第二次频繁项过滤，消除频繁项间的包含关系。Step C12, perform frequency correction and second frequent item filtering on the multi-level frequent item set after the first frequent item filtering, and eliminate the inclusion relationship between frequent items.

所述步骤C12包括下列步骤：Described step C12 comprises the following steps:

步骤C121，利用下式对1级频繁项的频繁度进行修正；Step C121, using the following formula to correct the frequency of the first-level frequent items;

${freq freq}_{new new} = = \{\begin{matrix} {k k}_{11} \times \times {freq freq}_{old old};; & {pos pos}_{00} = = 0,0.9 0,0.9 \leq \leq {k k}_{11} < < 11 \\ f f (({pos pos}_{00})) \times \times {freq freq}_{old old};; & {pos pos}_{00} &NotEqual; &NotEqual; 0,0 0,0 < < f f (({pos pos}_{00})) \leq \leq {k k}_{11} \end{matrix}$

其中，freq_new表示修正过后的新的频繁度，freq_old表示修正前的频繁项的频繁度，pos_i表示频繁项中第i个单项的位置，i从零开始，k表示频繁项中单项的个数；单项的位置从零开始编号，k₁是一个常数，f(pos₀)是一个连续单调减函数；Among them, freq _new represents the new frequency after correction, freq _old represents the frequency of frequent items before correction, pos _i represents the position of the i-th single item in frequent items, i starts from zero, and k represents the frequency of single items in frequent items number; the position of a single item is numbered from zero, k ₁ is a constant, and f(pos ₀ ) is a continuous monotonically decreasing function;

步骤C122，利用下式对2级频繁项的频繁度进行修正；Step C122, using the following formula to correct the frequency of the 2-level frequent item;

其中，k₂是一个常数，f((pos₁-pos₀))是一个连续单调减函数；Among them, k ₂ is a constant, f((pos ₁ -pos ₀ )) is a continuous monotone decreasing function;

步骤C123，首先，利用下式求频繁项中项与项之间的平均距离，记为ave_dist；Step C123, at first, use the following formula to find the average distance between items in frequent items, which is recorded as ave _dist ;

${ave ave}_{dist dist} = = \frac{{Σ Σ}_{i i = = 11}^{k k - - 11} {pos pos}_{i i} - - {pos pos}_{i i - - 11}}{k k - - 11}$

所述距离是指项中两个相临单项的位置的差值的绝对值；The distance refers to the absolute value of the difference between the positions of two adjacent single items in the item;

如果ave_dist≠1，利用下式对ave_dist进行修正；If ave _dist ≠1, use the following formula to correct ave _dist ;

其中，k₃，k₄是常数；Among them, k ₃ and k ₄ are constants;

然后，对3、4级频繁项的频繁度的利用下式进行修正；Then, the following formula is used to modify the frequency of frequent items of the 3rd and 4th grades;

其中，f₁(k)是关于k的连续单调增函数；f₂(ave_dist)和f₃(ave_dist)是关于ave_dist的连续单调减函数，且满足f₂(ave_dist)＞f₃(ave_dist)；Among them, f ₁ (k) is a continuous monotonically increasing function about k; f ₂ (ave _dist ) and f ₃ (ave _dist ) are continuous monotonically decreasing functions about ave _dist , and satisfy f ₂ (ave _dist )＞f ₃ (ave _dist );

步骤C124，过滤掉有包含关系的频繁项中频繁度小的频繁项。Step C124, filtering out the frequent items with small frequency among the frequent items having inclusion relationship.

所述步骤C2包括下列步骤：Said step C2 comprises the following steps:

步骤C21，将修正与过滤后的频繁项集中的频繁项转化为特征字符串，从事务数据库中检索得到满足该特征字符串的事务，挖掘绝对包长特征和包长与内容差值关系特征，并将满足绝对包长特征和包长与内容差值关系特征的事务进行标记；Step C21, convert the frequent items in the corrected and filtered frequent item set into a feature string, retrieve the transactions satisfying the feature string from the transaction database, and mine the absolute packet length feature and the packet length and content difference relationship feature, And mark the transactions that satisfy the characteristics of the absolute packet length and the relationship between the packet length and the content difference;

步骤C22，将相应于该事务的准协议识别特征数据信息过滤掉重复挖掘特征、弱特征以及疑似混杂特征，得到最终的协议识别特征。Step C22 , filtering the quasi-protocol identification feature data information corresponding to the transaction to remove repeated mining features, weak features, and suspected mixed features to obtain the final protocol identification feature.

本发明的有益效果在于：The beneficial effects of the present invention are:

1.通过本发明的方法分析每种应用层协议识别特征具有很高的效率，运用本发明的技术使得实现特征库中每种协议识别特征的周期性更新成为了现实，为实现完全、准确识别所有流量打下了基础；1. It is highly efficient to analyze the identification characteristics of each application layer protocol by the method of the present invention. Using the technology of the present invention makes it possible to realize the periodic update of the identification characteristics of each protocol in the feature library. In order to realize complete and accurate identification All traffic lays the foundation;

2.本发明的方法得出的应用层协议识别特征完全性、可靠性都较当前技术有极大的提高。因为该方法是在对大量数据进行提取、分析的基础上挖掘应用层协议识别特征的，能够得到更完全、完整的应用层协议识别特征，并在得出最终应用层协议识别特征前已经对多个可能的特征进行了多级过滤和验证，其可靠性也比当前技术只分析有限数据包得出的特征的可靠性有极大的提高；2. The completeness and reliability of the identification features of the application layer protocol obtained by the method of the present invention are greatly improved compared with the current technology. Because this method mines the identification features of the application layer protocol on the basis of extracting and analyzing a large amount of data, it can obtain a more complete and complete identification feature of the application layer protocol, and has analyzed multiple A possible feature has been filtered and verified at multiple levels, and its reliability is greatly improved compared to the current technology that only analyzes the limited data packets;

3.本发明的方法提出了度量应用层协议识别特征正确与否的指标：识别率、准确率、正误识别率、负误识别率，从各个方面对自动得出的应用层协议识别特征进行衡量，使得自动得出的协议识别特征即可以达到较高的识别率，同时又保证了较高的准确率和较低的误识别率，使得自动识别的应用层协议识别特征的正确性有了判断依据；3. The method of the present invention proposes the index of measuring whether the identification feature of the application layer protocol is correct or not: recognition rate, accuracy rate, positive and false recognition rate, negative false recognition rate, and measures the automatically drawn application layer protocol identification feature from all aspects , so that the automatically obtained protocol identification features can achieve a high recognition rate, while ensuring a high accuracy rate and a low misidentification rate, so that the correctness of the automatically identified application layer protocol identification features can be judged in accordance with;

4.本发明的方法不仅在于把当前识别协议特征的人工分析过程转化为计算机自动挖掘处理应用层协议识别特征的过程，同时更在于其得出的协议识别特征的准确性、完全性和可靠性等方面，这使得以前随着识别协议的增多，协议之间的误识别率增高的现象有很大的改变，该方法把协议之间的误识别率降低到了一个很小的范围内，使得可靠性得到极大的提升。4. The method of the present invention is not only to transform the manual analysis process of current identification protocol features into a process of automatic computer mining and processing of application layer protocol identification features, but also lies in the accuracy, completeness and reliability of the protocol identification features obtained by it In other aspects, this has greatly changed the phenomenon that the misrecognition rate between protocols increased with the increase of identification protocols. This method reduces the misrecognition rate between protocols to a small range, making reliable Sexuality is greatly improved.

附图说明Description of drawings

图1是本发明的应用层协议识别特征挖掘方法的工作流程图；Fig. 1 is the working flow chart of application layer protocol identification feature mining method of the present invention;

图2是本发明中训练数据提取步骤的工作流程图；Fig. 2 is the work flowchart of training data extraction step among the present invention;

图3是本发明中准特征初级过滤步骤的工作流程图；Fig. 3 is the work flowchart of quasi-feature primary filtering step in the present invention;

图4是本发明中准特征二次挖掘过滤步骤的工作流程图。Fig. 4 is a working flow chart of the quasi-feature secondary mining and filtering step in the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明的一种应用层协议识别特征挖掘方法进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the purpose, technical solution and advantages of the present invention clearer, a method for mining application layer protocol identification features of the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

本发明提出了一种应用层协议识别特征挖掘方法，该方法只需捕获到该应用层协议通信过程中的数据包集合，通过对该数据包集合进行分析、挖掘，即可提取出该应用层协议的所有识别特征，特征提取效率和总体识别率较现有技术都有极大提高。The present invention proposes a method for mining application layer protocol identification features. The method only needs to capture the data packet set in the communication process of the application layer protocol, and the application layer can be extracted by analyzing and mining the data packet set. All the recognition features of the protocol, the feature extraction efficiency and the overall recognition rate are greatly improved compared with the prior art.

为了更好的说明本发明一种应用层协议识别特征挖掘方法，以下首先对本发明中使用到的术语和所能够挖掘的几个特征进行说明：In order to better illustrate a kind of application layer protocol identification feature mining method of the present invention, below at first the terms used in the present invention and several features that can be mined are described:

流：指一个包括通信双方的源IP、目的IP、源端口、目的端口和协议的五元组。Flow: Refers to a five-tuple including the source IP, destination IP, source port, destination port and protocol of both communicating parties.

训练数据包集合：捕获的一具体协议的通信过程的数据包集合，用这个捕获的训练数据包集合来挖掘应用层协议识别特征。Training data packet set: a set of captured data packets in the communication process of a specific protocol, and use this captured training data packet set to mine application layer protocol identification features.

事务：事务数据库中的一条记录，在本发明中一事务包括经过编码的字节、包长、字节百分比和包百分比四个属性字段，存储所提取的数据。Transaction: a record in the transaction database. In the present invention, a transaction includes four attribute fields of encoded byte, packet length, byte percentage and packet percentage, and stores the extracted data.

单项：应用层载荷中的一字节B₁经过编码后的表示，第一位是字符I，表示Item，第二、三位表示的十六进制值称为项的位置，第四、五位表示字节B₁的十六进制值，例如：I0039。Single item: the encoded representation of one byte B ₁ in the application layer payload. The first bit is the character I, which means Item. The hexadecimal value represented by the second and third bits is called the position of the item. The fourth and fifth bits Bits represent the hexadecimal value of byte B ₁ , eg: I0039.

项：由一个或多个单项组成的序列，序列中单项之间用标点符号隔开(如逗号)，且序列中单项的位置满足从左到右的递增关系。Item: A sequence consisting of one or more single items. The single items in the sequence are separated by punctuation marks (such as commas), and the position of the single items in the sequence satisfies the increasing relationship from left to right.

频繁度：一个项在事务数据库中出现的次数。Frequency: The number of times an item occurs in a transactional database.

频繁项：一个频繁度大于等于初始频繁度(一个预先设定的正整数)的项。Frequent item: an item whose frequency is greater than or equal to the initial frequency (a preset positive integer).

K级频繁项：一个包含K个单项的频繁项。K-level frequent item: a frequent item containing K individual items.

频繁项集：一个由各级频繁项组成的集合。Frequent itemset: A set composed of frequent items at all levels.

特征字符串：在应用层载荷中固定位置出现，能够标识该协议的字符的组合。Feature string: A combination of characters that appear at a fixed position in the application layer payload and can identify the protocol.

绝对包长特征：所有满足一特征字符串的数据包的应用层载荷的长度都相等，就称其满足绝对包长特征。这个绝对包长特征是与特征字符串的组合特征，所以当满足绝对包长特征时一定满足一字符串特征，它非常有效的降低了误识别，同时提高了识别准确性。Absolute packet length feature: all data packets satisfying a feature string have the same length of application layer payload, so they are said to satisfy the absolute packet length feature. The absolute packet length feature is a combined feature with the feature string, so when the absolute packet length feature is satisfied, a string feature must be satisfied, which effectively reduces misidentification and improves recognition accuracy.

包长和内容的差值关系特征：所有满足一特征字符串的数据包，若其应用层载荷的长度与应用层载荷中一个固定位置处的值都相差一个固定大小的值，就称其满足包长和内容的差值关系特征。这个包长和内容差值关系特征是与特征字符串的组合特征，所以当满足包长和内容的差值关系特征时一定满足一字符串特征，它非常有效的降低了误识别，同时提高了识别准确性。The difference relationship between packet length and content: All data packets that satisfy a characteristic string, if the length of the application layer payload and the value at a fixed position in the application layer payload differ by a fixed value, it is said to meet Difference relationship between packet length and content. The feature of the difference between packet length and content is a combination feature of the feature string, so when the feature of the difference between packet length and content is satisfied, a string feature must be satisfied, which effectively reduces misidentification and improves recognition accuracy.

本发明提出了一种应用层协议识别特征挖掘方法，能够挖掘提取以下三种特征：特征字符串、绝对包长特征、包长和内容的差值关系特征。下面结合上述目标介绍本发明一种应用层协议识别特征自动挖掘方法，如图1所示，包括下列步骤：The invention proposes a mining method for application layer protocol identification features, which can mine and extract the following three features: feature character strings, absolute packet length features, and difference relationship features between packet length and content. Introduce a kind of application layer protocol identification feature automatic mining method of the present invention below in conjunction with above-mentioned object, as shown in Figure 1, comprise the following steps:

步骤S100，利用混杂流量过滤方法和应用层载荷编码方法对训练数据包集合进行过滤和编码，提取准协议识别特征数据信息；Step S100, using the hybrid traffic filtering method and the application layer payload encoding method to filter and encode the training data packet set, and extract quasi-protocol identification feature data information;

如图2所示，所述步骤S100包括下列步骤；As shown in Figure 2, the step S100 includes the following steps;

步骤S110，捕获训练数据包集合并对训练数据包集合按流进行划分后存储到流结构体中；Step S110, capturing the training data packet set and storing the training data packet set into the flow structure after dividing the training data packet set according to flow;

从每一个数据包的IP数据报报头中提取其源IP、目的IP的和协议号。若协议号为6则表示传输层采用TCP协议，从TCP报头中提取源端口号和目的端口号；若协议号为17则表示传输层采用UDP协议，从UDP报头中提取源端口号和目的端口号；若协议号为其它值则丢弃该数据包。Extract its source IP, destination IP and protocol number from the IP datagram header of each data packet. If the protocol number is 6, it means that the transport layer adopts the TCP protocol, and the source port number and the destination port number are extracted from the TCP header; if the protocol number is 17, it means that the transport layer uses the UDP protocol, and the source port number and the destination port number are extracted from the UDP header. number; if the protocol number is other values, the data packet will be discarded.

将具有相同源IP、目的IP、源端口、目的端口和协议的数据包分为一类，并按其到达先后存储在一个流结构体中，以流的五元组作为它的标签。Packets with the same source IP, destination IP, source port, destination port and protocol are divided into one category, and stored in a flow structure according to their arrival sequence, and the five-tuple of the flow is used as its label.

步骤S120，利用混杂流量过滤方法过滤训练数据包集合中的混杂流量；Step S120, using a hybrid traffic filtering method to filter the hybrid traffic in the training data packet set;

步骤S121，过滤掉确定满足HTTP协议和FTP协议的流的内容；Step S121, filtering out the content of the flow determined to meet the HTTP protocol and the FTP protocol;

由于目前许多软件很多都在页面上嵌入一些广告，这些广告采用HTTP协议传输，而且有些软件也同时支持基于HTTP协议或者FTP协议的文件下载或者数据传输，这导致虽然只采用一种软件进行通信，但却产生了不同协议的流量，并且如果有软件采用这两种方式中的任何一种，通信中产生的HTTP或者FTP协议流量一般会占总流量较大的比例，如果不将其过滤掉，将会在挖掘本协议特征的同时挖掘出HTTP或FTP协议特征，这将导致在实时识别时将HTTP或者FTP协议识别为该协议。Because many softwares currently embed some advertisements on the page, these advertisements are transmitted by HTTP protocol, and some software also supports file download or data transmission based on HTTP protocol or FTP protocol at the same time, which leads to the fact that although only one software is used for communication, However, traffic of different protocols is generated, and if any software adopts any of these two methods, the HTTP or FTP protocol traffic generated in the communication will generally account for a large proportion of the total traffic. If it is not filtered out, The features of the HTTP or FTP protocol will be mined while mining the features of this protocol, which will cause the HTTP or FTP protocol to be identified as the protocol during real-time identification.

因此，在捕获训练数据包集合时，为使捕获的数据包基本上是采用同一个应用层协议通信的数据包，只打开一种应用软件，需要过滤掉满足HTTP协议和FTP协议的流的内容。Therefore, when capturing a set of training data packets, in order to make the captured data packets basically communicate with the same application layer protocol, and only open one application software, it is necessary to filter out the content of the flow that meets the HTTP protocol and the FTP protocol .

实验表明目前其它应用层协议产生的混杂流量占总流量的比例很小，不会对挖掘结果造成大的影响，通过弱特征过滤就可以把它们过滤掉。Experiments show that the mixed traffic generated by other application layer protocols accounts for a small proportion of the total traffic, and will not have a great impact on the mining results. They can be filtered out by weak feature filtering.

步骤S121A，首先过滤掉HTTP协议的流的内容；Step S121A, first filter out the content of the HTTP protocol stream;

依次判断每个TCP流结构体，若满足以下两个条件则将该结构体包含的数据包丢弃：1)该流结构体含有80端口；2)该结构体中第四个数据包的开始是GET、HEAD、POST、PUT、PATCH、COPY、MOVE、DELETE、LINK、UNLINK、OPTION中任一个。Judge each TCP flow structure in turn, and discard the data packets contained in the structure if the following two conditions are met: 1) the flow structure contains port 80; 2) the beginning of the fourth data packet in the structure is Any of GET, HEAD, POST, PUT, PATCH, COPY, MOVE, DELETE, LINK, UNLINK, OPTION.

步骤S121B，然后过滤掉FTP协议的流的内容；Step S121B, then filter out the content of the flow of the FTP protocol;

(1)过滤掉采用PASV模式通信的流结构体的数据包；(1) Filter out the data packets of the flow structure that adopts PASV mode communication;

(2)过滤掉采用20、21端口的结构体的数据包；(2) filter out the data packet adopting the structure of port 20 and port 21;

判断采用FTP的PASV模式通信的流结构体的方法：The method of judging the stream structure of PASV mode communication using FTP:

首先寻找采用21端口的流结构体，判断属于该结构体的数据包是否有以227开头的数据包；First look for the flow structure using port 21, and determine whether the data packets belonging to the structure have data packets starting with 227;

若有则进一步判断是否是PASV模式的应答包(即根据FTP协议应答包的格式)，若是则该数据包中就包含服务器端准备和客户端进行PASV模式数据连接的IP地址和端口号，同时该数据包的目的IP地址也即客户端的IP地址，FTP数据连接采用TCP协议，至此采用FTP的PASV模式通信的流的五元组信息已经得到4个，记录这四个数据；当遍历完所有采用21端口的流结构体后，可以得到所有采用FTP的PASV模式通信的流信息；If there is, it is further judged whether it is the response packet of the PASV mode (i.e. according to the format of the FTP protocol response packet), if so, the IP address and the port number that the server side prepares to carry out the PASV mode data connection with the client are included in the packet, and at the same time The destination IP address of the data packet is also the IP address of the client. The FTP data connection adopts the TCP protocol. So far, 4 quintuple information of the flow using the FTP PASV mode communication have been obtained, and these four data are recorded; After using the flow structure of port 21, all the flow information of PASV mode communication using FTP can be obtained;

对20端口的结构体数据包，直接过滤掉。For the structure data packet of port 20, filter it out directly.

将每个流和所述记录的采用FTP的PASV模式的数据连接流信息进行对比，若流结构体中的五元组信息中的四个分别与所述记录的PASV模式数据连接的流信息相同，则认定此流是采用FTP的PASV模式通信的流，丢弃该流中的所有数据包。Compare each flow with the recorded data connection flow information using FTP in PASV mode, if four of the five-tuple information in the flow structure are the same as the recorded flow information of the PASV mode data connection , the flow is determined to be a flow using FTP PASV mode communication, and all data packets in the flow are discarded.

步骤S122，过滤掉TCP流中没有完整三次握手的流；Step S122, filtering out flows without a complete three-way handshake in the TCP flow;

根据TCP协议的三次握手数据包的标志位，判断每一个TCP流是否是具有完整的三次握手的流，过滤掉没有完整三次握手数据包的TCP流；According to the flag bit of the three-way handshake data packet of the TCP protocol, it is judged whether each TCP flow is a flow with a complete three-way handshake, and the TCP flow without a complete three-way handshake data packet is filtered out;

步骤S130，利用基于位置对提取的数据包中的字节进行编码的方法对应用层载荷进行编码；Step S130, encoding the application layer payload by using the method of encoding the bytes in the extracted data packet based on the location;

在协议的报头中，不同位置的相同数据表示不同的意义，而本发明分析的内容很多涉及协议的报头，此外在不编码情况下用寻找频繁项集方法得出的处理结果对本发明的问题而言也是无意义的。为了解决该问题本发明提出了一种基于位置对提取的数据包中的字节进行编码的方法，并利用该方法对所有数据包内容进行编码，它不仅把不同位置的相同字节区分开来，并对后面准应用层协议识别特征的过滤、应用层协议识别特征的自动生成有决定意义。In the header of the protocol, the same data in different positions represent different meanings, and many contents analyzed by the present invention relate to the header of the protocol. In addition, the processing result obtained by the method of finding frequent itemsets under the condition of no encoding is not enough for the problem of the present invention. Words are also meaningless. In order to solve this problem, the present invention proposes a method for encoding the bytes in the extracted data packets based on the position, and uses this method to encode all the contents of the data packets, which not only distinguishes the same bytes in different positions , and has decisive significance for the filtering of the identification characteristics of the quasi-application layer protocol and the automatic generation of the identification characteristics of the application layer protocol.

所述编码方法为：原来的数据包中一个字节的值用两个十六进制数表示，现在把这个字节编码为一个有5个字符的单项，从左边起第1个字符是I，表示Item，第二、三个字符表示该字节在上述所提取的前N个字节中处于第几位，用十六进制表示，从零开始计数，若小于十六则第二位为零；第四、五个字符是原来字节的两个十六进制字符。The encoding method is: the value of a byte in the original data packet is represented by two hexadecimal numbers, now this byte is encoded as a single item with 5 characters, and the first character from the left is I , which means Item, the second and third characters indicate the number of the byte in the first N bytes extracted above, expressed in hexadecimal, counting from zero, if less than sixteen, the second is zero; the fourth and fifth characters are two hexadecimal characters of the original byte.

步骤S140，对经过编码后数据的信息进行提取，提取准协议识别特征数据信息。Step S140, extracting the encoded data information, and extracting quasi-protocol identification feature data information.

准协议识别特征数据信息包括具有相同偏移数据包的经过编码的信息，以及相关的统计辅助数据信息。The quasi-protocol identification feature data information includes encoded information with the same offset data packet, and related statistical auxiliary data information.

对过滤过的流，按照IP数据报报头中的上层协议类型，分为TCP流和UDP流，并分别进行准协议识别特征数据信息提取，包括如下步骤：To the filtered flow, according to the upper layer protocol type in the IP datagram header, be divided into TCP flow and UDP flow, and carry out quasi-protocol identification characteristic data information extraction respectively, comprise the following steps:

步骤S141，对TCP流提取准协议识别特征数据信息，并导入到事务数据库中；Step S141, extracting quasi-protocol identification feature data information from the TCP flow, and importing it into the transaction database;

对TCP流提取具有相同偏移数据包的经过编码的信息，即流的第i个数据包前N个经过编码的字节、第i个数据包的包长、该流的总字节数占训练数据包集合中所有具有完整三次握手TCP流的总字节数的百分比、该流的总数据包数占训练数据包集合中所有具有完整三次握手TCP流的总数据包数的百分比。Extract the encoded information of the packets with the same offset for the TCP stream, that is, the first N encoded bytes of the i-th data packet in the stream, the packet length of the i-th data packet, and the total number of bytes in the stream. The percentage of the total number of bytes of all TCP flows with complete three-way handshake in the training data packet set, and the percentage of the total number of data packets of this flow to the total number of data packets of all TCP flows with complete three-way handshake in the training data packet set.

然后，根据i值的不同导入到事务数据库不同的表中，较佳地，所述其中4≤i≤(n+4)，从第四个包开始提取是为了不提取建立连接的三次握手数据包的数据，i最大为(n+4)，表示总共提取n个数据包，提取n个数据包是为了在第1个数据包挖掘得到的信息不足以有效识别协议时，迭代用第2，3到第n个数据包的内容继续进行挖掘；Then, import it into different tables of the transaction database according to the value of i, preferably, where 4≤i≤(n+4), the extraction from the fourth package is to avoid extracting the three-way handshake data for establishing a connection For the data of the packet, the maximum value of i is (n+4), which means that a total of n data packets are extracted. The purpose of extracting n data packets is to use the second iteration when the information obtained by mining the first data packet is not enough to effectively identify the protocol The contents of the 3 to nth data packets continue to be mined;

步骤S142，对UDP流提取准协议识别特征数据信息，并导入到事务数据库中。Step S142, extract quasi-protocol identification feature data information from the UDP stream, and import it into the transaction database.

对UDP流的提取与TCP流提取类似，但1≤i≤n，因为UDP没有建立连接的过程；The extraction of UDP flow is similar to the extraction of TCP flow, but 1≤i≤n, because UDP does not have the process of establishing a connection;

步骤S200，从提取的准协议识别特征数据信息中挖掘得到多级频繁项集；Step S200, mining multi-level frequent itemsets from the extracted quasi-protocol identification feature data information;

特征字符串(例如固定的版本号、状态等)是最常见的识别特征之一，它一般会在协议的报头部分出现，而上一步提取的每一条数据都极可能是协议报头部分数据，若能在这些数据中找出频繁出现的组合字符串，则这些字符串就很可能是标识该协议的特征，因此提取特征字符串的问题可转化为对编码过的数据挖掘分析寻找频繁项集的问题。Feature strings (such as fixed version number, status, etc.) are one of the most common identification features, which generally appear in the header of the protocol, and each piece of data extracted in the previous step is likely to be part of the protocol header. If the combination strings that appear frequently can be found in these data, then these strings are likely to be the features that identify the protocol, so the problem of extracting feature strings can be transformed into the problem of finding frequent itemsets for coded data mining analysis question.

本发明的挖掘得到一级频繁项集处理方法是：先设定一个初始频繁率，由初始频繁率乘以事务数据库中事务的总个数，然后取整得到初始频繁度；计算数据库中每一个单项的频繁度，过滤掉频繁度小于初始频繁度的单项，剩余的每一个单项称为一级频繁项，所有一级频繁项的集合称为一级频繁项集；The mining method of the present invention obtains a first-level frequent itemset processing method: first set an initial frequency rate, multiply the total number of transactions in the transaction database by the initial frequency rate, and then round to obtain the initial frequency; calculate each The frequency of a single item, filter out the single item whose frequency is less than the initial frequency, each remaining single item is called a first-level frequent item, and the set of all first-level frequent items is called a first-level frequent item set;

K(K≥2)级频繁项的挖掘方法是：从(K-1)级频繁项集中选择两个(K-1)级频繁项A和B，这两个(K-1)级频繁项必须满足(K-1)级频繁项A的前(K-2)个单项与(K-1)级频繁项B的(K-2)个单项相同，(K-1)级频繁项A的第(K-1)个单项与(K-1)级频繁项B的第(K-1)个单项的位置不同，若B的第(K-1)个单项的位置大于A的第(K-1)个单项的位置，则将B的第(K-1)个单项，附加在A的后面得到准K级频繁项(由K个单项组成)，计算该准K级频繁项的频繁度，若大于等于初始频繁度，则其确实是K级频繁项，按此方法得到所有的K级频繁项，组成K级频繁项集。The mining method of K (K≥2) level frequent items is: select two (K-1) level frequent items A and B from the (K-1) level frequent item set, these two (K-1) level frequent items It must be satisfied that the first (K-2) individual items of the (K-1) level frequent item A are the same as the (K-2) individual items of the (K-1) level frequent item B, and the (K-1) level frequent item A's The position of the (K-1)th single item and the (K-1)th single item of the (K-1) level frequent item B is different, if the position of the (K-1)th single item of B is greater than that of A's (K -1) single item position, then the (K-1)th single item of B is appended to A to obtain a quasi-K-level frequent item (composed of K single items), and the frequency of the quasi-K-level frequent item is calculated , if it is greater than or equal to the initial frequency, it is indeed a K-level frequent item. According to this method, all K-level frequent items are obtained to form a K-level frequent item set.

例如，二级频繁项的挖掘方法是：从一级频繁项集中任选两个一级频繁项，按位置关系组合成准二级频繁项，然后扫描事务数据库，计算这个准二级频繁项的频繁度，若大于等于初始频繁度，则确定此准二级频繁项确实是二级频繁项，按此方法可以得到所有的二级频繁项，组成二级频繁项集。For example, the mining method of the second-level frequent items is: select two first-level frequent items from the first-level frequent item set, combine them into quasi-second-level frequent items according to the positional relationship, and then scan the transaction database to calculate the value of the quasi-second-level frequent items Frequency, if it is greater than or equal to the initial frequency, it is determined that the quasi-secondary frequent item is indeed a second-level frequent item, and all second-level frequent items can be obtained by this method to form a second-level frequent item set.

因为若存在二级频繁项，则其一定是两个一级频繁项的组合，并且由于本发明中编码的特点，二级频繁项中项的位置一定要满足从左到右按从小到大排列，而不是一级频繁项的随意组合，这就极大的减少了组合次数，提高了处理效率。Because if there is a second-level frequent item, it must be a combination of two first-level frequent items, and due to the characteristics of the encoding in the present invention, the position of the item in the second-level frequent item must satisfy the arrangement from left to right from small to large , rather than random combinations of first-level frequent items, which greatly reduces the number of combinations and improves processing efficiency.

三级频繁项的挖掘方法是：从二级频繁项集中选择两个二级频繁项A和B，这两个二级频繁项满足二级频繁项A的第一个单项与二级频繁项B的第一个单项相同，二级频繁项A的第二个单项与二级频繁项B的第二个单项的位置不同，若B的第二个单项的位置大于A的第二个单项的位置，则将B的第二个单项，附加在A的后面得到准三级频繁项(由三个单项组成)，计算该准三级频繁项的频繁度，若大于等于初始频繁度，则其确实是三级频繁项，按此方法得到所有的三级频繁项，组成三级频繁项集。The mining method of the third-level frequent item is: select two second-level frequent items A and B from the second-level frequent item set, and these two second-level frequent items satisfy the first single item of the second-level frequent item A and the second-level frequent item B The first single item of A is the same, the second single item of the second-level frequent item A is different from the second single item of the second-level frequent item B, if the position of the second single item of B is greater than the position of the second single item of A , then add the second single item of B to the back of A to obtain a quasi-third-level frequent item (composed of three single items), calculate the frequency of the quasi-third-level frequent item, if it is greater than or equal to the initial frequency, then it is indeed is a three-level frequent item, and all three-level frequent items are obtained by this method to form a three-level frequent item set.

四级频繁项及以后各级的频繁项挖掘方法与此类同。The frequent item mining method of the fourth-level frequent item and subsequent levels is the same as this one.

迭代处理直到找不更长的频繁项时，处理结束。将各级频繁项组合就得到频繁度大于等于一设定值的频繁项集，并将包含单项个数相对较多的频繁项称为高级频繁项，包含单项个数较少的频繁项为低级频繁项。Iterative processing until no longer frequent items are found, the processing ends. Combining frequent items at all levels can obtain a frequent item set whose frequency is greater than or equal to a set value, and the frequent items containing a relatively large number of single items are called high-level frequent items, and the frequent items containing a small number of single items are called low-level frequent items frequent items.

步骤S300，利用频繁度修正和频繁项过滤方法，对所述多级频繁项集进行第一次过滤，对第一次过滤后剩余的多级频繁项的频繁度进行修正、二次挖掘和第二次频繁项过滤，得到最终协议识别特征；Step S300, using the method of frequency correction and frequent item filtering to perform the first filtering on the multi-level frequent itemset, and perform correction, secondary mining and second filtering on the frequency of the remaining multi-level frequent items after the first filtering Secondary frequent item filtering to obtain the final protocol identification features;

步骤S310，利用频繁度修正和频繁项过滤方法，对所述多级频繁项集进行第一次频繁项过滤，以及频繁度修正与第二次频繁项过滤；Step S310, using frequency correction and frequent item filtering methods to perform the first frequent item filtering, frequency correction and second frequent item filtering on the multi-level frequent item set;

本步骤包括第一次频繁项过滤过程，以及频繁度修正和第二次频繁项过滤过程。This step includes the first frequent item filtering process, the frequency correction and the second frequent item filtering process.

第一次频繁项过滤过程过滤冗余准特征和全零的准特征；The first frequent item filtering process filters redundant quasi-features and all-zero quasi-features;

频繁度修正和第二次频繁项过滤过程：对4级频繁项及以下各级频繁项的频繁度进行修正，然后通过第二次频繁项过滤方法过滤掉可靠性相对较低的特征。Frequency correction and the second frequent item filtering process: Correct the frequency of frequent items at levels 4 and below, and then filter out features with relatively low reliability through the second frequent item filtering method.

频繁度修正方法有力的降低了不可靠特征的频繁度，确保不可靠特征在第二次频繁项过滤过程中被过滤掉，极大的减少了准特征个数。使得特征的识别准确性得到极大的提高，同时降低了不同协议之间特征交叉的概率。The frequency correction method effectively reduces the frequency of unreliable features, ensures that unreliable features are filtered out in the second frequent item filtering process, and greatly reduces the number of quasi-features. The recognition accuracy of the features is greatly improved, and at the same time, the probability of feature crossover between different protocols is reduced.

如图3所示，所述步骤S310包括下列步骤。As shown in FIG. 3, the step S310 includes the following steps.

步骤S311，对所述多级频繁项集进行第一次频繁项过滤；Step S311, performing the first frequent item filtering on the multi-level frequent item set;

步骤S200中处理得到的多级频繁项集，有很大冗余，即包括很多冗余特征，需要对其进行第一次频繁项过滤。The multi-level frequent item set processed in step S200 has great redundancy, that is, includes many redundant features, and needs to be subjected to the first frequent item filtering.

所述冗余特征是指：若一个频繁项的频繁度与包含它的高级频繁项的频繁度相等，则称该频繁项为冗余特征。The redundant feature refers to: if the frequency of a frequent item is equal to the frequency of the high-level frequent item containing it, then the frequent item is called a redundant feature.

因为高级频繁项的存在就意味着频繁度不小于当前高级频繁项的频繁度的任意子频繁项的存在。Because the existence of high-level frequent items means the existence of any sub-frequent items whose frequency is not less than that of the current high-level frequent items.

第一次频繁项过滤从一级频繁项开始依次检查过滤掉所有的冗余特征。因为这是等价过滤，不会产生特征的丢失。The first frequent item filtering starts from the first-level frequent item to check and filter out all redundant features. Because this is equivalent filtering, no loss of features will occur.

此外第一次频繁项过滤还要过滤掉内容全零的频繁项。In addition, the first frequent item filtering also filters out frequent items with all zero content.

所述全零的频繁项是指：频繁项中每个单项的后两位全为零。The frequent item with all zeros means that the last two digits of each single item in the frequent item are all zeros.

首先全零特征实际中被采用的可能性很小，因为其只能标识很少的信息；其次全零的特征在实际中是不可靠的，因为有很多协议采用零填充，在对许多协议的特征挖掘分析过程中都出现大量全零的频繁项，若不将其过滤会导致协议间的交叉误识别。Firstly, the all-zero feature is less likely to be used in practice, because it can only identify a small amount of information; second, the all-zero feature is unreliable in practice, because many protocols use zero padding. A large number of frequent items of all zeros appear in the process of feature mining and analysis. If they are not filtered, it will lead to cross misidentification between protocols.

步骤S312，对第一次频繁项过滤后的多级频繁项集进行频繁度修正，然后进行第二次频繁项过滤，消除频繁项间的包含关系；Step S312, performing frequency correction on the multi-level frequent item set after the first frequent item filtering, and then performing the second frequent item filtering to eliminate the inclusion relationship between frequent items;

经过第一次频繁项过滤后，仍可能存在有互相包含关系的频繁项，这在最终结果中决不允许的，否则高级频繁项将自动被屏蔽，低级频繁项出现的频繁度更高，但高级频繁项更不容易产生误识别。After the first frequent item filtering, there may still be frequent items that contain each other, which is never allowed in the final result, otherwise the high-level frequent items will be automatically blocked, and the low-level frequent items appear more frequently, but High-level frequent items are less prone to misidentification.

因此，本发明提出对4级频繁项及4级以下频繁项的频繁度，基于起始位置以及频繁项中单项与单项之间的平均距离修正其频繁度。Therefore, the present invention proposes to modify the frequency of frequent items of level 4 and below based on the starting position and the average distance between individual items in the frequent items.

因为低级频繁项产生误识别的概率比高级频繁项大，且低级频繁项一般较多，一些低级频繁项的起始位置较大并且单项与单项之间的距离较大，而这样的频繁项一般不满足协议识别特征，它们的存在会产生较高误识别，所以可以根据这些值降低其频繁度以达到将其过滤掉的目的。以长度4作为修正频繁度的分界是因为长度大于4的频繁项一般较少，频繁度相对较低，而且其造成误别的最大概率为1/2⁴⁰即使不过滤也不会造成大的误识别。Because the probability of misrecognition of low-level frequent items is greater than that of high-level frequent items, and there are generally more low-level frequent items, some low-level frequent items have a larger starting position and a larger distance between individual items, and such frequent items generally If the identification characteristics of the protocol are not met, their existence will cause high misidentification, so their frequency can be reduced according to these values to achieve the purpose of filtering them out. The reason for using length 4 as the boundary for correcting frequency is that there are generally fewer frequent items with a length greater than 4, and the frequency is relatively low, and the maximum probability of causing errors is 1/2. ⁴⁰ will not cause major errors even without filtering. identify.

进一步地，所述对4级频繁项及4级以下频繁项的频繁度进行频繁度修正和第二次频繁项过滤，包括如下步骤：Further, the frequency correction and the second frequent item filtering for the frequent items of level 4 and frequent items below level 4 include the following steps:

步骤S3121，利用公式(1)对1级频繁项的频繁度进行修正；Step S3121, using the formula (1) to correct the frequency of the first-level frequent items;

${freq freq}_{new new} = = \{\begin{matrix} {k k}_{11} \times \times {freq freq}_{old old};; & {pos pos}_{00} = = 0,0.9 0,0.9 \leq \leq {k k}_{11} < < 11 \\ f f (({pos pos}_{00})) \times \times {freq freq}_{old old};; & {pos pos}_{00} &NotEqual; &NotEqual; 0,0 0,0 < < f f (({pos pos}_{00})) \leq \leq {k k}_{11} \end{matrix} - - - - - - ((11))$

其中，freq_new表示修正过后的新的频繁度，freq_old表示修正前的频繁项的频繁度，pos_i表示频繁项中第i个单项的位置，i从零开始，k表示频繁项中单项的个数。单项的位置从零开始编号，k₁是一个常数，f(pos₀)是一个连续单调减函数。Among them, freq _new represents the new frequency after correction, freq _old represents the frequency of frequent items before correction, pos _i represents the position of the i-th single item in frequent items, i starts from zero, and k represents the frequency of single items in frequent items number. The positions of the individual terms are numbered starting from zero, k ₁ is a constant, and f(pos ₀ ) is a continuous monotonically decreasing function.

步骤S3122，利用公式(2)对2级频繁项的频繁度进行修正；Step S3122, using the formula (2) to correct the frequency of the 2-level frequent item;

其中，k₂是一个常数，f((pos₁-pos₀))是一个连续单调减函数。Among them, k ₂ is a constant, and f((pos ₁ -pos ₀ )) is a continuous monotone decreasing function.

对起始位置在前四位，并且两个单项位置相临的频繁项，不降低其频繁度，反而将频繁度升高，目的是为消除频繁度比它稍大的一级频繁项，升高此项的频繁度不会对包含它的更长项频繁度有影响，因为它的频繁度即使不升高也比包含的长频繁项的频繁度大。For frequent items whose starting position is in the first four positions and two single items are adjacent to each other, the frequency is not reduced, but the frequency is increased. The purpose is to eliminate the first-level frequent items whose frequency is slightly larger than it. The frequency of a high item has no effect on the frequency of longer items containing it, because its frequency is greater than that of the long frequent items it contains, if not increased.

同样，在下述步骤中增大频繁度的目的与此类同。Likewise, the purpose of increasing the frequency in the following steps is the same.

步骤S3123，利用式(5)对3、4级频繁项的频繁度进行修正；Step S3123, using formula (5) to correct the frequency of frequent items of level 3 and level 4;

首先，利用公式(3)求频繁项中项与项之间的平均距离，记为ave_dist。First, use formula (3) to find the average distance between items in frequent items, which is recorded as ave _dist .

${ave ave}_{dist dist} = = \frac{{Σ Σ}_{i i = = 11}^{k k - - 11} {pos pos}_{i i} - - {pos pos}_{i i - - 11}}{k k - - 11} - - - - - - ((33))$

所述距离是指项中两个相临单项的位置的差值的绝对值。The distance refers to the absolute value of the difference between the positions of two adjacent single items in an item.

如果ave_dist≠1，利用公式(4)对ave_dist进行修正，以保证频繁度降低适当，而不至于一次降低太大；If ave _dist ≠ 1, use the formula (4) to correct the ave _dist to ensure that the frequency is reduced appropriately and not too large at one time;

其中，k₃，k₄是常数。Among them, k ₃ and k ₄ are constants.

对3、4级频繁项的频繁度的修正公式为式(5)The corrected formula for the frequency of frequent items of level 3 and 4 is formula (5)

其中，f₁(k)是关于k的连续单调增函数；f₂(ave_dist)和f₃(ave_dist)是关于ave_dist的连续单调减函数，且满足f₂(ave_dist)＞f₃(ave_dist)。Among them, f ₁ (k) is a continuous monotonically increasing function about k; f ₂ (ave _dist ) and f ₃ (ave _dist ) are continuous monotonically decreasing functions about ave _dist , and satisfy f ₂ (ave _dist )＞f ₃ (ave _dist ).

步骤S3124，过滤掉有包含关系的频繁项中频繁度小的频繁项。Step S3124, filtering out frequent items with small frequency among the frequent items having inclusion relationship.

经过频繁度修正后，比较任意两个具有包含关系的频繁项的频繁度大小，利用第二次频繁项过滤方法，即过滤掉其中频繁度较小的频繁项。After frequency correction, compare the frequency of any two frequent items with inclusion relationship, and use the second frequent item filtering method to filter out the frequent items with smaller frequency.

步骤S320，将修正与过滤后的频繁项集中的频繁项转化为特征字符串，挖掘绝对包长特征和包长与内容差值关系特征，并标记相应事务，然后将相应于该事务的准协议识别特征数据信息进行过滤，得到最终的协议识别特征。Step S320, convert the frequent items in the corrected and filtered frequent item set into feature strings, mine the absolute packet length feature and the feature of the relationship between packet length and content difference, mark the corresponding transaction, and then convert the quasi-protocol corresponding to the transaction The identification feature data information is filtered to obtain the final protocol identification feature.

本步骤包括对满足特征字符串的事务进行二次挖掘，挖掘绝对包长特征和包长与内容差值关系特征，并将满足绝对包长特征和包长与内容差值关系特征的事务进行标记；并将相应于该事务的准协议识别特征数据信息过滤掉弱特征、重复挖掘特征、疑似混杂特征，降低误识别率，得到最终的协议识别特征。This step includes re-mining the transactions that satisfy the characteristic string, mining the absolute packet length feature and the packet length and content difference relationship feature, and marking the transactions that satisfy the absolute packet length feature and the packet length and content difference relationship feature ; and filter out weak features, repeated mining features, and suspected mixed features from the quasi-protocol identification feature data information corresponding to the transaction, reduce the false identification rate, and obtain the final protocol identification feature.

本步骤的作用是增加协议识别特征特征的准确性和可靠性，极大的提高了识别的准确率、降低了误识别率，本步骤中提出的挖掘绝对包长特征和包长与内容差值关系特征的方法是准确识别协议的保证，没有这个方法，协议之间的误识别率将急剧上升；另外本步骤中提出的弱特征、重复挖掘特征和疑似混杂特征过滤方法也在一定程度上降低了协议间的误识别率。The function of this step is to increase the accuracy and reliability of protocol identification features, which greatly improves the accuracy of identification and reduces the false identification rate. The mining absolute packet length feature and the difference between packet length and content proposed in this step The method of relational features is the guarantee for accurate identification of protocols. Without this method, the misidentification rate between protocols will rise sharply; in addition, the weak features, repeated mining features and suspected mixed feature filtering methods proposed in this step are also reduced to a certain extent. Misidentification rate between protocols.

步骤S321，将修正与过滤后的频繁项集中的频繁项转化为特征字符串，从事务数据库中检索得到满足该特征字符串的事务，挖掘绝对包长特征和包长与内容差值关系特征，并将满足绝对包长特征和包长与内容差值关系特征的事务进行标记；Step S321, converting the frequent items in the corrected and filtered frequent item set into a feature string, retrieving transactions satisfying the feature string from the transaction database, mining the absolute packet length feature and the packet length and content difference relationship feature, And mark the transactions that satisfy the characteristics of the absolute packet length and the relationship between the packet length and the content difference;

将经过第一次频繁项过滤，以及第二次频繁项修正与过滤后的每一个频繁项转化为一个特征字符串，依次扫描事务数据库中的每个事务，对满足该特征字符串的事务，首先记录该事务所对应的数据包的实际包长，其次将提取的前N个字节的实际内容(即每个单项后两位的值)从第0个字节开始，把前后相临的两个字节相组合，分别组合成(N-1)个大端字节序(Big-Endian)双字节无符号整数和(N-1)个小端字节序(Little-Endian)双字节无符号整数，并设置两个能存储2n×(N-1)个值的存储结构进行存储。用该事务所对应的数据包包长分别减去前面的(N-1)个大端字节序无符号整数，若其值在[-n，n-1]之间则在上面存储结构中相应的位置计数加一，并在相应的位置记录其位置，并对(N-1)个小端字节序无符号整数做相同的处理。Convert each frequent item after the first frequent item filtering and the second frequent item correction and filtering into a characteristic string, scan each transaction in the transaction database in turn, and for the transaction that satisfies the characteristic string, First record the actual packet length of the data packet corresponding to the transaction, and secondly, the actual content of the extracted first N bytes (that is, the value of the last two digits of each single item) starts from the 0th byte, and the adjacent Two bytes are combined to form (N-1) big-endian (Big-Endian) double-byte unsigned integers and (N-1) little-endian (Little-Endian) double-byte integers. Byte unsigned integer, and set two storage structures capable of storing 2n×(N-1) values for storage. Subtract the previous (N-1) big-endian unsigned integers from the data packet length corresponding to the transaction, and if its value is between [-n, n-1], it will be stored in the above storage structure The corresponding position count is incremented by one, and its position is recorded at the corresponding position, and the same process is done for (N-1) little-endian unsigned integers.

若存在绝对包长特征，则满足该特征字符串的所有事务所对应的数据包的实际包长是一个相同的值或者绝大部分事务所对应的数据包的实际包长是一个相同的值。若存在包长与内容的差值关系特征，设其满足与起始位置为2的大端字节序双字节无符号整数差值为5的特征，则满足该特征字符串的所有或者绝大部分事务所对应的数据包的实际包长与起始位置为2的大端字节序双字节无符号整数差值都是5。If there is an absolute packet length feature, the actual packet lengths of the data packets corresponding to all transactions satisfying the characteristic string are the same value or the actual packet lengths of the data packets corresponding to most transactions are the same value. If there is a characteristic of the difference between the packet length and the content, if it satisfies the characteristic that the difference between the big-endian double-byte unsigned integer whose starting position is 2 is 5, then all or absolutely The difference between the actual packet length of most transactions and the big-endian double-byte unsigned integer whose starting position is 2 is 5.

通过对事务数据库中的所有事务扫描一遍之后，就可以判断满足该特征字符串的所有事务中，是否有绝大部分(一个确定的百分比)的事务所对应的包长相等，若是，说明满足绝对包长特征，将绝对包长特征附加在特征字符串之后；若否，再判断是否有绝大部分的事务存在其包长与一固定起始处双字节值相差一个固定大小的值，若是则满足包长与内容之间的差值特征，此时要先对特征字符串修正，修正方法是除去特征字符串中位置是包长与内容差值关系特征的位置的单项，然后再附加上包长与内容的差值关系特征。After scanning all the transactions in the transaction database, it can be judged whether most (a certain percentage) of all transactions satisfying the characteristic string have the same package length. If so, it means that the absolute Packet length feature, append the absolute packet length feature after the feature string; if not, then judge whether there are most transactions whose packet length differs from a fixed start double-byte value by a fixed size value, if so Then the difference between the package length and content is satisfied. At this time, the feature string must be corrected first. The correction method is to remove the single item in the feature string whose position is the position of the difference between the package length and the content, and then add The difference relationship between packet length and content.

扫描事务数据库重新计算修正过的特征的频繁度、所有满足该特征的流的总数据包个数占该协议(TCP或UDP)所对应的总数据包个数的百分比、所有满足该特征的流的总字节总数该协议所对应的总字节数的百分比，对满足该新特征的事务打上标记。The frequency of scanning the transaction database to recalculate the corrected characteristics, the percentage of the total number of data packets of all flows satisfying this characteristic to the total number of data packets corresponding to the protocol (TCP or UDP), all the flows satisfying this characteristic The percentage of the total bytes of the protocol corresponding to the total number of bytes, marking transactions that meet the new characteristics.

步骤S322，将相应于该事务的准协议识别特征数据信息过滤掉重复挖掘特征、弱特征以及疑似混杂特征，得到最终的协议识别特征。In step S322, the quasi-protocol identification feature data information corresponding to the transaction is filtered out of repeated mining features, weak features and suspected mixed features to obtain the final protocol identification feature.

对上步过滤后剩余的准协议识别特征数据信息特征称之为第二级准特征，对任意两个第二级准特征，记录它们在事务数据库中同时出现的次数，若同时出现的次数占频繁度较小的那个准特征的频繁度的p₁％以上，其中，p₁一个预先设定的值，80≤p₁≤100，则认为这两个准特征实际上是重复的，它们是从基本相同的事务中发现的，称频繁度小的第二级准特征为重复挖掘特征，过滤掉此重复挖掘特征。The remaining quasi-protocol identification feature data information features after filtering in the previous step are called second-level quasi-features. For any two second-level quasi-features, record the number of times they appear simultaneously in the transaction database. If the number of simultaneous appearances accounts for p ₁ % of the frequency of the quasi-feature with the smaller frequency, where p ₁ is a preset value, 80≤p ₁ ≤100, it is considered that these two quasi-features are actually repeated, they are The second-level quasi-features with low frequency found in basically the same transactions are called repeated mining features, and the repeated mining features are filtered out.

经过过滤后，再依次判断每个第二级准特征是否仅是一个1级频繁项，若是则称其为弱特征，因为其出现的概率为1/256，会造成很高的误识别，所以要将其过滤掉；同时对于剩余的所有第二级准特征，判断满足该第二级准特征所有流的总数据包和总字节数分别占该协议所对应的总数据包和总字节数的百分比是否小于p₂％(其中，p₂一个预先设定的值)，若任一个第二级准特征称小于p₂％，则该特征为疑似混杂特征，产生该疑似混杂特征的原因可能是训练数据包集合中混合进了其它协议的流量，要将其过滤掉(实验结果表明有很多流量不足1％的准特征造成了很大的误识别，过滤掉这些准特征后，识别率基本上没有降低，但误识别率变得很低)，把剩余的准特征作为该协议的最终协议识别特征。After filtering, it is judged in turn whether each second-level quasi-feature is only a level 1 frequent item. If so, it is called a weak feature, because its occurrence probability is 1/256, which will cause a high misidentification, so To filter it out; at the same time, for all the remaining second-level quasi-characteristics, it is judged that the total data packets and total bytes of all flows that meet the second-level quasi-characteristics account for the total data packets and total bytes corresponding to the protocol Whether the percentage of the number is less than p ₂ % (wherein, p ₂ is a preset value), if any second-level quasi-feature is said to be less than p ₂ %, then the feature is a suspected confounding feature, and the reason for the suspected confounding feature It may be that the traffic of other protocols is mixed in the training data packet set, which should be filtered out (experimental results show that there are many quasi-features with less than 1% of the traffic that cause a lot of misidentification. After filtering out these quasi-features, the recognition rate Basically no reduction, but the false recognition rate becomes very low), and the remaining quasi-features are used as the final protocol recognition features of the protocol.

步骤S400，若所有最终协议识别特征的字节识别率都达到要求，或者数据包识别率总和达到要求时，则不再挖掘第二个及以后数据包的数据；否则循环挖掘第二个及以后数据包，直到总识别率达到要求。Step S400, if the byte recognition rates of all final protocol identification features meet the requirements, or the sum of data packet recognition rates meets the requirements, then no longer mine the data of the second and subsequent data packets; otherwise, the second and subsequent data packets are mined in a loop packets until the total recognition rate reaches the requirement.

经过修正、挖掘和过滤后，若所有最终协议识别特征的字节识别率或者数据包识别率总和都达到p₃％以上(其中，所述p₃是一个预先设定的值，80≤p₃≤100)，则不再挖掘第二个及以后数据包的数据，否则循环挖掘第二个及以后数据包，直到总识别率达到p₃％以上，然后用挖掘得出最终协议识别特征生成特征库中相对应的特征文件。至此就可以运行流量识别程序进行实时在线识别了。After correction, mining and filtering, if the sum of the byte recognition rates or packet recognition rates of all final protocol identification features reaches p ₃ % or more (wherein, p ₃ is a preset value, 80≤p ₃ ≤ 100), then the data of the second and subsequent data packets will not be mined, otherwise, the second and subsequent data packets will be mined cyclically until the total recognition rate reaches more than p ₃ %, and then the final protocol identification feature generation feature will be obtained by mining The corresponding feature file in the library. At this point, the flow identification program can be run for real-time online identification.

所述用最终协议识别特征生成特征库中相关的特征文件，以及利用流量识别程序进行实时在线识别，是一种现有技术，因此，在本发明实施例中，不再一一详细描述。The use of final protocol identification features to generate relevant feature files in the feature library, and the use of traffic identification programs for real-time online identification are a prior art, and therefore, will not be described in detail in the embodiments of the present invention.

下面以自动分析生成迅雷采用TCP协议进行通信的应用层协议，提取识别特征为例，对本发明的应用层协议识别特征挖掘方法进行进一步解释说明。Taking the automatic analysis and generation of the application layer protocol that Thunder adopts the TCP protocol for communication and extracting the identification features as an example, the method for mining the identification features of the application layer protocol of the present invention will be further explained.

步骤S100’，提取迅雷满足条件TCP流的数据；Step S100', extracting the data that Xunlei satisfies the conditional TCP flow;

提取迅雷满足条件的804个TCP流中每个流的第一个数据包的部分数据如表1所示。Table 1 shows part of the data of the first packet of each flow in the 804 TCP flows that Xunlei meets.

表1：Table 1:

事务 affairs 前N个经过编码的字节 The first N encoded bytes 包长 package length 字节百分比 Byte Percentage 包百分比 package percentage 1 1 I0038，I0100，I0200，I0300，I040d，I0500，I0600，I0700，I0884，I09ab，I0a0c，I0b00 I0038, I0100, I0200, I0300, I040d, I0500, I0600, I0700, I0884, I09ab, I0a0c, I0b00 21 twenty one 0.007039％ 0.007039% 0.00597％ 0.00597% 2 2 I0038，I0100，I0200，I0300，I0472，I0500，I0600，I0700，I0864，I09e5，I0a5c，I0b00 I0038, I0100, I0200, I0300, I0472, I0500, I0600, I0700, I0864, I09e5, I0a5c, I0b00 122 122 0.002133％ 0.002133% 0.000246％ 0.000246% 3 3 I0038，I0100，I0200，I0300，I040d，I0500，I0600，I0700，I0884，I09f4，I0a1e，I0b00 I0038, I0100, I0200, I0300, I040d, I0500, I0600, I0700, I0884, I09f4, I0a1e, I0b00 21 twenty one 0.011092％ 0.011092% 0.00957％ 0.00957%

4 4 I0038，I0100，I0200，I0300，I040d，I0500，I0600，I0700，I0884，I0933，I0a00，I0b00 I0038, I0100, I0200, I0300, I040d, I0500, I0600, I0700, I0884, I0933, I0a00, I0b00 21 twenty one 0.014291％ 0.014291% 0.012682％ 0.012682% 5 5 I0038，I0100，I0200，I0300，I040d，I0500，I0600，I0700，I0884，I095d，I0a13，I0b00 I0038, I0100, I0200, I0300, I040d, I0500, I0600, I0700, I0884, I095d, I0a13, I0b00 21 twenty one 0.011305％ 0.011305% 0.011376％ 0.011376% 6 6 I0038，I0100，I0200，I0300，I0472，I0500，I0600，I0700，I0864，I099d，I0a5f，I0b00 I0038, I0100, I0200, I0300, I0472, I0500, I0600, I0700, I0864, I099d, I0a5f, I0b00 122 122 0.001493％ 0.001493% 0.000159％ 0.000159% 7 7 I0032，I0132，I0230，I032d，I0453，I0565，I0672，I0776，I082d，I0955，I1020，I1146 I0032, I0132, I0230, I032d, I0453, I0565, I0672, I0776, I082d, I0955, I1020, I1146 44 44 0.001067％ 0.001067% 0.000103％ 0.000103% 8 8 I0038，I0100，I0200，I0300，I0472，I0500，I0600，I0700，I0864，I09dc，I0a5c，I0b00 I0038, I0100, I0200, I0300, I0472, I0500, I0600, I0700, I0864, I09dc, I0a5c, I0b00 122 122 0.002133％ 0.002133% 0.000246％ 0.000246% ... ... ... ... ... ... ... ... ... ... 803 803 I0038，I0100，I0200，I0300，I0472，I0500，I0600，I0700，I0864，I09e5，I0a5c，I0b00 I0038, I0100, I0200, I0300, I0472, I0500, I0600, I0700, I0864, I09e5, I0a5c, I0b00 122 122 0.001493％ 0.001493% 0.000269％ 0.000269% 804 804 I0038，I0100，I0200，I0300，I0472，I0500，I0600，I0700，I0864，I09e5，I0a5c，I0b00 I0038, I0100, I0200, I0300, I0472, I0500, I0600, I0700, I0864, I09e5, I0a5c, I0b00 122 122 0.001493％ 0.001493% 0.000269％ 0.000269%

步骤S200’，找出所有大于一频繁度的频繁项。Step S200', finding all frequent items greater than a certain frequency.

将初始频繁率设置为0.02；初始频繁度＝(初始频繁率×事务个数)取整＝(0.02×804)取整＝16，挖掘所有频繁度大于16的频繁项，此实施例中共挖掘出31680个频繁项，部分列举如表2所示。Initial frequency is set to 0.02; Initial frequency = (initial frequency × number of transactions) rounding = (0.02 × 804) rounding = 16, mining all frequent items with a frequency greater than 16, this embodiment excavates a total of There are 31680 frequent items, some of which are listed in Table 2.

表2Table 2

序号 serial number 频繁度 Frequency 各级频繁项 Frequent items at all levels 1 1 27 27 I0032 I0032 2 2 723 723 I0038 I0038 3 3 248 248 I040d I040d 4 4 29 29 I0955 I0955 5 5 29 29 I0a20 I0a20 ... ... ... ... ... ...

51 51 27 27 I0032，I0132 I0032, I0132 52 52 27 27 I0032，I0230 I0032, I0230 53 53 640 640 I0000，I0200 I0000, I0200 ... ... ... ... ... ... 397 397 27 27 I0032，I0132，I0230 I0032, I0132, I0230 398 398 27 27 I0032，I0132，I032d I0032, I0132, I032d ... ... ... ... ... ... 5423 5423 720 720 I0038，I0100，I0200，I0500，I0600 I0038, I0100, I0200, I0500, I0600 5424 5424 720 720 I0038，I0100，I0200，I0500，I0700 I0038, I0100, I0200, I0500, I0700 54255425 720720 I0038，I0100，I0200，I0300，I0500，I0600，I0700 I0038, I0100, I0200, I0300, I0500, I0600, I0700 54265426 228228 I0100，I0200，I0300，I040d，I0500，I0600，I0700 I0100, I0200, I0300, I040d, I0500, I0600, I0700 ... ... ... ... ... ... 1000010000 718718 I0038，I0100，I0200，I0300，I0500，I0600，I0700，I0b00 I0038, I0100, I0200, I0300, I0500, I0600, I0700, I0b00 ... ... ... ... ... ... 3167831678 8484 I0038，I0100，I0200，I0300，I0472，I0500，I0600，I0700，I0864，I09dc，I0a5c，I0b00 I0038, I0100, I0200, I0300, I0472, I0500, I0600, I0700, I0864, I09dc, I0a5c, I0b00 3167931679 6363 I0038，I0100，I0200，I0300，I0472，I0500，I0600，I0700，I0864，I09e5，I0a5c，I0b00 I0038, I0100, I0200, I0300, I0472, I0500, I0600, I0700, I0864, I09e5, I0a5c, I0b00 3168031680 2020 I0067，I0165，I0274，I032f，I040d，I050a，I066c，I0765，I086e，I0967，I0a74，I0b68 I0067, I0165, I0274, I032f, I040d, I050a, I066c, I0765, I086e, I0967, I0a74, I0b68

步骤S300’，准特征过滤；Step S300', quasi-feature filtering;

步骤S310’，(第一次过滤)对上步得出的频繁项进行第一次过滤：Step S310', (filtering for the first time) carries out the first filtering to the frequent items obtained in the previous step:

很显然步骤S200’得到的频繁项有高度的冗余，需要对其进行过滤，第一次过滤包括过滤冗余特征和内容为全零的频繁项(过滤方法参见实施方式)。It is obvious that the frequent items obtained in step S200' are highly redundant and need to be filtered. The first filtering includes filtering redundant features and frequent items whose content is all zeros (see the embodiment for the filtering method).

在本示例中第一次过滤共过滤31649个频繁项，非常有力了降低了候选特征的个数，并且冗余特征过滤是等价过滤，保证过滤过程没有信息的丢失，过滤后剩余31个频繁项，部分列举如表3所示。In this example, a total of 31649 frequent items are filtered for the first time, which is very effective in reducing the number of candidate features, and redundant feature filtering is equivalent filtering, ensuring that there is no information loss during the filtering process, and 31 frequent items remain after filtering items, some of which are listed in Table 3.

表3：table 3:

序号 serial number 频繁度 Frequency 各级频繁项 Frequent items at all levels 2 2 723 723 I0038 I0038 3 3 248 248 I040d I040d 4 4 29 29 I0955 I0955 5 5 29 29 I0a20 I0a20 5425 5425 720 720 I0038，I0100，I0200，I0300，I0500，I0600，I0700 I0038, I0100, I0200, I0300, I0500, I0600, I0700 5426 5426 228 228 I0100，I0200，I0300，I040d，I0500，I0600，I0700 I0100, I0200, I0300, I040d, I0500, I0600, I0700 10000 10000 718 718 I0038，I0100，I0200，I0300，I0500，I0600，I0700，I0b00 I0038, I0100, I0200, I0300, I0500, I0600, I0700, I0b00 ... ... ... ... ... ... 3167831678 8484 I0038，I0100，I0200，I0300，I0472，I0500，I0600，I0700，I0864，I09dc，I0a5c，I0b00 I0038, I0100, I0200, I0300, I0472, I0500, I0600, I0700, I0864, I09dc, I0a5c, I0b00 3167931679 6363 I0038，I0100，I0200，I0300，I0472，I0500，I0600，I0700，I0864，I09e5，I0a5c，I0b00 I0038, I0100, I0200, I0300, I0472, I0500, I0600, I0700, I0864, I09e5, I0a5c, I0b00 3168031680 2020 I0067，I0165，I0274，I032f，I040d，I050a，I066c，I0765，I086e，I0967，I0a74，I0b68 I0067, I0165, I0274, I032f, I040d, I050a, I066c, I0765, I086e, I0967, I0a74, I0b68

步骤S320’，(频繁度修正和频繁项第二次过滤)对于4级频繁项及4级以上的频繁项，基于起始位置及单项之间的平均距离修正其频繁度，并进行第二次过滤。Step S320', (frequency correction and second filtering of frequent items) For frequent items of level 4 and above, the frequency is corrected based on the starting position and the average distance between individual items, and the second filtering is carried out. filter.

修正频繁项的方法参见实施方式，对频繁项修正过后的如表4所示。For the method of correcting the frequent items, refer to the implementation manner, and the corrected frequent items are shown in Table 4.

表4：Table 4:

序号 serial number 频繁度 Frequency 各级频繁项 Frequent items at all levels 2 2 686 686 I0038 I0038 3 3 137 137 I040d I040d 4 4 10 10 I0955 I0955 5 5 9 9 I0a20 I0a20 5425 5425 720 720 I0038，I0100，I0200，I0300，I0500，I0600，I0700 I0038, I0100, I0200, I0300, I0500, I0600, I0700 5426 5426 228 228 I0100，I0200，I0300，I040d，I0500，I0600，I0700 I0100, I0200, I0300, I040d, I0500, I0600, I0700 10000 10000 718 718 I0038，I0100，I0200，I0300，I0500，I0600，I0700，I0b00 I0038, I0100, I0200, I0300, I0500, I0600, I0700, I0b00 ... ... ... ... ... ... 3167831678 8484 I0038，I0100，I0200，I0300，I0472，I0500，I0600，I0700，I0864，I09dc，I0a5c，I0b00 I0038, I0100, I0200, I0300, I0472, I0500, I0600, I0700, I0864, I09dc, I0a5c, I0b00 3167931679 6363 I0038，I0100，I0200，I0300，I0472，I0500，I0600，I0700，I0864，I09e5，I0a5c，I0b00 I0038, I0100, I0200, I0300, I0472, I0500, I0600, I0700, I0864, I09e5, I0a5c, I0b00 3168031680 2020 I0067，I0165，I0274，I032f，I040d，I050a，I066c，I0765，I086e，I0967，I0a74，I0b68 I0067, I0165, I0274, I032f, I040d, I050a, I066c, I0765, I086e, I0967, I0a74, I0b68

表4中序号2、3、4、5部分是经过修正后的频繁度，因其它的频繁项都包含了四个或四个以上单项，所以不修正其频繁度。The serial numbers 2, 3, 4, and 5 in Table 4 are the corrected frequencies. Since other frequent items contain four or more individual items, their frequencies are not corrected.

对频繁度修正后，过滤掉任意两个有包含关系的频繁项中频繁度小的频繁项，过滤后结果列举如表5所示。After correcting the frequency, filter out the frequent items with small frequency among any two frequent items that have an inclusion relationship, and the filtered results are listed in Table 5.

表5：table 5:

序号 serial number 频繁度 Frequency 各级频繁项 Frequent items at all levels 5425 5425 720 720 I0038，I0100，I0200，I0300，I0500，I0600，I0700 I0038, I0100, I0200, I0300, I0500, I0600, I0700 5426 5426 228 228 I0100，I0200，I0300，I040d，I0500，I0600，I0700 I0100, I0200, I0300, I040d, I0500, I0600, I0700 31680 31680 20 20 I0067，I0165，I0274，I032f，I040d，I050a，I066c，I0765，I086e，I0967，I0a74，I0b68 I0067, I0165, I0274, I032f, I040d, I050a, I066c, I0765, I086e, I0967, I0a74, I0b68

步骤S400’，准特征二次挖掘过滤；Step S400', secondary mining and filtering of quasi-features;

步骤S410’，挖掘第二次过滤后的每个频繁项是否同时满足绝对包长关系以及包长与内容差值关系。若存在上述关系则对新发现特征进行附加，并对旧特征(即频繁项)进行修正。Step S410', mining whether each frequent item after the second filtering satisfies both the absolute packet length relationship and the packet length-content difference relationship. If the above relationship exists, the newly discovered features are added, and the old features (ie, frequent items) are corrected.

将表5中序号5425表示的频繁项转化为一个特征字符串，该特征字符串表示的意义：应用层载荷的第0个字节(从零开始编号)是0x38，第1、2、3、5、6、7位是零，其频繁度记为Support＝720。Convert the frequent item represented by the serial number 5425 in Table 5 into a characteristic string, the meaning of the characteristic string: the 0th byte (numbered from zero) of the application layer load is 0x38, the 1st, 2nd, 3rd, Bits 5, 6, and 7 are zero, and their frequency is recorded as Support=720.

以该特征字符串为例挖掘绝对包长特征方法如下：Taking this feature string as an example, the method of mining absolute packet length features is as follows:

例如可以定义一个结构，以记录满足特征字符串的包长的分布信息，如表6所示。For example, a structure can be defined to record the distribution information of the packet length satisfying the characteristic string, as shown in Table 6.

表6：Table 6:

包长 package length 0 0 1 1 2 2 3 3 ... ... 1497 1497 1498 1498 1499 1499 1500 1500 Count Count 0 0 0 0 0 0 0 0 ... ... 0 0 0 0 0 0 0 0

扫描事务数据库，对满足该特征字符串的事务记录其对应的包长(例如：若包长为21则在包长为21的计数值单元加1)。当扫描完事务数据库后，可以取得计数值最大的计数单元的值Count_max和其所对应的包长P_len，如果(Count_max/Support)≥P₅％，(一个确定的百分比如95％)，则认为满足该特征字符串的事务还满足绝对包长为P_len的特征，将此特征附加在特征字符串后，形成一个组合特征。Scan the transaction database, and record the corresponding packet length for the transaction that satisfies the feature string (for example: if the packet length is 21, add 1 to the count value unit with the packet length of 21). After scanning the transaction database, the value Count _max of the counting unit with the largest count value and the corresponding packet length P _len can be obtained, if (Count _max /Support)≥P ₅ %, (a certain percentage such as 95%) , then it is considered that the transaction that satisfies the characteristic string also satisfies the characteristic that the absolute packet length is P _len , and this characteristic is appended to the characteristic string to form a combined characteristic.

以序号为5425的频繁项转化的特征字符串为例挖掘包长与内容的差值关系特征的方法如下：Taking the feature string converted from the frequent item with serial number 5425 as an example, the method of mining the difference relationship between packet length and content is as follows:

事务数据库中一条事务中可包含的单项个数最大值记为N，则对一条包含N个单项且满足该特征字符串的事务来说，可以提取组合成(N-1)个大端字节序(Big-Endian)无符号整数和(N-1)个小端字节序(Little-Endian)无符号整数。(详细组合方法参见实施方式)可用如下方式表示：表7表示(N-1)个大端字节序无符号整数，序号也同时表示位置。例如：若事务包含以下三个单项I0001，I0102，I0203，则其可组成两个大端字节序双字节无符号整数和两个小端字节序双字节无符号整数，由I0001和I0102提取组合的大端字节序双字节无符号整数为18，序号为0(等于两个被组合的单项中，位置较小的单项的位置)，由I0001和I0102提取组合的小端字节序双字节无符号整数为33，序号为0。表8表示(N-1)个小端字节序双字节无符号整数。The maximum number of single items that can be included in a transaction in the transaction database is recorded as N, then for a transaction that contains N single items and satisfies the characteristic string, it can be extracted and combined into (N-1) big-endian bytes Big-Endian unsigned integers and (N-1) Little-Endian unsigned integers. (Refer to the embodiment for the detailed combination method) It can be represented in the following manner: Table 7 represents (N-1) big-endian unsigned integers, and the serial number also represents the position. For example: if the transaction contains the following three single items I0001, I0102, and I0203, it can be composed of two big-endian double-byte unsigned integers and two little-endian double-byte unsigned integers, composed of I0001 and The big-endian double-byte unsigned integer extracted by I0102 is 18, and the serial number is 0 (equal to the position of the single item with the smaller position among the two combined single items), and the combined little-endian word is extracted by I0001 and I0102 The endian double-byte unsigned integer is 33, and the sequence number is 0. Table 8 represents (N-1) little-endian two-byte unsigned integers.

表7(N-1)个大端字节序无符号整数表Table 7 (N-1) big-endian unsigned integer table

序号(位置) serial number (position) 值(大端字节序) value (big-endian) 0 0 BigVal₀ BigVal ₀ 1 1 BigVal₁ BigVal ₁ ... ... ... ... N-2 N-2 BigVal_N-2 BigVal _N-2 N-1 N-1 BigVal_N-1 BigVal _N-1

表8(N-1)个小端字节序无符号整数表Table 8 (N-1) little-endian unsigned integer table

序号(位置) serial number (position) 值(小端字节序) value (little endian) 0 0 LittVal₀ LittVal ₀ 1 1 LittVal₁ LittVal ₁ ... ... ... ... N-2 N-2 LittVal_N-2 LittVal _N-2 N-1 N-1 LittVal_N-1 LittVal _N-1

数据库中每个事务所对应的包长记为P_len。另需要两个全局的结构，用分别记录包长和上面2×(N-1)值的差值关系。例如：可用如下结构记录每条事务的P_len与(N 1)个大端字节序双字节无符号整数的差值关系，如表9所示。The packet length corresponding to each transaction in the database is recorded as P _len . In addition, two global structures are required to record the difference relationship between the packet length and the above 2×(N-1) value. For example, the following structure can be used to record the difference relationship between P _len of each transaction and (N 1) big-endian double-byte unsigned integers, as shown in Table 9.

表9差值关系表Table 9 difference relationship table

序号(位置) serial number (position) -n -n -(n-1) -(n-1) ... ... 0 0 1 1 ... ... (n-1) (n-1) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 N-2 N-2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 N-1 N-1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

表9中n表示一个正整数，记(P_len-BigVal_i)＝MinVal_i，若-n≤MinVal_i＜n，则在第i行第MinVal_i列计数加1，对数据库扫描一遍后可以得到最大的一个计数值Count_Max，其位于表9结构中第Pos行，第Val列。如果(Count_Max/Support)≥P₆％，(一个确定的百分比如95％)，则认为满足该特征字符串的事务还满足包长与应用层载荷的第Pos和(Pos+1)个字节组成的大端字节序双字节无符号整数差值为Val的特征。判断特征字符串中是否包含位置等于Pos或(Pos+1)的值，若有将这两个位置上的值从特征字符串中删除，再将包长与内容的差值特征附加在特征字符串后，形成组合特征。In Table 9, n represents a positive integer, record (P _len -BigVal _i )=MinVal _i , if -n≤MinVal _i <n, then add 1 to the count of MinVal _i column in row i, and scan the database once to get The largest count value, Count _Max , is located in row Pos and column Val in the table 9 structure. If (Count _Max /Support)≥P ₆ %, (a certain percentage such as 95%), it is considered that the transaction that satisfies the feature string also satisfies the packet length and the application layer load Pos and (Pos+1) word The big-endian two-byte unsigned integers that make up the section are characterized by Val. Determine whether the feature string contains a value whose position is equal to Pos or (Pos+1). If so, delete the values at these two positions from the feature string, and then add the difference between the packet length and content to the feature character After stringing, a combined feature is formed.

扫描事务数据库重新计算修正过的特征的频繁度、所有满足该特征的流的总数据包个数占该协议所对应的总数据包个数的百分比(即该条准特征的包识别率)、所有满足该特征的流的总字节总数该协议所对应的总字节数的百分比(即该条准特征的字节识别率)，对满足该特征的事务打上标记。经过二次挖掘修正过准特征如表10所示。The frequency of scanning the transaction database to recalculate the corrected feature, the percentage of the total number of data packets of all streams satisfying the feature to the total number of data packets corresponding to the protocol (that is, the packet recognition rate of the quasi-feature), The total number of bytes of all streams that meet this characteristic is the percentage of the total number of bytes corresponding to the protocol (that is, the byte recognition rate of this standard characteristic), and the transactions that meet this characteristic are marked. The quasi-characteristics corrected by secondary excavation are shown in Table 10.

表10：Table 10:

序号 serial number 频繁度 Frequency 包识别率 Packet recognition rate 字节识别率 Byte recognition rate 准特征 quasi-features 54255425 720720 99.5％99.5% 99.94％99.94% I0038，I0100，I0200，I0600，I0700*BE*BG：3*MV：8*LE*BG：4*MV：8 I0038, I0100, I0200, I0600, I0700*BE*BG: 3*MV: 8*LE*BG: 4*MV: 8 54265426 228228 18.79％18.79% 20.58％20.58% I0100，I0200，I0300，I040d，I0500，I0600，I0700*APL：21 I0100, I0200, I0300, I040d, I0500, I0600, I0700*APL: 21 3168031680 2020 0.04％0.04% 0％0% I0067，I0165，I0274，I032f，I040d，I050a，I066c，I0765，I086e，I0967，I1074，I1168*APL：75 I0067, I0165, I0274, I032f, I040d, I050a, I066c, I0765, I086e, I0967, I1074, I1168*APL: 75

步骤S420’，过滤弱特征和重复挖掘特征。Step S420', filtering weak features and repeated mining features.

过滤掉只包含一个一级频繁项的准特征，该准特征出现的概率为1/256，会造成很高的误识别率。Filter out the quasi-features that only contain one level of frequent items, the probability of this quasi-feature is 1/256, which will cause a high misrecognition rate.

过滤掉重复挖掘特征(重复挖掘特征概念参见实施方式)，上步得到的序号为5425和5426的准特征是从基本相同的事务是挖掘得到的，因为当序号为5426的准特征出现时序号为5425的准特征(99％)的都出现了，所认为序号5426表示的准特征为重复挖掘特征，要将其过滤。Filter out repeated mining features (refer to the implementation method for the concept of repeated mining features), the quasi-features with serial numbers 5425 and 5426 obtained in the previous step are mined from basically the same transaction, because when the quasi-feature with serial number 5426 appears, the serial number is All of the quasi-features (99%) of 5425 have appeared, so the quasi-features represented by serial number 5426 are considered to be repeated mining features and should be filtered.

过滤掉包识别率或者字节识别率小于p₂％的准特征，称这种特征为疑似混杂特征，产生此特征的原因可能是因为训练数据包集合中混合进了少量的其它协议的流量。实验证明许多识别率不足1％的特征造成了协议间严重的误识别，过滤掉这些特征后对识别率基本上没有影响，但却极大的降低了误识别率。根据此规则过滤掉了序号3和4的准特征。Filter out the quasi-features whose packet recognition rate or byte recognition rate is less than p ₂ %, and call this feature a suspected mixed feature. The reason for this feature may be that a small amount of traffic from other protocols is mixed into the training data packet set. Experiments prove that many features with a recognition rate of less than 1% cause serious misrecognition between protocols. After filtering out these features, the recognition rate is basically not affected, but the misrecognition rate is greatly reduced. The quasi-features with ordinal numbers 3 and 4 were filtered out according to this rule.

目前只剩下如下一个特征，如表11所示。Currently only the following feature remains, as shown in Table 11.

表11：Table 11:

序号 serial number 频繁度 Frequency 包百分比 package percentage 字节百分比 Byte Percentage 应用层特征 Application Layer Features 5425 5425 720 720 99.502％ 99.502% 99.939％ 99.939% I0038，I0100，I0200，I0600，I0700*BE*BG：3*MV：8*LE*BG：4*MV：8 I0038, I0100, I0200, I0600, I0700*BE*BG: 3*MV: 8*LE*BG: 4*MV: 8

I0038，I0100，I0200，I0600，I0700是表示特征字符串，表示第0-2位是0x38，0x00，0x00，第6、7位是0x00，0x00；*BE*BG：3*MV：8表示包长与内容的差值关系，*BE表示是大端字节序，*BG：3表示起始位在第3位，MV：8表示包长与第3、4位组成的大端字节序双字节无符号整数的差值是8。LE*B6：4*MV：8表示包长与内容的差值关系，*LE表示是小端字节序，*BG：4表示起始位在第4位，MV：8表示包长与第4、5位组成的小端字节序双字节无符号整数的差值是8。I0038, I0100, I0200, I0600, and I0700 are characteristic strings, indicating that the 0-2 bits are 0x38, 0x00, 0x00, and the 6th and 7th bits are 0x00, 0x00; *BE*BG: 3*MV: 8 means package The difference between length and content, *BE means big-endian byte order, *BG: 3 means the start bit is at the third bit, MV: 8 means big-endian byte order composed of the packet length and the 3rd and 4th bits The difference is 8 for double-byte unsigned integers. LE*B6: 4*MV: 8 means the difference between packet length and content, *LE means little-endian byte order, *BG: 4 means the start bit is at the 4th bit, MV: 8 means the difference between the packet length and the 4. The difference between 5-bit little-endian double-byte unsigned integers is 8.

步骤S500’，用挖掘的协议识别特征，生成该协议的特征文件。本例用XML文件表示特征文件，生成特征文件之后就可以运行流量识别程序进行在线实时识别。Step S500', use the mined protocol to identify features, and generate a feature file of the protocol. In this example, an XML file is used to represent the signature file. After the signature file is generated, the traffic identification program can be run for online real-time identification.

以下是用上述挖掘的特征生成的特征文件的计算机实现算法示例：The following is an example of a computer-implemented algorithm for a signature file generated with the features mined above:

<content_length><content_length>

<byte_number>10</byte_number><byte_number>10</byte_number>

<differ_value>8</differ_value><differ_value>8</differ_value>

</content_length></content_length>

</byte></byte>

</byte></byte>

</byte></byte>

</byte></byte>

</byte></type><type value＝″2″></byte></type><type value="2">

<content_length><content_length>

<byte_number>2</byte_number><byte_number>2</byte_number>

<differ_value>8</differ_value><differ_value>8</differ_value>

</content_length></content_length>

</byte></byte>

</byte></byte>

</byte></byte>

</byte></byte>

</byte></byte>

</type></type>

</tcp></tcp>

4.本发明的方法不仅在于把当前识别协议特征的人工分析过程转化为计算机自动挖掘处理应用层协议识别特征的过程，同时更在于其得出的协议识别特征的准确性、完全性和可靠性等方面，这使得以前随着识别协议的增多，协议之间的误识别率增高的现象有很大的改变，该方法把协议之间的误识别率降低到了一个很小的范围内，使得可靠性得到极大的提升。4. The method of the present invention is not only to transform the manual analysis process of current identification protocol features into a process of automatic computer mining and processing application layer protocol identification features, but also lies in the accuracy, completeness and reliability of the protocol identification features obtained by it In other aspects, this has greatly changed the phenomenon that the misidentification rate between protocols increased with the increase of identification protocols. This method reduces the misidentification rate between protocols to a small range, making reliable Sexuality is greatly improved.

通过以上结合附图对本发明具体实施例的描述，本发明的其它方面及特征对本领域的技术人员而言是显而易见的。Other aspects and features of the present invention will be apparent to those skilled in the art from the above description of specific embodiments of the present invention in conjunction with the accompanying drawings.

以上对本发明的具体实施例进行了描述和说明，这些实施例应被认为其只是示例性的，并不用于对本发明进行限制，本发明应根据所附的权利要求进行解释。The specific embodiments of the present invention have been described and illustrated above, and these embodiments should be considered as exemplary only, and are not used to limit the present invention, and the present invention should be interpreted according to the appended claims.

Claims

1. An application layer protocol identification feature mining method is characterized in that, comprising the following steps:

Step A, the training data packet set is filtered for the first time, and encoded, and the quasi-protocol identification feature data information is extracted;

Step B, performing the first mining from the extracted quasi-protocol identification feature data information to obtain multi-level frequent itemsets;

Step C, performing the first filtering on the multi-level frequent itemsets, correcting the frequency of the remaining multi-level frequent itemsets after the first filtering and mining for the second time, performing a second filtering on them, Get the final protocol identification feature.

2. application layer protocol identification feature mining method according to claim 1, is characterized in that, also comprises the following steps:

Step D, if the byte recognition rate of all final protocol identification features meets the requirements, or the sum of the data packet recognition rates meets the requirements, then no longer mine the data of the second and subsequent data packets; otherwise, cyclically mine the second and subsequent data packages until the total recognition rate meets the requirement.

3. the application layer protocol identification feature mining method according to claim 1 or 2, is characterized in that, described step A comprises the following steps:

A1. Capture the training data packet set and store the training data packet set in the flow structure after dividing the training data packet set by flow;

A2. Utilize the mixed traffic filtering method to filter the mixed traffic in the training data packet set;

A3. Encoding the application layer payload by encoding the bytes in the extracted data packet based on the location;

A4. Extract the encoded data information, and extract the quasi-protocol identification feature data information.

4. a kind of application layer protocol identification feature mining method according to claim 3, is characterized in that, in described step A2, the mixed traffic filtering method comprises the following steps:

A21. Filter out the content of the flow that meets the HTTP protocol and the FTP protocol;

A22. Filter out the flows that do not have a complete three-way handshake in the TCP flow.

5. a kind of application layer protocol identification feature mining method according to claim 4, is characterized in that, in step A21, described filter out the content that satisfies the flow of FTP agreement, comprises the following steps:

Filter out the data packets of the flow structure that communicates in PASV mode;

Filter out the packets using the structure of ports 20 and 21.

6. a kind of application layer protocol identification feature mining method according to claim 5, is characterized in that, the judging method of the flow structure of the PASV pattern communication that adopts FTP, comprises the following steps:

Find the flow structure using port 21, and judge whether the data packets belonging to the structure have data packets starting with 227;

If there is, it is further judged whether it is a response packet of the PASV mode, and if so, the data packet includes the IP address and the port number that the server prepares to carry out the PASV mode data connection with the client, and the destination IP address of the data packet is also the client The IP address of the terminal, the FTP data connection adopts the TCP protocol, and records these four data;

After traversing all flow structures using port 21, all flow information using FTP in PASV mode communication is obtained.

7. a kind of application layer protocol identification feature mining method according to claim 6, is characterized in that, described filtering out adopts the data packet of the flow structure body of PASV pattern communication, comprises the following steps:

Compare each flow with the recorded data connection flow information in PASV mode using FTP. If four of the five-tuple information in the flow structure are the same as the recorded flow information in PASV mode data connection, then this is considered A flow is a flow that communicates in PASV mode of FTP, and all packets in this flow are discarded.

8. a kind of application layer protocol identification feature mining method according to claim 3, is characterized in that, in described step A3, the method for encoding the byte in the data packet that extracts based on position, comprises the following steps :

Encode a byte value represented by two hexadecimal numbers in the data packet into a single item represented by 5 characters. The first character from the left is I, which means Item, and the second and third characters Indicates which bit the byte is in the first N bytes extracted above, expressed in hexadecimal, counting from zero, if less than sixteen, the second bit is zero; the fourth and fifth characters are Two hexadecimal characters of the original byte.

9. A kind of application layer protocol identification characteristic mining method according to claim 3, is characterized in that, in described step A4, described quasi-protocol identification characteristic data information comprises the encoded information with the same offset data packet , and related statistical auxiliary data information;

The extraction of quasi-protocol identification feature data information includes the following steps:

A41. Extract the quasi-protocol identification feature data information for the TCP flow, and import it into the transaction database;

A42. Extract the quasi-protocol identification feature data information from the UDP stream, and import it into the transaction database.

10. A kind of application layer protocol identification feature mining method according to claim 3, is characterized in that, described step B comprises the following steps:

Step B1, first set an initial frequency, multiply the initial frequency by the total number of transactions in the database, and then round up to get the initial frequency; calculate the frequency of each single item in the database, filter out the frequency less than the initial frequency The single item of the degree, each of the remaining single items is called a first-level frequent item, and the set of all first-level frequent items is called a first-level frequent item set;

Step B2, select two K-1 level frequent items A and B from the K-1 level frequent item set, these two K-1 level frequent items must satisfy the first K-2 single items of K-1 level frequent item A and The K-2 individual items of the K-1 frequent item B are the same, and the K-1th individual item of the K-1 frequent item A is different from the K-1th individual item of the K-1 frequent item B. If B The position of the K-1th single item of A is greater than the position of the K-1th single item of A, then the K-1th single item of B is appended to the back of A to obtain a quasi-K-level frequent item, and the quasi-K-level frequent item is calculated If the frequency of the item is greater than or equal to the initial frequency, it is indeed a K-level frequent item. According to this method, all K-level frequent items are obtained to form a K-level frequent itemset; where K≥2.

11. A kind of application layer protocol identification characteristic mining method according to claim 10, it is characterized in that, described transaction comprises four attribute fields of encoded byte, packet length, byte percentage and packet percentage, store extracted The data.

12. A kind of application layer protocol identification characteristic mining method according to claim 1, is characterized in that, described step C comprises the following steps:

C1. Using frequency correction and frequent item filtering methods, performing the first frequent item filtering, frequency correction and the second frequent item filtering on the multi-level frequent item set;

C2. Convert the frequent items in the corrected and filtered frequent item set into feature strings, mine the absolute packet length feature and the feature of the relationship between packet length and content difference, and mark the corresponding transaction, and then identify the quasi-protocol corresponding to the transaction The feature data information is filtered to obtain the final protocol identification feature.

13. A kind of application layer protocol identification characteristic mining method according to claim 12, is characterized in that, described step C1 comprises the following steps:

Step C11, performing the first frequent item filtering on the multi-level frequent item set;

Step C12, perform frequency correction and second frequent item filtering on the multi-level frequent item set after the first frequent item filtering, and eliminate the inclusion relationship between frequent items.

14. A kind of application layer protocol identification characteristic mining method according to claim 13, is characterized in that, described step C12 comprises the following steps:

Step C121, using the following formula to correct the frequency of the first-level frequent items;

{freq freq}_{new new} = = \{\begin{matrix} {k k}_{11} \times \times {freq freq}_{old old};; & {pos pos}_{00} = = 0,0.9 0,0.9 \leq \leq {k k}_{11} < < 11 \\ f f (({pos pos}_{00})) \times \times {freq freq}_{old old};; & {pos pos}_{00} &NotEqual; &NotEqual; 0,0 0,0 < < f f (({pos pos}_{00})) \leq \leq {k k}_{11} \end{matrix}

Among them, freq _new represents the new frequency after correction, freq _old represents the frequency of frequent items before correction, pos _i represents the position of the i-th single item in frequent items, i starts from zero, and k represents the frequency of single items in frequent items number; the position of a single item is numbered from zero, k ₁ is a constant, and f(pos ₀ ) is a continuous monotonically decreasing function;

Step C122, using the following formula to correct the frequency of the 2-level frequent item;

Among them, k ₂ is a constant, f((pos ₁ -pos ₀ )) is a continuous monotone decreasing function;

Step C123, at first, use the following formula to find the average distance between items in frequent items, which is recorded as ave _dist ;

{ave ave}_{dist dist} = = \frac{{Σ Σ}_{i i = = 11}^{k k - - 11} {pos pos}_{i i} - - {pos pos}_{i i - - 11}}{k k - - 11}

The distance refers to the absolute value of the difference between the positions of two adjacent single items in the item; if ave _dist ≠1, use the following formula to correct ave _dist ;

Among them, k ₃ and k ₄ are constants;

Then, the following formula is used to modify the frequency of frequent items of the 3rd and 4th grades;

Among them, f ₁ (k) is a continuous monotonically increasing function about k; f ₂ (ave _dist ) and f ₃ (ave _dist ) are continuous monotonically decreasing functions about ave _dist , and satisfy f ₂ (ave _dist )＞f ₃ (ave _dist );

Step C124, filtering out the frequent items with small frequency among the frequent items having inclusion relationship.

15. A kind of application layer protocol identification feature mining method according to claim 12, is characterized in that, described step C2 comprises the following steps:

Step C21, converting the frequent items in the corrected and filtered frequent item set into a feature string, retrieving transactions satisfying the feature string from the transaction database, mining the absolute packet length feature and the packet length and content difference relationship feature, And mark the transactions that satisfy the characteristics of the absolute packet length and the relationship between the packet length and the content difference;

Step C22, filtering the quasi-protocol identification feature data information corresponding to the transaction to remove repeated mining features, weak features and suspected mixed features, to obtain the final protocol identification feature.