CN110290022B

CN110290022B - Unknown application layer protocol identification method based on adaptive clustering

Info

Publication number: CN110290022B
Application number: CN201910548327.5A
Authority: CN
Inventors: 洪征; 龚启缘; 冯文博; 李毅豪; 林培鸿; 周振吉; 付梦琳
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2019-06-24
Filing date: 2019-06-24
Publication date: 2021-02-26
Anticipated expiration: 2039-06-24
Also published as: CN110290022A

Abstract

The invention provides an unknown application layer protocol identification method based on adaptive clustering, which comprises the following steps: data preprocessing, similarity calculation and unknown application layer protocol clustering. According to the method, the network flow is recombined from the acquired original network data, the application layer protocol data of the network flow is automatically extracted, the similarity of the application layer protocol data is calculated to be used as the basis for identifying the application layer protocol, and finally, the application layer protocol data of the network flow is clustered by using a clustering algorithm, so that the identification of the unknown application layer protocol is realized. The method can effectively improve the accuracy of unknown application layer protocol identification, can avoid the training process of a protocol identification model, and has wide application range.

Description

Unknown application layer protocol identification method based on adaptive clustering

Technical Field

The invention relates to the technical field of networks, in particular to an unknown application layer protocol identification method based on self-adaptive clustering.

Background

The application layer protocol identification means that key features which can identify an application layer protocol are extracted from network traffic carrying application layer protocol data, and the same type of application layer protocol data is divided together based on the key features.

Unknown protocol refers to a protocol for which the protocol specification is unknown. Many manufacturers do not publish details of the communication protocols used by the software for security, copyright protection, and the like. Many malware also communicate based on communication protocols designed by writers. Most of the protocols belong to application layer protocols, and the protocol specifications of the protocols are not disclosed and belong to unknown application layer protocols.

The application layer protocol identification technology is an important basis for network service providers and network administrators to provide differentiated service quality guarantee, implement intrusion detection, monitor traffic and the like.

According to different identification methods, current application layer protocol identification technologies include a port number-based traffic classification method, a Deep Packet Inspection (DPI) -based traffic classification method, a host behavior-based traffic classification method, and a Deep Flow Inspection (DFI) -based traffic classification method.

The traffic classification method based on port numbers distinguishes the commonly used protocols such as HTTP, FTP, TELNET and the like according to the recommendation of IANA (internet agent membership authority). The method is simple, quick and convenient to implement. However, with the development of networks, the number of port numbers cannot meet the requirements of more and more network applications, and dynamic port technology comes, and the port numbers used by the same protocol can be fixed and changed. The traffic classification method based on port numbers cannot effectively identify a protocol using a dynamic port technology.

The flow classification method based on deep packet inspection realizes the identification and classification of application layer protocols by extracting the load characteristics of application layer data packets and comparing the load characteristics with a pre-constructed application layer load characteristic library. The method solves the problem that the port number of the protocol is not fixed, can identify the protocol adopting the dynamic port, but cannot identify the protocol and the encryption protocol with unknown protocol specification. In addition, the method needs to analyze the application layer load, a huge application layer load feature library needs to be maintained, system resources are occupied, and calculation is complex.

The traffic classification method based on the host behaviors combines traffic characteristics of a network layer, a transport layer and an application layer, and distinguishes different network protocols by analyzing the host behaviors. The method improves the accuracy of network traffic classification to a certain extent, but cannot accurately identify an application layer protocol and an encryption protocol with unknown protocol specifications.

In recent years, researchers have proposed a traffic classification method based on deep flow detection, which considers that the flow characteristics corresponding to each protocol are unique from a statistical point of view. The stream characteristic information here includes a stream length, a stream duration, a transmission interval of data packets in the stream, and the like. Traffic of different protocols can be identified by analyzing statistical characteristics of flows in the protocol entity interaction process. The method only needs to extract the statistical characteristics of the flow without analyzing the application layer load, and the calculation complexity is low. Meanwhile, the method does not need to maintain a feature library and consumes less resources. In addition, such methods can identify encrypted traffic and unknown protocol traffic. However, the methods use the distance as a dissimilarity degree measurement standard, and have the defect that the dissimilarity degree between data points in different density areas cannot be accurately measured, so that the accuracy of the method in the field of flow identification is low.

With the rise of technologies such as big data and machine learning, researchers have introduced machine learning methods into the field of network traffic classification in order to make up for the defects of the traditional network traffic classification technology. The machine learning method is mainly divided into three types, namely supervised learning, unsupervised learning and semi-supervised learning. The supervised learning is to obtain the mapping relation between the marks and the data features through calculation and analysis on a given marked data set, so as to train a model, and the model can mark the unmarked data set. Supervised learning mainly comprises two methods of classification and regression. Unsupervised learning is the ultimate partitioning of a given unlabeled data set into sets by analyzing the characteristics of the data, with the data within these sets typically having similarities in certain characteristics. The representative of unsupervised learning is a clustering method. Semi-supervised learning is in the middle zone of supervised and unsupervised learning, and the data set used is a mixture of a small amount of labeled data and a large amount of unlabeled data. The method can be subdivided into four methods of classification, regression, clustering, dimension reduction and the like.

Some researchers have applied supervised learning methods to the field of network traffic classification. In the research, classification models such as a convolutional neural network are utilized, marked network traffic data are adopted for model training, and the trained models can accurately classify the network traffic. However, such methods can only handle known protocols. If the protocol information is unknown, the network traffic for these protocols is difficult to classify.

Applying an unsupervised learning method to the field of network traffic classification is a feasible approach to solve the above-mentioned problems. However, unsupervised network traffic classification is still in the beginning stage at present, and when an unsupervised learning method is applied to the field of network traffic classification, how to convert network traffic into the input of a clustering method, how to calculate the similarity between network traffic, and how to improve the clustering accuracy on the basis of solving the similarity calculation problem are lack of intensive research.

Disclosure of Invention

The purpose of the invention is as follows: in order to solve the technical problem, the invention provides an unknown application layer protocol identification method based on adaptive clustering. The method takes application layer protocol data of network flow as an analysis object. Network data of the same protocol has certain similarity, and different application layer protocols are distinguished by utilizing the similarity. The invention recombines the network flow from the collected original network data, extracts the application layer protocol data of the network flow, calculates the similarity of the protocol data, takes the similarity between the application layer protocol data as the basis of protocol identification, and finally, clusters the application layer protocol data of the network flow by utilizing an improved hierarchical clustering method to realize the identification of unknown application layer protocols. The method and the device can solve the problem that the unknown application layer protocol is difficult to identify in the field of protocol identification at present, and effectively improve the accuracy of the unknown application layer protocol identification. In addition, the invention can avoid the training process in the field of application layer protocol recognition and can be directly used without training the classification model.

The technical scheme is as follows: in order to achieve the technical effects, the invention provides the following technical scheme:

an unknown application layer protocol identification method based on adaptive clustering comprises the following steps:

(1) a data preprocessing stage: preprocessing the acquired network traffic data, and extracting application layer protocol data in the acquired network traffic data;

(2) and a similarity calculation stage: selecting a section of data with fixed length in front of each application layer protocol data as input of similarity calculation, and calculating the similarity between the application layer protocol data;

(3) unknown application layer protocol clustering stage: performing cluster initialization processing on the extracted application layer protocol data, dividing each piece of application layer protocol data into a cluster, performing cluster initialization by taking the similarity between the application layer protocol data calculated in the step (2) as initial inter-cluster similarity, and performing inter-cluster similarity calculation on a cluster class set obtained by initialization, wherein in the process of calculating the inter-cluster similarity, when the similarity between the application layer protocol data needs to be calculated, the similarity between the application layer protocol data calculated in the step (2) is directly inquired; and repeatedly iterating by using a clustering algorithm until a clustering stopping condition is reached, and finally outputting a cluster set to describe the classification information of the application layer protocol corresponding to the network flow.

Further, the work flow of the data preprocessing stage is as follows: data pre-processing can be subdivided into three sub-steps: the first substep is to filter and sort the data, extract the network traffic with application layer protocol data load through the data filtering, and sort according to transport layer protocol and port number information, gather together the data packet that may belong to the same flow roughly; the second sub-step is flow recombination, which obtains the information of the network flow on the basis of the data packet; the third sub-step is the extraction of application layer protocol data, and the application layer load is extracted.

The first substep, data filtering and sorting. Not all packets contain application layer data, but the object concerned by protocol identification is application layer data, so that the traffic not containing application layer protocol load in the original network traffic can be filtered out through data filtering. For network flow, the non-IP data packets can be filtered by reading a FrameType field in a frame header of a data link layer, and the non-TCP and non-UDP data packets can be filtered by reading a Protocol field in a header of the IP data packet of the network layer. For a unidirectional network flow, the IP address and port number of the included packet are usually fixed, so the network traffic can be sorted according to the information of the packet header, such as the source IP, the destination IP, the source port number, the destination port number, etc., and packets that may belong to the same flow are grouped together. It should be clear that the sorting operation is only to gather together packets that may belong to the same network flow, which improves the efficiency of the post-processing.

The second sub-step, stream reassembly. The purpose of stream reassembly is to merge data packets belonging to the same network stream, that is, to splice application layer protocol data of the data packets belonging to the same network stream, so as to obtain a network stream including complete application layer protocol data. For a TCP flow, no matter a client or a server, a handshake packet with a SYN flag bit of 1 is sent when a TCP connection is established, and a handoff packet with a FIN flag bit of 1 is sent when the TCP connection is released. The start and end of a TCP flow may be identified based on SYN and FIN flag bits in the TCP header. And then the sequence number of the TCP header and the length information of the load data are used for reintegrating the arrived data packets into an ordered flow. UDP streams have no connection establishment and connection release procedures and cannot identify the start and end of a UDP stream by a UDP header. The invention adopts the setting of the maximum duration of the stream, and identifies the start of the UDP stream and the end of the UDP stream according to the sending time of the data packet. The invention calculates the difference between the sending time and the stream starting time every time a data packet is captured according to the sequence from the first data packet by taking the sending time of the data packet as the stream starting time, if the difference is less than the stream maximum duration, the captured data packet belongs to the UDP stream, if the difference is more than the stream maximum duration, the UDP stream is considered to be finished, the captured data packet belongs to the next UDP stream, and the sending time of the data packet is taken as the new stream starting time, and the steps are repeated until all the data packets are combined.

And the third sub-step, application layer protocol data extraction. And according to the characteristics of the TCP stream and the UDP stream, extracting the parts of the TCP stream and the UDP stream as the application layer protocol data, and storing the parts for subsequent analysis.

Further, the workflow of the similarity calculation stage is as follows: the clustering algorithm divides the data set into a plurality of different clusters, the data in the clusters have higher similarity, and the similarity difference of the data among the different clusters is larger. The clustering algorithm firstly carries out cluster initialization on a data object set, namely, each data object is marked as a cluster, then the similarity between the clusters is calculated, whether two clusters are combined or not is determined by comparing the similarity between the clusters with a set threshold value, iteration is repeated, and finally clustering of the data set is finished.

The research of the invention finds that the clustering algorithm can repeatedly calculate the similarity of the objects among clusters when calculating the similarity among the clusters. For example, as for the object set { A1, A2, A3, B1}, where A1, A2, A3 are the same type of object and B1 is another type of object. After the clustering algorithm initializes the clusters, a cluster set is obtained, wherein the cluster set comprises { A1}, { A2}, { A3}, and { B1 }. And executing a first clustering loop, and calculating inter-cluster similarity among clusters { A1}, { A2}, { A3}, and { B1 }. Assuming that after the first clustering cycle, the cluster { A1} and the cluster { A2} are merged into the cluster { A1, A2}, the new cluster set is { { A1, A2}, { A3}, and { B1} }. Then the clustering loop for the second time needs to calculate the inter-cluster similarity between clusters { a1, a2}, { A3}, and { B3}, and for clusters { a1, a2} and clusters { A3}, when calculating the inter-cluster similarity, the inter-object similarity between a1 and A3, and the inter-cluster similarity between a2 and A3 need to be calculated first, and then the average value is taken as the inter-cluster similarity. It was found that in this process, the similarities between a1 and A3, a2 and A3 were repeatedly calculated. The clustering algorithm changes the position of an object in a cluster set, the object is not added or deleted, and when the similarity among the clusters is calculated, the similarity among the objects is repeatedly calculated, so that the clustering efficiency is reduced by repeatedly calculating.

According to the method, the clustering algorithm is improved according to the characteristic, and the calculation of the inter-cluster similarity in the clustering algorithm is divided into two parts, namely the calculation of the inter-cluster similarity and the calculation of the inter-cluster similarity of the data objects. Before clustering, similarity among the objects is calculated to obtain the similarity among all the data objects. When the clustering algorithm needs to calculate the similarity between objects in clusters, only the corresponding similarity needs to be read, and recalculation is not needed, so that the clustering efficiency is improved. In this stage, the similarity between objects is calculated, that is, the similarity between application layer protocol data is calculated. Firstly, extracting application layer protocol data from the network flow to obtain complete application layer protocol data. Application layer protocol data often contains a large amount of user data that is unrelated to protocol features, and the application layer protocol features are concentrated on the front of the application layer protocol data. Based on the observation, the invention selects a fixed length of load in front of the application layer protocol data for similarity calculation, and the calculated similarity between the application layer protocol data is used as the basis for protocol identification. The method can reserve the characteristic information of the application layer protocol to the maximum extent on one hand, and can reduce the influence of the user data irrelevant to the protocol on the accuracy of the calculation result to the maximum extent on the other hand. And finally, storing the calculated similarity among all application layer protocol data for the next stage of analysis and processing.

Further, the workflow of the unknown application layer protocol clustering stage is as follows: at this stage, cluster initialization processing is performed on the application layer protocol data, and the application layer protocol data of each stream is divided into a cluster separately. When the inter-cluster similarity is calculated, the similarity value between the application layer protocol data obtained in the similarity calculation stage is read, so that the inter-cluster similarity calculation process is simplified, and the clustering efficiency is improved. The calculation steps of the inter-cluster similarity are as follows:

1) for any two clusters C1, C2, C1 ═ a1, a2, …, An }, and C2 ═ B1, B2, …, Bm }, first the similarity of each piece of application layer protocol data Ai in cluster C1 to cluster C2 is calculated:

wherein, the similarity of the application layer protocol data Ai to the application layer protocol data Bj is represented by similar (Ai, Bj), the value of similar (Ai, Bj) is obtained by querying the calculation result of step (2), m is the total number of the application layer protocol data contained in the cluster C2, and n is the total number of the application layer protocol data contained in the cluster C1;

2) the relative similarity of C1 to C2 was calculated:

3) according to the steps (30) and (31), calculating the relative similarity (C2, C1) of C2 to C1;

4) finally, the similarity between the cluster C1 and the cluster C2 is calculated as: (similar (C1, C2) + similar (C2, C1))/2.

In the clustering process, a similarity threshold value between clusters needs to be manually set. The inter-cluster similarity threshold refers to the minimum inter-cluster similarity that two clusters need to achieve to merge into one cluster. During clustering, if the similarity of the two closest clusters exceeds the threshold value of the similarity between the clusters, the two closest clusters are merged into one cluster, the cluster set is correspondingly updated, and iteration is repeated until the maximum value of the similarity between the clusters is smaller than the set threshold value of the similarity between the clusters. This also means that the similarity difference between the data in different clusters is already large and not suitable for further merging. Finally, a set of clusters is output, each cluster being a set of network flow information belonging to an application layer protocol.

Has the advantages that: compared with the prior art, the invention has the following advantages:

the method fully utilizes the advantages of the clustering algorithm, does not need to label and train network flow data, calculates the similarity of application layer protocol data by automatically extracting the characteristics of the application layer protocol data, realizes the identification of the application layer protocol, can better solve the problem of difficult identification of unknown application layer protocols in the field of application layer protocol identification, and effectively improves the accuracy of the identification result of the unknown application layer protocols.

Drawings

FIG. 1 is a schematic overall flow chart of the adaptive clustering-based unknown application layer protocol identification method;

FIG. 2 is a schematic flow chart of a similarity algorithm designed according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of an inter-cluster similarity algorithm designed according to an embodiment of the present invention;

fig. 4 is a schematic flow chart of an improved hierarchical clustering algorithm according to an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the following drawings and specific embodiments.

The invention provides an unknown application layer protocol identification and classification method based on adaptive clustering, the flow of which is shown in figure 1 and comprises three stages:

(1) a data preprocessing stage: and processing the acquired network flow data, and converting the original network flow data into the input of a similarity calculation stage through data filtering and sorting, flow recombination and other substeps.

(2) And a similarity calculation stage: and extracting the application layer protocol data of the network flow, intercepting the front fixed-length bytes of the extracted application layer protocol data, and calculating to obtain the similarity between the application layer protocol data.

(3) Unknown application layer protocol clustering stage: taking the application layer protocol data obtained in the step (1) and the similarity between the application layer protocols obtained in the step (2) as input, performing cluster initialization on the application layer protocol data, further calculating to obtain the similarity between clusters through an inter-cluster similarity algorithm, iteratively repeating by using an improved clustering algorithm until a clustering stop condition is reached, aggregating the application layer protocol data of the same kind in one cluster, and finally outputting a cluster set, wherein each cluster in the set is a set of network flow information corresponding to one application layer protocol.

The invention is further illustrated by the following specific examples.

(1) Data pre-processing

The data preprocessing is the basis for carrying out unknown application layer protocol identification, and aims to extract network flow from acquired network flow so as to obtain application layer protocol data. Data pre-processing can be subdivided into three sub-steps: the first substep is to filter and sort the data, obtain the network traffic with protocol load of application layer through the data filtering, and then adopt the method of sorting to gather together the network data packet that may belong to the same flow roughly; the second sub-step is stream reassembly, obtaining information of the network stream. The third sub-step is application layer protocol data extraction, which extracts application layer protocol data on the basis of network flow.

The first sub-step of data pre-processing is to filter out communication packets that do not need to be considered. The invention concerns network packets containing application layer protocol payload, the involved network flow can be a complete TCP connection or a complete UDP interaction. The non-IP data packets can be filtered by reading FrameType fields in frame headers of data link layers, TCP data packets and UDP data packets are determined by reading Protocol fields in headers of IP data packets of network layers, and the non-TCP data packets and the non-UDP data packets are filtered. For a unidirectional network flow, the IP address and port number of the data packet are fixed, so that the network data packets can be sorted according to the information of the header source IP, the destination IP, the source port number, the destination port number and the like of the IP data packet, and the data packets possibly belonging to the same flow are gathered together to improve the efficiency of flow recombination.

In the second sub-step of stream reassembly for data preprocessing, for a TCP stream, the start and end of the TCP stream can be identified according to SYN and FIN flags in the TCP header, and then the arriving packets are reassembled into an ordered stream by using the relation between the sequence number of the TCP header and the length of the payload data. The specific process is as follows: selecting one TCP data packet in sequence, reading the value of an SEQ (sequence number) field in the header of the TCP data packet, calculating the load data length of the TCP data packet by using the total length field of the header of the IP data packet, and marking as TCP _ data _ len, wherein if the SEQ value of the data packet is equal to the sum of the SEQ value of the last TCP data packet and the load data length, the two TCP data packets are considered to belong to the same TCP stream.

For UDP streams, the beginning and end of a UDP stream cannot be identified by a UDP header, since it has no connection establishment and connection release procedures. The invention judges the start and the end of the UDP flow according to the sending time of the data packet by setting the maximum duration of the flow. The invention firstly selects the first UDP data packet in sequence, takes the sending time of the data packet as the stream starting time, calculates the difference between the sending time and the stream starting time when capturing one data packet, if the difference is less than the maximum duration of the stream, the captured data packet belongs to the UDP stream; if the difference is greater than the maximum duration of the stream, the UDP stream is considered to have ended and the captured packet belongs to the next UDP stream.

(2) Similarity calculation

The clustering algorithm divides the data set into a plurality of different clusters, the dividing standard is the similarity of data, the data in the clusters have higher similarity, and the similarity difference of the data among the different clusters is larger.

The clustering algorithm usually firstly performs cluster initialization on a data object set, marks each data object as a cluster, then calculates the similarity between the clusters, determines whether to combine two clusters or not by comparing the similarity between the clusters with a set threshold value, and finally completes the clustering of the data set after iteration. When the clustering algorithm calculates the similarity between clusters, the similarity between objects is repeatedly calculated, and the clustering efficiency is reduced by repeatedly calculating.

The invention improves the clustering process in a targeted manner, and divides the calculation of the inter-cluster similarity in the clustering method into two parts, wherein the first part is the calculation of the inter-cluster similarity and the second part is the calculation of the inter-cluster similarity. And before clustering, the similarity between all the objects is obtained through the calculation of the similarity between the objects. When the similarity between clusters is calculated, when the similarity between two objects needs to be determined, only the previous similarity calculation result needs to be utilized, and recalculation is not needed any more, so that the clustering efficiency is improved.

In this stage, similarity calculation is performed between objects, and a similarity calculation flow designed for the requirements of application layer protocol identification and classification is shown in fig. 2 and includes the substeps of slice processing, slice set selection, similarity calculation, similarity storage, and the like.

The application layer protocol data may be viewed as an ordered byte stream. In the application layer protocol payload, there are some short sequences with obvious protocol characteristics. These short sequences are mainly concentrated at the front position of the application layer protocol payload. The embodiment of the invention retains the data containing the protocol characteristic short sequence as much as possible and shields the user data irrelevant to the protocol information on one hand by intercepting the bytes with the fixed length at the front part of the protocol data of the application layer. However, some interference-determined user data may still be included in the intercepted application layer protocol data.

The embodiment of the invention discovers that the user data and the protocol characteristic sequence are often separated by using a space or special characters through analyzing the protocol data. In order to reduce the influence of user data on protocol classification as much as possible and improve the accuracy of similarity calculation, the slicing method designed in the embodiment cuts application layer protocol data by taking a space and special characters as separators, and separates the user data from a protocol feature short sequence.

For example, as HTTP protocol data is typically:

“GET/HTTP/1.1

Host:127.0.0.1

User-Agent:Mozilla/5.0”

the first row is a protocol characteristic sequence, the left side of each row of data is the protocol characteristic sequence, the right side of each row of data is user data, the middle of each row of data is separated by a colon, and the following slicing sets can be obtained after slicing processing by using the similarity algorithm in the text: { "GET", "HTTP", "1", "HOST", "127", "0", "1", "User", "Agent", "Mozilla", "5", "0" }, it can be seen that each slice is either an application layer protocol feature short sequence or User data, the protocol feature sequence and the User data being distinguished.

And after the intercepted application layer protocol data is sliced, taking a slice set of each piece of application layer protocol data as the input of the similarity algorithm. The similarity calculation comprises the following concrete implementation steps: optionally, two slice sets a ═ a1, a2.. an }, B ═ B1, b2... bm }, and the number of slices in the first slice set appearing in the second slice set is calculated and is denoted as num, so that the similarity of the slice sets a to B is similar to similarity (a, B) ═ nun/n, and n is the total number of slices in a. This step is performed in a loop until the similarity between all slice sets is calculated. The calculated similarity is saved in a number of sets for further use. If a is { a1, a2, a3, a4}, B is { a1, a4}, and if only slices a1 and a4 appear in a B, num is 2, the similarity between a and B is similar to similarity (a, B) ═ 2/4 is 0.5. Similarly, the similarity (B, a) of B to a is 1.

(3) Unknown application layer protocol clustering

This step of operation requires clustering of network flow information against the need for application layer protocol identification classification. The embodiment of the invention improves the hierarchical clustering method and designs the clustering method. Fig. 4 is a schematic flow chart of an improved hierarchical clustering method used in the embodiment of the present invention, and the whole clustering flow includes sub-steps of cluster initialization, inter-cluster similarity calculation, similarity threshold comparison, cluster merging, and the like.

The flow of the inter-cluster similarity algorithm is shown in fig. 3, and application layer protocol data obtained by data preprocessing is used as input of a clustering algorithm to perform cluster initialization on the protocol data. The method is specifically characterized in that the protocol data of each network flow is stored into an independent array, an initial cluster mark is added to each array, and the protocol data of each network flow belongs to different clusters after cluster initialization. And selecting two clusters optionally by taking cluster cooperation as the input of an inter-cluster similarity calculation method, and firstly calculating the mean value of the similarity between each piece of application layer protocol data in a cluster and all pieces of application layer protocol data in another cluster as the similarity between the piece of application layer protocol data and the other cluster. And then calculating the mean value of the similarity between all the application layer protocol data in the cluster and another cluster as the relative similarity between the clusters. And finally, calculating the average value of the relative similarity between the two clusters to obtain the inter-cluster similarity.

The specific operation is as follows: optionally, two clusters C1 are { a1, An 2.. ann }, and C2 are { B1, B2.. ann }, and the similarity between the protocol data Ai (i ═ 1,2.. ann.. n) in the cluster C1 and the cluster C2 is first calculated

Denoted as similar (Ai, C2), where similar (Ai, Bj) represents the similarity of the application layer protocol data Ai to the application layer protocol data Bj, and m is the total number of application layer protocol data contained in the cluster C2. Then calculating the relative similarity of C1 to C2

And n is the total number of the application layer protocol data contained in the cluster C1, and the relative similarity (C2, C1) of the cluster C2 to C1 is calculated by repeating the steps. The similarity between cluster C1 and cluster C2 was found to be (similar (C1, C2) + similar (C2, C1))/2.

If the inter-cluster similarity is greater than the similarity threshold, merging the two cluster classes and updating the cluster set, specifically, randomly selecting two similar clusters C1 ═ { a1, a2.. An }, C2 ═ B1, B2.. a.bm }, based on any one of the clusters, adding all data in the other cluster, for example, based on C1, and merging C1 ═ a1, a2.. An, B1, B2.. b.m }, and deleting cluster C2 in the cluster set at the same time, thereby completing the operation of merging the clusters and updating the cluster set.

And repeating the steps of calculating the similarity among clusters, comparing similarity threshold values, merging clusters and the like until the algorithm meets the cluster termination condition. The termination condition of clustering is that the maximum value of the inter-cluster similarity in the cluster set is smaller than the similarity threshold value, and the clusters can not be merged. And then outputting a cluster set, wherein each cluster is a set of network flow information belonging to the same application layer protocol.

In conclusion, the unknown application layer protocol recognition method based on the adaptive clustering fully utilizes the advantages of unsupervised learning of the clustering algorithm and data similarity calculation by automatically extracting data features, effectively avoids the complexity of a process of identifying training data, can be used without training a model, calculates the similarity of protocol data by automatically extracting the application layer protocol features, performs protocol clustering by taking the similarity as a basis, and effectively improves the accuracy of the unknown application layer protocol recognition result.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. An unknown application layer protocol identification method based on adaptive clustering is characterized by comprising the following steps:

(1) preprocessing the acquired network traffic data, and extracting application layer protocol data in the acquired network traffic data;

(2) selecting a section of data with fixed length in front of each application layer protocol data as input of similarity calculation, and calculating the similarity between the application layer protocol data;

(3) performing cluster initialization processing on the extracted application layer protocol data, dividing each piece of application layer protocol data into a cluster, performing cluster initialization by taking the similarity between the application layer protocol data calculated in the step (2) as initial inter-cluster similarity, and performing inter-cluster similarity calculation on a cluster class set obtained by initialization, wherein in the process of calculating the inter-cluster similarity, when the similarity between the application layer protocol data needs to be calculated, the similarity between the application layer protocol data calculated in the step (2) is directly inquired; and repeatedly iterating by using a clustering algorithm until a clustering stopping condition is reached, and finally outputting a cluster set to describe the classification information of the application layer protocol corresponding to the network flow.

2. The method for identifying the unknown application layer protocol based on the adaptive clustering as claimed in claim 1, wherein the preprocessing comprises:

(20) data filtering and sorting: the method comprises the steps of obtaining network traffic with application layer protocol data load by filtering collected network traffic data, and then gathering the network traffic which possibly belongs to the same flow together according to the sequence of IP addresses or port numbers;

(21) flow recombination: merging network flows belonging to the same network flow;

(22) extracting application layer protocol data: and extracting application layer protocol data from the network flow after the flow recombination.

3. The method for identifying unknown application layer protocols based on adaptive clustering according to claim 2, wherein the data filtering comprises the following specific steps: and filtering the non-IP data packet by reading a FrameType field in a data link layer frame head of the network flow data, and filtering the non-TCP and non-UDP data packets by reading a Protocol field in a network layer IP data packet head of the network flow data.

4. The method for identifying unknown application layer protocols based on adaptive clustering according to claim 3, wherein the sorting comprises:

according to the principle that the IP addresses and the port numbers of the data packets contained in the same network flow are the same, the network flows are sequenced according to any one of the source IP, the destination IP, the source port number and the destination port number of the IP data packet header, and the network flows possibly belonging to the same flow are gathered together.

5. The method for identifying unknown application layer protocols based on adaptive clustering according to claim 4, wherein the specific steps of the stream reassembly include:

for TCP flows, the beginning and the end of the TCP flows are identified according to the SYN and FIN zone bits of the TCP headers, and then the arrived data packets are integrated into an ordered flow again by utilizing the relation between the serial number of the TCP headers and the load data length of the TCP data packets;

setting a maximum duration of a stream for a UDP stream, taking the transmission time of a data packet as start and end marks of the UDP stream, starting from a captured first UDP data packet containing application layer data, taking the transmission time of the data packet as the stream start time, calculating the difference between the transmission time and the stream start time every time one data packet is captured, and if the difference is less than the maximum duration of the stream, the captured data packet belongs to the UDP stream; if the difference is greater than the maximum duration of the stream, the UDP stream is considered to have ended and the captured UDP packet belongs to the next UDP stream.

6. The method for identifying unknown application layer protocol based on adaptive clustering as claimed in claim 2, wherein the calculating step of inter-cluster similarity comprises:

(60) for any two clusters C1, C2, C1 ═ a1, a2, …, An }, and C2 ═ B1, B2, …, Bm }, first the similarity of each piece of application layer protocol data Ai in cluster C1 to cluster C2 is calculated:

(61) the relative similarity of C1 to C2 was calculated:

(62) according to the steps (60) and (61), calculating the relative similarity (C2, C1) of C2 to C1;

(63) finally, the similarity between the cluster C1 and the cluster C2 is calculated as: (similar (C1, C2) + similar (C2, C1))/2.