CN115086043B

CN115086043B - Encryption network flow classification and identification method based on minimum public subsequence

Info

Publication number: CN115086043B
Application number: CN202210690984.5A
Authority: CN
Inventors: 刘瑶; 樊鹏程; 白晓羽; 熊静; 丁熠
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-06-17
Filing date: 2022-06-17
Publication date: 2023-03-21
Anticipated expiration: 2042-06-17
Also published as: CN115086043A

Abstract

The invention aims to provide a minimum public subsequence-based encrypted network traffic classification and identification method, and belongs to the technical field of information security. The method is based on a KMP algorithm and a longest public subsequence algorithm, a feature code family which can represent a certain behavior in network traffic is extracted, a database of various behavior feature code families is constructed, and the feature code family of the network traffic to be detected is matched with the feature code family in the database, so that the classification and identification of the network traffic are completed. The algorithm adopted in the flow classification and identification method is low in complexity, does not need a large amount of training of machine learning, does not need a large amount of calculation power, and is suitable for being deployed in small embedded equipment; the feature code rarely changes due to equipment or system or time change, and risks caused by overfitting can be reduced, so that compared with the conventional flow detection algorithm, the method has higher algorithm accuracy and shorter detection time, and can be practically applied.

Description

Encryption network flow classification and identification method based on minimum public subsequence

Technical Field

The invention belongs to the technical field of information security, and particularly relates to a minimum public subsequence-based encrypted network traffic classification and identification method.

Background

In recent years, the field of information science has developed rapidly, and the widespread use of many emerging technologies, particularly the internet based on the TCP/IP protocol, has profoundly changed the lifestyle of people, and the internet is becoming one of the important infrastructures indispensable to the human society. The network traffic relates to a plurality of closely-connected entities such as a host, a network, an application and a user, and is a multi-factor converged and complex system concept. Each network application has its own corresponding traffic behavior characteristic, and with the continuous emergence of various network novel applications and network application layer protocols, the complexity of network traffic is increasing day by day, and the characteristics of variability, dynamics and heterogeneity are more obvious. Network traffic identification is the key point of network behavior analysis, network planning construction, network anomaly detection and network traffic model research, and is also the premise and the basis for improving the network management level, improving the service quality and monitoring the application safety. Due to the security problem of the unencrypted traffic, in recent years, with the rapid development of network technology, the network traffic has gradually tended to be encrypted. The advent of encryption technologies such as SSL, SSH, VPN, and Tor has made past traffic identification methods no longer reliable. With the widespread use of these encryption methods in networks, network encryption traffic is growing rapidly and changing threat landscape. The attacker uses encryption as a tool for hiding activities, and the encrypted traffic provides a multiplier for hiding the command and control activities of the attacker. Traffic needs to be identified before the network encrypted traffic can be analyzed. The high-accuracy identification and detection of the encrypted flow have important practical significance for ensuring the network information safety and maintaining the normal operation of the network. Therefore, the network traffic classification and identification technology is used as the basis of network management and network analysis, has wide application prospects in the fields of network security, network service quality evaluation and the like, and is one of the important fields of computer network research at present.

At present, the following three methods are commonly used in the prior art to classify and identify encrypted network traffic: 1. the method comprises the steps of checking a source port number and a destination port number of a network data packet, and identifying different network applications according to port number rules used in communication of corresponding network protocols or network applications and mapping the port number rules with the port number rules. However, the Internet Assigned Numbers Authority (IANA) does not define communication port numbers for all applications, especially for some later new applications, so that the network port numbers and the applications cannot always correspond one to one, and secondly, the port numbers used by some common protocols during data transmission are not fixed, and the services of a plurality of network protocols can be packaged into common applications and use the same port number, so that the identification accuracy and reliability of the traffic classification identification method based on network port mapping are continuously reduced, and the requirement of the current network traffic classification identification cannot be met; 2. the traffic classification and identification method based on the neural network trains and classifies traffic data by using the current popular convolutional neural network or the automatic encoder neural network. In the method, a large amount of computer resources are required to be occupied in a training stage, and a generated recognition model is large generally, so that the recognition speed is too slow, and the network flow is difficult to detect and recognize in real time; 3. the traffic classification and identification method based on the behavior characteristics utilizes the principle that different network applications have different communication behavior modes, and performs classification and identification on corresponding network traffic by analyzing the difference of the behavior modes mapped to a transmission layer by each network protocol and the network application from the macroscopic view of the traffic characteristics. However, the method has large system space-time overhead and poor identification real-time performance, and the related research progress in recent years is limited.

Therefore, how to solve the problems of low recognition rate and low recognition speed of the current encryption traffic recognition method becomes a research hotspot.

Disclosure of Invention

In view of the problems in the background art, the present invention aims to provide a method for classifying and identifying encrypted network traffic based on a minimum common subsequence. The method is based on a KMP algorithm and a longest public subsequence algorithm, a feature code family which can represent a certain behavior in network traffic is extracted, a database of various behavior feature code families is constructed, and the feature code family of the network traffic to be detected is matched with the feature code family in the database, so that the classification and identification of the network traffic are completed. The method can greatly improve the detection and identification rate and accuracy of the network flow.

In order to realize the purpose, the technical scheme of the invention is as follows:

a method for classifying and identifying encrypted network traffic based on a minimum common subsequence comprises the following steps:

step 1, collecting the flow of a specified behavior for a plurality of times;

step 2, screening application layer payload data of each data packet in the flow data collected in the step 1 (ignoring the data packet without the application layer payload data), then carrying out binary conversion processing on the application layer payload data, obtaining a binary number for each data packet, and obtaining a binary two-dimensional array from the flow data collected at one time;

step 3, searching the feature code, which comprises the following specific processes: comparing the binary number obtained by the first data packet in the first collection with the binary numbers in the two-dimensional arrays obtained by other collections, if the number of the more than continuous 16-bit digits is completely consistent, the matching is determined to be successful, and each binary two-dimensional array collected by other collections can be successfully matched, the data packet is determined to have a feature code;

step 4, processing all the remaining data packets in the first collection according to the step 3, thereby extracting all different feature codes of the first collection result of the specified behavior, and obtaining the feature code sequence of the collected flow according to the time sequence precedence relationship of the feature codes in the specified behavior;

step 5, processing the flow data in the rest of the collection according to the step 3, and calculating the longest common subsequence of different feature code sequences of the multiple collection results of the specified behavior by using a longest common subsequence algorithm, so as to obtain a feature code family of the specified behavior;

step 6, acquiring a feature code family for each type of behavior according to the steps 1 to 5, and then establishing a feature code family database;

and 7, extracting the characteristic code sequence of the flow to be detected according to the steps 1 to 4, and then comparing the characteristic code sequence with the characteristic codes in the characteristic code family database in the step 6, thereby finishing the classification and identification of the flow to be detected.

Further, in step 1, a traffic collection tool such as wireshark or tcpdump is used.

Further, the flow data in step 1 can be collected in different environments and devices, and the robustness of the final recognition result is increased.

Further, the more the flow collection times for the specified behavior in step 1 are, the higher the accuracy of the final recognition result is, and the collection times are not lower than 25 times.

Further, in step 3, a KMP matching algorithm may be used to find the feature code.

Further, the number of feature codes in step 4 is less than or equal to the number of packets of one collection result of the specified behavior.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

the algorithm adopted in the flow classification and identification method is low in complexity, and does not need a large amount of training of machine learning, so that a large amount of calculation power is not needed, and the method is suitable for being deployed in small embedded equipment. Meanwhile, the feature code rarely changes due to equipment or system or time change, and risks caused by overfitting can be reduced, so that compared with the conventional flow detection algorithm, the method has higher algorithm accuracy, shorter detection time and capability of being practically applied.

Drawings

FIG. 1 is a schematic view of step 2 in example 1 of the present invention.

FIG. 2 is a schematic view of step 3 in example 1 of the present invention.

FIG. 3 is a schematic view of step 5 in example 1 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.

step 1, collecting flow of a specified behavior for a plurality of times, and labeling a collection result;

step 3, searching the feature code, and the specific process is as follows: comparing the binary number obtained by the first data packet in the first collection with the binary numbers in the two-dimensional arrays obtained by other collections, if the number of the more than continuous 16-bit digits is completely consistent, the matching is determined to be successful, and each binary two-dimensional array collected by other collections can be successfully matched, the data packet is determined to have a feature code;

Example 1

step 1, collecting flow of the WeChat sending voice for multiple times by using flow collecting tools such as wireshark or tcpdump, marking the flow, and storing the flow in a pcap file format, such as 1.Pcap of WeChat sending voice and 2.pacp of WeChat sending voice;

and 2, screening the application layer data of a single data packet of the pcap file collected in the step 1, and performing binary processing on the application layer data, such as: binary 6f to 0110 1111; the micro-message sending voice 1.pcap is provided with a plurality of data packets, and application layer data of each data packet is subjected to binary processing to obtain a plurality of binary codes such as o1, o2, o3., and the like, so that a binary two-dimensional array m1 of the micro-message sending voice 1.pcap is formed; similarly, a binary system two-dimensional array m2 can be obtained by the micro message sending voice 2.pacp; the process is shown in FIG. 1;

and 3, searching the feature code by using a KMP algorithm, wherein the specific process comprises the following steps: matching the binary number o1 obtained from the first data packet in the first collected two-dimensional array m1 with all other binary numbers in m2 and m3, and when the matching length exceeds 16 bits and at least one match can be found in all the remaining mi (mi is the ith binary two-dimensional array), determining the binary number as a feature code and marking the feature code as s1; the process is shown in FIG. 2;

step 4, repeatedly using the residual binary codes o2, o3, o4... Oj in the m1 to obtain k feature codes of s2, s3, s4... Sk, wherein since some binary packages can not necessarily correspond to the feature codes, k is generally less than or equal to j; then, a feature code sequence of m1 can be obtained according to the sequential relationship of sk in m1, such as s1, s3, s2, s1, s5, s4, s1, s3.;

step 5, calculating the longest public subsequence of the feature code sequence of the two-dimensional array m1, m2, m3.. Of the multiple times of the WeChat voice sending behaviors by using the longest public subsequence algorithm to obtain a feature code family of the WeChat voice sending, wherein the process is shown in FIG. 3;

step 6, acquiring feature code families of various appointed behaviors according to the steps 1 to 5, wherein the behaviors comprise behaviors such as WeChat red packages, QQ transmission files and the like, and then establishing a feature code family database;

and 7, extracting the characteristic code sequence of the flow to be detected according to the steps 1 to 4, and then comparing the characteristic code sequence with the characteristic codes in the characteristic code family database in the step 6, thereby completing the respective identification of the flow to be detected.

TABLE 1

TABLE 2

Tables 1 and 2 show the accuracy of the method of the invention in identifying various behaviors of various hot software in an experimental environment. The accuracy is obtained from the average of multiple experiments. As can be seen from the table, the detection behaviors are almost all completed instantly, and the longest behavior does not exceed 0.1s, which means that the method has high identification rate and the accuracy can reach more than 95%. Various behaviors can occur automatically without being clicked, for example, a microblog can be refreshed automatically, a WeChat can acquire friend circle information automatically, and the behaviors can be regarded as normal phenomena when the behaviors are detected under the condition that the behaviors are not actively performed.

Where mentioned above are merely embodiments of the invention, any feature disclosed in this specification may, unless stated otherwise, be replaced by alternative features serving equivalent or similar purposes; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims

1. A method for classifying and identifying encrypted network traffic based on a minimum common subsequence is characterized by comprising the following steps:

step 1, collecting the flow of a specified behavior for a plurality of times;

step 2, screening the application layer payload data of each data packet in the flow data collected in the step 1, then carrying out binary conversion processing on the application layer payload data, obtaining a binary number for each data packet, and obtaining a binary two-dimensional array from the once collected flow data;

2. The method for classifying and identifying encrypted network traffic according to claim 1, wherein in step 1, traffic collection is performed by using a wireshark or tcpdump traffic collection tool.

3. The classification and identification method for encrypted network traffic according to claim 1, wherein the traffic data in step 1 are collected in different environments and devices, so that the robustness of the final identification result can be increased.

4. The classification and identification method for encrypted network traffic according to claim 1, wherein the more traffic collection times for the specified behavior in step 1, the higher the accuracy of the final identification result, and the collection times are not lower than 25.

5. The method for classifying and identifying encrypted network traffic according to claim 1, wherein a KMP matching algorithm is used to find the feature code in step 3.

6. The method for classifying and identifying encrypted network traffic according to claim 1, wherein the number of the feature codes in step 4 is less than or equal to the number of packets of one collection result of the specified behavior.