CN115086043B - Encryption network flow classification and identification method based on minimum public subsequence - Google Patents

Encryption network flow classification and identification method based on minimum public subsequence Download PDF

Info

Publication number
CN115086043B
CN115086043B CN202210690984.5A CN202210690984A CN115086043B CN 115086043 B CN115086043 B CN 115086043B CN 202210690984 A CN202210690984 A CN 202210690984A CN 115086043 B CN115086043 B CN 115086043B
Authority
CN
China
Prior art keywords
feature code
collection
network traffic
flow
binary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210690984.5A
Other languages
Chinese (zh)
Other versions
CN115086043A (en
Inventor
刘瑶
樊鹏程
白晓羽
熊静
丁熠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202210690984.5A priority Critical patent/CN115086043B/en
Publication of CN115086043A publication Critical patent/CN115086043A/en
Application granted granted Critical
Publication of CN115086043B publication Critical patent/CN115086043B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Abstract

The invention aims to provide a minimum public subsequence-based encrypted network traffic classification and identification method, and belongs to the technical field of information security. The method is based on a KMP algorithm and a longest public subsequence algorithm, a feature code family which can represent a certain behavior in network traffic is extracted, a database of various behavior feature code families is constructed, and the feature code family of the network traffic to be detected is matched with the feature code family in the database, so that the classification and identification of the network traffic are completed. The algorithm adopted in the flow classification and identification method is low in complexity, does not need a large amount of training of machine learning, does not need a large amount of calculation power, and is suitable for being deployed in small embedded equipment; the feature code rarely changes due to equipment or system or time change, and risks caused by overfitting can be reduced, so that compared with the conventional flow detection algorithm, the method has higher algorithm accuracy and shorter detection time, and can be practically applied.

Description

Encryption network flow classification and identification method based on minimum public subsequence
Technical Field
The invention belongs to the technical field of information security, and particularly relates to a minimum public subsequence-based encrypted network traffic classification and identification method.
Background
In recent years, the field of information science has developed rapidly, and the widespread use of many emerging technologies, particularly the internet based on the TCP/IP protocol, has profoundly changed the lifestyle of people, and the internet is becoming one of the important infrastructures indispensable to the human society. The network traffic relates to a plurality of closely-connected entities such as a host, a network, an application and a user, and is a multi-factor converged and complex system concept. Each network application has its own corresponding traffic behavior characteristic, and with the continuous emergence of various network novel applications and network application layer protocols, the complexity of network traffic is increasing day by day, and the characteristics of variability, dynamics and heterogeneity are more obvious. Network traffic identification is the key point of network behavior analysis, network planning construction, network anomaly detection and network traffic model research, and is also the premise and the basis for improving the network management level, improving the service quality and monitoring the application safety. Due to the security problem of the unencrypted traffic, in recent years, with the rapid development of network technology, the network traffic has gradually tended to be encrypted. The advent of encryption technologies such as SSL, SSH, VPN, and Tor has made past traffic identification methods no longer reliable. With the widespread use of these encryption methods in networks, network encryption traffic is growing rapidly and changing threat landscape. The attacker uses encryption as a tool for hiding activities, and the encrypted traffic provides a multiplier for hiding the command and control activities of the attacker. Traffic needs to be identified before the network encrypted traffic can be analyzed. The high-accuracy identification and detection of the encrypted flow have important practical significance for ensuring the network information safety and maintaining the normal operation of the network. Therefore, the network traffic classification and identification technology is used as the basis of network management and network analysis, has wide application prospects in the fields of network security, network service quality evaluation and the like, and is one of the important fields of computer network research at present.
At present, the following three methods are commonly used in the prior art to classify and identify encrypted network traffic: 1. the method comprises the steps of checking a source port number and a destination port number of a network data packet, and identifying different network applications according to port number rules used in communication of corresponding network protocols or network applications and mapping the port number rules with the port number rules. However, the Internet Assigned Numbers Authority (IANA) does not define communication port numbers for all applications, especially for some later new applications, so that the network port numbers and the applications cannot always correspond one to one, and secondly, the port numbers used by some common protocols during data transmission are not fixed, and the services of a plurality of network protocols can be packaged into common applications and use the same port number, so that the identification accuracy and reliability of the traffic classification identification method based on network port mapping are continuously reduced, and the requirement of the current network traffic classification identification cannot be met; 2. the traffic classification and identification method based on the neural network trains and classifies traffic data by using the current popular convolutional neural network or the automatic encoder neural network. In the method, a large amount of computer resources are required to be occupied in a training stage, and a generated recognition model is large generally, so that the recognition speed is too slow, and the network flow is difficult to detect and recognize in real time; 3. the traffic classification and identification method based on the behavior characteristics utilizes the principle that different network applications have different communication behavior modes, and performs classification and identification on corresponding network traffic by analyzing the difference of the behavior modes mapped to a transmission layer by each network protocol and the network application from the macroscopic view of the traffic characteristics. However, the method has large system space-time overhead and poor identification real-time performance, and the related research progress in recent years is limited.
Therefore, how to solve the problems of low recognition rate and low recognition speed of the current encryption traffic recognition method becomes a research hotspot.
Disclosure of Invention
In view of the problems in the background art, the present invention aims to provide a method for classifying and identifying encrypted network traffic based on a minimum common subsequence. The method is based on a KMP algorithm and a longest public subsequence algorithm, a feature code family which can represent a certain behavior in network traffic is extracted, a database of various behavior feature code families is constructed, and the feature code family of the network traffic to be detected is matched with the feature code family in the database, so that the classification and identification of the network traffic are completed. The method can greatly improve the detection and identification rate and accuracy of the network flow.
In order to realize the purpose, the technical scheme of the invention is as follows:
a method for classifying and identifying encrypted network traffic based on a minimum common subsequence comprises the following steps:
step 1, collecting the flow of a specified behavior for a plurality of times;
step 2, screening application layer payload data of each data packet in the flow data collected in the step 1 (ignoring the data packet without the application layer payload data), then carrying out binary conversion processing on the application layer payload data, obtaining a binary number for each data packet, and obtaining a binary two-dimensional array from the flow data collected at one time;
step 3, searching the feature code, which comprises the following specific processes: comparing the binary number obtained by the first data packet in the first collection with the binary numbers in the two-dimensional arrays obtained by other collections, if the number of the more than continuous 16-bit digits is completely consistent, the matching is determined to be successful, and each binary two-dimensional array collected by other collections can be successfully matched, the data packet is determined to have a feature code;
step 4, processing all the remaining data packets in the first collection according to the step 3, thereby extracting all different feature codes of the first collection result of the specified behavior, and obtaining the feature code sequence of the collected flow according to the time sequence precedence relationship of the feature codes in the specified behavior;
step 5, processing the flow data in the rest of the collection according to the step 3, and calculating the longest common subsequence of different feature code sequences of the multiple collection results of the specified behavior by using a longest common subsequence algorithm, so as to obtain a feature code family of the specified behavior;
step 6, acquiring a feature code family for each type of behavior according to the steps 1 to 5, and then establishing a feature code family database;
and 7, extracting the characteristic code sequence of the flow to be detected according to the steps 1 to 4, and then comparing the characteristic code sequence with the characteristic codes in the characteristic code family database in the step 6, thereby finishing the classification and identification of the flow to be detected.
Further, in step 1, a traffic collection tool such as wireshark or tcpdump is used.
Further, the flow data in step 1 can be collected in different environments and devices, and the robustness of the final recognition result is increased.
Further, the more the flow collection times for the specified behavior in step 1 are, the higher the accuracy of the final recognition result is, and the collection times are not lower than 25 times.
Further, in step 3, a KMP matching algorithm may be used to find the feature code.
Further, the number of feature codes in step 4 is less than or equal to the number of packets of one collection result of the specified behavior.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
the algorithm adopted in the flow classification and identification method is low in complexity, and does not need a large amount of training of machine learning, so that a large amount of calculation power is not needed, and the method is suitable for being deployed in small embedded equipment. Meanwhile, the feature code rarely changes due to equipment or system or time change, and risks caused by overfitting can be reduced, so that compared with the conventional flow detection algorithm, the method has higher algorithm accuracy, shorter detection time and capability of being practically applied.
Drawings
FIG. 1 is a schematic view of step 2 in example 1 of the present invention.
FIG. 2 is a schematic view of step 3 in example 1 of the present invention.
FIG. 3 is a schematic view of step 5 in example 1 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.
A method for classifying and identifying encrypted network traffic based on a minimum common subsequence comprises the following steps:
step 1, collecting flow of a specified behavior for a plurality of times, and labeling a collection result;
step 2, screening application layer payload data of each data packet in the flow data collected in the step 1 (ignoring the data packet without the application layer payload data), then carrying out binary conversion processing on the application layer payload data, obtaining a binary number for each data packet, and obtaining a binary two-dimensional array from the flow data collected at one time;
step 3, searching the feature code, and the specific process is as follows: comparing the binary number obtained by the first data packet in the first collection with the binary numbers in the two-dimensional arrays obtained by other collections, if the number of the more than continuous 16-bit digits is completely consistent, the matching is determined to be successful, and each binary two-dimensional array collected by other collections can be successfully matched, the data packet is determined to have a feature code;
step 4, processing all the remaining data packets in the first collection according to the step 3, thereby extracting all different feature codes of the first collection result of the specified behavior, and obtaining the feature code sequence of the collected flow according to the time sequence precedence relationship of the feature codes in the specified behavior;
step 5, processing the flow data in the rest of the collection according to the step 3, and calculating the longest common subsequence of different feature code sequences of the multiple collection results of the specified behavior by using a longest common subsequence algorithm, so as to obtain a feature code family of the specified behavior;
step 6, acquiring a feature code family for each type of behavior according to the steps 1 to 5, and then establishing a feature code family database;
and 7, extracting the characteristic code sequence of the flow to be detected according to the steps 1 to 4, and then comparing the characteristic code sequence with the characteristic codes in the characteristic code family database in the step 6, thereby finishing the classification and identification of the flow to be detected.
Example 1
A method for classifying and identifying encrypted network traffic based on a minimum common subsequence comprises the following steps:
step 1, collecting flow of the WeChat sending voice for multiple times by using flow collecting tools such as wireshark or tcpdump, marking the flow, and storing the flow in a pcap file format, such as 1.Pcap of WeChat sending voice and 2.pacp of WeChat sending voice;
and 2, screening the application layer data of a single data packet of the pcap file collected in the step 1, and performing binary processing on the application layer data, such as: binary 6f to 0110 1111; the micro-message sending voice 1.pcap is provided with a plurality of data packets, and application layer data of each data packet is subjected to binary processing to obtain a plurality of binary codes such as o1, o2, o3., and the like, so that a binary two-dimensional array m1 of the micro-message sending voice 1.pcap is formed; similarly, a binary system two-dimensional array m2 can be obtained by the micro message sending voice 2.pacp; the process is shown in FIG. 1;
and 3, searching the feature code by using a KMP algorithm, wherein the specific process comprises the following steps: matching the binary number o1 obtained from the first data packet in the first collected two-dimensional array m1 with all other binary numbers in m2 and m3, and when the matching length exceeds 16 bits and at least one match can be found in all the remaining mi (mi is the ith binary two-dimensional array), determining the binary number as a feature code and marking the feature code as s1; the process is shown in FIG. 2;
step 4, repeatedly using the residual binary codes o2, o3, o4... Oj in the m1 to obtain k feature codes of s2, s3, s4... Sk, wherein since some binary packages can not necessarily correspond to the feature codes, k is generally less than or equal to j; then, a feature code sequence of m1 can be obtained according to the sequential relationship of sk in m1, such as s1, s3, s2, s1, s5, s4, s1, s3.;
step 5, calculating the longest public subsequence of the feature code sequence of the two-dimensional array m1, m2, m3.. Of the multiple times of the WeChat voice sending behaviors by using the longest public subsequence algorithm to obtain a feature code family of the WeChat voice sending, wherein the process is shown in FIG. 3;
step 6, acquiring feature code families of various appointed behaviors according to the steps 1 to 5, wherein the behaviors comprise behaviors such as WeChat red packages, QQ transmission files and the like, and then establishing a feature code family database;
and 7, extracting the characteristic code sequence of the flow to be detected according to the steps 1 to 4, and then comparing the characteristic code sequence with the characteristic codes in the characteristic code family database in the step 6, thereby completing the respective identification of the flow to be detected.
TABLE 1
Figure BDA0003699747440000051
TABLE 2
Figure BDA0003699747440000052
Tables 1 and 2 show the accuracy of the method of the invention in identifying various behaviors of various hot software in an experimental environment. The accuracy is obtained from the average of multiple experiments. As can be seen from the table, the detection behaviors are almost all completed instantly, and the longest behavior does not exceed 0.1s, which means that the method has high identification rate and the accuracy can reach more than 95%. Various behaviors can occur automatically without being clicked, for example, a microblog can be refreshed automatically, a WeChat can acquire friend circle information automatically, and the behaviors can be regarded as normal phenomena when the behaviors are detected under the condition that the behaviors are not actively performed.
Where mentioned above are merely embodiments of the invention, any feature disclosed in this specification may, unless stated otherwise, be replaced by alternative features serving equivalent or similar purposes; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims (6)

1. A method for classifying and identifying encrypted network traffic based on a minimum common subsequence is characterized by comprising the following steps:
step 1, collecting the flow of a specified behavior for a plurality of times;
step 2, screening the application layer payload data of each data packet in the flow data collected in the step 1, then carrying out binary conversion processing on the application layer payload data, obtaining a binary number for each data packet, and obtaining a binary two-dimensional array from the once collected flow data;
step 3, searching the feature code, which comprises the following specific processes: comparing the binary number obtained by the first data packet in the first collection with the binary numbers in the two-dimensional arrays obtained by other collections, if the number of the more than continuous 16-bit digits is completely consistent, the matching is determined to be successful, and each binary two-dimensional array collected by other collections can be successfully matched, the data packet is determined to have a feature code;
step 4, processing all the remaining data packets in the first collection according to the step 3, thereby extracting all different feature codes of the first collection result of the specified behavior, and obtaining the feature code sequence of the collected flow according to the time sequence precedence relationship of the feature codes in the specified behavior;
step 5, processing the flow data in the rest of the collection according to the step 3, and calculating the longest common subsequence of different feature code sequences of the multiple collection results of the specified behavior by using a longest common subsequence algorithm, so as to obtain a feature code family of the specified behavior;
step 6, acquiring a feature code family for each type of behavior according to the steps 1 to 5, and then establishing a feature code family database;
and 7, extracting the characteristic code sequence of the flow to be detected according to the steps 1 to 4, and then comparing the characteristic code sequence with the characteristic codes in the characteristic code family database in the step 6, thereby finishing the classification and identification of the flow to be detected.
2. The method for classifying and identifying encrypted network traffic according to claim 1, wherein in step 1, traffic collection is performed by using a wireshark or tcpdump traffic collection tool.
3. The classification and identification method for encrypted network traffic according to claim 1, wherein the traffic data in step 1 are collected in different environments and devices, so that the robustness of the final identification result can be increased.
4. The classification and identification method for encrypted network traffic according to claim 1, wherein the more traffic collection times for the specified behavior in step 1, the higher the accuracy of the final identification result, and the collection times are not lower than 25.
5. The method for classifying and identifying encrypted network traffic according to claim 1, wherein a KMP matching algorithm is used to find the feature code in step 3.
6. The method for classifying and identifying encrypted network traffic according to claim 1, wherein the number of the feature codes in step 4 is less than or equal to the number of packets of one collection result of the specified behavior.
CN202210690984.5A 2022-06-17 2022-06-17 Encryption network flow classification and identification method based on minimum public subsequence Active CN115086043B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210690984.5A CN115086043B (en) 2022-06-17 2022-06-17 Encryption network flow classification and identification method based on minimum public subsequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210690984.5A CN115086043B (en) 2022-06-17 2022-06-17 Encryption network flow classification and identification method based on minimum public subsequence

Publications (2)

Publication Number Publication Date
CN115086043A CN115086043A (en) 2022-09-20
CN115086043B true CN115086043B (en) 2023-03-21

Family

ID=83253784

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210690984.5A Active CN115086043B (en) 2022-06-17 2022-06-17 Encryption network flow classification and identification method based on minimum public subsequence

Country Status (1)

Country Link
CN (1) CN115086043B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287439A (en) * 2019-06-27 2019-09-27 电子科技大学 A kind of network behavior method for detecting abnormality based on LSTM

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102394827A (en) * 2011-11-09 2012-03-28 浙江万里学院 Hierarchical classification method for internet flow
CN108171057B (en) * 2017-12-22 2021-03-23 西安电子科技大学 Android platform malicious software detection method based on feature matching
CN111182002A (en) * 2020-02-19 2020-05-19 北京亚鸿世纪科技发展有限公司 Zombie network detection device based on HTTP (hyper text transport protocol) first question-answer packet clustering analysis
CN112632531A (en) * 2020-12-15 2021-04-09 平安科技(深圳)有限公司 Malicious code identification method and device, computer equipment and medium
US11165675B1 (en) * 2021-04-19 2021-11-02 Corelight, Inc. System and method for network traffic classification using snippets and on the fly built classifiers
CN113300977B (en) * 2021-05-27 2022-10-21 国家计算机网络与信息安全管理中心 Application flow identification and classification method based on multi-feature fusion analysis

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287439A (en) * 2019-06-27 2019-09-27 电子科技大学 A kind of network behavior method for detecting abnormality based on LSTM

Also Published As

Publication number Publication date
CN115086043A (en) 2022-09-20

Similar Documents

Publication Publication Date Title
CN111277578B (en) Encrypted flow analysis feature extraction method, system, storage medium and security device
CN110247930B (en) Encrypted network flow identification method based on deep neural network
CN111953669B (en) Tor flow tracing and application type identification method and system suitable for SDN
CN110611640A (en) DNS protocol hidden channel detection method based on random forest
CN112822189A (en) Traffic identification method and device
CN110460502B (en) Application program flow identification method under VPN based on distributed feature random forest
CN112804253B (en) Network flow classification detection method, system and storage medium
CN114143037B (en) Malicious encrypted channel detection method based on process behavior analysis
Chen et al. IoT-ID: robust IoT device identification based on feature drift adaptation
CN113706100B (en) Real-time detection and identification method and system for Internet of things terminal equipment of power distribution network
CN111586075B (en) Hidden channel detection method based on multi-scale stream analysis technology
CN113923026A (en) Encrypted malicious flow detection model based on TextCNN and construction method thereof
CN115277216A (en) Vulnerability exploitation attack encryption flow classification method based on multi-head self-attention mechanism
Fan et al. Autoiot: Automatically updated iot device identification with semi-supervised learning
Zhang et al. Unsupervised iot fingerprinting method via variational auto-encoder and k-means
Du et al. A lightweight flow feature-based iot device identification scheme
CN115086043B (en) Encryption network flow classification and identification method based on minimum public subsequence
Erdenebaatar et al. Analyzing traffic characteristics of instant messaging applications on android smartphones
CN114666273B (en) Flow classification method for application layer unknown network protocol
Das Design and development of an efficient network intrusion detection system using ensemble machine learning techniques for Wifi environments
CN111371727A (en) Detection method for NTP protocol covert communication
CN113765891B (en) Equipment fingerprint identification method and device
Hao et al. IoTTFID: An Incremental IoT device identification model based on traffic fingerprint
Ferman et al. Early Generation and Detection of Efficient IoT Device Fingerprints Using Machine Learning
CN112968906A (en) Modbus TCP abnormal communication detection method and system based on multi-tuple

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant