CN113381996A

CN113381996A - C & C communication attack detection method based on machine learning

Info

Publication number: CN113381996A
Application number: CN202110637965.1A
Authority: CN
Inventors: 黄丽荣; 陈耿生; 蔡悦贞; 戴宏鹏; 黄嘉诚
Original assignee: China Telecom Fufu Information Technology Co Ltd
Current assignee: China Telecom Fufu Information Technology Co Ltd
Priority date: 2021-06-08
Filing date: 2021-06-08
Publication date: 2021-09-10
Anticipated expiration: 2041-06-08
Also published as: CN113381996B

Abstract

The invention discloses a C & C communication attack detection method based on machine learning, which comprises the following steps: acquiring a continuous downlink flow packet, filtering the flow packet to enable the length distribution of the flow packet to be normal, and carrying out session aggregation on the flow packet according to a specified condition; extracting session flow characteristics by using random cluster sampling and Apriori algorithm; and calculating the similarity of the aggregated flow context data by combining the editing distance and the Longest Common Subsequence (LCS) for detecting the sequence similarity. The invention can detect the communication of undiscovered malicious software without depending on a feature library; when a large number of attack flow samples are detected, the detection time complexity is low, and the detection time is short.

Description

C & C communication attack detection method based on machine learning

Technical Field

The invention relates to the technical field of communication security, in particular to a C & C communication attack detection method based on machine learning.

Background

At present, there are three aspects to C & C communication detection, which are statistical feature detection based on traffic packets, feature code detection based on traffic payload, and detection based on existing malware supervised machine learning methods.

The existing method has certain defects aiming at the detection of C & C communication attacks. First, existing methods have certain drawbacks to the detection of unpublished or undiscovered malware. Secondly, the detection effect of the existing method is more dependent on a feature library rather than comprehensive. Finally, because normal users use network scenes more diversified, the situation that the normal user traffic attribute features are similar to the malicious traffic attribute features is easily caused, and if the situation is judged according to the size and the arrival time interval of the data packet, the communication process of the existing partial chat software has the similar features to the malicious software. Therefore, the existing method has certain limitation on the detection precision and detection effect of C & C communication. Certain disadvantages exist in the aspect of C & C communication detection. Based on the statistical characteristic detection of the flow packet, as the communication of malicious software changes along with the change of network congestion and the current normal network application scenes are more and more, the statistical characteristics of normal user flow and malicious user flow are easy to be similar, so that the false alarm rate is higher. Based on the feature code detection in the traffic payload, the method has a high detection effect on the existing known malicious software, but if the malicious software is mutated, the feature code is changed, and the detection is invalid. The detection based on the existing malicious software supervised machine learning method is mainly based on the flow characteristics of the existing malicious software to carry out supervised learning, and the detection effect of the method is more dependent on the coverage of a training set of machine learning and the scientificity of the learning method.

Disclosure of Invention

The invention aims to provide a C & C communication attack detection method based on machine learning.

The technical scheme adopted by the invention is as follows:

the C & C communication attack detection method based on machine learning comprises the following steps:

step 1, filtering a flow packet: obtaining continuous downlink flow packets and filtering the flow packets to ensure that the distribution of the length of the flow packets is normal,

step 2, flow session aggregation: carrying out session aggregation on the traffic packet according to a specified condition;

step 3, extracting session flow characteristics by using random cluster sampling and Apriori algorithm;

and 4, calculating the similarity of the aggregated flow context data by combining the editing distance and the Longest Common Subsequence (LCS) for detecting the sequence similarity.

And 5, judging whether abnormal C & C communication exists according to whether the context similarity of the downlink flow of the session exceeds a set value.

Further, as a preferred embodiment, step 1 sets a filtering threshold according to the positive distribution of the packet length of the traffic, filters part of the uncorrelated traffic,

further, as a preferred embodiment, the packet length critical value of the small flow rate packet is calculated by setting the packet filtering rate in step 1, and the final filtering packet length is determined by adopting normal distribution estimation and threshold setting mode comprehensive calculation.

Further, as a preferred embodiment, in step 2, session aggregation is performed according to a source address, a source port, a destination address or a destination port.

Further, as a preferred embodiment, when the amount of data processed in step 3 is too large, a reservoir sampling algorithm is used for probability sampling.

Further, as a preferred embodiment, in step 4, the edit distance calculation is performed on the sequence pairs, the sequence pairs with larger distance values are filtered and removed according to the calculation result, and then the LCS calculation is performed on the sequence pairs.

By adopting the technical scheme, the context similarity detection is carried out on the session flow data after the filtering, sampling and aggregation are carried out on the network flow according to the flow, and then whether the malicious software communication exists is detected. The invention has the following advantages: 1. undetected malware communications can be detected without relying on a feature library. 2. The method is different from the existing malicious software supervised machine learning method, the detection is mainly based on the flow characteristics of the existing malicious software for supervised learning, and the detection effect of the method is more dependent on the coverage of a training set of machine learning and the scientificity of the learning method. 3. For C & C communication detection, the detection algorithm based on downlink payload similarity has higher accuracy and recall rate compared with the detection of a traffic packet detection algorithm and a payload feature code, and has certain advantages in detection time.

Drawings

The invention is described in further detail below with reference to the accompanying drawings and the detailed description;

fig. 1 is a schematic flow chart of the C & C communication attack detection method based on machine learning according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

As shown in fig. 1, the invention discloses a C & C communication attack detection method based on machine learning, which comprises the following steps:

step 1, filtering a flow packet: acquiring a continuous downlink flow packet; at present, the flow in the existing network environment is increasingly large, the malicious software downlink flow packets are mostly small, the flow packets are filtered to avoid resource waste caused by meaningless analysis and detection of non-related flow, so that the distribution of the length of the flow packets is in normal distribution,

further, as a preferred embodiment, step 1 sets a filtering threshold according to the positive distribution of the packet length of the traffic, and filters the part of the uncorrelated traffic. Specifically, a packet length critical value of the small flow packet is calculated by setting a packet filtering rate, and finally the filtering packet length is determined by adopting normal distribution estimation and threshold setting mode comprehensive calculation.

Further, as a preferred embodiment, the sampling in step 3 is to extract a sample representing the population from the population by a certain sampling algorithm. The overall characteristics are predicted by detecting the characteristics of the extracted samples, the content similarity in payload of continuous downlink flow is detected, the condition that continuity possibly occurs in the same name and interest in the actual attack process is considered, therefore, a random cluster sampling algorithm is adopted, and if the processing data volume is overlarge, a reservoir sampling algorithm can be adopted for probability sampling.

Specifically, the detection of the similarity of the downlink traffic packet sequence is mainly based on the combination of the value algorithm for calculating the Longest Common Subsequence (LCS) and calculating the edit distance between the two sequences. The LCS is the longest common subsequence, and the similarity of the two sequences is obtained by obtaining the length of the maximum common subsequence of the two sequences. The longest common subsequence is generally obtained by using a dynamic programming algorithm. The editing distance, also called Levenshtein distance, represents the minimum number of edits required to convert one character string into another character string, and the editing refers to replacing one character in the character string with another character or inserting and deleting characters.

Because the complexity of the calculation time of the edit distance is low, some irrelevant sequence pairs can be removed firstly, and the similarity of LCS calculation is more accurate, the detection result has higher reliability.

It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. The embodiments and features of the embodiments in the present application may be combined with each other without conflict. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the detailed description of the embodiments of the present application is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Claims

1. The C & C communication attack detection method based on machine learning is characterized by comprising the following steps: which comprises the following steps:

and 4, calculating the similarity of the aggregated flow context data by combining the editing distance and the longest public subsequence for sequence similarity detection.

2. And 5, judging whether abnormal C & C communication exists according to whether the context similarity of the downlink flow of the session exceeds a set value.

3. The machine learning-based C & C communication attack detection method according to claim 1, characterized in that: step 1, setting a filtering threshold value according to the positive distribution of the length of the flow packet, filtering part of irrelevant flow,

the machine learning-based C & C communication attack detection method according to claim 1, characterized in that: step 1, calculating the packet length critical value of the small flow packet by setting the packet filtering rate, and finally determining the filtering packet length by adopting normal distribution estimation and threshold setting mode comprehensive calculation.

4. The machine learning-based C & C communication attack detection method according to claim 1, characterized in that: and in step 2, carrying out session aggregation according to the source address, the source port, the destination address or the destination port.

5. The machine learning-based C & C communication attack detection method according to claim 1, characterized in that: and 3, when the processing data volume is overlarge, performing probability sampling by adopting a reservoir sampling algorithm.

6. The machine learning-based C & C communication attack detection method according to claim 1, characterized in that: in step 4, the edit distance of the sequence pair is calculated, the sequence pair with larger distance value is screened and removed according to the calculation result, and then LCS calculation is carried out on the sequence pair.