CN112333180A

CN112333180A - APT attack detection method and system based on data mining

Info

Publication number: CN112333180A
Application number: CN202011187318.7A
Authority: CN
Inventors: 邢亚君; 彭海龙; 孟铭; 王德胜
Original assignee: Beijing An Xin Tian Xing Technology Co ltd
Current assignee: Beijing An Xin Tian Xing Technology Co ltd
Priority date: 2020-10-30
Filing date: 2020-10-30
Publication date: 2021-02-05

Abstract

The invention discloses an APT attack detection method and system based on data mining. The method comprises the following steps: acquiring access frequency characteristics of a host to be detected and domain name popularity characteristics of the host to be detected based on the DNS log; acquiring the flow characteristics of the host to be detected and the port protocol mismatching characteristics of the host to be detected based on the network flow log; fusing the access frequency characteristic of the host to be detected, the domain popularity characteristic of the host to be detected, the flow characteristic of the host to be detected and the port protocol mismatching characteristic of the host to be detected by adopting a fuzzy mathematical model to obtain a comprehensive characteristic value of the host to be detected; and inputting the comprehensive characteristic value of the host to be detected into a trained decision tree model based on an isolated forest algorithm to obtain a detection result of whether the host to be detected is attacked by APT. The invention can effectively detect whether the host is attacked by the APT.

Description

APT attack detection method and system based on data mining

Technical Field

The invention relates to the field of APT attack detection, in particular to an APT attack detection method and system based on data mining.

Background

With the development of communication technology, the informatization and networking of companies have become a trend. In this context, however, an APT having persistence, concealment, and permeability becomes a non-negligible threat. APT has posed a great threat to organizations and organizations worldwide, and therefore research into the defense and detection of APT attacks has become an important direction for practitioners of the current network security field. The APT attack generally has good hiding performance because of clear attack targets and relatively high cost, and an attacker deliberately controls the behavior of a domain name, so that the domain name behavior is often difficult to distinguish from a normal domain name.

The K-nearest neighbor algorithm is a global and directly-calculated unsupervised detection algorithm. The algorithm focuses on the distance of the neighbor of the sample point, and the absolute distance between the sample and the neighbor is used as the judgment of the abnormal degree, which is influenced by the distance calculation, so that the abnormal degree of the sample data is influenced. In practice, abnormal points may occur in small-scale clustering, and since the K-nearest neighbor algorithm relies on the comparison between a sample point and its nearest neighbors, a particularly good detection result cannot be obtained in the face of such small-scale clustering.

Disclosure of Invention

The invention aims to provide an APT attack detection method and system based on data mining, which can effectively detect the abnormity.

In order to achieve the purpose, the invention provides the following scheme:

an APT attack detection method based on data mining comprises the following steps:

acquiring access frequency characteristics of a host to be detected and domain name popularity characteristics of the host to be detected based on a DNS log, wherein the access frequency characteristics of the host to be detected represent a frequency within a set frequency range in the frequency of accessing each domain name by the host to be detected, and the domain name popularity characteristics represent the ratio of the number of hosts accessing the host to be detected in a set time period to the number of active hosts in the set time period;

acquiring flow characteristics of a host to be detected and port protocol mismatching characteristics of the host to be detected based on a network flow log, wherein the flow characteristics of the host to be detected represent the ratio of uploading flow and downloading flow of the host to be detected, and the port protocol mismatching characteristics of the host to be detected represent the condition that a communication protocol is not matched with a port;

fusing the access frequency characteristic of the host to be detected, the domain name popularity characteristic of the host to be detected, the flow characteristic of the host to be detected and the port protocol mismatching characteristic of the host to be detected by adopting a fuzzy mathematical model to obtain a comprehensive characteristic value of the host to be detected;

and inputting the comprehensive characteristic value of the host to be detected into a trained decision tree model based on an isolated forest algorithm to obtain a detection result of whether the host to be detected is attacked by APT.

Optionally, the method further includes:

acquiring access frequency characteristics of a sample host and domain name popularity characteristics of the sample host based on a DNS log, wherein the access frequency characteristics of the sample host represent a frequency within a set frequency range in the frequency of accessing each domain name by the sample host, and the domain name popularity characteristics of the sample host represent the ratio of the number of hosts accessing the sample host in a set time period to the number of active hosts in the set time period;

acquiring flow characteristics of a sample host and port protocol mismatching characteristics of the sample host based on a network flow log, wherein the flow characteristics of the sample host represent the ratio of uploading flow and downloading flow of the sample host, and the port protocol mismatching characteristics of the sample host represent the condition that a communication protocol of the sample host is not matched with a port;

fusing the access frequency characteristic of the sample host, the domain name popularity characteristic of the sample host, the flow characteristic of the sample host and the port protocol mismatching characteristic of the sample host by adopting a fuzzy mathematical model to obtain a comprehensive characteristic value of the sample host;

and training a decision tree based on an isolated forest algorithm by taking a set formed by the comprehensive characteristic values of the sample hosts as a training set and taking the APT attack condition of the sample hosts as a label to obtain the decision tree model.

Optionally, the output of the decision tree model is the probability that the host to be detected is attacked by APT.

Optionally, the method further includes: and when the output result of the decision tree model indicates that the host to be detected suffers from ATP attack or the probability of suffering from ATP attack is greater than a set threshold value, giving an alarm.

The invention also provides an APT attack detection system based on data mining, which comprises:

the DNS log feature extraction module is used for acquiring access frequency features of a host to be detected and domain name popularity features of the host to be detected based on the DNS log, wherein the access frequency features of the host to be detected represent a frequency within a set frequency range in the frequency of the host to be detected accessing each domain name, and the domain name popularity features represent the ratio of the number of the hosts accessing the host to be detected in a set time period to the number of active hosts in the set time period;

the flow log feature extraction module is used for acquiring flow features of the host to be detected and port protocol mismatching features of the host to be detected based on the network flow log, wherein the flow features of the host to be detected represent the ratio of uploading flow and downloading flow of the host to be detected, and the port protocol mismatching features of the host to be detected represent the condition that a communication protocol is not matched with a port;

the characteristic fusion module is used for fusing the access frequency characteristic of the host to be detected, the domain name popularity characteristic of the host to be detected, the flow characteristic of the host to be detected and the port protocol mismatching characteristic of the host to be detected by adopting a fuzzy mathematical model to obtain a comprehensive characteristic value of the host to be detected;

and the anomaly detection module is used for inputting the comprehensive characteristic value of the host to be detected into a trained decision tree model based on an isolated forest algorithm to obtain a detection result of whether the host to be detected is attacked by the APT.

Alternatively to this, the first and second parts may,

the DNS log feature extraction module is further used for obtaining access frequency features of the sample host and domain name popularity features of the sample host based on the DNS log, wherein the access frequency features of the sample host represent a frequency in a set frequency range in the frequency of the sample host accessing each domain name, and the domain name popularity features of the sample host represent the ratio of the number of hosts accessing the sample host in a set time period to the number of active hosts in the set time period;

the flow log feature extraction module is further used for obtaining flow features of the sample host and port protocol mismatching features of the sample host based on the network flow log, wherein the flow features of the sample host represent the ratio of uploading flow and downloading flow of the sample host, and the port protocol mismatching features of the sample host represent the condition that a communication protocol of the sample host is not matched with a port;

the characteristic fusion module is also used for fusing the access frequency characteristic of the sample host, the domain name popularity characteristic of the sample host, the flow characteristic of the sample host and the port protocol mismatching characteristic of the sample host by adopting a fuzzy mathematical model to obtain a comprehensive characteristic value of the sample host;

the system further comprises: and the decision tree training unit is used for training a decision tree based on an isolated forest algorithm by taking a set formed by the comprehensive characteristic values of the sample hosts as a training set and taking the APT attack condition of the sample hosts as a label to obtain the decision tree model.

Optionally, the system further includes: and the alarm module is used for giving an alarm when the output result of the decision tree model indicates that the host to be detected suffers from ATP attack or the probability of suffering from ATP attack is greater than a set threshold value.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the APT attack detection method and system based on data mining provided by the invention extract the characteristics of DNS logs/network flow logs, and then evaluate a data set by using an iForest anomaly detection algorithm to obtain an attack detection result. Due to the combination of the iForest algorithm, the accuracy and the effectiveness of the APT detection are improved, and a reliable basis is provided for the subsequent detection analysis.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic flow chart of an APT attack detection method based on data mining according to embodiment 1 of the present invention;

fig. 2 is a schematic structural diagram of an APT attack detection system based on data mining according to embodiment 2 of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Example 1

Referring to fig. 1, the present embodiment provides a data mining-based APT attack detection method, including the following steps:

step 101: the method comprises the steps of obtaining access frequency characteristics of a host to be detected and domain name popularity characteristics of the host to be detected based on DNS logs, wherein the access frequency characteristics of the host to be detected represent a frequency within a set frequency range in the frequency of the host to be detected accessing each domain name, and the domain name popularity characteristics represent the ratio of the number of the hosts accessing the host to be detected in a set time period to the number of active hosts in the set time period.

After a host is attacked by APT, the behavior of the host is controlled, after the authority of the host is obtained, the behavior of stealing host information is hidden in normal flow, and we find that the host sends information to a command & control server at a relatively low frequency, so the access frequency of the host is taken as a characteristic for identifying whether the host is attacked by APT, in this embodiment, if the host to be detected does not have the access frequency within the set frequency range, the host to be detected is not attacked by APT, if the host to be detected has a plurality of access frequencies within the set frequency range, the access frequencies are respectively extracted one by one and fused with other characteristics, for example, the frequency of the host to be detected accessing a domain name a is a, the frequency of accessing a domain name B is B, the frequency of accessing a domain name C is C, and the frequency of accessing a domain name D is D, and if b, c and d are within the set frequency range, the frequencies b, c and d are respectively extracted and respectively combined with other characteristics to form detection identification vectors, for example, other characteristics are alpha, beta and gamma, then three groups of detection identification vectors [ b, alpha, beta, gamma ], [ c, alpha, beta, gamma ], [ d, alpha, beta, gamma ] are obtained. In the subsequent steps, the characteristic elements in [ b, alpha, beta, gamma ], [ c, alpha, beta, gamma ], [ d, alpha, beta, gamma ] are fused to obtain three comprehensive characteristic values of the host to be detected, each comprehensive characteristic value is input and output to the decision tree model, and the host to be detected is considered to be attacked by APT as long as one output result indicates that the host is attacked by APT.

After the host is infected by the malicious software, the authority of the malicious software is raised to acquire the access authority of the sensitive file, but the risk of discovery is increased compared with the access of the sensitive file. We have found that malware often acquires sensitive information by infecting a small portion of critical hosts in an intranet in order to reduce the risk of discovery. Therefore, the present invention uses the ratio p of the number of hosts accessing a domain name to the number of active hosts during a period of time T as a feature for identifying whether the domain name (host) is subjected to APT attack.

Step 102: the method comprises the steps of obtaining flow characteristics of a host to be detected and port protocol mismatching characteristics of the host to be detected based on a network flow log, wherein the flow characteristics of the host to be detected represent the ratio of uploading flow and downloading flow of the host to be detected, and the port protocol mismatching characteristics of the host to be detected represent the situation that a communication protocol is not matched with a port.

Generally, during normal network communication, the traffic in the downloading direction is much more than the traffic in the uploading direction. However, during the APT attack, the infected host needs to send the collected data to the command & control server, and the ratio of the upload traffic to the download traffic of the host may be higher than that of other hosts. Thus, the present invention uses the ratio of upload traffic to download traffic as a feature to identify whether a host is subject to an APT attack.

To penetrate the firewall of the target network, malware typically communicates with the port using common protocols. In general malware, a communication protocol is fixed during malware development, and a specifically used port needs to be configured according to a specific situation after invading a host, so that the situation that the communication protocol is not matched with the port occurs in an attack process. For example, other protocols than http on 80 ports, these are likely to be malicious traffic. Therefore, the invention takes the phenomenon that the communication protocol is not matched with the port as a characteristic for identifying whether the host is attacked by the APT, for example, the number that the communication protocol is not matched with the port can be taken as the port protocol mismatch characteristic.

In addition, after the infected host connects to the command & control server, the infected host and the command & control server can determine that the other host keeps on the line by sending data packets. Such data packets are commonly referred to as heartbeat packets. Heartbeat packets typically have small data packets with a strong periodicity. Thus, periodic small packets may also be used as a feature to identify whether a host is subject to an APT attack.

Step 103: and fusing the access frequency characteristic of the host to be detected, the domain popularity characteristic of the host to be detected, the flow characteristic of the host to be detected and the port protocol mismatching characteristic of the host to be detected by adopting a fuzzy mathematical model to obtain a comprehensive characteristic value of the host to be detected.

After the above features of the host to be detected are extracted, the present embodiment adopts a fuzzy mathematical model to fuse the features:

(1) defining a set of influencing factors a ═ { a ═ a₁,A₂,A₃,...A_nFor example, a ═ access frequency feature, domain popularity feature, port protocol mismatch feature, traffic feature.

(2) Define the used evaluation set W ═ { W₁,W₂,W₃,...,W_m}

(3) Defining a single cause injection: f. of₁:A→(W)，A_i|→f₁(A_i)＝(a_i,1,a_i,2,...,a_i,m) E (W) wherein a_i,jWherein i and j respectively satisfy 1. ltoreq. i.ltoreq.n, 1. ltoreq. j.ltoreq.m, the value of which represents A_iAt W_m'An evaluation value among the factors, and a_i,1+a_i,2+a_i,3+...+a_i,mWith the intermediate fuzzy mapping, the fuzzy relation can be obtained, and the fuzzy matrix in the model is shown in formula (1).

Wherein 0 is more than or equal to a_i,j≤1，1≤i≤n，1≤j≤m(1)

(4) An authoritative weight matrix is predefined, where Z ═ Z is used₁,z₂,...,z_n]Is shown in which z is₁+z₂+...+z_nThis step is very important for convex optimization of the blur matrix into a classical matrix. While the value of each element in the set represents the importance of each influencing factor in the set A, namely z_iThe larger the value of (A) is, the more A_iThe more important.

The maximum and minimum synthesis operation is carried out on R and Z, as shown in formula (2)

In the formula (2), the first and second groups,

represents the maximum-minimum composition operation, and then normalizes G to

Finally, the comprehensive value omega which is the extracted characteristic is obtained^*×W^TAnd the method is used for the next abnormal detection algorithm.

Step 104: and inputting the comprehensive characteristic value of the host to be detected into a trained decision tree model based on an isolated forest algorithm to obtain a detection result of whether the host to be detected is attacked by APT.

In this embodiment, the training process of the decision tree model is as follows:

(1) the method comprises the steps of obtaining access frequency characteristics of a sample host and domain name popularity characteristics of the sample host based on a DNS log, wherein the access frequency characteristics of the sample host represent a frequency within a set frequency range in the frequency of the sample host accessing each domain name, and the domain name popularity characteristics of the sample host represent the ratio of the number of hosts accessing the sample host within a set time period to the number of active hosts within the set time period. If the access frequency within the set frequency range does not exist in the sample host, the information corresponding to the sample host is abandoned and is not used as the sample data of the training decision tree. If a certain sample host has a plurality of access frequencies within the set frequency range, extracting the access frequencies one by one respectively, and fusing the access frequencies with other characteristics respectively to obtain a plurality of comprehensive characteristic values of the sample host, wherein the characteristic values are used as sample information of a training decision tree.

(2) The method comprises the steps of obtaining flow characteristics of a sample host and port protocol mismatching characteristics of the sample host based on a network flow log, wherein the flow characteristics of the sample host represent the ratio of uploading flow and downloading flow of the sample host, and the port protocol mismatching characteristics of the sample host represent the condition that a communication protocol of the sample host is not matched with a port.

(3) And fusing the access frequency characteristic of the sample host, the domain name popularity characteristic of the sample host, the flow characteristic of the sample host and the port protocol mismatching characteristic of the sample host by adopting a fuzzy mathematical model to obtain a comprehensive characteristic value of the sample host.

(4) And training a decision tree based on an isolated forest algorithm by taking a set formed by the comprehensive characteristic values of the sample hosts as a training set and taking the APT attack condition of the sample hosts as a label to obtain the decision tree model. The specific training process of the decision tree is as follows:

the iForest consists of t iTrees, each of which is a decision tree.

The specific steps for implementing the iTree are as follows:

1) randomly selecting a plurality of sample points from training data as subsamples, and putting the subsamples into a root node of a tree;

2) randomly selecting a feature as a new node, and randomly selecting a cutting point p under the current feature, wherein the cutting point p is generated between the maximum value and the minimum value of the specified dimensionality in the current node data;

3) with this cut point, a hyperplane is generated, and then the current node data space is divided into two subspaces: placing data smaller than p in the specified dimension on the left child of the current node, and placing data larger than or equal to p on the right child of the current node;

4) recursion steps 2) and 3) in the child nodes, new child nodes are continuously constructed until only one piece of data in the child nodes or the child nodes reach the limited height of the tree.

After t iTrees are obtained, iForest is the end of training, and then the model can be used to evaluate the data. For each test datum, it is traversed through all itree and the nodes are recorded as they are separated individually or until they reach the deepest level of the tree, thus yielding an average height at which the datum falls in iForest.

The path length h (x) of the sample point x is the number of edges that pass from the root node to the leaf node of the iTree. Given a data set containing y samples, the average path length of the tree is

Where H (i) is a harmonic number, which may be estimated as ln (i) + 0.577215. c (y) the average of the path lengths for a given number of samples y, to normalize the path length h (x) of the sample x.

And the anomaly probability of sample x is

When the anomaly probability s (x, y) → 1, that is, the anomaly score of x approaches 1, it is determined to be anomalous.

Of course, in this embodiment, the label used for training the decision tree may be the abnormal probability (i.e. the probability of being attacked by APT) of the sample, or may be only the label representing whether the sample is abnormal or not.

Compared with other algorithms, the isolated forest has better performance and the training process is hardly related to the training data size. Most applications of decision trees indicate that only a very small number of samples (256 samples are selected by default) need to be extracted for building each tree, and a good detection effect can be achieved in the case of building 100 decision trees.

Step 105: and when the output result of the decision tree model indicates that the host to be detected is attacked by ATP or the probability of being attacked by ATP is greater than a set threshold value, giving an alarm to avoid greater loss.

The method adopts the iForest algorithm to detect the abnormality of the host, the iForest algorithm randomly selects the characteristics, does not use any distance or density representation method, and can segment the micro-clusters through the data dividing capability of the decision tree, so that the obtained result is more accurate. In addition, the iForest algorithm has excellent calculation performance, a basic model iTree can be constructed by only a small number of samples in the model training process, and the storage requirement of extremely low linear time complexity is met.

Example 2

Referring to fig. 2, the present embodiment provides an APT attack detection system based on data mining, where the system includes:

the DNS log feature extraction module 201 is configured to obtain, based on the DNS log, an access frequency feature of the host to be detected and a domain popularity feature of the host to be detected, where the access frequency feature of the host to be detected indicates a frequency in a set frequency range in frequencies in which the host to be detected accesses each domain, and the domain popularity feature indicates a ratio of the number of hosts accessing the host to be detected in a set time period to the number of active hosts in the set time period.

The traffic log feature extraction module 202 is configured to obtain, based on the network traffic log, a traffic feature of the host to be detected and a port protocol mismatch feature of the host to be detected, where the traffic feature of the host to be detected indicates a ratio of upload traffic to download traffic of the host to be detected, and the port protocol mismatch feature of the host to be detected indicates a situation where a communication protocol is not matched with a port.

The feature fusion module 203 is configured to fuse, by using a fuzzy mathematical model, an access frequency feature of the host to be detected, a domain popularity feature of the host to be detected, a traffic feature of the host to be detected, and a port protocol mismatch feature of the host to be detected, so as to obtain a comprehensive feature value of the host to be detected.

And the anomaly detection module 204 is used for inputting the comprehensive characteristic value of the host to be detected into a trained decision tree model based on an isolated forest algorithm to obtain a detection result of whether the host to be detected is attacked by the APT.

And the alarm module 205 is configured to issue an alarm when an output result of the decision tree model indicates that the host to be detected is attacked by ATP or the probability of being attacked by ATP is greater than a set threshold.

In this embodiment, the DNS log feature extraction module 201 is further configured to obtain, based on the DNS log, an access frequency feature of the sample host and a domain name popularity feature of the sample host, where the access frequency feature of the sample host indicates a frequency in a set frequency range in the frequency of accessing each domain name by the sample host, and the domain name popularity feature of the sample host indicates a ratio of the number of hosts accessing the sample host in a set time period to the number of active hosts in the set time period. The traffic log feature extraction module 202 is further configured to obtain traffic features of the sample host and port protocol mismatch features of the sample host based on the network traffic log, where the traffic features of the sample host represent a ratio of upload traffic and download traffic of the sample host, and the port protocol mismatch features of the sample host represent a situation that a communication protocol of the sample host is not matched with a port. The feature fusion module 203 is further configured to fuse the access frequency feature of the sample host, the domain popularity feature of the sample host, the traffic feature of the sample host, and the port protocol mismatch feature of the sample host by using a fuzzy mathematical model to obtain a comprehensive feature value of the sample host. The APT attack detection system provided in this embodiment may further include: and the decision tree training module is used for training the decision tree based on an isolated forest algorithm by taking a set formed by the comprehensive characteristic values of all the sample hosts as a training set and taking the APT attack condition of all the sample hosts as a label to obtain a decision tree model. The output of the decision tree model is the probability that the host to be detected is attacked by the APT.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. An APT attack detection method based on data mining is characterized by comprising the following steps:

2. The method of claim 1, wherein the method further comprises:

3. The data mining-based APT attack detection method according to claim 1, wherein the output of the decision tree model is a probability that the host to be detected is under APT attack.

4. The method for detecting APT attack based on data mining according to any one of claims 1-3, characterized in that the method further comprises: and when the output result of the decision tree model indicates that the host to be detected suffers from ATP attack or the probability of suffering from ATP attack is greater than a set threshold value, giving an alarm.

5. An APT attack detection system based on data mining, characterized by comprising:

6. The data mining based APT attack detection system according to claim 5,

the system further comprises: and the decision tree training module is used for training a decision tree based on an isolated forest algorithm by taking a set formed by the comprehensive characteristic values of the sample hosts as a training set and taking the APT attack condition of the sample hosts as a label to obtain the decision tree model.

7. The data mining-based APT attack detection system of claim 5, wherein the output of the decision tree model is a probability that a host to be detected is under APT attack.

8. The data mining based APT attack detection system according to any one of the claims 5-7, characterized in that said system further comprises: and the alarm module is used for giving an alarm when the output result of the decision tree model indicates that the host to be detected suffers from ATP attack or the probability of suffering from ATP attack is greater than a set threshold value.