CN106817248B

CN106817248B - APT attack detection method

Info

Publication number: CN106817248B
Application number: CN201611179208.XA
Authority: CN
Inventors: 李兴华; 许勐璠; 苗美霞; 刘海; 马建峰; 李晖
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2016-12-19
Filing date: 2016-12-19
Publication date: 2020-10-16
Anticipated expiration: 2036-12-19
Also published as: CN106817248A

Abstract

The invention discloses an Advanced Persistent Threat (APT) attack detection method, which utilizes a semi-supervised learning algorithm to mark data with similar characteristics, utilizes a small amount of marked data to generate a large-scale training data set, and introduces an information gain rate to determine the influence degree of different characteristics on detection; and (3) extracting the characteristics of each subdata set divided in the detection model by adopting the information gain rate so as to realize accurate identification of unknown attacks. The method utilizes the improved k-means algorithm to mark the data with similar characteristics, realizes the accurate marking of a large amount of training data sets on the basis of a small amount of manually marked data, and ensures the detection precision of the model; the influence degree of different characteristics on detection is determined by introducing an information gain rate into the model, and the influence of redundancy and noise characteristics in data is reduced, so that important flow characteristics are selected, the generalization capability of the detection model is improved, and the detection of unknown attacks is responded.

Description

APT attack detection method

Technical Field

The invention belongs to the field of network security, and particularly relates to an APT attack detection method.

Background

A high-level persistent threat is a new type of attack that is organized, targeted, and extremely long lasting. With the exposure of seismic nets (Stuxnet), Duqu, Flame (Flame) and Killdisk attacks against uklan power plants in 2015, it can be seen that APT attacks pose a huge threat to the security of various industrial control networks and key information infrastructures. The main goal of APT attacks is to steal confidential information or cause specified damage in military agencies, government agencies, national infrastructure, and high and new technology enterprises. The method is characterized by mainly comprising the following two points: (1) the attack approach is advanced. An attacker mostly utilizes unknown attacks such as 0day bug and the like to carry out invasion, for example, the attack is initiated to an Iran nuclear power station, and the attacker respectively utilizes a plurality of unknown bugs such as the Flame virus, the Stuxnet virus and the Duqu virus to carry out the attack within 6 years. However, the detection of unknown network attacks is currently very challenging. (2) The attack duration is long. Attackers may be latent for months or even years to complete the target of the attack, as "scout action" from 2006 soaks and attacks networks of up to 72 companies and organizations worldwide, not discovered and reported by McAfee and Symantec until 2011. Further, as in "night Long attack" from 2007 for stealing confidential information on oil field operations of 5 Western multinational energy companies, it was reported and discovered by McAfe 4 years later. It is due to this property that the present invention requires the detection of APT attacks from a large amount of network traffic/host behavior data. The APT attack, although taking different approaches, will eventually react on the underlying raw data stream of the network. As Machine Learning (ML) has obvious advantages in the aspects of processing big data and attack detection, the machine learning has great research prospects in the field of APT attack detection. The basic idea of the APT attack detection scheme based on machine learning is to extract the characteristics of current network traffic data or host behavior data, use known normal or abnormal historical traffic data as a training data set, and use a classification algorithm to label and classify the network traffic data or the host behavior on the basis to distinguish normal behavior from abnormal behavior. However, these methods have the following problems: (1) the training data set for a particular network is limited. In an actual APT attack scene, training data is often generated by manually marking through expert knowledge, so that the general scale of an accurate marked data set is very limited, most of the accurate marked data set is unmarked historical network traffic data, and the accurate marked data set is not known to be normal or abnormal. The supervised learning itself needs a large amount of labeled data as a training data set, and the scale and accuracy of the training data set directly affect the detection performance of the model. Therefore, how to generate an accurate training data set by using a small amount of labeled data to ensure the detection accuracy of the model is a big challenge in current APT attack detection. (2) Network traffic characteristics are difficult to select. At present, APT attack detection based on supervised learning is more based on traditional intrusion detection to develop research, and has certain advantages for the detection of known attacks. However, the APT attack mostly uses unknown attacks such as 0day bug, and the attacks taken for different target networks may be different, and even the 0day bug used by the same target network attacker may be dynamically changed, and the underlying network traffic characteristics shown by different unknown attacks are different. Therefore, how to dynamically choose appropriate network traffic characteristics for a particular target network to detect the unknown attacks suffered by the network is another challenge. The network-level persistent threat is a novel intelligent network attack and is mainly characterized by utilization and long duration of unknown attacks such as 0day bugs.

In summary, the problem that the artificial labeling data of the training detection model in the APT attack detection based on machine learning is too little and the flow characteristics of unknown attacks are difficult to select is solved.

Disclosure of Invention

The invention aims to provide an APT attack detection method, and aims to solve the problems that in the current APT attack detection based on machine learning, artificial marking data of a training detection model is too little, and the flow characteristics of unknown attacks are difficult to select.

The invention is realized in such a way, and provides an APT attack detection method, wherein the semi-supervised learning attack detection method based on the information gain rate in the APT environment utilizes a semi-supervised learning algorithm to mark data with similar characteristics, generates a large-scale training data set by using a small amount of marked data, realizes automatic marking of historical data and obtains a larger-scale and accurately marked training data set; introducing an information gain rate to determine the influence degree of different characteristics on detection, extracting the characteristics of each sub data set divided in the detection model by adopting the information gain rate, and selecting the characteristics which are most beneficial to dividing data samples aiming at different target networks so as to realize accurate identification of unknown attacks; and finally, optimizing the APT attack detection performance of the model by adopting a weighted majority algorithm based on information gain.

Further, the semi-supervised learning algorithm comprises the following specific steps:

(1) randomly selecting a piece of data from the marked normal data and abnormal data respectively as the center of the cluster, and selecting N₁,N₅As marked clusters of data and marked dataCenter of cluster c₁,c₂；

(2) Using formulas

Calculating each data N_iAre respectively connected with the cluster core c₁,c₂Distance d (N)_i,c_k) And d (N)_i,c_k) Dividing data with small values into a cluster;

(3) using formulas

Respectively calculating the centroids of all the points in the two clusters and taking the centroids as new cluster centers c₁′,c′₂；

(4) Repeating the steps (2) and (3) until the total intra-cluster dispersion sum J reaches the minimum, wherein the dispersion sum is every piece of data N_iTo its corresponding cluster center c_kDistance d (N)_i,c_k) The sum of (a);

(5) calculating the probability P of each type of marked data in each cluster_l,kAnd with P_l,kMarking the category of the cluster k at the maximum time l to finally obtain a training data set D;

wherein N is_i,mM characteristic value, c, representing the ith piece of data_k,mThe mth characteristic value of the kth cluster center is represented, and m is the number of the network flow characteristics; i is the total number of samples in the data set, and I' is the total number of data samples in the cluster k.

Further, the random forest detection method based on the information gain rate utilizes a Bootstrap resampling algorithm to extract a data sample from the set D with a place back each time, extracts I times in total, and removes repeated data to obtain a sub-training set S₁Repeating the step q times to obtain q sub-training data sets { S }₁,S₂,...,S_qQ different decision trees are generated to build a random forest.

Further, each decision tree T is generated_qThe method comprises the following specific steps:

selecting a flow characteristic with the largest information gain rate as a root node of a decision tree;

step two, finding out the data set S corresponding to the selected characteristics_qThe threshold value of the leaf node is divided into the nodes according to the characteristic fastest speed, and the nodes are divided;

before each non-leaf node selects the characteristics, taking the residual characteristics as the splitting characteristic set of the current node, and selecting the flow characteristics with the maximum information gain rate as the non-leaf nodes split by the root node;

step four, repeating the step two and the step three until each feature corresponds to a leaf node, and constructing S_qCorresponding decision tree T_q。

Further, the specific calculation formula is as follows:

wherein S is_qA subset of a training data set D randomly selected by Bootstrap resampling; GainRatio (S)_q,m)、Gain(S_q,m)、Split(S_qM) respectively represent the sub data set S_qV (m) is the range of values of feature m; s_vIs a set S_qA subset of (1) where the value on feature m equals v; a represents the total number of attributes of the feature m, feature Protocol _ type, and since the attributes thereof include TCP, UDP, and ICMP, a is 3; h (x) is the entropy of data set x; p is a radical of_lThe number of class i samples is the proportion of the total data set.

The invention provides an APT attack detection method, which utilizes a semi-supervised learning algorithm to mark data with similar characteristics and introduces an information gain rate to determine the influence degree of different characteristics on detection. The experimental result shows that the accuracy and the detection rate of the proposed model on the unknown attack detection are respectively improved by 3.18 percent and 4.5 percent compared with the traditional random forest model, and the false alarm rate and the missing report rate are respectively reduced by 53.54 percent and 40.76 percent. The method utilizes the improved k-means algorithm to mark the data with similar characteristics, realizes the accurate marking of a large amount of training data sets on the basis of a small amount of manually marked data, and ensures the detection precision of the model; the influence degree of different characteristics on detection is determined by introducing an information gain rate into the model, and the influence of redundancy and noise characteristics in data is reduced, so that important flow characteristics are selected, the generalization capability of the detection model is improved, and the detection of unknown attacks is responded.

Drawings

Fig. 1 is a flowchart of an APT attack detection method according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of network traffic characteristics provided in an embodiment of the present invention.

Fig. 3 is a schematic diagram of a historical network traffic data marking process based on semi-supervised learning according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of traffic characteristic extraction based on information gain ratio according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of a decision tree generation process based on feature extraction according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of attack detection of Weighted Majority Algorithm (WMA) based on information gain according to an embodiment of the present invention.

Fig. 7 is a schematic diagram comparing detection performances of the present invention and a conventional Random Forest (RF), K-Nearest Neighbor (KNN), Support Vector Machine (SVM) algorithm according to an embodiment of the present invention.

Fig. 8 is a schematic diagram of the model detection rate and the training data set D accuracy under different D values according to the embodiment of the present invention.

Fig. 9 is a schematic diagram illustrating an influence of the number q of decision trees on the detection accuracy and the model detection time according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The following detailed description of the principles of the invention is provided in connection with the accompanying drawings.

As shown in fig. 1, an APT attack detection method provided by an embodiment of the present invention includes the following steps:

s101: a semi-supervised learning-based method utilizes a small amount of labeled data to generate a large-scale training data set;

s102: and (3) extracting the characteristics of each subdata set divided in the detection model by adopting the information gain rate so as to realize accurate identification of unknown attacks.

The application of the principles of the present invention will now be described in further detail with reference to the accompanying drawings.

In order to solve two challenges of limited training data set and difficulty in selecting network flow characteristics in APT attack detection, firstly, a large-scale training data set is generated by utilizing a small amount of marked data based on a semi-supervised learning method, and then, characteristic extraction is carried out on each sub data set divided in a detection model by adopting an information gain rate so as to realize accurate identification of unknown attacks.

Although the APT attack modes are different, the APT attack modes can be finally reflected in the underlying network traffic data, so the present invention adopts the most common network traffic characteristics at present, including Duration, Protocol, Count, Srv _ Count, etc., and the specific characteristics are shown in table 1. The present invention converts character-type features in a dataset into numerical types for testing. Here, the invention uses a piece of data N in a data set₁The description will be given by taking (2, tcp, smtp, SF,1684,363,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,104,66,0.63,0.03,0.01,0.00,0.00, 0.00) as an example. For the character type Protocol _ type characteristic which comprises three attribute values of TCP, UDP and ICMP, the invention uses 0 to represent attribute TCP, 1 to represent attribute UDP, and 2 to represent attribute ICAnd (6) MP. Meanwhile, the data set is normalized to reduce the influence of dimension among the features, namely, the importance degree of each feature is not influenced by numerical values. The invention utilizes min-max standardization method to reduce the size range of data to 0,1]The specific calculation formula is shown as formula (1):

wherein x is_NormalizedIs a normalized value of a certain feature, x_IntialFor the initial attribute value of a feature, x_minIs the minimum value of the feature, x_maxIs the maximum value of the feature. Then N is₁Converted into (3.429 × 10) by the above treatment^-5,0,0.59,0,1.286×10^-4,2.77×10^-6,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0.21,0.132,0.63,0.03,0.01,0.00,0.00,0.00,0.00,0.00)。

1.1 training set Generation based on semi-supervised learning

Because the data volume of the target network is huge in the APT attack detection, the artificial marking depending on expert knowledge can only obtain a small amount of accurately marked data as a training sample, so that the trained model cannot accurately detect the abnormity^[19]. Semi-supervised learning is proposed for such problems, i.e. unsupervised learning is assisted with a small amount of data with a priori knowledge. In order to realize automatic marking of historical data and obtain a larger-scale and accurately marked training data set, an improved k-means semi-supervised learning algorithm is provided, and the invention is illustrated by taking fig. 2 as an example.

(1) Randomly selecting a piece of data as the center of the cluster from the marked normal data and abnormal data respectively, and selecting N in the invention in FIG. 2₁,N₅Cluster center c as marked data (normal) cluster and marked data (abnormal) cluster₁,c₂；

(2) Calculating each data N using equation (2)_iAre respectively connected with the cluster core c₁,c₂Distance (similarity) d (N)_i,c_k) And d (N)_i,c_k) Data with small values are divided into onesWithin each cluster;

(3) respectively calculating the centroids of all the points in the two clusters by using the formula (3) and taking the centroids as new cluster centers c₁′,c′₂；

(4) Repeating the

steps

2 and 3 until the total intra-cluster dispersion sum J reaches the minimum, wherein the dispersion sum is N of each piece of data_iTo its corresponding cluster center c_kDistance d (N)_i,c_k) The sum of (a);

(5) calculating the probability P of each type of marked data in each cluster_l,kAnd with P_l,kAnd marking the class of the cluster k at the maximum time l to finally obtain a training data set D.

The specific formula is as follows:

l＝argmaxP_l,k(6)

wherein N is_i,mRepresenting the mth characteristic value of the ith piece of data, e.g. N_1,1＝3.429×10^-5，c_k,mAnd m is the number of network flow characteristics, and m is 28. I is the total number of samples in the data set, and I' is the total number of data samples in the cluster k, for example, I ═ 20 in fig. 2, and I ═ 15 in the cluster 1 of the training data set D. d (N)_i,c_k) Representing data N_iThe Euclidean distance from the center of the cluster is used for describing the similarity of the cluster. Since the present invention divides the dataset into two clusters, k is 1 or 2. P_l,kIndicating the occurrence of a marked class of the l-th class in the k-th clusterProbability (i ═ 0 or 1,0 for normal class, 1 for abnormal class), n_l,kIndicates the number of marked samples of the l-th class in the k-th cluster, n_lIndicates the total number of class I labeled samples, so when P is_l,kThe class of the cluster k is labeled with l at maximum, and arg max f (x) represents the value of the argument x at maximum that satisfies the function f (x). For example, in FIG. 2, the probability P that a normally labeled class appears in cluster 1_0,1Probability P of an abnormally labeled class appearing in cluster 1 ═ 1_1,1＝0.33，P_0,1＞P_1,1Therefore, cluster 1 is labeled 0, i.e., cluster 1 is a normal class. Similarly, P in cluster 2_1,2＝0.67，P_0,2Cluster 2 is therefore labeled 1, i.e., cluster 2 is an exception class.

1.2 traffic feature extraction based on information gain Rate

In an Intrusion Detection System (IDS) based on machine learning, a random forest algorithm has excellent generalization performance and is more advantageous to the detection of attacks compared with other classification algorithms, so that the random forest algorithm becomes a commonly selected reference algorithm for the current attack detection. However, for APT attacks, different target networks may suffer different attacks, and the characteristics exhibited by different attacks are different. The present invention therefore requires that features that are most helpful in dividing the data samples be chosen for different target networks.

The more information a feature can bring to classify a model, the more important the feature is, the larger the information amount will be caused by the presence or absence of the feature in the model, and the difference between the previous and next information amounts is the information gain the model brings to the model by the feature. In order to select more representative features in the process of constructing the decision tree, the invention introduces the concept of information Gain and uses the information Gain Ratio (Gain Ratio) to measure the capability of a given feature to distinguish training examples, and the specific process is shown in fig. 4.

The invention assumes that the training data set D generated in the previous paragraph contains I different data samples { N }₁,N₂,...,N_I}. Firstly, a Bootstrap resampling algorithm is utilized, a data sample is extracted from a set D with each time of replacement, a total of I times of extraction is carried out, repeated data is removed, and a sub-training is obtainedCollection S₁Repeating the step q times to obtain q sub-training data sets { S }₁,S₂,...,S_qQ different decision trees are generated to build a random forest. Wherein each decision tree T is generated_qThe method comprises the following specific steps:

1. selecting the flow characteristic with the maximum information gain rate as a root node of a decision tree;

2. finding out the data set S corresponding to the selected characteristics_qThe threshold value of the leaf node is divided into the nodes according to the characteristic fastest speed, and the nodes are divided;

3. before each non-leaf node (including a root node) selects the characteristics, taking the residual characteristics as the splitting characteristic set of the current node, and selecting the flow characteristics with the maximum information gain rate as the non-leaf nodes split by the root node;

4. repeating the

steps

2 and 3 until each feature corresponds to a leaf node, and constructing S_qCorresponding decision tree T_q。

The specific calculation formula is as follows:

wherein S is_qA subset of a training data set D randomly selected by Bootstrap resampling; GainRatio (S)_q,m)、Gain(S_q,m)、Split(S_qM) respectively represent the sub data set S_qV (m) is the range of values of feature m; s_vIs a set S_qA subset of (1) where the value on feature m equals v; a representsThe total number of attributes of the feature m, for example, the feature Protocol _ type, has the attributes of TCP, UDP, and ICMP, so that a is 3; h (x) is the entropy of data set x; p is a radical of_lThe number of class i samples is the proportion of the total data set.

In the following, the present invention will be described with reference to fig. 5, and the present invention first randomly extracts a data set S from the training data set by using the bootstrapping resampling algorithm_qCalculating a data set S_qAnd assuming that the information gain rate of the obtained feature count is maximum, and taking the feature count as a root node to start to construct a decision tree T_q(ii) a From the data set S_qThe attribute value distribution of the middle feature count is split, namely, a proper threshold value division data set S is found out_q. It is assumed here that in the data set S_qThe split threshold of the mid-feature count is 64, i.e. in the data set S_qIf the feature count of the data in (1) is less than or equal to 64, the data is classified as abnormal, and if the count is greater than 64, the feature is further extracted; when the count is larger than 64, selecting a second feature Dst _ bytes with the largest information gain rate in the remaining features, and finding out a split threshold value in the same way; repeating the above steps until the data set is completely divided, wherein in fig. 4, when the characteristic is selected to the Protocol _ type, the data set is completely divided, and at this time, the decision tree T is used_qAnd (5) completing construction.

In the above method, the information gain rate is used instead of the information gain for feature selection, because there is an inherent bias in the information gain metric, which favors features with more attributes, i.e. when a feature has a large number of attribute values (e.g. feature Duration, which is known by the present invention according to table 1 to have a value range of [0,58329 ]]Therefore, the present invention considers that it has 58329 attribute values), which can be calculated by the formula (8)

Approaching 0 due to the data set S_qEntropy of H (S)_q) Is fixed so that the information Gain (S)_qM) becomes large and approaches to H (S)_q) Therefore, the decision tree is biased towards selecting a feature during the feature selection process. However, this again leads to overfitting, i.e. the selected features only reflect the known training numbersThe data in the data set is distributed, so that the model only has the classification capability on the known traffic data, and the classification effect on the unknown traffic data (the detection capability of the unknown attack) is very poor. And the splitting Information (Split Information) of a feature refers to the entropy of its corresponding data set with respect to the respective attribute values of the feature, and when the Information gain is determined, the importance of the feature decreases as the splitting Information increases.

For example, when a set of I traffic data is completely divided by feature A (i.e., into I groups, I > 2), the split information is log₂I; meanwhile, there is a boolean feature B that partitions the same set, and if it exactly bisects (I ═ 2), its split information is 1. At this time, if only the information Gain is used instead of the information Gain rate to select the features, the present invention can know Gain (S) using equation (8)_q,A)＞Gain(S_qAnd B) to select feature a as a non-leaf node (root node) for constructing the decision tree. However, in practice, because the feature a has a plurality of attribute values, the data set is divided into a plurality of small spaces, that is, each leaf node may only include pure normal and abnormal, at this time, the decision tree may perfectly fit the training data, however, when data that does not belong to the attribute value of the feature a exists in the test data set, the constructed decision tree still classifies the test data only through the feature a without considering other features, which inevitably causes the detection performance of the model to be greatly reduced. Therefore, the present invention introduces an information gain ratio to solve the above-mentioned problems. According to the formula (6), it is obvious that the gain rate of the information of the feature B is higher, that is, the feature B is preferentially selected as a non-leaf node (root node) to construct a decision tree, so that the reduction of the detection capability of the model on unknown attacks caused by selecting the feature A with a large attribute value is avoided. Therefore, the information gain ratio of the invention using the characteristic Protocol _ type is calculated as follows:

the information gain rate is used as a compensation measure to solve the problem of information gain, and the splitting information is introduced to punish the characteristics with more attribute values, so as to improve the accuracy of the model for detecting unknown flow.

The following description of the present invention will take fig. 3 as an example to illustrate the process of feature extraction and decision tree generation. Let S_qTo obtain a training data set D ═ N by Bootstrap algorithm₁,N₂,...,N₂₀Extracted subdata set, S_q＝{N₁,N₂,N₃,N₄,N₅,N₆,N₇,N₈,N₉,N₁₀And selecting 2 characteristics of Protocol _ type and Service for comparison in order to calculate the invention conveniently, wherein the attribute values of the characteristics are shown in table 1, wherein data type normal/abnormal is represented by a numeral 0/1, and s is then_qThe number of normal and abnormal data is 7 and 3, respectively.

TABLE 1N₁To N₁₀Feature Protocol _ type, Service attribute value, and class

GainRatio (S) is calculated in the same way_qService) is 18.4%, and it can be known that the characteristic Service has larger information gain rate than Protocol _ type, so that the invention preferentially selects the characteristic Service as the decision tree model T₁Is not a leaf node.

1.3 attack detection based on Weighted Majority Algorithm (WMA)

Because each decision tree is constructed by using the subdata set randomly generated after the duplication removal by the Bootstrap algorithm, the subdata set scale and the normal/abnormal data distribution are relatively trainedThe data set D phase ratio is changed, and the subdata set S_qInformation entropy H (S) of_q) And changes accordingly, resulting in each sub data set S_qCorrespondingly constructed decision tree T_qThe degree of influence on the final classification result is also different, and it is obviously not appropriate to simply determine the classification result of the most number of decision trees as the final classification result in the conventional RF. Meanwhile, test data are detected item by item through the model in the detection process, if the data are detected and marked by adopting a semi-supervised learning-based method, the model detection efficiency is extremely low because all data are required to be clustered through multiple iterations in each detection, the real-time detection requirement in the actual environment cannot be met, and the overall detection precision of the model is reduced because the division precision of the model is lower than that of a classification algorithm only by adopting the semi-supervised learning-based method for detection.

Thus, as shown in FIG. 6, the present invention introduces a weighted majority algorithm to assign a weight w to each decision tree_qDetecting the network flow data and analyzing the subdata set S_qThe set scale and the variation degree of data distribution of the training data set D after the duplication removal through a Bootstrap resampling algorithm are compared, and a sub data set S is obtained_qInformation Gain (S) of training data set D_qL) measure each decision tree T that it generates correspondingly_qThe degree of influence on the final detection result. Because l is only 0 and 1, and the overfitting problem caused by excessive characteristic attribute values does not exist according to the definition of the information Gain in the upper section, the information Gain (S) is adopted in the invention_qL) rather than information augmentation

The benefit rate is used for measuring the influence degree of each decision tree on the final detection result, and the specific formula is as follows:

for convenience of calculation, the result of each decision tree is usedDividing the classification into normal and abnormal, and expressing the results by 1 and-1 to obtain q classification results (y)₁,y₂,...,y_q) I.e. y_qAnd {1, -1}, wherein, -1 represents that the output result of the decision tree is abnormal, and 1 represents that the output result of the decision tree is normal. S_q,lIs represented in the sub data set S_qThe middle class is a data set of l, Q is the total number of decision trees, sgn is a sign function, namely

When y is 1, the detected flow data is normal flow; when in use

When y is equal to-1, the detected flow data is abnormal flow.

For example when there is a T in the random forest₁、T₂The weights of the two decision trees are respectively w calculated by formula (11)₁＝41％、w₂33%, for a piece of data E in the test data set E₁The classification results obtained during detection are respectively 1 and-1, and then the classification results can be obtained by using the formula (12)

y 1, i.e. flow data E₁The detection result of (1) is normal flow.

As APT attacks progress over time, the attack patterns may change dynamically, so that the network traffic data reflected in the bottom layer and the features embodying the attacks also change. In order to cope with the attack mode of dynamic change, the invention adds the detected data into the training data set, removes the earlier data, and dynamically updates the training data set to cope with the attack mode of dynamic change.

As shown in fig. 7, the Accuracy (Accuracy), the Detection Rate (Detection Rate), the False Alarm Rate (False Alarm) and the False negative Rate (False negative Rate) of the present invention respectively reach 80.77%, 77.51%, 0.92% and 2.79%, and compared with the conventional RF, the False Alarm Rate and the False Alarm Rate are respectively reduced by 53.54% and 40.76%, which is mainly because different from the conventional RF algorithm, the present invention utilizes the semi-supervised learning algorithm to make the model have enough labeled samples for training, thereby ensuring the effectiveness of the trained model and making the model have higher Detection Accuracy. On the other hand, in combination with fig. 7, it can be known through calculation that the accuracy and the detection rate of the present invention are respectively improved by 14.86%, 14.20%, 7.58% and 7.07% compared with the KNN algorithm and the SVM algorithm, because the present invention selects the traffic characteristics that can best reflect the current target network by introducing the information gain rate, the detection model can detect different abnormal behaviors existing in different target networks, and is suitable for different APT attack scenarios. Meanwhile, the invention gives weights to different decision trees by using information gain and obtains a final detection result based on WMA, thereby ensuring that different decision trees have different degrees of influence on the detection performance of the model, ensuring that the model does not depend on the selection of the number of trees in a standard random forest algorithm alone, and greatly improving the detection precision of the model.

Effect of different scale training data sets on test Performance

FIG. 8 is a graph of the accuracy of the training data set D and the detection rate of the model as a function of the scale of unlabeled historical data, where the present invention sets the ratio of labeled accurate data to unlabeled historical data to 1: d, where d represents the size of the unlabeled historical data. It can be seen from the figure that the accuracy of the training data set D gradually decreases with the increase of the value of D, while the accuracy of the model detection gradually increases and then decreases with the increase of the value of D, and reaches a maximum value of 80.77% when D is 9. This is because when the value of D is too small, the data used to train the model is insufficient, resulting in low detection accuracy of the model, and when the value of D is too large, the accuracy of the trained model itself is too low, resulting from the generated training data set D being already inaccurate. Therefore, the training data set D is constructed by taking D as 9.

Influence of decision tree number on detection performance

As can be seen from fig. 9, when the number q of decision trees is 300, the accuracy of detection reaches 80.77% at the highest, at this time, the time taken by the model for detection is 112s, and when the value q is greater than 300, the detection accuracy tends to be stable, and the detection time greatly increases, so that the model cannot guarantee the real-time performance of the APT attack detection. The number q of the decision trees is a major factor influencing the performance and efficiency of the model, and when the number q of the decision trees is small, the detection accuracy of the model is poor. On the other hand, since the random forest has the property of not fitting over, q can be made as large as possible to ensure the detection accuracy of the model. However, the complexity of the model is proportional to q, i.e., q is too large, and the model detection time is too long. Therefore, q is 300 for the experiment.

In conclusion, the invention extracts the characteristics which can reflect the target network flow to the greatest extent while generating the large-scale accurate marking training data set by using a small amount of marked accurate data sets, thereby ensuring the detection precision of the model to unknown attacks. Aiming at the problems of few training data sets and difficult network flow characteristic selection in the APT environment, the invention generates a large-scale training data set by utilizing a small amount of marked data based on a semi-supervised learning algorithm, trains a detection model, and realizes the characteristic extraction of APT target network flow data by introducing an information gain rate so as to detect unknown attacks in a target network. The effectiveness of the method is verified by comparing the detection results of the experiment and the RF, KNN and SVM algorithms, and the influence of different-scale training data sets and the selection of the number of decision trees on the detection performance is respectively analyzed.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. An APT attack detection method is characterized in that a semi-supervised learning attack detection method based on an information gain rate in an APT environment utilizes a semi-supervised learning algorithm to mark data with similar characteristics, utilizes a small amount of marked data to generate a large-scale training data set, and introduces the information gain rate to determine the degree of influence of different characteristics on detection; extracting the characteristics of each subdata set divided in the detection model by adopting an information gain rate so as to realize accurate identification of unknown attacks;

the semi-supervised learning algorithm comprises the following specific steps:

(1) randomly selecting a piece of data from the marked normal data and the marked abnormal data as the center of the cluster, and randomly selecting a piece of data from the marked normal data and the marked abnormal data as the cluster center c of the marked data cluster and the marked data cluster, wherein the data are respectively N1 and N5₁,c₂；

(2) Using formulas

Calculating each data N_iAre respectively connected with the cluster core c₁，c₂Distance d (N)_i,c_k) And d (N)_i,c_k) Dividing data with small values into a cluster; n is a radical of_i,mThe m characteristic value of the ith piece of data is represented; m represents the number of network flow characteristics, and the value of m is 28 in an APT environment; c. C_k,mRepresenting an mth eigenvalue representing a kth cluster center;

(3) using formulas

Calculate the centroid of all points in the two clusters separately and treat it as the new cluster center c'₁,c′₂(ii) a I represents the ith data sample in the cluster k, and I' represents the total number of the data samples in the cluster k;

2. The APT attack detection method according to claim 1, wherein the semi-supervised learning has a specific formula as follows:

l＝argmaxP_l,k；

3. The APT attack detection method of claim 1, wherein the information gain rate utilizes Bootstrap resampling algorithm to extract one data sample from the set D with a set back each time, and to extract I times in total, and to remove repeated data to obtain a sub-training set S₁Repeating the step q times to obtain q sub-training data sets { S }₁,S₂,...,S_qQ different decision trees are generated to build a random forest.

4. The APT attack detection method of claim 3, characterized in that each decision tree T is generated_qThe method comprises the following specific steps:

step two, finding out the data set S corresponding to the selected characteristics_qThe characteristic is split to a threshold value of a leaf node at the fastest speed, and the leaf node is split;

5. The APT attack detection method according to claim 3, wherein the specific calculation formula is as follows: