CN106817248B - APT attack detection method - Google Patents

APT attack detection method Download PDF

Info

Publication number
CN106817248B
CN106817248B CN201611179208.XA CN201611179208A CN106817248B CN 106817248 B CN106817248 B CN 106817248B CN 201611179208 A CN201611179208 A CN 201611179208A CN 106817248 B CN106817248 B CN 106817248B
Authority
CN
China
Prior art keywords
data
cluster
detection
information gain
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611179208.XA
Other languages
Chinese (zh)
Other versions
CN106817248A (en
Inventor
李兴华
许勐璠
苗美霞
刘海
马建峰
李晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201611179208.XA priority Critical patent/CN106817248B/en
Publication of CN106817248A publication Critical patent/CN106817248A/en
Application granted granted Critical
Publication of CN106817248B publication Critical patent/CN106817248B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network

Abstract

The invention discloses an Advanced Persistent Threat (APT) attack detection method, which utilizes a semi-supervised learning algorithm to mark data with similar characteristics, utilizes a small amount of marked data to generate a large-scale training data set, and introduces an information gain rate to determine the influence degree of different characteristics on detection; and (3) extracting the characteristics of each subdata set divided in the detection model by adopting the information gain rate so as to realize accurate identification of unknown attacks. The method utilizes the improved k-means algorithm to mark the data with similar characteristics, realizes the accurate marking of a large amount of training data sets on the basis of a small amount of manually marked data, and ensures the detection precision of the model; the influence degree of different characteristics on detection is determined by introducing an information gain rate into the model, and the influence of redundancy and noise characteristics in data is reduced, so that important flow characteristics are selected, the generalization capability of the detection model is improved, and the detection of unknown attacks is responded.

Description

APT attack detection method
Technical Field
The invention belongs to the field of network security, and particularly relates to an APT attack detection method.
Background
A high-level persistent threat is a new type of attack that is organized, targeted, and extremely long lasting. With the exposure of seismic nets (Stuxnet), Duqu, Flame (Flame) and Killdisk attacks against uklan power plants in 2015, it can be seen that APT attacks pose a huge threat to the security of various industrial control networks and key information infrastructures. The main goal of APT attacks is to steal confidential information or cause specified damage in military agencies, government agencies, national infrastructure, and high and new technology enterprises. The method is characterized by mainly comprising the following two points: (1) the attack approach is advanced. An attacker mostly utilizes unknown attacks such as 0day bug and the like to carry out invasion, for example, the attack is initiated to an Iran nuclear power station, and the attacker respectively utilizes a plurality of unknown bugs such as the Flame virus, the Stuxnet virus and the Duqu virus to carry out the attack within 6 years. However, the detection of unknown network attacks is currently very challenging. (2) The attack duration is long. Attackers may be latent for months or even years to complete the target of the attack, as "scout action" from 2006 soaks and attacks networks of up to 72 companies and organizations worldwide, not discovered and reported by McAfee and Symantec until 2011. Further, as in "night Long attack" from 2007 for stealing confidential information on oil field operations of 5 Western multinational energy companies, it was reported and discovered by McAfe 4 years later. It is due to this property that the present invention requires the detection of APT attacks from a large amount of network traffic/host behavior data. The APT attack, although taking different approaches, will eventually react on the underlying raw data stream of the network. As Machine Learning (ML) has obvious advantages in the aspects of processing big data and attack detection, the machine learning has great research prospects in the field of APT attack detection. The basic idea of the APT attack detection scheme based on machine learning is to extract the characteristics of current network traffic data or host behavior data, use known normal or abnormal historical traffic data as a training data set, and use a classification algorithm to label and classify the network traffic data or the host behavior on the basis to distinguish normal behavior from abnormal behavior. However, these methods have the following problems: (1) the training data set for a particular network is limited. In an actual APT attack scene, training data is often generated by manually marking through expert knowledge, so that the general scale of an accurate marked data set is very limited, most of the accurate marked data set is unmarked historical network traffic data, and the accurate marked data set is not known to be normal or abnormal. The supervised learning itself needs a large amount of labeled data as a training data set, and the scale and accuracy of the training data set directly affect the detection performance of the model. Therefore, how to generate an accurate training data set by using a small amount of labeled data to ensure the detection accuracy of the model is a big challenge in current APT attack detection. (2) Network traffic characteristics are difficult to select. At present, APT attack detection based on supervised learning is more based on traditional intrusion detection to develop research, and has certain advantages for the detection of known attacks. However, the APT attack mostly uses unknown attacks such as 0day bug, and the attacks taken for different target networks may be different, and even the 0day bug used by the same target network attacker may be dynamically changed, and the underlying network traffic characteristics shown by different unknown attacks are different. Therefore, how to dynamically choose appropriate network traffic characteristics for a particular target network to detect the unknown attacks suffered by the network is another challenge. The network-level persistent threat is a novel intelligent network attack and is mainly characterized by utilization and long duration of unknown attacks such as 0day bugs.
In summary, the problem that the artificial labeling data of the training detection model in the APT attack detection based on machine learning is too little and the flow characteristics of unknown attacks are difficult to select is solved.
Disclosure of Invention
The invention aims to provide an APT attack detection method, and aims to solve the problems that in the current APT attack detection based on machine learning, artificial marking data of a training detection model is too little, and the flow characteristics of unknown attacks are difficult to select.
The invention is realized in such a way, and provides an APT attack detection method, wherein the semi-supervised learning attack detection method based on the information gain rate in the APT environment utilizes a semi-supervised learning algorithm to mark data with similar characteristics, generates a large-scale training data set by using a small amount of marked data, realizes automatic marking of historical data and obtains a larger-scale and accurately marked training data set; introducing an information gain rate to determine the influence degree of different characteristics on detection, extracting the characteristics of each sub data set divided in the detection model by adopting the information gain rate, and selecting the characteristics which are most beneficial to dividing data samples aiming at different target networks so as to realize accurate identification of unknown attacks; and finally, optimizing the APT attack detection performance of the model by adopting a weighted majority algorithm based on information gain.
Further, the semi-supervised learning algorithm comprises the following specific steps:
(1) randomly selecting a piece of data from the marked normal data and abnormal data respectively as the center of the cluster, and selecting N1,N5As marked clusters of data and marked dataCenter of cluster c1,c2
(2) Using formulas
Figure BDA0001184699140000031
Calculating each data NiAre respectively connected with the cluster core c1,c2Distance d (N)i,ck) And d (N)i,ck) Dividing data with small values into a cluster;
(3) using formulas
Figure BDA0001184699140000032
Respectively calculating the centroids of all the points in the two clusters and taking the centroids as new cluster centers c1′,c′2
(4) Repeating the steps (2) and (3) until the total intra-cluster dispersion sum J reaches the minimum, wherein the dispersion sum is every piece of data NiTo its corresponding cluster center ckDistance d (N)i,ck) The sum of (a);
(5) calculating the probability P of each type of marked data in each clusterl,kAnd with Pl,kMarking the category of the cluster k at the maximum time l to finally obtain a training data set D;
wherein N isi,mM characteristic value, c, representing the ith piece of datak,mThe mth characteristic value of the kth cluster center is represented, and m is the number of the network flow characteristics; i is the total number of samples in the data set, and I' is the total number of data samples in the cluster k.
Further, the random forest detection method based on the information gain rate utilizes a Bootstrap resampling algorithm to extract a data sample from the set D with a place back each time, extracts I times in total, and removes repeated data to obtain a sub-training set S1Repeating the step q times to obtain q sub-training data sets { S }1,S2,...,SqQ different decision trees are generated to build a random forest.
Further, each decision tree T is generatedqThe method comprises the following specific steps:
selecting a flow characteristic with the largest information gain rate as a root node of a decision tree;
step two, finding out the data set S corresponding to the selected characteristicsqThe threshold value of the leaf node is divided into the nodes according to the characteristic fastest speed, and the nodes are divided;
before each non-leaf node selects the characteristics, taking the residual characteristics as the splitting characteristic set of the current node, and selecting the flow characteristics with the maximum information gain rate as the non-leaf nodes split by the root node;
step four, repeating the step two and the step three until each feature corresponds to a leaf node, and constructing SqCorresponding decision tree Tq
Further, the specific calculation formula is as follows:
Figure BDA0001184699140000041
Figure BDA0001184699140000042
Figure BDA0001184699140000043
Figure BDA0001184699140000044
wherein S isqA subset of a training data set D randomly selected by Bootstrap resampling; GainRatio (S)q,m)、Gain(Sq,m)、Split(SqM) respectively represent the sub data set SqV (m) is the range of values of feature m; svIs a set SqA subset of (1) where the value on feature m equals v; a represents the total number of attributes of the feature m, feature Protocol _ type, and since the attributes thereof include TCP, UDP, and ICMP, a is 3; h (x) is the entropy of data set x; p is a radical oflThe number of class i samples is the proportion of the total data set.
The invention provides an APT attack detection method, which utilizes a semi-supervised learning algorithm to mark data with similar characteristics and introduces an information gain rate to determine the influence degree of different characteristics on detection. The experimental result shows that the accuracy and the detection rate of the proposed model on the unknown attack detection are respectively improved by 3.18 percent and 4.5 percent compared with the traditional random forest model, and the false alarm rate and the missing report rate are respectively reduced by 53.54 percent and 40.76 percent. The method utilizes the improved k-means algorithm to mark the data with similar characteristics, realizes the accurate marking of a large amount of training data sets on the basis of a small amount of manually marked data, and ensures the detection precision of the model; the influence degree of different characteristics on detection is determined by introducing an information gain rate into the model, and the influence of redundancy and noise characteristics in data is reduced, so that important flow characteristics are selected, the generalization capability of the detection model is improved, and the detection of unknown attacks is responded.
Drawings
Fig. 1 is a flowchart of an APT attack detection method according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of network traffic characteristics provided in an embodiment of the present invention.
Fig. 3 is a schematic diagram of a historical network traffic data marking process based on semi-supervised learning according to an embodiment of the present invention.
Fig. 4 is a schematic diagram of traffic characteristic extraction based on information gain ratio according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of a decision tree generation process based on feature extraction according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of attack detection of Weighted Majority Algorithm (WMA) based on information gain according to an embodiment of the present invention.
Fig. 7 is a schematic diagram comparing detection performances of the present invention and a conventional Random Forest (RF), K-Nearest Neighbor (KNN), Support Vector Machine (SVM) algorithm according to an embodiment of the present invention.
Fig. 8 is a schematic diagram of the model detection rate and the training data set D accuracy under different D values according to the embodiment of the present invention.
Fig. 9 is a schematic diagram illustrating an influence of the number q of decision trees on the detection accuracy and the model detection time according to the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The following detailed description of the principles of the invention is provided in connection with the accompanying drawings.
As shown in fig. 1, an APT attack detection method provided by an embodiment of the present invention includes the following steps:
s101: a semi-supervised learning-based method utilizes a small amount of labeled data to generate a large-scale training data set;
s102: and (3) extracting the characteristics of each subdata set divided in the detection model by adopting the information gain rate so as to realize accurate identification of unknown attacks.
The application of the principles of the present invention will now be described in further detail with reference to the accompanying drawings.
In order to solve two challenges of limited training data set and difficulty in selecting network flow characteristics in APT attack detection, firstly, a large-scale training data set is generated by utilizing a small amount of marked data based on a semi-supervised learning method, and then, characteristic extraction is carried out on each sub data set divided in a detection model by adopting an information gain rate so as to realize accurate identification of unknown attacks.
Although the APT attack modes are different, the APT attack modes can be finally reflected in the underlying network traffic data, so the present invention adopts the most common network traffic characteristics at present, including Duration, Protocol, Count, Srv _ Count, etc., and the specific characteristics are shown in table 1. The present invention converts character-type features in a dataset into numerical types for testing. Here, the invention uses a piece of data N in a data set1The description will be given by taking (2, tcp, smtp, SF,1684,363,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,104,66,0.63,0.03,0.01,0.00,0.00, 0.00) as an example. For the character type Protocol _ type characteristic which comprises three attribute values of TCP, UDP and ICMP, the invention uses 0 to represent attribute TCP, 1 to represent attribute UDP, and 2 to represent attribute ICAnd (6) MP. Meanwhile, the data set is normalized to reduce the influence of dimension among the features, namely, the importance degree of each feature is not influenced by numerical values. The invention utilizes min-max standardization method to reduce the size range of data to 0,1]The specific calculation formula is shown as formula (1):
Figure BDA0001184699140000061
wherein x isNormalizedIs a normalized value of a certain feature, xIntialFor the initial attribute value of a feature, xminIs the minimum value of the feature, xmaxIs the maximum value of the feature. Then N is1Converted into (3.429 × 10) by the above treatment-5,0,0.59,0,1.286×10-4,2.77×10-6,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00,0.00,0.00,0.21,0.132,0.63,0.03,0.01,0.00,0.00,0.00,0.00,0.00)。
1.1 training set Generation based on semi-supervised learning
Because the data volume of the target network is huge in the APT attack detection, the artificial marking depending on expert knowledge can only obtain a small amount of accurately marked data as a training sample, so that the trained model cannot accurately detect the abnormity[19]. Semi-supervised learning is proposed for such problems, i.e. unsupervised learning is assisted with a small amount of data with a priori knowledge. In order to realize automatic marking of historical data and obtain a larger-scale and accurately marked training data set, an improved k-means semi-supervised learning algorithm is provided, and the invention is illustrated by taking fig. 2 as an example.
(1) Randomly selecting a piece of data as the center of the cluster from the marked normal data and abnormal data respectively, and selecting N in the invention in FIG. 21,N5Cluster center c as marked data (normal) cluster and marked data (abnormal) cluster1,c2
(2) Calculating each data N using equation (2)iAre respectively connected with the cluster core c1,c2Distance (similarity) d (N)i,ck) And d (N)i,ck) Data with small values are divided into onesWithin each cluster;
(3) respectively calculating the centroids of all the points in the two clusters by using the formula (3) and taking the centroids as new cluster centers c1′,c′2
(4) Repeating the steps 2 and 3 until the total intra-cluster dispersion sum J reaches the minimum, wherein the dispersion sum is N of each piece of dataiTo its corresponding cluster center ckDistance d (N)i,ck) The sum of (a);
(5) calculating the probability P of each type of marked data in each clusterl,kAnd with Pl,kAnd marking the class of the cluster k at the maximum time l to finally obtain a training data set D.
The specific formula is as follows:
Figure BDA0001184699140000071
Figure BDA0001184699140000072
Figure BDA0001184699140000081
Figure BDA0001184699140000082
l=argmaxPl,k(6)
wherein N isi,mRepresenting the mth characteristic value of the ith piece of data, e.g. N1,1=3.429×10-5,ck,mAnd m is the number of network flow characteristics, and m is 28. I is the total number of samples in the data set, and I' is the total number of data samples in the cluster k, for example, I ═ 20 in fig. 2, and I ═ 15 in the cluster 1 of the training data set D. d (N)i,ck) Representing data NiThe Euclidean distance from the center of the cluster is used for describing the similarity of the cluster. Since the present invention divides the dataset into two clusters, k is 1 or 2. Pl,kIndicating the occurrence of a marked class of the l-th class in the k-th clusterProbability (i ═ 0 or 1,0 for normal class, 1 for abnormal class), nl,kIndicates the number of marked samples of the l-th class in the k-th cluster, nlIndicates the total number of class I labeled samples, so when P isl,kThe class of the cluster k is labeled with l at maximum, and arg max f (x) represents the value of the argument x at maximum that satisfies the function f (x). For example, in FIG. 2, the probability P that a normally labeled class appears in cluster 10,1Probability P of an abnormally labeled class appearing in cluster 1 ═ 11,1=0.33,P0,1>P1,1Therefore, cluster 1 is labeled 0, i.e., cluster 1 is a normal class. Similarly, P in cluster 21,2=0.67,P0,2Cluster 2 is therefore labeled 1, i.e., cluster 2 is an exception class.
1.2 traffic feature extraction based on information gain Rate
In an Intrusion Detection System (IDS) based on machine learning, a random forest algorithm has excellent generalization performance and is more advantageous to the detection of attacks compared with other classification algorithms, so that the random forest algorithm becomes a commonly selected reference algorithm for the current attack detection. However, for APT attacks, different target networks may suffer different attacks, and the characteristics exhibited by different attacks are different. The present invention therefore requires that features that are most helpful in dividing the data samples be chosen for different target networks.
The more information a feature can bring to classify a model, the more important the feature is, the larger the information amount will be caused by the presence or absence of the feature in the model, and the difference between the previous and next information amounts is the information gain the model brings to the model by the feature. In order to select more representative features in the process of constructing the decision tree, the invention introduces the concept of information Gain and uses the information Gain Ratio (Gain Ratio) to measure the capability of a given feature to distinguish training examples, and the specific process is shown in fig. 4.
The invention assumes that the training data set D generated in the previous paragraph contains I different data samples { N }1,N2,...,NI}. Firstly, a Bootstrap resampling algorithm is utilized, a data sample is extracted from a set D with each time of replacement, a total of I times of extraction is carried out, repeated data is removed, and a sub-training is obtainedCollection S1Repeating the step q times to obtain q sub-training data sets { S }1,S2,...,SqQ different decision trees are generated to build a random forest. Wherein each decision tree T is generatedqThe method comprises the following specific steps:
1. selecting the flow characteristic with the maximum information gain rate as a root node of a decision tree;
2. finding out the data set S corresponding to the selected characteristicsqThe threshold value of the leaf node is divided into the nodes according to the characteristic fastest speed, and the nodes are divided;
3. before each non-leaf node (including a root node) selects the characteristics, taking the residual characteristics as the splitting characteristic set of the current node, and selecting the flow characteristics with the maximum information gain rate as the non-leaf nodes split by the root node;
4. repeating the steps 2 and 3 until each feature corresponds to a leaf node, and constructing SqCorresponding decision tree Tq
The specific calculation formula is as follows:
Figure BDA0001184699140000091
Figure BDA0001184699140000092
Figure BDA0001184699140000093
Figure BDA0001184699140000094
wherein S isqA subset of a training data set D randomly selected by Bootstrap resampling; GainRatio (S)q,m)、Gain(Sq,m)、Split(SqM) respectively represent the sub data set SqV (m) is the range of values of feature m; svIs a set SqA subset of (1) where the value on feature m equals v; a representsThe total number of attributes of the feature m, for example, the feature Protocol _ type, has the attributes of TCP, UDP, and ICMP, so that a is 3; h (x) is the entropy of data set x; p is a radical oflThe number of class i samples is the proportion of the total data set.
In the following, the present invention will be described with reference to fig. 5, and the present invention first randomly extracts a data set S from the training data set by using the bootstrapping resampling algorithmqCalculating a data set SqAnd assuming that the information gain rate of the obtained feature count is maximum, and taking the feature count as a root node to start to construct a decision tree Tq(ii) a From the data set SqThe attribute value distribution of the middle feature count is split, namely, a proper threshold value division data set S is found outq. It is assumed here that in the data set SqThe split threshold of the mid-feature count is 64, i.e. in the data set SqIf the feature count of the data in (1) is less than or equal to 64, the data is classified as abnormal, and if the count is greater than 64, the feature is further extracted; when the count is larger than 64, selecting a second feature Dst _ bytes with the largest information gain rate in the remaining features, and finding out a split threshold value in the same way; repeating the above steps until the data set is completely divided, wherein in fig. 4, when the characteristic is selected to the Protocol _ type, the data set is completely divided, and at this time, the decision tree T is usedqAnd (5) completing construction.
In the above method, the information gain rate is used instead of the information gain for feature selection, because there is an inherent bias in the information gain metric, which favors features with more attributes, i.e. when a feature has a large number of attribute values (e.g. feature Duration, which is known by the present invention according to table 1 to have a value range of [0,58329 ]]Therefore, the present invention considers that it has 58329 attribute values), which can be calculated by the formula (8)
Figure BDA0001184699140000101
Approaching 0 due to the data set SqEntropy of H (S)q) Is fixed so that the information Gain (S)qM) becomes large and approaches to H (S)q) Therefore, the decision tree is biased towards selecting a feature during the feature selection process. However, this again leads to overfitting, i.e. the selected features only reflect the known training numbersThe data in the data set is distributed, so that the model only has the classification capability on the known traffic data, and the classification effect on the unknown traffic data (the detection capability of the unknown attack) is very poor. And the splitting Information (Split Information) of a feature refers to the entropy of its corresponding data set with respect to the respective attribute values of the feature, and when the Information gain is determined, the importance of the feature decreases as the splitting Information increases.
For example, when a set of I traffic data is completely divided by feature A (i.e., into I groups, I > 2), the split information is log2I; meanwhile, there is a boolean feature B that partitions the same set, and if it exactly bisects (I ═ 2), its split information is 1. At this time, if only the information Gain is used instead of the information Gain rate to select the features, the present invention can know Gain (S) using equation (8)q,A)>Gain(SqAnd B) to select feature a as a non-leaf node (root node) for constructing the decision tree. However, in practice, because the feature a has a plurality of attribute values, the data set is divided into a plurality of small spaces, that is, each leaf node may only include pure normal and abnormal, at this time, the decision tree may perfectly fit the training data, however, when data that does not belong to the attribute value of the feature a exists in the test data set, the constructed decision tree still classifies the test data only through the feature a without considering other features, which inevitably causes the detection performance of the model to be greatly reduced. Therefore, the present invention introduces an information gain ratio to solve the above-mentioned problems. According to the formula (6), it is obvious that the gain rate of the information of the feature B is higher, that is, the feature B is preferentially selected as a non-leaf node (root node) to construct a decision tree, so that the reduction of the detection capability of the model on unknown attacks caused by selecting the feature A with a large attribute value is avoided. Therefore, the information gain ratio of the invention using the characteristic Protocol _ type is calculated as follows:
Figure BDA0001184699140000111
Figure BDA0001184699140000112
the information gain rate is used as a compensation measure to solve the problem of information gain, and the splitting information is introduced to punish the characteristics with more attribute values, so as to improve the accuracy of the model for detecting unknown flow.
The following description of the present invention will take fig. 3 as an example to illustrate the process of feature extraction and decision tree generation. Let SqTo obtain a training data set D ═ N by Bootstrap algorithm1,N2,...,N20Extracted subdata set, Sq={N1,N2,N3,N4,N5,N6,N7,N8,N9,N10And selecting 2 characteristics of Protocol _ type and Service for comparison in order to calculate the invention conveniently, wherein the attribute values of the characteristics are shown in table 1, wherein data type normal/abnormal is represented by a numeral 0/1, and s is thenqThe number of normal and abnormal data is 7 and 3, respectively.
TABLE 1N1To N10Feature Protocol _ type, Service attribute value, and class
Figure BDA0001184699140000121
Figure BDA0001184699140000122
Figure BDA0001184699140000123
GainRatio (S) is calculated in the same wayqService) is 18.4%, and it can be known that the characteristic Service has larger information gain rate than Protocol _ type, so that the invention preferentially selects the characteristic Service as the decision tree model T1Is not a leaf node.
1.3 attack detection based on Weighted Majority Algorithm (WMA)
Because each decision tree is constructed by using the subdata set randomly generated after the duplication removal by the Bootstrap algorithm, the subdata set scale and the normal/abnormal data distribution are relatively trainedThe data set D phase ratio is changed, and the subdata set SqInformation entropy H (S) ofq) And changes accordingly, resulting in each sub data set SqCorrespondingly constructed decision tree TqThe degree of influence on the final classification result is also different, and it is obviously not appropriate to simply determine the classification result of the most number of decision trees as the final classification result in the conventional RF. Meanwhile, test data are detected item by item through the model in the detection process, if the data are detected and marked by adopting a semi-supervised learning-based method, the model detection efficiency is extremely low because all data are required to be clustered through multiple iterations in each detection, the real-time detection requirement in the actual environment cannot be met, and the overall detection precision of the model is reduced because the division precision of the model is lower than that of a classification algorithm only by adopting the semi-supervised learning-based method for detection.
Thus, as shown in FIG. 6, the present invention introduces a weighted majority algorithm to assign a weight w to each decision treeqDetecting the network flow data and analyzing the subdata set SqThe set scale and the variation degree of data distribution of the training data set D after the duplication removal through a Bootstrap resampling algorithm are compared, and a sub data set S is obtainedqInformation Gain (S) of training data set DqL) measure each decision tree T that it generates correspondinglyqThe degree of influence on the final detection result. Because l is only 0 and 1, and the overfitting problem caused by excessive characteristic attribute values does not exist according to the definition of the information Gain in the upper section, the information Gain (S) is adopted in the inventionqL) rather than information augmentation
The benefit rate is used for measuring the influence degree of each decision tree on the final detection result, and the specific formula is as follows:
Figure BDA0001184699140000131
Figure BDA0001184699140000132
for convenience of calculation, the result of each decision tree is usedDividing the classification into normal and abnormal, and expressing the results by 1 and-1 to obtain q classification results (y)1,y2,...,yq) I.e. yqAnd {1, -1}, wherein, -1 represents that the output result of the decision tree is abnormal, and 1 represents that the output result of the decision tree is normal. Sq,lIs represented in the sub data set SqThe middle class is a data set of l, Q is the total number of decision trees, sgn is a sign function, namely
Figure BDA0001184699140000133
When y is 1, the detected flow data is normal flow; when in use
Figure BDA0001184699140000134
When y is equal to-1, the detected flow data is abnormal flow.
For example when there is a T in the random forest1、T2The weights of the two decision trees are respectively w calculated by formula (11)1=41%、w233%, for a piece of data E in the test data set E1The classification results obtained during detection are respectively 1 and-1, and then the classification results can be obtained by using the formula (12)
Figure BDA0001184699140000141
y 1, i.e. flow data E1The detection result of (1) is normal flow.
As APT attacks progress over time, the attack patterns may change dynamically, so that the network traffic data reflected in the bottom layer and the features embodying the attacks also change. In order to cope with the attack mode of dynamic change, the invention adds the detected data into the training data set, removes the earlier data, and dynamically updates the training data set to cope with the attack mode of dynamic change.
As shown in fig. 7, the Accuracy (Accuracy), the Detection Rate (Detection Rate), the False Alarm Rate (False Alarm) and the False negative Rate (False negative Rate) of the present invention respectively reach 80.77%, 77.51%, 0.92% and 2.79%, and compared with the conventional RF, the False Alarm Rate and the False Alarm Rate are respectively reduced by 53.54% and 40.76%, which is mainly because different from the conventional RF algorithm, the present invention utilizes the semi-supervised learning algorithm to make the model have enough labeled samples for training, thereby ensuring the effectiveness of the trained model and making the model have higher Detection Accuracy. On the other hand, in combination with fig. 7, it can be known through calculation that the accuracy and the detection rate of the present invention are respectively improved by 14.86%, 14.20%, 7.58% and 7.07% compared with the KNN algorithm and the SVM algorithm, because the present invention selects the traffic characteristics that can best reflect the current target network by introducing the information gain rate, the detection model can detect different abnormal behaviors existing in different target networks, and is suitable for different APT attack scenarios. Meanwhile, the invention gives weights to different decision trees by using information gain and obtains a final detection result based on WMA, thereby ensuring that different decision trees have different degrees of influence on the detection performance of the model, ensuring that the model does not depend on the selection of the number of trees in a standard random forest algorithm alone, and greatly improving the detection precision of the model.
Effect of different scale training data sets on test Performance
FIG. 8 is a graph of the accuracy of the training data set D and the detection rate of the model as a function of the scale of unlabeled historical data, where the present invention sets the ratio of labeled accurate data to unlabeled historical data to 1: d, where d represents the size of the unlabeled historical data. It can be seen from the figure that the accuracy of the training data set D gradually decreases with the increase of the value of D, while the accuracy of the model detection gradually increases and then decreases with the increase of the value of D, and reaches a maximum value of 80.77% when D is 9. This is because when the value of D is too small, the data used to train the model is insufficient, resulting in low detection accuracy of the model, and when the value of D is too large, the accuracy of the trained model itself is too low, resulting from the generated training data set D being already inaccurate. Therefore, the training data set D is constructed by taking D as 9.
Influence of decision tree number on detection performance
As can be seen from fig. 9, when the number q of decision trees is 300, the accuracy of detection reaches 80.77% at the highest, at this time, the time taken by the model for detection is 112s, and when the value q is greater than 300, the detection accuracy tends to be stable, and the detection time greatly increases, so that the model cannot guarantee the real-time performance of the APT attack detection. The number q of the decision trees is a major factor influencing the performance and efficiency of the model, and when the number q of the decision trees is small, the detection accuracy of the model is poor. On the other hand, since the random forest has the property of not fitting over, q can be made as large as possible to ensure the detection accuracy of the model. However, the complexity of the model is proportional to q, i.e., q is too large, and the model detection time is too long. Therefore, q is 300 for the experiment.
In conclusion, the invention extracts the characteristics which can reflect the target network flow to the greatest extent while generating the large-scale accurate marking training data set by using a small amount of marked accurate data sets, thereby ensuring the detection precision of the model to unknown attacks. Aiming at the problems of few training data sets and difficult network flow characteristic selection in the APT environment, the invention generates a large-scale training data set by utilizing a small amount of marked data based on a semi-supervised learning algorithm, trains a detection model, and realizes the characteristic extraction of APT target network flow data by introducing an information gain rate so as to detect unknown attacks in a target network. The effectiveness of the method is verified by comparing the detection results of the experiment and the RF, KNN and SVM algorithms, and the influence of different-scale training data sets and the selection of the number of decision trees on the detection performance is respectively analyzed.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (5)

1. An APT attack detection method is characterized in that a semi-supervised learning attack detection method based on an information gain rate in an APT environment utilizes a semi-supervised learning algorithm to mark data with similar characteristics, utilizes a small amount of marked data to generate a large-scale training data set, and introduces the information gain rate to determine the degree of influence of different characteristics on detection; extracting the characteristics of each subdata set divided in the detection model by adopting an information gain rate so as to realize accurate identification of unknown attacks;
the semi-supervised learning algorithm comprises the following specific steps:
(1) randomly selecting a piece of data from the marked normal data and the marked abnormal data as the center of the cluster, and randomly selecting a piece of data from the marked normal data and the marked abnormal data as the cluster center c of the marked data cluster and the marked data cluster, wherein the data are respectively N1 and N51,c2
(2) Using formulas
Figure FDA0002590761170000011
Calculating each data NiAre respectively connected with the cluster core c1,c2Distance d (N)i,ck) And d (N)i,ck) Dividing data with small values into a cluster; n is a radical ofi,mThe m characteristic value of the ith piece of data is represented; m represents the number of network flow characteristics, and the value of m is 28 in an APT environment; c. Ck,mRepresenting an mth eigenvalue representing a kth cluster center;
(3) using formulas
Figure FDA0002590761170000012
Calculate the centroid of all points in the two clusters separately and treat it as the new cluster center c'1,c′2(ii) a I represents the ith data sample in the cluster k, and I' represents the total number of the data samples in the cluster k;
(4) repeating the steps (2) and (3) until the total intra-cluster dispersion sum J reaches the minimum, wherein the dispersion sum is every piece of data NiTo its corresponding cluster center ckDistance d (N)i,ck) The sum of (a);
(5) calculating the probability P of each type of marked data in each clusterl,kAnd with Pl,kAnd marking the class of the cluster k at the maximum time l to finally obtain a training data set D.
2. The APT attack detection method according to claim 1, wherein the semi-supervised learning has a specific formula as follows:
Figure FDA0002590761170000013
Figure FDA0002590761170000014
Figure FDA0002590761170000021
Figure FDA0002590761170000022
l=argmaxPl,k
wherein N isi,mM characteristic value, c, representing the ith piece of datak,mThe mth characteristic value of the kth cluster center is represented, and m is the number of the network flow characteristics; i is the total number of samples in the data set, and I' is the total number of data samples in the cluster k.
3. The APT attack detection method of claim 1, wherein the information gain rate utilizes Bootstrap resampling algorithm to extract one data sample from the set D with a set back each time, and to extract I times in total, and to remove repeated data to obtain a sub-training set S1Repeating the step q times to obtain q sub-training data sets { S }1,S2,...,SqQ different decision trees are generated to build a random forest.
4. The APT attack detection method of claim 3, characterized in that each decision tree T is generatedqThe method comprises the following specific steps:
selecting a flow characteristic with the largest information gain rate as a root node of a decision tree;
step two, finding out the data set S corresponding to the selected characteristicsqThe characteristic is split to a threshold value of a leaf node at the fastest speed, and the leaf node is split;
before each non-leaf node selects the characteristics, taking the residual characteristics as the splitting characteristic set of the current node, and selecting the flow characteristics with the maximum information gain rate as the non-leaf nodes split by the root node;
step four, repeating the step two and the step three until each feature corresponds to a leaf node, and constructing SqCorresponding decision tree Tq
5. The APT attack detection method according to claim 3, wherein the specific calculation formula is as follows:
Figure FDA0002590761170000023
Figure FDA0002590761170000031
Figure FDA0002590761170000032
Figure FDA0002590761170000033
wherein S isqA subset of a training data set D randomly selected by Bootstrap resampling; GainRatio (S)q,m)、Gain(Sq,m)、Split(SqM) respectively represent the sub data set SqV (m) is the range of values of feature m; svIs a set SqA subset of (1) where the value on feature m equals v; a represents the total number of attributes of the feature m, feature Protocol _ type, and since the attributes thereof include TCP, UDP, and ICMP, a is 3; h (x) is the entropy of data set x; p is a radical oflThe number of class i samples is the proportion of the total data set.
CN201611179208.XA 2016-12-19 2016-12-19 APT attack detection method Active CN106817248B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611179208.XA CN106817248B (en) 2016-12-19 2016-12-19 APT attack detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611179208.XA CN106817248B (en) 2016-12-19 2016-12-19 APT attack detection method

Publications (2)

Publication Number Publication Date
CN106817248A CN106817248A (en) 2017-06-09
CN106817248B true CN106817248B (en) 2020-10-16

Family

ID=59109994

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611179208.XA Active CN106817248B (en) 2016-12-19 2016-12-19 APT attack detection method

Country Status (1)

Country Link
CN (1) CN106817248B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107276805B (en) * 2017-06-19 2020-06-05 北京邮电大学 Sample prediction method and device based on intrusion detection model and electronic equipment
CN107392015B (en) * 2017-07-06 2019-09-17 长沙学院 A kind of intrusion detection method based on semi-supervised learning
CN107562618A (en) * 2017-08-07 2018-01-09 北京奇安信科技有限公司 A kind of shellcode detection method and device
CN107872460B (en) * 2017-11-10 2019-09-24 重庆邮电大学 A kind of wireless sense network DoS attack lightweight detection method based on random forest
CN108961071B (en) * 2018-06-01 2023-07-21 中国平安人寿保险股份有限公司 Method for automatically predicting combined service income and terminal equipment
CN109728939B (en) * 2018-12-13 2022-04-26 杭州迪普科技股份有限公司 Network flow detection method and device
CN111343127B (en) * 2018-12-18 2021-03-16 北京数安鑫云信息技术有限公司 Method, device, medium and equipment for improving crawler recognition recall rate
CN109768985B (en) * 2019-01-30 2020-06-23 电子科技大学 Intrusion detection method based on flow visualization and machine learning algorithm
CN110222715B (en) * 2019-05-07 2021-07-27 国家计算机网络与信息安全管理中心 Sample homologous analysis method based on dynamic behavior chain and dynamic characteristics
CN112003869B (en) * 2020-08-28 2022-10-04 国网重庆市电力公司电力科学研究院 Vulnerability identification method based on flow
CN112269987B (en) * 2020-09-27 2023-01-24 西安电子科技大学 Intelligent model information leakage degree evaluation method, system, medium and equipment
CN114362972B (en) * 2020-09-27 2023-07-21 中国科学院计算机网络信息中心 Botnet hybrid detection method and system based on flow abstract and graph sampling
CN112333195B (en) * 2020-11-10 2021-11-30 西安电子科技大学 APT attack scene reduction detection method and system based on multi-source log correlation analysis
CN112583844B (en) * 2020-12-24 2021-09-03 北京航空航天大学 Big data platform defense method for advanced sustainable threat attack

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103001825A (en) * 2012-11-15 2013-03-27 中国科学院计算机网络信息中心 Method and system for detecting DNS (domain name system) traffic abnormality
CN103617156A (en) * 2013-11-14 2014-03-05 上海交通大学 Multi-protocol network file content inspection method
CN104834977A (en) * 2015-05-15 2015-08-12 浙江银江研究院有限公司 Traffic alarm condition level predication method based on distance metric learning
CN106156809A (en) * 2015-04-24 2016-11-23 阿里巴巴集团控股有限公司 For updating the method and device of disaggregated model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103001825A (en) * 2012-11-15 2013-03-27 中国科学院计算机网络信息中心 Method and system for detecting DNS (domain name system) traffic abnormality
CN103617156A (en) * 2013-11-14 2014-03-05 上海交通大学 Multi-protocol network file content inspection method
CN106156809A (en) * 2015-04-24 2016-11-23 阿里巴巴集团控股有限公司 For updating the method and device of disaggregated model
CN104834977A (en) * 2015-05-15 2015-08-12 浙江银江研究院有限公司 Traffic alarm condition level predication method based on distance metric learning

Also Published As

Publication number Publication date
CN106817248A (en) 2017-06-09

Similar Documents

Publication Publication Date Title
CN106817248B (en) APT attack detection method
Sahu et al. Network intrusion detection system using J48 Decision Tree
Tesfahun et al. Intrusion detection using random forests classifier with SMOTE and feature reduction
CN108874927B (en) Intrusion detection method based on hypergraph and random forest
CN110460605B (en) Abnormal network flow detection method based on automatic coding
CN108520272A (en) A kind of semi-supervised intrusion detection method improving blue wolf algorithm
CN110225055B (en) Network flow abnormity detection method and system based on KNN semi-supervised learning model
CN107579846B (en) Cloud computing fault data detection method and system
CN112884204B (en) Network security risk event prediction method and device
Zhao et al. Intrusion detection based on clustering genetic algorithm
Chen et al. DDoS attack detection based on random forest
CN111598179A (en) Power monitoring system user abnormal behavior analysis method, storage medium and equipment
Chen et al. Not afraid of the unseen: a siamese network based scheme for unknown traffic discovery
Mkuzangwe et al. Ensemble of classifiers based network intrusion detection system performance bound
Saheed et al. An efficient hybridization of K-means and genetic algorithm based on support vector machine for cyber intrusion detection system
CN114172688A (en) Encrypted traffic network threat key node automatic extraction method based on GCN-DL
CN113901448A (en) Intrusion detection method based on convolutional neural network and lightweight gradient elevator
Daneshgadeh et al. An empirical investigation of DDoS and Flash event detection using Shannon entropy, KOAD and SVM combined
Zhang et al. Network intrusion detection based on active semi-supervised learning
CN109450876B (en) DDos identification method and system based on multi-dimensional state transition matrix characteristics
Farrahi et al. KCMC: A hybrid learning approach for network intrusion detection using K-means clustering and multiple classifiers
Nalavade et al. Evaluation of k-means clustering for effective intrusion detection and prevention in massive network traffic data
Dong et al. Concept drift region identification via competence-based discrepancy distribution estimation
Huan et al. Anomaly detection method based on clustering undersampling and ensemble learning
Malik et al. Performance Evaluation of Classification Algorithms for Intrusion Detection on NSL-KDD Using Rapid Miner

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant