CN112333180A - APT attack detection method and system based on data mining - Google Patents

APT attack detection method and system based on data mining Download PDF

Info

Publication number
CN112333180A
CN112333180A CN202011187318.7A CN202011187318A CN112333180A CN 112333180 A CN112333180 A CN 112333180A CN 202011187318 A CN202011187318 A CN 202011187318A CN 112333180 A CN112333180 A CN 112333180A
Authority
CN
China
Prior art keywords
host
detected
sample
characteristic
flow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011187318.7A
Other languages
Chinese (zh)
Inventor
邢亚君
彭海龙
孟铭
王德胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing An Xin Tian Xing Technology Co ltd
Original Assignee
Beijing An Xin Tian Xing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing An Xin Tian Xing Technology Co ltd filed Critical Beijing An Xin Tian Xing Technology Co ltd
Priority to CN202011187318.7A priority Critical patent/CN112333180A/en
Publication of CN112333180A publication Critical patent/CN112333180A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses an APT attack detection method and system based on data mining. The method comprises the following steps: acquiring access frequency characteristics of a host to be detected and domain name popularity characteristics of the host to be detected based on the DNS log; acquiring the flow characteristics of the host to be detected and the port protocol mismatching characteristics of the host to be detected based on the network flow log; fusing the access frequency characteristic of the host to be detected, the domain popularity characteristic of the host to be detected, the flow characteristic of the host to be detected and the port protocol mismatching characteristic of the host to be detected by adopting a fuzzy mathematical model to obtain a comprehensive characteristic value of the host to be detected; and inputting the comprehensive characteristic value of the host to be detected into a trained decision tree model based on an isolated forest algorithm to obtain a detection result of whether the host to be detected is attacked by APT. The invention can effectively detect whether the host is attacked by the APT.

Description

APT attack detection method and system based on data mining
Technical Field
The invention relates to the field of APT attack detection, in particular to an APT attack detection method and system based on data mining.
Background
With the development of communication technology, the informatization and networking of companies have become a trend. In this context, however, an APT having persistence, concealment, and permeability becomes a non-negligible threat. APT has posed a great threat to organizations and organizations worldwide, and therefore research into the defense and detection of APT attacks has become an important direction for practitioners of the current network security field. The APT attack generally has good hiding performance because of clear attack targets and relatively high cost, and an attacker deliberately controls the behavior of a domain name, so that the domain name behavior is often difficult to distinguish from a normal domain name.
The K-nearest neighbor algorithm is a global and directly-calculated unsupervised detection algorithm. The algorithm focuses on the distance of the neighbor of the sample point, and the absolute distance between the sample and the neighbor is used as the judgment of the abnormal degree, which is influenced by the distance calculation, so that the abnormal degree of the sample data is influenced. In practice, abnormal points may occur in small-scale clustering, and since the K-nearest neighbor algorithm relies on the comparison between a sample point and its nearest neighbors, a particularly good detection result cannot be obtained in the face of such small-scale clustering.
Disclosure of Invention
The invention aims to provide an APT attack detection method and system based on data mining, which can effectively detect the abnormity.
In order to achieve the purpose, the invention provides the following scheme:
an APT attack detection method based on data mining comprises the following steps:
acquiring access frequency characteristics of a host to be detected and domain name popularity characteristics of the host to be detected based on a DNS log, wherein the access frequency characteristics of the host to be detected represent a frequency within a set frequency range in the frequency of accessing each domain name by the host to be detected, and the domain name popularity characteristics represent the ratio of the number of hosts accessing the host to be detected in a set time period to the number of active hosts in the set time period;
acquiring flow characteristics of a host to be detected and port protocol mismatching characteristics of the host to be detected based on a network flow log, wherein the flow characteristics of the host to be detected represent the ratio of uploading flow and downloading flow of the host to be detected, and the port protocol mismatching characteristics of the host to be detected represent the condition that a communication protocol is not matched with a port;
fusing the access frequency characteristic of the host to be detected, the domain name popularity characteristic of the host to be detected, the flow characteristic of the host to be detected and the port protocol mismatching characteristic of the host to be detected by adopting a fuzzy mathematical model to obtain a comprehensive characteristic value of the host to be detected;
and inputting the comprehensive characteristic value of the host to be detected into a trained decision tree model based on an isolated forest algorithm to obtain a detection result of whether the host to be detected is attacked by APT.
Optionally, the method further includes:
acquiring access frequency characteristics of a sample host and domain name popularity characteristics of the sample host based on a DNS log, wherein the access frequency characteristics of the sample host represent a frequency within a set frequency range in the frequency of accessing each domain name by the sample host, and the domain name popularity characteristics of the sample host represent the ratio of the number of hosts accessing the sample host in a set time period to the number of active hosts in the set time period;
acquiring flow characteristics of a sample host and port protocol mismatching characteristics of the sample host based on a network flow log, wherein the flow characteristics of the sample host represent the ratio of uploading flow and downloading flow of the sample host, and the port protocol mismatching characteristics of the sample host represent the condition that a communication protocol of the sample host is not matched with a port;
fusing the access frequency characteristic of the sample host, the domain name popularity characteristic of the sample host, the flow characteristic of the sample host and the port protocol mismatching characteristic of the sample host by adopting a fuzzy mathematical model to obtain a comprehensive characteristic value of the sample host;
and training a decision tree based on an isolated forest algorithm by taking a set formed by the comprehensive characteristic values of the sample hosts as a training set and taking the APT attack condition of the sample hosts as a label to obtain the decision tree model.
Optionally, the output of the decision tree model is the probability that the host to be detected is attacked by APT.
Optionally, the method further includes: and when the output result of the decision tree model indicates that the host to be detected suffers from ATP attack or the probability of suffering from ATP attack is greater than a set threshold value, giving an alarm.
The invention also provides an APT attack detection system based on data mining, which comprises:
the DNS log feature extraction module is used for acquiring access frequency features of a host to be detected and domain name popularity features of the host to be detected based on the DNS log, wherein the access frequency features of the host to be detected represent a frequency within a set frequency range in the frequency of the host to be detected accessing each domain name, and the domain name popularity features represent the ratio of the number of the hosts accessing the host to be detected in a set time period to the number of active hosts in the set time period;
the flow log feature extraction module is used for acquiring flow features of the host to be detected and port protocol mismatching features of the host to be detected based on the network flow log, wherein the flow features of the host to be detected represent the ratio of uploading flow and downloading flow of the host to be detected, and the port protocol mismatching features of the host to be detected represent the condition that a communication protocol is not matched with a port;
the characteristic fusion module is used for fusing the access frequency characteristic of the host to be detected, the domain name popularity characteristic of the host to be detected, the flow characteristic of the host to be detected and the port protocol mismatching characteristic of the host to be detected by adopting a fuzzy mathematical model to obtain a comprehensive characteristic value of the host to be detected;
and the anomaly detection module is used for inputting the comprehensive characteristic value of the host to be detected into a trained decision tree model based on an isolated forest algorithm to obtain a detection result of whether the host to be detected is attacked by the APT.
Alternatively to this, the first and second parts may,
the DNS log feature extraction module is further used for obtaining access frequency features of the sample host and domain name popularity features of the sample host based on the DNS log, wherein the access frequency features of the sample host represent a frequency in a set frequency range in the frequency of the sample host accessing each domain name, and the domain name popularity features of the sample host represent the ratio of the number of hosts accessing the sample host in a set time period to the number of active hosts in the set time period;
the flow log feature extraction module is further used for obtaining flow features of the sample host and port protocol mismatching features of the sample host based on the network flow log, wherein the flow features of the sample host represent the ratio of uploading flow and downloading flow of the sample host, and the port protocol mismatching features of the sample host represent the condition that a communication protocol of the sample host is not matched with a port;
the characteristic fusion module is also used for fusing the access frequency characteristic of the sample host, the domain name popularity characteristic of the sample host, the flow characteristic of the sample host and the port protocol mismatching characteristic of the sample host by adopting a fuzzy mathematical model to obtain a comprehensive characteristic value of the sample host;
the system further comprises: and the decision tree training unit is used for training a decision tree based on an isolated forest algorithm by taking a set formed by the comprehensive characteristic values of the sample hosts as a training set and taking the APT attack condition of the sample hosts as a label to obtain the decision tree model.
Optionally, the output of the decision tree model is the probability that the host to be detected is attacked by APT.
Optionally, the system further includes: and the alarm module is used for giving an alarm when the output result of the decision tree model indicates that the host to be detected suffers from ATP attack or the probability of suffering from ATP attack is greater than a set threshold value.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the APT attack detection method and system based on data mining provided by the invention extract the characteristics of DNS logs/network flow logs, and then evaluate a data set by using an iForest anomaly detection algorithm to obtain an attack detection result. Due to the combination of the iForest algorithm, the accuracy and the effectiveness of the APT detection are improved, and a reliable basis is provided for the subsequent detection analysis.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic flow chart of an APT attack detection method based on data mining according to embodiment 1 of the present invention;
fig. 2 is a schematic structural diagram of an APT attack detection system based on data mining according to embodiment 2 of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Example 1
Referring to fig. 1, the present embodiment provides a data mining-based APT attack detection method, including the following steps:
step 101: the method comprises the steps of obtaining access frequency characteristics of a host to be detected and domain name popularity characteristics of the host to be detected based on DNS logs, wherein the access frequency characteristics of the host to be detected represent a frequency within a set frequency range in the frequency of the host to be detected accessing each domain name, and the domain name popularity characteristics represent the ratio of the number of the hosts accessing the host to be detected in a set time period to the number of active hosts in the set time period.
After a host is attacked by APT, the behavior of the host is controlled, after the authority of the host is obtained, the behavior of stealing host information is hidden in normal flow, and we find that the host sends information to a command & control server at a relatively low frequency, so the access frequency of the host is taken as a characteristic for identifying whether the host is attacked by APT, in this embodiment, if the host to be detected does not have the access frequency within the set frequency range, the host to be detected is not attacked by APT, if the host to be detected has a plurality of access frequencies within the set frequency range, the access frequencies are respectively extracted one by one and fused with other characteristics, for example, the frequency of the host to be detected accessing a domain name a is a, the frequency of accessing a domain name B is B, the frequency of accessing a domain name C is C, and the frequency of accessing a domain name D is D, and if b, c and d are within the set frequency range, the frequencies b, c and d are respectively extracted and respectively combined with other characteristics to form detection identification vectors, for example, other characteristics are alpha, beta and gamma, then three groups of detection identification vectors [ b, alpha, beta, gamma ], [ c, alpha, beta, gamma ], [ d, alpha, beta, gamma ] are obtained. In the subsequent steps, the characteristic elements in [ b, alpha, beta, gamma ], [ c, alpha, beta, gamma ], [ d, alpha, beta, gamma ] are fused to obtain three comprehensive characteristic values of the host to be detected, each comprehensive characteristic value is input and output to the decision tree model, and the host to be detected is considered to be attacked by APT as long as one output result indicates that the host is attacked by APT.
After the host is infected by the malicious software, the authority of the malicious software is raised to acquire the access authority of the sensitive file, but the risk of discovery is increased compared with the access of the sensitive file. We have found that malware often acquires sensitive information by infecting a small portion of critical hosts in an intranet in order to reduce the risk of discovery. Therefore, the present invention uses the ratio p of the number of hosts accessing a domain name to the number of active hosts during a period of time T as a feature for identifying whether the domain name (host) is subjected to APT attack.
Step 102: the method comprises the steps of obtaining flow characteristics of a host to be detected and port protocol mismatching characteristics of the host to be detected based on a network flow log, wherein the flow characteristics of the host to be detected represent the ratio of uploading flow and downloading flow of the host to be detected, and the port protocol mismatching characteristics of the host to be detected represent the situation that a communication protocol is not matched with a port.
Generally, during normal network communication, the traffic in the downloading direction is much more than the traffic in the uploading direction. However, during the APT attack, the infected host needs to send the collected data to the command & control server, and the ratio of the upload traffic to the download traffic of the host may be higher than that of other hosts. Thus, the present invention uses the ratio of upload traffic to download traffic as a feature to identify whether a host is subject to an APT attack.
To penetrate the firewall of the target network, malware typically communicates with the port using common protocols. In general malware, a communication protocol is fixed during malware development, and a specifically used port needs to be configured according to a specific situation after invading a host, so that the situation that the communication protocol is not matched with the port occurs in an attack process. For example, other protocols than http on 80 ports, these are likely to be malicious traffic. Therefore, the invention takes the phenomenon that the communication protocol is not matched with the port as a characteristic for identifying whether the host is attacked by the APT, for example, the number that the communication protocol is not matched with the port can be taken as the port protocol mismatch characteristic.
In addition, after the infected host connects to the command & control server, the infected host and the command & control server can determine that the other host keeps on the line by sending data packets. Such data packets are commonly referred to as heartbeat packets. Heartbeat packets typically have small data packets with a strong periodicity. Thus, periodic small packets may also be used as a feature to identify whether a host is subject to an APT attack.
Step 103: and fusing the access frequency characteristic of the host to be detected, the domain popularity characteristic of the host to be detected, the flow characteristic of the host to be detected and the port protocol mismatching characteristic of the host to be detected by adopting a fuzzy mathematical model to obtain a comprehensive characteristic value of the host to be detected.
After the above features of the host to be detected are extracted, the present embodiment adopts a fuzzy mathematical model to fuse the features:
(1) defining a set of influencing factors a ═ { a ═ a1,A2,A3,...AnFor example, a ═ access frequency feature, domain popularity feature, port protocol mismatch feature, traffic feature.
(2) Define the used evaluation set W ═ { W1,W2,W3,...,Wm}
(3) Defining a single cause injection: f. of1:A→(W),Ai|→f1(Ai)=(ai,1,ai,2,...,ai,m) E (W) wherein ai,jWherein i and j respectively satisfy 1. ltoreq. i.ltoreq.n, 1. ltoreq. j.ltoreq.m, the value of which represents AiAt Wm'An evaluation value among the factors, and ai,1+ai,2+ai,3+...+ai,mWith the intermediate fuzzy mapping, the fuzzy relation can be obtained, and the fuzzy matrix in the model is shown in formula (1).
Figure BDA0002751734980000071
Wherein 0 is more than or equal to ai,j≤1,1≤i≤n,1≤j≤m(1)
(4) An authoritative weight matrix is predefined, where Z ═ Z is used1,z2,...,zn]Is shown in which z is1+z2+...+znThis step is very important for convex optimization of the blur matrix into a classical matrix. While the value of each element in the set represents the importance of each influencing factor in the set A, namely ziThe larger the value of (A) is, the more AiThe more important.
The maximum and minimum synthesis operation is carried out on R and Z, as shown in formula (2)
Figure BDA0002751734980000072
In the formula (2), the first and second groups,
Figure BDA0002751734980000073
represents the maximum-minimum composition operation, and then normalizes G to
Figure BDA0002751734980000074
Finally, the comprehensive value omega which is the extracted characteristic is obtained*×WTAnd the method is used for the next abnormal detection algorithm.
Step 104: and inputting the comprehensive characteristic value of the host to be detected into a trained decision tree model based on an isolated forest algorithm to obtain a detection result of whether the host to be detected is attacked by APT.
In this embodiment, the training process of the decision tree model is as follows:
(1) the method comprises the steps of obtaining access frequency characteristics of a sample host and domain name popularity characteristics of the sample host based on a DNS log, wherein the access frequency characteristics of the sample host represent a frequency within a set frequency range in the frequency of the sample host accessing each domain name, and the domain name popularity characteristics of the sample host represent the ratio of the number of hosts accessing the sample host within a set time period to the number of active hosts within the set time period. If the access frequency within the set frequency range does not exist in the sample host, the information corresponding to the sample host is abandoned and is not used as the sample data of the training decision tree. If a certain sample host has a plurality of access frequencies within the set frequency range, extracting the access frequencies one by one respectively, and fusing the access frequencies with other characteristics respectively to obtain a plurality of comprehensive characteristic values of the sample host, wherein the characteristic values are used as sample information of a training decision tree.
(2) The method comprises the steps of obtaining flow characteristics of a sample host and port protocol mismatching characteristics of the sample host based on a network flow log, wherein the flow characteristics of the sample host represent the ratio of uploading flow and downloading flow of the sample host, and the port protocol mismatching characteristics of the sample host represent the condition that a communication protocol of the sample host is not matched with a port.
(3) And fusing the access frequency characteristic of the sample host, the domain name popularity characteristic of the sample host, the flow characteristic of the sample host and the port protocol mismatching characteristic of the sample host by adopting a fuzzy mathematical model to obtain a comprehensive characteristic value of the sample host.
(4) And training a decision tree based on an isolated forest algorithm by taking a set formed by the comprehensive characteristic values of the sample hosts as a training set and taking the APT attack condition of the sample hosts as a label to obtain the decision tree model. The specific training process of the decision tree is as follows:
the iForest consists of t iTrees, each of which is a decision tree.
The specific steps for implementing the iTree are as follows:
1) randomly selecting a plurality of sample points from training data as subsamples, and putting the subsamples into a root node of a tree;
2) randomly selecting a feature as a new node, and randomly selecting a cutting point p under the current feature, wherein the cutting point p is generated between the maximum value and the minimum value of the specified dimensionality in the current node data;
3) with this cut point, a hyperplane is generated, and then the current node data space is divided into two subspaces: placing data smaller than p in the specified dimension on the left child of the current node, and placing data larger than or equal to p on the right child of the current node;
4) recursion steps 2) and 3) in the child nodes, new child nodes are continuously constructed until only one piece of data in the child nodes or the child nodes reach the limited height of the tree.
After t iTrees are obtained, iForest is the end of training, and then the model can be used to evaluate the data. For each test datum, it is traversed through all itree and the nodes are recorded as they are separated individually or until they reach the deepest level of the tree, thus yielding an average height at which the datum falls in iForest.
The path length h (x) of the sample point x is the number of edges that pass from the root node to the leaf node of the iTree. Given a data set containing y samples, the average path length of the tree is
Figure BDA0002751734980000081
Where H (i) is a harmonic number, which may be estimated as ln (i) + 0.577215. c (y) the average of the path lengths for a given number of samples y, to normalize the path length h (x) of the sample x.
And the anomaly probability of sample x is
Figure BDA0002751734980000091
When the anomaly probability s (x, y) → 1, that is, the anomaly score of x approaches 1, it is determined to be anomalous.
Of course, in this embodiment, the label used for training the decision tree may be the abnormal probability (i.e. the probability of being attacked by APT) of the sample, or may be only the label representing whether the sample is abnormal or not.
Compared with other algorithms, the isolated forest has better performance and the training process is hardly related to the training data size. Most applications of decision trees indicate that only a very small number of samples (256 samples are selected by default) need to be extracted for building each tree, and a good detection effect can be achieved in the case of building 100 decision trees.
Step 105: and when the output result of the decision tree model indicates that the host to be detected is attacked by ATP or the probability of being attacked by ATP is greater than a set threshold value, giving an alarm to avoid greater loss.
The method adopts the iForest algorithm to detect the abnormality of the host, the iForest algorithm randomly selects the characteristics, does not use any distance or density representation method, and can segment the micro-clusters through the data dividing capability of the decision tree, so that the obtained result is more accurate. In addition, the iForest algorithm has excellent calculation performance, a basic model iTree can be constructed by only a small number of samples in the model training process, and the storage requirement of extremely low linear time complexity is met.
Example 2
Referring to fig. 2, the present embodiment provides an APT attack detection system based on data mining, where the system includes:
the DNS log feature extraction module 201 is configured to obtain, based on the DNS log, an access frequency feature of the host to be detected and a domain popularity feature of the host to be detected, where the access frequency feature of the host to be detected indicates a frequency in a set frequency range in frequencies in which the host to be detected accesses each domain, and the domain popularity feature indicates a ratio of the number of hosts accessing the host to be detected in a set time period to the number of active hosts in the set time period.
The traffic log feature extraction module 202 is configured to obtain, based on the network traffic log, a traffic feature of the host to be detected and a port protocol mismatch feature of the host to be detected, where the traffic feature of the host to be detected indicates a ratio of upload traffic to download traffic of the host to be detected, and the port protocol mismatch feature of the host to be detected indicates a situation where a communication protocol is not matched with a port.
The feature fusion module 203 is configured to fuse, by using a fuzzy mathematical model, an access frequency feature of the host to be detected, a domain popularity feature of the host to be detected, a traffic feature of the host to be detected, and a port protocol mismatch feature of the host to be detected, so as to obtain a comprehensive feature value of the host to be detected.
And the anomaly detection module 204 is used for inputting the comprehensive characteristic value of the host to be detected into a trained decision tree model based on an isolated forest algorithm to obtain a detection result of whether the host to be detected is attacked by the APT.
And the alarm module 205 is configured to issue an alarm when an output result of the decision tree model indicates that the host to be detected is attacked by ATP or the probability of being attacked by ATP is greater than a set threshold.
In this embodiment, the DNS log feature extraction module 201 is further configured to obtain, based on the DNS log, an access frequency feature of the sample host and a domain name popularity feature of the sample host, where the access frequency feature of the sample host indicates a frequency in a set frequency range in the frequency of accessing each domain name by the sample host, and the domain name popularity feature of the sample host indicates a ratio of the number of hosts accessing the sample host in a set time period to the number of active hosts in the set time period. The traffic log feature extraction module 202 is further configured to obtain traffic features of the sample host and port protocol mismatch features of the sample host based on the network traffic log, where the traffic features of the sample host represent a ratio of upload traffic and download traffic of the sample host, and the port protocol mismatch features of the sample host represent a situation that a communication protocol of the sample host is not matched with a port. The feature fusion module 203 is further configured to fuse the access frequency feature of the sample host, the domain popularity feature of the sample host, the traffic feature of the sample host, and the port protocol mismatch feature of the sample host by using a fuzzy mathematical model to obtain a comprehensive feature value of the sample host. The APT attack detection system provided in this embodiment may further include: and the decision tree training module is used for training the decision tree based on an isolated forest algorithm by taking a set formed by the comprehensive characteristic values of all the sample hosts as a training set and taking the APT attack condition of all the sample hosts as a label to obtain a decision tree model. The output of the decision tree model is the probability that the host to be detected is attacked by the APT.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (8)

1. An APT attack detection method based on data mining is characterized by comprising the following steps:
acquiring access frequency characteristics of a host to be detected and domain name popularity characteristics of the host to be detected based on a DNS log, wherein the access frequency characteristics of the host to be detected represent a frequency within a set frequency range in the frequency of accessing each domain name by the host to be detected, and the domain name popularity characteristics represent the ratio of the number of hosts accessing the host to be detected in a set time period to the number of active hosts in the set time period;
acquiring flow characteristics of a host to be detected and port protocol mismatching characteristics of the host to be detected based on a network flow log, wherein the flow characteristics of the host to be detected represent the ratio of uploading flow and downloading flow of the host to be detected, and the port protocol mismatching characteristics of the host to be detected represent the condition that a communication protocol is not matched with a port;
fusing the access frequency characteristic of the host to be detected, the domain name popularity characteristic of the host to be detected, the flow characteristic of the host to be detected and the port protocol mismatching characteristic of the host to be detected by adopting a fuzzy mathematical model to obtain a comprehensive characteristic value of the host to be detected;
and inputting the comprehensive characteristic value of the host to be detected into a trained decision tree model based on an isolated forest algorithm to obtain a detection result of whether the host to be detected is attacked by APT.
2. The method of claim 1, wherein the method further comprises:
acquiring access frequency characteristics of a sample host and domain name popularity characteristics of the sample host based on a DNS log, wherein the access frequency characteristics of the sample host represent a frequency within a set frequency range in the frequency of accessing each domain name by the sample host, and the domain name popularity characteristics of the sample host represent the ratio of the number of hosts accessing the sample host in a set time period to the number of active hosts in the set time period;
acquiring flow characteristics of a sample host and port protocol mismatching characteristics of the sample host based on a network flow log, wherein the flow characteristics of the sample host represent the ratio of uploading flow and downloading flow of the sample host, and the port protocol mismatching characteristics of the sample host represent the condition that a communication protocol of the sample host is not matched with a port;
fusing the access frequency characteristic of the sample host, the domain name popularity characteristic of the sample host, the flow characteristic of the sample host and the port protocol mismatching characteristic of the sample host by adopting a fuzzy mathematical model to obtain a comprehensive characteristic value of the sample host;
and training a decision tree based on an isolated forest algorithm by taking a set formed by the comprehensive characteristic values of the sample hosts as a training set and taking the APT attack condition of the sample hosts as a label to obtain the decision tree model.
3. The data mining-based APT attack detection method according to claim 1, wherein the output of the decision tree model is a probability that the host to be detected is under APT attack.
4. The method for detecting APT attack based on data mining according to any one of claims 1-3, characterized in that the method further comprises: and when the output result of the decision tree model indicates that the host to be detected suffers from ATP attack or the probability of suffering from ATP attack is greater than a set threshold value, giving an alarm.
5. An APT attack detection system based on data mining, characterized by comprising:
the DNS log feature extraction module is used for acquiring access frequency features of a host to be detected and domain name popularity features of the host to be detected based on the DNS log, wherein the access frequency features of the host to be detected represent a frequency within a set frequency range in the frequency of the host to be detected accessing each domain name, and the domain name popularity features represent the ratio of the number of the hosts accessing the host to be detected in a set time period to the number of active hosts in the set time period;
the flow log feature extraction module is used for acquiring flow features of the host to be detected and port protocol mismatching features of the host to be detected based on the network flow log, wherein the flow features of the host to be detected represent the ratio of uploading flow and downloading flow of the host to be detected, and the port protocol mismatching features of the host to be detected represent the condition that a communication protocol is not matched with a port;
the characteristic fusion module is used for fusing the access frequency characteristic of the host to be detected, the domain name popularity characteristic of the host to be detected, the flow characteristic of the host to be detected and the port protocol mismatching characteristic of the host to be detected by adopting a fuzzy mathematical model to obtain a comprehensive characteristic value of the host to be detected;
and the anomaly detection module is used for inputting the comprehensive characteristic value of the host to be detected into a trained decision tree model based on an isolated forest algorithm to obtain a detection result of whether the host to be detected is attacked by the APT.
6. The data mining based APT attack detection system according to claim 5,
the DNS log feature extraction module is further used for obtaining access frequency features of the sample host and domain name popularity features of the sample host based on the DNS log, wherein the access frequency features of the sample host represent a frequency in a set frequency range in the frequency of the sample host accessing each domain name, and the domain name popularity features of the sample host represent the ratio of the number of hosts accessing the sample host in a set time period to the number of active hosts in the set time period;
the flow log feature extraction module is further used for obtaining flow features of the sample host and port protocol mismatching features of the sample host based on the network flow log, wherein the flow features of the sample host represent the ratio of uploading flow and downloading flow of the sample host, and the port protocol mismatching features of the sample host represent the condition that a communication protocol of the sample host is not matched with a port;
the characteristic fusion module is also used for fusing the access frequency characteristic of the sample host, the domain name popularity characteristic of the sample host, the flow characteristic of the sample host and the port protocol mismatching characteristic of the sample host by adopting a fuzzy mathematical model to obtain a comprehensive characteristic value of the sample host;
the system further comprises: and the decision tree training module is used for training a decision tree based on an isolated forest algorithm by taking a set formed by the comprehensive characteristic values of the sample hosts as a training set and taking the APT attack condition of the sample hosts as a label to obtain the decision tree model.
7. The data mining-based APT attack detection system of claim 5, wherein the output of the decision tree model is a probability that a host to be detected is under APT attack.
8. The data mining based APT attack detection system according to any one of the claims 5-7, characterized in that said system further comprises: and the alarm module is used for giving an alarm when the output result of the decision tree model indicates that the host to be detected suffers from ATP attack or the probability of suffering from ATP attack is greater than a set threshold value.
CN202011187318.7A 2020-10-30 2020-10-30 APT attack detection method and system based on data mining Pending CN112333180A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011187318.7A CN112333180A (en) 2020-10-30 2020-10-30 APT attack detection method and system based on data mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011187318.7A CN112333180A (en) 2020-10-30 2020-10-30 APT attack detection method and system based on data mining

Publications (1)

Publication Number Publication Date
CN112333180A true CN112333180A (en) 2021-02-05

Family

ID=74297893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011187318.7A Pending CN112333180A (en) 2020-10-30 2020-10-30 APT attack detection method and system based on data mining

Country Status (1)

Country Link
CN (1) CN112333180A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113452707A (en) * 2021-06-28 2021-09-28 华中科技大学 Scanner network scanning attack behavior detection method, medium and terminal
CN115378670A (en) * 2022-08-08 2022-11-22 北京永信至诚科技股份有限公司 APT attack identification method and device, electronic equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103916406A (en) * 2014-04-25 2014-07-09 上海交通大学 System and method for detecting APT attacks based on DNS log analysis
US9635049B1 (en) * 2014-05-09 2017-04-25 EMC IP Holding Company LLC Detection of suspicious domains through graph inference algorithm processing of host-domain contacts
CN108154029A (en) * 2017-10-25 2018-06-12 上海观安信息技术股份有限公司 Intrusion detection method, electronic equipment and computer storage media
SG11201906843TA (en) * 2017-01-24 2019-08-27 Ensco Int Inc Joint recognition system
CN111371757A (en) * 2020-02-25 2020-07-03 腾讯科技(深圳)有限公司 Malicious communication detection method and device, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103916406A (en) * 2014-04-25 2014-07-09 上海交通大学 System and method for detecting APT attacks based on DNS log analysis
US9635049B1 (en) * 2014-05-09 2017-04-25 EMC IP Holding Company LLC Detection of suspicious domains through graph inference algorithm processing of host-domain contacts
SG11201906843TA (en) * 2017-01-24 2019-08-27 Ensco Int Inc Joint recognition system
CN108154029A (en) * 2017-10-25 2018-06-12 上海观安信息技术股份有限公司 Intrusion detection method, electronic equipment and computer storage media
CN111371757A (en) * 2020-02-25 2020-07-03 腾讯科技(深圳)有限公司 Malicious communication detection method and device, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
司德睿等: "一种基于机器学习的安全威胁分析系统", 《信息技术与网络安全》 *
钟瑶: "基于数据挖掘的APT攻击检测方法研究与实现", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113452707A (en) * 2021-06-28 2021-09-28 华中科技大学 Scanner network scanning attack behavior detection method, medium and terminal
CN113452707B (en) * 2021-06-28 2022-07-22 华中科技大学 Scanner network scanning attack behavior detection method, medium and terminal
CN115378670A (en) * 2022-08-08 2022-11-22 北京永信至诚科技股份有限公司 APT attack identification method and device, electronic equipment and medium
CN115378670B (en) * 2022-08-08 2024-03-12 永信至诚科技集团股份有限公司 APT attack identification method and device, electronic equipment and medium

Similar Documents

Publication Publication Date Title
KR102046789B1 (en) Deep-learning-based intrusion detection method, system and computer program for web applications
Kirubavathi et al. Botnet detection via mining of traffic flow characteristics
Alshamkhany et al. Botnet attack detection using machine learning
Najafabadi et al. Machine learning for detecting brute force attacks at the network level
Zhang et al. Network Intrusion Detection using Random Forests.
Shah et al. Fuzzy clustering for intrusion detection
Yoon et al. Communication pattern monitoring: Improving the utility of anomaly detection for industrial control systems
Rashid et al. Machine and deep learning based comparative analysis using hybrid approaches for intrusion detection system
Bagui et al. Using machine learning techniques to identify rare cyber‐attacks on the UNSW‐NB15 dataset
US11700269B2 (en) Analyzing user behavior patterns to detect compromised nodes in an enterprise network
US20200134175A1 (en) Chain of events representing an issue based on an enriched representation
CN110768946A (en) Industrial control network intrusion detection system and method based on bloom filter
CN112333180A (en) APT attack detection method and system based on data mining
CN117216660A (en) Method and device for detecting abnormal points and abnormal clusters based on time sequence network traffic integration
Le et al. Unsupervised monitoring of network and service behaviour using self organizing maps
Daneshgadeh et al. An empirical investigation of DDoS and Flash event detection using Shannon entropy, KOAD and SVM combined
WO2006008307A1 (en) Method, system and computer program for detecting unauthorised scanning on a network
Huang et al. Network forensic analysis using growing hierarchical SOM
Algaolahi et al. Port-scanning attack detection using supervised machine learning classifiers
CN116915450A (en) Topology pruning optimization method based on multi-step network attack recognition and scene reconstruction
Mahardhika et al. An implementation of Botnet dataset to predict accuracy based on network flow model
CN111901286A (en) APT attack detection method based on flow log
Nawaz et al. Attack detection from network traffic using machine learning
Puthran et al. Intrusion detection using data mining
US8869267B1 (en) Analysis for network intrusion detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210205