CN113094707B

CN113094707B - Lateral movement attack detection method and system based on heterogeneous graph network

Info

Publication number: CN113094707B
Application number: CN202110347685.7A
Authority: CN
Inventors: 卢志刚; 王天; 姜波; 刘俊荣; 刘松; 董璞
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2024-05-14
Anticipated expiration: 2041-03-31
Also published as: CN113094707A

Abstract

The invention relates to a lateral movement attack detection method and system based on a heterogeneous graph network. The method is based on an authentication log of an intranet, a login behavior diagram between a user and a host is structured, a user login diagram and a source host path diagram are constructed, and then two-stage anomaly detection is carried out on the diagram. The first stage is based on a user login diagram, a graph neural network algorithm with maximized mutual information is used for learning a behavior mode of a host, and a partial abnormal sample is obtained through calculation of a partial abnormal factor algorithm; and the second stage is based on the source host path diagram and the labeled sample obtained in the first stage, and performs semi-supervised learning by using a heterogeneous diagram attention network algorithm to detect the lateral movement attack behavior. The method can simply and effectively detect the lateral movement attack behavior under the condition of no sample label, has the effect exceeding that of most supervised learning methods, and has high recall rate and low false alarm rate.

Description

Lateral movement attack detection method and system based on heterogeneous graph network

Technical Field

The invention relates to the field of computer network security, and is used for resisting transverse movement attack behaviors implemented in advanced persistent threats, in particular to a transverse movement attack detection method and system based on a heterogeneous graph network.

Background

In recent years, with the rapid development of the internet, the network environment becomes increasingly complex, and network attacks increasingly present a high-frequency situation. Among other things, advanced persistent threats (ADVANCED PERSISTENT THREAT, APT) benefit from advances in attack methodology and improvements in attack organization, with attacks being increasingly frequent. APT attacks have a longer latency period and greater destructive power than other attacks. The attack method is comprehensive, and the customized attack tool can be developed through long-term observation of the target, so that the threat is huge. Therefore, detection and protection against APT attacks has become a major issue in current network security.

The transverse movement is an extremely important ring for APT attack, and is a main process of implementing attack after an attacker enters an intranet. According to the ATT & CK framework, lateral movement consists of the technology used by an attacker to access and control remote systems on the network. After an attacker successfully invades the network and establishes a foothold, the attacker usually moves transversely in the network for the next attack and collection of information of the target network, finally obtains the control right of the whole network, and achieves the purposes of destroying the target network or infrastructure, stealing confidential data or core intellectual property rights and the like, thus being huge in harm.

At present, the detection of the transverse moving attack is still in a relatively preliminary stage, and the research on the detection of the transverse moving attack mainly converts the detection of the transverse moving attack into the detection of an abnormal user or host in an intranet, and the abnormal performance exceeding a threshold value is detected by modeling the behavior of the user or host. The detection targets can be classified into a moving target type and a moving path type according to their difference. The mobile target method mainly detects a user or a host which is attacked by an attacker in the transverse mobile attack; the moving path method uses the moving path generated in the lateral moving attack as the detection target. Many existing research efforts are focused on moving object type lateral movement attack detection, and the research on the moving path of the lateral movement attack is less.

In summary, the lateral movement attack generally camouflage the normal user for operation by stealing the user credentials, and has high concealment and difficult detection. The existing lateral movement attack detection research method generally converts the detection method into detection of an abnormal user or host in an intranet, but the following disadvantages and shortcomings still exist: firstly, the massive multi-source security logs enable the false alarm rate of the existing method to be high. Secondly, in the actual network environment, a few abnormal users or hosts cannot be observed or can be observed, and the abnormal users or hosts are not fully utilized; thirdly, the intranet is essentially a correlation diagram consisting of users and a host computer, and the detection of the transverse mobile attack on the diagram is yet to be studied.

Disclosure of Invention

In order to solve the above-mentioned problems, a two-stage lateral mobile attack detection method HGLM (Lateral Movement detection using Heterogeneous Graph) based on a heterogeneous graph network is proposed herein.

The principle of the invention is as follows: based on the authentication log of the intranet, the login behavior diagram between the user and the host is structured, a user login diagram and a source host path diagram are constructed, and then two-stage anomaly detection is carried out on the diagram. The first stage is based on a user login diagram, a graph neural network algorithm with maximized mutual information is used for learning a behavior mode of a host, and a partial abnormal sample is obtained through calculation of a partial abnormal factor algorithm; and the second stage is based on the source host path diagram and the labeled sample obtained in the first stage, and performs semi-supervised learning by using a heterogeneous diagram attention network algorithm to detect the lateral movement attack behavior.

In order to achieve the purpose, the invention adopts the following specific technical scheme:

a method for detecting a lateral mobile attack based on a heterogeneous graph network, comprising the steps of:

1) And (5) extracting the data set. Because the lateral movement attack involves login authentication behavior between the user and the host, the data set is extracted, namely, authentication logs generated by the intranet equipment are collected, and the data set is constructed.

2) The security log graph is structured. And constructing a user login diagram and a source host path diagram by using the extracted data set.

3) Abnormal login behavior detection based on unsupervised learning: based on the user login graph, abnormal login behavior detection based on unsupervised learning is performed. This part is the first stage of HGLM two-stage anomaly detection.

4) Lateral movement attack detection based on semi-supervised learning: and performing lateral movement attack detection based on the semi-supervised learning based on the source host path diagram and a small number of labeled samples in the first stage. The part is the second stage of HGLM two-stage anomaly detection.

Further, the security log graph structuring mainly comprises three parts, namely data preprocessing, user login graph construction and source host path graph construction.

A) The first step of log graph structuring is to preprocess an authentication log of an intranet. The authentication log typically contains attributes such as authentication time, source user, target user, source host, target host, and authentication status. The original log information is redundant heterogeneous and therefore needs to be processed into a format that complies with the landscape mobile attack scenario. First, since an attacker typically moves laterally from one host to another with a trapped user, we only need to pay attention to the same authentication event for the source and target users. Second, a lateral movement attack involves at least two hosts, so we need to filter the same authentication events for the source host and the target host. In summary, the pretreatment flow is as follows: and traversing each authentication event in the authentication log data set D, and screening out events which are the same as the source user and the target user and are different from the source host and the target host to obtain a processed data set D ₁.

B) The user login graph (User Authentication Graph, UAG) is an undirected homogeneous graph showing the login behavior pattern of the user between hosts within a certain period of time. Define graph G _u = (V, E, F), where node V represents the hosts and edge E represents the login connection of the user between the hosts. The user login map network with the characteristics is obtained by giving the login times of the user on the host under the sliding window as the characteristics F to the nodes in the map, giving no characteristics to the sides in the map, and only representing the connection relation. Specifically, given the data set D ₁, the user u, and the sliding window length L, the authentication event belonging to the user u is first screened out in D ₁ to obtain the data set D _u. And secondly, dividing the data into a plurality of time windows according to the sliding window length L, and calculating login times characteristics F of users on the host under different windows. And finally, traversing each authentication event in D _u, adding the source host and the target host to a node V in the graph, adding an edge E (the node and the edge are ignored if the addition is repeated) of the source host and the target host, which is connected to the graph, adding one to the login times of the source host and the target host in F under the corresponding window, and obtaining a user login graph G _u = (V, E, F) with characteristics of the user u after the traversing is finished.

C) The source Host path map (Host PATH GRAPH, HPG) is a directed heterogeneous map, which represents the association between the user's login path to the target Host and the source Host. Defining a graph G _p = (V, E, F), wherein two types of nodes are defined in the graph, one type of nodes represents a source host V _src, one type of nodes represents a login path V _path from a user to a target host, two types of sides also exist, one type of sides is a transmitting side E _send, a login path node pointing from the source host node to the user to the target host represents that the user logs in from the source host to the target host; the other type is a depending edge E _on, which points from the user's login path node to the target host to the source host node, indicating that the user's login path to the target host occurs on the source host, and the two types of edges are symmetrical. The occurrence times of the login path under the sliding window on the source host and the statistical characteristic F _statistic are endowed to the nodes, and the edges only represent the connection relationship, so that a source host path diagram network is obtained. Specifically, given dataset D ₁, sliding window length L, and statistics F _statistic, each event in D ₁ is traversed, V _src is added to the source host, the user and target hosts are spliced into a login path as a node to V _path, the connection edge pointed to the login path by the source host is added to E _send, the connection edge pointed to the source host by the login path is symmetrically added to E _on, and the sliding window login number characteristics are calculated as in the user login map. Finally, traversing the node V _path of the login path type in the graph, adding the statistical feature F _statistic to the node, and simultaneously endowing the source host node V _src with a single-hot coding feature to obtain a source host path graph G _p = (V, E, F) with the feature. The statistical characteristics used include the number of successful and failed authentications of the user to the target host, the ratio of the number of authentications of the user to the target host to the total number of authentications of the user, and the minimum, maximum and average values of time intervals between the user and the occurrence of authentication events of the target host.

Further, the abnormal login behavior detection based on the unsupervised learning includes: based on a user login graph, firstly, a graph neural network algorithm (DEEP GRAPH Infomax, DGI) with maximized mutual information is used for learning a behavior mode of a host, namely, hidden layer characteristic representation of a sample is obtained through mutual information training of a local characteristic h and a global characteristic s of the maximized sample, specifically, in the graph, the characteristic vector of each node is the local characteristic h of the node, training learning is carried out through a graph convolution kernel encoder, and the global characteristic s is obtained through an average readout function. And then, a negative sample is obtained by applying random disorder disturbance to the nodes, a discriminator is used for scoring a sample pair consisting of h and s, and finally, the hidden layer representation of the nodes is obtained. And detecting by using a local anomaly factor algorithm (Local Outlier Factor, LOF) based on the sample characteristic representation learned by the DGI, obtaining a small number of labeled host samples by setting a threshold value, combining the labeled host samples with corresponding users to form a login path from the user to the target host, and using the login path for semi-supervised learning in the second stage.

Further, the lateral movement attack detection based on semi-supervised learning includes: based on the source host path graph and a small number of labeled samples in the first stage, semi-supervised learning on the graph is performed by using a heterogeneous graph attention network algorithm (Heterogeneous graph Attention Network, HAN), and more lateral movement attack behaviors are detected by learning the association between login path nodes. The HAN introduces attention mechanisms into the heterograms, including node-level attention and semantic-level attention. By defining meta-paths (meta-paths) on the graph, node level attention primarily learns the weights of neighboring nodes on its meta-paths, while semantic level attention learning is based on the weights of different meta-paths. And finally, obtaining the final node representation through corresponding aggregation operation. Specifically in the figure, two meta-paths are defined: meta-path p ₁(v_path,e_on,v_src from path node to source host node) and meta-path p ₂(v_path,e_on,v_src,e_send,v_path from path node to source host node to path node). Based on the two element paths, node-level attention and semantic-level attention characteristics are calculated, labeled samples in the first stage are used, semi-supervised learning is performed by taking a cross entropy loss function as a target, and lateral movement attack behaviors are detected.

Based on the same inventive concept, the invention also provides a system for detecting the transverse movement attack based on the heterogeneous graph network, which comprises:

The data acquisition module is used for collecting authentication logs generated by the intranet equipment and constructing a data set;

And the security log diagram structuring module is used for constructing a user login diagram and a source host machine path diagram by utilizing the data set.

The abnormal login behavior detection module based on the unsupervised learning is used for detecting the abnormal login behavior based on the unsupervised learning based on the user login graph;

and the lateral movement attack detection module is used for carrying out lateral movement attack detection based on the semi-supervised learning based on the source host path diagram and the labeled sample obtained by the detection of the abnormal login behavior based on the non-supervised learning.

Compared with the prior art, the invention has the beneficial effects that:

The method can simply and effectively detect the transverse movement attack behavior under the condition of no sample label, the AUC value on the related public dataset CMCS EVENTS exceeds 95%, the TPR of partial users reaches 100%, the FPR is 0, and the effect exceeds that of most supervised learning methods, and has high recall rate and low false alarm rate.

Drawings

Fig. 1 is an overall flow chart of the present invention for detecting a lateral mobile attack based on a heterogeneous graph network. Wherein X represents the initial characteristic of the positive sample, X 'represents the initial characteristic of the negative sample after disturbance, H represents the hidden layer characteristic of the positive sample after graph convolution, H' represents the hidden layer characteristic of the negative sample after graph convolution, D represents a classifier, R represents an average Readout function, S represents the global characteristic obtained by calculation of the average Readout function, Z ₁～Z_p represents the hidden layer characteristic obtained by node level attentiveness, and Z represents the hidden layer characteristic obtained by semantic level attentiveness.

Fig. 2 is a flowchart of the construction of a user login diagram in the present invention.

FIG. 3 is a flow chart of the construction of a source host path graph in the present invention.

FIG. 4 is a flow chart of abnormal login behavior detection based on unsupervised learning in the present invention.

Fig. 5 is a flow chart of lateral movement attack detection based on semi-supervised learning in the present invention.

Fig. 6 is a graph of the detection performance results of the method HGLM of the present invention for different users on the public dataset CMCS EVENTS.

Detailed Description

In order to better understand the technical solution in the embodiments of the present invention and make the objects, features and advantages of the present invention more obvious and understandable, the technical core of the present invention will be further described in detail below with reference to the accompanying drawings and examples.

The invention discloses a method for detecting transverse movement attack based on heterogeneous graph network, as shown in figure 1, the method mainly comprises four parts of data acquisition, security log graph structuring, abnormal login behavior detection based on unsupervised learning and transverse movement attack detection based on semi-supervised learning, and the main steps are as follows:

Step 100 is data set extraction, that is, collecting authentication logs generated by intranet devices for a period of time, to form a data set.

Step 200 is security log graph structuring, and mainly comprises three parts of data preprocessing, user login graph construction and source host path graph construction.

The construction of the user login diagram is shown in fig. 2.

Step 210, for a given dataset D ₁, first define a user u and a sliding window length L for screening authentication events belonging to user u and calculating the user login frequency feature F on the host under different windows in D ₁.

Step 220, traversing the dataset D ₁.

Step 230, screening out authentication event belonging to user u in D ₁.

Step 240, adding the source host and the target host of the authentication event to the node V in the graph, and adding an edge E (node and edge are ignored if added repeatedly) of the source host and the target host connected to the graph.

Step 250, calculating the login frequency characteristic F of the user on the host under different windows, and adding one to the login frequency of the source host and the target host under the corresponding windows.

And after the traversal is finished, obtaining a user login graph G _u = (V, E, F) with the characteristics of the user u.

The construction of the source host path graph is shown in fig. 3.

At step 260, given data set D ₁, a sliding window length L and extracted statistics F _statistic are first defined.

Step 270, traversing each event in D ₁, adding V _src to the source host, splicing the user and the target host into a login path as a node to add to V _path, adding the connection edge pointed to by the source host to the login path to E _send, and symmetrically adding the connection edge pointed to by the login path to the source host to E _on.

Step 280, calculating the occurrence frequency characteristic F of the login path under different windows, and adding one to the occurrence frequency of the login path under the corresponding window.

Step 290, adding the corresponding statistical feature F _statistic to the login path node.

After the traversal is finished, the source host node V _src in the graph is endowed with the unique thermal coding feature, and the source host path graph G _p = (V, E, F) with the feature is obtained.

Step 300 is two-stage anomaly detection, wherein the first stage is anomaly log-in behavior detection based on unsupervised learning, and the second stage is lateral movement attack detection based on semi-supervised learning.

Abnormal login behavior detection based on unsupervised learning is shown in fig. 4.

Step 310, based on the user login diagram, firstly, learning a behavior mode of the host by using the DGI, and performing node disturbance by using a random disorder method to obtain a negative sample.

In step 320, the feature vector of each node in the graph is the local feature h of the node, training learning is performed by the graph convolution kernel encoder, and the global feature s is obtained by averaging readout functions. And (3) taking the maximization of the mutual information of the local features and the global features as a target, and scoring positive and negative 'sample pairs' consisting of h and s by using a discriminator to obtain the hidden layer representation of the node.

Step 330, detecting by using a local anomaly factor algorithm based on the sample feature representation obtained by DGI, and obtaining a small amount of labeled host samples by setting a threshold.

And finally, combining the labeled host computer sample and the corresponding user into a login path training sample from the user to the target host computer for semi-supervised learning of the second stage.

Lateral movement attack detection based on semi-supervised learning is shown in fig. 5.

Step 340, semi-supervised learning on the graph using the HAN based on the source host path graph and the small number of labeled samples of the first phase. First, two meta-paths are defined: meta-path p ₁(v_path,e_on,v_src from path node to source host node) and meta-path p ₂(v_path,e_on,v_src,e_send,v_path from path node to source host node to path node).

Step 350, calculating node level attention and semantic level attention characteristics based on the two element paths, performing semi-supervised learning by using the labeled sample of the first stage and taking the cross entropy loss function as a target, and detecting the transverse movement attack behavior.

And finally, combining the abnormal samples detected in the first stage and the second stage, namely, the result of the transverse movement attack behavior detected by the HGLM model.

Experiments on a public dataset CMCS EVENTS can find that the AUC value of the detection result of the method HGLM disclosed herein exceeds 95%, the TPR of partial users reaches 100%, the FPR is 0, and the effect exceeds that of most supervised learning methods, and the method has high recall rate and low false alarm rate. The experimental results are shown in table 1, and compared with the existing methods, the method HGLM provided herein is simple and effective, does not require sample tags, and can exceed most of the supervised detection methods. In addition, the detection performance of the model on different users is shown in fig. 6, and it can be found that the recall rate of the model can exceed 95% and the false alarm rate is lower than 5% for most users.

TABLE 1 comparison of Performance of lateral-movement attack detection models

Based on the same inventive concept, another embodiment of the present invention provides a system for detecting a lateral movement attack based on a heterogeneous graph network, comprising:

Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smart phone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps in the inventive method.

Based on the same inventive concept, another embodiment of the present invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, implements the steps of the inventive method.

Parts of the invention not described in detail, such as the local anomaly factor algorithm, are within the knowledge of those skilled in the art.

Finally, it should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail by using examples, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention, and all such modifications and equivalents are intended to be encompassed in the scope of the claims of the present invention.

Claims

1. The transverse movement attack detection method based on the heterogeneous graph network is characterized by comprising the following steps of:

Collecting an authentication log generated by intranet equipment, and constructing a data set;

constructing a user login diagram and a source host path diagram by utilizing the data set;

based on the user login graph, abnormal login behavior detection based on unsupervised learning is performed;

performing lateral movement attack detection based on semi-supervised learning based on the source host path diagram and the labeled sample obtained by the detection of the abnormal login behavior based on the non-supervised learning;

The source host path diagram is a directed heterogeneous diagram and represents the association relationship between a login path from a user to a target host and the source host; two types of nodes are defined in the source host path diagram, one type of nodes represents a source host V _src, and the other type of nodes represents a login path V _path from a user to a target host; the edges also have two types, one is a sending edge E _send, and a login path node pointing from a source host node to a user to a target host represents that the user logs in from the source host to the target host; the other type is a depending edge E _on, a login path node from a user to a target host points to a source host node, and the two types of edges are symmetrical, wherein the login path from the user to the target host is represented to occur on the source host; the occurrence times of the login path under the sliding window on the source host and the statistical characteristic F _statistic are endowed to the nodes, and the edges only represent the connection relationship, so that a source host path diagram network is obtained.

2. The method of claim 1, wherein data preprocessing is performed prior to constructing the user log-in graph and the source host path graph; the data preprocessing comprises the following steps: and traversing each authentication event in the authentication log data set D, and screening out events which are the same as the source user and the target user and are different from the source host and the target host to obtain a processed data set D ₁.

3. The method of claim 2, wherein the user login pattern is an undirected homogeneous pattern representing a user's login behavior pattern between hosts over a period of time; the construction process of the user login graph comprises the following steps: screening authentication events belonging to a user u from a data set D ₁, the user u and a sliding window length L in the D ₁ to obtain a data set D _u; dividing the data into a plurality of time windows according to the sliding window length L, and calculating login times characteristics F of users on a host under different windows; traversing each authentication event in D _u, adding a source host and a target host to a node V in the graph, adding an edge E of the source host and the target host, which is connected to the graph, and simultaneously adding one to the login times of the source host and the target host in F under a corresponding window, and obtaining a user login graph G _u = (V, E, F) with characteristics of a user u after the traversing is finished.

4. The method of claim 1, wherein the statistical feature F _statistic comprises: the number of successful and failed authentications of the user to the target host, the ratio of the number of authentications of the user to the target host to the total number of authentications of the user, the minimum, maximum, and average of time intervals in which authentication events of the user to the target host occur.

5. The method of claim 1, wherein the unsupervised learning-based abnormal login behavior detection comprises:

Based on a user login graph, a graph neural network algorithm with maximized mutual information is used for learning a behavior mode of a host, namely, hidden layer characteristic representation of a sample is obtained through mutual information training of a local characteristic h and a global characteristic s of the maximized sample;

obtaining a negative sample by applying random disorder disturbance to the nodes, and scoring a sample pair consisting of h and s by using a discriminator to obtain hidden layer representation of the nodes;

Based on sample characteristic representation learned by the graph neural network algorithm, detecting by using a local anomaly factor algorithm, obtaining a small number of labeled host samples by setting a threshold value, combining the labeled host samples with corresponding users into a login path from the user to the target host, and using the login path for semi-supervised learning in the second stage.

6. The method of claim 1, wherein the semi-supervised learning based lateral movement attack detection comprises:

Two element paths are defined: a meta path from the path node to the source host node and a meta path from the path node to the source host node to the path node;

Based on the two element paths, calculating node-level attention and semantic-level attention characteristics, performing semi-supervised learning by using the labeled sample obtained by the detection of the abnormal login behavior based on the unsupervised learning and taking the cross entropy loss function as a target, and detecting the transverse movement attack behavior.

7. A heterogeneous graph network-based lateral mobile attack detection system employing the method of any of claims 1-6, comprising:

the security log diagram structuring module is used for constructing a user login diagram and a source host path diagram by utilizing the data set;

8. An electronic device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-6.

9. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any of claims 1-6.