CN115913616A

CN115913616A - Method and device for detecting transverse mobile attack based on heterogeneous graph abnormal link discovery

Info

Publication number: CN115913616A
Application number: CN202211163410.9A
Authority: CN
Inventors: 杨家海; 孙晓晴; 李城龙
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2022-09-23
Filing date: 2022-09-23
Publication date: 2023-04-04

Abstract

The application provides a method and a device for detecting a lateral mobile attack based on discovery of an abnormal link of a heterogeneous graph, which relate to the technical field of network security and comprise the following steps: acquiring log information, determining a network entity according to the log information, and constructing a heterogeneous user authentication graph, wherein the heterogeneous user authentication graph comprises the network entity and a relationship between the network entities; processing a heterogeneous user authentication graph according to a random walk neighbor node sampling strategy based on a meta-path, and determining a neighbor node set; performing feature aggregation on the neighbor node set according to a meta-path attention mechanism to obtain a characterization vector of a login link; and calculating the relative reconstruction error of the characterization vector, and identifying the login link according to the relative reconstruction error. The method is based on the association between the random walk neighbor node sampling strategy of the meta-path and the attention mechanism processing node, automatically completes the transverse movement identification according to the relative reconstruction error, does not need to manually set an abnormal detection threshold value, is easy to deploy and implement in an actual network scene, and improves the efficiency.

Description

Method and device for detecting transverse mobile attack based on heterogeneous graph abnormal link discovery

Technical Field

The present application relates to the field of network security technologies, and in particular, to a method and an apparatus for detecting a lateral mobile attack based on heterogeneous graph abnormal link discovery.

Background

Advanced Persistent Threat (APT) attacks have the characteristics of complex process, long duration, high concealment, strong destructiveness and the like, and seriously threaten the interests of organizations and the privacy of individual users. The APT lifecycle comprises six stages of intelligence reconnaissance and attack tool construction, attack tool delivery and initial intrusion, command and control (C & C) communication, lateral movement, network asset and data discovery, and final attack targeting. The transverse movement is the key of an attacker to go deep into the network, expand the threat range and realize the final attack target, accurately detects and blocks the transverse movement behavior, can effectively defend APT attack and prevent major security events and economic loss.

The existing transverse movement detection method based on the user authentication graph model has problems in the aspects of detection accuracy and feasibility of an actual scheme. Firstly, in the aspect of detection effect, the current method is limited by the expression capability of a same composition or bipartite graph model, omits rich multi-source heterogeneous information among various network entities, and does not fully mine internal network scenes. Secondly, in terms of feasibility, the current method has an idealized requirement on training data set construction and model deployment that is difficult to achieve in practical scenarios. In particular, supervised learning methods require large-scale labeled datasets for model training, but label information is difficult to acquire in real scenes and often the samples are not uniform. The unsupervised learning method needs to complete normal behavior modeling by relying on pure benign data, and in practice, due to the existence of noise and attack samples, it is difficult to ensure that all links in the current graph data are normal links. In addition, the method needs to set a threshold value manually for abnormality identification, and the threshold value can greatly influence the detection effect. Manually setting the detection system threshold based on expert experience is difficult to operate in practical applications.

Disclosure of Invention

Aiming at the problems, a method and a device for detecting the lateral mobile attack based on the discovery of the abnormal link of the heterogeneous graph are provided.

The application provides a method for detecting a lateral mobile attack based on heterogeneous graph abnormal link discovery in a first aspect, which comprises the following steps:

acquiring log information, determining a network entity according to the log information, and constructing a heterogeneous user authentication graph, wherein the heterogeneous user authentication graph comprises the network entity and a relationship between the network entities;

processing the heterogeneous user authentication graph according to a random walk neighbor node sampling strategy based on a meta-path, and determining a neighbor node set;

performing feature aggregation on the neighbor node set according to a meta-path attention mechanism to obtain a characterization vector of a login link;

and calculating the relative reconstruction error of the characterization vector, and identifying the login link according to the relative reconstruction error.

Optionally, the log information includes one or more of a user authentication event log, a file access log, a process log, and a network flow log.

Optionally, the processing the heterogeneous user authentication graph according to the meta-path-based random walk neighbor node sampling policy to determine a neighbor node set includes:

for a given heterogeneous user authentication graph G =<V,E,X,T _V ,T _E >Sum element path

The transition probability of the ith step in the random walk process under the constraint of the meta-path node type is as follows:

wherein

Indicates that node v is of type->

A set of adjacent nodes;

and selecting nodes with the access times ranking within a preset range to form the neighbor node set.

Optionally, before performing feature aggregation on the neighbor node set according to the meta-path attention mechanism, the method includes:

obtaining a feature aggregation expression of the meta-path attention mechanism, and formulating as:

wherein, V _A Is a set of nodes of type A, P _A Is a symmetrical element path set with A type nodes as start and stop nodes, W _A , _A And alpha _A Respectively, a weight matrix, a bias vector and an attention coefficient,

is node v via meta-path p _j And acquiring the characterization vector.

Optionally, the performing feature aggregation on the neighbor node set according to the meta-path attention mechanism to obtain a characterization vector of the login link includes:

logging in the attribute information X of the link in the neighbor node set through a full-connection network _e Encoding to vector h _e ；

Processing the neighbor node set according to the graph neural network to obtain a characterization vector h of the user node _u And a characterization vector h for the device node _g ；

According to the h _e H is described _u And h is as described _d Determining a characterization vector h for the logged-in link _A Wherein, the

Optionally, the calculating a relative reconstruction error of the characterization vector, and identifying the entry link according to the relative reconstruction error includes:

inputting the characterization vector into a white decoder and a gray decoder, and acquiring the reconstruction errors of the characterization vector on the white decoder and the gray decoder;

if the reconstruction error obtained at the white decoder is greater than the reconstruction error obtained at the gray decoder, the login link is considered an abnormal login;

the login link is considered as a normal login if the reconstruction error obtained at the white decoder is not greater than the reconstruction error obtained at the gray decoder.

Optionally, the white decoder and the gray decoder include:

the white decoder is trained on normal sign-in link samples with a loss function of:

wherein the content of the first and second substances,

represents the normal login link sample, D _white Represents the white decoder, is present>

A characterization vector representing the normal sign-on link samples;

the white decoder and the gray decoder are trained on unlabeled sign-on link samples with a loss function of:

wherein the content of the first and second substances,

representing said unmarked sign-on link samples, D _eray Represents the gray decoder>

A characterization vector representing the unlabeled sign-in link sample.

The second aspect of the present application provides a lateral mobile attack detection apparatus based on heterogeneous graph abnormal link discovery, including:

the building module is used for obtaining log information, determining a network entity according to the log information and building a heterogeneous user authentication graph;

the sampling module is used for processing the heterogeneous user authentication graph according to a random walk neighbor node sampling strategy based on a meta-path and determining a neighbor node set;

the processing module is used for carrying out feature aggregation on the nodes in the neighbor node set according to the meta-path attention mechanism to obtain a characterization vector of the login link;

and the identification module calculates the relative reconstruction error of the characterization vector and identifies the login link according to the relative reconstruction error.

In a third aspect of the present application, a computer device is proposed, which comprises a memory, a processor and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the method according to any of the first aspect is implemented.

In a fourth aspect of the present application, a non-transitory computer-readable storage medium is presented, on which a computer program is stored, which computer program, when executed by a processor, performs the method according to any of the first aspect described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the relevance between the random walk neighbor node sampling strategy based on the meta-path and the attention mechanism processing node is automatically completed according to the relative reconstruction error, the abnormal detection threshold value does not need to be manually set, the deployment and implementation in an actual network scene are easy, and the efficiency is improved.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The above and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flowchart illustrating a method for detecting a lateral mobile attack based on heterogeneous graph abnormal link discovery according to an exemplary embodiment of the present application;

fig. 2 is a block diagram illustrating a lateral mobile attack detection apparatus based on heterogeneous map abnormal link discovery according to an exemplary embodiment of the present application;

fig. 3 is a block diagram of an electronic device.

Detailed Description

Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative and intended to explain the present application and should not be construed as limiting the present application.

Fig. 1 is a flowchart illustrating a method for detecting a lateral mobile attack based on heterogeneous graph abnormal link discovery according to an exemplary embodiment of the present application, where as shown in fig. 1, the method includes:

step 101, obtaining log information, determining a network entity according to the log information, and constructing a heterogeneous user authentication graph, wherein the heterogeneous user authentication graph comprises the network entity and a relationship between the network entities.

In the embodiment of the application, various types of log information such as a user authentication event log, a file access log, a process log, a network flow log and the like are analyzed, a network entity is extracted, and a heterogeneous user authentication graph containing the relation among a user, equipment, a file and a process is constructed. The relationship among various entities can be obtained according to the heterogeneous user authentication graph, and the relationship among various types of network entities is as follows:

logging in R _L The adjacency matrix is M _L If the user i tries to authenticate the identity of the equipment j, M _L(i,j) =1, otherwise M _L(i,j) ＝0；

Operation R _O The adjacency matrix is M _O If user i operates using device j, M _O(i,j) =1, otherwise M _O(i,j) ＝0；

Operation R _R The adjacency matrix is M _R If process i runs on device j, M _R(i,j) =1, otherwise M _R(i,j) ＝0；

Communication R _Cn The adjacency matrix is M _Cn If there is a network traffic flow between devices i and j, then M _Cn(i,j) =1, otherwise M _Cn(i,j) ＝0；

Control of R _Co The adjacency matrix is M _Co If user i manipulates process j, then M _Co(i,j) =1, otherwise M _Co(i,j) ＝0；

R _A The adjacency matrix is M _A If the process i accesses the file j in the running process, M _A(i,j) =1, otherwise M _A(i,j) ＝0。

In order to express the complex association relationship among various types of network entities more clearly, a plurality of symmetrical element paths which take users and equipment as start and stop nodes are designed according to a heterogeneous user authentication graph, as shown in the table:

the above table clearly illustrates the meta path detailed information obtained by the heterogeneous user authentication graph.

And 102, processing the heterogeneous user authentication graph according to the random walk neighbor node sampling strategy based on the meta-path, and determining a neighbor node set.

In order to reduce the dependence on label information, in the encoding process in the abnormal login link detection, firstly, a random walk neighbor node sampling strategy based on a meta-path is adopted to process a heterogeneous user authentication graph, and a neighbor node set is determined. Specific examples of the treatment process include: for a given heterogeneous user authentication graph G =<V,E,X,T _V ,T _E >Sum element path

wherein

Indicates that node v is of type->

The set of neighboring nodes.

And after the adjacent node set is obtained, selecting the nodes with the access times ranking within the preset range to form the adjacent node set.

In one possible embodiment, the node with the access times ranked at the top 5 is selected to form a neighbor node set.

And 103, performing feature aggregation on the neighbor node set according to the meta-path attention mechanism, and acquiring a characterization vector of the login link.

In the embodiment of the application, after determining the neighbor node set, considering that different meta-paths play different roles in the lateral movement detection, the invention adopts a meta-path attention mechanism to complete node feature aggregation, and the feature aggregation expression operation is defined as follows:

wherein, V _A Is a set of nodes of type A, P _A Is a symmetrical element path set with A type node as start-stop node, W _A ,b _A And alpha _A Respectively, a weight matrix, a bias vector and an attention coefficient,

is node v via meta-path p _j And acquiring the characterization vectors.

According to the invention, the relation between the fully-connected network and the nodes in the neighbor node aggregation is aggregated according to the neural network, and the specific conditions are as follows:

attribute information X of logging-in link in neighbor node set through full-connection network _e Encoding to vector h _e ；

Processing the neighbor node set according to the graph neural network to obtain a characterization vector h of the user node _u And a characterization vector h for the device node _d ；

Finally, the attribute of the login link is combined with the information of the user node and the equipment node to obtain a characteristic vector h of the login link _A Wherein, in the step (A),

and 104, calculating a relative reconstruction error of the characterization vector, and identifying the login link according to the relative reconstruction error.

In this embodiment, a token vector h to be registered in a link _A After the error is input into the decoder part, the relative reconstruction error of the login link is calculated to complete the identification of the abnormal login link.

The invention adopts a double-decoder structure, comprising a white decoder and a gray decoder, and the training processes are as follows:

firstly, training a white decoder by using partial normal login link samples as weak supervision information, wherein a loss function is as follows:

wherein, the first and the second end of the pipe are connected with each other,

indicating a normal login link attribute, D _white Representing a white decoder.

Then, for an untagged login link, if the reconstruction error calculated by the link on the white decoder is larger than that on the gray decoder, the link is considered as abnormal login; if the reconstruction error obtained at the white decoder is not greater than the reconstruction error obtained at the gray decoder, the sign-on link is considered a normal sign-on.

Accordingly, the loss function for the white decoder and the gray decoder trained via the unmarked sign-on link data is:

wherein the content of the first and second substances,

indicating an unmarked sign-on link sample, D _grat Represents a gray decoder>

A characterization vector representing an unlabeled sign-in link sample.

In addition, after the logged links are identified, normal and abnormal logged links can be further distinguished by maximizing mutual information between the abnormal link and its characterizing information. The method takes an abnormal login link detected by a dual decoder as a positive sample, takes a pre-marked and detected normal login link as a negative sample, uses a bilinear binary classifier as a discriminator, and realizes the maximization of mutual information between the abnormal link and the characterization information thereof based on Jenson-Shannon divergence:

to summarize, the training loss function of the detector is: loss = Loss _normal +Loss _unlabel -λLoss _rrg Wherein λ is a super parameter for adjusting the importance of the mutual information regularization term.

According to the method and the device, based on a neighbor node sampling strategy and an attention mechanism of random walk of heterogeneous primitive paths, internal network scenes are fully mined, transverse movement recognition is automatically completed according to relative reconstruction errors through a double-decoder structure and mutual information regularization operation, the requirement on training data set label information is lowered, dependence on manually set abnormal detection threshold values is eliminated, deployment and implementation in actual network scenes are easy, and efficiency is improved.

Fig. 2 is a block diagram 200 of a lateral mobile attack detection apparatus based on heterogeneous map abnormal link discovery according to an exemplary embodiment of the present application, as shown in fig. 2, including: a construction module 210, a sampling module 220, a processing module 230, and an identification module 240.

The building module 210 is configured to obtain log information, determine a network entity according to the log information, and build a heterogeneous user authentication graph;

the sampling module 220 is used for processing the heterogeneous user authentication graph according to the meta-path-based random walk neighbor node sampling strategy and determining a neighbor node set;

the processing module 230 performs feature aggregation on nodes in the neighbor node set according to the meta-path attention mechanism to obtain a characterization vector of the login link;

and the identification module 240 calculates a relative reconstruction error of the characterization vector and identifies the login link according to the relative reconstruction error.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

FIG. 3 illustrates a schematic block diagram of an example electronic device 300 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 3, the apparatus 300 includes a computing unit 301 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 302 or a computer program loaded from a storage unit 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data required for the operation of the device 300 can also be stored. The computing unit 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

Various components in device 300 are connected to I/O interface 305, including: an input unit 306 such as a keyboard, a mouse, or the like; an output unit 307 such as various types of displays, speakers, and the like; a storage unit 308 such as a magnetic disk, optical disk, or the like; and a communication unit 309 such as a network card, modem, wireless communication transceiver, etc. The communication unit 309 allows the device 300 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 301 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 301 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 301 executes the respective methods and processes described above, such as the voice instruction response method. For example, in some embodiments, the voice instruction response method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 308. In some embodiments, part or all of the computer program may be loaded onto and/or installed onto device 300 via ROM 302 and/or communications unit 309. When the computer program is loaded into RAM 303 and executed by computing unit 301, one or more steps of the voice instruction response method described above may be performed. Alternatively, in other embodiments, the computing unit 301 may be configured to perform the voice instruction response method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the Internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A method for detecting a lateral mobile attack based on abnormal link discovery of a heterogeneous graph is characterized by comprising the following steps:

2. The method of claim 1, wherein the log information comprises one or more of a user authentication event log, a file access log, a process log, and a network flow log.

3. The method of claim 1, wherein processing the heterogeneous user authentication graph according to a meta-path based random walk neighbor node sampling policy to determine a set of neighbor nodes comprises:

wherein

Indicates that node v is of type->

A set of adjacent nodes;

and selecting nodes with access times ranking within a preset range to form the neighbor node set.

4. The method of claim 1, prior to the feature aggregation of the set of neighbor nodes according to a meta-path attention mechanism, comprising:

is node v via meta-path p _j And acquiring the characterization vector.

5. The method according to claim 1, wherein the performing feature aggregation on the neighbor node set according to a meta-path attention mechanism to obtain a token vector of a logged-in link comprises:

logging attribute information X of the link in the neighbor node set through a full-connection network _e Encoding to vector h _e ；

Processing the data according to a graph neural networkNeighbor node set to obtain the characterization vector h of the user node _u And a characterization vector h for the device node _d ；

6. The method of claim 1, wherein the calculating a relative reconstruction error of the characterization vector from which the logged link is identified comprises:

the check-in link is considered as a normal check-in if the reconstruction error obtained at the white decoder is not greater than the reconstruction error obtained at the gray decoder.

7. The method of claim 6, wherein the white decoder and the gray decoder comprise:

wherein the content of the first and second substances,

represents the normal sign-in link sample, D _white Represents the white decoder, <' > or>

A characterization vector representing the normal sign-in link sample;

representing said unmarked sign-on link samples, D _gray Represents the gray decoder>

A characterization vector representing the unlabeled sign-in link sample.

8. A lateral movement attack detection device based on heterogeneous graph abnormal link discovery is characterized by comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1-7 when executing the computer program.

10. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the method of any one of claims 1-7.