CN117315331A

CN117315331A - Dynamic graph anomaly detection method and system based on GNN and LSTM

Info

Publication number: CN117315331A
Application number: CN202311132125.5A
Authority: CN
Inventors: 仝永杰; 苗功勋
Original assignee: Zhongfu Safety Technology Co Ltd
Current assignee: Zhongfu Safety Technology Co Ltd
Priority date: 2023-09-04
Filing date: 2023-09-04
Publication date: 2023-12-29

Abstract

The invention belongs to the field of dynamic graph anomaly detection, and provides a dynamic graph anomaly detection method and system based on GNN and LSTM. The method for converting the network data into the dynamic attribute graph model truly reflects the evolution process of node attributes and network structures in the network, the node embedding mode fully fuses information such as network topological structures, node attributes and the like, the accuracy of the model is greatly improved, finally, the influence of a short-term mode on output is focused, time sequence information is reasonably combined, and the operation efficiency of the model is improved.

Description

Dynamic graph anomaly detection method and system based on GNN and LSTM

Technical Field

The invention belongs to the field of dynamic graph anomaly detection, and particularly relates to a dynamic graph anomaly detection method and system based on GNN and LSTM.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

With the improvement of the social informatization degree, people's life becomes more and more convenient, but some network abnormal phenomena also frequently occur, and people are puzzled with different degrees. Meanwhile, the progress of information technology also makes a large amount of network structure data (such as social networks, e-commerce networks, traffic networks and the like) appear in various fields, so that network anomaly detection based on big data is one of important research directions of data science nowadays.

With the progress of graph processing technology, the network structure data is converted into a graph of a corresponding type, and then the graph processing technology is utilized for data mining and analysis, so that the main idea for solving the problem is realized. The traditional graph anomaly detection algorithm is generally realized by performing feature selection, decomposition and dimension reduction according to data characteristics, and then performing classification and clustering by using a machine learning algorithm; the deep learning-based algorithm generally utilizes Node2Vector or NetWalk schemes to embed nodes or edges, and then utilizes Graph Neural Network (GNN) or traditional method classification or clustering to implement the method.

The inventor finds that the feature engineering such as feature selection, feature matrix decomposition and the like related to the traditional machine learning algorithm often depend too much on the data processing experience of people, and the processing of highly complex nonlinear data is often more complicated. The embedded method Node2Vector based on deep learning cannot process the time sequence information of the dynamic graph, but NetWalk maintains the dynamic information of the graph to a certain extent, but only updates the edge representation and ignores the long-short period mode of the Node and the graph structure.

Disclosure of Invention

In order to solve at least one technical problem in the background art, the invention provides a dynamic graph anomaly detection method and system based on GNN and LSTM, which converts flow data into a dynamic graph containing a network topology structure and node attributes through unique data preprocessing, obtains embedded vectors of nodes through graph attention network coding and decoding of both the structure and the attributes, combines an LSTM self-encoder with an attention model, and realizes the detection of abnormal nodes and abnormal graphs in the dynamic network in a mode that reconstruction errors of the nodes are node anomaly scores and node average anomaly scores of one graph are graph anomaly scores.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the first aspect of the present invention provides a dynamic graph anomaly detection method based on GNN and LSTM, comprising the steps of:

acquiring network structure data;

converting the network structure data into a dynamic graph containing network topology and node attributes;

training the dynamic graph anomaly detection model based on the dynamic graph to obtain a trained dynamic graph anomaly detection model; the training process of the dynamic graph anomaly detection model comprises the following steps:

based on a dynamic graph comprising a network topological structure and node attributes, coding by adopting trained GNN to obtain potential representation of the dynamic graph, performing structural decoding on the potential representation of the dynamic graph to obtain a reconstructed adjacent matrix, performing attribute decoding on two dimensions to obtain a reconstructed attribute matrix, and combining the reconstructed adjacent matrix and the attribute matrix to obtain an embedded vector of the node;

based on the embedded vector of the node, a text attention mechanism is introduced on an LSTM self-coding mechanism, the original information transmission mode among LSTM units is changed, the influence of the long-term behavior mode and the short-term behavior mode of the node on an output result is adjusted, a reconstructed vector of the node is obtained, and the abnormal score of the node is obtained by combining the embedded vector and the reconstructed vector of the node;

and judging the data to be detected by adopting a trained dynamic graph abnormality detection model to obtain an abnormality result.

A second aspect of the present invention provides a GNN and LSTM based dynamic graph anomaly detection system, comprising:

the data acquisition module is used for acquiring network structure data;

the data preprocessing module is used for converting the network structure data into a dynamic graph containing a network topology structure and node attributes;

the anomaly detection model training model is used for training the dynamic image anomaly detection model based on the dynamic image to obtain a trained dynamic image anomaly detection model; the training process of the dynamic graph anomaly detection model comprises the following steps:

the anomaly detection module is used for judging the data to be detected by adopting a trained dynamic graph anomaly detection model to obtain anomaly scores.

A third aspect of the present invention provides a computer-readable storage medium.

A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a GNN and LSTM based dynamic graph anomaly detection method as described in the first aspect.

A fourth aspect of the invention provides a computer device.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in a GNN and LSTM based dynamic graph anomaly detection method as described in the first aspect when the program is executed.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention converts the network flow information into the node attribute vector in the dynamic network through the GAT, so that the embedded vector fuses the interaction information among the nodes and the network topology structure information.

2. In the node embedding stage, the attribute decoding adopts two-dimensional decoding in the GAT self-coding process, and fully considers the information flows in two directions, so that the model obtains more effective node characterization.

3. According to the invention, a text attention mechanism is introduced into the LSTM self-coding mechanism in the anomaly detection stage, so that the model reasonably combines the long-term behavior mode and the short-term behavior mode of the node, and the detection performance of the model is improved.

Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a flowchart of a method for detecting dynamic graph anomalies provided by an embodiment of the present invention;

FIG. 2 is a diagram illustrating the conversion of network traffic data into dynamic images according to an embodiment of the present invention;

fig. 3 is a node embedding and anomaly detection model provided by an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

In order to solve the technical problems mentioned in the background technology of the application, the invention converts flow data into the dynamic graph containing network topology and node attributes through unique data preprocessing, obtains the embedded vector of the node through the coding of the graph attention network and the decoding of the two aspects of the structure and the attribute, finally combines the LSTM self-coder with the attention model, and realizes the detection of the abnormal node and the abnormal graph in the dynamic network in a mode that the reconstruction error of the node is the abnormal node and the average abnormal node of one graph is divided into the abnormal graph. The method for converting the network data into the dynamic attribute graph model reflects the evolution process of node attributes and network structures in the network more truly, the node embedding mode fully fuses information such as network topological structures, node attributes and the like, retains as many features as possible, greatly improves the accuracy of the model, and finally, the attention mechanism and LSTM are combined to pay attention to the influence of a short-term mode on output, time sequence information is reasonably combined, and the operation efficiency of the model is improved.

Example 1

As shown in fig. 1-3, the present embodiment provides a dynamic graph anomaly detection method based on GNN and LSTM, including the following steps:

the present embodiment uses network traffic data as an example to illustrate the specific process of the technology.

Step 1: and (5) preprocessing data.

Each piece of traffic data includes at least three factors, source IP, destination IP, and time stamp. And counting the number of nodes, if the number of the nodes is larger than a set threshold (for example, exceeds 1000), counting all flow information related to each node to obtain the corresponding flow number, sequencing each node according to the size of the flow number, selecting the first N nodes (N is smaller than or equal to 1000), and only keeping all flow data among the N nodes. If the number of nodes is lower than the set threshold, all traffic data is reserved.

And sorting the screened flow according to the time stamp, selecting a time interval T according to the data condition, and taking a group of data at intervals of T from the earliest time point of the data to be used for making a graph. And counting the number of the flow of each node serving as the source IP when each node serves as the target IP in one group of data, so that each node forms a vector, and each graph forms a matrix which is defined as an attribute matrix. Such that all data forms a set of matrix sequences. And generating a group of corresponding adjacent matrix sequences according to whether the elements in the matrix are zero. In practical application, the flow data is generally dynamically generated, and each time the difference between the current time point and the mapping time point of the last graph reaches T, the data between the two time points is taken to form a graph, so that the network flow data forms a dynamic graph, and the flow is shown in fig. 2.

Step 2: node embedding

GNN technology mainly includes graph roll-up network (GCN), graph annotation network (GAT), graph save, etc., referring to fig. 3, the embodiment implements embedding of network nodes with GAT self-encoder. Each graph contains its unique network topology and node attributes in the dynamic graph.

In this embodiment, a graph attention network (GAT) is employed.

In the embodiment, the conventional decoding mode of the GAT self-encoder is improved, a double self-encoder is adopted, and two-dimensional information is considered for decoding in the process of attribute decoding.

The method specifically comprises the following steps:

step 201: calculating an attention score α for a graph attention network _i,j 。

Wherein the method comprises the steps ofIs the vector of the ith node of the t-th diagram after data preprocessing, +.>The vector after data preprocessing of the jth node of the t-th graph is represented by W, which is a weight matrix whose parameters are shared in all graphs. />Is the neighbor set of node i, a ^T Is the learning parameter tensor of the attention mechanism. />Representing the concatenation of two vectors of two nodes in the last dimension.

Step 202: fusing the nodes around the node i to obtain the potential representation of the node i

Wherein alpha is _i,i Representing node i and its own self-attention score, α _i,j Representing the self-attention scores of nodes i and j.

Step 203: all hidden characterizations from a single mapCombining into matrix Z, decoding the obtained potential representation Z into graph structure to obtain reconstructed adjacency matrix +.>

Wherein z is ^T Representing the transposed matrix of z.

Step 204: performing attribute decoding of two dimensions on the potential representation Z to obtain a reconstructed attribute matrix

Z ^A ＝ReLU(Z ^T W ^A(1) +b ¹ ) (5)

Wherein W is ^A(1) Andweights and deviations in one dimension, W ^A(2) And b ² Is the weight and bias in the other dimension, these parameters being shared between all graphs.

Step 205: a loss function based on the reconstruction error is constructed.

Wherein lambda is the balance factor and,and->For the structure loss function and the attribute loss function, respectively defined as:

wherein, gamma is a balance factor,and->For the structure loss function and the attribute loss function, A _ij Meaning of (1) original adjacency matrix element, +.>Reconstructed adjacency matrix element, X _i For the original attribute matrix element, +.>And (3) reconstructing an attribute matrix element, wherein N is the number of nodes.

Step 206: and (5) model training. When the loss function falls to a certain level or changes little, the network structure and parameters of the coding part of the model, namely, the step 201 and the step 202, and the embedding vector of the node are saved

Step 3: anomaly detection model construction

Referring to fig. 3, in this embodiment, the LSTM self-encoder is applied to fuse the timing information of the nodes for anomaly detection. Different conventional LSTM self-coding mechanisms utilize a text attention model to change the original information transmission mode among LSTM units, so that the model reasonably combines a long-term behavior mode and a short-term behavior mode of the node, and the influence of the short-term mode on an output result is more focused.

The method specifically comprises the following steps:

step 301, setting a window parameter w.

Step 302, organizing the embedded vector set of the nodesOrdered sequences of nodes for each nodeIt is convenient to input them into LSTM cells at corresponding times later.

Step 303, sorting hidden vectors of LSTM output in a window into:

step 304, attention mechanism is introduced to reconstruct hidden vector.

Wherein Q is _h And r ^T Is a parameter of the attention model,is a short-term hidden state of the reconstruction.

Step 305, processing with LSTM UnitAnd->Obtaining a hidden state at the moment t:

step 306, decoding with LSTM. The last hidden vectorAs initial parameter of the decoder, wherein +.>Is the reconstruction vector at t-1. As can be seen from the above equation, the output at time t is used as the input at time t-1 in the decoding stage.

Step 307, construct a loss function.

Step 308, training the model until the model is lowered to a certain level or the change is not large, and storing the model.

Step 4: abnormal node and abnormal graph judgment

Embedding vectors for binding nodesAnd reconstruction vector +.>The resulting error as its anomaly score can be expressed as:

during training, embedded vectors and reconstructed vectors of nodes are directly appliedThe abnormal score can be calculated, and when the abnormal score is actually needed, the coding part of the model in the step 2 is stored for vector embedding, and the step 3 is utilized for reconstructing vector calculation to obtain the abnormal score. According to s ^t (i) Can judge the abnormality degree of the node i at the moment t and set a threshold lambda ₁ It can be determined whether the change point is abnormal.

For the judgment of the abnormal graph, the average value of abnormal scores of all nodes in the graph can be countedSetting the threshold lambda at the same time ₂ Whether the graph is abnormal or not is judged in the same way as the abnormal node is judged.

Example two

The embodiment provides a dynamic graph anomaly detection system based on GNN and LSTM, which comprises:

the data acquisition module is used for acquiring network structure data;

Example III

The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in a GNN and LSTM based dynamic graph anomaly detection method as described above.

Example IV

The embodiment provides a computer device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps in the dynamic graph anomaly detection method based on the GNN and the LSTM when executing the program.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random access Memory (Random AccessMemory, RAM), or the like.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The dynamic graph anomaly detection method based on the GNN and the LSTM is characterized by comprising the following steps of:

acquiring network structure data;

2. The GNN and LSTM based dynamic graph anomaly detection method of claim 1 wherein said converting network structure data into a dynamic graph comprising network topology and node attributes comprises:

screening nodes and screening network structure data;

sorting the screened flow data according to time stamps, selecting a time interval T according to the data condition, and taking a group of data at intervals of T from the earliest time point of the data to prepare a graph;

taking the nodes obtained by screening as nodes in the graph, counting the number of traffic of each node as a source IP when each node is taken as a target IP in one group of data, so that each node forms a vector, each graph forms a matrix, and the matrix is defined as an attribute matrix;

all the data form a group of matrix sequences, a corresponding group of adjacent matrix sequences are generated according to whether elements in the matrix are zero, and each time the difference between the current time point and the mapping time point of the last graph reaches T, the data between the two time points are taken to form a graph, so that the network flow data form a dynamic graph.

3. The method for detecting dynamic graph anomalies based on GNN and LSTM as claimed in claim 2, wherein said screening nodes and screening network structure data specifically comprises:

counting the number of nodes, counting all flow information related to each node if the number of the nodes exceeds a set number threshold, obtaining the corresponding flow number, sequencing each node according to the size of the flow number, selecting the first N nodes, reserving all flow data among the N nodes, and reserving all flow data if the number of the nodes is lower than the number threshold.

4. The method for detecting abnormal dynamic image based on GNN and LSTM according to claim 1, wherein the step of encoding with trained GNN to obtain potential representation of dynamic image specifically comprises:

obtaining an attention score of the attention network of the drawing based on the t-th drawing;

fusing nodes around the node i of the t-th drawing to obtain potential characterization of the node i;

all potential representation composition matrices obtained in the graph are used as potential representations of the dynamic graph.

5. The dynamic graph anomaly detection method based on GNN and LSTM as claimed in claim 4, wherein the GNN training time loss function is:

wherein, lambda and gamma are balance factors,and->For the structure loss function and the attribute loss function, A _ij Meaning of (1) original adjacency matrix element, +.>Reconstructed adjacency matrix element, X _i For the original attribute matrix element, +.>And (3) reconstructing an attribute matrix element, wherein N is the number of nodes.

6. The dynamic graph anomaly detection method based on GNN and LSTM according to claim 1, wherein the method for detecting dynamic graph anomaly based on GNN and LSTM is characterized in that an original information transmission mode between LSTM units is changed by using a text attention model, and influences of a short-term behavior mode on an output result are focused to obtain a reconstructed vector of a node, and specifically comprises:

setting window parameters, and arranging the embedded vectors of the nodes into a node sequence of the LSTM unit at the corresponding moment;

at the time t, after the hidden vectors output by the LSTM units in one window are tidied, an attention mechanism is introduced, and the hidden vectors are reconstructed to obtain a reconstructed short-term hidden state;

processing the node sequence and the reconstructed short-term hidden state by adopting an LSTM unit to obtain a hidden state at the moment t;

and decoding by using the LSTM, taking the last hidden vector as an initial parameter of the decoder, and taking the output of the moment t as the input of the moment t-1 to obtain a reconstruction vector of the moment t.

7. The method for detecting abnormal dynamic graph based on GNN and LSTM according to claim 1, wherein whether the node i is abnormal or not is determined according to the magnitude of the abnormal score, the abnormal node i is compared with a set threshold, and if the reconstruction error obtained by combining the embedded vector and the reconstruction vector is greater than the set reconstruction error threshold, the abnormal node i is generated.

8. A GNN and LSTM based motion map anomaly detection system, comprising:

the data acquisition module is used for acquiring network structure data;

9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of a GNN and LSTM based dynamic graph anomaly detection method as claimed in any one of claims 1 to 7.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of a GNN and LSTM based dynamic graph anomaly detection method as claimed in any one of claims 1 to 7 when the program is executed.