CN114443909A

CN114443909A - Dynamic graph anomaly detection method based on community structure

Info

Publication number: CN114443909A
Application number: CN202210019006.8A
Authority: CN
Inventors: 方凯彬; 李俊杰; 包先雨; 蔡伊娜; 林伟钦; 王歆
Original assignee: Shenzhen University; Shenzhen Academy of Inspection and Quarantine; Shenzhen Customs Information Center
Current assignee: Shenzhen University; Shenzhen Academy of Inspection and Quarantine; Shenzhen Customs Information Center
Priority date: 2022-01-10
Filing date: 2022-01-10
Publication date: 2022-05-06
Also published as: WO2023130728A1

Abstract

The invention discloses a dynamic graph anomaly detection method based on a community structure, which comprises the following steps: s1: firstly, defining the specific definition of abnormal edge detection of the dynamic graph; s2: the CmaGraph is composed of a C-Block, an M-Block and an A-Block, the C-Block detects an evolutionary community of the dynamic graph, the M-Block reconstructs distances between vertexes in the community and the community, so that the vertex embedding distances in the same community are close to each other, and the vertex embedding distances in different communities are far away from each other; s3: vertex embedding is finally input to A-Block for anomaly detection. The method has better effect on detecting the abnormal data and the effectiveness of the community structure in abnormal detection. Aiming at the research blank of carrying out abnormity detection by utilizing a community structure based on a graph embedding method, the invention fills the research blank in the aspect.

Description

Dynamic graph anomaly detection method based on community structure

Technical Field

The invention belongs to the technical field of dynamic graph abnormity detection, and particularly relates to a dynamic graph abnormity detection method based on a community structure.

Background

Dynamic map anomaly detection is an important research direction in the map field. Anomalies for the dynamic graph include: vertex exceptions, edge exceptions, and subgraph exceptions. Many applications of dynamic graphs use edges to represent complex topologies and timing characteristics. Therefore, the detection of abnormal edges is a key part of the dynamic graph abnormality detection technology. The detection of abnormal edges in the dynamic graph has wide application, such as intrusion detection systems, social network fraud detection and the like. By mining the anomalies in the dynamic graph, some safety accidents can be avoided, and economic losses are avoided or reduced.

Graph-embedded models are models that can map vertices, edges, or subgraphs on a graph to a new vector space. In the new vector space, embedding can express different attributes according to different methods, and the embedded learning is free from manual intervention. In large complex graphs, the graph embedding model has better performance than the traditional heuristic method. Since the graph embedding model has excellent performance, there are many studies to extract features of a graph based on a graph embedding method and to perform anomaly detection using the extracted features. The method is also based on a graph embedding technology, utilizes the community structure to extract the characteristics of the dynamic graph, and carries out anomaly detection on the extracted characteristics.

NetWalk is one of the classic and commonly used graph-embedding-based dynamic graph anomaly detection algorithms. NetWalk may dynamically update the network representation as the dynamic graph is updated and use the updated network representation for dynamic graph anomaly detection. NetWalk first encodes the vertices of the dynamic graph as vectors by a self-encoder with blob embedding, then minimizes the vertex-embedded distance in random walks, and reconstructs the error from the encoder as a global regularization term. After learning vertex embedding, a clustering-based technique is used to incrementally and dynamically detect network anomalies.

NetWalk is an important research on the anomaly detection direction of the dynamic graph, but does not solve the problem of anomaly detection by using a community structure under the condition that the community structure of the dynamic graph is obviously divided.

The existing method does not consider the abnormal detection based on the community structure in the graph embedding method. In general, a community is defined as a set of vertices with similar relationships, and such relationships are different from other relationships of the network. In real life, there are many anomalies that occur from community to community. For example, in a computer network, a certain terminal often sends and receives data to and from a fixed certain terminal, and a relationship of sending and receiving data is formed between the certain terminal and the fixed certain terminal. The terminal and the terminal set which receives and transmits data frequently can be considered to belong to the same community. However, when a hacker attacks the terminal and uses the terminal as a springboard, a large amount of abnormal data is sent to the network to attack other terminals. This is manifested as the node suddenly sending a large amount of data to nodes within and outside the community, whereas the node would normally only send data to nodes within the community previously. This behavior may be considered an abnormal behavior. Therefore, we propose a dynamic graph anomaly detection method based on community structure to solve the above mentioned problems in the background art.

Disclosure of Invention

The invention aims to provide a dynamic graph abnormity detection method based on a community structure, so as to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: a dynamic graph anomaly detection method based on community structure includes the following steps:

s1: firstly, defining the specific definition of abnormal edge detection of the dynamic graph:

dynamic graph

Is a sequence of graphs, G^tThe representation is a graph under a time stamp t, G^t＝(V^t，E^t) (ii) a With the graph updated, the updated edge set uses E^tIs represented by^tSet V for all vertices in (1)^tRepresenting;

n＝|V^t|，m^t＝|E^tl, |; at time stamp t, A^tRepresents G^tThe adjacency matrix of (a); given G^tThe detection of abnormal edge of dynamic graph aims to find E^tThe abnormal edge of (1);

s2: the CmaGraph is composed of a C-Block, an M-Block and an A-Block, the C-Block detects an evolutionary community of the dynamic graph, the M-Block reconstructs distances between vertexes in the community and the community, so that the vertex embedding distances in the same community are close to each other, and the vertex embedding distances in different communities are far away from each other;

s3: vertex embedding is finally input to A-Block for anomaly detection.

The C-Block aims to detect an evolved community, wherein the evolved community refers to a community which changes along with the time on the dynamic graph;

using the adjacency matrix as an input from the encoder to obtain initial vertex embeddings and applying k-means on the vertex embeddings for community detection;

the self-encoder uses a sparse evolution self-encoder (SeAutoencor) to obtain stable vertex embedding, so that k-means obtains a stable community label;

formally, at timestamp t, G can be found^tAdjacent matrix A of^tAnd setting the number k of communities, which is a hyper-parameter;

by a_sEmbedding the constructed vertex of the layer full-connection network SeAutoencor, wherein the forward propagation formula of the SeAutoencor is as follows:

wherein l is 1, …, l_s-1，

And

respectively are a l-th layer weight matrix and a bias vector of the SeAutoencoder, and sigma is a sigmoid function;

is provided with

Applying k-means to H^tThus, a community label vector c containing each vertex can be obtained^tHere, H^t∈R^n×dD is the dimension of vertex embedding, C^t∈Rⁿ；

The reconstruction loss function of SeAutoencorder is:

wherein F is frobenius norm, introducing sparse constraint, and the sparse penalty term of the SeAutoencoder neuron is defined by Kullback-Leibler divergence:

where p is a sparse parameter, where p is,

is the average activation of the jth neuron in layer i,

at H^tAnd H^t-1Introduces timing loss J therebetween^t，

When t is 1, J_T＝0；

Use of_sSeAutoencorder for a layer, the loss function is:

where β and λ control the weight of the sparsity constraint and timing loss, respectively.

The M-Block aims to reconstruct the distance between communities and the vertexes between the communities, so that the vertexes in the same community are close to each other in Euclidean distance, and the Euclidean distances between the vertexes in different communities are far away from each other;

vertex embedding and community label vector are input of M-Block, and output of M-Block is community measurement enhancement vertex embedding; M-Block uses the community metric enhancement network (CenNet), a Siamese network, to enhance vertex embedding, a method of depth metric learning that reconstructs the distance between vertices in the evolved community

Formally, at timestamp t, H is derived from C-Block^tAnd c^t(ii) a Use of_cA fully connected network of layers CenNet, each layer having d neurons, wherein the forward propagation formula is:

wherein l is 1, …, l_c-1，

And

respectively is a weight matrix and a bias vector of the first layer of the CenNet; loss of CenNetThe function is a contrast loss function, which is:

wherein

Representing the euclidean distance between samples i and j,

is a matrix O^tIf samples i and j are in the same community, then y _ij1, otherwise y_ij0, b is the interval; when the data set is too large, J_CenNetIs high, for a given sample i, the index J is obtained using negative sampling, which reduces the complexity, i.e. some random samples J can be obtained from the data set to approximate J_CenNet。

The goal of the A-Block is to obtain E^t(ii) anomaly scores for all edges in CmaGraph, using OC-NN as an anomaly detector;

A-Block to O^tUsing an edge encoder to obtain edge embedding, using for vertex u and vertex v

Embedding and inputting edges into an One-Class neural network (OCNN) by using the embedded distance information, wherein the edge embedding is an anomaly detection model;

formally, at timestamp t, get O for M-Block^tAnd E^tThe encoder phi and O^tAnd E^tConversion to edge-embedded P^tUsing a 1_aThe OCNN of the layer is a fully connected network, each hidden layer is provided with d neurons, and the output layer is provided with only 1 neuron and represents abnormal score; the forward propagation formula of the neural network is as follows:

wherein l is 1, …, l _a2, the last layer does not apply an activation function, i.e.

Wherein

And

the weight matrix and the bias vector of the OCNN l-th layer and the abnormal score vector are respectively

The loss function of OCNN is:

where r is the bias of the hyperplane, v controls the number of data points allowed to pass through the hyperplane, v is the percentage of anomalies, and s is the final result^tAnd the abnormal edges can be classified by setting a threshold value;

formally, at timestamp t, an updated edge set E can be obtained^tAccording to E^tUpdating adjacency matrix A^tAnd use of A^tAs input to the CmaGraph, the autoancoder, CenNet, and OCNN were then trained with the learning rate α and previous weights.

Compared with the prior art, the invention has the beneficial effects that: the invention provides a dynamic graph anomaly detection method based on a community structure, which uses an evolutionary sparse self-encoder and K-Means to detect the evolutionary community structure on a dynamic graph, and uses a Siamese network to reconstruct the vertex distance between communities and community intervals, so that the European distances between the vertices in the same community are close to each other, and the European distances between the vertices in different communities are far away from each other. Vertex embedding is learned based on community structure and used in anomaly detection. The method has better effect on detecting the abnormal data and the effectiveness of the community structure in abnormal detection. Aiming at the research blank of carrying out abnormity detection by utilizing a community structure based on a graph embedding method, the invention fills the research blank in the aspect.

Drawings

FIG. 1 is a schematic flow chart of a dynamic graph anomaly detection method based on a community structure according to the present invention;

FIG. 2 is a diagram of the input diagram G of the present invention^t-1A schematic diagram of (a);

FIG. 3 shows (b) G of the present invention^t-1A schematic diagram of the visual output at C-Block;

FIG. 4 is (c) input diagram G of the present invention^tA schematic diagram of (a);

FIG. 5 shows (d) G of the present invention^tA schematic diagram of the visual output at C-Block;

FIG. 6 shows (a) G of the present invention^t-1A schematic of the visual input at M-Block;

FIG. 7 shows (b) G of the present invention^t-1Schematic diagram of visual output at M-Block.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The CmaGraph performs detection of abnormal edges based on community structures on the dynamic graph. The CmaGraph detects the evolutionary community of the dynamic graph, and rebuilds the distance between the vertexes in the community and the community through a Siamese network, so that the vertexes in the same community are close to each other in Euclidean distance, and the Euclidean distances between the vertexes in different communities are far away from each other. The distance between vertices implicitly preserves the distance relationship between intra-community and inter-community. The distance between the vertexes is used for coding edges, the obtained edges are embedded and input into the existing abnormal edge detection algorithm for abnormal edge detection.

Anomaly detection is the identification of events or observations in the data that do not match the expected pattern. A dynamic graph is a sequence of graphs in which the graphs change over time. Dynamic graph anomaly detection is the identification of anomalous data in a dynamic graph. The invention provides a dynamic graph anomaly detection method based on a community structure, and aims to solve the technical problem of fully utilizing the community structure in a dynamic graph to mine anomalous data when the community structure in the dynamic graph is obviously divided. The ultimate goal addressed by the present invention is to mine anomalous data in a dynamic graph given the dynamic graph.

The purpose of the invention is as follows:

aiming at the research blank of carrying out anomaly detection by utilizing a community structure based on a graph embedding method, the invention provides a dynamic graph anomaly detection method CmaGraph based on the community structure, and fills the research blank in the aspect.

The invention provides a dynamic graph abnormity detection method based on a community structure as shown in figures 1-7, which comprises the following steps:

firstly, the specific definition of the detection of the abnormal edge of the dynamic graph is defined:

dynamic graph

Is a sequence of graphs. G^tThe representation is a graph under a time stamp t, G^t＝(V^t，E^t). With the graph updated, the updated edge set uses E^tIs represented by^tSet V for all vertices in (1)^tAnd (4) showing.

n＝|V^t|，m^t＝|E^tL. At time stamp t, A^tRepresents G^tOf the adjacent matrix. Given G^tThe detection of abnormal edge of dynamic graph aims to find E^tAn abnormal edge in (1).

The CmaGraph is composed of three blocks, as shown in FIG. 1, including C-Block, M-Block, and A-Block.

FIG. 1 is a CmaGraph flow diagram in which (a) the dynamic graph, (b) the adjacency matrix, (C) C-Block: an evolutionary community detection Block of a dynamic graph, (d) M-Block: an embedding distance reconstruction Block, (e) A-Block: and an anomaly detection block.

C-Block detects the evolutionary community of the kinetic graph. M-Block reconstructs the distance between the vertexes in the communities, so that the vertex embedding distances in the same community are close to each other, and the vertex embedding distances in different communities are far away from each other. Vertex embedding is finally input to A-Block for anomaly detection.

The goal of C-Block (a) is to detect evolving communities. Evolutionary communities, refer to communities on a dynamic graph that change over time. For example, in a certain enterprise, workers within the enterprise may form a community. Over time, there is a flow of people within the enterprise, such as new employees, and employees who are out of business. But this community still exists, only the structure has changed. Therefore, in the dynamic graph, vertices in the community and the structure of the community change with time. In many real dynamic graphs, the structure of the community changes over time, but the changes are not very drastic. Therefore, it is first necessary to detect evolving communities on the kinetic graph, and the variation of these communities cannot be too drastic.

The method comprises the following steps: initial vertex embeddings are obtained using the adjacency matrix as input from the encoder, and k-means are applied on the vertex embeddings for community detection. In particular, the present invention uses a sparse evolution self-encoder (SeAutoencoder) which can result in stable vertex embedding, and thus k-means can result in stable community tags. The input and output of C-Block in a composite dynamic graph and the corresponding visualization information are shown in fig. 2-5, and fig. 2-5 show that the embedded moving distance is small and the community variation is small as the graph is updated.

FIGS. 2-5 are generally the inputs and outputs of C-Block. FIG. 2 is (a) an input graph G^t-1FIG. 3 is (b) G^t-1A schematic diagram of visual output in C-Block, FIG. 4 is (C) an input diagram G^tFIG. 5 is (d) G^tSchematic diagram of visual output at C-Block. (b) C1 and c2 in (d) represent different communities, and the arrow in (d) represents the embedded update direction relative to (b).

Formally, at timestamp t, G can be found^tAdjacent matrix A of^tAnd sets the number of communities k, which is a hyperparameter. For the invention_sThe full-connection network SeAutoencor structure vertex of the layer is embedded, and the forward propagation formula of the SeAutoencor is as follows

Wherein l is 1, …, l_s-1，

And

respectively are a l-th layer weight matrix and an offset vector of the SeAutoencoder, and sigma is a sigmoid function. Is provided with

Applying k-means to H^tThus, a community label vector c containing each vertex can be obtained^t. Here, H^t∈R^n×dD is the dimension of vertex embedding, C^t∈Rⁿ. The reconstruction loss function of SeAutoencorder is:

wherein F is the frobenius norm. Introducing sparsity constraint, and defining a sparsity penalty term of the SeAutoencoder neuron by Kullback-Leibler divergence:

where p is a sparse parameter, where p is,

for the average activation of the jth neuron in layer i,

the vertex embedding and community label changes cannot be too drastic when the graph is updated. Thus, at H^tAnd H^t-1Introduces timing loss J therebetween^t：

When t is 1, J_T0. Use of_sSeAutoencorder for a layer, the loss function is:

The aim of the M-Block is to reconstruct the distance between communities and the vertexes between the communities, so that the vertexes in the same community are close to each other in Euclidean distance, and the vertexes in different communities are far from each other in Euclidean distance. The vertex embedding and community label vector are the inputs to M-Block, the output of which is community metric enhanced vertex embedding. M-Block uses community metric enhancement network (CenNet), which is a Siamese network, to enhance vertex embedding, which is a method of depth metric learning. It reconstructs the distance between vertices in the evolving community. As shown in fig. 6-7, enhanced vertex embedding is more indicative than original vertex embedding, because the euclidean distance between vertices implicitly preserves the intra-community and inter-community distance information.

FIGS. 6-7 are generally G^t-1And (4) displaying input and output of M-Block. FIG. 6 shows (a)G^t-1FIG. 7 is a schematic diagram of visual input at M-Block (b) G^t-1Schematic diagram of visual output at M-Block.

Formally, at timestamp t, H is derived from C-Block^tAnd c^t. Use of the invention_cA fully connected network of layers CenNet, each layer having d neurons, wherein the forward propagation formula is:

wherein l is 1, …, l_c-1，

And

respectively, a weight matrix and an offset vector of the l-th layer of the CenNet. The loss function for CenNet is the comparative loss function:

wherein

Representing the euclidean distance between samples i and j,

is a matrix O^tIf samples i and j are in the same community, then y _ij1, otherwise y_ijAnd b is an interval 0. When the data set is too large, J_CenNetIs high, for a given sample i, the index j is obtained using negative sampling, which may reduce complexity. That is, some random samples J can be obtained from the data set to approximate J_CenNet。

(III) A-Block aimed at obtaining E^tThe anomaly score of all edges in. In CmaGraph, the invention uses OC-NN as an anomaly detector.

As shown in FIGS. 2-5, A-Block vs. O^tAn edge encoder is applied to obtain edge embedding. For vertex u and vertex v, the invention uses

It can make better use of the embedded distance information. The edge embedding is then input into One-Class neural network (OCNN), which is an anomaly detection model.

Formally, at timestamp t, O for M-Block can be obtained^tAnd E^t. Edge encoder φ to O^tAnd E^tConversion to edge-embedded P^t. The invention uses a_aAnd in the fully connected network OCNN of the layers, each hidden layer has d neurons, and the output layer has only 1 neuron and represents abnormal score. The forward propagation formula of the neural network is as follows:

Wherein

And

The loss function of OCNN is:

where r is the offset of the hyperplane. V controls the number of data points that are allowed to pass through the hyperplane, v being equivalent to the percentage of anomalies. Finally obtaining s^tAnd the abnormal edges can be classified by setting a threshold.

Formally, at timestamp t, an updated edge set E can be obtained^t. According to E^tUpdating adjacency matrix A^tAnd use of A^tAs input to the CmaGraph. SeAutoencorder, CenNet, and OCNN are then trained with the learning rate α and previous weights.

The CmaGraph is summarized in algorithm 1.

In real life, many anomalies occur among communities, however, the community structure in the dynamic graph is not considered by the existing dynamic graph anomaly detection algorithm based on the graph embedding model. The method uses an evolution sparse self-encoder and K-Means to detect the evolution community structure on the dynamic graph, and uses a Siamese network to reconstruct the vertex distance between communities and the community, so that the European distances between the vertexes in the same community are close to each other, and the European distances between the vertexes in different communities are far away from each other. Vertex embedding is learned based on community structure and used in anomaly detection.

The method of the invention uses three real-world dynamic graph data sets as data to carry out repeated tests. The data set types encompass social networking graphs, paper author collaboration graphs, and computer networking graphs. Quantitative analysis is carried out through AUC indexes, on three data sets, the CmaGraph method is improved by 18% compared with the NetWalk method on average, and the CmaGraph method has a good effect on abnormal data detection and the effectiveness of community structures on abnormal detection.

In summary, compared with the prior art, the method uses the evolution sparse autoencoder and the K-Means to detect the evolution community structure on the dynamic graph, and uses the Siamese network to reconstruct the vertex distance between the communities and the community, so that the European distances between the vertexes in the same community are close to each other, and the European distances between the vertexes in different communities are far away from each other. Vertex embedding is learned based on community structure and used in anomaly detection. The method has better effect on detecting the abnormal data and the effectiveness of the community structure in abnormal detection. Aiming at the research blank of carrying out abnormity detection by utilizing a community structure based on a graph embedding method, the invention fills the research blank in the aspect.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments or portions thereof without departing from the spirit and scope of the invention.

Claims

1. The dynamic graph anomaly detection method based on the community structure is characterized by comprising the following steps: the method comprises the following steps:

dynamic graph

Is a sequence of graphs, G^tThe representation is a graph under the time stamp t, G^t＝(V^t，E^t) (ii) a With the graph updated, the updated edge set uses E^tIs represented by^tSet V for all vertices in (1)^tRepresents;

s3: vertex embedding is finally input to A-Block for anomaly detection.

2. The method for detecting the anomaly of the dynamic graph based on the community structure as claimed in claim 1, wherein: the C-Block aims to detect an evolved community, wherein the evolved community refers to a community which changes along with the time on the dynamic graph;

the self-encoder obtains stable vertex embedding by using a sparse evolution self-encoder, namely a SeAutoencoder, so that k-means obtains a stable community label;

by a_sThe full-connection network SeAutoencor of the layer is constructed with vertex embedding, and the forward propagation formula of SeAutoencor is as follows:

wherein l is 1, …, l_s-1，

And

is provided with

Applying k-means to H^tThus, a community label vector c containing each vertex can be obtained^tHere, H^t∈R^n×dD is the dimension of vertex embedding, C^t∈Rⁿ(ii) a The reconstruction loss function of SeAutoencorder is:

where p is a sparse parameter, where p is,

for the average activation of the jth neuron in layer i,

at H^tAnd H^t-1Introduces timing loss J therebetween^t，

When t is 1, J_T＝0；

Use of_sSeAutoencorder for a layer, the loss function is:

3. The method for detecting the anomaly of the dynamic graph based on the community structure as claimed in claim 1, wherein: the M-Block aims to reconstruct the distance between communities and the vertexes between the communities, so that the vertexes in the same community are close to each other in Euclidean distance, and the Euclidean distances between the vertexes in different communities are far away from each other;

vertex embedding and community label vector are input of M-Block, and output of M-Block is community measurement enhancement vertex embedding; M-Block uses the community metric enhancement network, namely CenNet, to strengthen the vertex embedding, CenNet is a Siamese network, it is a method of depth metric learning, it has rebuilt the distance between vertexes in the evolution community;

wherein l is 1, …, l_c-1，

And

of the first layer of CenNet respectivelyA weight matrix and a bias vector; the loss function for CenNet is the comparative loss function, which is:

wherein

Representing the euclidean distance between samples i and j,

is a matrix O^tIf samples i and j are in the same community, then y_ij1, otherwise y_ij0, b is the interval; when the data set is too large, J_CenNetIs high, for a given sample i, the index J is obtained using negative sampling, which reduces the complexity, i.e. some random samples J can be obtained from the data set to approximate J_CenNet。

4. The method for detecting the anomaly of the dynamic graph based on the community structure as claimed in claim 1, wherein: the goal of the A-Block is to obtain E^t(ii) anomaly scores for all edges in CmaGraph, using OC-NN as an anomaly detector;

Embedding the edges into an One-Class neural network (OCNN) by using the embedded distance information, wherein the edge is an abnormal detection model;

formally, at timestamp t, get O for M-Block^tAnd E^tThe encoder phi and O^tAnd E^tConversion to edge-embedded P^tUsing a 1_aAn OCNN with d neurons in each hidden layer and an output layerOnly 1 neuron, representing an abnormal score; the forward propagation formula of the neural network is as follows:

wherein l is 1, …, l_a2, the last layer does not apply an activation function, i.e.

Wherein

And

The loss function of OCNN is: