CN114218445A

CN114218445A - Anomaly detection method based on dynamic heterogeneous information network representation of metagraph

Info

Publication number: CN114218445A
Application number: CN202111505386.8A
Authority: CN
Inventors: 赵翔; 方阳; 谭真; 胡升泽; 陈盈果; 李欣奕; 王吉
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2022-03-22

Abstract

The invention belongs to the field of data analysis, and discloses an anomaly detection method based on dynamic heterogeneous information network representation of a metagraph, which is used for obtaining scientific cooperative dynamic heterogeneous information network data comprising network nodes and network side data; an embedding mechanism in complex space is introduced to represent a given dynamic heterogeneous information network at timestamp 1; learning a dynamic heterogeneous information network from a time stamp 2 to a time stamp t by adopting a ternary element diagram dynamic embedding mechanism; processing a heterogeneous information network from the timestamp 1 to the timestamp t by using a deep automatic encoder based on a long-short term memory network, and analyzing and calculating the graph prediction of the timestamp t + 1; and carrying out anomaly detection on the nodes in the network by using the graph data from 1 to t +1 to obtain an anomaly detection result. The invention trains on the change data set based on the mechanism of the metagraph, can be expanded to a large dynamic heterogeneous information network, and can predict the future network structure.

Description

Anomaly detection method based on dynamic heterogeneous information network representation of metagraph

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to an anomaly detection method based on dynamic heterogeneous information network representation of a metagraph.

Background

Content presentation is a fundamental task in information retrieval. The goal of representation learning is to capture features of information objects in a low dimensional space. Most research on Heterogeneous Information Network (HIN) representation learning has focused on static heterogeneous information networks. In practice, however, the network is dynamic and may change constantly. A Heterogeneous Information Network (HIN) is an evolving network with multiple types of nodes and edges. In fact, most networks are dynamic heterogeneous information networks, such as social networks and bibliographic networks. Therefore, compared with a static network, the dynamic heterogeneous information network is a more expressive tool and can model the problem of rich information.

In order to be able to handle networks in a machine learning environment, network representation learning (also referred to as network embedding learning) has been widely studied with the objective of embedding networks into a low-dimensional space. Most research has focused on static information networks. The classical network embedding model exploits random walks to explore static homogeneous networks. To represent static heterogeneous networks, many meta-path based models are proposed, using different mechanisms to model the relationships between heterogeneous information network nodes. Unlike static network embedding, the technology of dynamic heterogeneous information networks needs to be incremental and scalable in order to be able to efficiently handle network evolution. This makes most existing static embedding models unsuitable and inefficient requiring the gradual processing of the entire network.

Disclosure of Invention

In view of the above, the present invention provides a new extensible representation learning model M-DHIN to explore the evolution of the dynamic heterogeneous information network. The present invention treats a dynamic heterogeneous information network as a series of snapshots with different time stamps. The invention first learns the initial embedding of the dynamic heterogeneous information network at a first time stamp using a static embedding method. The invention describes the characteristics of the initial heterogeneous information network through the metagraph, and compared with the traditional path-oriented static model, the metagraph reserves more structural and semantic information. The present invention also employs a complex embedding scheme to better distinguish between symmetric and asymmetric metagraphs. Unlike the traditional model, which deals with the entire network at each timestamp, the present invention builds a so-called change data set that contains only nodes that are involved in the ternary closure or open process, as well as newly added or deleted nodes. The present invention then utilizes the above-described metagraph-based mechanism to train on the change data set. With this arrangement, the M-DHIN can be extended to large dynamic heterogeneous information networks because it only needs to model the entire heterogeneous information network once, and over time, only needs to deal with the changing parts. The existing dynamic embedded model can only express the existing snapshot and cannot predict the future network structure. In order to enable the M-DHIN to have the capability, the invention introduces a deep automatic encoder model based on a long-short term memory network, and the deep automatic encoder model processes the evolution of a graph and outputs a prediction graph through a long-short term memory network encoder. Finally, the present invention evaluates the proposed model M-DHIN on a real dataset and demonstrates that it is significant and consistently superior to the most advanced models.

The invention provides a novel dynamic heterogeneous information network embedded model, named M-DHIN, which provides an extensible method for capturing the characteristics of a dynamic heterogeneous information network through a metagraph. The present invention first learns the initial embedding of the entire heterogeneous information network at a first timestamp. The traditional network embedding method adopts random walk or meta-path, which is not enough to completely describe the neighborhood structure of the node. Therefore, the present invention proposes a metagraph to capture structural information of the HIN. The metagraph is a node type subgraph with an edge type in the middle, and captures a connection subgraph of two node types, so that the neighborhood structure of the node is completely reserved; some simple examples are given in fig. 1. In training the model, the present invention observes that the structure of the metagraph can be symmetric or asymmetric, as shown in fig. 1(a) and 1(b), respectively. In order to better represent heterogeneous information network nodes, the invention combines a complex space-oriented embedding scheme to process the symmetrical and asymmetrical relations between the nodes. In complex space, the components of the embedded vectors of nodes are complex numbers, that is, the present invention divides the vectors of nodes into real and imaginary parts.

Specifically, the anomaly detection method based on the dynamic heterogeneous information network representation of the metagraph disclosed by the invention comprises the following steps of:

step 1, acquiring scientific cooperative dynamic heterogeneous information network data comprising network nodes and network side data;

step 2, an embedding mechanism in a complex number space is introduced to represent a given dynamic heterogeneous information network at the time stamp 1;

step 3, learning a dynamic heterogeneous information network from a time stamp 2 to a time stamp t by adopting a ternary diagram dynamic embedding mechanism;

step 4, processing the heterogeneous information network from the timestamp 1 to the timestamp t by using a deep automatic encoder based on a long-short term memory network, and performing graph prediction of the timestamp t +1 after analysis and calculation;

and 5, carrying out anomaly detection on the nodes in the network by using the graph data from 1 to t +1 to obtain an anomaly detection result.

Further, the embedding mechanism described in step 2 represents the network by a metagraph-based complex spatial embedding scheme, G, for the initial heterogeneous information network of timestamp 1¹＝(V¹,E¹) In order to represent the relationship between the nodes and the metagraph, the concept of a heterogeneous information network triple is introduced, wherein the heterogeneous information network triple is represented as (u, s, v), u is the first node generated in the metagraph, v is the last node, s is the metagraph connecting u and v,

and

the representation vectors of u, v and s respectively, d is the dimension of the representation vector, and the probability of whether a heterogeneous information network triple is established is represented as follows:

P(s|u,v)＝σ(X_uv)，

wherein the content of the first and second substances,

is a scoring matrix, sigma is an activation function; for a heterogeneous information network triplet (u, s, v), its complex spatial embedding is denoted as u ═ re (u) + imi (u), v ═ re (v) + imi (v), and s ═ re(s) + imi(s), where

And

respectively represent vectors

The excess and deficiency portions of (1);

the Hadamard function was introduced to capture the relationship of u, v and s in complex space, expressed as:

wherein the content of the first and second substances,

is a complex conjugate form of v, is an element-corresponding product;

one element in the scoring matrix is finally:

the corresponding score function is defined as:

where < > denotes that the standard element corresponds to a multilinear point product.

Further, the triplet in the ternary diagram dynamic embedding mechanism described in step 3 is a set including three nodes, and if each node is connected to each other, it is called a closure triplet, and if there are only two edges between the three nodes, it is called an open triplet; in order to obtain dynamic heterogeneous information network embedding from timestamps 1 to t, a negative sample strategy is first used to form a training data set, nodes u and v are connected by a metagraph s in positive triplets (u, s, v), and nodes u 'and v' are connected by a metagraph s 'in negative triplets (u', s ', v'); for each heterogeneous information network positive triple (u, s, v), generating a heterogeneous information network negative triple in a mode of randomly replacing u and v with other nodes and limiting the types of the u and v to be the same as the replaced nodes, and filtering out the replaced heterogeneous information network triple which is still positive after sampling; the evolution of the open triple structure into the triple closure process and the evolution of the closure triple into the open triple structure are basic changes of the dynamic heterogeneous information network evolution, positive and negative evolution triples are used as training sets, and a complex space embedding mechanism in the step 2 is adopted for training to obtain the representation learning of the dynamic heterogeneous information network with the timestamps of 1 to t.

Further, the dynamic heterogeneous information network has four changes:

(1) the added edges form a ternary closure process: determining that all metagraphs with three node vectors, which contain varying training data sets at timestamps t, have only two edges in the middle changed into interconnected circles

In for three nodes v₁、v₂And v₃The primitive diagram s of (v)₁,v₂) Denotes v₁And v₂Edges in between, then obtained after the ternary closure process

Is defined as:

(2) deleted edges result in a ternary open process: collecting all metagraphs that have triples that evolve from circles to paths with two edges; these nodes at time stamp t will be included in

After the ternary opening process

Is defined as

(3) One added node: given that one in the metagraph is denoted v₁Existing node of (2) and a newly added node v₂Then, then

Will be expanded into

(4) A deleted node: in view of the meta-diagram denoted v₁And v₂Existing node of (c), let v₂Is deleted, then

Will become

Furthermore, a change set is constructed based on the original metagraph, namely when the change set is trained, after the change process is finished, nodes are trained on the original metagraph; to obtain

After, only train

Instead of retraining the entire network, the metagraph based on a complex mechanism is used to obtain the embedding of the changed nodes.

Further, in step 4, the depth autoencoder model is composed of encoder and decoder parts, and to construct the input of the encoder, for one node, its metagraph is taken as the neighboring node to form its adjacency matrix a, and then, for any pair of u and v nodes in the metagraph, the encoder input is composed of time-sequential adjacency vectors of u and v, respectively denoted as u and v

And

a_uis a combination of two parts, one part is a row of adjacency matrix A which represents nodes adjacent to u and is further mapped to a d-dimensional vector through a fully connected layer, and the other part is dynamic node embedding of the node u; the encoder then processes the input content to obtain a low-dimensional representation y_uAnd y_v(ii) a The encoder aims at predicting neighbors by embedding of timestamps tDomain

And

adjacent vector representation for depth autoencoder prediction as

And

further, for node u and its neighborhood

Where d is the embedding dimension, t is the total time step, and the concealment at the first level is expressed as

Wherein

Is the parameter matrix of layer 1 of the auto-encoder, d (1) is the representation dimension of layer 1,

f represents an S-shaped activation function for the deviation of the layer 1 of the encoder;

the k-layer output of the encoder is calculated as follows:

to fully capture the information about the past evolution of the metagraph, a long-short term memory network layer is further applied on the output of the encoder, for the first long-short term memory network layer, the hidden state representation is calculated as:

wherein

In order to activate the value of the input gate,

in order to activate the value of the forgetting gate,

is a new predicted candidate state for the current state,

the unit states of the long-short term memory network,

the value for activating the output gate, delta represents the activation function,

in the form of a matrix of parameters,

represents deviation, d^(k+1)A representation dimension representing a k +1 layer;

the long-short term memory network has one layer, and the final output of the long-short term memory network can be expressed as

Wherein

The training objective is to minimize the following loss function:

the embedding at the time t is utilized to punish the incorrect neighborhood reconstruction at the time t +1, so that the node embedding of a future timestamp can be predicted by a depth automatic encoder model based on a long-short term memory network; f (.) represents the function employed to generate the prediction neighborhood at timestamp t +1, using the above-described auto-encoder framework as

A hyperparameter matrix, which balances the weights of the penalty observation neighbors, indicates a product by element.

Further, a gradient is applied to the decoder weights on the objective function, as follows:

wherein

A parameter matrix of a k + l layer of the automatic encoder; after the derivatives are calculated, the SGD algorithm and Adam are again applied to train the model.

Further, the dynamic heterogeneous network is composed of a plurality of types of nodes and edges, including academic graph network data sets with four types of nodes of authors, documents, publishing platforms and themes, social graph network data sets with nodes of customers, restaurants, reviews and food, and social graph network data sets with nodes of movie information of movies, directors, actors, producers and composers; in the social graph network dataset, nodes represent users and their characteristics, and edges represent relationships between users; in the academic graph network data set, nodes comprise authors and papers, and edges represent association relations between the nodes.

The invention has the following beneficial effects:

the metagraph-based mechanism trains on the changing data set, and due to the arrangement, the invention can be extended to a large dynamic heterogeneous information network because the whole heterogeneous information network only needs to be modeled once, and only the changed part needs to be processed along with the time;

compared with the existing dynamic embedded model which can only express the existing snapshot, the method can predict the future network structure;

the present invention was evaluated on a real dataset, demonstrating that it is significant and superior to the most advanced models.

Drawings

FIG. 1 is an illustration of a diagram of elements;

FIG. 2 is an exemplary diagram of a dynamic HIN model for a social network;

FIG. 3 is a diagram of the M-DHIN model framework of the present invention;

FIG. 4 is a schematic diagram of a process for forming a variation set;

fig. 5 is an LSTM based depth autoencoder of the present invention.

Detailed Description

The invention is further described with reference to the accompanying drawings, but the invention is not limited in any way, and any alterations or substitutions based on the teaching of the invention are within the scope of the invention.

In the following, the invention will introduce the notation and definition of the dynamic heterogeneous information network and the metagraph. Next, the present invention will address the dynamic network representation learning problem for heterogeneous information networks. Table 1 lists the main terms and symbols used.

TABLE 1 terminology and symbols

Dynamic heterogeneous information network: let G ═ V, E, T be a directed graph, where V denotes the set of nodes and E denotes the set of edges between the nodes. Each node and edge is associated with a type mapping function, φ: V → T respectively_VAnd

E→T_E。T_Vand T_ERepresenting a collection of node and edge types. Heterogeneous Information Network (HIN) is | T_VIf 1 or T_EIf the ratio is greater than 1, otherwise, the network is a homogeneous network. In addition, the dynamic heterogeneous information network is a series of network snapshots, denoted as { G }¹,...,G^Time}. For two consecutive timestamps t and t +1, the following condition needs to be satisfied: i V^t+1|≠|V^tI or I E^t+1|≠|E^tL, wherein | V^tI and I E^tAnd | represents the number of nodes and the number of edges at the timestamp t, respectively. The present invention assumes T in the whole network evolution process_VAnd T_ERemain unchanged. Fig. 2 illustrates an example of a dynamic HIN.

A metagraph is a subgraph of compatible node types and edge types, denoted S ═ T_VS,T_ES) Wherein T is_VSAnd T_ESRepresenting a collection of node types and edge types in a metagraph, respectively.

As shown in fig. 1, the metagraph can be divided into two types, symmetric and asymmetric; the present invention will handle both cases in the proposed model.

Problem (dynamic heterogeneous information network representation learning). Given a series of dynamic networks G¹,...,G^TimeThe dynamic HIN indicates that the learning is a learning node

D is the dimension of the node representation. In particular, the method provided by the invention also obtains the metagraph

Is shown. These representations are able to capture the ever evolving structural attributes in dynamic networks.

In this section, the present invention proposes a model M-DHIN. FIG. 3 gives an overview of M-DHIN. The M-DHIN model first introduces a complex embedding mechanism to represent a given dynamic heterogeneous information network at an initial time stamp. The present invention then employs a ternary diagram dynamic embedding mechanism to learn the dynamic heterogeneous information network from time step 2 to timestamp t. Finally, the invention provides a deep automatic encoder based on a long-short term memory network to perform the graph prediction of the timestamp t + 1.

Conventional heterogeneous information network representation learning methods (such as deep walk, node2vec, and Metapath2vec) require processing the complete heterogeneous information network at each timestamp to dynamically generate the latest node vector, which is time consuming and inefficient and cannot be extended to large dynamic heterogeneous information networks. To solve this problem, the present invention proposes a new model named M-DHIN that is able to capture the major changes of the dynamic network. The initial steps of M-DHIN are similar to the static heterogeneous information network embedding method, which represents the entire network (at the first timestamp) by a metagraph-based complex embedding scheme. Initial heterogeneous information network, G, in view of timestamp 1¹＝(V¹,E¹) In order to represent the relationship between the nodes and the metagraph, the invention introduces the concept of heterogeneous information network triples. The heterogeneous information network triplet is represented as (u, s, v), where u is the first section generated in the metagraphThe point, v is the last node and s is the metagraph connecting them.

And

are the representation vectors for u, v and s, respectively, and d is the dimension of the representation vector.

The probability of whether a heterogeneous information network triple is established is expressed as

P(s|u,v)＝σ(X_uv) (1)

Wherein the content of the first and second substances,

in order to score a matrix, n is the number of training nodes, and sigma is an activation function, the sigmoid function is selected.

Note that the metagraph may be symmetric or asymmetric, that is, if the metagraph is symmetric, the exchange of the first node and the last node does not change the semantics of the original metagraph, as shown in fig. 1 (a). Accordingly, head and tail node interchanging will change the semantics of the asymmetric metagraph, as shown in FIG. 1 (b).

To solve the problem, the invention adopts a scheme of complex knowledge embedding to adapt to the task of network embedding. For a heterogeneous information network triplet (u, s, v), its complex number embeddings are denoted as u ═ re (u) + imi (u), v ═ re (v) + imi (v), and s ═ re(s) + imi(s), where

And

respectively represent vectors

The excess and deficiency parts of (1). The invention introduces Hadamard function to capture the relation of u, v and s in complex space, which is expressed as:

wherein the content of the first and second substances,

is a complex conjugate form of v, which is an element-corresponding product. However, the sigmoid function in equation 1 cannot be applied to complex space, so the present invention only retains the real part of the objective function, and the real part can still well handle symmetric and asymmetric structures, and the present invention will be described in detail later. Thus, one element in the scoring matrix ends up being

Since the invention has obtained a scoring matrix, the corresponding scoring function is defined as

Wherein<.>Is the product of the standard element corresponding to the multi-linear point. For example<a,b,c〉＝∑_ka_kb_kc_kWhere a, b, c are vectors and k represents its dimension.

Equation 4 is able to handle asymmetric metagraphs, thanks to one of the embedded complex conjugates. Furthermore, the scoring function is antisymmetric if s is purely imaginary, i.e., its real part is zero, and symmetric if real. Co-authoring is a symmetric relationship. The references are antisymmetric relationships: (typically) paper a can only refer to paper B but B cannot, in other words, B is referred to by a. The invention embeds the metagram matrix X by separating the metagram into the imaginary and real parts of s_sDecomposed into antisymmetric matrices

And symmetric matrix

The sum of (a) and (b). It can thus be seen that metagraph embedding naturally acts as a weight for each potential dimension, i.e. im(s) at<u,v>On the antisymmetric imaginary part of (a), Re(s) is<u,v>On the real part of symmetry. The present invention has obtained

This indicates Im: (<u,v>) Is antisymmetric, and Re: (<u,v>) Is symmetrical. Thus, the above-described mechanism enables the present invention to accurately and efficiently represent symmetric and asymmetric (including antisymmetric) metagraphs between pairs of nodes.

In this initial step, the present invention uses the state of the art method, GRAMI, to find all sub-graphs in the database that occur frequently and meet a given frequency threshold. The present invention then uses these mined subgraphs to compose a metagraph.

Algorithm one summarizes the initial complex embedding algorithm.

After the entire heterogeneous information network is processed to obtain the initial embedding, the present invention further captures the structural evolution of the dynamic heterogeneous information network using the ternary node block. A triplet is a set that contains three nodes. If each node is connected to each other, it is called a closure triple; if there are only two edges between these three nodes, it is called an open triplet. As mentioned previously, the evolution of open triplet structures into closure structures (i.e., the ternary closure process) is a fundamental variation of the evolution of dynamic heterogeneous information networks. Thus, at this step, the present invention builds a varying training data set accordingly to contain the nodes that have undergone the ternary closure. Meanwhile, the invention cannot ignore that the ternary open process does exist in the dynamic heterogeneous information network, namely that two nodes in the ternary group may lose contact with the time. In general, the present invention determines four common scenarios describing dynamic heterogeneous information network changes:

(1) the added edges form a ternary closure process. As shown in FIG. 4(a), the present invention determines that all three of all three are ownedThe metagraph of the node vector changes from a circle with only two edges in the middle to an interconnection. These metagraphs will be included in the varying training dataset with time stamp t, named

For a metagraph s (there are three nodes v)₁、v₂And v₃) (v) for the present invention₁,v₂) Denotes v₁And v₂The edge in between. Then, the user can use the device to perform the operation,

(obtained after the ternary closure process) is defined as:

(2) the deleted edges result in a ternary open process. As shown in FIG. 4(b), the present invention collects all the metagraphs that have triples that evolve from a circle to a path with two edges; these nodes at time stamp t will be included in

Similar to the process of the ternary closure,

defined as after the ternary opening process

(3) An added node. As shown in FIG. 4(c), one is denoted v in view of the metagraph₁Existing node of (2) and a newly added node v₂Then, then

Will be expanded into

(4) A deleted node. As shown in fig. 4(d), given the v in the metagraph₁And v₂Existing node of (c), let v₂Is deleted, then

Will become

In forming the Change set, Change2vec is mainly different from the model of the present invention in that they only collect changed nodes and then form meta-paths in the Change set, which may lose contact with the original network and may miss many meta-paths. For example, two newly added nodes may be connected through a meta-path of the original network, but not in the change set. However, in the model of the present invention, the present invention constructs a change set based on the original metagraph, that is, when training the change set, after the change process is finished, the nodes are trained on the original metagraph. By doing so, the model of the present invention is more suitable for training meta-paths and meta-graphs. Furthermore, this operation ensures that model learning is not embedded for

As it is still connected to the original network. Note that a node may relate to multiple scenarios, and a node may only be included in

And (5) once. This is to avoid the embedding of duplicate compute nodes. After the node is included, the node is calculated according to the metagraph to which the node belongs, and the change process of the metagraph can describe all possible scenes experienced by the node.

To obtain

Thereafter, the present invention only trains

Instead of retraining the entire network, the metagraph based on a complex mechanism is used to obtain the embedding of the changed nodes. Specifically, the node is embedded in Y^tEvolution to Y^t+1At timestamp t +1 by removing the deleted node representation

Embedding of adding newly added nodes

And replace the change node representation in the ternary closure or opening process

The training target will now be described. To obtain dynamic heterogeneous information network embedding from timestamps 1 through t, and observe the change in the graph, the present invention first uses a negative sample strategy to form a training data set. First, nodes u and v are connected by a metagraph s at the positive triplet (u, s, v) and nodes u 'and v' are connected by a metagraph s 'at the negative triplet (u', s ', v'). For each positive heterogeneous information network triplet (u, s, v), the present invention generates a negative heterogeneous information network triplet by randomly replacing u and v with other nodes while restricting them to be of the same type as the replaced node. The invention also filters out the replaced heterogeneous information network triplets which are still positive after sampling. Note that the candidate number of s is much smaller than u or v, so sampled negative data is only generated by replacing u and v.

After sampling, the invention possesses the training data in the form of (u, s, v, H)_uv) In which H is_uvE {1,0} is a binary value that indicates whether the HIN triplet is positive. For training example (u, s, v, H)_uv) If H is present_uv1, then the objective functionNumber O_(u,s,v)Aiming to maximize P (s | u, v); otherwise P (s | u, v) should be minimized. Thus, there is an objective function as follows:

to simplify the calculations of the present invention, the present invention pairs log O_(u,s,v)Is defined as follows

logO_(u,s,v)＝H_uv logP(s|u,v)+(1-H_uv)log[1-P(s|u,v)](10)

Wherein P (s | u, v) is defined as

P(s|u,v)＝sigmoid(f_s(u,v))。 (11)

In particular, it is an object of the invention to make the objective function log O_(u,s,v)And (4) maximizing. If a triplet (u, s, v) is present, H_uvIf 1, then the objective function would be

log O_(u,s,v)＝log P(s|u,v)。 (12)

Maximizing the objective function maximizes the probability P (s | u, v). In turn, the present invention enables embedding of u, v, and s, which maximizes the probability that (u, s, v) holds true. Likewise, for triplets (u, s, v) is not complete with H_uvFor a negative sample of 0 co-existence, the objective function will be

log O_(u,s,v)＝log[1-P(s|u,v)]。 (13)

Maximizing the objective function will minimize the probability P (s | u, v); accordingly, the present invention will obtain u, v, and s embeddings that will minimize the probability that (u, s, v) will be true.

The present invention uses a Stochastic Gradient Descent (SGD) algorithm to maximize the above objective function by adaptive moment estimation (Adam). Specifically, for each training entry (u, s, v, H)_uv) It will go back to adjust the embedding u, v and s according to logO_(u,s,v)Are distinguished by u, v and s, respectively.

Algorithm II summarizes the dynamic embedding algorithm

After completing the above two steps, the model of the present invention is able to generate a heterogeneous information network representation at each time stamp, however, it cannot predict future structures in a dynamic heterogeneous information network. In other words, it can only generate node inlays from observed network evolution, and cannot describe changes that may occur but are not seen in the future. To solve this problem, the present invention proposes a deep automatic encoder model based on long-short term memory network, which can generate future heterogeneous information network representation by using the previous sequential structural evolution. Fig. 5 illustrates such a deep auto-encoder based on long short term memory network.

Note that in predicting future networks, the present invention trains only the change metagraph, rather than each metagraph. As described above, each node is included in an original metagraph once, which also saves training time. For each change metagraph contained, the present invention trains it with an autoencoder. Accordingly, each node in the change set is computed only once, whether it is commonly used or not. In other words, the present invention treats each node in the change set equally. It is more important to know the dynamic course of the node rather than the popularity of the node in order to predict the future state of the node. Thus, after obtaining the dynamic course of a given node, the present invention is able to predict its future state, whether it is in common use or not.

The depth autoencoder model of the present invention consists of an encoder and decoder sectionAnd (4) forming. To construct the input to the encoder, for one node, the present invention takes its metagraph as a neighbor node to form its adjacency matrix a. Then, for any pair of u and v nodes in the metagraph s, the encoder input consists of time-sequential adjacency vectors of u and v, denoted as u and v, respectively

And

specifically, au is a combination of two components. One is a row adjacency matrix a, representing nodes adjacent to u, which is further mapped to the d-dimensional vector through a fully connected layer. The other is dynamic node embedding of node u learned in algorithm one and algorithm two. The encoder then processes the input content to obtain a low-dimensional representation y_uAnd y_v. The encoder aims at predicting the neighborhood by embedding of the timestamp t

And

and the neighboring vectors predicted by the depth autoencoder are represented as

And

in particular, for node u and its neighborhood

Where d is the embedding dimension and t is the total time step, the concealment of the first layer is represented as

Wherein

Is a parameter matrix of the first layer of the auto-encoder, d⁽¹⁾Is the representation dimension of the first layer.

For the first layer of the encoder, f is the activation function of the sigmoid. The output of the encoder (k layers) is then calculated as follows:

to fully capture information about the past evolution of the metagraph, the present invention further applies a long short term memory network layer on the output of the encoder. For the first long-short term memory network layer, the hidden state representation is calculated as

Wherein

In order to activate the value of the input gate,

in order to activate the value of the forgetting gate,

is a new predicted candidate state for the current state,

the unit states of the long-short term memory network,

to activate the value of the output gate, δ represents the activation function, here an S-shaped function is used,

in the form of a matrix of parameters,

the deviation is indicated. d^(k+1)Representing the presentation dimension of the k +1 layer.

Assuming that the long-short term memory network has one layer, the final output of the long-short term memory network can be expressed as

Wherein

Finally, the aim of the invention is to minimize the following loss function:

the invention penalizes the incorrect neighborhood reconstruction at the t +1 moment by embedding at the t moment. Therefore, the deep automatic encoder model based on the long-short term memory network can predict node embedding of future time stamps. To simplify the notation of the present invention, f () denotes the function that the present invention takes to generate the prediction neighborhood at timestamp t +1, the present invention uses the above-described auto-encoder framework as f ().

In order to predict the heterogeneous information network embedding of the future timestamp t +1, the invention optimizes the objective function of the deep automatic encoder framework based on the long-short term memory network. In particular, the present invention applies a gradient to the decoder weights on equation 19, as shown below

Wherein

Is the parameter matrix of the k + l layer of the automatic encoder. After calculating the derivatives, the present invention also applies the SGD algorithm and Adam to train the model.

The algorithm III introduces the LSTM-based depth automatic encoder for the graph prediction in detail. Algorithm three includes the evolution of the continuously changing metagraph in algorithm one and algorithm two, forming an adjacency matrix. It also forms a using dynamic embedding learned in Algorithm one and Algorithm two_v ^tThis is the input to the auto-encoder.

To evaluate the performance of M-DHIN, the present invention performed experiments using four dynamic datasets extracted from DBLP, YELP, YAGO and Freebase. Descriptive statistics of these data sets are shown in table 2. For simplicity, the present invention provides only statistics of the initial and final timestamps for different time spans (months and years).

Table 2. data set statistics.

DBLP is a bibliographic dataset in computer science. The present invention extracts a subset from 15 consecutive month time stamps from month 10 2015 to month 12 2016. Specifically, 10 months of 2015, it contained 110,634 papers (P), 9,2473 authors (a), 4,274 subjects (T), and 118 publishing platforms (V). Month 11, 2016, which contains 135,348 papers (P), 116,137 authors (A), 4,476 topics (T), and 121 publishing platforms (V). The author is divided into four tag regions: database, machine learning, data mining, and information retrieval.

YELP is a social media data set containing restaurant reviews. The extracted dynamic HIN has 12-month continuous snapshots from 1 month 2016 to 12 months 2016. 2016, month 1, which contained 81,240 reviews (V), 43,927 customers (C), 74 food-related keywords (K), and 23,421 restaurants (R). 2016, 12 months, which contained 102,367 reviews (V), 51,299 customers (C), 86 food-related keywords (K), and 29,777 restaurants (R). Restaurants are divided into three types: american restaurants, sushi restaurants, and fast food.

YAGO captures world knowledge and the present invention extracts a subset that contains 10 annual snapshots of the 2007 to 2016 movies. In 2009, there were 5,334 movies (M), 8,346 actors (a), 1,345 directors (D), 1,123 compositions (C), and 2,876 producers (P). In 2018, there were 7,476 movies (M), 10,212 actors (A), 1,872 directors (D), 1,342 compositions (C), and 3,537 producers (P). Movies are divided into five types: terrorist, action, adventure, crime and science fiction.

Freebase contains world knowledge and facts, and the extracted subset is relevant to the video game. It consists of 12 month snapshots of 1 month 2016 to 12 months 2016. At the beginning of month 1 of 2016, it contained 3435 games (G), 1284 publishers (P), 1768 developers (D), and 154 designers (S). By 2016 for 12 months, it contained 4,122 games (G), 1,673 distributors (P), 2,022 developers (D), and 201 designers (S). These games fall into one of three categories: actions, hazards, and policies.

In terms of experimental evaluation, the invention considers that by evaluating the performance of different models on the anomaly detection task, the invention can evaluate the degree to which the models can describe and capture the characteristics of the dynamic HIN. Anomaly detection may test the model's ability to detect unexpected events during dynamic HIN evolution.

The present invention incorporates two types of benchmarks, one consisting of a static embedding method and the other consisting of a dynamic embedding method. For the static embedding method, the present invention considers both homogeneous and heterogeneous methods. Deepwalk and node2vec were originally designed to represent homogeneous networks. The metapath2Vec and MetaGraph2Vec are designed for heterogeneous networks using meta paths and meta maps, respectively. Note that the present invention does not apply the method of using textual information, as the data set of the present invention does not contain such information, only nodes and edges.

Deepwalk captures the structural information of the HIN using random walks and learns the representation using isomorphic SkipGram. It has two main hyper-parameters, the step size (wl) of the random walk and the window size (ws) of the SkipGram mechanism. To report the best performance, the present invention uses grid search to find the best configuration for different tasks using wl ═ {20,40,60,8} and ws ∈ {3,5,7 }.

node2vec is an extension of Deepwalk because it uses biased random walks to better explore the structure and also SkipGrams to learn network embedding. The present invention uses the same ws and ws as Deepwalk. For its deviation parameters p and q, the invention performs a grid search for p e {0.5,1,1.5,2,5} and q e {0.5,1,1.5,2,5 }.

The metapath2vec captures the structural information of the HIN using meta-paths and learns embedding using heterogeneous SkipGrams to restrict the context window to one specific type. The present invention uses the same ws and ws as Deepwalk.

MetaGraph2Vec builds a MetaGraph by simply combining several metapaths, essentially a path-oriented model. It then learns the final representation using the heterogeneous SkipGrams. The present invention sets wl and ws to the same value as Deepwalk.

For fair comparison, the present invention also evaluated the performance of four dynamic embedding models, DynamicTriad, dynagem, dynagraph 2vec, and change2 vec.

The DynamicTriad describes the evolution of a network based only on the ternary closure process, which is designed for homogeneous networks. Beta is a₀And beta₁Are two hyper-parameters, representing the weight of the ternary closure process and the weight of temporal smoothness, respectively. The invention utilizes a grid search from beta₀E {0.01, 0.1, 1, 10} and β₁E {0.01, 0.1, 1, 10 }.

Change2vec first learns the initial embedding of dynamic HINs, and then samples the changing set of nodes for training using the metapath2vec model. The present invention sets its configuration to be the same as metapath2 vec.

DynGEM captures the dynamics of the HIN timestamp t by the depth autoencoder using only a snapshot of the timestamp t 1. Alpha and upsilon₁、υ₂Is searched from alpha e 10 by a grid^-6，10^-5}、ν₁∈{10^-4，10^-6}、ν₂∈{10^-3，10^-6Selected relative weight hyperparameter

The dynagraph 2vec uses an automatic encoder based on depth LSTM to process the previous snapshot, i.e., the training snapshot of length lb, from a backtracking window of length lb. M-DHIN trains all previous timestamp snapshots through metagraph embedding and only predicts the final snapshot using the auto-encoder, while dygraph2vec learns all snapshot graph embedding using the auto-encoder. Thus, due to limited hardware resources, the lb is limited, as noted in the disclosure, being no greater than 10. Therefore, lb is selected from {3,4,5,6,7,8,9,10 }.

As for other parameters, such as learning rate, embedding dimension, etc., the present invention directly adopts the initial optimal settings of the method described in the paper.

The present invention also adds a variant of M-DHIN, named M-DHIN-MG, that uses dynamic complex embedding only through the metagraph, without the depth LSTM based autoencoder mechanism, to measure the effectiveness of the autoencoder in ablation analysis.

To evaluate the performance of M-DHIN, the present invention utilizes a grid search to find the best experimental configuration. Specifically, the node and metagraph embedding dimension is selected from {32,64,128,256}, the learning rate in SGD is selected from {0.01,0.02,0.025,0.05,0.1}, the negative sampling ratio is selected from {3,4,5,6,7}, the number of layers of the auto-encoder is selected from {2,3,4}, the number of layers of the LSTM is selected from {2,3,4}, and the training period is selected from {5,10,15,20,25,30,35,40 }. To balance effectiveness and efficiency, the present invention selects the following configurations to generate the experimental results reported in the following sections. The embedding dimension is set to 128, the learning rate is set to 0.025, the negative sampling rate is set to 5 (i.e., 5 negative samples per positive sample), the number of levels in the autoencoder and the number of levels in the LSTM are both set to 2, and the number of training sessions is set to 20.

All experiments were performed on a 64-bit Ubuntu 16.04.1LTS system with an Intel (R) core (TM) i9-7900X CPU, 64GB RAM, and 8GB of GTX-1080GPU memory.

The stability of M-DHIN was measured by parameter sensitivity analysis and statistical significance was reported using paired two-tailed t-tests.

The anomaly detection task is to detect newly arriving nodes or edges that do not belong to an existing cluster. The invention uses a k-means clustering algorithm to generate clusters based on dynamic HIN, and then uses exception injection to create exception nodes and edges.

Specifically, the present invention injects 1% and 5% anomalies to build the test data set. The present invention uses area under the curve (AUC) scores to evaluate the performance of all models. Higher AUC scores represent better performance.

Table 3 shows the experimental results of the model. M-DHIN was significantly better than the other benchmarks when both 1% and 5% of the abnormalities were detected in each dataset. metapath2Vec and MetaGraph2Vec outperform the isomorphic static network models DeepWalk and node2Vec, indicating that meta-path is more powerful than random walks in identifying anomalies because it defines some relationships, is more sensitive to outliers, and random walks are equal for each node. As described above, MetaGraph2Vec is actually a meta-path based model because it simply combines meta-paths. Change2vec outperforms DynGEM and dynagraph 2vec because meta-paths are more expressive than the adjacency matrix of the node neighborhood. The adjacency matrix cannot describe the specific relationship between the nodes; it only describes the existence of nodes and edges and treats them equally and is therefore not valid in anomaly detection. Note that M-DHIN-MG is superior to Change2vec, which confirms that the GRAMI generated metagraph of the present invention is more sensitive to anomalies than meta-paths, because metagraphs can express a greater number of relationship information than meta-paths. The performance of M-DHIN is even better than that of M-DHIN-MG, which the present invention attributes to using historical information to help identify whether newly arrived nodes or edges are anomalous.

TABLE 3 Experimental results of anomaly detection task

The invention has the following beneficial effects:

The word "preferred" is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as "preferred" is not necessarily to be construed as advantageous over other aspects or designs. Rather, use of the word "preferred" is intended to present concepts in a concrete fashion. The term "or" as used in this application is intended to mean an inclusive "or" rather than an exclusive "or". That is, unless specified otherwise or clear from context, "X employs A or B" is intended to include either of the permutations as a matter of course. That is, if X employs A; b is used as X; or X employs both A and B, then "X employs A or B" is satisfied in any of the foregoing examples.

Also, although the disclosure has been shown and described with respect to one or an implementation, equivalent alterations and modifications will occur to others skilled in the art based upon a reading and understanding of this specification and the annexed drawings. The present disclosure includes all such modifications and alterations, and is limited only by the scope of the appended claims. In particular regard to the various functions performed by the above described components (e.g., elements, etc.), the terms used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., that is functionally equivalent), even though not structurally equivalent to the disclosed structure which performs the function in the herein illustrated exemplary implementations of the disclosure. In addition, while a particular feature of the disclosure may have been disclosed with respect to only one of several implementations, such feature may be combined with one or other features of the other implementations as may be desired and advantageous for a given or particular application. Furthermore, to the extent that the terms "includes," has, "" contains, "or variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term" comprising.

Each functional unit in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or a plurality of or more than one unit are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium. The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Each apparatus or system described above may execute the storage method in the corresponding method embodiment.

In summary, the above-mentioned embodiment is an implementation manner of the present invention, but the implementation manner of the present invention is not limited by the above-mentioned embodiment, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be regarded as equivalent replacements within the protection scope of the present invention.

Claims

1. The anomaly detection method based on the dynamic heterogeneous information network representation of the metagraph is characterized by comprising the following steps of:

2. The method of claim 1, wherein the embedding mechanism in step 2 represents the network by a metagraph-based complex space embedding scheme, and G is the initial heterogeneous information network with timestamp of 1¹＝(V¹,E¹) In order to represent the relationship between the nodes and the metagraph, the concept of a heterogeneous information network triple is introduced, wherein the heterogeneous information network triple is represented as (u, s, v), and u is a generator in the metagraphThe first node in the tree, v the last node, s the metagraph connecting u and v,

and

P(s|u,v)＝σ(X_uv)，

wherein the content of the first and second substances,

And

respectively represent vectors

The excess and deficiency portions of (1);

wherein the content of the first and second substances,

is a complex conjugate form of v, is an element-corresponding product;

one element in the scoring matrix is finally:

the corresponding score function is defined as:

3. The anomaly detection method based on metagraph dynamic heterogeneous information network representation according to claim 1, wherein the triplet in the ternary metagraph dynamic embedding mechanism in step 3 is a set containing three nodes, and is called a closed triplet if each node is connected to each other, and is called an open triplet if there are only two edges between the three nodes; in order to obtain dynamic heterogeneous information network embedding from timestamps 1 to t, a negative sample strategy is first used to form a training data set, nodes u and v are connected by a metagraph s in positive triplets (u, s, v), and nodes u 'and v' are connected by a metagraph s 'in negative triplets (u', s ', v'); for each heterogeneous information network positive triple (u, s, v), generating a heterogeneous information network negative triple in a mode of randomly replacing u and v with other nodes and limiting the types of the u and v to be the same as the replaced nodes, and filtering out the replaced heterogeneous information network triple which is still positive after sampling; the evolution of the open triple structure into the triple closure process and the evolution of the closure triple into the open triple structure are basic changes of the dynamic heterogeneous information network evolution, positive and negative evolution triples are used as training sets, and a complex space embedding mechanism in the step 2 is adopted for training to obtain the representation learning of the dynamic heterogeneous information network with the timestamps of 1 to t.

4. The anomaly detection method based on meta-map dynamic heterogeneous information network representation according to claim 3, characterized in that there are four changes to the dynamic heterogeneous information network:

Is defined as:

After the ternary opening process

Is defined as

Will be expanded into

Will become

5. The anomaly detection method based on metagraph dynamic heterogeneous information network representation according to claim 1, characterized in that a change set is constructed based on an original metagraph, namely when the change set is trained, nodes are trained on the original metagraph after the change process is finished; to obtain

After, only train

Sets, obtained using a metagraph based on a complex mechanismInstead of retraining the entire network, the embedding of the changed nodes is taken.

6. The method of claim 1, wherein in step 4, the deep automatic encoder model is composed of an encoder and a decoder, and for constructing the input of the encoder, for a node, its metagraph is taken as a neighboring node to form its adjacency matrix a, and then, for any pair of u and v nodes in the metagraph, the encoder input is composed of the time-series adjacency vectors of u and v, respectively represented as u and v

And

a_uis a combination of two parts, one part is a row of adjacency matrix A which represents nodes adjacent to u and is further mapped to a d-dimensional vector through a fully connected layer, and the other part is dynamic node embedding of the node u; the encoder then processes the input content to obtain a low-dimensional representation y_uAnd y_v(ii) a The encoder aims at predicting the neighborhood by embedding of the timestamp t

And

adjacent vector representation for depth autoencoder prediction as

And

7. the anomaly detection method based on metagraph dynamic heterogeneous information network representation according to claim 6It is characterized by that for node u and its neighborhood

Wherein

the k-layer output of the encoder is calculated as follows:

wherein

In order to activate the value of the input gate,

in order to activate the value of the forgetting gate,

is a new predicted candidate state for the current state,

the unit states of the long-short term memory network,

in the form of a matrix of parameters,

Wherein

The training objective is to minimize the following loss function:

the embedding at the time t is utilized to punish the incorrect neighborhood reconstruction at the time t +1, so that the node embedding of a future timestamp can be predicted by a depth automatic encoder model based on a long-short term memory network; f () represents the function employed to generate the prediction neighborhood at timestamp t +1, using the above-described auto-encoder framework as f (),

8. The method of anomaly detection based on a metagraph dynamic heterogeneous information network representation according to claim 7, characterized in that a gradient is applied to decoder weights on the objective function as follows:

wherein W_* ^(k+l)For layer k + l of the automatic encoderA parameter matrix; after the derivatives are calculated, the SGD algorithm and Adam are again applied to train the model.

9. The anomaly detection method based on metagraph dynamic heterogeneous information network representation is characterized in that the network nodes comprise author nodes, literature nodes, publishing platform nodes and theme nodes, and the network edge data is an association relationship among the network nodes.