CN114218445A - Anomaly detection method based on dynamic heterogeneous information network representation of metagraph - Google Patents

Anomaly detection method based on dynamic heterogeneous information network representation of metagraph Download PDF

Info

Publication number
CN114218445A
CN114218445A CN202111505386.8A CN202111505386A CN114218445A CN 114218445 A CN114218445 A CN 114218445A CN 202111505386 A CN202111505386 A CN 202111505386A CN 114218445 A CN114218445 A CN 114218445A
Authority
CN
China
Prior art keywords
network
heterogeneous information
metagraph
nodes
information network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111505386.8A
Other languages
Chinese (zh)
Inventor
赵翔
方阳
谭真
胡升泽
陈盈果
李欣奕
王吉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202111505386.8A priority Critical patent/CN114218445A/en
Publication of CN114218445A publication Critical patent/CN114218445A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of data analysis, and discloses an anomaly detection method based on dynamic heterogeneous information network representation of a metagraph, which is used for obtaining scientific cooperative dynamic heterogeneous information network data comprising network nodes and network side data; an embedding mechanism in complex space is introduced to represent a given dynamic heterogeneous information network at timestamp 1; learning a dynamic heterogeneous information network from a time stamp 2 to a time stamp t by adopting a ternary element diagram dynamic embedding mechanism; processing a heterogeneous information network from the timestamp 1 to the timestamp t by using a deep automatic encoder based on a long-short term memory network, and analyzing and calculating the graph prediction of the timestamp t + 1; and carrying out anomaly detection on the nodes in the network by using the graph data from 1 to t +1 to obtain an anomaly detection result. The invention trains on the change data set based on the mechanism of the metagraph, can be expanded to a large dynamic heterogeneous information network, and can predict the future network structure.

Description

Anomaly detection method based on dynamic heterogeneous information network representation of metagraph
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to an anomaly detection method based on dynamic heterogeneous information network representation of a metagraph.
Background
Content presentation is a fundamental task in information retrieval. The goal of representation learning is to capture features of information objects in a low dimensional space. Most research on Heterogeneous Information Network (HIN) representation learning has focused on static heterogeneous information networks. In practice, however, the network is dynamic and may change constantly. A Heterogeneous Information Network (HIN) is an evolving network with multiple types of nodes and edges. In fact, most networks are dynamic heterogeneous information networks, such as social networks and bibliographic networks. Therefore, compared with a static network, the dynamic heterogeneous information network is a more expressive tool and can model the problem of rich information.
In order to be able to handle networks in a machine learning environment, network representation learning (also referred to as network embedding learning) has been widely studied with the objective of embedding networks into a low-dimensional space. Most research has focused on static information networks. The classical network embedding model exploits random walks to explore static homogeneous networks. To represent static heterogeneous networks, many meta-path based models are proposed, using different mechanisms to model the relationships between heterogeneous information network nodes. Unlike static network embedding, the technology of dynamic heterogeneous information networks needs to be incremental and scalable in order to be able to efficiently handle network evolution. This makes most existing static embedding models unsuitable and inefficient requiring the gradual processing of the entire network.
Disclosure of Invention
In view of the above, the present invention provides a new extensible representation learning model M-DHIN to explore the evolution of the dynamic heterogeneous information network. The present invention treats a dynamic heterogeneous information network as a series of snapshots with different time stamps. The invention first learns the initial embedding of the dynamic heterogeneous information network at a first time stamp using a static embedding method. The invention describes the characteristics of the initial heterogeneous information network through the metagraph, and compared with the traditional path-oriented static model, the metagraph reserves more structural and semantic information. The present invention also employs a complex embedding scheme to better distinguish between symmetric and asymmetric metagraphs. Unlike the traditional model, which deals with the entire network at each timestamp, the present invention builds a so-called change data set that contains only nodes that are involved in the ternary closure or open process, as well as newly added or deleted nodes. The present invention then utilizes the above-described metagraph-based mechanism to train on the change data set. With this arrangement, the M-DHIN can be extended to large dynamic heterogeneous information networks because it only needs to model the entire heterogeneous information network once, and over time, only needs to deal with the changing parts. The existing dynamic embedded model can only express the existing snapshot and cannot predict the future network structure. In order to enable the M-DHIN to have the capability, the invention introduces a deep automatic encoder model based on a long-short term memory network, and the deep automatic encoder model processes the evolution of a graph and outputs a prediction graph through a long-short term memory network encoder. Finally, the present invention evaluates the proposed model M-DHIN on a real dataset and demonstrates that it is significant and consistently superior to the most advanced models.
The invention provides a novel dynamic heterogeneous information network embedded model, named M-DHIN, which provides an extensible method for capturing the characteristics of a dynamic heterogeneous information network through a metagraph. The present invention first learns the initial embedding of the entire heterogeneous information network at a first timestamp. The traditional network embedding method adopts random walk or meta-path, which is not enough to completely describe the neighborhood structure of the node. Therefore, the present invention proposes a metagraph to capture structural information of the HIN. The metagraph is a node type subgraph with an edge type in the middle, and captures a connection subgraph of two node types, so that the neighborhood structure of the node is completely reserved; some simple examples are given in fig. 1. In training the model, the present invention observes that the structure of the metagraph can be symmetric or asymmetric, as shown in fig. 1(a) and 1(b), respectively. In order to better represent heterogeneous information network nodes, the invention combines a complex space-oriented embedding scheme to process the symmetrical and asymmetrical relations between the nodes. In complex space, the components of the embedded vectors of nodes are complex numbers, that is, the present invention divides the vectors of nodes into real and imaginary parts.
Specifically, the anomaly detection method based on the dynamic heterogeneous information network representation of the metagraph disclosed by the invention comprises the following steps of:
step 1, acquiring scientific cooperative dynamic heterogeneous information network data comprising network nodes and network side data;
step 2, an embedding mechanism in a complex number space is introduced to represent a given dynamic heterogeneous information network at the time stamp 1;
step 3, learning a dynamic heterogeneous information network from a time stamp 2 to a time stamp t by adopting a ternary diagram dynamic embedding mechanism;
step 4, processing the heterogeneous information network from the timestamp 1 to the timestamp t by using a deep automatic encoder based on a long-short term memory network, and performing graph prediction of the timestamp t +1 after analysis and calculation;
and 5, carrying out anomaly detection on the nodes in the network by using the graph data from 1 to t +1 to obtain an anomaly detection result.
Further, the embedding mechanism described in step 2 represents the network by a metagraph-based complex spatial embedding scheme, G, for the initial heterogeneous information network of timestamp 11=(V1,E1) In order to represent the relationship between the nodes and the metagraph, the concept of a heterogeneous information network triple is introduced, wherein the heterogeneous information network triple is represented as (u, s, v), u is the first node generated in the metagraph, v is the last node, s is the metagraph connecting u and v,
Figure BDA0003404098220000031
and
Figure BDA0003404098220000032
the representation vectors of u, v and s respectively, d is the dimension of the representation vector, and the probability of whether a heterogeneous information network triple is established is represented as follows:
P(s|u,v)=σ(Xuv),
wherein the content of the first and second substances,
Figure BDA0003404098220000033
is a scoring matrix, sigma is an activation function; for a heterogeneous information network triplet (u, s, v), its complex spatial embedding is denoted as u ═ re (u) + imi (u), v ═ re (v) + imi (v), and s ═ re(s) + imi(s), where
Figure BDA0003404098220000041
And
Figure BDA0003404098220000042
respectively represent vectors
Figure BDA0003404098220000043
The excess and deficiency portions of (1);
the Hadamard function was introduced to capture the relationship of u, v and s in complex space, expressed as:
Figure BDA0003404098220000044
wherein the content of the first and second substances,
Figure BDA0003404098220000045
is a complex conjugate form of v, is an element-corresponding product;
one element in the scoring matrix is finally:
Figure BDA0003404098220000046
the corresponding score function is defined as:
Figure BDA0003404098220000047
where < > denotes that the standard element corresponds to a multilinear point product.
Further, the triplet in the ternary diagram dynamic embedding mechanism described in step 3 is a set including three nodes, and if each node is connected to each other, it is called a closure triplet, and if there are only two edges between the three nodes, it is called an open triplet; in order to obtain dynamic heterogeneous information network embedding from timestamps 1 to t, a negative sample strategy is first used to form a training data set, nodes u and v are connected by a metagraph s in positive triplets (u, s, v), and nodes u 'and v' are connected by a metagraph s 'in negative triplets (u', s ', v'); for each heterogeneous information network positive triple (u, s, v), generating a heterogeneous information network negative triple in a mode of randomly replacing u and v with other nodes and limiting the types of the u and v to be the same as the replaced nodes, and filtering out the replaced heterogeneous information network triple which is still positive after sampling; the evolution of the open triple structure into the triple closure process and the evolution of the closure triple into the open triple structure are basic changes of the dynamic heterogeneous information network evolution, positive and negative evolution triples are used as training sets, and a complex space embedding mechanism in the step 2 is adopted for training to obtain the representation learning of the dynamic heterogeneous information network with the timestamps of 1 to t.
Further, the dynamic heterogeneous information network has four changes:
(1) the added edges form a ternary closure process: determining that all metagraphs with three node vectors, which contain varying training data sets at timestamps t, have only two edges in the middle changed into interconnected circles
Figure BDA0003404098220000051
In for three nodes v1、v2And v3The primitive diagram s of (v)1,v2) Denotes v1And v2Edges in between, then obtained after the ternary closure process
Figure BDA0003404098220000052
Is defined as:
Figure BDA0003404098220000053
Figure BDA0003404098220000054
(2) deleted edges result in a ternary open process: collecting all metagraphs that have triples that evolve from circles to paths with two edges; these nodes at time stamp t will be included in
Figure BDA0003404098220000055
After the ternary opening process
Figure BDA0003404098220000056
Is defined as
Figure BDA0003404098220000057
Figure BDA0003404098220000058
Figure BDA0003404098220000059
(3) One added node: given that one in the metagraph is denoted v1Existing node of (2) and a newly added node v2Then, then
Figure BDA00034040982200000510
Will be expanded into
Figure BDA00034040982200000511
(4) A deleted node: in view of the meta-diagram denoted v1And v2Existing node of (c), let v2Is deleted, then
Figure BDA00034040982200000512
Will become
Figure BDA00034040982200000513
Furthermore, a change set is constructed based on the original metagraph, namely when the change set is trained, after the change process is finished, nodes are trained on the original metagraph; to obtain
Figure BDA00034040982200000514
After, only train
Figure BDA00034040982200000515
Instead of retraining the entire network, the metagraph based on a complex mechanism is used to obtain the embedding of the changed nodes.
Further, in step 4, the depth autoencoder model is composed of encoder and decoder parts, and to construct the input of the encoder, for one node, its metagraph is taken as the neighboring node to form its adjacency matrix a, and then, for any pair of u and v nodes in the metagraph, the encoder input is composed of time-sequential adjacency vectors of u and v, respectively denoted as u and v
Figure BDA0003404098220000061
And
Figure BDA0003404098220000062
auis a combination of two parts, one part is a row of adjacency matrix A which represents nodes adjacent to u and is further mapped to a d-dimensional vector through a fully connected layer, and the other part is dynamic node embedding of the node u; the encoder then processes the input content to obtain a low-dimensional representation yuAnd yv(ii) a The encoder aims at predicting neighbors by embedding of timestamps tDomain
Figure BDA0003404098220000063
And
Figure BDA0003404098220000064
adjacent vector representation for depth autoencoder prediction as
Figure BDA0003404098220000065
And
Figure BDA0003404098220000066
further, for node u and its neighborhood
Figure BDA0003404098220000067
Where d is the embedding dimension, t is the total time step, and the concealment at the first level is expressed as
Figure BDA0003404098220000068
Wherein
Figure BDA0003404098220000069
Is the parameter matrix of layer 1 of the auto-encoder, d (1) is the representation dimension of layer 1,
Figure BDA00034040982200000610
f represents an S-shaped activation function for the deviation of the layer 1 of the encoder;
the k-layer output of the encoder is calculated as follows:
Figure BDA00034040982200000611
to fully capture the information about the past evolution of the metagraph, a long-short term memory network layer is further applied on the output of the encoder, for the first long-short term memory network layer, the hidden state representation is calculated as:
Figure BDA00034040982200000612
Figure BDA00034040982200000613
Figure BDA00034040982200000614
Figure BDA00034040982200000615
Figure BDA00034040982200000616
Figure BDA00034040982200000617
wherein
Figure BDA0003404098220000071
In order to activate the value of the input gate,
Figure BDA0003404098220000072
in order to activate the value of the forgetting gate,
Figure BDA0003404098220000073
is a new predicted candidate state for the current state,
Figure BDA0003404098220000074
the unit states of the long-short term memory network,
Figure BDA0003404098220000075
the value for activating the output gate, delta represents the activation function,
Figure BDA0003404098220000076
in the form of a matrix of parameters,
Figure BDA0003404098220000077
represents deviation, d(k+1)A representation dimension representing a k +1 layer;
the long-short term memory network has one layer, and the final output of the long-short term memory network can be expressed as
Figure BDA0003404098220000078
Figure BDA0003404098220000079
Wherein
Figure BDA00034040982200000710
The training objective is to minimize the following loss function:
Figure BDA00034040982200000711
the embedding at the time t is utilized to punish the incorrect neighborhood reconstruction at the time t +1, so that the node embedding of a future timestamp can be predicted by a depth automatic encoder model based on a long-short term memory network; f (.) represents the function employed to generate the prediction neighborhood at timestamp t +1, using the above-described auto-encoder framework as
Figure BDA00034040982200000712
A hyperparameter matrix, which balances the weights of the penalty observation neighbors, indicates a product by element.
Further, a gradient is applied to the decoder weights on the objective function, as follows:
Figure BDA00034040982200000713
wherein
Figure BDA00034040982200000714
A parameter matrix of a k + l layer of the automatic encoder; after the derivatives are calculated, the SGD algorithm and Adam are again applied to train the model.
Further, the dynamic heterogeneous network is composed of a plurality of types of nodes and edges, including academic graph network data sets with four types of nodes of authors, documents, publishing platforms and themes, social graph network data sets with nodes of customers, restaurants, reviews and food, and social graph network data sets with nodes of movie information of movies, directors, actors, producers and composers; in the social graph network dataset, nodes represent users and their characteristics, and edges represent relationships between users; in the academic graph network data set, nodes comprise authors and papers, and edges represent association relations between the nodes.
The invention has the following beneficial effects:
the metagraph-based mechanism trains on the changing data set, and due to the arrangement, the invention can be extended to a large dynamic heterogeneous information network because the whole heterogeneous information network only needs to be modeled once, and only the changed part needs to be processed along with the time;
compared with the existing dynamic embedded model which can only express the existing snapshot, the method can predict the future network structure;
the present invention was evaluated on a real dataset, demonstrating that it is significant and superior to the most advanced models.
Drawings
FIG. 1 is an illustration of a diagram of elements;
FIG. 2 is an exemplary diagram of a dynamic HIN model for a social network;
FIG. 3 is a diagram of the M-DHIN model framework of the present invention;
FIG. 4 is a schematic diagram of a process for forming a variation set;
fig. 5 is an LSTM based depth autoencoder of the present invention.
Detailed Description
The invention is further described with reference to the accompanying drawings, but the invention is not limited in any way, and any alterations or substitutions based on the teaching of the invention are within the scope of the invention.
In the following, the invention will introduce the notation and definition of the dynamic heterogeneous information network and the metagraph. Next, the present invention will address the dynamic network representation learning problem for heterogeneous information networks. Table 1 lists the main terms and symbols used.
TABLE 1 terminology and symbols
Figure BDA0003404098220000091
Dynamic heterogeneous information network: let G ═ V, E, T be a directed graph, where V denotes the set of nodes and E denotes the set of edges between the nodes. Each node and edge is associated with a type mapping function, φ: V → T respectivelyVAnd
Figure BDA0003404098220000092
E→TE。TVand TERepresenting a collection of node and edge types. Heterogeneous Information Network (HIN) is | TVIf 1 or TEIf the ratio is greater than 1, otherwise, the network is a homogeneous network. In addition, the dynamic heterogeneous information network is a series of network snapshots, denoted as { G }1,...,GTime}. For two consecutive timestamps t and t +1, the following condition needs to be satisfied: i Vt+1|≠|VtI or I Et+1|≠|EtL, wherein | VtI and I EtAnd | represents the number of nodes and the number of edges at the timestamp t, respectively. The present invention assumes T in the whole network evolution processVAnd TERemain unchanged. Fig. 2 illustrates an example of a dynamic HIN.
A metagraph is a subgraph of compatible node types and edge types, denoted S ═ TVS,TES) Wherein T isVSAnd TESRepresenting a collection of node types and edge types in a metagraph, respectively.
As shown in fig. 1, the metagraph can be divided into two types, symmetric and asymmetric; the present invention will handle both cases in the proposed model.
Problem (dynamic heterogeneous information network representation learning). Given a series of dynamic networks G1,...,GTimeThe dynamic HIN indicates that the learning is a learning node
Figure BDA0003404098220000093
D is the dimension of the node representation. In particular, the method provided by the invention also obtains the metagraph
Figure BDA0003404098220000101
Is shown. These representations are able to capture the ever evolving structural attributes in dynamic networks.
In this section, the present invention proposes a model M-DHIN. FIG. 3 gives an overview of M-DHIN. The M-DHIN model first introduces a complex embedding mechanism to represent a given dynamic heterogeneous information network at an initial time stamp. The present invention then employs a ternary diagram dynamic embedding mechanism to learn the dynamic heterogeneous information network from time step 2 to timestamp t. Finally, the invention provides a deep automatic encoder based on a long-short term memory network to perform the graph prediction of the timestamp t + 1.
Conventional heterogeneous information network representation learning methods (such as deep walk, node2vec, and Metapath2vec) require processing the complete heterogeneous information network at each timestamp to dynamically generate the latest node vector, which is time consuming and inefficient and cannot be extended to large dynamic heterogeneous information networks. To solve this problem, the present invention proposes a new model named M-DHIN that is able to capture the major changes of the dynamic network. The initial steps of M-DHIN are similar to the static heterogeneous information network embedding method, which represents the entire network (at the first timestamp) by a metagraph-based complex embedding scheme. Initial heterogeneous information network, G, in view of timestamp 11=(V1,E1) In order to represent the relationship between the nodes and the metagraph, the invention introduces the concept of heterogeneous information network triples. The heterogeneous information network triplet is represented as (u, s, v), where u is the first section generated in the metagraphThe point, v is the last node and s is the metagraph connecting them.
Figure BDA0003404098220000102
And
Figure BDA0003404098220000103
are the representation vectors for u, v and s, respectively, and d is the dimension of the representation vector.
The probability of whether a heterogeneous information network triple is established is expressed as
P(s|u,v)=σ(Xuv) (1)
Wherein the content of the first and second substances,
Figure BDA0003404098220000104
in order to score a matrix, n is the number of training nodes, and sigma is an activation function, the sigmoid function is selected.
Note that the metagraph may be symmetric or asymmetric, that is, if the metagraph is symmetric, the exchange of the first node and the last node does not change the semantics of the original metagraph, as shown in fig. 1 (a). Accordingly, head and tail node interchanging will change the semantics of the asymmetric metagraph, as shown in FIG. 1 (b).
To solve the problem, the invention adopts a scheme of complex knowledge embedding to adapt to the task of network embedding. For a heterogeneous information network triplet (u, s, v), its complex number embeddings are denoted as u ═ re (u) + imi (u), v ═ re (v) + imi (v), and s ═ re(s) + imi(s), where
Figure BDA0003404098220000111
And
Figure BDA0003404098220000112
respectively represent vectors
Figure BDA0003404098220000113
The excess and deficiency parts of (1). The invention introduces Hadamard function to capture the relation of u, v and s in complex space, which is expressed as:
Figure BDA0003404098220000114
wherein the content of the first and second substances,
Figure BDA0003404098220000115
is a complex conjugate form of v, which is an element-corresponding product. However, the sigmoid function in equation 1 cannot be applied to complex space, so the present invention only retains the real part of the objective function, and the real part can still well handle symmetric and asymmetric structures, and the present invention will be described in detail later. Thus, one element in the scoring matrix ends up being
Figure BDA0003404098220000116
Since the invention has obtained a scoring matrix, the corresponding scoring function is defined as
Figure BDA0003404098220000117
Wherein<.>Is the product of the standard element corresponding to the multi-linear point. For example<a,b,c〉=∑kakbkckWhere a, b, c are vectors and k represents its dimension.
Equation 4 is able to handle asymmetric metagraphs, thanks to one of the embedded complex conjugates. Furthermore, the scoring function is antisymmetric if s is purely imaginary, i.e., its real part is zero, and symmetric if real. Co-authoring is a symmetric relationship. The references are antisymmetric relationships: (typically) paper a can only refer to paper B but B cannot, in other words, B is referred to by a. The invention embeds the metagram matrix X by separating the metagram into the imaginary and real parts of ssDecomposed into antisymmetric matrices
Figure BDA0003404098220000121
And symmetric matrix
Figure BDA0003404098220000122
The sum of (a) and (b). It can thus be seen that metagraph embedding naturally acts as a weight for each potential dimension, i.e. im(s) at<u,v>On the antisymmetric imaginary part of (a), Re(s) is<u,v>On the real part of symmetry. The present invention has obtained
Figure BDA0003404098220000123
This indicates Im: (<u,v>) Is antisymmetric, and Re: (<u,v>) Is symmetrical. Thus, the above-described mechanism enables the present invention to accurately and efficiently represent symmetric and asymmetric (including antisymmetric) metagraphs between pairs of nodes.
In this initial step, the present invention uses the state of the art method, GRAMI, to find all sub-graphs in the database that occur frequently and meet a given frequency threshold. The present invention then uses these mined subgraphs to compose a metagraph.
Algorithm one summarizes the initial complex embedding algorithm.
Figure DEST_PATH_IMAGE001
After the entire heterogeneous information network is processed to obtain the initial embedding, the present invention further captures the structural evolution of the dynamic heterogeneous information network using the ternary node block. A triplet is a set that contains three nodes. If each node is connected to each other, it is called a closure triple; if there are only two edges between these three nodes, it is called an open triplet. As mentioned previously, the evolution of open triplet structures into closure structures (i.e., the ternary closure process) is a fundamental variation of the evolution of dynamic heterogeneous information networks. Thus, at this step, the present invention builds a varying training data set accordingly to contain the nodes that have undergone the ternary closure. Meanwhile, the invention cannot ignore that the ternary open process does exist in the dynamic heterogeneous information network, namely that two nodes in the ternary group may lose contact with the time. In general, the present invention determines four common scenarios describing dynamic heterogeneous information network changes:
(1) the added edges form a ternary closure process. As shown in FIG. 4(a), the present invention determines that all three of all three are ownedThe metagraph of the node vector changes from a circle with only two edges in the middle to an interconnection. These metagraphs will be included in the varying training dataset with time stamp t, named
Figure BDA0003404098220000131
For a metagraph s (there are three nodes v)1、v2And v3) (v) for the present invention1,v2) Denotes v1And v2The edge in between. Then, the user can use the device to perform the operation,
Figure BDA0003404098220000132
(obtained after the ternary closure process) is defined as:
Figure BDA0003404098220000133
Figure BDA0003404098220000134
(2) the deleted edges result in a ternary open process. As shown in FIG. 4(b), the present invention collects all the metagraphs that have triples that evolve from a circle to a path with two edges; these nodes at time stamp t will be included in
Figure BDA0003404098220000135
Similar to the process of the ternary closure,
Figure BDA0003404098220000136
defined as after the ternary opening process
Figure BDA0003404098220000137
(3) An added node. As shown in FIG. 4(c), one is denoted v in view of the metagraph1Existing node of (2) and a newly added node v2Then, then
Figure BDA0003404098220000138
Will be expanded into
Figure BDA0003404098220000141
(4) A deleted node. As shown in fig. 4(d), given the v in the metagraph1And v2Existing node of (c), let v2Is deleted, then
Figure BDA0003404098220000142
Will become
Figure BDA0003404098220000143
In forming the Change set, Change2vec is mainly different from the model of the present invention in that they only collect changed nodes and then form meta-paths in the Change set, which may lose contact with the original network and may miss many meta-paths. For example, two newly added nodes may be connected through a meta-path of the original network, but not in the change set. However, in the model of the present invention, the present invention constructs a change set based on the original metagraph, that is, when training the change set, after the change process is finished, the nodes are trained on the original metagraph. By doing so, the model of the present invention is more suitable for training meta-paths and meta-graphs. Furthermore, this operation ensures that model learning is not embedded for
Figure BDA0003404098220000144
As it is still connected to the original network. Note that a node may relate to multiple scenarios, and a node may only be included in
Figure BDA0003404098220000145
And (5) once. This is to avoid the embedding of duplicate compute nodes. After the node is included, the node is calculated according to the metagraph to which the node belongs, and the change process of the metagraph can describe all possible scenes experienced by the node.
To obtain
Figure BDA0003404098220000146
Thereafter, the present invention only trains
Figure BDA0003404098220000147
Instead of retraining the entire network, the metagraph based on a complex mechanism is used to obtain the embedding of the changed nodes. Specifically, the node is embedded in YtEvolution to Yt+1At timestamp t +1 by removing the deleted node representation
Figure BDA0003404098220000148
Embedding of adding newly added nodes
Figure BDA0003404098220000149
And replace the change node representation in the ternary closure or opening process
Figure BDA00034040982200001410
The training target will now be described. To obtain dynamic heterogeneous information network embedding from timestamps 1 through t, and observe the change in the graph, the present invention first uses a negative sample strategy to form a training data set. First, nodes u and v are connected by a metagraph s at the positive triplet (u, s, v) and nodes u 'and v' are connected by a metagraph s 'at the negative triplet (u', s ', v'). For each positive heterogeneous information network triplet (u, s, v), the present invention generates a negative heterogeneous information network triplet by randomly replacing u and v with other nodes while restricting them to be of the same type as the replaced node. The invention also filters out the replaced heterogeneous information network triplets which are still positive after sampling. Note that the candidate number of s is much smaller than u or v, so sampled negative data is only generated by replacing u and v.
After sampling, the invention possesses the training data in the form of (u, s, v, H)uv) In which H isuvE {1,0} is a binary value that indicates whether the HIN triplet is positive. For training example (u, s, v, H)uv) If H is presentuv1, then the objective functionNumber O(u,s,v)Aiming to maximize P (s | u, v); otherwise P (s | u, v) should be minimized. Thus, there is an objective function as follows:
Figure BDA0003404098220000151
to simplify the calculations of the present invention, the present invention pairs log O(u,s,v)Is defined as follows
logO(u,s,v)=Huv logP(s|u,v)+(1-Huv)log[1-P(s|u,v)](10)
Wherein P (s | u, v) is defined as
P(s|u,v)=sigmoid(fs(u,v))。 (11)
In particular, it is an object of the invention to make the objective function log O(u,s,v)And (4) maximizing. If a triplet (u, s, v) is present, HuvIf 1, then the objective function would be
log O(u,s,v)=log P(s|u,v)。 (12)
Maximizing the objective function maximizes the probability P (s | u, v). In turn, the present invention enables embedding of u, v, and s, which maximizes the probability that (u, s, v) holds true. Likewise, for triplets (u, s, v) is not complete with HuvFor a negative sample of 0 co-existence, the objective function will be
log O(u,s,v)=log[1-P(s|u,v)]。 (13)
Maximizing the objective function will minimize the probability P (s | u, v); accordingly, the present invention will obtain u, v, and s embeddings that will minimize the probability that (u, s, v) will be true.
The present invention uses a Stochastic Gradient Descent (SGD) algorithm to maximize the above objective function by adaptive moment estimation (Adam). Specifically, for each training entry (u, s, v, H)uv) It will go back to adjust the embedding u, v and s according to logO(u,s,v)Are distinguished by u, v and s, respectively.
Figure BDA0003404098220000161
Figure BDA0003404098220000162
Figure BDA0003404098220000163
Algorithm II summarizes the dynamic embedding algorithm
Figure DEST_PATH_IMAGE002
After completing the above two steps, the model of the present invention is able to generate a heterogeneous information network representation at each time stamp, however, it cannot predict future structures in a dynamic heterogeneous information network. In other words, it can only generate node inlays from observed network evolution, and cannot describe changes that may occur but are not seen in the future. To solve this problem, the present invention proposes a deep automatic encoder model based on long-short term memory network, which can generate future heterogeneous information network representation by using the previous sequential structural evolution. Fig. 5 illustrates such a deep auto-encoder based on long short term memory network.
Note that in predicting future networks, the present invention trains only the change metagraph, rather than each metagraph. As described above, each node is included in an original metagraph once, which also saves training time. For each change metagraph contained, the present invention trains it with an autoencoder. Accordingly, each node in the change set is computed only once, whether it is commonly used or not. In other words, the present invention treats each node in the change set equally. It is more important to know the dynamic course of the node rather than the popularity of the node in order to predict the future state of the node. Thus, after obtaining the dynamic course of a given node, the present invention is able to predict its future state, whether it is in common use or not.
The depth autoencoder model of the present invention consists of an encoder and decoder sectionAnd (4) forming. To construct the input to the encoder, for one node, the present invention takes its metagraph as a neighbor node to form its adjacency matrix a. Then, for any pair of u and v nodes in the metagraph s, the encoder input consists of time-sequential adjacency vectors of u and v, denoted as u and v, respectively
Figure BDA0003404098220000171
And
Figure BDA0003404098220000172
specifically, au is a combination of two components. One is a row adjacency matrix a, representing nodes adjacent to u, which is further mapped to the d-dimensional vector through a fully connected layer. The other is dynamic node embedding of node u learned in algorithm one and algorithm two. The encoder then processes the input content to obtain a low-dimensional representation yuAnd yv. The encoder aims at predicting the neighborhood by embedding of the timestamp t
Figure BDA0003404098220000173
And
Figure BDA0003404098220000174
and the neighboring vectors predicted by the depth autoencoder are represented as
Figure BDA0003404098220000175
And
Figure BDA0003404098220000176
in particular, for node u and its neighborhood
Figure BDA0003404098220000177
Where d is the embedding dimension and t is the total time step, the concealment of the first layer is represented as
Figure BDA0003404098220000178
Wherein
Figure BDA0003404098220000181
Is a parameter matrix of the first layer of the auto-encoder, d(1)Is the representation dimension of the first layer.
Figure BDA0003404098220000182
For the first layer of the encoder, f is the activation function of the sigmoid. The output of the encoder (k layers) is then calculated as follows:
Figure BDA0003404098220000183
to fully capture information about the past evolution of the metagraph, the present invention further applies a long short term memory network layer on the output of the encoder. For the first long-short term memory network layer, the hidden state representation is calculated as
Figure BDA0003404098220000184
Figure BDA0003404098220000185
Figure BDA0003404098220000186
Figure BDA0003404098220000187
Figure BDA0003404098220000188
Figure BDA0003404098220000189
Wherein
Figure BDA00034040982200001810
In order to activate the value of the input gate,
Figure BDA00034040982200001811
in order to activate the value of the forgetting gate,
Figure BDA00034040982200001812
is a new predicted candidate state for the current state,
Figure BDA00034040982200001813
the unit states of the long-short term memory network,
Figure BDA00034040982200001814
to activate the value of the output gate, δ represents the activation function, here an S-shaped function is used,
Figure BDA00034040982200001815
in the form of a matrix of parameters,
Figure BDA00034040982200001816
the deviation is indicated. d(k+1)Representing the presentation dimension of the k +1 layer.
Assuming that the long-short term memory network has one layer, the final output of the long-short term memory network can be expressed as
Figure BDA00034040982200001817
Figure BDA0003404098220000191
Wherein
Figure BDA0003404098220000192
Finally, the aim of the invention is to minimize the following loss function:
Figure BDA0003404098220000193
the invention penalizes the incorrect neighborhood reconstruction at the t +1 moment by embedding at the t moment. Therefore, the deep automatic encoder model based on the long-short term memory network can predict node embedding of future time stamps. To simplify the notation of the present invention, f () denotes the function that the present invention takes to generate the prediction neighborhood at timestamp t +1, the present invention uses the above-described auto-encoder framework as f ().
Figure BDA0003404098220000194
A hyperparameter matrix, which balances the weights of the penalty observation neighbors, indicates a product by element.
In order to predict the heterogeneous information network embedding of the future timestamp t +1, the invention optimizes the objective function of the deep automatic encoder framework based on the long-short term memory network. In particular, the present invention applies a gradient to the decoder weights on equation 19, as shown below
Figure BDA0003404098220000195
Wherein
Figure BDA0003404098220000196
Is the parameter matrix of the k + l layer of the automatic encoder. After calculating the derivatives, the present invention also applies the SGD algorithm and Adam to train the model.
The algorithm III introduces the LSTM-based depth automatic encoder for the graph prediction in detail. Algorithm three includes the evolution of the continuously changing metagraph in algorithm one and algorithm two, forming an adjacency matrix. It also forms a using dynamic embedding learned in Algorithm one and Algorithm twov tThis is the input to the auto-encoder.
Figure DEST_PATH_IMAGE003
To evaluate the performance of M-DHIN, the present invention performed experiments using four dynamic datasets extracted from DBLP, YELP, YAGO and Freebase. Descriptive statistics of these data sets are shown in table 2. For simplicity, the present invention provides only statistics of the initial and final timestamps for different time spans (months and years).
Table 2. data set statistics.
Figure BDA0003404098220000202
DBLP is a bibliographic dataset in computer science. The present invention extracts a subset from 15 consecutive month time stamps from month 10 2015 to month 12 2016. Specifically, 10 months of 2015, it contained 110,634 papers (P), 9,2473 authors (a), 4,274 subjects (T), and 118 publishing platforms (V). Month 11, 2016, which contains 135,348 papers (P), 116,137 authors (A), 4,476 topics (T), and 121 publishing platforms (V). The author is divided into four tag regions: database, machine learning, data mining, and information retrieval.
YELP is a social media data set containing restaurant reviews. The extracted dynamic HIN has 12-month continuous snapshots from 1 month 2016 to 12 months 2016. 2016, month 1, which contained 81,240 reviews (V), 43,927 customers (C), 74 food-related keywords (K), and 23,421 restaurants (R). 2016, 12 months, which contained 102,367 reviews (V), 51,299 customers (C), 86 food-related keywords (K), and 29,777 restaurants (R). Restaurants are divided into three types: american restaurants, sushi restaurants, and fast food.
YAGO captures world knowledge and the present invention extracts a subset that contains 10 annual snapshots of the 2007 to 2016 movies. In 2009, there were 5,334 movies (M), 8,346 actors (a), 1,345 directors (D), 1,123 compositions (C), and 2,876 producers (P). In 2018, there were 7,476 movies (M), 10,212 actors (A), 1,872 directors (D), 1,342 compositions (C), and 3,537 producers (P). Movies are divided into five types: terrorist, action, adventure, crime and science fiction.
Freebase contains world knowledge and facts, and the extracted subset is relevant to the video game. It consists of 12 month snapshots of 1 month 2016 to 12 months 2016. At the beginning of month 1 of 2016, it contained 3435 games (G), 1284 publishers (P), 1768 developers (D), and 154 designers (S). By 2016 for 12 months, it contained 4,122 games (G), 1,673 distributors (P), 2,022 developers (D), and 201 designers (S). These games fall into one of three categories: actions, hazards, and policies.
In terms of experimental evaluation, the invention considers that by evaluating the performance of different models on the anomaly detection task, the invention can evaluate the degree to which the models can describe and capture the characteristics of the dynamic HIN. Anomaly detection may test the model's ability to detect unexpected events during dynamic HIN evolution.
The present invention incorporates two types of benchmarks, one consisting of a static embedding method and the other consisting of a dynamic embedding method. For the static embedding method, the present invention considers both homogeneous and heterogeneous methods. Deepwalk and node2vec were originally designed to represent homogeneous networks. The metapath2Vec and MetaGraph2Vec are designed for heterogeneous networks using meta paths and meta maps, respectively. Note that the present invention does not apply the method of using textual information, as the data set of the present invention does not contain such information, only nodes and edges.
Deepwalk captures the structural information of the HIN using random walks and learns the representation using isomorphic SkipGram. It has two main hyper-parameters, the step size (wl) of the random walk and the window size (ws) of the SkipGram mechanism. To report the best performance, the present invention uses grid search to find the best configuration for different tasks using wl ═ {20,40,60,8} and ws ∈ {3,5,7 }.
node2vec is an extension of Deepwalk because it uses biased random walks to better explore the structure and also SkipGrams to learn network embedding. The present invention uses the same ws and ws as Deepwalk. For its deviation parameters p and q, the invention performs a grid search for p e {0.5,1,1.5,2,5} and q e {0.5,1,1.5,2,5 }.
The metapath2vec captures the structural information of the HIN using meta-paths and learns embedding using heterogeneous SkipGrams to restrict the context window to one specific type. The present invention uses the same ws and ws as Deepwalk.
MetaGraph2Vec builds a MetaGraph by simply combining several metapaths, essentially a path-oriented model. It then learns the final representation using the heterogeneous SkipGrams. The present invention sets wl and ws to the same value as Deepwalk.
For fair comparison, the present invention also evaluated the performance of four dynamic embedding models, DynamicTriad, dynagem, dynagraph 2vec, and change2 vec.
The DynamicTriad describes the evolution of a network based only on the ternary closure process, which is designed for homogeneous networks. Beta is a0And beta1Are two hyper-parameters, representing the weight of the ternary closure process and the weight of temporal smoothness, respectively. The invention utilizes a grid search from beta0E {0.01, 0.1, 1, 10} and β1E {0.01, 0.1, 1, 10 }.
Change2vec first learns the initial embedding of dynamic HINs, and then samples the changing set of nodes for training using the metapath2vec model. The present invention sets its configuration to be the same as metapath2 vec.
DynGEM captures the dynamics of the HIN timestamp t by the depth autoencoder using only a snapshot of the timestamp t 1. Alpha and upsilon1、υ2Is searched from alpha e 10 by a grid-6,10-5}、ν1∈{10-4,10-6}、ν2∈{10-3,10-6Selected relative weight hyperparameter
The dynagraph 2vec uses an automatic encoder based on depth LSTM to process the previous snapshot, i.e., the training snapshot of length lb, from a backtracking window of length lb. M-DHIN trains all previous timestamp snapshots through metagraph embedding and only predicts the final snapshot using the auto-encoder, while dygraph2vec learns all snapshot graph embedding using the auto-encoder. Thus, due to limited hardware resources, the lb is limited, as noted in the disclosure, being no greater than 10. Therefore, lb is selected from {3,4,5,6,7,8,9,10 }.
As for other parameters, such as learning rate, embedding dimension, etc., the present invention directly adopts the initial optimal settings of the method described in the paper.
The present invention also adds a variant of M-DHIN, named M-DHIN-MG, that uses dynamic complex embedding only through the metagraph, without the depth LSTM based autoencoder mechanism, to measure the effectiveness of the autoencoder in ablation analysis.
To evaluate the performance of M-DHIN, the present invention utilizes a grid search to find the best experimental configuration. Specifically, the node and metagraph embedding dimension is selected from {32,64,128,256}, the learning rate in SGD is selected from {0.01,0.02,0.025,0.05,0.1}, the negative sampling ratio is selected from {3,4,5,6,7}, the number of layers of the auto-encoder is selected from {2,3,4}, the number of layers of the LSTM is selected from {2,3,4}, and the training period is selected from {5,10,15,20,25,30,35,40 }. To balance effectiveness and efficiency, the present invention selects the following configurations to generate the experimental results reported in the following sections. The embedding dimension is set to 128, the learning rate is set to 0.025, the negative sampling rate is set to 5 (i.e., 5 negative samples per positive sample), the number of levels in the autoencoder and the number of levels in the LSTM are both set to 2, and the number of training sessions is set to 20.
All experiments were performed on a 64-bit Ubuntu 16.04.1LTS system with an Intel (R) core (TM) i9-7900X CPU, 64GB RAM, and 8GB of GTX-1080GPU memory.
The stability of M-DHIN was measured by parameter sensitivity analysis and statistical significance was reported using paired two-tailed t-tests.
The anomaly detection task is to detect newly arriving nodes or edges that do not belong to an existing cluster. The invention uses a k-means clustering algorithm to generate clusters based on dynamic HIN, and then uses exception injection to create exception nodes and edges.
Specifically, the present invention injects 1% and 5% anomalies to build the test data set. The present invention uses area under the curve (AUC) scores to evaluate the performance of all models. Higher AUC scores represent better performance.
Table 3 shows the experimental results of the model. M-DHIN was significantly better than the other benchmarks when both 1% and 5% of the abnormalities were detected in each dataset. metapath2Vec and MetaGraph2Vec outperform the isomorphic static network models DeepWalk and node2Vec, indicating that meta-path is more powerful than random walks in identifying anomalies because it defines some relationships, is more sensitive to outliers, and random walks are equal for each node. As described above, MetaGraph2Vec is actually a meta-path based model because it simply combines meta-paths. Change2vec outperforms DynGEM and dynagraph 2vec because meta-paths are more expressive than the adjacency matrix of the node neighborhood. The adjacency matrix cannot describe the specific relationship between the nodes; it only describes the existence of nodes and edges and treats them equally and is therefore not valid in anomaly detection. Note that M-DHIN-MG is superior to Change2vec, which confirms that the GRAMI generated metagraph of the present invention is more sensitive to anomalies than meta-paths, because metagraphs can express a greater number of relationship information than meta-paths. The performance of M-DHIN is even better than that of M-DHIN-MG, which the present invention attributes to using historical information to help identify whether newly arrived nodes or edges are anomalous.
TABLE 3 Experimental results of anomaly detection task
Figure BDA0003404098220000251
The invention has the following beneficial effects:
the metagraph-based mechanism trains on the changing data set, and due to the arrangement, the invention can be extended to a large dynamic heterogeneous information network because the whole heterogeneous information network only needs to be modeled once, and only the changed part needs to be processed along with the time;
compared with the existing dynamic embedded model which can only express the existing snapshot, the method can predict the future network structure;
the present invention was evaluated on a real dataset, demonstrating that it is significant and superior to the most advanced models.
The word "preferred" is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as "preferred" is not necessarily to be construed as advantageous over other aspects or designs. Rather, use of the word "preferred" is intended to present concepts in a concrete fashion. The term "or" as used in this application is intended to mean an inclusive "or" rather than an exclusive "or". That is, unless specified otherwise or clear from context, "X employs A or B" is intended to include either of the permutations as a matter of course. That is, if X employs A; b is used as X; or X employs both A and B, then "X employs A or B" is satisfied in any of the foregoing examples.
Also, although the disclosure has been shown and described with respect to one or an implementation, equivalent alterations and modifications will occur to others skilled in the art based upon a reading and understanding of this specification and the annexed drawings. The present disclosure includes all such modifications and alterations, and is limited only by the scope of the appended claims. In particular regard to the various functions performed by the above described components (e.g., elements, etc.), the terms used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., that is functionally equivalent), even though not structurally equivalent to the disclosed structure which performs the function in the herein illustrated exemplary implementations of the disclosure. In addition, while a particular feature of the disclosure may have been disclosed with respect to only one of several implementations, such feature may be combined with one or other features of the other implementations as may be desired and advantageous for a given or particular application. Furthermore, to the extent that the terms "includes," has, "" contains, "or variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term" comprising.
Each functional unit in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or a plurality of or more than one unit are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium. The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Each apparatus or system described above may execute the storage method in the corresponding method embodiment.
In summary, the above-mentioned embodiment is an implementation manner of the present invention, but the implementation manner of the present invention is not limited by the above-mentioned embodiment, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be regarded as equivalent replacements within the protection scope of the present invention.

Claims (9)

1. The anomaly detection method based on the dynamic heterogeneous information network representation of the metagraph is characterized by comprising the following steps of:
step 1, acquiring scientific cooperative dynamic heterogeneous information network data comprising network nodes and network side data;
step 2, an embedding mechanism in a complex number space is introduced to represent a given dynamic heterogeneous information network at the time stamp 1;
step 3, learning a dynamic heterogeneous information network from a time stamp 2 to a time stamp t by adopting a ternary diagram dynamic embedding mechanism;
step 4, processing the heterogeneous information network from the timestamp 1 to the timestamp t by using a deep automatic encoder based on a long-short term memory network, and performing graph prediction of the timestamp t +1 after analysis and calculation;
and 5, carrying out anomaly detection on the nodes in the network by using the graph data from 1 to t +1 to obtain an anomaly detection result.
2. The method of claim 1, wherein the embedding mechanism in step 2 represents the network by a metagraph-based complex space embedding scheme, and G is the initial heterogeneous information network with timestamp of 11=(V1,E1) In order to represent the relationship between the nodes and the metagraph, the concept of a heterogeneous information network triple is introduced, wherein the heterogeneous information network triple is represented as (u, s, v), and u is a generator in the metagraphThe first node in the tree, v the last node, s the metagraph connecting u and v,
Figure FDA0003404098210000011
and
Figure FDA0003404098210000012
the representation vectors of u, v and s respectively, d is the dimension of the representation vector, and the probability of whether a heterogeneous information network triple is established is represented as follows:
P(s|u,v)=σ(Xuv),
wherein the content of the first and second substances,
Figure FDA0003404098210000013
is a scoring matrix, sigma is an activation function; for a heterogeneous information network triplet (u, s, v), its complex spatial embedding is denoted as u ═ re (u) + imi (u), v ═ re (v) + imi (v), and s ═ re(s) + imi(s), where
Figure FDA0003404098210000021
And
Figure FDA0003404098210000022
respectively represent vectors
Figure FDA0003404098210000023
The excess and deficiency portions of (1);
the Hadamard function was introduced to capture the relationship of u, v and s in complex space, expressed as:
Figure FDA0003404098210000024
wherein the content of the first and second substances,
Figure FDA0003404098210000025
is a complex conjugate form of v, is an element-corresponding product;
one element in the scoring matrix is finally:
Figure FDA0003404098210000026
the corresponding score function is defined as:
Figure FDA0003404098210000027
where < > denotes that the standard element corresponds to a multilinear point product.
3. The anomaly detection method based on metagraph dynamic heterogeneous information network representation according to claim 1, wherein the triplet in the ternary metagraph dynamic embedding mechanism in step 3 is a set containing three nodes, and is called a closed triplet if each node is connected to each other, and is called an open triplet if there are only two edges between the three nodes; in order to obtain dynamic heterogeneous information network embedding from timestamps 1 to t, a negative sample strategy is first used to form a training data set, nodes u and v are connected by a metagraph s in positive triplets (u, s, v), and nodes u 'and v' are connected by a metagraph s 'in negative triplets (u', s ', v'); for each heterogeneous information network positive triple (u, s, v), generating a heterogeneous information network negative triple in a mode of randomly replacing u and v with other nodes and limiting the types of the u and v to be the same as the replaced nodes, and filtering out the replaced heterogeneous information network triple which is still positive after sampling; the evolution of the open triple structure into the triple closure process and the evolution of the closure triple into the open triple structure are basic changes of the dynamic heterogeneous information network evolution, positive and negative evolution triples are used as training sets, and a complex space embedding mechanism in the step 2 is adopted for training to obtain the representation learning of the dynamic heterogeneous information network with the timestamps of 1 to t.
4. The anomaly detection method based on meta-map dynamic heterogeneous information network representation according to claim 3, characterized in that there are four changes to the dynamic heterogeneous information network:
(1) the added edges form a ternary closure process: determining that all metagraphs with three node vectors, which contain varying training data sets at timestamps t, have only two edges in the middle changed into interconnected circles
Figure FDA0003404098210000031
In for three nodes v1、v2And v3The primitive diagram s of (v)1,v2) Denotes v1And v2Edges in between, then obtained after the ternary closure process
Figure FDA0003404098210000032
Is defined as:
Figure FDA0003404098210000033
Figure FDA0003404098210000034
(2) deleted edges result in a ternary open process: collecting all metagraphs that have triples that evolve from circles to paths with two edges; these nodes at time stamp t will be included in
Figure FDA0003404098210000035
After the ternary opening process
Figure FDA0003404098210000036
Is defined as
Figure FDA0003404098210000037
Figure FDA0003404098210000038
Figure FDA0003404098210000039
(3) One added node: given that one in the metagraph is denoted v1Existing node of (2) and a newly added node v2Then, then
Figure FDA00034040982100000310
Will be expanded into
Figure FDA00034040982100000311
(4) A deleted node: in view of the meta-diagram denoted v1And v2Existing node of (c), let v2Is deleted, then
Figure FDA00034040982100000312
Will become
Figure FDA00034040982100000313
5. The anomaly detection method based on metagraph dynamic heterogeneous information network representation according to claim 1, characterized in that a change set is constructed based on an original metagraph, namely when the change set is trained, nodes are trained on the original metagraph after the change process is finished; to obtain
Figure FDA00034040982100000314
After, only train
Figure FDA0003404098210000041
Sets, obtained using a metagraph based on a complex mechanismInstead of retraining the entire network, the embedding of the changed nodes is taken.
6. The method of claim 1, wherein in step 4, the deep automatic encoder model is composed of an encoder and a decoder, and for constructing the input of the encoder, for a node, its metagraph is taken as a neighboring node to form its adjacency matrix a, and then, for any pair of u and v nodes in the metagraph, the encoder input is composed of the time-series adjacency vectors of u and v, respectively represented as u and v
Figure FDA0003404098210000042
And
Figure FDA0003404098210000043
auis a combination of two parts, one part is a row of adjacency matrix A which represents nodes adjacent to u and is further mapped to a d-dimensional vector through a fully connected layer, and the other part is dynamic node embedding of the node u; the encoder then processes the input content to obtain a low-dimensional representation yuAnd yv(ii) a The encoder aims at predicting the neighborhood by embedding of the timestamp t
Figure FDA0003404098210000044
And
Figure FDA0003404098210000045
adjacent vector representation for depth autoencoder prediction as
Figure FDA0003404098210000046
And
Figure FDA0003404098210000047
7. the anomaly detection method based on metagraph dynamic heterogeneous information network representation according to claim 6It is characterized by that for node u and its neighborhood
Figure FDA0003404098210000048
Where d is the embedding dimension, t is the total time step, and the concealment at the first level is expressed as
Figure FDA0003404098210000049
Wherein
Figure FDA00034040982100000410
Is the parameter matrix of layer 1 of the auto-encoder, d (1) is the representation dimension of layer 1,
Figure FDA00034040982100000411
f represents an S-shaped activation function for the deviation of the layer 1 of the encoder;
the k-layer output of the encoder is calculated as follows:
Figure FDA00034040982100000412
to fully capture the information about the past evolution of the metagraph, a long-short term memory network layer is further applied on the output of the encoder, for the first long-short term memory network layer, the hidden state representation is calculated as:
Figure FDA0003404098210000051
Figure FDA0003404098210000052
Figure FDA0003404098210000053
Figure FDA0003404098210000054
Figure FDA0003404098210000055
Figure FDA0003404098210000056
wherein
Figure FDA0003404098210000057
In order to activate the value of the input gate,
Figure FDA0003404098210000058
in order to activate the value of the forgetting gate,
Figure FDA0003404098210000059
is a new predicted candidate state for the current state,
Figure FDA00034040982100000510
the unit states of the long-short term memory network,
Figure FDA00034040982100000511
the value for activating the output gate, delta represents the activation function,
Figure FDA00034040982100000512
in the form of a matrix of parameters,
Figure FDA00034040982100000513
represents deviation, d(k+1)A representation dimension representing a k +1 layer;
the long-short term memory network has one layer, and the final output of the long-short term memory network can be expressed as
Figure FDA00034040982100000514
Figure FDA00034040982100000515
Wherein
Figure FDA00034040982100000516
The training objective is to minimize the following loss function:
Figure FDA00034040982100000517
the embedding at the time t is utilized to punish the incorrect neighborhood reconstruction at the time t +1, so that the node embedding of a future timestamp can be predicted by a depth automatic encoder model based on a long-short term memory network; f () represents the function employed to generate the prediction neighborhood at timestamp t +1, using the above-described auto-encoder framework as f (),
Figure FDA00034040982100000518
a hyperparameter matrix, which balances the weights of the penalty observation neighbors, indicates a product by element.
8. The method of anomaly detection based on a metagraph dynamic heterogeneous information network representation according to claim 7, characterized in that a gradient is applied to decoder weights on the objective function as follows:
Figure FDA0003404098210000061
wherein W* (k+l)For layer k + l of the automatic encoderA parameter matrix; after the derivatives are calculated, the SGD algorithm and Adam are again applied to train the model.
9. The anomaly detection method based on metagraph dynamic heterogeneous information network representation is characterized in that the network nodes comprise author nodes, literature nodes, publishing platform nodes and theme nodes, and the network edge data is an association relationship among the network nodes.
CN202111505386.8A 2021-12-10 2021-12-10 Anomaly detection method based on dynamic heterogeneous information network representation of metagraph Pending CN114218445A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111505386.8A CN114218445A (en) 2021-12-10 2021-12-10 Anomaly detection method based on dynamic heterogeneous information network representation of metagraph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111505386.8A CN114218445A (en) 2021-12-10 2021-12-10 Anomaly detection method based on dynamic heterogeneous information network representation of metagraph

Publications (1)

Publication Number Publication Date
CN114218445A true CN114218445A (en) 2022-03-22

Family

ID=80700763

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111505386.8A Pending CN114218445A (en) 2021-12-10 2021-12-10 Anomaly detection method based on dynamic heterogeneous information network representation of metagraph

Country Status (1)

Country Link
CN (1) CN114218445A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114826735A (en) * 2022-04-25 2022-07-29 国家计算机网络与信息安全管理中心 VoIP malicious behavior detection method and system based on heterogeneous neural network technology

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114826735A (en) * 2022-04-25 2022-07-29 国家计算机网络与信息安全管理中心 VoIP malicious behavior detection method and system based on heterogeneous neural network technology
CN114826735B (en) * 2022-04-25 2023-11-03 国家计算机网络与信息安全管理中心 VoIP malicious behavior detection method and system based on heterogeneous neural network technology

Similar Documents

Publication Publication Date Title
Xu Understanding graph embedding methods and their applications
Qi et al. Exploiting multi-domain visual information for fake news detection
Interdonato et al. Multilayer network simplification: approaches, models and methods
Halcrow et al. Grale: Designing networks for graph learning
CN110287439A (en) A kind of network behavior method for detecting abnormality based on LSTM
Zhang et al. Blockchain phishing scam detection via multi-channel graph classification
Selvarajah et al. Dynamic network link prediction by learning effective subgraphs using CNN-LSTM
Al Sulaimani et al. Short text classification using contextual analysis
Zhou et al. Multiview deep graph infomax to achieve unsupervised graph embedding
CN114218445A (en) Anomaly detection method based on dynamic heterogeneous information network representation of metagraph
Kumar et al. Community-enhanced Link Prediction in Dynamic Networks
CN114218446A (en) Link prediction method based on dynamic heterogeneous information network extensible representation
Kumar et al. PWAF: Path Weight Aggregation Feature for link prediction in dynamic networks
CN109582953B (en) Data support scoring method and equipment for information and storage medium
Liu et al. Learning implicit labeling-importance and label correlation for multi-label feature selection with streaming labels
Zhao et al. Multi-label node classification on graph-structured data
CN113159976B (en) Identification method for important users of microblog network
İş et al. A Profile Analysis of User Interaction in Social Media Using Deep Learning.
CN114205252B (en) Dynamic heterogeneous network node classification method based on extensible representation of metagraph
CN114862588A (en) Block chain transaction behavior-oriented anomaly detection method
Rogers et al. Using topic modelling algorithms for hierarchical activity discovery
Wei et al. Twitter Bot Detection Using Neural Networks and Linguistic Embeddings
Perner Machine Learning and Data Mining in Pattern Recognition: 14th International Conference, MLDM 2018, New York, NY, USA, July 15-19, 2018, Proceedings, Part II
Muppudathi et al. Anomaly Detection in Social Media Texts Using Optimal Convolutional Neural Network.
Fu et al. A study on recursive neural network based sentiment classification of Sina Weibo

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination