CN116257662A - Heterogeneous graph community discovery method based on K neighbor graph neural network - Google Patents

Heterogeneous graph community discovery method based on K neighbor graph neural network Download PDF

Info

Publication number
CN116257662A
CN116257662A CN202310003986.7A CN202310003986A CN116257662A CN 116257662 A CN116257662 A CN 116257662A CN 202310003986 A CN202310003986 A CN 202310003986A CN 116257662 A CN116257662 A CN 116257662A
Authority
CN
China
Prior art keywords
matrix
meta
path
graph
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310003986.7A
Other languages
Chinese (zh)
Inventor
刘小洋
吴玉蝶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Technology
Original Assignee
Chongqing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Technology filed Critical Chongqing University of Technology
Priority to CN202310003986.7A priority Critical patent/CN116257662A/en
Publication of CN116257662A publication Critical patent/CN116257662A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a heterogeneous graph community discovery method based on a K neighbor graph neural network, which comprises the following steps: s1, calculating node similarity according to the feature vector of the node, and further forming a similarity matrix; and constructing K neighbor graph adjacency matrix A k The method comprises the steps of carrying out a first treatment on the surface of the Adjacency matrix A of K neighbor graph k Adjacent matrix added with identity matrix in input diagram
Figure DDA0004035383460000011
Splicing to obtain a K neighbor graph adjacency matrix with characteristic space topology information
Figure DDA0004035383460000012
S2, attention scores are distributed for different heterogeneous point edge relations; s3, will
Figure DDA0004035383460000013
Generating a meta-path transition matrix through a meta-path information conversion layer, generating a meta-path conversion matrix, then adaptively learning a meta-path, and fusing meta-path information through GCN to obtain node representation; s4, performing k-means operation on the learned node representations according to the number of node types to form community division on specific nodes. According to the method and the device, through fusion of the meta-path information, the possibility of node co-occurrence under the higher-order relation is considered, and the probability that nodes without connection in the meta-path are divided into the same community is improved.

Description

Heterogeneous graph community discovery method based on K neighbor graph neural network
Technical Field
The invention relates to the technical field of data mining, in particular to a heterogeneous graph community discovery method based on a K neighbor graph neural network.
Background
Community discovery is a fundamental and important research field in network science, and aims to divide nodes connected tightly in a network into communities, so that nodes inside the communities are compact, and node connection between communities is sparse. The theoretical significance and the practical significance of community discovery can not be degraded in the research subject of network science. In a social network, platform sponsors promote products, deliver topic recommendations, etc. to target users in a detected community. In metabolic networks and protein-protein interaction networks, community discovery reveals the complexity of metabolism and proteins with similar biological functions. Community discovery in the citation network determines the importance, interrelationships, evolution and recognition of research trends of the research topic.
Many excellent works have emerged for community discovery tasks, and these research models can be divided into three main categories. The conventional community discovery method highly depends on the structure of a network, and communities are discovered by using topology information of the network, such as a graph segmentation method, a hierarchical clustering method, and the like. And secondly, generating node sequences based on a graph embedding mode, a random walk mode and the like, and further discovering communities by more effectively learning node embedding. In recent years, graph neural networks have been widely used for various tasks on graphs, and important efforts have been made in community discovery research projects, by transmitting node characteristics to neighbors, performing operations such as convolution with an underlying topology. Node representation learned through a graph neural network has proven to achieve the most advanced performance with community discovery across most data sets.
However, most of the existing community discovery work mainly focuses on studying homogeneous graphs with the same node type and edge type. Since networks in the real world often have multiple node types and multiple edges, such solutions are clearly less effective on such more compliant real networks. Such a graph, in which various nodes and edges are ubiquitous in the real world, is called a heterogeneous graph, which contains more comprehensive information and rich semantics, and thus is widely used in many data mining tasks.
Community discovery on heterogeneous graphs is very challenging, and traditional graph neural network models cannot be directly applied to heterogeneous graphs due to the high heterogeneity of heterogeneous graphs. And there is one such composite relationship in the heterogeneous graph, which has rich semantic information, called meta-path, which is widely used in the research of heterogeneous graph. Because of the multi-hop semantic information rich in the meta-path, some nodes without direct edge connection are very likely to be divided into communities, so that similarity in the node feature space is fully considered when the heterogeneous graph is found in communities, and the high-order relationship of the nodes is considered, so that the nodes cannot be limited to the existing topological edge connection. In order to be able to effectively capture high-level information between nodes, many innovative studies have been proposed successively for community discovery, but they mostly rely on manually defined meta-paths, which need to be manually modified according to the data sets, and these work highly depend on the quality of the meta-paths, which are chosen by the expert to be different, resulting in distinct results for the model. In addition, the semantic information contained in each meta-path is different, meaning that it is preferable to be able to distinguish their importance for each meta-path to better fit the research task and learn a more efficient node representation.
Disclosure of Invention
The invention aims at least solving the technical problems existing in the prior art, and particularly creatively provides a heterogeneous graph community discovery method based on a K neighbor graph neural network.
In order to achieve the above object of the present invention, the present invention provides a heterogeneous graph community discovery method based on a K-nearest neighbor graph neural network, comprising the steps of:
s1, calculating node similarity according to feature vectors of nodes, and further forming a similarity matrix, wherein the similarity matrix is formed by similarity between a target node and other nodes of a heterogeneous graph; k nodes which are most similar to the target node are selected as neighbors according to the similarity matrix to form a continuous edge, and then a K neighbor graph adjacency matrix A is constructed k
Adjacency matrix A of K neighbor graph k Adjacent matrix added with identity matrix in input diagram
Figure BDA0004035383440000021
Splicing to obtain K neighbor graph adjacency matrix with characteristic space topology information>
Figure BDA0004035383440000022
Said adjacency matrix->
Figure BDA0004035383440000023
The method is obtained by performing Concat operation on the input graph matrix and the identity matrix. The input graph is an input heterogeneous graph.
And S1, constructing a similarity matrix according to the node feature vector by utilizing the structural information in the feature space, and generating a K neighbor graph to enhance the similarity between nodes, thereby increasing the possibility that nodes without connecting edges are divided into communities.
S2, using a weight matrix to distribute attention scores for different heterogeneous point edge relations so as to achieve the purpose of distinguishing the importance degrees of different element paths;
s3, adjacent matrix of K neighbor graph with characteristic space topology information
Figure BDA0004035383440000024
Generating a meta-path transition matrix through a meta-path information conversion layer, generating the meta-path transition matrix through a matrix multiplication mode, then adaptively learning a meta-path, fusing meta-path information through GCN, and capturing a higher-order relation between nodes to obtain node representation;
s4, performing k-means operation on the learned node representations according to the number of node types to form community division on specific nodes.
Further, the method for measuring the node similarity comprises the following steps: cosine similarity, thermonuclear, dot product.
Further, the K-nearest neighbor graph adjacency matrix for topological information of the feature space
Figure BDA0004035383440000025
The calculation formula for generating the meta-path transition matrix through the meta-path information conversion layer is as follows:
Figure BDA0004035383440000026
wherein T represents a meta-path transition matrix having heterogeneous side information;
f (; ·) represents the function used to generate the meta-path transition matrix;
Figure BDA0004035383440000031
representing a K-nearest neighbor graph adjacency matrix with feature space topology information;
W att representing weight convolution at MP-Trans layer;
conv 1×1 Representing a convolution layer;
Figure BDA0004035383440000032
is a parameter of a 1 x 1 convolution, +.>
Figure BDA0004035383440000033
Representing the number of heterogeneous edge types +.>
Figure BDA0004035383440000034
Representing a heterogeneous relationship;
Figure BDA0004035383440000035
representing that the dot-edge relationship is +.>
Figure BDA0004035383440000036
Heterogeneous adjacency matrix of->
Figure BDA0004035383440000037
Representing an ith set of edge types;
Figure BDA0004035383440000038
representing convex combinations of heterogeneous adjacency matrices.
Further, the calculation formula for generating the meta-path conversion matrix by the matrix multiplication method is as follows:
Figure BDA0004035383440000039
wherein A is (l) Representing a meta path conversion matrix of a current layer;
Figure BDA00040353834400000310
a representation matrix;
A (l-1) representing a meta path conversion matrix of a previous layer;
f (; ·) represents the function used to generate the meta-path transition matrix;
Figure BDA00040353834400000311
representing the fusion K neighbor graph information and the heterogeneous graph input of the added identity matrix;
W att (l) representing the weight convolution of the current layer.
Further, the adaptive learning meta-path includes:
given a single by
Figure BDA00040353834400000312
The element path P formed by the series of compound relations corresponds to the element pathThe radius transformation matrix may be calculated by:
Figure BDA00040353834400000313
wherein the method comprises the steps of
Figure BDA00040353834400000314
Representing the edge type as +.>
Figure BDA00040353834400000315
Is a heterogeneous adjacency matrix of (1);
Figure BDA00040353834400000316
representing a set of edge types;
Figure BDA00040353834400000317
a weight representing each element path transition matrix;
A P representing a weighted sum of the l heteroadjacency matrices.
Since these transition adjacency matrices, i.e. adjacency matrices with heterogeneous point-side relations, contain original point-side relation information, and the characteristics of the original edge itself are ignored when the element path information is fused, it is necessary to add an identity matrix I into the heterogeneous adjacency matrices, and this way, element paths of any length can be learned, and when the element path conversion matrices are multiplied, element path information of l+1 lengths can be fused.
Further, the obtaining of the node representation comprises the steps of:
by multiplying the element path conversion matrix by l times, and applying GCN and MLP to each channel of the element path conversion matrix, the node representation is:
Figure BDA0004035383440000041
where I is a join operator, splice the node representations of each channel.
C represents the number of channels;
sigma represents an activation function;
Figure BDA0004035383440000042
representing fusion of +.sup.th channel with layer I>
Figure BDA0004035383440000043
An element path conversion matrix with a self-loop identity matrix and a K neighbor graph adjacent matrix;
Figure BDA0004035383440000044
is->
Figure BDA0004035383440000045
A degree matrix of (2);
X∈R N×d is the characteristic matrix of the input heterogeneous graph, N is the node number, and d is the node characteristic dimension;
W∈R d×dim is a training matrix of the neural network model, dim is the output embedding dimension of the model, Z epsilon R N×dim Is the final output node embedded representation.
In summary, due to the adoption of the technical scheme, the invention has the following advantages:
the characteristic information of the node space is considered, the K neighbor graph topological structure is constructed and is subjected to information fusion with the existing network topological structure, the similarity of the nodes is fully enhanced, and the probability that the nodes based on no connection in the meta-path are divided into the same community is improved.
The utility model provides a heterogeneous graph neural network method community discovery method KGNN_HCD, which can learn element paths end to end, capture higher-order relations, distinguish the importance among different element paths, learn high-quality node representation and perform interpretable analysis on element path conversion layers.
A great number of comparison experiments are carried out on three real heterogeneous data sets of ACM, DBLP and IMDB and node community discovery models of CP-GNN, GTN and the like, and experimental results show that compared with the existing heterogeneous community discovery methods of CP-GNN, GTN and the like, the KGNN_HCD method is remarkably improved in indexes of F1, NMI, ARI, purity and the like.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:
FIG. 1 is a schematic diagram of a heterogeneous graph of the present invention, FIG. 1 (a) being three node types, namely author, paper and meeting; FIG. 1 (b) is an example of a heterogram on a DBLP dataset; fig. 1 (c) is a three-element path used in DBLP, namely author-paper-author (APA), author-paper-author (APAPA), author-paper-meeting-paper-author (APCPA).
Fig. 2 is a frame diagram of a heterogeneous graph community discovery method fusing K neighbor graph information.
Fig. 3 is a schematic diagram of the construction process of the K-nearest neighbor graph based on the feature vector according to the present invention.
FIG. 4 is four metrics for three datasets in different dimensions according to the present invention.
FIG. 5 is an analysis of K in a K-nearest neighbor plot under three data sets
FIG. 6 is NMI and ARI of MP-Trans with different layers on three data sets
FIG. 7 is a graph of numerical analysis in MP-Translayer on an ACM dataset
FIG. 8 is a graph of numerical analysis in MP-Trans Layer on an IMDB dataset
Fig. 9 is a visual results graph of ACM, DBLP and IMDB dataset ablation studies for community discovery.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.
Three representative methods related to community discovery at present are a traditional community discovery method, a community discovery algorithm based on network embedding and a community discovery method based on heterogeneous graph neural network.
Traditional community discovery
Traditional community discovery methods rely mainly on network structures to explore communities, which have attracted a great deal of research attention. The Infomap algorithm encodes communities in the network and nodes inside communities simultaneously, and generates unique encoded representations of the nodes. Then, random walk is carried out on the network to obtain a group of total coding length, and when the coding length reaches the shortest, the nodes which are tightly connected are marked to the same community to obtain the optimal solution. The label propagation algorithm (Label Propagation Algorithm, LPA) is highly dependent on the topology of the network, and uses the label information of labeled nodes to pre-determine the label information of unlabeled nodes to identify the diffusion communities. However, the subjects of these methods described above are homogeneous images. Some superior work has also emerged on the study of community discovery tasks on heterograms. And obtaining community division results by carrying out k-means clustering on the eigenvectors corresponding to different node types by using the Het-SC and the Het-RSC in the literature. AGGMMR proposes a framework for performing community detection with attribute and topology information through a greedy modular maximization model. The AJNMF comprehensively considers link information and node content information based on regularization and non-negative matrix factorization modes to construct a heterogeneous network matrix, and a regulating function is introduced to reduce the influence of noise data on communities, so that the community discovery effect is improved. The literature adopts meta-paths to capture higher-order relationships between nodes to discover community structures in heterogeneous graphs.
Network-embedded-based community discovery
Network-embedded community discovery is a network characterization learning method, a high-dimensional and sparse vector space is represented by a low-dimensional and dense vector space, and the expression capacity of node embedding is enhanced according to community characteristics, so that many researchers aim to solve the problem of community discovery by means of graph embedding. The deep method obtains the co-occurrence relation of nodes in the graph by simulating uniform random walk in the network so as to learn the vector representation of the nodes, and the sampling strategy in the deep can be regarded as a special case of node2vec with p=1 and q=1. Node2vec proposes random walk based on DFS and BFS, and respectively digs Node characterization with homogeneity and structural similarity. CDE proposes a novel embedding-based approach. It embeds the inherent community structure into the structure embedment through known community membership. Then, based on node attributes and embedding of community structures, community discovery is defined as a matrix factorization optimization problem. NEC proposes a learnable network embedding algorithm for community discovery tasks in heterogeneous graphs that will learn together graph structure-based and cluster-oriented representations. Community discovery was then performed using k-means. Metaath 2vec is a vertex embedding method for heterogeneous information networks (Heterogeneous Information Network, HIN) proposed by Dong equal to 2017, using meta-paths to guide random walks on heterogeneous graphs, so that the generated node sequences contain rich semantic information. Compared with some previous models, the HIN2Vec model retains more context information, not only assumes that two nodes with a relationship are related, but also distinguishes between different relationships between the nodes and is treated differently by co-learning a relationship vector.
Community discovery based on graph neural network
Community discovery based on graph neural network is a deep learning method on graph domain, which can capture the independence of graph, solve the disorder of input graph, learn the state embedding of the neighbor of each node, so that scholars research a plurality of novel graph neural network based methods. As one of widely used deep learning techniques, the graph neural network learns graph structure data by virtue of its own advantages, extracts and discovers features and modes in the graph structure data, and has a great capability in the field of community discovery, thus occupying a place. LGNN is a graph neural network model, is an advanced deep learning-based method, utilizes the adjacent information of edges in the graph, has strong characteristic representation capability, and can be used for homogeneous community discovery. HAN is one of the earliest efforts to study heterograms, and it requires predefining meta-paths for datasets. And using a hierarchical attention mechanism to capture node level importance and semantic level importance and utilizing GAT to assign attention scores to neighbors based on different meta-paths, giving a final representation. MAGNN was further improved on the HAN basis. HAN considers only two end nodes based on meta-paths, while MAGNN proposes several meta-path encoders to encode all information along the path, taking into account intermediate nodes, providing an excellent community discovery solution. The CP-GNN proposes a context path based on the meta path, and if two master nodes are connected by the context path, it is explained that the two nodes have a semantic relationship. The CP-GNN can learn node embedding by maximizing co-occurrence probabilities of context neighbor nodes, capture higher order relationships, and do not require pre-defining meta-paths. The GTN is a method for heterogeneous graphs, does not need to manually define element paths, can be regarded as graph simulation of a space transformation network, clearly learns the space transformation of input images or features, obtains effective node representation, reduces high heterogeneity and enhances community discovery results.
For ease of description that follows, the terms and network model definitions relevant herein are presented herein. Table 1 summarizes the symbols in the present patent.
Table 1 symbols in the patent of the invention
Figure BDA0004035383440000061
Figure BDA0004035383440000071
Heterogeneous diagram: the heterogeneous graph (or heterogeneous information network) is an abstract modeling language for modeling heterogeneous relational data and many complex systems, as shown in FIG. 1Shown. Unlike a homogeneous graph, heterogeneous graphs have different types of nodes and edges, which are generally defined as
Figure BDA0004035383440000072
Where v denotes the node set and ε denotes the edge set, each of which has a node type mapping function
Figure BDA0004035383440000073
And an edge type mapping function ++>
Figure BDA0004035383440000074
And->
Figure BDA0004035383440000075
Figure->
Figure BDA0004035383440000076
Is +.>
Figure BDA0004035383440000077
All correspond to a node type, i.e. +.>
Figure BDA0004035383440000078
Likewise, each edge e in the graph ij Epsilon corresponds to one edge type, i.e
Figure BDA0004035383440000079
As shown in fig. 1 (a) and 1 (b), a simple heterogeneous indexing diagram is constructed using the DBLP dataset as a prototype. In this figure, there are three node types, paper (P), author (a) and meeting (C), respectively. There are two types of edges, paper-author (P-A) and paper-meeting (P-C), respectively.
Meta-path: in the heterogeneous graph, a path connecting two heterogeneous nodes through a mixed relation is called a meta path, and the path can effectively capture semantic information and is a basic means for researching the heterogeneous graph. A meta-path P to
Figure BDA00040353834400000710
Form (can also be simplified as +.>
Figure BDA00040353834400000711
) Is constructed in which
Figure BDA00040353834400000712
Figure BDA00040353834400000714
Representing a connection symbol; its length is hetero relation->
Figure BDA00040353834400000713
Is a number of (3).
As in fig. 1 (c), two authors can be connected by different meta paths, such as author-paper-author (APAPA) and author-paper-meeting-paper-author (APCPA). Although only the middle node is different, the semantic information expressed by the two element paths is completely different, the former represents that the author 1 and the author 4 respectively co-operate with the author 2, and the latter represents that the papers written by the two authors belong to the same conference. The richness of semantic information expressed by meta paths of different lengths is also different, such as meta path author-paper-author (APA) and author-paper-meeting-paper-author (APCPA), the semantic information of the latter is more abundant.
Heterogeneous adjacency matrix: in the heterogeneous graph, since the node type and the edge type are different, if the conventional adjacency matrix is used for representing the original graph, the heterogeneity of the graph is lost, so the present patent constructs a method of the heterogeneous adjacency matrix to represent the heterogeneous graph. Given a heterogeneous graph, the node type set is that
Figure BDA0004035383440000081
Edge type collection
Figure BDA0004035383440000082
And->
Figure BDA0004035383440000083
Figure BDA0004035383440000084
Figure BDA0004035383440000085
Representation->
Figure BDA0004035383440000086
And->
Figure BDA0004035383440000087
Dot-edge relation matrix>
Figure BDA0004035383440000088
And->
Figure BDA0004035383440000089
If edges exist, the corresponding element value is 1, otherwise, 0, and the values of the three other point-edge relationship type matrices are all 0. The heterogram can be expressed as +.>
Figure BDA00040353834400000810
Wherein N is the number of nodes in the heterogeneous graph. In addition, the point-edge relationship matrix is herein referred to as a heteroadjacency matrix, because in the heteroadjacency matrix, the point-edge relationship is deposited, the edge is 1, otherwise 0.
If a compound relationship is given without causing ambiguity
Figure BDA00040353834400000811
Or edge type sequence
Figure BDA00040353834400000812
Then the meta-path can also be represented directly by the object type, i.e. p= (v) 1 v 2 …v l+1 ). Adjacency matrix A representing meta-path P P Can be obtained according to the following formula:
Figure BDA00040353834400000813
wherein A is P The meta path conversion matrix corresponding to the meta path P is represented, and the specific reference can be made to formula (8);
Figure BDA00040353834400000814
representation A l And->
Figure BDA00040353834400000815
A dot-edge relation matrix is formed;
A l represented in the heterogram is a contiguous matrix of point-edge type l.
KNN_graph: the K-nearest neighbor Graph (KNN_graph) is a weighted directed Graph.
Figure BDA00040353834400000816
Wherein->
Figure BDA00040353834400000817
Representing node set +.>
Figure BDA00040353834400000818
ε k Representing edge set ε k ={e k1 ,e k2 ,...,e km }. Unlike ordinary graphs, in the K-nearest neighbor graph, ε k The middle is stored with->
Figure BDA00040353834400000819
The edges of the K most similar nodes under some similarity measure.
Community: given a group of communities
Figure BDA00040353834400000820
Each community C m Is a partition of the original graph that preserves the region structure and clustering properties. Node v i Clustered into a community C m The condition to be satisfied is that the node degree in the community exceeds the node degree in the communityAnd (5) a degree of division.
Graph roll-up network (GCN): GCN as a specific graph-based neural network model f (X, a) constructs a multi-layer graph convolutional network with the following layer propagation rules:
Figure BDA00040353834400000821
wherein H is (l) A feature representation for the first layer; the upper mark in the text (l) All representing a layer i graph roll-up network.
Figure BDA00040353834400000822
Is a heterogeneous graph adjacency matrix added with self-connection;
Figure BDA00040353834400000823
a degree matrix;
W (l) ∈R d×d is a trainable weight matrix;
sigma (·) represents an activation function, such as ReLU (·) =max (0, ·).
In this work, since the study object is a directed graph, the degree matrix is used in the graph convolution operation
Figure BDA0004035383440000094
To perform an inverse normalization operation.
The method provided by the invention aims at finding communities with nodes of the same target type in the heterogeneous graph, mining structural information in a feature space and generating high-quality node embedding. The method comprises the steps of generating a K neighbor graph by utilizing characteristic nodes to acquire bottom topology information in a characteristic space, constructing a heterogeneous adjacency matrix according to a data set by the proposed KGNN_HCD to find a significant meta-path, and executing a more effective graph rolling network to learn more powerful node representation.
The invention provides a heterogeneous graph community discovery method KGNN_HCD based on a K neighbor graph neural network, and a specific method architecture is shown in figure 2. In fig. 2, kgnn_hcd uses a K-nearest neighbor graph adjacency matrix with a heterogeneous adjacency matrix and representing feature space topology information as input of GCN, assigns weights for different heterogeneous point-edge relationships by using a weight matrix, and generates a meta-path transition matrix at MP-Trans layer, further generates a meta-path transition matrix. Meanwhile, graph convolution is carried out on the new graph structure in the GCN, and a k-means method is adopted on the learned node representation so as to find the community structure in the heterogeneous graph.
1. The proposed method
1.1K neighbor map information fusion
In order to obtain the structural information in the feature space, a K-nearest neighbor map is constructed from the input feature matrix X of the heterogeneous map
Figure BDA0004035383440000091
It can reveal the underlying structure of the data, where A k Representing the adjacency matrix of the K-nearest neighbor graph. The specific operation is shown in fig. 3.
In fig. 3, node similarity is calculated according to the feature vector of the node, and K nodes most similar to the target node are selected according to the similarity matrix, so as to construct a K neighbor graph adjacency matrix required by the model. The kernel of selecting the constructed K neighbor graph in the KGNN_HCD method is to select a similarity measurement index, and according to the measurement method, the similarity between nodes can be calculated, so as to form a similarity matrix S. The invention can adopt the following three similarity measurement methods, wherein x is as follows i And x j For node v i And v j Is a feature vector of (1):
1) Cosine similarity, which is measured by the cosine value of the angle between two vectors, the range of values is [ -1,1]:
Figure BDA0004035383440000092
2) The similarity calculation mode of the thermonuclear Heat Kernel is shown as a formula (4), wherein t is a time parameter in a Heat conduction equation:
Figure BDA0004035383440000093
the term "distance" refers to the distance between two vectors.
3) Dot Product, which is mainly applied to discrete data, such as word bags (bag-of-words), the calculated similarity is only related to the number of the same words, and the specific calculation mode is shown in the formula (5):
S ij =x j T x i (5)
Figure BDA0004035383440000101
to transpose the symbols.
In the invention, cosine similarity is uniformly selected as a similarity measurement standard to obtain a similarity matrix S, and then the first K similarity points of each node are selected as neighbors to form a continuous edge, so as to generate an undirected K neighbor graph, thereby obtaining a required K neighbor graph adjacent matrix A k
After the above operation, the underlying structure information in the feature space is acquired. Meanwhile, K most similar neighbors are added for each node to further aggregate information, so that the similarity of the nodes is increased, and the possibility of node co-occurrence is improved. Adjacency matrix A of K neighbor graph k Adjacent matrix added with identity matrix in input diagram
Figure BDA0004035383440000102
Splicing to obtain final input +.>
Figure BDA0004035383440000103
1.2-element path information conversion
The rich semantics of meta-paths are an important feature of heterogeneous information networks, and the connection relations between objects are different based on different meta-paths, and the path definitions are also different, which may have an impact on many specific tasks. Previous meta-path based works are highly dependent on predefined meta-paths, and depending on the fact that the meta-paths manually defined by domain experts are different, the performance of the model is directly affected. In contrast, kgnn_hcd can define meta-paths end to end, and can generate meta-paths of any length according to the characteristics of the data set and the task requirements.
In this section, a K-nearest neighbor graph adjacency matrix expressing the feature space topology has been added to the heterograms. Later on, when aggregating meta-path information, it should be noted that the point-edge relationships under different types play different roles and will show different importance when learning the node embedment of a specific task, so that a weight convolution W needs to be introduced here att Constraining the point-edge relations with different importance degrees, normalizing the weight convolution by a softmax function, and realizing the differential expression of the point-edge relation matrix, thereby generating a meta-path transition matrix with heterogeneous edge information, wherein the calculation formula is as follows.
Figure BDA0004035383440000104
Wherein T represents a meta-path transition matrix having heterogeneous side information;
f (; ·) represents the function used to generate the meta-path transition matrix;
Figure BDA0004035383440000105
representing a K-nearest neighbor graph adjacency matrix with feature space topology information;
W att representing weight convolution at MP-Trans layer;
conv 1×1 representing a convolution layer;
Figure BDA0004035383440000111
is a parameter of a 1 x 1 convolution;
Figure BDA0004035383440000112
representing a set of edge types;
Figure BDA0004035383440000113
representing that the dot-edge relationship is +.>
Figure BDA0004035383440000114
Is a heterogeneous adjacency matrix of (1);
W i (l) =softmax(W att ) Representing convex combinations of heterogeneous adjacency matrices.
In order that the proposed method can learn different behaviors based on the same mechanism, and combine the different behaviors as knowledge, the kgnn_hcd sets the output channel of the convolution layer to C to fully consider different types of meta-paths, which is similar to the multi-head attention mechanism. In the first Layer of meta-path information fusion, i.e. in the meta-path information conversion Layer (MP-Trans Layer), two meta-path transition matrices T are calculated 1 ∈R N×N×C And T 2 ∈R N×N×C Both transition matrices at this time have heterogeneous point side relation information (such as ACM as data set, PA, AP, PS, SP) and T is defined as 1 And T 2 Matrix multiplication means that the first time of meta-path information fusion is performed, and at this time, the meta-path information under all valid point side relation combinations with the meta-path length of 3 is learned. At the same time, KGNN_HCD uses the degree matrix of the heterogram to perform the binary path conversion matrix A for the stability of the numerical value (l) Normalizing, i.e.
Figure BDA0004035383440000115
Each subsequent element path conversion matrix can be obtained by multiplying the element path transition matrix obtained by the current calculation with the element path conversion matrix of the upper layer, namely
Figure BDA0004035383440000116
Wherein A is (l) Meta-path representing current layerConverting the matrix;
Figure BDA0004035383440000117
a representation matrix;
A (l-1) representing a meta path conversion matrix of a previous layer;
f (; ·) represents the function used to generate the meta-path transition matrix;
Figure BDA0004035383440000118
representing the fusion K neighbor graph information and the heterogeneous graph input of the added identity matrix;
W att (l) a weight convolution representing the current layer;
next, explanation is made on how kgnn_hcd learns a meta-path of arbitrary length. Given a single by
Figure BDA0004035383440000119
The meta-path conversion matrix corresponding to the meta-path P formed by the series of composite relations can be calculated by the following method:
Figure BDA00040353834400001110
wherein the method comprises the steps of
Figure BDA00040353834400001111
Representing the edge type as +.>
Figure BDA00040353834400001112
Is a heterogeneous adjacency matrix of (1);
Figure BDA00040353834400001113
representing a set of edge types;
Figure BDA00040353834400001114
can be regarded as the weight of each element path transition matrix, so that the KGNN_HCD can distinguish element paths with different importance degrees, and then A P It can be seen as a weighted sum of i heteroadjacency matrices, so the proposed method can learn meta-path information of arbitrary length.
Since these transition adjacency matrices contain original point-side relation information, the characteristics of the original edge itself are ignored when the meta-path information is fused, so that it is necessary to add an identity matrix I into the heterogeneous adjacency matrix, and in this way, kgnn_hcd can learn meta-paths of any length, and when the multiplication operation is performed on l meta-path conversion matrices, the meta-path information of l+1 lengths can be fused.
1.3 Community discovery
The main goal of this module is to conduct community discovery on learned node representations. In order to enable the final node representation to contain rich meta-path semantic information, the variance of the data is high because the heterograms have a scaleless nature. In order to cope with the challenges, the kgnn_hcd extends the information fusion of the meta-path level to the multi-head channel, and increases the number of output channels C. By multiplying the element path conversion matrix by l times, and applying GCN and MLP to each channel of the element path conversion matrix, the final node representation is then
Figure BDA0004035383440000121
Where I is a join operator, splice the node representations of each channel.
C represents the number of channels;
sigma represents an activation function;
Figure BDA0004035383440000122
the representation in (1) merges with the ith channel of the first layer/>
Figure BDA0004035383440000123
An element path conversion matrix with a self-loop identity matrix and a K neighbor graph adjacent matrix;
Figure BDA0004035383440000124
is->
Figure BDA0004035383440000125
Is a degree matrix of (2).
X∈R N×d Is the characteristic matrix of the input heterogeneous graph, N is the node number, and d is the node characteristic dimension;
W∈R d×dim is a training matrix of the neural network model, dim is the output embedding dimension of the model, Z epsilon R N×dim Is the final output node embedded representation.
After the node representation is obtained, the most classical k-means algorithm is used to discover communities for these learned embeddings. Specifically, k is set as the number of node categories, so that k clusters are obtained, and meanwhile, the real community label corresponding to the data set is compared with the community label obtained by KGNN_HCD, and the community discovery result is evaluated through the common community discovery index.
2. Experimental analysis
2.1 data sets
The experiment selects three real world heterogeneous graph datasets, respectively heterogeneous citation networks ACM and DBLP, and a movie dataset IMDB, and specifically their detailed information including node number, edge type, meta path, etc. is summarized in table 2.
Table 2 statistics of dataset
Figure BDA0004035383440000131
·ACM
ACM (Association for Computing Machinery) data set is a bibliographic information networkThe Paper published on KDD, SIGMOD, SIGCOMM, mobiCOMM and VLDB was extracted, and the whole dataset included 3 node types, 3025 Paper (P), 5835 Author (a), and 56 subjects (S), respectively, wherein nodes of type P have 3 tags of Database, wireless communication Wireless Communication, and Data Mining, that is, paper nodes can be classified into 3 types. The ACM dataset also contained 4 edge types, 9936P-A and A-P and 3025P-S and S-P, respectively, and it is apparent that the dataset had 4 heterogeneous adjacency matrices with corresponding point-edge relationships A, respectively P-A 、A A-P 、A P-S And A S-P . For an ACM dataset, the meta-paths with semantic information are PAP and PSP.
·DBLP
DBLP (DataBase systems and Logic Programming) dataset is a network that reflects the relationship between authors and papers. Nodes are of 3 types: 14328 papers (P), 4057Author (A) and 20 references (C). The authors fall into 4 areas: database, data Mining, machine Learning, information retrieval Information Retrieval. The areas of their study are labeled according to the meetings submitted by each author. The DBLP dataset contains 4 edge types, 19645P-A and A-P and 14328P-C and C-P, respectively, the corresponding heteroadjacency matrix being A P-A 、A A-P 、A P-C And A C-P . For a DBLP dataset, the meta-path with semantic information is APA, APCPA, APAPA.
·IMDB
IMDB (Internet Movie Database) is a web site for movies and related information, recording user pairs
Preferences for different movies. This data contains 3 node types, 2939 Movie (M), 5841 Actor (a), and 2269 Director (D), respectively, and the nodes of the type Movie are labeled, including three attributes, action, comedy, and Drama. thereare4typesofedgetypes,namely4661M-AandA-Mand13983M-DandD-M,respectively,andthecorrespondingheterogeneousadjacencymatrixisA M-A 、A A-M 、A M-D And A D-M . In the course of grindingWhen the IMDB data set is researched, the data set can be subjected to task mining by researching two meta-paths with semantic information, namely MAM and MDM.
2.2 Baseline model
The proposed kgnn_hcd method is compared with the most advanced heterogeneous community discovery methods such as CP-GNN, GTN, etc. In order to fully verify fairness and effectiveness of experiments, firstly, traditional network embedding algorithms are compared, and the traditional network embedding algorithms are originally designed to study isomorphic graphs, so that node heterogeneity is ignored in the experimental process, and homogeneous operation is performed on the whole heterogeneous graph. Meanwhile, for the purpose of fair comparison, methods based on graph neural networks are introduced, most of the methods are designed for heterogeneous network embedding, and the characteristics of heterogeneous information networks are captured by using element paths, but the modes are not the same.
Conventional network embedding method
Infomap is an algorithm based on information coding theory, which codes a path of random walk and uses the code length as an objective function to perform community optimization division. Here, the method views the heterogeneous map as a homogeneous map, converting the community discovery problem into an information compression problem.
Node2vec is a network embedding method that considers DFS neighborhood and BFS neighborhood comprehensively, and can be regarded as an extension of deep. node2vec is a second order random walk that can walk away to delineate macroscopic features of the network, and can walk locally to preserve community information of the node.
Meta 2vec (Mp 2 vec) is the most advanced heterogeneous information network embedding method. It uses meta-path based random walk to construct a heterogeneous neighborhood for each vertex, and then uses Skip-Gram model to complete vertex embedding. On the basis of meta 2vec, authors have also proposed meta 2vec++ to implement modeling of both structural and semantic associations in heterogeneous networks.
The HIN2vec model learns the rich information in heterogeneous information networks by studying the different types of relationships and network structures between nodes. The method samples random walking based on element paths with given size, inputs the random walking into a neural network, codes rich information embedded in the element paths and the whole network structure according to different semantic information of different element paths, and learns more meaningful node representation.
Method based on graph neural network
GCN is the most advanced graph rolling method for homographs. The GCN is a semi-supervised graph convolutional network model for performing convolution operation in a graph Fourier domain, and globally captures complex features by aggregating neighbor information for node representation learning. Here, the GCN is tested on a homogeneity map based on the meta-path and the best results from the meta-path are reported.
·GCN KG Is a special GCN. For better comparison, the advantages of K neighbor graphs are explored, the input of the model gives up the traditional topological graph, and the sparse K nearest neighbor graph calculated from the feature matrix is used as the input graph of the GCN and expressed as the GCN KG
GAT is a semi-supervised homogeneous graph neural network model. The model performs convolution operations in the graph space domain and introduces a attention mechanism. For each node, it aggregates the neighbor representation by a node-level attention-learning importance score. Likewise, all meta-paths under the GAT model were tested and reported for best performance.
HAN is one of the early attempts to solve heterogeneous graphs, which converts heterogeneous information networks into multiple homogeneous graphs through a given symmetric meta-path, and captures node-level importance and semantic-level importance using a hierarchical attention mechanism, and achieves node-level attention on its corresponding meta-path neighbor graph through GAT, giving a final representation.
CP-GNN proposes to capture higher order relationships between nodes using context paths and build a context path based heterogeneous graph neural network model. It recursively embeds higher-order relationships between nodes into node embedments through a attention mechanism to distinguish the importance of different relationships. The embedding of nodes is better learned by maximizing the expectation of co-occurrence of nodes connected by context paths.
GTN is a method suitable for heterogeneous graphs that can automatically discover valuable meta-paths instead of relying on manual selection as with HAN. All possible cases are considered by computing all possible meta-path based graphs, and then performing a graph convolution on the resulting graph.
2.3 implementation procedure
For the proposed kgnn_hcd method, parameters are first randomly initialized and Adam optimization model is used, and then super parameters are respectively selected: the learning rate is set to 0.005 and the regularization parameter is set to 0.001 so that each baseline yields its best performance. For the random walk-based model, the window size is set to 5, the step size is set to 100, the step size for each node is 40, and the negative sample number is set to 7. For semi-supervised graph neural networks, including GCN, GAT, HAN, etc., the exact same training set, validation set, and test set are split to ensure fairness. For the kgnn_hcd method, the meta path information conversion layer of both the DBLP and IMDB data sets is set to 3, the meta path information conversion layer of the ACM data set is set to 2, and furthermore, for the K neighbor map, the value of K is set to 4. For a fair comparison, the embedding dimension of all the algorithms described above is set to 64.
2.4 performance comparisons
In this experiment, the proposed kgnn_hcd method first learns node embeddings and then uses the k-means algorithm on these embeddings for community discovery, where k is set as the number of node classes. The performance of kgnn_hcd was evaluated using the real label and four metrics F1, NMI, ARI, and Purity, and 11 node methods such as Infomap, and experimental analysis results are shown in table 3.
Table 3 comparison of the performance of different community discovery methods
Figure BDA0004035383440000151
Figure BDA0004035383440000161
As is clear from table 3, the proposed kgnn_hcd method has an F1 index of 2.68% better than the GTN method on the ACM dataset, 1.35% better than the GTN method on the DBLP dataset, 1.89% better than the GTN method on the IMDB dataset, 2.54% and 2.56% better than the NMI and ARI, 2.59% and 1.47% better than the ACM dataset, 1.22% and 1.67% better than the DBLP dataset. The above results indicate that: by fusing K neighbor graph information and conversion meta-path information, the structural information of the feature space can be captured, and more meaningful meta-paths can be acquired adaptively so as to learn node representations more suitable for community discovery.
All graph-based embedding methods are superior to the traditional infomap algorithm, showing their great potential in community discovery tasks. For the network embedding-based method, the overall performance of the HIN2vec is better than that of the Node2vec and the Metapath2vec, but the superiority of the HIN2vec only appears prominent on the ACM data set, and the effect on the two data sets of DBLP and IMDB is not ideal. This is because, although HIN2vec is able to automatically discover meta-paths through random walks, discovered meta-paths may not be suitable for community discovery. Second, it can be found that Metapath2vec works better on the DBLP dataset than on the ACM dataset, indicating the importance of higher order relationships.
GCN、GCN KG And LGNN performed the least in GNN-based baselines. The possible reasons are that they were originally proposed for isomorphic diagrams, without considering the complex context information in iso-patterning. GCN KG All results are superior to GCN, further indicating the necessity of introducing K-nearest neighbor graphs to obtain feature space information. The performance of GAT is superior to GCN and LGNN, which strongly supports the importance of the attention mechanism. The attention mechanism in GAT can be seen as a simple way to distinguish node types from edge types in a heterogeneous graph. Due to the existence of the meta-path, the HAN can explicitly mine complex semantic information, and a good effect is achieved. The CP-GNN can well capture higher-order relations by learning the meta-path in a manner based on the context path, and the performance is good. The graph conversion network of the GTN can identify the nodes without edges on the original graphThe useful connection between the two is free from domain knowledge in the aspect of learning element paths, and the result proves that the model performance is better. In the graph neural network-based method, the performance of HAN, CP-GNN, GTN and KGNN_HCD is better than that of GCN and GAT, which shows that the community discovery models based on heterogeneous graph neural networks consider the heterogeneity of nodes and edges well.
2.5 parameter sensitivity experiment
This section will conduct sensitivity contrast analysis experiments of parameters on three different data sets such as ACM.
The effect of the final embedded Z dimension was first tested, five experiments were performed to set the embedded dimensions to 4, 16, 64, 128 and 1024, respectively, and for control variables, K in the K neighbor graph was uniformly set to 4, with the final result shown in FIG. 4.
As can be seen from fig. 4, the performance increases and then begins to decrease slowly with increasing embedding dimension across the three data sets. The reason may be that kgnn_hcd requires a suitable dimension to encode meta-path information, a smaller dimension may lead to learning less than meaningful information, and a larger dimension may introduce additional redundancy, so we find that the embedding dimension is most appropriate when set at 64 according to multiple experiments.
To detect the effect of the first K neighbors in the K neighbor graph on the proposed method, the performance of knn_gnn of K in fig. 5 ranging from 2 to 10 was studied.
It can be seen very intuitively from fig. 5 that the accuracy of kgnn_hcd increases first and then starts to decrease for all three data sets ACM, DBLP and IMDB. This is probably because the smaller the K value, the less likely it is to be when selecting neighbors, and the node most similar to the target is easily selected. However, as K increases, some nodes that are less similar to the target node are forced to keep the edges, which increases the complexity of the graph, and thus cannot intuitively find nodes that are more similar to the target node. At the same time, if the graph becomes denser, the features are more easily smoothed, which is also one of the reasons for the degradation of the model performance. As can be seen from fig. 5, kgnn_hcd works best when K takes 4.
To investigate the effect of the meta-path conversion Layer on kgnn_hcd, the layers were set to 2, 3, 4, 5, respectively, and the NMI and ARI indices of the model on the three data sets were observed, with the results shown in fig. 6.
Taking the DBLP dataset as an example, it is apparent from fig. 6 that the performance of kgnn_hcd continues to decrease as the number of layers increases. This is because the role of the meta-path information translation layer is to aggregate meta-paths, the longest effective meta-path length for a DBLP dataset being 5, i.e. APCPA. When the number of layers of meta-path information conversion increases, meta-paths without semantic information are also used by kgnn_hcd, causing additional redundancy to affect performance. It was thus found from several experiments that the overall effect of the kgnn_hcd method is optimal when the number of meta-path information conversion layers is set to 2 for the ACM dataset and 3 for the dblp and IMDB datasets.
2.6 interpretable analysis of MP-Translayer
MP-Trans Layer is used for fusing two element path information matrixes, so that element path conversion matrixes with importance degree are obtained, and differences among different element paths are further expressed. Next, this section explains in detail how MP-Trans Layer distinguishes the importance of each meta-path from the generated meta-path transition matrix. For convenience of explanation, the output channel is set to 1, and the convex combinations of the adjacency matrix of the input heterogeneous map are defined as
Figure BDA0004035383440000171
Correspondingly, the first MP-Trans Layer generates a meta-path conversion matrix A (l) The use of the previous layer output meta-path transition matrix and the current meta-path transition matrix may be obtained as follows.
Figure BDA0004035383440000181
Wherein D is (1) The matrix of degrees of representation,
Figure BDA0004035383440000182
representing the point-side relationship type as +.>
Figure BDA0004035383440000183
Heterogeneous adjacency matrix of->
Figure BDA0004035383440000184
Is a heterogeneous adjacency matrix->
Figure BDA0004035383440000185
Is a weight of (2). From the above, it can be seen that two element path transition matrices are generated in the first MP-Trans Layer, and the two coefficients are allocated correspondingly, respectively +.>
Figure BDA0004035383440000186
And->
Figure BDA0004035383440000187
It is noted that the first two meta-path transition matrices generated are matrix multiplied, so that here meta-path information of path length 3, e.g. P-a-P, is actually and fused. If the information is to be fused, the weight of the hetero-adjacent matrix is calculated once>
Figure BDA0004035383440000188
Multiplying the obtained meta-path conversion matrix, the meta-path information with the length of 4 can be fused. Therefore, if one wants to study one of the following +.>
Figure BDA0004035383440000189
The significant element path P composed of the series of compound relations can use the hetero-adjacent matrix corresponding to the element path>
Figure BDA00040353834400001810
Is expressed and only has to be performed 1-1 MP-TransLayeThe superposition of r can be fused to the information of the meta path P, and it can be known from the formula (8) that the weighted sum of the weights corresponding to the hetero-adjacency matrix under each type, namely +. >
Figure BDA00040353834400001811
The contribution of the meta-path P is the importance of this meta-path P.
It is explained so far why kgnn_hcd can express the importance of different meta-paths. Meta path
Figure BDA00040353834400001812
Weight of +.>
Figure BDA00040353834400001813
Is a component of attention that provides semantic information and the importance of the meta-path in a particular task. Fig. 7 and 8 summarize the attention scores corresponding to the heteroadjacency matrix for each relationship type, and learn the attention scores for all length-3 meta-paths, taking ACM and IMDB datasets as examples.
Fig. 7 (a) and 7 (b) and fig. 8 (a) and 8 (b) show attention scores from adjacency matrices (edge types) of the first and second Layer meta-path information conversion layers (MP-Trans layers), where ACM data sets and IMDB data sets are selected. As can be seen on the ACM dataset, the attention score of the hetero-edge relationship PA is highest in the first layer meta-path conversion layer; in the second layer, the attention score of the hetero-edge relationship AP is highest. It is clear that the importance of the meta-path PAP must be high, which can be verified in fig. 7 (c). On the IMDB dataset the attention scores of the hetero-edge relations MA and MD in the first layer are not very different, but in the second layer the attention score of DM is significantly higher than that of AM, so the importance of meta-path MDM is higher than that of meta-path MAM, which can be demonstrated in the subsequent thermodynamic diagrams. Meanwhile, in these four figures, the attention weight of the identity matrix is relatively high, which as discussed in section 4.2, kgnn_cd can also attempt to adhere to shorter meta-paths in deeper layers.
Fig. 7 (c) and 8 (c) show a visualization of the correlation between each heterostructure in the first Layer MP-Trans Layer and each heterostructure in the second Layer MP-Trans Layer in the ACM dataset and IMDB dataset (the values are here magnified 100 times for a clearer representation). It can be seen intuitively that the meta-path PAP is of greater importance than the meta-path PSP. For example, as can be seen from the above figures, in the first Layer MP-Trans Layer, the attention score of PA is 0.1691, the attention score of ps is 0.1689, and in the second Layer MP-Trans Layer, the attention score of AP is 0.1721, and the attention score of sp is 0.1707. Then according to the formula in 5.6
Figure BDA0004035383440000191
It is available that the importance of meta-path PAP is +.>
Figure BDA0004035383440000192
Similarly, the importance of the available meta-path PSP is 0.0288, and thus it can be seen that in the community discovery task, the importance of the meta-path PAP is greater than that of the meta-path PSP, which is also demonstrated in Metapath2 vec. Similarly, for the IMDB dataset, in the first Layer MP-Trans Layer, the attention score for MD was 0.1656, the attention score for ma was 0.1645, and in the second Layer MP-Trans Layer, the attention score for DM was 0.1591, and the attention score for am was 0.1549. Then the importance of the meta path MDM is +. >
Figure BDA0004035383440000193
The importance of the meta-path MAM is 0.0255, which results in a meta-path MDM that is more important than MAM, as can also be demonstrated in Metapath2 vec.
2.7 ablation experiments
To verify the effectiveness of the individual components of kgnn_hcd, further experiments were performed on different kgnn_hcd variants. The performance results for the three data sets are given in figure 9. KGNN-HCD here w/oKG And the K neighbor graph information fusion module is removed, so that the importance degree of the module in the KGNN_HCD is explored. KGNN_HCD w/oI Neglecting the role of the identity matrix; KGNN_HCD w/oI&KG The identity matrix and the K neighbor graph information are fused, and the two modules are removed and the KGNN_HCD is obtained w/oKG And KGNN_HCD w/oI By contrast, the meaning of the two parts can be better highlighted.
As can be seen from fig. 9, kgnn_hcd on three data sets w/oKG Always lower than kgnn_hcd and creates a significant gap, which illustrates the effectiveness and necessity of performing K-nearest neighbor map information fusion. Next, to verify the effect of the identity matrix, KGNN_HCD was trained and evaluated here w/oI ,KGNN_HCD w/oI Has exactly the same system architecture as kgnn_hcd, but its candidate adjacency matrix does not include identity matrix. As can be seen from the figure, under normal conditions, KGNN_HCD w/oI Always less than kgnn_hcd. If both modules are removed, KGNN_HCD is trained directly w/oI&KG As can be readily seen, this effect is poor, but one of the two modes, e.g. KGNN-HCD, is added w/oI Or KGNN_HCD w/oKG The final performance is improved, and the effectiveness of the identity matrix and K neighbor graph information fusion module is reflected laterally.
In summary, in the present invention, a heterogeneous graph community discovery method kgnn_hcd fused with a K-nearest neighbor graph neural network is provided to solve the existing problem of heterogeneous graph community discovery. The method not only can effectively utilize the structural information of the feature space to enhance the node similarity, but also can improve the possibility that the connectionless nodes are divided into communities; but also avoids the use of predefined meta-paths, enabling end-to-end learning of meta-paths. In order to evaluate the rationality and effectiveness of the KGNN_HCD method, 11 different models such as CP-GNN, GTN and the like are compared on three real heterogeneous data sets of ACM, DBLP and IMDB, and the result shows that the performance of the KGNN_HCD is superior to a baseline, and the importance of different element paths can be well distinguished by weight convolution; for the clear explanation of mp_Trans Layer in KGNN_HCD, the module was subjected to an explanatory analysis; in order to study the influence of the super parameters in the KGNN_HCD on the model, the influence of the super parameters on the model is studied by changing the K value of the K neighbor graph, changing the embedding dimension and changing the layer number of the meta-path conversion; ablation experiments were performed on three datasets in order to verify the validity of each module; finally, the higher-order relation among the nodes is well captured through GCN, the finally learned node representation is clustered, the community structure in the heterogeneous graph is found, and the experimental result comprehensively demonstrates the rationality and effectiveness of the proposed KGNN_HCD method.
While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

Claims (6)

1. A heterogeneous graph community discovery method based on a K neighbor graph neural network is characterized by comprising the following steps:
s1, calculating node similarity according to the feature vector of the node, and further forming a similarity matrix; k nodes which are most similar to the target node are selected as neighbors according to the similarity matrix to form a continuous edge, and then a K neighbor graph adjacency matrix A is constructed k
Adjacency matrix A of K neighbor graph k Adjacent matrix added with identity matrix in input diagram
Figure FDA0004035383430000011
Splicing to obtain K neighbor graph adjacency matrix with characteristic space topology information>
Figure FDA0004035383430000012
S2, using a weight matrix to distribute attention scores for different heterogeneous point edge relations so as to achieve the purpose of distinguishing the importance degrees of different element paths;
s3, adjacent matrix of K neighbor graph with characteristic space topology information
Figure FDA0004035383430000013
Through element path messageThe information conversion layer generates a meta-path transition matrix, generates a meta-path transition matrix in a matrix multiplication mode, then adaptively learns meta-paths, and captures high-order relations among nodes through GCN fusion meta-path information to obtain node representation;
S4, performing k-means operation on the learned node representations according to the number of node types to form community division on specific nodes.
2. The heterogeneous graph community discovery method based on the K neighbor graph neural network according to claim 1, wherein the node similarity measurement method comprises the following steps: cosine similarity, thermonuclear, dot product.
3. The heterogeneous graph community discovery method based on the K-nearest neighbor graph neural network according to claim 1, wherein the K-nearest neighbor graph adjacency matrix based on characteristic space topology information
Figure FDA0004035383430000014
The calculation formula for generating the meta-path transition matrix through the meta-path information conversion layer is as follows:
Figure FDA0004035383430000015
wherein T represents a meta-path transition matrix having heterogeneous side information;
f (; ·) represents the function used to generate the meta-path transition matrix;
Figure FDA0004035383430000021
representing a K-nearest neighbor graph adjacency matrix with feature space topology information;
W att representing weight convolution at MP-Trans layer;
conv 1×1 representing a convolution layer;
Figure FDA00040353834300000214
is a parameter of a 1 x 1 convolution, +.>
Figure FDA0004035383430000022
Representing the number of heterogeneous edge types +.>
Figure FDA0004035383430000023
Representing a heterogeneous relationship;
Figure FDA0004035383430000024
representing that the dot-edge relationship is +.>
Figure FDA0004035383430000025
Heterogeneous adjacency matrix of->
Figure FDA0004035383430000026
Representing an ith set of edge types;
W i (l) =softmax(W att ) Representing convex combinations of heterogeneous adjacency matrices.
4. The heterogeneous graph community discovery method based on the K-nearest neighbor graph neural network according to claim 1, wherein the calculation formula for generating the element path conversion matrix by the matrix multiplication method is as follows:
Figure FDA0004035383430000027
wherein A is (l) Representing a meta path conversion matrix of a current layer;
Figure FDA0004035383430000028
a representation matrix;
A (l-1) meta-path conversion representing the upper layerChanging a matrix;
f (; ·) represents the function used to generate the meta-path transition matrix;
Figure FDA0004035383430000029
representing the fusion K neighbor graph information and the heterogeneous graph input of the added identity matrix;
W att (l) representing the weight convolution of the current layer.
5. The heterogeneous graph community discovery method based on the K-nearest neighbor graph neural network according to claim 1, wherein the adaptive learning element path comprises:
given a single by
Figure FDA00040353834300000210
The meta-path conversion matrix corresponding to the meta-path P formed by the series of composite relations can be calculated by the following method:
Figure FDA00040353834300000211
wherein the method comprises the steps of
Figure FDA00040353834300000212
Representing the edge type as +.>
Figure FDA00040353834300000213
Is a heterogeneous adjacency matrix of (1);
Figure FDA0004035383430000031
representing a set of edge types;
Figure FDA0004035383430000032
representing each meta-path transitionWeighting of the matrix;
A P representing a weighted sum of the l heteroadjacency matrices.
6. The heterogeneous graph community discovery method based on the K-nearest neighbor graph neural network according to claim 1, wherein the obtaining of the node representation comprises the following steps:
By multiplying the element path conversion matrix by l times, and applying GCN and MLP to each channel of the element path conversion matrix, the node representation is:
Figure FDA0004035383430000033
wherein I is a join operator;
c represents the number of channels;
sigma represents an activation function;
Figure FDA0004035383430000034
representing fusion of +.sup.th channel with layer I>
Figure FDA0004035383430000035
An element path conversion matrix with a self-loop identity matrix and a K neighbor graph adjacent matrix;
Figure FDA0004035383430000036
is->
Figure FDA0004035383430000037
A degree matrix of (2);
X∈R N×d is the characteristic matrix of the input heterogeneous graph, N is the node number, and d is the node characteristic dimension;
W∈R d×dim is a training matrix of the neural network model, dim is the output embedding dimension of the model, Z epsilon R N×dim Node embedded table as final outputShown.
CN202310003986.7A 2023-01-03 2023-01-03 Heterogeneous graph community discovery method based on K neighbor graph neural network Pending CN116257662A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310003986.7A CN116257662A (en) 2023-01-03 2023-01-03 Heterogeneous graph community discovery method based on K neighbor graph neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310003986.7A CN116257662A (en) 2023-01-03 2023-01-03 Heterogeneous graph community discovery method based on K neighbor graph neural network

Publications (1)

Publication Number Publication Date
CN116257662A true CN116257662A (en) 2023-06-13

Family

ID=86681867

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310003986.7A Pending CN116257662A (en) 2023-01-03 2023-01-03 Heterogeneous graph community discovery method based on K neighbor graph neural network

Country Status (1)

Country Link
CN (1) CN116257662A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117151279A (en) * 2023-08-15 2023-12-01 哈尔滨工业大学 Isomorphic network link prediction method and system based on line graph neural network
CN117520995A (en) * 2024-01-03 2024-02-06 中国海洋大学 Abnormal user detection method and system in network information platform

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117151279A (en) * 2023-08-15 2023-12-01 哈尔滨工业大学 Isomorphic network link prediction method and system based on line graph neural network
CN117520995A (en) * 2024-01-03 2024-02-06 中国海洋大学 Abnormal user detection method and system in network information platform
CN117520995B (en) * 2024-01-03 2024-04-02 中国海洋大学 Abnormal user detection method and system in network information platform

Similar Documents

Publication Publication Date Title
Chen et al. A tutorial on network embeddings
Malliaros et al. Clustering and community detection in directed networks: A survey
Ma et al. Graph classification based on structural features of significant nodes and spatial convolutional neural networks
CN116257662A (en) Heterogeneous graph community discovery method based on K neighbor graph neural network
CN113095439A (en) Heterogeneous graph embedding learning method based on attention mechanism
CN109753589A (en) A kind of figure method for visualizing based on figure convolutional network
CN112989842A (en) Construction method of universal embedded framework of multi-semantic heterogeneous graph
CN113051927B (en) Social network emergency detection method based on multi-modal graph convolutional neural network
Xiao et al. Link prediction based on feature representation and fusion
Moyano Learning network representations
CN113282612A (en) Author conference recommendation method based on scientific cooperation heterogeneous network analysis
Obaid et al. Semantic web and web page clustering algorithms: a landscape view
CN112256870A (en) Attribute network representation learning method based on self-adaptive random walk
CN116401380B (en) Heterogeneous knowledge graph-oriented contrast learning prediction method and system
Wang et al. An enhanced multi-modal recommendation based on alternate training with knowledge graph representation
CN111368176A (en) Cross-modal Hash retrieval method and system based on supervision semantic coupling consistency
Zheng et al. Attribute augmented network embedding based on generative adversarial nets
Piccialli et al. Data science for the Internet of Things
CN116450938A (en) Work order recommendation realization method and system based on map
CN111274498B (en) Network characteristic community searching method
Li et al. Intelligent unsupervised learning method of physical education image resources based on genetic algorithm
Xu et al. On learning community-specific similarity metrics for cold-start link prediction
Liu et al. Heterogeneous graph community detection method based on K-nearest neighbor graph neural network
Zhou [Retracted] Event Scene Method of Legal Domain Knowledge Map Based on Neural Network Hybrid Model
Wang et al. Improved hard example mining by discovering attribute-based hard person identity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination