CN116257662A

CN116257662A - Heterogeneous graph community discovery method based on K neighbor graph neural network

Info

Publication number: CN116257662A
Application number: CN202310003986.7A
Authority: CN
Inventors: 刘小洋; 吴玉蝶
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2023-01-03
Filing date: 2023-01-03
Publication date: 2023-06-13

Abstract

The invention provides a heterogeneous graph community discovery method based on a K neighbor graph neural network, which comprises the following steps: s1, calculating node similarity according to the feature vector of the node, and further forming a similarity matrix; and constructing K neighbor graph adjacency matrix A _k The method comprises the steps of carrying out a first treatment on the surface of the Adjacency matrix A of K neighbor graph _k Adjacent matrix added with identity matrix in input diagram

Splicing to obtain a K neighbor graph adjacency matrix with characteristic space topology information

S2, attention scores are distributed for different heterogeneous point edge relations; s3, will

Generating a meta-path transition matrix through a meta-path information conversion layer, generating a meta-path conversion matrix, then adaptively learning a meta-path, and fusing meta-path information through GCN to obtain node representation; s4, performing k-means operation on the learned node representations according to the number of node types to form community division on specific nodes. According to the method and the device, through fusion of the meta-path information, the possibility of node co-occurrence under the higher-order relation is considered, and the probability that nodes without connection in the meta-path are divided into the same community is improved.

Description

Heterogeneous graph community discovery method based on K neighbor graph neural network

Technical Field

The invention relates to the technical field of data mining, in particular to a heterogeneous graph community discovery method based on a K neighbor graph neural network.

Background

Community discovery is a fundamental and important research field in network science, and aims to divide nodes connected tightly in a network into communities, so that nodes inside the communities are compact, and node connection between communities is sparse. The theoretical significance and the practical significance of community discovery can not be degraded in the research subject of network science. In a social network, platform sponsors promote products, deliver topic recommendations, etc. to target users in a detected community. In metabolic networks and protein-protein interaction networks, community discovery reveals the complexity of metabolism and proteins with similar biological functions. Community discovery in the citation network determines the importance, interrelationships, evolution and recognition of research trends of the research topic.

Many excellent works have emerged for community discovery tasks, and these research models can be divided into three main categories. The conventional community discovery method highly depends on the structure of a network, and communities are discovered by using topology information of the network, such as a graph segmentation method, a hierarchical clustering method, and the like. And secondly, generating node sequences based on a graph embedding mode, a random walk mode and the like, and further discovering communities by more effectively learning node embedding. In recent years, graph neural networks have been widely used for various tasks on graphs, and important efforts have been made in community discovery research projects, by transmitting node characteristics to neighbors, performing operations such as convolution with an underlying topology. Node representation learned through a graph neural network has proven to achieve the most advanced performance with community discovery across most data sets.

However, most of the existing community discovery work mainly focuses on studying homogeneous graphs with the same node type and edge type. Since networks in the real world often have multiple node types and multiple edges, such solutions are clearly less effective on such more compliant real networks. Such a graph, in which various nodes and edges are ubiquitous in the real world, is called a heterogeneous graph, which contains more comprehensive information and rich semantics, and thus is widely used in many data mining tasks.

Community discovery on heterogeneous graphs is very challenging, and traditional graph neural network models cannot be directly applied to heterogeneous graphs due to the high heterogeneity of heterogeneous graphs. And there is one such composite relationship in the heterogeneous graph, which has rich semantic information, called meta-path, which is widely used in the research of heterogeneous graph. Because of the multi-hop semantic information rich in the meta-path, some nodes without direct edge connection are very likely to be divided into communities, so that similarity in the node feature space is fully considered when the heterogeneous graph is found in communities, and the high-order relationship of the nodes is considered, so that the nodes cannot be limited to the existing topological edge connection. In order to be able to effectively capture high-level information between nodes, many innovative studies have been proposed successively for community discovery, but they mostly rely on manually defined meta-paths, which need to be manually modified according to the data sets, and these work highly depend on the quality of the meta-paths, which are chosen by the expert to be different, resulting in distinct results for the model. In addition, the semantic information contained in each meta-path is different, meaning that it is preferable to be able to distinguish their importance for each meta-path to better fit the research task and learn a more efficient node representation.

Disclosure of Invention

The invention aims at least solving the technical problems existing in the prior art, and particularly creatively provides a heterogeneous graph community discovery method based on a K neighbor graph neural network.

In order to achieve the above object of the present invention, the present invention provides a heterogeneous graph community discovery method based on a K-nearest neighbor graph neural network, comprising the steps of:

s1, calculating node similarity according to feature vectors of nodes, and further forming a similarity matrix, wherein the similarity matrix is formed by similarity between a target node and other nodes of a heterogeneous graph; k nodes which are most similar to the target node are selected as neighbors according to the similarity matrix to form a continuous edge, and then a K neighbor graph adjacency matrix A is constructed _k ；

Adjacency matrix A of K neighbor graph _k Adjacent matrix added with identity matrix in input diagram

Splicing to obtain K neighbor graph adjacency matrix with characteristic space topology information>

Said adjacency matrix->

The method is obtained by performing Concat operation on the input graph matrix and the identity matrix. The input graph is an input heterogeneous graph.

And S1, constructing a similarity matrix according to the node feature vector by utilizing the structural information in the feature space, and generating a K neighbor graph to enhance the similarity between nodes, thereby increasing the possibility that nodes without connecting edges are divided into communities.

S2, using a weight matrix to distribute attention scores for different heterogeneous point edge relations so as to achieve the purpose of distinguishing the importance degrees of different element paths;

s3, adjacent matrix of K neighbor graph with characteristic space topology information

Generating a meta-path transition matrix through a meta-path information conversion layer, generating the meta-path transition matrix through a matrix multiplication mode, then adaptively learning a meta-path, fusing meta-path information through GCN, and capturing a higher-order relation between nodes to obtain node representation;

s4, performing k-means operation on the learned node representations according to the number of node types to form community division on specific nodes.

Further, the method for measuring the node similarity comprises the following steps: cosine similarity, thermonuclear, dot product.

Further, the K-nearest neighbor graph adjacency matrix for topological information of the feature space

The calculation formula for generating the meta-path transition matrix through the meta-path information conversion layer is as follows:

wherein T represents a meta-path transition matrix having heterogeneous side information;

f (; ·) represents the function used to generate the meta-path transition matrix;

representing a K-nearest neighbor graph adjacency matrix with feature space topology information;

W _att representing weight convolution at MP-Trans layer;

conv _1×1 Representing a convolution layer;

is a parameter of a 1 x 1 convolution, +.>

Representing the number of heterogeneous edge types +.>

Representing a heterogeneous relationship;

representing that the dot-edge relationship is +.>

Heterogeneous adjacency matrix of->

Representing an ith set of edge types;

representing convex combinations of heterogeneous adjacency matrices.

Further, the calculation formula for generating the meta-path conversion matrix by the matrix multiplication method is as follows:

wherein A is ^(l) Representing a meta path conversion matrix of a current layer;

a representation matrix;

A ^(l-1) representing a meta path conversion matrix of a previous layer;

representing the fusion K neighbor graph information and the heterogeneous graph input of the added identity matrix;

W _att ^(l) representing the weight convolution of the current layer.

Further, the adaptive learning meta-path includes:

given a single by

The element path P formed by the series of compound relations corresponds to the element pathThe radius transformation matrix may be calculated by:

wherein the method comprises the steps of

Representing the edge type as +.>

Is a heterogeneous adjacency matrix of (1);

representing a set of edge types;

a weight representing each element path transition matrix;

A _P representing a weighted sum of the l heteroadjacency matrices.

Since these transition adjacency matrices, i.e. adjacency matrices with heterogeneous point-side relations, contain original point-side relation information, and the characteristics of the original edge itself are ignored when the element path information is fused, it is necessary to add an identity matrix I into the heterogeneous adjacency matrices, and this way, element paths of any length can be learned, and when the element path conversion matrices are multiplied, element path information of l+1 lengths can be fused.

Further, the obtaining of the node representation comprises the steps of:

by multiplying the element path conversion matrix by l times, and applying GCN and MLP to each channel of the element path conversion matrix, the node representation is:

where I is a join operator, splice the node representations of each channel.

C represents the number of channels;

sigma represents an activation function;

representing fusion of +.sup.th channel with layer I>

An element path conversion matrix with a self-loop identity matrix and a K neighbor graph adjacent matrix;

is->

A degree matrix of (2);

X∈R ^N×d is the characteristic matrix of the input heterogeneous graph, N is the node number, and d is the node characteristic dimension;

W∈R ^d×dim is a training matrix of the neural network model, dim is the output embedding dimension of the model, Z epsilon R ^N×dim Is the final output node embedded representation.

In summary, due to the adoption of the technical scheme, the invention has the following advantages:

the characteristic information of the node space is considered, the K neighbor graph topological structure is constructed and is subjected to information fusion with the existing network topological structure, the similarity of the nodes is fully enhanced, and the probability that the nodes based on no connection in the meta-path are divided into the same community is improved.

The utility model provides a heterogeneous graph neural network method community discovery method KGNN_HCD, which can learn element paths end to end, capture higher-order relations, distinguish the importance among different element paths, learn high-quality node representation and perform interpretable analysis on element path conversion layers.

A great number of comparison experiments are carried out on three real heterogeneous data sets of ACM, DBLP and IMDB and node community discovery models of CP-GNN, GTN and the like, and experimental results show that compared with the existing heterogeneous community discovery methods of CP-GNN, GTN and the like, the KGNN_HCD method is remarkably improved in indexes of F1, NMI, ARI, purity and the like.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 is a schematic diagram of a heterogeneous graph of the present invention, FIG. 1 (a) being three node types, namely author, paper and meeting; FIG. 1 (b) is an example of a heterogram on a DBLP dataset; fig. 1 (c) is a three-element path used in DBLP, namely author-paper-author (APA), author-paper-author (APAPA), author-paper-meeting-paper-author (APCPA).

Fig. 2 is a frame diagram of a heterogeneous graph community discovery method fusing K neighbor graph information.

Fig. 3 is a schematic diagram of the construction process of the K-nearest neighbor graph based on the feature vector according to the present invention.

FIG. 4 is four metrics for three datasets in different dimensions according to the present invention.

FIG. 5 is an analysis of K in a K-nearest neighbor plot under three data sets

FIG. 6 is NMI and ARI of MP-Trans with different layers on three data sets

FIG. 7 is a graph of numerical analysis in MP-Translayer on an ACM dataset

FIG. 8 is a graph of numerical analysis in MP-Trans Layer on an IMDB dataset

Fig. 9 is a visual results graph of ACM, DBLP and IMDB dataset ablation studies for community discovery.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

Three representative methods related to community discovery at present are a traditional community discovery method, a community discovery algorithm based on network embedding and a community discovery method based on heterogeneous graph neural network.

Traditional community discovery

Traditional community discovery methods rely mainly on network structures to explore communities, which have attracted a great deal of research attention. The Infomap algorithm encodes communities in the network and nodes inside communities simultaneously, and generates unique encoded representations of the nodes. Then, random walk is carried out on the network to obtain a group of total coding length, and when the coding length reaches the shortest, the nodes which are tightly connected are marked to the same community to obtain the optimal solution. The label propagation algorithm (Label Propagation Algorithm, LPA) is highly dependent on the topology of the network, and uses the label information of labeled nodes to pre-determine the label information of unlabeled nodes to identify the diffusion communities. However, the subjects of these methods described above are homogeneous images. Some superior work has also emerged on the study of community discovery tasks on heterograms. And obtaining community division results by carrying out k-means clustering on the eigenvectors corresponding to different node types by using the Het-SC and the Het-RSC in the literature. AGGMMR proposes a framework for performing community detection with attribute and topology information through a greedy modular maximization model. The AJNMF comprehensively considers link information and node content information based on regularization and non-negative matrix factorization modes to construct a heterogeneous network matrix, and a regulating function is introduced to reduce the influence of noise data on communities, so that the community discovery effect is improved. The literature adopts meta-paths to capture higher-order relationships between nodes to discover community structures in heterogeneous graphs.

Network-embedded-based community discovery

Network-embedded community discovery is a network characterization learning method, a high-dimensional and sparse vector space is represented by a low-dimensional and dense vector space, and the expression capacity of node embedding is enhanced according to community characteristics, so that many researchers aim to solve the problem of community discovery by means of graph embedding. The deep method obtains the co-occurrence relation of nodes in the graph by simulating uniform random walk in the network so as to learn the vector representation of the nodes, and the sampling strategy in the deep can be regarded as a special case of node2vec with p=1 and q=1. Node2vec proposes random walk based on DFS and BFS, and respectively digs Node characterization with homogeneity and structural similarity. CDE proposes a novel embedding-based approach. It embeds the inherent community structure into the structure embedment through known community membership. Then, based on node attributes and embedding of community structures, community discovery is defined as a matrix factorization optimization problem. NEC proposes a learnable network embedding algorithm for community discovery tasks in heterogeneous graphs that will learn together graph structure-based and cluster-oriented representations. Community discovery was then performed using k-means. Metaath 2vec is a vertex embedding method for heterogeneous information networks (Heterogeneous Information Network, HIN) proposed by Dong equal to 2017, using meta-paths to guide random walks on heterogeneous graphs, so that the generated node sequences contain rich semantic information. Compared with some previous models, the HIN2Vec model retains more context information, not only assumes that two nodes with a relationship are related, but also distinguishes between different relationships between the nodes and is treated differently by co-learning a relationship vector.

Community discovery based on graph neural network

Community discovery based on graph neural network is a deep learning method on graph domain, which can capture the independence of graph, solve the disorder of input graph, learn the state embedding of the neighbor of each node, so that scholars research a plurality of novel graph neural network based methods. As one of widely used deep learning techniques, the graph neural network learns graph structure data by virtue of its own advantages, extracts and discovers features and modes in the graph structure data, and has a great capability in the field of community discovery, thus occupying a place. LGNN is a graph neural network model, is an advanced deep learning-based method, utilizes the adjacent information of edges in the graph, has strong characteristic representation capability, and can be used for homogeneous community discovery. HAN is one of the earliest efforts to study heterograms, and it requires predefining meta-paths for datasets. And using a hierarchical attention mechanism to capture node level importance and semantic level importance and utilizing GAT to assign attention scores to neighbors based on different meta-paths, giving a final representation. MAGNN was further improved on the HAN basis. HAN considers only two end nodes based on meta-paths, while MAGNN proposes several meta-path encoders to encode all information along the path, taking into account intermediate nodes, providing an excellent community discovery solution. The CP-GNN proposes a context path based on the meta path, and if two master nodes are connected by the context path, it is explained that the two nodes have a semantic relationship. The CP-GNN can learn node embedding by maximizing co-occurrence probabilities of context neighbor nodes, capture higher order relationships, and do not require pre-defining meta-paths. The GTN is a method for heterogeneous graphs, does not need to manually define element paths, can be regarded as graph simulation of a space transformation network, clearly learns the space transformation of input images or features, obtains effective node representation, reduces high heterogeneity and enhances community discovery results.

For ease of description that follows, the terms and network model definitions relevant herein are presented herein. Table 1 summarizes the symbols in the present patent.

Table 1 symbols in the patent of the invention

Heterogeneous diagram: the heterogeneous graph (or heterogeneous information network) is an abstract modeling language for modeling heterogeneous relational data and many complex systems, as shown in FIG. 1Shown. Unlike a homogeneous graph, heterogeneous graphs have different types of nodes and edges, which are generally defined as

Where v denotes the node set and ε denotes the edge set, each of which has a node type mapping function

And an edge type mapping function ++>

And->

Figure->

Is +.>

All correspond to a node type, i.e. +.>

Likewise, each edge e in the graph _ij Epsilon corresponds to one edge type, i.e

As shown in fig. 1 (a) and 1 (b), a simple heterogeneous indexing diagram is constructed using the DBLP dataset as a prototype. In this figure, there are three node types, paper (P), author (a) and meeting (C), respectively. There are two types of edges, paper-author (P-A) and paper-meeting (P-C), respectively.

Meta-path: in the heterogeneous graph, a path connecting two heterogeneous nodes through a mixed relation is called a meta path, and the path can effectively capture semantic information and is a basic means for researching the heterogeneous graph. A meta-path P to

Form (can also be simplified as +.>

) Is constructed in which

Representing a connection symbol; its length is hetero relation->

Is a number of (3).

As in fig. 1 (c), two authors can be connected by different meta paths, such as author-paper-author (APAPA) and author-paper-meeting-paper-author (APCPA). Although only the middle node is different, the semantic information expressed by the two element paths is completely different, the former represents that the author 1 and the author 4 respectively co-operate with the author 2, and the latter represents that the papers written by the two authors belong to the same conference. The richness of semantic information expressed by meta paths of different lengths is also different, such as meta path author-paper-author (APA) and author-paper-meeting-paper-author (APCPA), the semantic information of the latter is more abundant.

Heterogeneous adjacency matrix: in the heterogeneous graph, since the node type and the edge type are different, if the conventional adjacency matrix is used for representing the original graph, the heterogeneity of the graph is lost, so the present patent constructs a method of the heterogeneous adjacency matrix to represent the heterogeneous graph. Given a heterogeneous graph, the node type set is that

Edge type collection

And->

Representation->

And->

Dot-edge relation matrix>

And->

If edges exist, the corresponding element value is 1, otherwise, 0, and the values of the three other point-edge relationship type matrices are all 0. The heterogram can be expressed as +.>

Wherein N is the number of nodes in the heterogeneous graph. In addition, the point-edge relationship matrix is herein referred to as a heteroadjacency matrix, because in the heteroadjacency matrix, the point-edge relationship is deposited, the edge is 1, otherwise 0.

If a compound relationship is given without causing ambiguity

Or edge type sequence

Then the meta-path can also be represented directly by the object type, i.e. p= (v) ₁ v ₂ …v _l+1 ). Adjacency matrix A representing meta-path P _P Can be obtained according to the following formula:

wherein A is _P The meta path conversion matrix corresponding to the meta path P is represented, and the specific reference can be made to formula (8);

representation A _l And->

A dot-edge relation matrix is formed;

A _l represented in the heterogram is a contiguous matrix of point-edge type l.

KNN_graph: the K-nearest neighbor Graph (KNN_graph) is a weighted directed Graph.

Wherein->

Representing node set +.>

ε _k Representing edge set ε _k ＝{e _k1 ,e _k2 ,...,e _km }. Unlike ordinary graphs, in the K-nearest neighbor graph, ε _k The middle is stored with->

The edges of the K most similar nodes under some similarity measure.

Community: given a group of communities

Each community C _m Is a partition of the original graph that preserves the region structure and clustering properties. Node v _i Clustered into a community C _m The condition to be satisfied is that the node degree in the community exceeds the node degree in the communityAnd (5) a degree of division.

Graph roll-up network (GCN): GCN as a specific graph-based neural network model f (X, a) constructs a multi-layer graph convolutional network with the following layer propagation rules:

wherein H is ^(l) A feature representation for the first layer; the upper mark in the text ^(l) All representing a layer i graph roll-up network.

Is a heterogeneous graph adjacency matrix added with self-connection;

a degree matrix;

W ^(l) ∈R ^d×d is a trainable weight matrix;

sigma (·) represents an activation function, such as ReLU (·) =max (0, ·).

In this work, since the study object is a directed graph, the degree matrix is used in the graph convolution operation

To perform an inverse normalization operation.

The method provided by the invention aims at finding communities with nodes of the same target type in the heterogeneous graph, mining structural information in a feature space and generating high-quality node embedding. The method comprises the steps of generating a K neighbor graph by utilizing characteristic nodes to acquire bottom topology information in a characteristic space, constructing a heterogeneous adjacency matrix according to a data set by the proposed KGNN_HCD to find a significant meta-path, and executing a more effective graph rolling network to learn more powerful node representation.

The invention provides a heterogeneous graph community discovery method KGNN_HCD based on a K neighbor graph neural network, and a specific method architecture is shown in figure 2. In fig. 2, kgnn_hcd uses a K-nearest neighbor graph adjacency matrix with a heterogeneous adjacency matrix and representing feature space topology information as input of GCN, assigns weights for different heterogeneous point-edge relationships by using a weight matrix, and generates a meta-path transition matrix at MP-Trans layer, further generates a meta-path transition matrix. Meanwhile, graph convolution is carried out on the new graph structure in the GCN, and a k-means method is adopted on the learned node representation so as to find the community structure in the heterogeneous graph.

1. The proposed method

1.1K neighbor map information fusion

In order to obtain the structural information in the feature space, a K-nearest neighbor map is constructed from the input feature matrix X of the heterogeneous map

It can reveal the underlying structure of the data, where A _k Representing the adjacency matrix of the K-nearest neighbor graph. The specific operation is shown in fig. 3.

In fig. 3, node similarity is calculated according to the feature vector of the node, and K nodes most similar to the target node are selected according to the similarity matrix, so as to construct a K neighbor graph adjacency matrix required by the model. The kernel of selecting the constructed K neighbor graph in the KGNN_HCD method is to select a similarity measurement index, and according to the measurement method, the similarity between nodes can be calculated, so as to form a similarity matrix S. The invention can adopt the following three similarity measurement methods, wherein x is as follows _i And x _j For node v _i And v _j Is a feature vector of (1):

1) Cosine similarity, which is measured by the cosine value of the angle between two vectors, the range of values is [ -1,1]:

2) The similarity calculation mode of the thermonuclear Heat Kernel is shown as a formula (4), wherein t is a time parameter in a Heat conduction equation:

the term "distance" refers to the distance between two vectors.

3) Dot Product, which is mainly applied to discrete data, such as word bags (bag-of-words), the calculated similarity is only related to the number of the same words, and the specific calculation mode is shown in the formula (5):

S _ij ＝x _j ^T x _i (5)

to transpose the symbols.

In the invention, cosine similarity is uniformly selected as a similarity measurement standard to obtain a similarity matrix S, and then the first K similarity points of each node are selected as neighbors to form a continuous edge, so as to generate an undirected K neighbor graph, thereby obtaining a required K neighbor graph adjacent matrix A _k 。

After the above operation, the underlying structure information in the feature space is acquired. Meanwhile, K most similar neighbors are added for each node to further aggregate information, so that the similarity of the nodes is increased, and the possibility of node co-occurrence is improved. Adjacency matrix A of K neighbor graph _k Adjacent matrix added with identity matrix in input diagram

Splicing to obtain final input +.>

1.2-element path information conversion

The rich semantics of meta-paths are an important feature of heterogeneous information networks, and the connection relations between objects are different based on different meta-paths, and the path definitions are also different, which may have an impact on many specific tasks. Previous meta-path based works are highly dependent on predefined meta-paths, and depending on the fact that the meta-paths manually defined by domain experts are different, the performance of the model is directly affected. In contrast, kgnn_hcd can define meta-paths end to end, and can generate meta-paths of any length according to the characteristics of the data set and the task requirements.

In this section, a K-nearest neighbor graph adjacency matrix expressing the feature space topology has been added to the heterograms. Later on, when aggregating meta-path information, it should be noted that the point-edge relationships under different types play different roles and will show different importance when learning the node embedment of a specific task, so that a weight convolution W needs to be introduced here _att Constraining the point-edge relations with different importance degrees, normalizing the weight convolution by a softmax function, and realizing the differential expression of the point-edge relation matrix, thereby generating a meta-path transition matrix with heterogeneous edge information, wherein the calculation formula is as follows.

W _att representing weight convolution at MP-Trans layer;

conv _1×1 representing a convolution layer;

is a parameter of a 1 x 1 convolution;

representing a set of edge types;

representing that the dot-edge relationship is +.>

Is a heterogeneous adjacency matrix of (1);

W _i ^(l) ＝softmax(W _att ) Representing convex combinations of heterogeneous adjacency matrices.

In order that the proposed method can learn different behaviors based on the same mechanism, and combine the different behaviors as knowledge, the kgnn_hcd sets the output channel of the convolution layer to C to fully consider different types of meta-paths, which is similar to the multi-head attention mechanism. In the first Layer of meta-path information fusion, i.e. in the meta-path information conversion Layer (MP-Trans Layer), two meta-path transition matrices T are calculated ₁ ∈R ^N×N×C And T ₂ ∈R ^N×N×C Both transition matrices at this time have heterogeneous point side relation information (such as ACM as data set, PA, AP, PS, SP) and T is defined as ₁ And T ₂ Matrix multiplication means that the first time of meta-path information fusion is performed, and at this time, the meta-path information under all valid point side relation combinations with the meta-path length of 3 is learned. At the same time, KGNN_HCD uses the degree matrix of the heterogram to perform the binary path conversion matrix A for the stability of the numerical value ^(l) Normalizing, i.e.

Each subsequent element path conversion matrix can be obtained by multiplying the element path transition matrix obtained by the current calculation with the element path conversion matrix of the upper layer, namely

Wherein A is ^(l) Meta-path representing current layerConverting the matrix;

a representation matrix;

A ^(l-1) representing a meta path conversion matrix of a previous layer;

W _att ^(l) a weight convolution representing the current layer;

next, explanation is made on how kgnn_hcd learns a meta-path of arbitrary length. Given a single by

The meta-path conversion matrix corresponding to the meta-path P formed by the series of composite relations can be calculated by the following method:

wherein the method comprises the steps of

Representing the edge type as +.>

Is a heterogeneous adjacency matrix of (1);

representing a set of edge types;

can be regarded as the weight of each element path transition matrix, so that the KGNN_HCD can distinguish element paths with different importance degrees, and then A _P It can be seen as a weighted sum of i heteroadjacency matrices, so the proposed method can learn meta-path information of arbitrary length.

Since these transition adjacency matrices contain original point-side relation information, the characteristics of the original edge itself are ignored when the meta-path information is fused, so that it is necessary to add an identity matrix I into the heterogeneous adjacency matrix, and in this way, kgnn_hcd can learn meta-paths of any length, and when the multiplication operation is performed on l meta-path conversion matrices, the meta-path information of l+1 lengths can be fused.

1.3 Community discovery

The main goal of this module is to conduct community discovery on learned node representations. In order to enable the final node representation to contain rich meta-path semantic information, the variance of the data is high because the heterograms have a scaleless nature. In order to cope with the challenges, the kgnn_hcd extends the information fusion of the meta-path level to the multi-head channel, and increases the number of output channels C. By multiplying the element path conversion matrix by l times, and applying GCN and MLP to each channel of the element path conversion matrix, the final node representation is then

Where I is a join operator, splice the node representations of each channel.

C represents the number of channels;

sigma represents an activation function;

the representation in (1) merges with the ith channel of the first layer/>

is->

Is a degree matrix of (2).

After the node representation is obtained, the most classical k-means algorithm is used to discover communities for these learned embeddings. Specifically, k is set as the number of node categories, so that k clusters are obtained, and meanwhile, the real community label corresponding to the data set is compared with the community label obtained by KGNN_HCD, and the community discovery result is evaluated through the common community discovery index.

2. Experimental analysis

2.1 data sets

The experiment selects three real world heterogeneous graph datasets, respectively heterogeneous citation networks ACM and DBLP, and a movie dataset IMDB, and specifically their detailed information including node number, edge type, meta path, etc. is summarized in table 2.

Table 2 statistics of dataset

·ACM

ACM (Association for Computing Machinery) data set is a bibliographic information networkThe Paper published on KDD, SIGMOD, SIGCOMM, mobiCOMM and VLDB was extracted, and the whole dataset included 3 node types, 3025 Paper (P), 5835 Author (a), and 56 subjects (S), respectively, wherein nodes of type P have 3 tags of Database, wireless communication Wireless Communication, and Data Mining, that is, paper nodes can be classified into 3 types. The ACM dataset also contained 4 edge types, 9936P-A and A-P and 3025P-S and S-P, respectively, and it is apparent that the dataset had 4 heterogeneous adjacency matrices with corresponding point-edge relationships A, respectively _P-A 、A _A-P 、A _P-S And A _S-P . For an ACM dataset, the meta-paths with semantic information are PAP and PSP.

·DBLP

DBLP (DataBase systems and Logic Programming) dataset is a network that reflects the relationship between authors and papers. Nodes are of 3 types: 14328 papers (P), 4057Author (A) and 20 references (C). The authors fall into 4 areas: database, data Mining, machine Learning, information retrieval Information Retrieval. The areas of their study are labeled according to the meetings submitted by each author. The DBLP dataset contains 4 edge types, 19645P-A and A-P and 14328P-C and C-P, respectively, the corresponding heteroadjacency matrix being A _P-A 、A _A-P 、A _P-C And A _C-P . For a DBLP dataset, the meta-path with semantic information is APA, APCPA, APAPA.

·IMDB

IMDB (Internet Movie Database) is a web site for movies and related information, recording user pairs

Preferences for different movies. This data contains 3 node types, 2939 Movie (M), 5841 Actor (a), and 2269 Director (D), respectively, and the nodes of the type Movie are labeled, including three attributes, action, comedy, and Drama. thereare4typesofedgetypes,namely4661M-AandA-Mand13983M-DandD-M,respectively,andthecorrespondingheterogeneousadjacencymatrixisA _M-A 、A _A-M 、A _M-D And A _D-M . In the course of grindingWhen the IMDB data set is researched, the data set can be subjected to task mining by researching two meta-paths with semantic information, namely MAM and MDM.

2.2 Baseline model

The proposed kgnn_hcd method is compared with the most advanced heterogeneous community discovery methods such as CP-GNN, GTN, etc. In order to fully verify fairness and effectiveness of experiments, firstly, traditional network embedding algorithms are compared, and the traditional network embedding algorithms are originally designed to study isomorphic graphs, so that node heterogeneity is ignored in the experimental process, and homogeneous operation is performed on the whole heterogeneous graph. Meanwhile, for the purpose of fair comparison, methods based on graph neural networks are introduced, most of the methods are designed for heterogeneous network embedding, and the characteristics of heterogeneous information networks are captured by using element paths, but the modes are not the same.

Conventional network embedding method

Infomap is an algorithm based on information coding theory, which codes a path of random walk and uses the code length as an objective function to perform community optimization division. Here, the method views the heterogeneous map as a homogeneous map, converting the community discovery problem into an information compression problem.

Node2vec is a network embedding method that considers DFS neighborhood and BFS neighborhood comprehensively, and can be regarded as an extension of deep. node2vec is a second order random walk that can walk away to delineate macroscopic features of the network, and can walk locally to preserve community information of the node.

Meta 2vec (Mp 2 vec) is the most advanced heterogeneous information network embedding method. It uses meta-path based random walk to construct a heterogeneous neighborhood for each vertex, and then uses Skip-Gram model to complete vertex embedding. On the basis of meta 2vec, authors have also proposed meta 2vec++ to implement modeling of both structural and semantic associations in heterogeneous networks.

The HIN2vec model learns the rich information in heterogeneous information networks by studying the different types of relationships and network structures between nodes. The method samples random walking based on element paths with given size, inputs the random walking into a neural network, codes rich information embedded in the element paths and the whole network structure according to different semantic information of different element paths, and learns more meaningful node representation.

Method based on graph neural network

GCN is the most advanced graph rolling method for homographs. The GCN is a semi-supervised graph convolutional network model for performing convolution operation in a graph Fourier domain, and globally captures complex features by aggregating neighbor information for node representation learning. Here, the GCN is tested on a homogeneity map based on the meta-path and the best results from the meta-path are reported.

·GCN _KG Is a special GCN. For better comparison, the advantages of K neighbor graphs are explored, the input of the model gives up the traditional topological graph, and the sparse K nearest neighbor graph calculated from the feature matrix is used as the input graph of the GCN and expressed as the GCN _KG 。

GAT is a semi-supervised homogeneous graph neural network model. The model performs convolution operations in the graph space domain and introduces a attention mechanism. For each node, it aggregates the neighbor representation by a node-level attention-learning importance score. Likewise, all meta-paths under the GAT model were tested and reported for best performance.

HAN is one of the early attempts to solve heterogeneous graphs, which converts heterogeneous information networks into multiple homogeneous graphs through a given symmetric meta-path, and captures node-level importance and semantic-level importance using a hierarchical attention mechanism, and achieves node-level attention on its corresponding meta-path neighbor graph through GAT, giving a final representation.

CP-GNN proposes to capture higher order relationships between nodes using context paths and build a context path based heterogeneous graph neural network model. It recursively embeds higher-order relationships between nodes into node embedments through a attention mechanism to distinguish the importance of different relationships. The embedding of nodes is better learned by maximizing the expectation of co-occurrence of nodes connected by context paths.

GTN is a method suitable for heterogeneous graphs that can automatically discover valuable meta-paths instead of relying on manual selection as with HAN. All possible cases are considered by computing all possible meta-path based graphs, and then performing a graph convolution on the resulting graph.

2.3 implementation procedure

For the proposed kgnn_hcd method, parameters are first randomly initialized and Adam optimization model is used, and then super parameters are respectively selected: the learning rate is set to 0.005 and the regularization parameter is set to 0.001 so that each baseline yields its best performance. For the random walk-based model, the window size is set to 5, the step size is set to 100, the step size for each node is 40, and the negative sample number is set to 7. For semi-supervised graph neural networks, including GCN, GAT, HAN, etc., the exact same training set, validation set, and test set are split to ensure fairness. For the kgnn_hcd method, the meta path information conversion layer of both the DBLP and IMDB data sets is set to 3, the meta path information conversion layer of the ACM data set is set to 2, and furthermore, for the K neighbor map, the value of K is set to 4. For a fair comparison, the embedding dimension of all the algorithms described above is set to 64.

2.4 performance comparisons

In this experiment, the proposed kgnn_hcd method first learns node embeddings and then uses the k-means algorithm on these embeddings for community discovery, where k is set as the number of node classes. The performance of kgnn_hcd was evaluated using the real label and four metrics F1, NMI, ARI, and Purity, and 11 node methods such as Infomap, and experimental analysis results are shown in table 3.

Table 3 comparison of the performance of different community discovery methods

As is clear from table 3, the proposed kgnn_hcd method has an F1 index of 2.68% better than the GTN method on the ACM dataset, 1.35% better than the GTN method on the DBLP dataset, 1.89% better than the GTN method on the IMDB dataset, 2.54% and 2.56% better than the NMI and ARI, 2.59% and 1.47% better than the ACM dataset, 1.22% and 1.67% better than the DBLP dataset. The above results indicate that: by fusing K neighbor graph information and conversion meta-path information, the structural information of the feature space can be captured, and more meaningful meta-paths can be acquired adaptively so as to learn node representations more suitable for community discovery.

All graph-based embedding methods are superior to the traditional infomap algorithm, showing their great potential in community discovery tasks. For the network embedding-based method, the overall performance of the HIN2vec is better than that of the Node2vec and the Metapath2vec, but the superiority of the HIN2vec only appears prominent on the ACM data set, and the effect on the two data sets of DBLP and IMDB is not ideal. This is because, although HIN2vec is able to automatically discover meta-paths through random walks, discovered meta-paths may not be suitable for community discovery. Second, it can be found that Metapath2vec works better on the DBLP dataset than on the ACM dataset, indicating the importance of higher order relationships.

GCN、GCN _KG And LGNN performed the least in GNN-based baselines. The possible reasons are that they were originally proposed for isomorphic diagrams, without considering the complex context information in iso-patterning. GCN _KG All results are superior to GCN, further indicating the necessity of introducing K-nearest neighbor graphs to obtain feature space information. The performance of GAT is superior to GCN and LGNN, which strongly supports the importance of the attention mechanism. The attention mechanism in GAT can be seen as a simple way to distinguish node types from edge types in a heterogeneous graph. Due to the existence of the meta-path, the HAN can explicitly mine complex semantic information, and a good effect is achieved. The CP-GNN can well capture higher-order relations by learning the meta-path in a manner based on the context path, and the performance is good. The graph conversion network of the GTN can identify the nodes without edges on the original graphThe useful connection between the two is free from domain knowledge in the aspect of learning element paths, and the result proves that the model performance is better. In the graph neural network-based method, the performance of HAN, CP-GNN, GTN and KGNN_HCD is better than that of GCN and GAT, which shows that the community discovery models based on heterogeneous graph neural networks consider the heterogeneity of nodes and edges well.

2.5 parameter sensitivity experiment

This section will conduct sensitivity contrast analysis experiments of parameters on three different data sets such as ACM.

The effect of the final embedded Z dimension was first tested, five experiments were performed to set the embedded dimensions to 4, 16, 64, 128 and 1024, respectively, and for control variables, K in the K neighbor graph was uniformly set to 4, with the final result shown in FIG. 4.

As can be seen from fig. 4, the performance increases and then begins to decrease slowly with increasing embedding dimension across the three data sets. The reason may be that kgnn_hcd requires a suitable dimension to encode meta-path information, a smaller dimension may lead to learning less than meaningful information, and a larger dimension may introduce additional redundancy, so we find that the embedding dimension is most appropriate when set at 64 according to multiple experiments.

To detect the effect of the first K neighbors in the K neighbor graph on the proposed method, the performance of knn_gnn of K in fig. 5 ranging from 2 to 10 was studied.

It can be seen very intuitively from fig. 5 that the accuracy of kgnn_hcd increases first and then starts to decrease for all three data sets ACM, DBLP and IMDB. This is probably because the smaller the K value, the less likely it is to be when selecting neighbors, and the node most similar to the target is easily selected. However, as K increases, some nodes that are less similar to the target node are forced to keep the edges, which increases the complexity of the graph, and thus cannot intuitively find nodes that are more similar to the target node. At the same time, if the graph becomes denser, the features are more easily smoothed, which is also one of the reasons for the degradation of the model performance. As can be seen from fig. 5, kgnn_hcd works best when K takes 4.

To investigate the effect of the meta-path conversion Layer on kgnn_hcd, the layers were set to 2, 3, 4, 5, respectively, and the NMI and ARI indices of the model on the three data sets were observed, with the results shown in fig. 6.

Taking the DBLP dataset as an example, it is apparent from fig. 6 that the performance of kgnn_hcd continues to decrease as the number of layers increases. This is because the role of the meta-path information translation layer is to aggregate meta-paths, the longest effective meta-path length for a DBLP dataset being 5, i.e. APCPA. When the number of layers of meta-path information conversion increases, meta-paths without semantic information are also used by kgnn_hcd, causing additional redundancy to affect performance. It was thus found from several experiments that the overall effect of the kgnn_hcd method is optimal when the number of meta-path information conversion layers is set to 2 for the ACM dataset and 3 for the dblp and IMDB datasets.

2.6 interpretable analysis of MP-Translayer

MP-Trans Layer is used for fusing two element path information matrixes, so that element path conversion matrixes with importance degree are obtained, and differences among different element paths are further expressed. Next, this section explains in detail how MP-Trans Layer distinguishes the importance of each meta-path from the generated meta-path transition matrix. For convenience of explanation, the output channel is set to 1, and the convex combinations of the adjacency matrix of the input heterogeneous map are defined as

Correspondingly, the first MP-Trans Layer generates a meta-path conversion matrix A ^(l) The use of the previous layer output meta-path transition matrix and the current meta-path transition matrix may be obtained as follows.

Wherein D is ⁽¹⁾ The matrix of degrees of representation,

representing the point-side relationship type as +.>

Heterogeneous adjacency matrix of->

Is a heterogeneous adjacency matrix->

Is a weight of (2). From the above, it can be seen that two element path transition matrices are generated in the first MP-Trans Layer, and the two coefficients are allocated correspondingly, respectively +.>

And->

It is noted that the first two meta-path transition matrices generated are matrix multiplied, so that here meta-path information of path length 3, e.g. P-a-P, is actually and fused. If the information is to be fused, the weight of the hetero-adjacent matrix is calculated once>

Multiplying the obtained meta-path conversion matrix, the meta-path information with the length of 4 can be fused. Therefore, if one wants to study one of the following +.>

The significant element path P composed of the series of compound relations can use the hetero-adjacent matrix corresponding to the element path>

Is expressed and only has to be performed 1-1 MP-TransLayeThe superposition of r can be fused to the information of the meta path P, and it can be known from the formula (8) that the weighted sum of the weights corresponding to the hetero-adjacency matrix under each type, namely +. >

The contribution of the meta-path P is the importance of this meta-path P.

It is explained so far why kgnn_hcd can express the importance of different meta-paths. Meta path

Weight of +.>

Is a component of attention that provides semantic information and the importance of the meta-path in a particular task. Fig. 7 and 8 summarize the attention scores corresponding to the heteroadjacency matrix for each relationship type, and learn the attention scores for all length-3 meta-paths, taking ACM and IMDB datasets as examples.

Fig. 7 (a) and 7 (b) and fig. 8 (a) and 8 (b) show attention scores from adjacency matrices (edge types) of the first and second Layer meta-path information conversion layers (MP-Trans layers), where ACM data sets and IMDB data sets are selected. As can be seen on the ACM dataset, the attention score of the hetero-edge relationship PA is highest in the first layer meta-path conversion layer; in the second layer, the attention score of the hetero-edge relationship AP is highest. It is clear that the importance of the meta-path PAP must be high, which can be verified in fig. 7 (c). On the IMDB dataset the attention scores of the hetero-edge relations MA and MD in the first layer are not very different, but in the second layer the attention score of DM is significantly higher than that of AM, so the importance of meta-path MDM is higher than that of meta-path MAM, which can be demonstrated in the subsequent thermodynamic diagrams. Meanwhile, in these four figures, the attention weight of the identity matrix is relatively high, which as discussed in section 4.2, kgnn_cd can also attempt to adhere to shorter meta-paths in deeper layers.

Fig. 7 (c) and 8 (c) show a visualization of the correlation between each heterostructure in the first Layer MP-Trans Layer and each heterostructure in the second Layer MP-Trans Layer in the ACM dataset and IMDB dataset (the values are here magnified 100 times for a clearer representation). It can be seen intuitively that the meta-path PAP is of greater importance than the meta-path PSP. For example, as can be seen from the above figures, in the first Layer MP-Trans Layer, the attention score of PA is 0.1691, the attention score of ps is 0.1689, and in the second Layer MP-Trans Layer, the attention score of AP is 0.1721, and the attention score of sp is 0.1707. Then according to the formula in 5.6

It is available that the importance of meta-path PAP is +.>

Similarly, the importance of the available meta-path PSP is 0.0288, and thus it can be seen that in the community discovery task, the importance of the meta-path PAP is greater than that of the meta-path PSP, which is also demonstrated in Metapath2 vec. Similarly, for the IMDB dataset, in the first Layer MP-Trans Layer, the attention score for MD was 0.1656, the attention score for ma was 0.1645, and in the second Layer MP-Trans Layer, the attention score for DM was 0.1591, and the attention score for am was 0.1549. Then the importance of the meta path MDM is +. >

The importance of the meta-path MAM is 0.0255, which results in a meta-path MDM that is more important than MAM, as can also be demonstrated in Metapath2 vec.

2.7 ablation experiments

To verify the effectiveness of the individual components of kgnn_hcd, further experiments were performed on different kgnn_hcd variants. The performance results for the three data sets are given in figure 9. KGNN-HCD here _w/oKG And the K neighbor graph information fusion module is removed, so that the importance degree of the module in the KGNN_HCD is explored. KGNN_HCD _w/oI Neglecting the role of the identity matrix; KGNN_HCD _w/oI&KG The identity matrix and the K neighbor graph information are fused, and the two modules are removed and the KGNN_HCD is obtained _w/oKG And KGNN_HCD _w/oI By contrast, the meaning of the two parts can be better highlighted.

As can be seen from fig. 9, kgnn_hcd on three data sets _w/oKG Always lower than kgnn_hcd and creates a significant gap, which illustrates the effectiveness and necessity of performing K-nearest neighbor map information fusion. Next, to verify the effect of the identity matrix, KGNN_HCD was trained and evaluated here _w/oI ，KGNN_HCD _w/oI Has exactly the same system architecture as kgnn_hcd, but its candidate adjacency matrix does not include identity matrix. As can be seen from the figure, under normal conditions, KGNN_HCD _w/oI Always less than kgnn_hcd. If both modules are removed, KGNN_HCD is trained directly _w/oI&KG As can be readily seen, this effect is poor, but one of the two modes, e.g. KGNN-HCD, is added _w/oI Or KGNN_HCD _w/oKG The final performance is improved, and the effectiveness of the identity matrix and K neighbor graph information fusion module is reflected laterally.

In summary, in the present invention, a heterogeneous graph community discovery method kgnn_hcd fused with a K-nearest neighbor graph neural network is provided to solve the existing problem of heterogeneous graph community discovery. The method not only can effectively utilize the structural information of the feature space to enhance the node similarity, but also can improve the possibility that the connectionless nodes are divided into communities; but also avoids the use of predefined meta-paths, enabling end-to-end learning of meta-paths. In order to evaluate the rationality and effectiveness of the KGNN_HCD method, 11 different models such as CP-GNN, GTN and the like are compared on three real heterogeneous data sets of ACM, DBLP and IMDB, and the result shows that the performance of the KGNN_HCD is superior to a baseline, and the importance of different element paths can be well distinguished by weight convolution; for the clear explanation of mp_Trans Layer in KGNN_HCD, the module was subjected to an explanatory analysis; in order to study the influence of the super parameters in the KGNN_HCD on the model, the influence of the super parameters on the model is studied by changing the K value of the K neighbor graph, changing the embedding dimension and changing the layer number of the meta-path conversion; ablation experiments were performed on three datasets in order to verify the validity of each module; finally, the higher-order relation among the nodes is well captured through GCN, the finally learned node representation is clustered, the community structure in the heterogeneous graph is found, and the experimental result comprehensively demonstrates the rationality and effectiveness of the proposed KGNN_HCD method.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A heterogeneous graph community discovery method based on a K neighbor graph neural network is characterized by comprising the following steps:

s1, calculating node similarity according to the feature vector of the node, and further forming a similarity matrix; k nodes which are most similar to the target node are selected as neighbors according to the similarity matrix to form a continuous edge, and then a K neighbor graph adjacency matrix A is constructed _k ；

Through element path messageThe information conversion layer generates a meta-path transition matrix, generates a meta-path transition matrix in a matrix multiplication mode, then adaptively learns meta-paths, and captures high-order relations among nodes through GCN fusion meta-path information to obtain node representation;

2. The heterogeneous graph community discovery method based on the K neighbor graph neural network according to claim 1, wherein the node similarity measurement method comprises the following steps: cosine similarity, thermonuclear, dot product.

3. The heterogeneous graph community discovery method based on the K-nearest neighbor graph neural network according to claim 1, wherein the K-nearest neighbor graph adjacency matrix based on characteristic space topology information

W _att representing weight convolution at MP-Trans layer;

conv _1×1 representing a convolution layer;

is a parameter of a 1 x 1 convolution, +.>

Representing the number of heterogeneous edge types +.>

Representing a heterogeneous relationship;

representing that the dot-edge relationship is +.>

Heterogeneous adjacency matrix of->

Representing an ith set of edge types;

4. The heterogeneous graph community discovery method based on the K-nearest neighbor graph neural network according to claim 1, wherein the calculation formula for generating the element path conversion matrix by the matrix multiplication method is as follows:

a representation matrix;

A ^(l-1) meta-path conversion representing the upper layerChanging a matrix;

W _att ^(l) representing the weight convolution of the current layer.

5. The heterogeneous graph community discovery method based on the K-nearest neighbor graph neural network according to claim 1, wherein the adaptive learning element path comprises:

given a single by

wherein the method comprises the steps of

Representing the edge type as +.>

Is a heterogeneous adjacency matrix of (1);

representing a set of edge types;

representing each meta-path transitionWeighting of the matrix;

A _P representing a weighted sum of the l heteroadjacency matrices.

6. The heterogeneous graph community discovery method based on the K-nearest neighbor graph neural network according to claim 1, wherein the obtaining of the node representation comprises the following steps:

wherein I is a join operator;

c represents the number of channels;

sigma represents an activation function;

representing fusion of +.sup.th channel with layer I>

is->

A degree matrix of (2);

W∈R ^d×dim is a training matrix of the neural network model, dim is the output embedding dimension of the model, Z epsilon R ^N×dim Node embedded table as final outputShown.