CN115761275A

CN115761275A - Unsupervised community discovery method and system based on graph neural network

Info

Publication number: CN115761275A
Application number: CN202211088226.2A
Authority: CN
Inventors: 维玉; 孙旭; 王政凯
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2022-09-07
Filing date: 2022-09-07
Publication date: 2023-03-07

Abstract

The invention provides an unsupervised community discovery method and system based on a graph neural network, which relate to the technical field of data mining, and simultaneously use graph structures and node information to obtain community data undirected graphs to be discovered and adjacency matrixes and feature matrixes of nodes; inputting the adjacency matrix and the feature matrix into a double-layer graph attention encoder to generate graph embedding; the self-expression mechanism is adopted to constrain the graph embedding, so that the graph embedding is more suitable for community discovery tasks; classifying nodes embedded in the constrained graph by using a multilayer perceptron model to generate a community structure of the graph, so as to realize community discovery; the information required by the community discovery task is effectively combined, and the community discovery performance is improved.

Description

Unsupervised community discovery method and system based on graph neural network

Technical Field

The invention belongs to the technical field of data mining, and particularly relates to an unsupervised community discovery method and system based on a graph neural network.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

The graph structure has wide application in real life and can represent a plurality of complex systems, such as a social network, a citation network, an electronic shopping network and the like; community discovery enables the nodes in a graph to be divided into a plurality of subsets, the nodes in each subset have similar features and connections, the association between different subsets is small, and the subsets are called communities; community discovery is an important task in graph analysis, and has important practical significance, for example, in the real world, by identifying communities in a social network, possibly known people can be recommended to a user; by identifying communities in the shopping website, commodity advertisements and the like can be accurately pushed to consumers.

The traditional community discovery methods such as spectral clustering, graph division and the like mainly discover communities from a network structure; the spectral clustering is based on a normalized Laplacian matrix and a regularized adjacency matrix to carry out network division, siemon et al discovers communities in a brain network based on spectral analysis of a normalized Laplacian operator, and carries out system-level description on a network structure instead of single nodes or edges; graph partitioning may partition a graph into a specified number of communities; in recent years, community discovery is gradually changed from a traditional method to a community discovery method based on deep learning, and an unsupervised learning community discovery model based on deep learning is also provided, for example, xie and the like add a transfer learning method to an automatic encoder frame to perform unsupervised community discovery, so that the problems of data lack and feature imbalance in a graph are solved; ye et al propose a Deep auto-encoder-like non-negative Matrix factorization (dammf) model, which is different from the traditional non-negative Matrix method, and the dammf is based on an automatic encoder framework, adopts hierarchical mapping, and combines a plurality of factors in the mapping to generate a community.

The graph neural network is an extended application of the convolutional neural network to graph data, shows excellent performance in processing graph data, and is of great interest. Graph neural networks aim to model information transfer, transformation or aggregation between nodes in a graph in an end-to-end manner by using a deep learning method, and are used for downstream tasks such as node classification, link prediction, graph generation, network embedding and the like. Graph neural networks are widely used in the fields of computer vision, traffic, natural language processing, recommendation systems, and the like. More recently, graph neural networks are increasingly being applied to community discovery tasks, for example Chen et al apply multi-scale graph operators to build a graph neural network model and augment with non-backtracking operators for supervised community discovery; jin et al combine the GCN framework with Markov random fields to perform semi-supervised community discovery; shchurr et al combines a Bernoulli-Poisson (BP) probabilistic model with two layers of GCNs, learns a community membership vector by minimizing the negative log-likelihood of the BP, and sets a threshold to identify and remove weak associations to discover overlapping communities; xu et al learned distinguishable community representations in an integrated self-trained antagonistic learning framework using a graph attention self-encoder as a generator, with discriminators to ensure diversity in community population distribution.

Most community discovery algorithms based on the graph neural network are supervised or semi-supervised methods, and the community discovery methods combining the graph neural network and unsupervised learning are few. However, sometimes the labels in the network of the large-scale system are scarce, and the acquisition of the labels is expensive, so that the combination of the graph neural network and the unsupervised learning to carry out the community discovery task is a problem which needs to be solved urgently. Wang et al have designed a label sampling model based on GCN framework to position the center of community discovery, and have carried on the unsupervised community discovery by combining the label sampling model with GCN; bandyopadhyay et al train a graph representation learning algorithm on the basis of general unsupervised loss, and then find a community by applying a KMeans + + algorithm, however, embedding generated by the algorithm is general and does not provide more targeted information for downstream tasks well; perozzi et al use local information derived from truncated random walks to learn potential representations by using the walks as the equivalent of sentences, and use clustering algorithms for community discovery in subsequent community discovery processes, however, the clustering algorithms do not improve the accuracy of community discovery very well.

Therefore, the technical problems to be solved urgently are as follows:

(1) How to combine graph neural networks with unsupervised learning for community discovery tasks.

(2) How to optimize the generated generic embedding to have embedding for community discovery task information.

(3) How to improve the accuracy of community allocation in the process of generating communities.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides an unsupervised community discovery method and system based on a graph neural network, wherein a graph structure and node information are simultaneously used, an adjacency matrix and a characteristic matrix of the graph are input into a graph attention network to generate the embedding of the graph, the generated embedding is restrained by adopting a self-expression principle, so that the method is more suitable for a community discovery task, and a multilayer perceptron is used for classifying the node embedding to generate the community structure of the graph; the information required by the community discovery task is effectively combined, and the community discovery performance is improved.

In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

the invention provides an unsupervised community discovery method based on a graph neural network;

an unsupervised community discovery method based on a graph neural network comprises the following steps:

acquiring an undirected graph of community data to be discovered and an adjacent matrix and a characteristic matrix of nodes;

inputting the adjacency matrix and the feature matrix into a double-layer graph attention encoder to generate graph embedding;

the self-expression mechanism is adopted to constrain the graph embedding, so that the graph embedding is more suitable for community discovery tasks;

and classifying nodes embedded in the constrained graph by using a multilayer perceptron model to generate a community structure of the graph, thereby realizing community discovery.

Furthermore, the double-layer graph attention encoder consists of a double-layer graph attention network GAT and combines the whole graph structure and node information; and aggregation operation is carried out on the node neighbors by using an attention mechanism, different weights are adaptively distributed to different neighbors, hidden representation of the node is learned, and expression capacity is improved.

Further, generating graph embedding specifically comprises the following steps:

calculating the correlation degree of the two nodes;

normalizing the correlation degree to generate a weight;

carrying out weighted summation according to an attention mechanism to obtain a feature vector of a node;

and generating the embedding of the nodes by taking the feature vectors of the nodes as input.

Further, when calculating the correlation degree of two nodes, generating a new proximity matrix S by means of the transfer matrix B to acquire t-order neighbor information of the nodes.

Further, the double-layer graph attention encoder trains by using an information maximization principle:

randomly destroying the structure of the original undirected graph by using a destruction function to generate two damaged views, wherein the destruction function randomly deletes partial edges and randomly covers partial node characteristics in the graph, and nodes in the two views are consistent with the nodes of the original undirected graph;

after obtaining the two corrupted views, two embeddings of the two views are generated separately using a two-layer graph attention encoder, which learns the embedding of the graph by maximizing the mutual information between the two corrupted views.

Furthermore, the self-expression mechanism is adopted to constrain the graph embedding, and the self-expression coefficient matrix is learned in batches to generate the pairwise similarity matrix, so that the graph embedding is constrained.

Further, the multilayer perceptron model comprises two hidden layers, nodes in the generated graph embedding are embedded and mapped to corresponding community vectors, and the community vectors are converted into probability distribution.

The invention provides an unsupervised community discovery system based on a graph neural network.

An unsupervised community discovery system based on a graph neural network comprises a data acquisition module, a graph embedding module, a self-expression module and a node classification module;

a data acquisition module configured to: acquiring a community data undirected graph to be discovered and an adjacent matrix and a characteristic matrix of nodes;

a graph embedding module configured to: inputting the adjacency matrix and the feature matrix into a double-layer graph attention encoder to generate graph embedding;

a self-expression module configured to: the graph embedding is restrained by adopting a self-expression mechanism, so that the graph embedding is more suitable for community discovery tasks;

a node classification module configured to: and classifying nodes embedded in the constrained graph by using a multilayer perceptron model to generate a community structure of the graph, thereby realizing community discovery.

A third aspect of the present invention provides a computer readable storage medium, on which a program is stored, which program, when being executed by a processor, carries out the steps of a graph neural network-based unsupervised community discovery method according to the first aspect of the present invention.

A fourth aspect of the present invention provides an electronic device, including a memory, a processor, and a program stored in the memory and executable on the processor, wherein the processor implements the steps in the unsupervised community discovery method based on graph neural network according to the first aspect of the present invention when executing the program.

The above one or more technical solutions have the following beneficial effects:

the invention provides an unsupervised community discovery method and system based on a graph neural network.

The invention adopts the self-expression principle to constrain the generated embedding, so that the method is more suitable for community discovery tasks, optimizes the generated general embedding into the embedding aiming at the community discovery task information, and improves the accuracy of the method.

The invention uses a multilayer perceptron to classify the node embedding, and generates a community structure of an image; the method effectively combines the information required by the community discovery task, improves the community discovery performance, and improves the community allocation accuracy in the process of generating the community.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flow chart of the method of the first embodiment.

Fig. 2 is a system configuration diagram of the second embodiment.

Detailed Description

The invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example one

The embodiment discloses an unsupervised community discovery method based on a graph neural network, as shown in fig. 1, including:

s1, acquiring an undirected graph of community data to be discovered and an adjacency matrix and a feature matrix of nodes;

the node data set to be subjected to community discovery is represented by an undirected graph and an adjacency matrix and a feature matrix of the nodes, wherein the undirected graph is G = (V, E), and V ∈ { V = (V, E) ₁ ,v ₂ ,…,v _N Representing a set of N nodes in the graph, and E is an edge set between the nodes in the graph; x belongs to R ^N×F Is a characteristic matrix of nodes in the graph, the characteristic matrix is composed of F-dimensional word vectors of N nodes in a data set, and A belongs to R ^N×N Is the adjacency matrix of the graph, K is the number of communities in the graph, C ∈ R ^N×K And finally mapping the nodes to corresponding communities for outputting the community matrix of the nodes, wherein the closely connected nodes or the nodes with similar characteristics in the graph belong to the same community.

S2, inputting the adjacency matrix and the feature matrix into a double-layer graph attention encoder to generate graph embedding;

in order to generate graph embedding, a double-layer graph attention encoder is provided, the encoder consists of a double-layer graph attention network (GAT), meanwhile, an overall graph structure and node information are combined, an attention mechanism is utilized to carry out aggregation operation on node neighbors, different weights are adaptively distributed to different neighbors, hidden representation of a learning node has strong expression capacity, and the specific steps are as follows:

s2-1, calculating the correlation degree of the two nodes;

for node v _i Use of

Representing its neighbor nodes, using a single fully-connected layer to compute v _j To v _i I.e. the degree of correlation of two nodes:

h _ij ＝Leaky ReLU(a ^T [WX _i ||WX _j ]) (1)

wherein, X _i ∈R ^d(l-1) 、X _j ∈R ^d(l-1) Respectively represent nodes v _i 、v _j The corresponding length in the l-1 layer is d (l-1) The weight parameter W belongs to R ^d(l)×d(l-1) Feature transformation for a node, a ∈ R ^2d(l) As a weight parameter, the learkyrelu is an activation function.

The conventional GAT-based system only considers the adjacent first-order neighbors of the node, however, in most of the figures, the node has more than one first-order neighbors, which also causes the missing of the acquired information, and the embodiment generates a new proximity matrix S by means of the transition matrix B to acquire the t-order neighbor information of the node:

wherein, B _ij For transferring elements of row i and column j of matrix B, when node v _i And node v _j With an edge in between, B _ij ＝1/d _i Otherwise B _ij ＝0，d _i Is a node v _i Degree of (d); s _ij Representative node v _i With its highest t-order neighbor v _j Topological correlation of (S) _ij >0 then represents v _i And v _j Adjacent, different t's may be selected from different data sets.

S2-2, carrying out normalization operation on the correlation degree to generate a weight;

for better weight assignment, the calculated correlation degrees with all neighbors are normalized by softmax:

wherein, alpha is a weight coefficient, exp is an exponential function with e as a base, and the above formula ensures that the sum of the weight coefficients of all neighbors of the node is 1.

Adding the topology weight S in equation 2 _ij And combining the formulas 1, 2 and 3 to obtain a complete weight coefficient formula:

s2-3, carrying out weighted summation according to an attention mechanism to obtain a feature vector of the node;

after the weight coefficient is calculated, carrying out weighted summation according to an attention mechanism to obtain a node v _i Feature vector at layer i:

and S2-4, generating the embedding of the nodes by taking the feature vectors of the nodes as input.

Generating embeddings of nodes using a two-layer graph attention network, in X _i As input, the resulting node embedding is:

wherein, W ⁰ Weight parameter, W, representing the first-layer feature transformation ¹ And the weight parameter representing the second-layer feature transformation simultaneously uses the graph structure and the node attribute in the node embedding generation process.

To better exploit graph structure and node content to generate the embedding, the information maximization principle is adopted to train the two-layer graph attention encoder:

(1) Randomly destroying the structure of an original undirected graph G using a destruction function to generate two destroyed views G ₁ And G ₂ The destruction function randomly deletes part of edges in the graph and randomly masks part of node features, and two views G ₁ And G ₂ The nodes in (1) are consistent with the nodes of the original undirected graph G;

(2) After obtaining two damaged views, two views G are generated separately using a two-layer view attention encoder ₁ And G ₂ Two of (Z) are embedded in ₁ And Z ₂ Graph embedding is learned by maximizing mutual information between two corrupt views.

Node i is in two viewsIs respectively shown as G _1i And G _2i Respectively, corresponding to the insertion is Z _1i ，Z _1i Will (Z) _1i ,Z _2i ) Considering positive samples, the embedding of nodes other than node i in both views is considered negative samples, the negative sample node being denoted V _-i = j ∈ V | j ≠ i, using noise contrast as objective function:

where θ () = cos (), the cosine similarity between two embeddings is represented.

Is a temperature coefficient, δ represents a parameter in a two-layer map attention encoder; all nodes in the two views except the positive sample pairs are regarded as negative samples, and the negative samples are from the two views and respectively correspond to the second item and the third item of the denominator; essentially maximizing the consistency of node i embedding in both views, G ₁ And G ₂ The effect of (c) is simply to train the loss in the objective function.

S3, adopting a self-expression mechanism to constrain the graph embedding, so that the graph embedding is more suitable for community discovery tasks;

the embedding of nodes is generated by using a double-layer graph attention encoder, however, the embedding generated in the way is universal and lacks information aiming at community discovery tasks. To address the above problem, a self-expression mechanism is introduced that represents data points from multiple linear subspaces by a linear combination of the remaining nodes in one of the subspaces.

Learning a pairwise similarity matrix M e R based on the principle that one node can be used to reconstruct another node ^N×N In order to solve the problem, a batch learning method is introduced, a self-expression coefficient matrix is learned, a pairwise similarity matrix is generated, and the graph is embedded into the matrixAnd (6) line constraint.

Embedding Z into the node obtained in the step S2 _i And splicing to obtain a matrix Z formed by embedding N nodes. The self-expression mechanism can be mathematically expressed as a simple equation, i.e., Z = ZE, where E ∈ R ^N×N Is a self-expression coefficient matrix; for each node i e V, it is represented using a linear combination of the other nodes of the subset to which it belongs, i.e. it is represented by

Wherein e _ij For the elements of the ith row and jth column of the matrix E, E is set in order to avoid assigning the trivial solution of E to the identity matrix _ii ＝0。

Learning a self-expression coefficient matrix E, finally generating a pairwise similarity matrix M, if the subspaces are independent, by minimizing certain norms of E, ensuring that E has a diagonal structure (until a determined arrangement is generated), namely when a point v _i And point v _j Time of belonging to the same subspace e _ii Not equal to 0; in this case, each block structure in E corresponds to a node containing a subset, which can be mathematically formulated as an optimization problem, the optimization formula being shown in equation 8.

Wherein | E | purple _p The norm of the matrix representing E is,

some norm representing minimum E, s.t. means such that Z is the embedding of all nodes, E is the self-expression coefficient matrix, diag (E) represents the element on the diagonal of matrix E, diag (E) =0 means E _ii =0. In the specification for matrix E, using the kernel norm of the Low Rank Representation (LRR), sparse Subspace Clustering (SSC)

Norm, frobenius norm of Least Squares Regression (LSR), and the like. Here, the square Flobonius matrix is chosenNorm, however, this does not allow for an accurate reconstruction of the matrix E, and to solve this problem, the hard constraint Z = ZE is relaxed with the squared frenobius matrix norm of the soft constraint Z-ZE, thus introducing a new objective penalty function:

wherein χ is a weight parameter.

The pairwise similarity matrix M is typically constructed as | E | + | E ^T Many heuristic algorithms have been proposed recently to enhance the block structure, which can improve the clustering performance of M; thus, a heuristic method is used to construct the pairwise similarity matrix by first computing the r-order singular value decomposition of the matrix E, i.e. E = U Σ V ^T Let r = kd +1, d be the largest intrinsic dimension of the subspace, k denote the number of communities, and after obtaining the singular value decomposition of the matrix E, let

And normalizing each row of P to a unit norm; finally, a matrix P ' is obtained by setting a negative value in the matrix P to 0, and the similarity matrix is set to M = (P ' + P ' ^T )/||P|| _∞ So that M _ij ∈[0,1]。

In the process of calculating the pairwise similarity matrix, it is expensive to calculate the N × N pairwise similarity matrix for each node, which consumes a lot of time and memory, and batch learning is used to solve this problem.

And randomly extracting nodes with the batch size of Q in the node set with the number of N, wherein Q is less than or equal to N. In equation 9, the training loss is for each batch, with a complexity of O (Q) ³ ) Much smaller than the complexity O (N) ³ ). However, this does not result in a complete similarity matrix for the graph, which only calculates s when node i and node j belong to the same batch _ij By using

Represents the set of node pairs used to calculate similarity in batch learning, when Q<When the N is greater than the N value,

the present embodiment calculates the similarity by batch learning without using the similarities of all the nodes.

And S4, classifying the nodes embedded in the constrained graph by using a multilayer perceptron model to generate a community structure of the graph, so that community discovery is realized.

The multilayer perceptron model comprises two hidden layers, nodes in the generated graph embedding are embedded and mapped to corresponding community vectors, and the community vectors are converted into probability distribution.

Carrying out community discovery by using a multi-layer perceptron model which is based on a neural network and has trainable parameters, wherein the multi-layer perceptron model comprises two hidden layers, embedding and mapping nodes in the generated graph embedding to corresponding community vectors thereof, and converting the community vectors into probability distribution, and the specific formula is as follows:

C _i ＝softmax((LeakyReLU(LeakyReLU(W ₂ Z _i +b ₂ )W ₃ +b ₃ )W ₄ +b ₄ ))(10)

wherein, W ₂ ，W ₃ ，b ₂ ，b ₃ Weight parameter and bias parameter, W, for two hidden layers, respectively ₄ ，b ₄ Respectively, a weight parameter and a deviation parameter of the output layer, wherein the parameters are trainable parameters; leakyReLU is an activation function of a hidden layer, softmax is an activation function of an output layer, and the community vector is converted into probability distribution; c _i ∈R ^K ，C _i The Kth element of (1) _ik Representing the probability that the ith node belongs to the Kth community; equation 10 can ensure that nodes with similar embeddings can be mapped to the same community vector element, however, the pre-generated embeddings are more general, for the community discovery task, these embeddings lack relevant task information and may not be the best choice for generating communities, therefore, the pairwise similarity information of the nodes learned in step S3 is used to generate communities of nodes by training the parameters in equation 10 and perform embeddingAnd (6) optimizing.

Setting community membership matrix as C = [ C = ₁ ,…,C _N ] ^T ，C∈R ^N×K The method comprises the steps that K-dimensional vectors of N nodes are formed, and phi represents a parameter in a community discovery module; with the complete node similarity matrix M, the following objective function is minimized:

however, the above objective function needs to use a complete node similarity matrix, has high complexity, and because of noise in the data set, the partial pairwise similarity cannot represent the similarity between nodes, and is not suitable for community discovery. Therefore, only the node pair set used in the batch learning of S3 is used

The data of (1). Furthermore, two thresholds theta are introduced _low And theta _high The partial pairwise similarity formed due to the influence of noise information in the data set is removed, for example, when the pairwise similarity is close to 1, the similarity between nodes is strong, when the pairwise similarity is close to 0, the similarity between nodes is weak, and when the pairwise similarity is 0.5, the pairwise similarity can neither represent that the node pairs are similar or dissimilar, the information amount is small, and the parameters of the multilayer perceptron model are influenced, so that the removal is performed to improve the accuracy of the model. Let 0<θ _low ≤θ _high <1，θ _high ＝1-θ _low Then, after removing useless node similarity, the remaining paired node set

Expressed as:

wherein θ will be greater than or equal to _high Is attributed to a cluster, will be less than or equal to θ _low Are assigned to different clusters. A set of connected and unconnected soft constraints are derived to guide the generation of communities based on the use of an unsupervised approach, and the following optimizations are designed:

formula 13 discards node pairs with smaller information amount, only uses node pairs with connection or without connection, and optimizes formula 13 in order to make probability distribution of each node i in K different communities more similar, avoiding forming fragmented communities:

the second term in equation 14 to ensure that the distribution of communities is close to orthogonal, and the community sizes are balanced; the total loss of training for node embedding and community discovery is formulated as a weighted sum of the two losses:

wherein γ is a weight coefficient, the community discovery method related in this embodiment is packaged into an individual community discovery model USCom, which adopts an iterative method, and first calculates a self-expression layer for each batch, and then updates parameters of the community-generated neural network by using a minimization formula 15.

Evaluation of experiments

The effect of the unsupervised community discovery method based on the graph neural network, which is provided by the implementation, is evaluated by adopting four data sets, and the effectiveness of the provided method is proved by carrying out community discovery operation on the data sets. The structure of the data sets is shown in table 1, and the four data sets are the most popular classical data sets in the graph neural network.

Table 1: statistics of a data set

To demonstrate the effectiveness of the method, the model USCom was compared to nine baseline models.

K-means: k-means is the basis for many clustering algorithms, which utilize node characteristics to cluster nodes.

Deepwalk: deepwalk is based on random walk and is a characterization learning model taking a graph structure as a core.

VGAE: VGAE also adopts a mode of encoding first and then decoding, and unlike GAE, the encoder of VGAE learns the normal distribution represented by the low-dimensional vector, and samples the normal distribution in the decoder to obtain the final vector representation.

MGAE: the MGAE learns the representation of the graph by combining graph structure information with content information of the nodes in an unsupervised manner based on an auto-encoder, and then inputs the learned feature representation to a spectral clustering algorithm to obtain the clustering of the graph.

ARGE: the ARGE is a graph embedding framework added with a countermeasure mechanism, the framework encodes the whole structure of the graph and the content of the nodes to obtain a representation of the graph, and the representation of the graph is input into a decoder to be subjected to graph structure reconstruction. The countermeasure mechanism is used to normalize the underlying data distribution to match the prior distribution.

ARVGE: similar to ARGE, ARVGA adds a countermeasure mechanism in a variational image self-encoder, and combines a learning countermeasure module with the image variational self-encoder.

AGC: AGC uses convolution of high-order graphs to obtain embedding of graphs, adaptively selects different orders for different graphs, and finally uses spectral clustering to discover communities.

GUCD: GUCD uses the self-encoder framework to directly get the community allocation of nodes.

SENET: SEnet improves the graph structure by using shared neighbor information, uses a three-layer embedded network to generate the representation of nodes under the constraint of spectral clustering loss, and finally obtains the community distribution of the nodes by using a k-means algorithm.

Three general criteria were used: accuracy (Acc), normalized Mutual Information (NMI), macro F1-score (F1), higher indices indicate better results for model community discovery.

The results of the experiment are shown in table 2:

TABLE 2USCom and results of baseline in Community discovery task

Table 2 shows the performance of the model USCom and the other nine baseline models in the community discovery task on four datasets, with bold numbers representing the best results. In each of the four data sets, the USCom was run ten times and the results were averaged. The results show that the model USCom achieves the best results except for the NMI index of the Cora dataset. Wherein in the Cora dataset, USCom's F1-score increased 8.75% and Acc increased 2.63% compared to suboptimal baseline. On the Pubmed data set, the USCom has 5.37% higher Acc and 3.63% higher NMI than the suboptimal send, which is because this embodiment introduces an attention mechanism, adaptively assigns different weights to neighboring nodes of a node, and better integrates the structure and node characteristic information of the graph. In Wiki data set, the USCom model is 5.15% higher than the suboptimal AGC Acc, 6.23% higher than NMI and 4.24% higher than F1-score, which proves that the addition of a self-expression mechanism in the model effectively combines the information required by the community discovery task and improves the performance of the model. The model using the structure information of the graph and the node feature information at the same time is generally superior to the k-means model using only the node features and the deep walk model using only the graph structure in the performance of community discovery obtained by observing the result, which also proves the superiority of the model of the embodiment using the structure information of the graph and the feature information of the nodes at the same time.

Example two

The embodiment discloses an unsupervised community discovery system based on a graph neural network;

as shown in fig. 2, an unsupervised community discovery system based on a graph neural network includes a data acquisition module, a graph embedding module, a self-expression module and a node classification module;

a data acquisition module configured to: acquiring an undirected graph of community data to be discovered and an adjacent matrix and a characteristic matrix of nodes;

EXAMPLE III

An object of the present embodiment is to provide a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps in a graph neural network-based unsupervised community discovery method according to embodiment 1 of the present disclosure.

Example four

An object of the present embodiment is to provide an electronic apparatus.

An electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the unsupervised community discovery method based on the neural network according to embodiment 1 of the present disclosure.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An unsupervised community discovery method based on a graph neural network is characterized by comprising the following steps:

acquiring a community data undirected graph to be discovered and an adjacent matrix and a characteristic matrix of nodes;

the graph embedding is restrained by adopting a self-expression mechanism, so that the graph embedding is more suitable for community discovery tasks;

2. The unsupervised community discovery method based on graph neural network as claimed in claim 1, wherein said two-layer graph attention encoder is composed of two-layer graph attention network GAT, and combines whole graph structure and node information; and aggregation operation is carried out on the node neighbors by using an attention mechanism, different weights are adaptively distributed to different neighbors, hidden representation of the node is learned, and expression capacity is improved.

3. The unsupervised community discovery method based on graph neural network as claimed in claim 2, wherein the graph embedding is generated by the specific steps of:

calculating the correlation degree of the two nodes;

normalizing the correlation degree to generate a weight;

and generating the embedding of the node by taking the feature vector of the node as input.

4. The unsupervised community discovery method based on graph neural network as claimed in claim 3, wherein when calculating the correlation degree of two nodes, a new proximity matrix S is generated by means of the transfer matrix B to obtain t-order neighbor information of the nodes.

5. The unsupervised community discovery method based on graph neural network as claimed in claim 2, wherein said two-layer graph attention encoder is trained using information maximization principle:

randomly destroying the structure of the original undirected graph by using a destruction function to generate two damaged views, wherein the destruction function randomly deletes part of edges in the graph and randomly covers part of node characteristics, and nodes in the two views are consistent with the nodes of the original undirected graph;

6. The unsupervised community discovery method based on graph neural network as claimed in claim 1, wherein said self-expression mechanism is used to constrain graph embedding, and is learning self-expression coefficient matrix in batch to generate pairwise similarity matrix to constrain graph embedding.

7. The unsupervised community discovery method based on graph neural network as claimed in claim 1, wherein said multilayer perceptron model comprises two hidden layers, and maps the node embedding in the previously generated graph embedding to its corresponding community vector, and converts the community vector into probability distribution.

8. An unsupervised community discovery system based on a graph neural network is characterized by comprising a data acquisition module, a graph embedding module, a self-expression module and a node classification module;

a self-expression module configured to: the self-expression mechanism is adopted to constrain the graph embedding, so that the graph embedding is more suitable for community discovery tasks;

9. Computer-readable storage medium, on which a program is stored which, when being executed by a processor, carries out the steps of a graph neural network-based unsupervised community discovery method according to any one of claims 1 to 7.

10. Electronic device comprising a memory, a processor and a program stored in the memory and executable on the processor, wherein the processor when executing the program performs the steps of a graph neural network-based unsupervised community discovery method of any of claims 1-7.