CN116304367B

CN116304367B - Algorithm and device for obtaining communities based on graph self-encoder self-supervision training

Info

Publication number: CN116304367B
Application number: CN202310163573.5A
Authority: CN
Inventors: 王静红; 王慧; 王威; 于富强
Original assignee: Hebei Normal University
Current assignee: Hebei Normal University
Priority date: 2023-02-24
Filing date: 2023-02-24
Publication date: 2023-12-01
Anticipated expiration: 2043-02-24
Also published as: CN116304367A

Abstract

The invention discloses an algorithm and a device for obtaining communities based on self-supervision training of a graph self-encoder, and relates to the technical field of neural learning methods; the algorithm comprises the steps of combining self-supervision training and self-expression principles to obtain a similarity matrix of paired nodes of a network diagram and guiding generation of a node embedding matrix, wherein the device comprises a community obtaining module, a community obtaining module and a community clustering module, wherein the community obtaining module is used for obtaining the network diagram, obtaining a first damage diagram and a second damage diagram through damage function processing, obtaining the node embedding matrix through pre-training processing, comparing and learning the first damage diagram and the second damage diagram through a noise comparison function based on a standardized mutual information maximization principle, training based on a self-supervision training model until a loss function is minimized and obtaining the node embedding matrix, obtaining the similarity matrix through the self-expression principles and regularization processing and guiding generation of the node embedding matrix, and further obtaining the community clustering matrix; the method combines self-supervision training and self-expression principle, and uses a similarity matrix to guide generation of node embedding matrix and the like, so that efficient and accurate community discovery is realized.

Description

Algorithm and device for obtaining communities based on graph self-encoder self-supervision training

Technical Field

The invention relates to the technical field of neural learning methods, in particular to an algorithm and a device for obtaining communities based on self-supervision training of a graph self-encoder.

Background

The writer searches for TACD_ALL (paper AND community AND neural network AND node AND supervision AND matrix), AND a closer prior art scheme is obtained as follows.

Application publication number is CN114741519A, named a paper correlation analysis method based on graph convolution neural network and knowledge base. The method comprises the steps of extracting key information in a paper set, constructing a paper set knowledge base, combining a graph convolutional neural network, providing an improved acceptance-GCN model to complete paper category division, using a NOCO model to complete paper community discovery, and further completing correlation analysis of papers in the paper set knowledge base. A new graph node classification model is proposed: an acceptance-GCN model. The method of acceptance originally used for the CNN model is combined with the GCN model, so that the new model can effectively solve the problem of fitting and smoothing while enhancing the feature learning capability. Experiments show that the model is used for classifying paper nodes, and better effect can be achieved compared with the prior art.

The authorization bulletin number is CN114357312B, and the name is community discovery method and personality recommendation method based on automatic modeling of the graphic neural network. Acquiring a graphic neural network structure component and constructing a graphic neural network search space; sampling the graph neural network search space to obtain a graph neural network structured initial population; calculating the fitness of each graph neural network model and selecting a plurality of graph neural network structure groups as father generation; searching the child graph neural network structure, calculating the fitness of each child graph neural network structure and updating the parent graph neural network structure group; selecting the optimal graph neural network structure modeling in the parent graph neural network structure group and obtaining a coefficient matrix of graph data; and decomposing the coefficient matrix of the graph data to obtain a similarity matrix of the graph data, and clustering to realize community discovery. A personality recommendation method including the community discovery method based on the graph neural network automatic modeling is also disclosed. The method has high reliability and high accuracy, and is more scientific and reasonable.

Application publication number CN111950594a, an unsupervised graph on a large-scale attribute map named sub-sampling-based, represents a learning method and apparatus. The method comprises the following steps: sub-sampling the attribute graph according to the structure information and the node attribute information of the attribute graph to generate a plurality of sub-graphs; and learning the graph self-encoder on each subgraph by utilizing the structure information, the node attribute information and the community information of the attribute graph to obtain the low-dimensional vector representation of the nodes in the attribute graph. The self-encoder comprises an encoder and a decoder; the encoder adopts a graph convolution neural network; the decoder comprises a graph structure loss reconstruction decoder, a graph content loss reconstruction decoder and a graph community loss reconstruction decoder. The method is used for supporting a user to learn low-dimensional vector representations of nodes in a large-scale attribute graph in an unsupervised mode, the vector representations can keep topological structure information and node attribute information on the graph as far as possible, and the vectors are applied to different downstream tasks as inputs to perform data mining tasks on the graph.

In combination with the above three patent documents and prior art schemes, the inventors analyzed the prior art schemes as follows.

(1) Prior art solution

The graph or network is ubiquitous in our daily lives. Network representation learning is the task of mapping different components of the network, such as nodes, edges, or the entire graph, to vector space in order to facilitate network downstream tasks. In the real world we have formed many complex networks based on the closeness of community structure, such as social networks, citation networks, transportation networks and protein interaction networks. There are complex interactions between nodes in a network, and interactions and node properties will cause the network to form different communities. From a topological perspective, the connections of nodes inside the community are relatively tight, while the connections of nodes outside are relatively sparse. Community discovery is one of the most important tasks in network analysis, and node modularization mining of potential community structures is quite important for understanding complex systems and knowledge discovery, and has been widely used in social, biological, computer engineering and other fields.

Most graph neural networks Graph neural network, GNNs, are represented in the form of messaging networks, where each node aggregates messages from neighboring nodes with its own message to update its vector representation. However, community discovery algorithms based on graph neural networks have not been explored much in comparison to other tasks such as node classification and link prediction.

In recent years, deep learning and convolutional neural network CNN has made remarkable breakthrough in various fields such as machine translation and reading understanding, for example, natural language processing NLP, object detection and image classification in computer vision CV. The graph neural network-based representation learning method has a supervised learning method and an unsupervised learning method. The supervised learning method requires that the data contain marked data, but in reality most are unlabeled data, and marked data can be costly. The unsupervised learning does not need a large number of marked nodes to train, and meanwhile, learning characterization is performed and local characteristics of the sample are reserved to output discrimination characteristics. The community discovery task is essentially an unsupervised learning task, but direct training of GNNs for community discovery in existing methods has challenges.

Most of the existing representation learning methods aim at isomorphic networks to carry out node embedded learning, and the social network is considered to be isomorphic, and all sides of all nodes in the network belong to a single type. Deep walk is the first embedding algorithm to learn neighborhood features by learning random encoding of each node scope in isomorphic graphs, and both the algorithm and node2vec algorithm use self-encoder for node traversal.

In recent years, some studies have enabled performance of close-supervised learning using self-supervised learning, and Velickovic et al in 2019 proposed DGI to maximize information between different graphic entities including graphic level to node level, corrupted version of graphics, etc., by expanding the concept of information maximization.

Traditionally, a method has been proposed to train a network representation learning algorithm on a generic unsupervised penalty and then apply a clustering algorithm as a post-processing step to find communities. Zhang et al in 2019 propose an adaptive graph convolution method that performs a higher order graph convolution to obtain a smoothed node embedding that captures the global cluster structure, the obtained node embedding then being used to detect communities using spectral clustering. He et al in 2020 propose a community-centric graph rolling network method that obtains node community membership in the hidden layer of the encoder, and introduces a community-centric double decoder to reconstruct the network structure and node attributes in an unsupervised manner. Our work is moving towards directly obtaining node communities within the framework of a graph neural network.

(2) Disadvantages of the prior art

Compared with other tasks such as node classification, link prediction and the like, the community discovery algorithm based on the graph neural network has not been studied and explored deeply. In real world networks, it is costly to directly obtain the real community labels or pairwise constraints, and the prior art mostly works with the node representation learning module and clustering algorithm independently. The existing unsupervised representation learning method is difficult to process complex network data, only low-level semantic features of the network data can be obtained, attribute network data cannot be processed, and an efficient and accurate unsupervised representation learning method is needed to perform community discovery network analysis tasks on the complex network.

Problems and considerations in the prior art:

how to solve the technical problems of low efficiency and poor accuracy of the community.

Disclosure of Invention

The invention provides an algorithm and a device for obtaining communities based on self-supervision training of a graph self-encoder, which solve the technical problems of low efficiency and poor accuracy of community discovery.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

an algorithm for obtaining communities based on self-supervision training of a graph self-encoder comprises the steps of combining self-supervision training with a self-expression principle to obtain a similarity matrix S of paired nodes of a network graph G, guiding generation of a node embedding matrix Z, and obtaining communities in an integrated mode.

The further technical proposal is that: the method specifically comprises the following steps: obtaining a network graph G, and obtaining a first damage graph G through damage function processing ₁ And a second damage graph G ₂ The node embedded matrix Z after the pre-training is obtained through the pre-training treatment, and a first damage graph G ₁ And a second damage graph G ₂ And (3) comparing and learning by a noise comparison function based on a standardized mutual information maximization principle, training by a self-supervision training model until a loss function is minimized, obtaining an optimized node embedded matrix Z, obtaining a similarity matrix S of paired nodes by self-expression principle and regularization treatment, and obtaining a community cluster matrix C of the nodes based on the similarity matrix S and the fully-connected multi-layer perceptron.

The further technical proposal is that: in the step of obtaining the community cluster matrix C of the nodes, probability distribution of due connection and no connection relation of the nodes is obtained in an unsupervised mode so as to guide the formation of communities.

The further technical proposal is that: g= (V, E, X), where v= {1,2, …, N } is a set of nodes,is a set of edges, assuming each node is in vector x _i Including some attribute values x _i ∈R ^F ，X＝[x ₁ ,x ₂ ,…,x _i ,…,x _N ] ^T ∈R ^N×F Is a node attribute matrix of the network graph; the object is to learn the function f V-K ]Wherein [ K ]]= {1,2, …, K } is an index set representing community clusters, each node being mapped to one community by the network structure and node properties of the graph.

The further technical proposal is that:

in the formula (1), Z is a node embedding matrix, each behavior of Z is a vector representation of a node, and Z is E R ^|V|×F′ The method comprises the steps of carrying out a first treatment on the surface of the X is the attribute matrix of the node, A is the adjacency matrix of the input network graph,I∈R ^|V|×|V| is a unit matrix, D is a degree matrix, < >>Relu (·) is the activation function, W ₀ 、W ₁ All are weight parameters of the picture scroll lamination; z is Z ₁ Is a first damage graph G ₁ Node embedded matrix of Z ₂ Is a second damage graph G ₂ Is embedded in the matrix, and parameter sharing is maintained during GCN encoder training.

The further technical proposal is that:

in the formula (8), S is a similarity matrix, L is a characteristic matrix generated after singular value decomposition of a coefficient matrix, L' is a normalized L matrix, and a negative element is set to be 0, L ^T Is the transpose of the L matrix, L _∞ Is an infinite norm of the L matrix.

The further technical proposal is that: the community cluster matrix C comprises N rows of community vectors C _i ，

C _i ＝Softmax(MLP(Z _i ))∈R ^K (9)

In the formula (9), C _i For the community vector of the ith node in the community cluster matrix, softmax (·) is an activation function, MLP is a three-layer neural network, and MLP takes each node vector Z _i Mapping to a K-dimensional vector, wherein K is the number of community clusters, and K is assumed to be known; the softmax layer converts the K-dimensional vector into a probability distribution such that C _iK Namely C _i Represents the probability that the ith node belongs to the kth cluster, such that there is a similarity inlayThe incoming nodes will be mapped to similar locations in the (K-1) dimensional probability distribution.

The further technical proposal is that: continuously updating node embedding to guide the generation of a community cluster matrix by training MLP parameters; the optimization objective of the community is found as formula (10),

in the formula (10), C _i C is the community vector of the ith node in the community cluster matrix _j S is a community vector of the jth node in the community cluster matrix _ij For the similarity matrix of the node i and the node j, when S _ij When the value is high, the node pair (i, j) is constrained in the same community, when S _ij When the value is low, then node i and node j are in different communities.

The further technical proposal is that: the joint optimization is performed according to the self-supervision training parameters and the MLP parameters, and the total loss consists of a weighted sum of two loss parts of training node embedding and community discovery, as shown in a formula (11),

min L _total ＝αL _ss +L _comm (11)

in formula (11), min L _total To minimize the overall training loss, α is the weighting factor in the optimization process, L _ss For self-supervision and contrast of the loss of training node embedding, L _comm Is a loss of community discovery.

The device for obtaining communities based on self-supervision training of the graph self-encoder comprises a community obtaining module, a network graph G obtaining module and a first damage graph G obtaining module, wherein the network graph G is obtained through damage function processing ₁ And a second damage graph G ₂ The node embedded matrix Z after the pre-training is obtained through the pre-training treatment, and a first damage graph G ₁ And a second damage graph G ₂ Based on the principle of maximizing the standardized mutual information, the method is subjected to noise comparison function comparison learning, is trained based on a self-supervision training model until the loss function is minimized, an optimized node embedded matrix Z is obtained, and is obtained through self-expression principle and regularization treatmentAnd obtaining a similarity matrix S of paired nodes, guiding to generate a node embedding matrix Z by using the similarity matrix S, and obtaining a community cluster matrix C of the nodes based on the similarity matrix S and the fully-connected multi-layer perceptron.

The further technical proposal is that: the community acquisition module is further used for acquiring a community cluster matrix C of the nodes, wherein the community cluster matrix C is used for acquiring probability distribution of due connection and no connection relation of the nodes in an unsupervised mode so as to guide formation of communities.

An apparatus for obtaining communities based on graph self-encoder self-supervision training comprises a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor realizes the corresponding steps when executing the computer program.

An apparatus for obtaining communities based on graph self-encoder self-supervised training includes a computer readable storage medium storing a computer program which when executed by a processor performs the respective steps described above.

The beneficial effects of adopting above-mentioned technical scheme to produce lie in:

firstly, an algorithm for obtaining communities based on self-supervision training of a graph self-encoder comprises the steps of combining self-supervision training with a self-expression principle to obtain a similarity matrix S of paired nodes of a network graph G, guiding a generating node to be embedded into a matrix Z, and obtaining communities in an integrated mode. According to the technical scheme, the self-supervision training is combined with the self-expression principle, the similarity matrix S is used for guiding the generation of the node embedding matrix Z and the like, and the community discovery is efficient and accurate.

Second, a device for obtaining communities based on self-supervision training of a graph self-encoder comprises a community obtaining module for obtaining a network graph G and obtaining a first damage graph G through damage function processing ₁ And a second damage graph G ₂ The node embedded matrix Z after the pre-training is obtained through the pre-training treatment, and a first damage graph G ₁ And a second damage graph G ₂ Based on the principle of maximizing the standardized mutual information, the noise comparison function is used for comparison and learning, and based on the self-supervision training model, the training is carried out until the loss functionThe number is minimized, an optimized node embedding matrix Z is obtained, a similarity matrix S of paired nodes is obtained through self-expression principle and regularization treatment, the node embedding matrix Z is guided to be generated through the similarity matrix S, and a community cluster matrix C of the nodes is obtained based on the similarity matrix S and the fully-connected multi-layer perceptron. According to the technical scheme, the self-supervision training is combined with the self-expression principle, the similarity matrix S is used for guiding the generation of the node embedding matrix Z and the like, and the community discovery is efficient and accurate.

See the description of the detailed description section.

Drawings

FIG. 1 is a flow chart of the present application;

FIG. 2 is a data flow diagram of the present application;

FIG. 3 is a first data plot of algorithm comparison experimental results;

FIG. 4 is a second data plot of results of an algorithm comparison experiment;

FIG. 5 is a graph comparing the performance of five algorithms on Cora;

fig. 6 is a graph comparing the performance of five algorithms on a citieser.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the application, its application, or uses. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, but the present application may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present application is not limited to the specific embodiments disclosed below.

Example 1:

the application discloses an algorithm for obtaining communities based on self-supervision training of a graph self-encoder, which comprises the steps of combining the self-supervision training with a self-expression principle to obtain a similarity matrix S of paired nodes of a network graph G, guiding a node to be generated to be embedded into a matrix Z, and obtaining communities in an integrated mode. The method specifically comprises the following steps:

obtaining a network graph G, and obtaining a first damage graph G through damage function processing ₁ And a second damage graph G ₂ The node embedded matrix Z after the pre-training is obtained through the pre-training treatment, and a first damage graph G ₁ And a second damage graph G ₂ And (3) comparing and learning by a noise comparison function based on a standardized mutual information maximization principle, training by a self-supervision training model until a loss function is minimized, obtaining an optimized node embedding matrix Z, obtaining a similarity matrix S of paired nodes by self-expression principle and regularization treatment, guiding to generate the node embedding matrix Z by the similarity matrix S, and obtaining a community cluster matrix C of the nodes based on the similarity matrix S and the fully-connected multi-layer perceptron.

The step of obtaining the community cluster matrix C of nodes includes obtaining probability distribution of due connection and no connection relation of the nodes in an unsupervised manner so as to guide formation of communities.

G= (V, E, X), where v= {1,2, …, N } is a set of nodes,is a set of edges, assuming each node is in vector x _i Including some attribute values x _i ∈R ^F ，X＝[x ₁ ,x ₂ ,…,x _i ,…,x _N ] ^T ∈R ^N×F Is a node attribute matrix of the network graph; the object is to learn the function f V-K]Wherein [ K ]]= {1,2, …, K } is an index set representing community clusters, each node being mapped to one community by the network structure and node properties of the graph.

The community cluster matrix C comprises N rows of community vectors C _i 。

C _i ＝Softmax(MLP(Z _i ))∈R ^K (9)

In the formula (9), C _i For the community vector of the ith node in the community cluster matrix, softmax (·) is an activation function, MLP is a three-layer neural network, and MLP takes each node vector Z _i Mapping to a K-dimensional vector, wherein K is the number of community clusters, and K is assumed to be known; the softmax layer converts the K-dimensional vector into a probability distribution such that C _iK Namely C _i Represents the probability that the ith node belongs to the kth cluster, such that nodes with similar embeddings will be mapped to similar positions in the (K-1) dimensional probability distribution.

Continuously updating node embedding to guide the generation of a community cluster matrix by training MLP parameters; the optimization objective of the community is found as formula (10).

And (3) carrying out joint optimization according to the self-supervision training parameters and the MLP parameters, wherein the total loss consists of a weighted sum of two loss parts of training node embedding and community discovery, as shown in a formula (11).

min L _total ＝αL _ss +L _comm (11)

Example 2:

the invention discloses a device for obtaining communities based on self-supervision training of a graph self-encoder, which comprises the following modules:

the community obtaining module is used for obtaining a network graph G, and obtaining a first damage graph G through damage function processing ₁ And a second damage graph G ₂ The node embedded matrix Z after the pre-training is obtained through the pre-training treatment, and a first damage graph G ₁ And a second damage graph G ₂ Based on the principle of maximizing standardized mutual information, comparing and learning by a noise comparison function, training by a self-supervision training model until a loss function is minimized, obtaining an optimized node embedded matrix Z, and obtaining a similarity matrix S of paired nodes by self-expression principle and regularization treatment so as toThe similarity matrix S guides generation of a node embedding matrix Z, and based on the similarity matrix S and the fully-connected multi-layer perceptron, probability distribution of due connection and no connection relation of the nodes is obtained in an unsupervised mode so as to guide formation of communities and obtain a community cluster matrix C of the nodes.

Example 3:

the invention discloses a device for obtaining communities based on self-supervision training of a graph self-encoder, which is electronic equipment, wherein the electronic equipment comprises the device of the embodiment 2.

Example 4:

the invention discloses a device for obtaining communities based on self-supervision training of a graph self-encoder, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the memory and the processor form an electronic terminal, and the processor realizes the steps of the embodiment 1 when executing the computer program.

Example 5:

the present invention discloses a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of embodiment 1.

Compared with the above embodiment, the program modules may be hardware modules made by using the existing logic operation technology, so as to implement the corresponding logic operation steps, communication steps and control steps, and further implement the corresponding steps, where the logic operation unit is not described in detail in the prior art.

The research and development process comprises the following steps:

the invention is characterized in that: unsupervised represents an improvement in the community discovery method.

1 the most basic technical problems to be solved

An unsupervised loss function is designed based on the graph convolution neural network, the graph convolution neural network is trained in a self-supervision mode, and an algorithm for extracting communities is realized in an integrated mode.

2 core technical scheme

We use g= (V, E, X) to represent an input network graph, where v= {1,2, …, N } is a set ofThe node point is a node point which,is a set of edges. Assume that each node is in vector x _i Including some attribute values x _i ∈R ^F ，X＝[x ₁ ,x ₂ ,…,x _i ,…,x _N ] ^T ∈R ^N×F Is a node attribute matrix of the network graph. The goal of our algorithm is to learn the function f V.fwdarw.K]Wherein [ K ]]= {1,2, …, K } is an index set representing community clusters, each node being mapped to one community by the network structure and node properties of the graph.

The research network comprises a quotation network, a social network and the like, the quotation network is used in experiments, and the community algorithm is not only used for the quotation network, but also can be used for the social network, and the network with node, side and node attributes can be used. Four citation networks were used in the experiments. .

The meaning of the symbols is given in table 1.

Table 1: meaning table of symbols

As shown in fig. 1, which is a flowchart of a community discovery algorithm based on self-supervision training of a graph self-encoder, the partitioning includes the following steps:

Step 1: generating a damage graph G from an input graph G ₁ And G ₂ Positive and negative sample pairs of node V are defined.

As shown in FIG. 2, a diagram of the overall framework of a community discovery algorithm based on self-supervised training of a graph self-encoder is provided.

We generate two damage graphs G from an input graph G using a damage function ₁ And G ₂ . The damage graph is generated by randomly deleting small parts of edges from the input graph G, and keeping the vertices unchanged. For any node i ε V, at G ₁ And G ₂ Corresponding node i in (a) is denoted as G ₁ (i) And G ₂ (i) We will (G) ₁ (i),G ₂ (i) Defined as positive sample pairs. Randomly selecting a group of nodes V _i- = { j∈v|j+.i }, we will (G) ₁ (i),G ₁ (j))(G ₁ (i),G ₂ (j) Defined as a negative sample pair.

Step 2: the pre-training mode generates a node embedding matrix Z of the input graph G.

We will corrupt graph G using a graph convolution GCN encoder ₁ And G ₂ Generating a corresponding node embedded representation Z ₁ And Z ₂ . The graph convolutional encoder generation process is shown in equation (1).

In the formula (1), Z is the generated node embedded matrix, each behavior of Z is a vector representation of a node, and Z epsilon R ^|V|×F′ . X is the attribute matrix of the node, A is the adjacency matrix of the input network graph,I∈R ^|V|×|V| is a unit matrix, D is a degree matrix, < >>Relu (·) is the activation function, W ₀ 、W ₁ The weight parameters for the graph roll stack. Z is Z ₁ And Z ₂ Is a graph G ₁ And G ₂ Is embedded in the matrix, and parameter sharing is maintained during GCN encoder training.

Step 3: minimizing two graphs G ₁ And G ₂ Finally generating the embedded matrix Z of the node.

In the method, network structure and node attribute information of the graph are considered simultaneously, a model is trained based on a self-supervision mode according to the principle of mutual information maximization between two damaged graphs of an input graph, and an embedded matrix Z of the node is finally generated. G ₁ And G ₂ As shown in equation (2), our training goal is to minimize the loss function.

In the formula (2), Z _1i And Z _2i Respectively, node i is at G ₁ (i) And G ₂ (i) We calculate the embedding loss of the positive and negative pairs using cosine similarity cos (·), τ being a temperature parameter. The loss function includes two parts, a positive sample pair and a negative sample pair.

Step 4: and (3) learning to generate a similarity matrix S of paired nodes according to the node embedding matrix Z generated in the step (1-3).

The objective problem is to derive a node similarity matrix S according to a node embedding matrix Z. Using the principle of self-expression of nodes, we try to represent node i using the linear sum of other nodes, as in equation (3).

In the formula (3), Z _i Is the embedded vector of node i, Z _j Is the embedded vector of node j, p _ij Is the similarity coefficient between node i and node j, defining p _ii ＝0。

We use the F-norm normalized reconstruction node embedding matrix Z, optimize the objective function as equation (4),

in the formula (4), Z is a node embedding matrix, P is a similarity coefficient matrix, and lambda ₁ Is an optimized weight parameter.

Following is the process of constructing a similarity matrix S for nodes, we train the data using matrix decomposition and batch processing techniques. First, a coefficient matrix is calculated according to the similarity coefficient P between nodesAs in equation (5),

because of the large data dimension in the dataset, we randomly sample the data using batch processing techniques, matrix coefficients for ease of computation and storageSVD decomposition with rank r is performed, as in equation (6) (7),

in the formula (6), r=4k+1, u is a left eigenvector matrix, Σ is a singular value diagonal matrix, V ^T Is the right eigenvector matrix.

Regularizing each row of L, and setting the negative value in L as 0 to obtain L'.

Finally, constructing a similarity matrix S, such as formula (8), S _ij ∈[0,1]，

In equation (8), the similarity matrix S.

We randomly sample and select M nodes, wherein M is less than or equal to N. Batch processing trains the loss in equation (4). The node similarity matrix S generated in the step can guide the node obtained in the step 1-3 to be embedded into Z to generate a community C.

Step 5: a community cluster matrix C of nodes is discovered using a fully connected multi-layer perceptron-based approach.

We use a method with trainable parameters W in this step _MLP Is a fully connected multi-layer perceptron (MLP) that embeds and maps each node to its corresponding nodeA community member vector as shown in equation (9).

C _i ＝Softmax(MLP(Z _i ))∈R ^K (9)

In the formula (9), C _i For the community vector of the ith node in the community cluster matrix, softmax (·) is an activation function, MLP is a three-layer neural network, and MLP takes each node vector Z _i Mapping to a K-dimensional vector, K being the number of community clusters, we assume that K is known. The softmax layer converts the K-dimensional vector into a probability distribution such that C _iK (C _i K-th element of (c-1) represents the probability that the i-th node belongs to the K-th cluster, such that nodes with similar embeddings will be mapped to similar locations in the (K-1) dimensional probability distribution.

By using the information learned from the paired node similarity obtained in step 4, the MLP parameters in this step are trained to continuously update the node embedment to guide the generation of the community cluster matrix. The optimization objective of the community is found as formula (10),

in the formula (10), C _i C is the community vector of the ith node in the community cluster matrix _j S is a community vector of the jth node in the community cluster matrix _ij For the similarity matrix of the node i and the node j, when S _ij When the value is high, the node pair (i, j) should be approximately constrained to be in the same community, when S _ij When the value is low, then node i and node j are in different communities. Thus, we generate a probability distribution of how a set of nodes should be connected and not connected in an unsupervised manner to guide the formation of communities.

The invention solves the problem of optimization goal, we are to perform joint optimization according to the self-supervision training parameters and MLP parameters, the total loss is composed of the weighted sum of the training node embedding and community discovery two-part loss, as in formula (11),

min L _total ＝αL _ss +L _comm (11)

in the formula (11), alpha is an optimized weight factor, and consists of two parts of a formula (2) and a formula (10). The node pair similarity value is obtained by solving the batch learning technology of the step 4. The entire algorithm proceeds in an iterative fashion by solving for the similarity of each batch node, and then updating the parameters of the neural network by minimizing equation (11).

The whole algorithm flow is to input a network diagram G, including a node number N, a community category number K, a batch processing size M, a batch number H and output a community member vector C of any node i _i . Firstly, initializing parameters of a self-supervision training graph neural network and a clustering MLP neural network, pre-training to obtain an embedding matrix Z of nodes, randomly selecting node samples according to batch size and batch number, optimizing an objective function according to a formula (4), and batch learning a similarity matrix of paired nodes in the network; iterative algorithm, using self-supervision graph neural network to generate node embedding matrix Z, generating community member vector C for any node i _i Parameters in the neural network and the MLP network are continuously updated according to the loss function formula (11).

The distinguishing technical characteristics are as follows:

the application combines self-supervision training and self-expression principle to generate a pair node similarity matrix, and guides the generation of a node embedding matrix, and extracts an algorithm of a community in an integrated mode.

3 beneficial technical effects

After the application runs for a period of time internally, the feedback of field technicians is beneficial in that:

the SECD algorithm can achieve better performance on all data sets and all indicators. In terms of clustering accuracy ACC performance, the SECD algorithm is improved by 10.1% on the Cora data set, 2.9% on the Citeser data set, 11.3% on the Wiki data set and 5.8% on the Pubmed data set compared with the AGC algorithm. Compared with the AGC algorithm, the SECD is improved by 4.1% on the Cora data set, 3.0% on the Citeseer data set, 12.1% on the Wiki data set and 13.8% on the Pubmed data set in the mutual information NMI performance.

See tables 3 and 4 for specific data.

As shown in fig. 3 and 4, the performance of the algorithm is improved on two indexes of accuracy ACC and NMI standards mutual information. The "diagonal" bar graph shows the performance of the proposed algorithm versus the other two most advanced algorithms.

4 inventive concept

Compared with other tasks such as node classification, link prediction and the like, the community discovery algorithm based on the graph neural network has not been studied and explored deeply. In real world networks, it is costly to directly obtain the real community labels or pairwise constraints, and the prior art mostly works with the node representation learning module and clustering algorithm independently. The invention aims to solve the problems that an unsupervised loss function is designed based on a graph convolution neural network, the graph convolution neural network is trained in a self-supervision mode, and an algorithm for extracting communities is realized in an integrated mode.

Application description:

the invention provides a community discovery algorithm based on self-supervision training of a graph self-encoder, which fuses a self-supervision graph neural network with a self-expression principle, and simultaneously considers the network structure and node attribute information of the graph to solve the community discovery problem. Training in an end-to-end manner is performed on a plurality of public network data sets, and compared with the performance of a plurality of algorithms, the algorithm of the invention achieves the best performance in community discovery tasks, and has the advantage of being applied to real data sets.

Application example 1: experiments were performed on the Cora dataset using the present algorithm. The Cora dataset is a citation network of machine-learned papers, containing 2708 papers, 5429 edges, for a total of 7 categories. Each paper in the dataset is described by a word vector of 0/1 value, representing the absence/presence of the corresponding word in the dictionary. The dictionary consists of 1433 unique words, each paper consists of 1433 features, each feature being represented only by 0/1.

Application example 2: experiments were performed on the citeser dataset using the present algorithm. The citeser dataset is a quotation network of document words, containing sparse word feature vector packages for each document and a list of quotation links between documents. These tags contain six areas: agents, artificial intelligence, databases, information retrieval, machine language, and human-machine interaction.

Application example 3: experiments were performed on Wiki datasets using the present algorithm. The Wiki dataset is a quotation network of a collection of web pages, including 2405 web pages, 17981 edges. Wherein the nodes are web pages and if one links to another, then it is connected to the other.

Application example 4: the present algorithm was applied to experiments on Pubmed datasets. The Pubmed dataset is a citation network comprising 19717 scientific publications on diabetes from Pubmed database, divided into three categories: diabetes Mellitus, experimenal, diabetes Mellitus Type, diabetes Mellitus Type 2, citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word vector in a dictionary of 500 unique words.

See table 2 for details of the four data sets. Cora, citeseer and Pubmed are reference datasets in which nodes correspond to papers, and are connected by an edge if one paper references another. Wiki is a collection of web pages where a node is a web page that if linked to another web page, then links to the other web page.

Table 2: experimental data set information

In the application embodiment of the invention, the algorithm is compared with two attribute map embedding algorithms and six node clustering algorithms. These eight algorithms are classified into three types, using only node features, using only network structures, using both node features and network structures. The SECD algorithm provided by the invention is an algorithm which uses node characteristics and network structure information simultaneously. The community discovery results are evaluated using two index accuracy ACC and normalized mutual information NMI. And performing evaluation index calculation according to the real community information in the data set. See table 3 for accuracy results of the SECD algorithm on the four data sets and table 4 for NMI mutual information results of the SECD algorithm on the four data sets.

For the attribute map embedding method, two algorithms are compared: GAE and VGAE combine the graph convolution network with a variational automatic encoder for representation learning in downstream task node classification and community discovery. For the node clustering method, a comparison is made with six other algorithms. These six algorithms can be divided into three classes:

(1) The method uses only features. Kmeans and spectral clustering are two conventional clustering algorithms. Spectran-F takes as input the cosine similarity of node features.

(2) The method uses only graph structures. Spectran-G is a Spectral cluster with the adjacency matrix as the input similarity matrix. Deep walk is embedded by learning nodes using skip gram on a random walk path generated on the graph.

(3) The method uses both features and graph structures. The AGC uses higher order graph convolution to filter node features and select the number of graph convolution layers for different data sets. The GUCD introduces local enhancements to potential communities, obtains node community membership in the hidden layer of the encoder, and introduces community-centric double decoders.

Table 3: SECD algorithm ACC on four datasets

Table 4: the SECD algorithm NMI on four datasets

The average performance was calculated by running ten times on each dataset. The SECD algorithm can achieve better performance on all data sets and all indicators. In terms of clustering accuracy ACC performance, the SECD algorithm is improved by 10.1% on the Cora data set, 2.9% on the Citeser data set, 11.3% on the Wiki data set and 5.8% on the Pubmed data set compared with the AGC algorithm. Compared with the AGC algorithm, the SECD is improved by 4.1% on the Cora data set, 3.0% on the Citeseer data set, 12.1% on the Wiki data set and 13.8% on the Pubmed data set in the mutual information NMI performance.

As shown in fig. 5 and 6, the performance effects of the present algorithm are shown in comparison with the four most advanced algorithms in the Cora dataset and the Citeseer dataset.

At present, the technical scheme of the invention has been subjected to pilot-scale test, namely, smaller-scale test of products before large-scale mass production; after the pilot test is completed, the use investigation of the user is performed in a small range, and the investigation result shows that the user satisfaction is higher; now, the preparation of the formal production of the product for industrialization (including intellectual property risk early warning investigation) is started.

Claims

1. An algorithm for obtaining communities based on graph self-encoder self-supervision training, which is characterized in that: combining self-supervision training with a self-expression principle to obtain a similarity matrix S of paired nodes of a network diagram G, guiding a node to be generated to be embedded into a matrix Z, and obtaining communities in an integrated mode; the method specifically comprises the following steps: obtaining a network graph G, and obtaining a first damage graph G through damage function processing ₁ And a second damage graph G ₂ The node embedded matrix Z after the pre-training is obtained through the pre-training treatment, and a first damage graph G ₁ And a second damage graph G ₂ Based on a standardized mutual information maximization principle, performing noise contrast function contrast learning, training based on a self-supervision training model until a loss function is minimized, obtaining an optimized node embedded matrix Z, obtaining a similarity matrix S of paired nodes through self-expression principle and regularization processing, and obtaining a community cluster matrix C of the nodes based on the similarity matrix S and the fully-connected multi-layer perceptron;

G= (V, E, X), where v= {1,2, …, N } is a set of nodes,is a set of edges, assuming each node is in vector x _i Including some attribute values x _i ∈R ^F ，X＝[x ₁ ,x ₂ ,…,x _i ,…,x _N ] ^T ∈R ^N×F Is a node attribute matrix of the network graph; the object is to learn the function f V-K]Wherein [ K ]]= {1,2, …, K } is an index set representing community clusters, each node is mapped to one community by the network structure and node attribute of the graph;

g is a quotation network diagram, V is a group of nodes which are papers, documents or web pages, E is a quotation relationship among the edges of the papers, the documents or the web pages;

in the formula (1), Z is a node embedding matrix, each behavior of Z is a vector representation of a node, and Z is E R ^|V|×F′ The method comprises the steps of carrying out a first treatment on the surface of the X is the attribute matrix of the node, A is the adjacency matrix of the input network graph,I∈R ^|V|×|V| is a unit matrix, D is a degree matrix, < >>Relu (·) is the activation function, W ₀ 、W ₁ All are weight parameters of the picture scroll lamination; z is Z ₁ Is a first damage graph G ₁ Node embedded matrix of Z ₂ Is a second damage graph G ₂ Is embedded into the matrix, and parameter sharing is kept in the GCN encoder training process;

according to the generated node embedding matrix Z, learning to generate a similarity matrix S of paired nodes;

using the self-expression principle of the node, and using the linear addition of other nodes to express the node i, such as a formula (3);

In the formula (3), Z _i Is the embedded vector of node i, Z _j Is the embedded vector of node j, p _ij Is the similarity coefficient between node i and node j, defining p _ii ＝0；

The reconstruction node embedding matrix Z is normalized by the F-norm, the objective function is optimized as in equation (4),

in the formula (4), Z is a node embedding matrix, P is a similarity coefficient matrix, and lambda ₁ Is an optimized weight parameter;

the following is a process of constructing a similarity matrix S for the nodes, training data using matrix decomposition and batch processing; calculating coefficient matrix according to similarity coefficient P between nodesAs in equation (5),

because the data dimension in the data set is large, the data is randomly sampled by batch processing, and the coefficient matrix is used for facilitating calculation and storageSVD decomposition with rank r is performed, as in equation (6) and equation (7),

in the formula (6), r=4k+1, u is a left eigenvector matrix, Σ is a singular value diagonal matrix, V ^T Is a right eigenvector matrix;

regularizing each row of L, and setting a negative value in L as 0 to obtain L';

In the formula (8), S is a similarity matrix, L is a characteristic matrix generated after singular value decomposition of a coefficient matrix, L' is a normalized L matrix, and a negative element is set to be 0, L ^T Is the transpose of the L matrix, L _∞ Is an infinite norm of the L matrix;

randomly sampling and selecting M nodes, wherein M is less than or equal to N; the loss in batch training equation (4); in the step, the node similarity matrix S is generated, so that the obtained node embedding Z can be guided to generate a community C;

the community cluster matrix C comprises N rows of community vectors C _i ，

C _i ＝Softmax(MLP(Z _i ))∈R ^K (9)

In the formula (9), C _i For the community vector of the ith node in the community cluster matrix, softmax (·) is an activation function, MLP is a three-layer neural network, and MLP takes each node vector Z _i Mapping to a K-dimensional vector, wherein K is the number of community clusters, and K is assumed to be known; the softmax layer converts the K-dimensional vector into a probability distribution such that C _iK Namely C _i Represents the probability that the ith node belongs to the kth cluster, such that nodes with similar embeddings will be mapped to similar locations in the (K-1) dimensional probability distribution;

in the step of obtaining the community cluster matrix C of the nodes, probability distribution of due connection and no connection relation of the nodes is obtained in an unsupervised mode so as to guide the formation of communities;

continuously updating node embedding to guide the generation of a community cluster matrix by training MLP parameters; the optimization objective of the community is found as formula (10),

In the formula (10), C _i C is the community vector of the ith node in the community cluster matrix _j S is a community vector of the jth node in the community cluster matrix _ij For the similarity matrix of the node i and the node j, when S _ij When the value is high, the node pair (i, j) is constrained in the same community, when S _ij When the value is low, the node i and the node j are in different communities;

the joint optimization is performed according to the self-supervision training parameters and the MLP parameters, and the total loss consists of a weighted sum of two loss parts of training node embedding and community discovery, as shown in a formula (11),

minL _total ＝αL _ss +L _comm (11)

in the formula (11), minL _total To minimize the overall training loss, α is the weighting factor in the optimization process, L _ss For self-supervision and contrast of the loss of training node embedding, L _comm Is a loss of community discovery.

2. An apparatus for obtaining communities based on self-supervision training of a graph self-encoder, which is characterized in that: the method comprises a community obtaining module for obtaining a network graph G, and obtaining a first damage graph G through damage function processing ₁ And a second damage graph G ₂ The node embedded matrix Z after the pre-training is obtained through the pre-training treatment, and a first damage graph G ₁ And a second damage graph G ₂ Based on a standardized mutual information maximization principle, performing noise contrast function contrast learning, training based on a self-supervision training model until a loss function is minimized, obtaining an optimized node embedding matrix Z, performing self-expression principle and regularization processing to obtain a similarity matrix S of paired nodes, and guiding generation of node embedding moments by the similarity matrix S The matrix Z is used for obtaining a community cluster matrix C of the nodes based on the similarity matrix S and the fully-connected multi-layer perceptron;

the community cluster matrix C comprises N rows of community vectors C _i ，

C _i ＝Softmax(MLP(Z _i ))∈R ^K (9)

In the formula (9), C _i For the community vector of the ith node in the community cluster matrix, softmax (·) is an activation function, MLP is a three-layer neural network, and MLP takes each node vector Z _i Mapping to a K-dimensional vector, wherein K is the number of community clusters, and K is assumed to be known; the softmax layer converts the K-dimensional vector into a probability distribution such that C _iK Namely C _i Represents the probability that the ith node belongs to the kth cluster, such that nodes with similar embeddingsThe points will be mapped to similar locations in the (K-1) dimensional probability distribution;

minL _total ＝αL _ss +L _comm (11)

3. The apparatus for obtaining communities based on graph-based self-encoder self-supervised training of claim 2, wherein: the community acquisition module is further used for acquiring a community cluster matrix C of the nodes, wherein the community cluster matrix C is used for acquiring probability distribution of due connection and no connection relation of the nodes in an unsupervised mode so as to guide formation of communities.

4. An apparatus for obtaining communities based on graph self-encoder self-supervised training, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that: the processor, when executing the computer program, implements the corresponding steps of claim 1.

5. An apparatus for obtaining communities based on graph self-encoder self-supervised training includes a computer readable storage medium storing a computer program, characterized in that: which when executed by a processor carries out the respective steps of claim 1.