CN112784118A

CN112784118A - Community discovery method and device in graph sensitive to triangle structure

Info

Publication number: CN112784118A
Application number: CN202110018835.XA
Authority: CN
Inventors: 张吉; 王佳麟; 高军
Original assignee: Peking University; Zhejiang Lab
Current assignee: Peking University; Zhejiang Lab
Priority date: 2021-01-07
Filing date: 2021-01-07
Publication date: 2021-05-11

Abstract

The invention relates to a community discovery method and device in a graph sensitive to a triangle structure. The method comprises the following steps: utilizing a graph encoder in a graph self-encoder to fuse structural information and node content information in a graph through a graph neural network model so as to learn hidden layer vector representation of nodes in the graph; reconstructing a continuous edge relation between two points in the graph and a triangular structure in the graph according to the hidden vector representation of the nodes in the graph by using a graph encoder in a graph self-encoder; and carrying out graph clustering by using the structure information and the node content information in the reconstructed graph, thereby discovering communities. The invention is an unsupervised graph self-encoder-based community discovery scheme sensitive to a triangle structure, can efficiently and adaptively realize community discovery tasks in a graph, is applied to different platforms, and has high expandability and high flexibility.

Description

Community discovery method and device in graph sensitive to triangle structure

Technical Field

The invention belongs to the technical field of general information, and a plurality of scenes and applications in real life can be described by using graphs, such as a social network graph, a paper introduction graph, a user commodity graph in an e-commerce platform and the like. The triangle structure in the graph has important significance for community composition and discovery. The method is based on the advanced graph neural network technology, combines the triangular structure in the graph, learns node representation from data in a self-supervision mode, and clusters the node representation so as to discover the community structure in the graph, and can be widely applied to graphs of different online network platforms such as social interaction, electronic commerce and the like.

Background

The graph structure is widely applied to the description of various complex scenes in the real world, such as a social relationship network, a world wide web, a city traffic network, a user commodity relationship network in e-commerce and the like. Community structure is a common feature in all types of graphs, the whole graph is composed of many communities, and the communities reflect the closeness of connection between nodes. The community discovery algorithm in the graph can help us to understand node clusters, independent groups and network structures in the graph, which helps us to infer similar behaviors and preferences of groups of peers, elastic estimation and search nesting relation, and can also provide basis for data mining tasks. In an e-commerce system, for example, querying for cheating groups that have a collaborative relationship with a given target cheating user; in a social network, a community of interest common to a single or multiple target users is queried, and so on.

The community discovery task on the graph is generally to obtain communities according to node clustering in the graph. Nodes in communities are closely related and the relations among communities are sparse, so that dense subgraphs are usually in communities, and triangles constitute basic elements of the dense subgraphs, so that the utilization of the triangle structure in the subgraphs is very important for the discovery of the communities. Traditional clustering algorithms, such as K-L dichotomy, graph dichotomy, spectral clustering, and the like, mainly use the connection information in the graph to find communities, and lack of utilization of the content information of the nodes in the graph. Some graph clustering algorithms based on deep learning try to fuse structural information and node content information of a graph in a model, learn to obtain vector representations of nodes for clustering, however, the models only pay attention to simple structural information, and lack of utilization of high-order structures (such as triangle structures) in the graph, so that community information cannot be better mined.

Aiming at the community problem in the graph, the method gives a model which considers the graph structure and the node content at the same time, and simultaneously has important academic value and wide application prospect by combining the model of the triangle structure in the graph.

Disclosure of Invention

The invention provides an unsupervised graph self-encoder-based community discovery method and device sensitive to a triangle structure. The method fuses structure information and content information in the graph by utilizing a graph neural network in an encoder, and learns high-order structure information in the graph by reconstructing a triangle structure in a decoder. In such a way, the method can realize the community discovery task in the graph in an efficient and self-adaptive manner, and is applied to different platforms.

The technical scheme adopted by the invention is as follows:

an unsupervised graph auto-encoder-based community discovery method sensitive to triangle structures, comprising the following steps:

utilizing a graph encoder in a graph self-encoder to fuse structural information and node content information in a graph through a graph neural network model so as to learn hidden layer vector representation of nodes in the graph;

reconstructing a continuous edge relation between two points in the graph and a triangular structure in the graph according to the hidden vector representation of the nodes in the graph by using a graph encoder in a graph self-encoder;

and carrying out graph clustering by using the structure information and the node content information in the reconstructed graph, thereby discovering communities.

The method is based on an auto-encoder structure and consists of a graph encoder and a graph decoder.

The graph encoder utilizes an advanced graph neural network model to fuse structural information and node content information in a graph, so as to learn hidden layer vector representation of nodes in the graph. The input of the graph encoder is a adjacency matrix of the graph and a node characteristic matrix, and the structural information of the graph and the content information of the nodes are fused through a graph neural network model, such as a graph convolution neural network/graph attention neural network and the like. Note that if the original image has only structure information and no node content information, the degree of the node may be used as the node content information. By means of the multi-layer graph neural network, hidden layer vector representations of the nodes can be obtained, and the hidden layer vector representations can be used for a subsequent decoder to decode and reconstruct according to existing information (such as structural content) in the graph.

The Graph decoder reconstructs the structure in the Graph according to the hidden vector representation of the nodes, and a traditional unsupervised Graph neural network (Graph auto encoder) usually only focuses on simple low-order structure information in a decoder part, such as the situation whether a connecting edge exists between two nodes. However, for the task of community detection, such information is insufficient, as described above, the community is usually a dense subgraph, and the important component of the dense subgraph is the triangle structure, so in the present invention, not only the connection relationship between two points in the graph is concerned, but also the reconstruction of the triangle structure in the graph is concerned. Specifically, for the reconstruction of the two-point continuous edge information in the graph, given two points A and B of the original continuous edge in the graph, the invention calculates the continuous edge possibility of the two points A and B through a layer of inner product network to reconstruct the existing continuous edge information in the graph. For the reconstruction of a triangular structure, a connecting edge between A and B is given, a neighbor set of A and B is searched, if C is a neighbor of A or B, the learning of triangular information is carried out according to whether C is connected with A and B (namely A, B and C form the triangular structure), meanwhile, negative sampling is carried out, sampling nodes D and D are not connected with A and B, and the reconstruction and the learning of the triangular information are carried out according to the relation among A, B, C and D.

The community detection method carries out graph clustering according to the node hidden layer vector representation learned by the graph self-encoder, such as a K-means algorithm, so as to find communities.

Furthermore, the method realizes the expandability of the algorithm, and as massive data is often processed in practical production application, the scale of the related graph may be very large (such as ten-million-level points and hundred million-edge graphs), and in order to ensure the expandability of the algorithm (feasibility of running on a large graph), the invention also provides a method for running the algorithm on the large graph. Firstly, a few theoretical guarantees are given, and since the learning process of the graph neural network model is to aggregate the structural and attribute features of the neighbors around the central node, and transmit the features after local conversion, the graph neural network model often has the characteristic of being "local", that is, the learning process of the graph neural network is limited to "receptive field" for each graph node, in other words, information at a far end (for example, the shortest distance between two graph nodes is more than 20) is useless for learning in the model. Also, for a central node, only the triangular structure near its perimeter is of greater value. Therefore, in order to improve the expandability of the algorithm, sub-graph sampling can be carried out on the original large graph, only neighbor nodes around the central node are reserved after sampling, then a graph self-encoder model is learned on the sampled sub-graph, and the operation efficiency and the expandability of the algorithm are guaranteed while the model learning effect is guaranteed.

Based on the same inventive concept, the invention also provides a community discovery device in a graph sensitive to a triangular structure, which adopts the method, and comprises the following steps:

the graph encoder module is used for fusing structure information and node content information in the graph through a graph neural network model so as to learn hidden layer vector representation of the nodes in the graph;

the decoder module is used for reconstructing a connection edge relation between two points in the graph and a triangle structure in the graph according to the hidden vector representation of the nodes in the graph;

and the clustering module is used for carrying out graph clustering by using the structure information and the node content information in the reconstructed graph so as to discover communities.

The invention relates to an unsupervised graph self-encoder-based community discovery method sensitive to a triangle structure, which has expandability and can be applied to graph data of different scales to discover mining information, and the generation of node vectors in a graph is realized under the condition of utilizing different dimension information in the graph. The invention has the advantages that:

1) the invention is an unsupervised learning model, does not need a data set with labels, and has high expandability and wider applicable scenes (pictures).

2) The method utilizes an advanced graph self-encoder to learn the vector representation of the graph nodes, wherein, in the graph encoder stage, a graph neural network is utilized to fuse the structure information of the graph and the content information of the nodes, and the original side information and triangle information in the graph are reconstructed in the decoder stage to learn the node representation more suitable for the community discovery algorithm in the graph.

3) The framework of the method has high flexibility, wherein the encoder can be replaced by different graph neural network models, such as a graph convolution model, a graph attention machine model and the like, the decoder can also design different loss functions according to different application scenes (for example, reconstruction of node attribute information can be increased for a rich attribute graph), and meanwhile, the sub-graph sampling part can also apply different sampling methods to carry out efficient sampling.

4) The node vector representation learned by the model in the invention contains rich information in the graph, can be used for community discovery, and tasks (such as node classification and link prediction) in other graphs, and can be applied to different applications in reality, such as community discovery with the same interest in a social network.

Drawings

FIG. 1 is a general framework and flow diagram of the process of the present invention. Wherein A is the adjacency matrix of the graph, X is the node attribute vector matrix in the graph, At is the adjacency matrix of the sampled subgraph, Xt is the node attribute vector matrix of the subgraph,

representing the continuous-edge relation reconstructed from the encoder,

representing the continuous-edge relationship of the triangles reconstructed from the encoder.

Fig. 2 shows three relations of the triangular structure.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, the present invention shall be described in further detail with reference to the following detailed description and accompanying drawings.

The patent learns the low-dimensional vector representation of the nodes in the graph by using the graph neural network under an unsupervised setting and then uses the low-dimensional vector representation for community mining in the graph. In order to improve the expandability of the algorithm, for a large graph, the method of sub-graph sampling is used for reducing the scale of training data, and the sub-graph sampling method can be distinguished according to the applied graph scene, for example, the graph nodes are simply sampled according to the degrees of the nodes, or the edges in the graph are sampled, and sub-graph extraction is carried out in the large graph according to the sampled nodes and edges. Specific sampling methods can be found in literature: zeng H, Zhou H, Srivastava a, Kannan R, Prasanna v. graphsaint: graph sampled induced left method. arXiv preprinting arXiv: 1907.04931.2019 Jul 10.

The overall framework of the process of the invention is shown in FIG. 1. Given a graph G ═ V, E, X, where V represents the set of nodes in the graph, E represents the set of edges in the graph, a represents the adjacency matrix of G, for an attribute graph, X represents the node attribute vector matrix in the graph, and for a non-attribute graph, we can initialize X with the node degree. The method firstly utilizes a sub-graph sampling method to perform sub-graph sampling on G to generate a plurality of sub-graphs, and then learning of a graph self-encoder is performed on each sub-graph. The graph self-encoder is divided into two modules, a graph encoder in front of it and a graph decoder behind it. The graph encoder can be a common graph neural network model, such as a graph convolution neural network, a graph attention neural network and the like, and the graph encoder is divided into two parts, wherein one part is the relationship of edges in the reconstructed graph, and the other part is the relationship of triangle structures in the reconstructed graph.

A detailed description of each block of the sub-picture sample and picture self-encoder is given below:

1. sub-graph sampling

For a sub-graph sampling method, the invention designs a simple and efficient mode for sampling edges in a graph, and meanwhile, documents theoretically prove the effectiveness of the sub-graph sampling method, namely the sampling method can ensure that the variance of a model is reduced under different iteration rounds when the model is calculated on the sub-graph after sampling, and the specific reference can be found in the documents: chen J, Ma T, Xiao c. fastgcn: fast Learning with Graph relational network via impedance sampling. In International Conference on Learning retrieval 2018 Feb 15.

The sampling strategy of the present invention is here related to the degree of the nodes in the graph, for node u and node v, if there is a connecting edge between them, the probability that the connecting edge is sampled is p_u，vOc 1/du + 1/dv. Wherein, oc represents a direct ratio.

And giving the sampling subgraph scale, such as how many edges are sampled, then sampling the edges in the whole graph according to the probability, and then performing subgraph extraction according to the edges obtained by sampling so as to determine the sampling subgraph.

2. Self-coding device for picture

After obtaining a plurality of subgraphs, training a neural network of a graph self-encoder on each subgraph, wherein the graph self-encoder is divided into two modules of an encoder and a decoder, and the decoder is subdivided into two parts of an edge structure loss reconstruction decoder in the graph and a triangle loss reconstruction decoder in the graph. Specific implementation forms thereof are given below respectively. Suppose the subgraph is G_t＝{A_t，X_tIn which A is_tIs a contiguous matrix of subgraphs, X_tIs a node attribute vector matrix of the subgraph.

2.1) Picture encoder

The graph encoder in the patent utilizes a graph neural network to encode structure information and node content information of a graph, and can utilize various forms of graph neural networks such as a graph convolution neural network and a graph attention machine to manufacture the neural network. A representation of a graph convolution neural network is given here.

Where l represents the l-th layer of the network,

representing a node vector representation of the l-th layer in the network. Wherein

Wherein

I is and A_tThe identity matrix is of the same size as the other identity matrix,

is that

Degree matrix of (W)^{l}Is a trainable parameter in the l-th layer, σ is an activation function in the network, usually set to RELU. After the L-layer network is processed, the hidden layer node vector representation in the network is obtained

2.2) diagram encoder

The graph decoder of the patent reconstructs side information and triangle structures in the graph. The reconstruction of the connecting edges can reflect the basic local structure information of the graph, and the reconstruction of the triangles can better learn the high-order community information in the graph. Through the fusion learning of the two, the algorithm can learn richer information in the graph.

a) Information reconstruction of edges in a graph

Where the side information reconstruction in the graph relies on the following formula:

wherein, given an edge, the edge contains node u and node v, z_uHidden layer node vector, z, representing point u_vA hidden node vector representing point v.

Hidden layer node vector representation Z learned according to model_tPerforming inner product operation, and reconstructing all connected edges in the subgraph to obtain a reconstructed adjacent matrix

Representing the reconstructed edge-connecting relation, the elements of the corresponding positions in the matrix are represented by L_u，vForming, then, from the reconstructed sub-graph adjacency matrix

And the real subgraph adjacency matrix A_tThe difference to define the loss function. Specifically, for edges (u, v) present in the graph, the computed loss function is L_u，v。

b) Information reconstruction of triangles in a graph

The triangle structure is an important part for forming a dense subgraph (community), is very helpful for mining high-order information in the graph, and in order to better utilize the triangle information, a decoder module in the graph carries out display reconstruction on a triangle module in the graph. A triangle is composed of three points, and given an edge, the relationships between the three points can be summarized as the three given in fig. 2 (excluding the case where none of the three points are connected to each other, as this case is already included in the reconstruction of information of the edge in the figure). In case 1, three points are connected with each other, the three points are connected most closely, in case 2, two connected edges are arranged, the three points are connected relatively closely, in case 3, one connected edge is arranged, and the three points are connected most loosely. In order for the model to learn three different relationships, the following loss functions were designed.

Firstly, mutual information represented by two vectors is defined to reflect the correlation of the two vectors:

D(e₁，e₂)＝σ(e₁ ^TW_d e₂)

where σ is a logistic sigmoid activation function and the training parameter W of the network_d∈R^F*F′Wherein F is e₁Is F' is e₂The vector dimension of (2).

Firstly, an edge (u, v) in the subgraph is given, all neighbor nodes R of the u and the v are found, and R belongs to N_u∪N_vIn which N is_uThe set of neighbor nodes representing u. The relationship of R and (u, v) then belongs to case 1 or case 2. To learn this difference, our loss function is designed as follows:

wherein z is_u，vDenotes z_uAnd z_vConnected representation of hidden layer vectors of, z_iThe hidden layer vector representing the inode, i belongs to the node in R that satisfies case 1, and j is the node in R that satisfies case 2.

For the difference between case 2 and case 3, given the edges (u, v), the loss function is designed as follows:

the loss function for reconstructing the triangle is as follows:

L_t＝αL₁+L₂

wherein alpha can be used to regulate the ratio of the two.

The algorithm training and predicting stage in the whole invention is as follows:

in the training stage, a plurality of subgraphs are sampled, then on each subgraph, the loss of the neural network is calculated according to the edge reconstruction loss function in the subgraph and the reconstruction loss function of the triangular structure in the subgraph, and then the parameters of the trained neural network are obtained after the neural network is trained through back propagation gradient descent. In the guessing stage, inputting an adjacency matrix and a node vector matrix of the whole graph, then calculating a node hidden layer vector matrix through an encoder in a graph self-encoder with trained parameters to obtain a final low-dimensional vector representation representing the nodes in learning, and mining community information in the graph by performing K-means clustering on the node low-dimensional vector representations, wherein the K-means clustering method can refer to the following documents: kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY.an effect k-means marketing Algorithm: IEEE transactions on pattern Analysis and machine interaction.2002 Aug 7; 24(7): 881-92.

To test the effectiveness of the method, experiments were performed on three published graph datasets, including Cora, Citeseer and Pubmed. They are all paper reference datasets, where nodes are papers, edges are paper reference relationships, 2708 points, 5429 edges, Citeseer 3327 points, 4732 edges, pubed 19717 points, 44338 edges in the Cora dataset. Evaluation indexes in the experiment are Normalized Mutual Information (NMI), community classification accuracy and the like, and the indexes are often used in a community detection algorithm. Through analysis of experimental results, the effect of the Graph self-encoder sensitive to the triangle structure is better than that of a Graph self-encoder (Graph auto encoder) only reconstructing side information in a Graph, and is averagely higher than 3 percentage points.

Based on the same inventive concept, another embodiment of the present invention provides a community discovery apparatus in a graph sensitive to a triangle structure, which employs the above method, including:

Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smartphone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps of the inventive method.

Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program, which when executed by a computer, performs the steps of the inventive method.

With the advent of the information age, the data size on the internet is increased in a large amount, and the relationship between data is more and more complex, so that the characteristics of the data and the relationship between the data can be better described and expressed by using the data structure of the graph. Meanwhile, in the data mining task of the graph, the community information can be analyzed to help people to better understand the information in the graph from a macroscopic view, the application is very wide, and the cost of manually marking mass data is huge, so that the community mining of the graph under the unsupervised setting is very important. In particular, the triangle structure is an important component of the community, and the learning of community information in the graph should be fully utilized. The method carries out graph representation learning tasks by combining with a deep learning technology graph neural network which is widely researched and used in recent years, improves the expandability of an algorithm by a sub-graph sampling strategy, obtains node vector representation by fusing graph structure information and attribute information of nodes in a graph in an encoder, and captures high-order information in the graph by focusing on the reconstruction of a triangle structure in a decoder. For use by downstream graph community information mining tasks.

The foregoing disclosure of the specific embodiments of the present invention and the accompanying drawings is directed to an understanding of the present invention and its implementation, and it will be appreciated by those skilled in the art that various alternatives, modifications, and variations may be made without departing from the spirit and scope of the invention. The present invention should not be limited to the disclosure of the embodiments and drawings in the specification, and the scope of the present invention is defined by the scope of the claims.

Claims

1. A community discovery method in a graph sensitive to a triangle structure is characterized by comprising the following steps:

2. The method of claim 1, wherein sub-graph sampling is used to reduce the training data size, and then a graph self-encoder is learned on the sampled sub-graphs.

3. The method of claim 2, wherein the sub-picture sampling comprises:

for node u and node v, if there is a continuous edge between them, the probability that the continuous edge is sampled is p_u，v∝1/du+1/dv；

And giving the scale of the sampling subgraph, sampling the edges in the whole graph according to the probability, and performing subgraph extraction according to the edges obtained by sampling so as to determine the sampling subgraph.

4. The method of claim 1, wherein the edge-to-edge relationship between two points in the graph is reconstructed by: two points A and B of the original continuous edge in the graph are given, and the continuous edge possibility of the two points A and B is calculated through a layer of inner product network, so that the existing continuous edge information in the graph is reconstructed.

5. The method of claim 4, wherein the loss function for reconstructing the edge-to-edge relationship between two points in the graph is calculated by:

representing Z from learned hidden node vectors_tPerforming inner product operation, reconstructing all connected edges in the subgraph to obtain a reconstructed subgraph adjacent matrix

For edges (u, v) present in the graph, according to

And the real subgraph adjacency matrix A_tTo define a loss function L_u，v：

Wherein z is_uHidden layer node vector, z, representing point u_vA hidden node vector representing point v.

6. The method of claim 1, wherein the triangular structure in the graph is reconstructed by: and (3) giving a connecting edge between the A and the B, searching a neighbor set of the A and the B, assuming that the C is a neighbor of the A or the B, learning the triangle information according to whether the C is connected with the A and the B, simultaneously carrying out negative sampling, sampling nodes D, D are not connected with the A and the B, and reconstructing and learning the triangle information according to the relation among the A, the B, the C and the D.

7. The method of claim 6, wherein the loss function for reconstructing the triangular structure in the graph is calculated by:

given an edge, the relationship between three points is summarized into three cases, where three points all have a connecting edge to each other in case 1, two connecting edges in case 2, and one connecting edge in case 3;

mutual information represented by two vectors is defined to reflect the correlation of the two vectors: d (e)₁，e₂)＝σ(e₁ ^TW_d e₂) Where σ is a logistic sigmoid activation function, the training parameter W of the network_d∈R^F*F′F is a vector e₁F' is the vector e₂Dimension (d);

given an edge (u, v) in the subgraph, find all the neighbor nodes R of u and v, R ∈ N_u∪N_vIn which N is_uRepresenting the neighbor node set of u, the relationship of R and (u, v) belongs to case 1 or case 2; for the difference between case 1 and case 2, the loss function is:

wherein z is_u，vDenotes z_uAnd z_vConnected representation of hidden layer vectors of, z_iRepresenting the hidden layer vector representation of the node i, i belongs to the node in R which accords with the condition 1, and j is the node in R which accords with the condition 2;

for the difference between case 2 and case 3, given the edges (u, v), the loss function is:

the loss function for the reconstructed triangle is then:

L_t＝αL₁+L₂

wherein alpha is used to regulate L₁、L₂The ratio of the two.

8. A community discovery device in a graph sensitive to a triangle structure by using the method of any one of claims 1 to 7, comprising:

9. An electronic apparatus, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1 to 7.