CN113157957A - Attribute graph document clustering method based on graph convolution neural network - Google Patents

Attribute graph document clustering method based on graph convolution neural network Download PDF

Info

Publication number
CN113157957A
CN113157957A CN202110244762.6A CN202110244762A CN113157957A CN 113157957 A CN113157957 A CN 113157957A CN 202110244762 A CN202110244762 A CN 202110244762A CN 113157957 A CN113157957 A CN 113157957A
Authority
CN
China
Prior art keywords
graph
clustering
neural network
attribute
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110244762.6A
Other languages
Chinese (zh)
Inventor
冀俊忠
梁烨
雷名龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202110244762.6A priority Critical patent/CN113157957A/en
Publication of CN113157957A publication Critical patent/CN113157957A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an attribute graph document clustering method based on a graph convolution neural network, and belongs to the field of graph data mining. Specifically, a cross-layer linked graph convolution neural network is used for document attribute graph feature learning; estimating the optimal cluster number from the node characteristics by utilizing a deep clustering estimation model; alternately executing the two steps to finish training; obtaining the characteristics of all document attribute graph nodes to be clustered and the estimated number of clustering clusters by using the trained model; and obtaining a document attribute graph clustering result by using a k-means clustering method by taking the characteristics and the estimated number of the clustering clusters as input. When the cross-layer linked graph convolution neural network is trained, the self-separation regularization item based on the pairwise similarity of the nodes is adopted, so that the similarity of the characteristics of the nodes in the same cluster can be promoted, and the characteristics of the nodes in different clusters are far away, thereby effectively improving the graph clustering performance. The cluster estimation module realizes the data-driven cluster number estimation, so that the whole system is more suitable for a real data environment without labels.

Description

Attribute graph document clustering method based on graph convolution neural network
Technical Field
The invention belongs to the field of graph data mining, and particularly relates to an attribute image-text contribution clustering method based on a graph convolution neural network.
Background
The attribute graph clustering is a basic task in the field of graph data mining, and aims to divide nodes in a graph into mutually-disjoint clusters according to node attributes and graph structure information. Compared with the traditional graph clustering method only using graph structure information, the attribute graph clustering is more suitable for scenes with nodes having rich content information. The attribute graph clustering has wide practical application in the fields of community discovery, protein function module detection, financial network fraud detection and the like.
A number of depth model-based graph clustering efforts have been proposed. Compared with a shallow graph clustering method, the deep graph clustering method is better in capturing nonlinear and complex node relations in the graph and is beneficial to improving the clustering performance. At present, most of the existing deep map clustering methods adopt a two-step framework to complete a clustering task: the feature learning step uses a depth model to learn low-dimensional node features; the clustering step performs a conventional clustering method to complete the graph clustering tasks, such as k-means and spectral clustering. Whether the real features of the attribute Graph can be learned in the feature learning step is important for the Graph clustering task, various Graph Autoencoders (GAE) are usually used for capturing Graph structure information in the early depth model method, but GAEs only use the structure features of the Graph to complete the training of a neural network, and the node attribute information in the attribute Graph is ignored, so that the performance of the method in the attribute Graph clustering task is limited.
In recent years, attribute Graph clustering methods generally implement feature learning of Graph nodes by using Graph Neural Networks (GNNs). The GNNs aggregate the attribute information of adjacent nodes by weighting and update the node characteristics iteratively, and the forward propagation mode of the GNNs fuses the structural characteristics and the node attributes of the attribute graph, so that the data utilization rate is improved, and the GNNs can be naturally applied to attribute graph clustering tasks and improve the clustering performance. Furthermore, the goal of graph clustering is to detect local substructures with dense intra-cluster connections and sparse inter-cluster connections, while the node features learned by GNNs preserve the local similarity of the graphs, which is advantageous for the graph clustering task. However, the current methods have two limitations: firstly, the task guidance of clustering is lacked in the feature learning process, so that cluster-friendly node features are difficult to learn, the node distribution in the feature space is easy to cause overlapping problem, and further clustering is not facilitated. Secondly, the method needs to manually set the number of clusters in advance, and in real application, the network data is large in scale and high in complexity, and the number of clustered clusters is usually difficult to manually estimate. In addition, the actual cluster number is highly correlated with the specific task, and the optimal cluster number should be determined by the node characteristics themselves. Therefore, the method for clustering the non-parameter attribute graph based on the graph convolution neural network is important to graph data mining.
Document clustering aims at dividing documents with similar contents into different groups. The existing literature clustering method adopts a clustering method based on hierarchy, division, density and the like, and the main idea is to cluster the literatures with similar characteristics into a same cluster. However, the current method only considers the similarity between the document contents in the clustering process, and ignores the existing reference relationship between documents. The documents cited from each other generally have higher similarity, and the citation relationship of the documents can also provide valuable information for clustering.
Disclosure of Invention
The invention provides an attribute graph document clustering method based on a graph convolution neural network aiming at the problems in the prior art, which is used for solving the problem that the citation relation of documents is not utilized in the document clustering process, can deal with the unbalanced cluster structure in real graph data, learn node characteristics friendly to clustering tasks from the unbalanced cluster structure, estimate the number of clustering clusters of the graph data according to the node characteristics and realize the parameter-free attribute graph clustering.
A undirected property graph may be represented as G ═ (V, E, X)1,v2,…,vnIs the set of nodes and E is the set of edges. The adjacency matrix of the graph may be denoted as A if node viAnd vjThere is a connection between, then A ij1, otherwise
Figure BDA0002963682720000031
The node attribute matrix of diagram G is represented, where n represents the number of nodes and m represents the dimension of the node attribute. The purpose of attribute graph clustering is to divide nodes in an attribute graph G into k mutually disjoint clusters, and in the invention, the number of k is estimated by a clustering modelThe blocks are estimated from the node characteristics.
In order to achieve the purpose, the invention adopts the technical scheme that: the method provides a cross-layer linked graph convolution neural network which can cope with unbalanced cluster structures in real graph data and learn valuable node characteristics from the unbalanced cluster structures, and has strong robustness; then, providing a regularization item for encouraging node feature self-separation, and realizing the joint optimization of feature learning and graph clustering; finally, a cluster estimation module is provided for realizing the data-driven cluster number estimation.
The specific technical scheme is as follows: firstly, training a cross-layer linked graph convolution neural network by using attribute graph data, encouraging self-separation of node characteristics in a training process, and outputting characteristics of graph nodes; secondly, training a clustering estimation module by using the characteristics of the graph nodes, and outputting the optimal clustering number; alternately executing the two steps until the maximum iteration number is reached; finally, clustering the characteristics of the graph nodes by using a k-means algorithm;
step (1) attribute graph feature learning:
step (1.1) attribute graph data encoding: the method utilizes a cross-layer linked graph convolution neural network to carry out coding operation on attribute graph data. Encoding the graph data significantly reduces the computational complexity of clustering and avoids overfitting phenomena that may result from graph data sparsity. The propagation rule of the graph neural network from l-1 to l-1 is as follows:
Figure BDA0002963682720000032
wherein N (v)i| A) in a citation network represented by adjacency matrix A, including document viAnd document viThere are documents in citation relations, i.e., neighbor documents, i 1. W(l)Is the parameter matrix of the l-th layer. deg (v) represents the degree of node v. When l is 1, in the formula (1)
Figure BDA0002963682720000033
Namely, the first layer graph convolution neural network aggregates the original characteristics of the neighbor documents. Relu (. cndot.) is a nonlinear activation function.
The cross-layer linked graph convolution neural network splices the output vectors of each layer of graph convolution: to be provided with
Figure BDA0002963682720000041
Node v in the representationiThe output of the convolution of the graph of the l < th > layer, node v in the graphiCoding result d of cross-layer linked graph convolution neural networkiConvolution of the neural network for each layer of graph versus node v in the graphiThe output stitching vector of (a), expressed as follows:
Figure BDA0002963682720000042
coding result diFor the spliced vector output by each layer of graph convolution neural network, the coding result is subjected to linear mapping operation, and the node characteristic z learned by the graph convolution neural network is outputi. In the encoding step, the method uses a 6-layer cross-chain convolutional neural network.
And (1.2) decoding the node characteristic data:
the method uses a multilayer perceptron to realize the decoding of the attribute matrix:
Figure BDA0002963682720000043
wherein,
Figure BDA0002963682720000044
representing node characteristics ziThe output of the decoding of (a) is,
Figure BDA0002963682720000045
derepresenting the dimension of the code vector, MLPsRepresenting a multi-layer perceptron of s layers, the method uses a multi-layer perceptron of 2 layers. WDAre parameters of the decoder. In order to make the node represent the characteristic ziThe real original image information is reserved, the accuracy of clustering is ensured, and the output of a decoder is ensured
Figure BDA0002963682720000046
The original graph attribute information x should be kept as much as possibleiTherefore, the method uses mean square error loss (MSE) as an optimization target for the graph convolution neural network:
Figure BDA0002963682720000047
mean square error loss can measure xiAnd
Figure BDA0002963682720000048
the method optimizes the objective function in the training so that the node characteristic ziCan be decoded into original graph attribute information xiThereby securing ziIncluding the characteristics of the graph attribute information.
Step (1.3) clustering friendly graph feature learning:
the method provides a regularization term to encourage the separation of node natural cluster structures in the feature space, the regularization term effectively relieves the overlapping problem in graph clustering, and the clustering performance is further improved. Specifically, the pairwise similarity of graph node features is first modeled using student t-distribution Q:
Figure BDA0002963682720000051
wherein z isiRepresenting a node viCharacteristic vector of (q)ijCan be seen as node viAnd vjProbability of having similar features in the feature space. As the student t distribution has a heavy tail characteristic, nodes with lower similarity are farther away in a feature space, and the natural cluster structures in the graph are macroscopically separated and the problem of crowding in the cluster is relieved. To increase the tendency for inter-cluster separation, further separating nodes in the feature space, the model encourages the distribution Q to approach another degree of freedomThe higher degree of student t distribution P realizes cluster-friendly feature learning:
Figure BDA0002963682720000052
wherein the parameter θ controls the degree of freedom of the distribution of the student t, set as de-1,deIs the dimension of the feature z. The higher degree of freedom makes the target distribution P more concentrated than Q, so P assigns higher probability for similar nodes in the feature space, making it more compact; lower probabilities are assigned to dissimilar nodes to make them more dispersed, thereby achieving self-separating regularization. The self-separating regularization term is defined as follows:
Figure BDA0002963682720000053
where KL is the Kullback-Leibler divergence, and is used to measure the asymmetric distance between distributions P and Q, PijAnd q isijRepresenting pairwise similarities of node features. By optimizing the KL divergence between P and Q, the encoder is able to increase the inter-cluster distance and decrease the intra-cluster distance. The process is beneficial to model learning of clustering friendly features, so that the graph clustering performance is improved. In summary, the feature learning optimization objective of the graph convolution neural network is as follows:
Figure BDA0002963682720000054
wherein
Figure BDA0002963682720000055
Representing input node attribute xiThe output of the codec process, α ═ 0.01, is a hyperparameter. By optimizing the objective function, the node distribution optimization aiming at clustering can be realized while the graph convolution neural network learns the node characteristics of the graph.
Step (2): estimating the number of clustering clusters:
most of the traditional clustering methods need a user to specify the number of clustering clusters, and in order to expand the model to a parameter-free condition, the method provides a deep clustering estimation model. The model aims to estimate the optimal cluster number from the node features z.
The model uses softmax autoencoder for cluster estimation. And (3) performing encoding and decoding operations on the node characteristics z in the step (1) by using a softmax self-encoder, and using a softmax nonlinear function in an output layer of the encoder, wherein the nonlinear function converts the characteristics of the nodes into soft clustering probability distribution. The number of hidden layer neurons represents the upper limit of the number of cluster clusters. The output of the softmax encoder is
Figure BDA0002963682720000061
Wherein d iscIs the number of hidden units. Clustering assignment y due to activation using softmaxiThe sum of (a) and (b) is equal to 1,
Figure BDA0002963682720000062
the method is that dcThe total number of the nodes is set, so that the clustering estimation is completely independent of parameters, and the clustering estimation without parameters is realized.
The softmax autoencoder estimates the number of clusters by calculating the number of cluster labels, but is itself prone to generating a uniformly distributed cluster allocation, which is disadvantageous for cluster estimation. Therefore, the present model learns the centralized soft cluster distribution by introducing additional Gini-index regularization, which can be expressed as:
Figure BDA0002963682720000063
ignoring constant 1, the above equation can be expressed as:
Figure BDA0002963682720000064
optimizing the kini coefficient loss can facilitate low entropy distribution of cluster assignments. Combining the reconstruction loss with the regularization of the kini coefficients, the overall loss function of the softmax autoencoder can be expressed as:
Figure BDA0002963682720000065
wherein z isiAnd
Figure BDA0002963682720000066
represents the input and output of the softmax self-encoder, and β ═ 0.1 is a hyperparameter.
And (5) training the softmax self-encoder by using the target function to obtain a clustering distribution result Y. y isiRepresenting the probability of dividing a node into the ith cluster. Finally, the model is used as an estimation of the cluster number k by calculating the number of different labels of all nodes. The calculation of k can be described as:
Figure BDA0002963682720000071
where the Card function calculates the number of different elements present in the collection.
And (3): the alternative training graph convolution and clustering estimation module: the above feature learning and cluster estimation are optimized simultaneously by means of alternate training, and in each iteration, we first fix the parameters of the softmax self-encoder, and then optimize the feature learning using equation (8). The parameters of the feature learning model are then fixed and the parameters of the softmax autoencoder are optimized using equation (11). Then, the cluster number k is calculated from the cluster distribution Y by equation (12).
And (4): and (3) outputting a clustering result: setting the maximum optimization times of the model to be 200, and outputting the node characteristics z and the cluster estimation number k after the maximum optimization times reach the timeseFinally, in z and keAnd as input, running a k-means clustering algorithm to obtain a graph clustering result.
Advantageous effects
The invention effectively improves the clustering accuracy.
Drawings
FIG. 1: a flow chart of an attribute graph clustering method based on a graph convolution neural network is disclosed.
FIG. 2: an attribute graph clustering model structure chart based on a graph convolution neural network.
FIG. 3: and (5) visually displaying the clustering result of the Cora data set.
Detailed Description
The validity of the method is verified by taking three literature databases of Cora, Citeser and Pubmed as examples. The document attribute map data is first constructed with the three databases described above. The document attribute map may be expressed as G ═ a, X, where a is the adjacency matrix, if document viAnd vjThere is a reference relationship between them, then Aij1, otherwise Aij0. X is the document attribute matrix, the ith row vector X in XiContains a reference viDescription of the contents. The construction method of X comprises the following steps: (1) eliminating the false words, i.e. adverb, preposition, conjunctions, auxiliary words, etc., in the document. (2) Words with elimination frequencies less than 10. (3) Constructing word vector characteristics of each document by using residual words if the jth word is in the document viIn (b) is present, then xij1, otherwise xijAnd (0) constructing the finished document attribute graph parameters as follows:
TABLE 1
Data set Number of nodes Number of edges Number of real clusters Feature dimension
Cora 2708 5429 7 1433
Citeseer 3327 4732 6 3703
Pubmed 19717 44338 3 500
Wiki 2405 17981 17 4973
The following describes the specific implementation steps of the present invention on the four attribute map data sets:
step (1) attribute graph feature learning:
step (1.1) attribute graph data encoding: as shown in the encoder portion of fig. 2, the present invention uses a 6-layer graph convolution neural network to perform an encoding operation on four data sets in table 1, the output of each layer of graph convolution neural network is used as the input of the next layer of graph convolution neural network, and the forward propagation process from layer l-1 to layer l can be expressed as:
Figure BDA0002963682720000081
taking Cora attribute diagram as an example, the document v in the attribute diagram isi,N(vi| A) representation includes document viAnd withDocument viThere are references in citation, i.e. neighbour references, i ═ 1,2, …,2708, i.e. there are a total of 2708 references. W(l)Is the parameter matrix of the l-th layer. deg (v)i) Representing a node viDegree of (c). When l is 1, in the formula (1)
Figure BDA0002963682720000082
Figure BDA0002963682720000083
I.e. the first layer graph convolution neural network aggregates viOf the neighbor documents. Relu (·) is a nonlinear activation function, Relu (x) max (0, x).
As shown in fig. 2, in addition to forward propagation, the output of each layer of the graph convolutional neural network is passed to an inter-layer aggregation module that performs a stitching operation on the node features from the different layers:
Figure BDA0002963682720000084
and performing linear mapping operation on the spliced vector to further reduce the dimension to obtain a node characteristic vector z., obtaining a 16-dimensional node characteristic vector for the Cora data set through linear mapping, and obtaining a 32-dimensional node characteristic vector for a larger data set Pubmed through linear mapping.
And (1.2) decoding the node characteristic data: as shown in the decoder portion of FIG. 2, the present invention uses a two-layered multi-layered perceptron for the encoding characteristics z of the four data sets described aboveiAnd (3) decoding:
Figure BDA0002963682720000091
wherein,
Figure BDA0002963682720000092
representing node characteristics ziThe output of the decoding of (a) is,
Figure BDA0002963682720000093
derepresenting the dimensions of the coded vector, d for the Cora, Citeser, Wiki and Pubmed data setse16, 32, respectively MLP2Multilayer perceptron representing two layers, WDAre parameters of the decoder. The decoder tries to derive the characteristic ziReconstructing section attribute information, so the method sets the hidden layer structure of the multilayer perceptron as de500-1000-m, where m represents the dimension of the node attribute, and the specific arrangement of m in the above four data sets is given in the fifth column of table 1.
Step (1.3) clustering friendly graph feature learning: as shown in the middle of FIG. 2, the student t distribution Q is first used to map node features z of the four datasets as described aboveiPairwise similarity modeling of (c):
Figure BDA0002963682720000094
the distribution Q is represented in the form of a matrix with the ith row and jth column elements represented by QijAnd (4) forming. For the different sets of data it is possible to,
Figure BDA0002963682720000095
Figure BDA0002963682720000096
where n represents the number of nodes of the dataset and the number of nodes n for the Cora, Citeseer, Wiki and Pubmed datasets is given by the second column of table 1. Then, the student t distribution P with higher degree of freedom is calculated in the same way:
Figure BDA0002963682720000097
wherein the parameter θ controls the degree of freedom of the distribution of the student t, set as de-1,deIs the dimension of the feature z. The parameters θ for the Cora, citeser, Wiki and Pubmed datasets are set to: 1432. 3702, 499, 4972. the higher degree of freedom makes the target distribution P more concentrated than Q, so P assigns higher similar nodes in the feature spaceThe probability of (2) making it more compact; lower probabilities are assigned to dissimilar nodes to make them more dispersed, thereby achieving self-separating regularization. The method optimizes the graph convolution neural network using the regularization term as follows:
Figure BDA0002963682720000101
where KL is the Kullback-Leibler divergence, and is used to measure the asymmetric distance between distributions P and Q, PijAnd q isijRepresenting pairwise similarities of node features. In summary, the feature learning optimization objective of the graph convolution neural network is as follows:
Figure BDA0002963682720000102
wherein
Figure BDA0002963682720000108
Representing input node attribute xiThe output of the codec process, α ═ 0.01, is a hyperparameter.
Step (2): estimating the number of clustering clusters: as shown in fig. 2 cluster estimation module: the invention uses a softmax autoencoder for cluster estimation. And (3) performing encoding and decoding operation on the node characteristics z in the step (1) by using a softmax self-encoder, and using a softmax nonlinear function in an output layer of the encoder, wherein the nonlinear function converts the characteristics of the nodes into soft clustering probability distribution. The output of the softmax encoder is
Figure BDA0002963682720000103
Wherein d iscIs the number of hidden units. For different data sets: cora, Citeser, Wiki and Pubmed, the invention will be described in detailcSet to the total number of nodes n.
The Gini-index regularization of the cluster estimation module may represent this as:
Figure BDA0002963682720000104
ignoring constant 1, the above equation can be expressed as:
Figure BDA0002963682720000105
with equation (10) as the regularization term and MSE loss as the main optimization objective, calculating the overall loss function of the softmax self-encoder can be:
Figure BDA0002963682720000106
wherein z isiAnd
Figure BDA0002963682720000107
representing the input and output of softmax from the encoder, β is set to 0.1 uniformly for the four data sets. And (5) training the softmax self-encoder by using the target function to obtain a clustering distribution result Y. y isiRepresenting the probability of dividing a node into the ith cluster. Finally, the model is used as an estimation of the cluster number k by calculating the number of different labels of all nodes. The k value is calculated as follows:
Figure BDA0002963682720000111
where the Card function calculates the number of different elements present in the collection.
And (3): the alternative training graph convolution and clustering estimation module: the feature learning and the cluster estimation are optimized simultaneously through an alternate training mode, and in each iteration, the parameters of the softmax self-encoder are firstly fixed, and then the feature learning is optimized by using an equation (8). The parameters of the feature learning model are then fixed and the parameters of the softmax autoencoder are optimized using equation (11). Then, the cluster number k is calculated from the cluster distribution Y by equation (12). The neural network in each module is optimized by adopting an Adam optimizer, and the fixed learning rate is set to be 10-3Dropout rate is set to 0.2.
And (4): and (3) outputting a clustering result: setting the maximum optimization times of the model to be 200, and outputting the node characteristics z and the cluster estimation number k after the maximum optimization times reach the timeseFinally, in z and keAnd as input, running a k-means clustering algorithm to obtain a graph clustering result.
To illustrate the beneficial effects of the method of the present invention, in the implementation we performed comparison tests on a number of different algorithms: the k-means algorithm is a traditional clustering method based on distance, and the main idea is to assign samples to the nearest clustering centers. And (3) enabling the Deepwalk to carry out truncated random walk on the graph to generate a group of node sequences, learning node representation by adopting a Skip-gram model, and finally clustering by using a k-means algorithm. Graphnencoder learns the node representation by training stacked sparse autoencoders and applies spectral clustering to obtain a clustering result. The MGAE trains graph convolutional neural networks using an unsupervised mapping loss and performs spectral clustering on node features.
The experimental results of the above comparison algorithm on four data sets of Cora, Citeseer, Wiki and Pubmed are shown in table 2:
TABLE 2
Figure BDA0002963682720000121
As shown in Table 2, the method of the present invention achieves the best results on the real attribute map data set, with better performance on the ACC, F1, NMI and ARI clustering indexes. The method is reasonable and reliable.
The self-separation regularization item provided by the invention can improve the distribution of node features in a high-dimensional space, is verified by feature vector visualization, a T-SNE algorithm is operated on a feature vector z of a Cora data set, and the visualization result is shown in FIG. 3. As can be seen in fig. 3, comparing the first rows (a) - (e) and the second rows (f) - (j), we can observe a strong contrast between using and not using the self-separating regularization term. The cluster structures in the second row are denser and the gaps between different clusters are more pronounced, indicating that self-separation regularization has a large effect on increasing the inter-cluster distance and decreasing the intra-cluster distance.
The invention is based on a graph convolution neural network, firstly carries out encoding and decoding operation on an attribute graph, provides a cross-layer link map convolution neural network, unsupervised learns the characteristics of graph nodes, and optimizes the characteristic distribution of the nodes by a self-separation regularization item in the process so as to enable the nodes to have a more obvious clustering structure. And then, a cluster estimation module is provided, and the optimal cluster number is estimated based on the node characteristics. The experiment result and the visual experiment on the real data set show that the method is reasonable and reliable and can provide reliable help for the attribute map clustering task.

Claims (9)

1. An attribute graph document clustering method based on a graph convolution neural network is characterized in that:
step (1), performing document attribute graph feature learning by using a cross-layer linked graph convolution neural network, wherein the learning comprises two stages of encoding and decoding to obtain features z of all graph nodes, and the features z are used for completing separation of graph node natural cluster structures in a feature space;
estimating the optimal cluster number from the node characteristics z by utilizing a deep clustering estimation model;
step (3), alternately executing the two steps until the maximum iteration number is reached to finish training;
obtaining the characteristics of all document attribute graph nodes to be clustered and the estimated number of clustering clusters by using the trained cross-layer linked graph convolution neural network and a deep clustering estimation model; and obtaining a document attribute graph clustering result by using a k-means clustering method by taking the characteristics and the estimated number of the clustering clusters as input.
2. The method for clustering the attribute graph documents based on the graph convolution neural network as claimed in claim 1, wherein: the step (1) further comprises the steps of,
step (1.1) attribute graph data encoding: encoding the attribute map data, and setting the input of the document attribute map as G ═ A, X, wherein A is an adjacent matrix, if the document is in the same way as the adjacent matrix, then the document attribute map data is encodedviAnd vjThere is a reference relationship between them, then Aij1, otherwise AijX is a document attribute matrix, each row vector representing a description of the contents of a document, where the ith row vector X in X is 0iRepresentative of the document viThe description of the content, the propagation rule of the graph convolution neural network from the l-1 st layer to the l-1 st layer is as follows:
Figure FDA0002963682710000011
wherein N (v)i| A) in a citation network represented by adjacency matrix A, including document viAnd document viA literature with a citation relationship, namely a neighbor literature, i 1., n, namely n literatures; w(l)Is the parameter matrix of the l-th layer. deg (v) represents the degree of node v; when l is 1, in the formula (1)
Figure FDA0002963682710000012
Namely, the first layer graph convolution neural network aggregates the original characteristics of the neighbor documents, and Relu (·) is a nonlinear activation function;
the cross-layer linked graph convolution neural network splices the output vectors of each layer of graph convolution: to be provided with
Figure FDA0002963682710000013
Node v in the representationiThe output of the convolution of the graph of the l < th > layer, node v in the graphiCoding result d of cross-layer linked graph convolution neural networkiConvolution of the neural network for each layer of graph versus node v in the graphiThe output stitching vector of (a), expressed as follows:
Figure FDA0002963682710000021
the coding result is subjected to linear mapping operation, and the node v in the graph learned by the graph convolutional neural network is outputiCharacteristic z of the nodei
And (1.2) decoding the node characteristic data:
decoding of the attribute matrix is achieved using a multi-layer perceptron:
Figure FDA0002963682710000022
wherein,
Figure FDA0002963682710000023
representing node characteristics ziThe output of the decoding of (a) is,
Figure FDA0002963682710000024
derepresenting a coded vector ziDimension of (1), MLPsMultilayer perceptron representing s layers, WDAre parameters of the decoder.
3. The method for clustering the attribute graph documents based on the graph convolution neural network as claimed in claim 2, wherein:
the construction method of X comprises the following steps: (1) eliminating the fictitious words in all the document documents; (2) eliminating all words with frequencies less than 10 in the literature documents; (3) constructing word vector characteristics of each document by using residual words if the jth word is in the document viIn (b) is present, then xij1, otherwise xij=0。
4. The method for clustering the attribute graph documents based on the graph convolution neural network as claimed in claim 1, wherein: the feature learning optimization objective of the cross-layer linked graph convolution neural network is as follows:
Figure FDA0002963682710000025
wherein,
Figure FDA0002963682710000026
representing input node attribute xiThe output of the encoding and decoding process, alpha is a hyperparameter,
qijrepresenting a node viAnd vjThe probability of having similar features in the feature space is as follows:
Figure FDA0002963682710000027
pijthe method is used for approaching another student t distribution with higher degree of freedom to realize cluster-friendly feature learning, and specifically comprises the following steps:
Figure FDA0002963682710000031
wherein the parameter θ controls the degree of freedom of the distribution of the student t, set as de-1,
And optimizing the objective function to realize the distribution optimization of the nodes taking clustering as the target while the graph convolution neural network learns the node characteristics of the graph.
5. The method for clustering the attribute graph documents based on the graph convolution neural network as claimed in claim 1, wherein: the step (2) further comprises the following steps:
the deep clustering estimation model adopts a softmax self-encoder and is used for clustering estimation; performing encoding and decoding operations on the node characteristics z in the step (1) by a softmax self-encoder, wherein a softmax nonlinear function is used in an output layer of the encoder, the nonlinear function converts the characteristics of the nodes into probability distribution of soft clusters, the number of hidden layer neurons represents the upper limit of the number of cluster clusters, and the output of the softmax encoder is a cluster distribution result
Figure FDA0002963682710000032
Wherein d iscIs the number of hidden units, cluster assignment y due to activation using softmaxiOfAnd is equal to 1, and is,
Figure FDA0002963682710000033
dcand setting the total number of the nodes for realizing complete non-parameter clustering estimation.
6. The method according to claim 5, wherein the method comprises the following steps: the overall loss function of the softmax autoencoder can be expressed as:
Figure FDA0002963682710000034
wherein z isiAnd
Figure FDA0002963682710000035
representing the input and output of the softmax autoencoder, beta being a hyperparameter, yiRepresenting the probability of dividing a node into the ith cluster.
7. The method for clustering the attribute graph documents based on the graph convolution neural network as claimed in claim 6, wherein the softmax self-encoder calculates the number of different tags of all nodes as the estimation of the cluster number k, and the calculation of k is described as follows:
Figure FDA0002963682710000041
wherein the Card function calculates the number of different elements present in the set, YijExpression document viProbability of belonging to the jth cluster.
8. The method for clustering the attribute graph documents based on the graph convolution neural network as claimed in claim 1, wherein the step (3) is as follows: simultaneously optimizing a cross-layer linked graph convolution neural network and a deep clustering estimation model in an alternating training mode, firstly fixing parameters of a softmax self-encoder in each iteration, and then optimizing feature learning by using an equation (8); then fixing the parameters of the characteristic learning model, and optimizing the parameters of the softmax self-encoder by using an equation (11); then, the cluster number k is calculated from the cluster distribution Y by equation (12).
9. The method for clustering the attribute graph documents based on the graph convolution neural network as claimed in claim 1, wherein: the attribute map data encoding step uses a 6-layer cross-layer linked map convolutional neural network.
CN202110244762.6A 2021-03-05 2021-03-05 Attribute graph document clustering method based on graph convolution neural network Pending CN113157957A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110244762.6A CN113157957A (en) 2021-03-05 2021-03-05 Attribute graph document clustering method based on graph convolution neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110244762.6A CN113157957A (en) 2021-03-05 2021-03-05 Attribute graph document clustering method based on graph convolution neural network

Publications (1)

Publication Number Publication Date
CN113157957A true CN113157957A (en) 2021-07-23

Family

ID=76884221

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110244762.6A Pending CN113157957A (en) 2021-03-05 2021-03-05 Attribute graph document clustering method based on graph convolution neural network

Country Status (1)

Country Link
CN (1) CN113157957A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505849A (en) * 2021-07-27 2021-10-15 电子科技大学 Multilayer network clustering method based on comparison learning
CN113688386A (en) * 2021-07-26 2021-11-23 中国人民解放军陆军工程大学 Graph structure-based intelligent detection method and system for malicious PDF (Portable document Format) document
CN113869404A (en) * 2021-09-27 2021-12-31 北京工业大学 Self-adaptive graph volume accumulation method for thesis network data
CN114462524A (en) * 2022-01-19 2022-05-10 北京工业大学 Clustering method for data center batch processing operation
CN115545098A (en) * 2022-09-23 2022-12-30 青海师范大学 Node classification method of three-channel graph neural network based on attention mechanism
CN115905617A (en) * 2023-03-02 2023-04-04 南京邮电大学 Video scoring prediction method based on deep neural network and double regularization
WO2024055677A1 (en) * 2022-09-15 2024-03-21 华为技术有限公司 Deep clustering method, apparatus and system
CN117971354A (en) * 2024-03-29 2024-05-03 苏州元脑智能科技有限公司 Heterogeneous acceleration method, device, equipment and storage medium based on end-to-end learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190228312A1 (en) * 2018-01-25 2019-07-25 SparkCognition, Inc. Unsupervised model building for clustering and anomaly detection
CN110399493A (en) * 2019-07-29 2019-11-01 中南大学 A kind of author's disambiguation method based on incremental learning
CN110889015A (en) * 2019-10-31 2020-03-17 天津工业大学 Independent decoupling convolutional neural network characterization algorithm for graph data
CN111259979A (en) * 2020-02-10 2020-06-09 大连理工大学 Deep semi-supervised image clustering method based on label self-adaptive strategy
CN111985623A (en) * 2020-08-28 2020-11-24 复旦大学 Attribute graph group discovery method based on maximized mutual information and graph neural network
CN112380435A (en) * 2020-11-16 2021-02-19 北京大学 Literature recommendation method and recommendation system based on heterogeneous graph neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190228312A1 (en) * 2018-01-25 2019-07-25 SparkCognition, Inc. Unsupervised model building for clustering and anomaly detection
CN110399493A (en) * 2019-07-29 2019-11-01 中南大学 A kind of author's disambiguation method based on incremental learning
CN110889015A (en) * 2019-10-31 2020-03-17 天津工业大学 Independent decoupling convolutional neural network characterization algorithm for graph data
CN111259979A (en) * 2020-02-10 2020-06-09 大连理工大学 Deep semi-supervised image clustering method based on label self-adaptive strategy
CN111985623A (en) * 2020-08-28 2020-11-24 复旦大学 Attribute graph group discovery method based on maximized mutual information and graph neural network
CN112380435A (en) * 2020-11-16 2021-02-19 北京大学 Literature recommendation method and recommendation system based on heterogeneous graph neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李亚芳 等: "基于社区优化的深度网络嵌入方法", 《计算机应用》, 25 January 2021 (2021-01-25), pages 1957 - 1963 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113688386A (en) * 2021-07-26 2021-11-23 中国人民解放军陆军工程大学 Graph structure-based intelligent detection method and system for malicious PDF (Portable document Format) document
CN113505849A (en) * 2021-07-27 2021-10-15 电子科技大学 Multilayer network clustering method based on comparison learning
CN113505849B (en) * 2021-07-27 2023-09-19 电子科技大学 Multi-layer network clustering method based on contrast learning
CN113869404A (en) * 2021-09-27 2021-12-31 北京工业大学 Self-adaptive graph volume accumulation method for thesis network data
CN113869404B (en) * 2021-09-27 2024-05-28 北京工业大学 Self-adaptive graph roll accumulation method for paper network data
CN114462524A (en) * 2022-01-19 2022-05-10 北京工业大学 Clustering method for data center batch processing operation
WO2024055677A1 (en) * 2022-09-15 2024-03-21 华为技术有限公司 Deep clustering method, apparatus and system
CN115545098A (en) * 2022-09-23 2022-12-30 青海师范大学 Node classification method of three-channel graph neural network based on attention mechanism
CN115545098B (en) * 2022-09-23 2023-09-08 青海师范大学 Node classification method of three-channel graph neural network based on attention mechanism
CN115905617A (en) * 2023-03-02 2023-04-04 南京邮电大学 Video scoring prediction method based on deep neural network and double regularization
CN117971354A (en) * 2024-03-29 2024-05-03 苏州元脑智能科技有限公司 Heterogeneous acceleration method, device, equipment and storage medium based on end-to-end learning
CN117971354B (en) * 2024-03-29 2024-06-14 苏州元脑智能科技有限公司 Heterogeneous acceleration method, device, equipment and storage medium based on end-to-end learning

Similar Documents

Publication Publication Date Title
CN113157957A (en) Attribute graph document clustering method based on graph convolution neural network
CN112417219B (en) Hyper-graph convolution-based hyper-edge link prediction method
CN108108854B (en) Urban road network link prediction method, system and storage medium
CN110929029A (en) Text classification method and system based on graph convolution neural network
CN109389151B (en) Knowledge graph processing method and device based on semi-supervised embedded representation model
CN111785329B (en) Single-cell RNA sequencing clustering method based on countermeasure automatic encoder
CN112966114B (en) Literature classification method and device based on symmetrical graph convolutional neural network
CN113693563B (en) Brain function network classification method based on hypergraph attention network
CN112765352A (en) Graph convolution neural network text classification method based on self-attention mechanism
CN111861756B (en) Group partner detection method based on financial transaction network and realization device thereof
CN111476261A (en) Community-enhanced graph convolution neural network method
CN113065649A (en) Complex network topology graph representation learning method, prediction method and server
CN114118369B (en) Image classification convolutional neural network design method based on group intelligent optimization
CN113220897A (en) Knowledge graph embedding model based on entity-relation association graph
CN114880538A (en) Attribute graph community detection method based on self-supervision
CN115640842A (en) Network representation learning method based on graph attention self-encoder
CN114329233A (en) Cross-region cross-scoring collaborative filtering recommendation method and system
CN116469561A (en) Breast cancer survival prediction method based on deep learning
CN108805280B (en) Image retrieval method and device
CN117036760A (en) Multi-view clustering model implementation method based on graph comparison learning
CN114037014A (en) Reference network clustering method based on graph self-encoder
CN117349494A (en) Graph classification method, system, medium and equipment for space graph convolution neural network
CN115691817A (en) LncRNA-disease association prediction method based on fusion neural network
CN116542080A (en) Condition generation countermeasure network topology optimization method and system based on contrast learning
CN114861863A (en) Heterogeneous graph representation learning method based on meta-path multi-level graph attention network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination