CN112084328A

CN112084328A - Scientific and technological thesis clustering analysis method based on variational graph self-encoder and K-Means

Info

Publication number: CN112084328A
Application number: CN202010742851.9A
Authority: CN
Inventors: 徐新黎; 刘锐; 肖云月; 杨旭华; 许营坤
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2020-07-29
Filing date: 2020-07-29
Publication date: 2020-12-15

Abstract

A scientific and technological thesis clustering analysis method based on a variational graph self-encoder and K-Means is characterized in that a citation network G (V, E, F) is constructed by utilizing the existing scientific and technological thesis data, a variational graph self-encoder composed of an encoder and a decoder is constructed according to an adjacent matrix A of the citation relation between the thesis and a feature matrix F of the thesis keyword attribute, and the reconstruction of the adjacent matrix is minimized

The distance measurement between the node and the original adjacent matrix A and the divergence of the node representing vector distribution and normal distribution are taken as targets, a multidimensional Gaussian distribution is obtained through unsupervised training, the low-dimensional embedding vector z of the node is obtained through sampling of the distribution, and then the K-Means algorithm is used for low-dimensional embeddingAnd (5) clustering the dimension embedded vector z to obtain a partitioning result of the scientific and technological paper, and performing two-dimensional visual display after dimension reduction through a tSNE algorithm. The invention improves the accuracy of the scientific and technological thesis clustering analysis and reduces the calculation cost of the analysis.

Description

Scientific and technological thesis clustering analysis method based on variational graph self-encoder and K-Means

Technical Field

The invention relates to the field of network science and machine learning, in particular to a scientific and technological thesis clustering analysis method based on a variational graph self-encoder and K-Means.

Background

Academic papers have experienced a history of development over 350 years, forming a complex citation network for ultra-large scale knowledge flow and information dissemination. Implicit in the citation network is a study population consisting of literature authors that have similar or related directions of study. The citation network can be divided into different research groups by a community discovery algorithm of the complex network. The clustering analysis of the citation network includes, besides the author clustering, periodical clustering, article clustering, and the like. The quotation network is a growing scientific network, and the size of the quotation network is larger and larger along with the time, so that the clustering analysis of the scientific papers is more difficult, and new requirements for the classification management of the scientific papers are provided.

One of the most fundamental problems in the efficient analysis of a cited network is how the network is represented. The traditional data mining analysis directly acts on the adjacency matrix, but the high-dimensional sparse adjacency matrix greatly increases the cost on storage and calculation on one hand, and on the other hand, a plurality of machine learning methods cannot be directly applied. In order to solve the problem, a batch of network representation learning methods including deep walk, Line, Node2vec and the like are proposed in recent years, and the method mainly aims at realizing low-dimensional representation of network data.

Each node in the citation network has a relatively rich keyword attribute in addition to the links formed by the references between publications. However, most of the existing network representation learning methods map the network structure or the node attributes to the potential space, and do not explore the dependency relationship between the node low-dimensional representation and the information of both the node attributes and the network structure. With the successful application of the variational self-encoder in image generation, the variational graph self-encoder proposed by Kipf et al in 2016 can simultaneously capture two parts of information of node attributes and network structures, and map each node into a multivariate gaussian distribution. Therefore, the feature information of the quotation network can be obtained by using the unsupervised variational graph self-encoder, and after the corresponding embedded vector representing the feature information is obtained, the accuracy of the scientific and technological thesis division can be improved by using the K-Means clustering algorithm.

Disclosure of Invention

In order to solve the problems of increasingly difficult classification management, low partition accuracy and the like of papers caused by the trend that a citation network has increasingly large scale at present, the invention provides an effective scientific and technological paper clustering analysis method based on a variational diagram self-encoder and K-Means.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a scientific and technological thesis clustering analysis method based on a variational graph self-encoder and K-Means comprises the following steps:

the method comprises the following steps: the scientific and technical paper data to be analyzed is expressed as a citation network G ═ (V, E, F), where V ═ V₁,v₂,...,v_nThe node is a node set, each node represents a scientific thesis, the number of nodes, namely the total number n of the scientific thesis, is | V |, E is an edge set, if a reference relationship exists between two thesis, a connecting edge exists between corresponding nodes of the two thesis, the connecting edge relationship of all the thesis forms an n × n adjacency matrix A, and the keyword attribute F of each thesis is { F ═ F₁,f₂,...,f_mThe attribute quantity m is | F |, and the attributes of all papers are represented as an n × m attribute information feature matrix X;

step two: constructing a variational graph self-encoder consisting of an encoder and a decoder, wherein the encoder of the variational graph self-encoder is a two-layer graph convolution neural network GCN, the input of the variational graph self-encoder is a feature matrix X and an adjacent matrix A of a citation network, the mean value and the variance expressed by learning node low-dimensional vectors are sampled by adopting a re-parameterization method, the output of the variational graph self-encoder is an n X d low-dimensional embedded vector of a node, the input of the decoder is a low-dimensional vector of the node, and the probability of an edge existing between two points is calculated in pairs to obtain the variational graph self-encoderReconstructing a graph, where 2. ltoreq. d. ltoreq.n, the output being a reconstructed adjacency matrix

Step three: training variational picture autoencoder with scientific paper data, the training goal being to minimize reconstructed adjacency matrix

Measuring the distance between the adjacent matrix A and the original adjacent matrix A, and expressing the divergence of vector distribution and normal distribution by nodes, obtaining the parameters of GCN after training, determining a multidimensional Gaussian distribution by GCN, and sampling from the distribution to obtain the low-dimensional embedded vector of the nodes;

step four: setting the expected dividing number of the scientific paper, and clustering the low-dimensional embedded vectors by using a K-Means algorithm to obtain the dividing result of the scientific paper;

step five: and reducing the dimension of the division result of the scientific and technological paper by a tSNE algorithm, and performing two-dimensional visual display by using a Matplotlib drawing library.

The technical conception of the invention is as follows: firstly, constructing a quotation network of scientific and technological paper data, inputting a feature matrix X and an adjacent matrix A of the quotation network into a variational graph self-encoder, training in an unsupervised mode, obtaining node embedded vectors, realizing the division of scientific and technological papers through K-Means, and performing dimension reduction visual display, thereby improving the accuracy of the clustering analysis of the scientific and technological papers and reducing the calculation cost of the analysis.

The invention has the beneficial effects that: the unsupervised variational graph self-encoder and K-Means-based citation network clustering model is used for analyzing the categories of the scientific and technological papers, the labeling cost of classification learning training is not needed, the accuracy of classification of the scientific and technological papers is improved, and the analysis and calculation cost is reduced.

Drawings

Fig. 1 is a schematic diagram of a simple citation network, wherein nodes in the diagram represent articles in the citation network, a, B, C, D, E, and F are corresponding article numbers, and if the articles have references to each other, a connecting edge exists between the two nodes.

Fig. 2 is a two-dimensional display diagram of the scientific paper clustering results of the cora data set of the citation network example.

FIG. 3 is a flow chart of a scientific paper clustering method based on variational graph auto-encoders and K-Means.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 to 3, a scientific and technological thesis clustering analysis method based on a variational graph self-encoder and K-Means includes the following steps:

step two: constructing a variational graph self-encoder consisting of an encoder and a decoder, wherein the encoder of the variational graph self-encoder is a two-layer graph convolution neural network GCN, the input of the variational graph self-encoder is a feature matrix X and an adjacent matrix A of a citation network, the mean value and the variance expressed by learning node low-dimensional vectors are sampled by adopting a reparameterization method, the output of the variational graph self-encoder is an n X d low-dimensional embedded vector of a node, the input of the decoder is a low-dimensional vector of the node, the graph is reconstructed by pairwise calculating the probability of edges existing between two points, wherein d is more than or equal to 2 and less than or equal to n, and the output is the reconstructed

As mentioned above, the present invention is made more clear by the specific implementation steps implemented in this patent. Any modification and variation of the present invention within the spirit of the present invention and the scope of the claims will fall within the scope of the present invention.

Claims

1. A scientific and technological thesis clustering analysis method based on a variational graph self-encoder and K-Means is characterized in that: the method comprises the following steps:

step two: constructing a variational graph self-encoder consisting of an encoder and a decoder, wherein the encoder of the variational graph self-encoder is a two-layer graph convolution neural network GCN, inputting a feature matrix X and an adjacent matrix A of a citation network, and learning a node low-dimensional vectorExpressed mean value mu and variance sigma, and adopting a re-parameterization method to sample the mean value mu and the variance sigma, outputting a low-dimensional embedded vector z of n multiplied by d of a node, inputting a decoder to the low-dimensional vector z of the node, reconstructing a picture by calculating the probability of an edge existing between two points two by two, wherein d is more than or equal to 2 and less than or equal to n, and outputting reconstructed adjacency matrix

Measuring the distance between the adjacent matrix A and the original adjacent matrix A, and expressing the divergence of vector distribution and normal distribution by nodes, obtaining the parameters of GCN after training, determining a multidimensional Gaussian distribution by GCN, and sampling from the distribution to obtain the low-dimensional embedded vector z of the nodes;

step four: setting the expected dividing number of the scientific paper, and clustering the low-dimensional embedded vectors z by using a K-Means algorithm to obtain the dividing result of the scientific paper;