CN112084328A - Scientific and technological thesis clustering analysis method based on variational graph self-encoder and K-Means - Google Patents

Scientific and technological thesis clustering analysis method based on variational graph self-encoder and K-Means Download PDF

Info

Publication number
CN112084328A
CN112084328A CN202010742851.9A CN202010742851A CN112084328A CN 112084328 A CN112084328 A CN 112084328A CN 202010742851 A CN202010742851 A CN 202010742851A CN 112084328 A CN112084328 A CN 112084328A
Authority
CN
China
Prior art keywords
scientific
thesis
encoder
node
dimensional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010742851.9A
Other languages
Chinese (zh)
Inventor
徐新黎
刘锐
肖云月
杨旭华
许营坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202010742851.9A priority Critical patent/CN112084328A/en
Publication of CN112084328A publication Critical patent/CN112084328A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

A scientific and technological thesis clustering analysis method based on a variational graph self-encoder and K-Means is characterized in that a citation network G (V, E, F) is constructed by utilizing the existing scientific and technological thesis data, a variational graph self-encoder composed of an encoder and a decoder is constructed according to an adjacent matrix A of the citation relation between the thesis and a feature matrix F of the thesis keyword attribute, and the reconstruction of the adjacent matrix is minimized
Figure DDA0002607326850000011
The distance measurement between the node and the original adjacent matrix A and the divergence of the node representing vector distribution and normal distribution are taken as targets, a multidimensional Gaussian distribution is obtained through unsupervised training, the low-dimensional embedding vector z of the node is obtained through sampling of the distribution, and then the K-Means algorithm is used for low-dimensional embeddingAnd (5) clustering the dimension embedded vector z to obtain a partitioning result of the scientific and technological paper, and performing two-dimensional visual display after dimension reduction through a tSNE algorithm. The invention improves the accuracy of the scientific and technological thesis clustering analysis and reduces the calculation cost of the analysis.

Description

Scientific and technological thesis clustering analysis method based on variational graph self-encoder and K-Means
Technical Field
The invention relates to the field of network science and machine learning, in particular to a scientific and technological thesis clustering analysis method based on a variational graph self-encoder and K-Means.
Background
Academic papers have experienced a history of development over 350 years, forming a complex citation network for ultra-large scale knowledge flow and information dissemination. Implicit in the citation network is a study population consisting of literature authors that have similar or related directions of study. The citation network can be divided into different research groups by a community discovery algorithm of the complex network. The clustering analysis of the citation network includes, besides the author clustering, periodical clustering, article clustering, and the like. The quotation network is a growing scientific network, and the size of the quotation network is larger and larger along with the time, so that the clustering analysis of the scientific papers is more difficult, and new requirements for the classification management of the scientific papers are provided.
One of the most fundamental problems in the efficient analysis of a cited network is how the network is represented. The traditional data mining analysis directly acts on the adjacency matrix, but the high-dimensional sparse adjacency matrix greatly increases the cost on storage and calculation on one hand, and on the other hand, a plurality of machine learning methods cannot be directly applied. In order to solve the problem, a batch of network representation learning methods including deep walk, Line, Node2vec and the like are proposed in recent years, and the method mainly aims at realizing low-dimensional representation of network data.
Each node in the citation network has a relatively rich keyword attribute in addition to the links formed by the references between publications. However, most of the existing network representation learning methods map the network structure or the node attributes to the potential space, and do not explore the dependency relationship between the node low-dimensional representation and the information of both the node attributes and the network structure. With the successful application of the variational self-encoder in image generation, the variational graph self-encoder proposed by Kipf et al in 2016 can simultaneously capture two parts of information of node attributes and network structures, and map each node into a multivariate gaussian distribution. Therefore, the feature information of the quotation network can be obtained by using the unsupervised variational graph self-encoder, and after the corresponding embedded vector representing the feature information is obtained, the accuracy of the scientific and technological thesis division can be improved by using the K-Means clustering algorithm.
Disclosure of Invention
In order to solve the problems of increasingly difficult classification management, low partition accuracy and the like of papers caused by the trend that a citation network has increasingly large scale at present, the invention provides an effective scientific and technological paper clustering analysis method based on a variational diagram self-encoder and K-Means.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a scientific and technological thesis clustering analysis method based on a variational graph self-encoder and K-Means comprises the following steps:
the method comprises the following steps: the scientific and technical paper data to be analyzed is expressed as a citation network G ═ (V, E, F), where V ═ V1,v2,...,vnThe node is a node set, each node represents a scientific thesis, the number of nodes, namely the total number n of the scientific thesis, is | V |, E is an edge set, if a reference relationship exists between two thesis, a connecting edge exists between corresponding nodes of the two thesis, the connecting edge relationship of all the thesis forms an n × n adjacency matrix A, and the keyword attribute F of each thesis is { F ═ F1,f2,...,fmThe attribute quantity m is | F |, and the attributes of all papers are represented as an n × m attribute information feature matrix X;
step two: constructing a variational graph self-encoder consisting of an encoder and a decoder, wherein the encoder of the variational graph self-encoder is a two-layer graph convolution neural network GCN, the input of the variational graph self-encoder is a feature matrix X and an adjacent matrix A of a citation network, the mean value and the variance expressed by learning node low-dimensional vectors are sampled by adopting a re-parameterization method, the output of the variational graph self-encoder is an n X d low-dimensional embedded vector of a node, the input of the decoder is a low-dimensional vector of the node, and the probability of an edge existing between two points is calculated in pairs to obtain the variational graph self-encoderReconstructing a graph, where 2. ltoreq. d. ltoreq.n, the output being a reconstructed adjacency matrix
Figure BDA0002607326830000021
Step three: training variational picture autoencoder with scientific paper data, the training goal being to minimize reconstructed adjacency matrix
Figure BDA0002607326830000022
Measuring the distance between the adjacent matrix A and the original adjacent matrix A, and expressing the divergence of vector distribution and normal distribution by nodes, obtaining the parameters of GCN after training, determining a multidimensional Gaussian distribution by GCN, and sampling from the distribution to obtain the low-dimensional embedded vector of the nodes;
step four: setting the expected dividing number of the scientific paper, and clustering the low-dimensional embedded vectors by using a K-Means algorithm to obtain the dividing result of the scientific paper;
step five: and reducing the dimension of the division result of the scientific and technological paper by a tSNE algorithm, and performing two-dimensional visual display by using a Matplotlib drawing library.
The technical conception of the invention is as follows: firstly, constructing a quotation network of scientific and technological paper data, inputting a feature matrix X and an adjacent matrix A of the quotation network into a variational graph self-encoder, training in an unsupervised mode, obtaining node embedded vectors, realizing the division of scientific and technological papers through K-Means, and performing dimension reduction visual display, thereby improving the accuracy of the clustering analysis of the scientific and technological papers and reducing the calculation cost of the analysis.
The invention has the beneficial effects that: the unsupervised variational graph self-encoder and K-Means-based citation network clustering model is used for analyzing the categories of the scientific and technological papers, the labeling cost of classification learning training is not needed, the accuracy of classification of the scientific and technological papers is improved, and the analysis and calculation cost is reduced.
Drawings
Fig. 1 is a schematic diagram of a simple citation network, wherein nodes in the diagram represent articles in the citation network, a, B, C, D, E, and F are corresponding article numbers, and if the articles have references to each other, a connecting edge exists between the two nodes.
Fig. 2 is a two-dimensional display diagram of the scientific paper clustering results of the cora data set of the citation network example.
FIG. 3 is a flow chart of a scientific paper clustering method based on variational graph auto-encoders and K-Means.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1 to 3, a scientific and technological thesis clustering analysis method based on a variational graph self-encoder and K-Means includes the following steps:
the method comprises the following steps: the scientific and technical paper data to be analyzed is expressed as a citation network G ═ (V, E, F), where V ═ V1,v2,...,vnThe node is a node set, each node represents a scientific thesis, the number of nodes, namely the total number n of the scientific thesis, is | V |, E is an edge set, if a reference relationship exists between two thesis, a connecting edge exists between corresponding nodes of the two thesis, the connecting edge relationship of all the thesis forms an n × n adjacency matrix A, and the keyword attribute F of each thesis is { F ═ F1,f2,...,fmThe attribute quantity m is | F |, and the attributes of all papers are represented as an n × m attribute information feature matrix X;
step two: constructing a variational graph self-encoder consisting of an encoder and a decoder, wherein the encoder of the variational graph self-encoder is a two-layer graph convolution neural network GCN, the input of the variational graph self-encoder is a feature matrix X and an adjacent matrix A of a citation network, the mean value and the variance expressed by learning node low-dimensional vectors are sampled by adopting a reparameterization method, the output of the variational graph self-encoder is an n X d low-dimensional embedded vector of a node, the input of the decoder is a low-dimensional vector of the node, the graph is reconstructed by pairwise calculating the probability of edges existing between two points, wherein d is more than or equal to 2 and less than or equal to n, and the output is the reconstructed
Figure BDA0002607326830000031
Step three: training variational picture autoencoder with scientific paper data, the training goal being to minimize reconstructed adjacency matrix
Figure BDA0002607326830000032
Measuring the distance between the adjacent matrix A and the original adjacent matrix A, and expressing the divergence of vector distribution and normal distribution by nodes, obtaining the parameters of GCN after training, determining a multidimensional Gaussian distribution by GCN, and sampling from the distribution to obtain the low-dimensional embedded vector of the nodes;
step four: setting the expected dividing number of the scientific paper, and clustering the low-dimensional embedded vectors by using a K-Means algorithm to obtain the dividing result of the scientific paper;
step five: and reducing the dimension of the division result of the scientific and technological paper by a tSNE algorithm, and performing two-dimensional visual display by using a Matplotlib drawing library.
As mentioned above, the present invention is made more clear by the specific implementation steps implemented in this patent. Any modification and variation of the present invention within the spirit of the present invention and the scope of the claims will fall within the scope of the present invention.

Claims (1)

1. A scientific and technological thesis clustering analysis method based on a variational graph self-encoder and K-Means is characterized in that: the method comprises the following steps:
the method comprises the following steps: the scientific and technical paper data to be analyzed is expressed as a citation network G ═ (V, E, F), where V ═ V1,v2,...,vnThe node is a node set, each node represents a scientific thesis, the number of nodes, namely the total number n of the scientific thesis, is | V |, E is an edge set, if a reference relationship exists between two thesis, a connecting edge exists between corresponding nodes of the two thesis, the connecting edge relationship of all the thesis forms an n × n adjacency matrix A, and the keyword attribute F of each thesis is { F ═ F1,f2,...,fmThe attribute quantity m is | F |, and the attributes of all papers are represented as an n × m attribute information feature matrix X;
step two: constructing a variational graph self-encoder consisting of an encoder and a decoder, wherein the encoder of the variational graph self-encoder is a two-layer graph convolution neural network GCN, inputting a feature matrix X and an adjacent matrix A of a citation network, and learning a node low-dimensional vectorExpressed mean value mu and variance sigma, and adopting a re-parameterization method to sample the mean value mu and the variance sigma, outputting a low-dimensional embedded vector z of n multiplied by d of a node, inputting a decoder to the low-dimensional vector z of the node, reconstructing a picture by calculating the probability of an edge existing between two points two by two, wherein d is more than or equal to 2 and less than or equal to n, and outputting reconstructed adjacency matrix
Figure FDA0002607326820000011
Step three: training variational picture autoencoder with scientific paper data, the training goal being to minimize reconstructed adjacency matrix
Figure FDA0002607326820000012
Measuring the distance between the adjacent matrix A and the original adjacent matrix A, and expressing the divergence of vector distribution and normal distribution by nodes, obtaining the parameters of GCN after training, determining a multidimensional Gaussian distribution by GCN, and sampling from the distribution to obtain the low-dimensional embedded vector z of the nodes;
step four: setting the expected dividing number of the scientific paper, and clustering the low-dimensional embedded vectors z by using a K-Means algorithm to obtain the dividing result of the scientific paper;
step five: and reducing the dimension of the division result of the scientific and technological paper by a tSNE algorithm, and performing two-dimensional visual display by using a Matplotlib drawing library.
CN202010742851.9A 2020-07-29 2020-07-29 Scientific and technological thesis clustering analysis method based on variational graph self-encoder and K-Means Pending CN112084328A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010742851.9A CN112084328A (en) 2020-07-29 2020-07-29 Scientific and technological thesis clustering analysis method based on variational graph self-encoder and K-Means

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010742851.9A CN112084328A (en) 2020-07-29 2020-07-29 Scientific and technological thesis clustering analysis method based on variational graph self-encoder and K-Means

Publications (1)

Publication Number Publication Date
CN112084328A true CN112084328A (en) 2020-12-15

Family

ID=73735972

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010742851.9A Pending CN112084328A (en) 2020-07-29 2020-07-29 Scientific and technological thesis clustering analysis method based on variational graph self-encoder and K-Means

Country Status (1)

Country Link
CN (1) CN112084328A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784121A (en) * 2021-01-28 2021-05-11 浙江工业大学 Traffic accident prediction method based on space-time diagram representation learning
CN112800749A (en) * 2021-01-08 2021-05-14 北京师范大学 Academic space construction method based on H-GCN
CN112836736A (en) * 2021-01-28 2021-05-25 哈尔滨理工大学 Hyperspectral image semi-supervised classification method based on depth self-encoder composition
CN112990721A (en) * 2021-03-24 2021-06-18 山西大学 Electric power user value analysis method and system based on payment behaviors
CN114817578A (en) * 2022-06-29 2022-07-29 北京邮电大学 Scientific and technological thesis citation relation representation learning method, system and storage medium
WO2022227957A1 (en) * 2021-04-25 2022-11-03 浙江师范大学 Graph autoencoder-based fusion subspace clustering method and system
CN117113240A (en) * 2023-10-23 2023-11-24 华南理工大学 Dynamic network community discovery method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105589948A (en) * 2015-12-18 2016-05-18 重庆邮电大学 Document citation network visualization and document recommendation method and system
CN105718528A (en) * 2016-01-15 2016-06-29 上海交通大学 Academic map display method based on reference relationship among thesises
US20190156946A1 (en) * 2017-11-17 2019-05-23 Accenture Global Solutions Limited Accelerated clinical biomarker prediction (acbp) platform
CN110580289A (en) * 2019-08-28 2019-12-17 浙江工业大学 Scientific and technological paper classification method based on stacking automatic encoder and citation network
CN111291190A (en) * 2020-03-23 2020-06-16 腾讯科技(深圳)有限公司 Training method of encoder, information detection method and related device
CN111428091A (en) * 2020-03-19 2020-07-17 腾讯科技(深圳)有限公司 Encoder training method, information recommendation method and related device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105589948A (en) * 2015-12-18 2016-05-18 重庆邮电大学 Document citation network visualization and document recommendation method and system
CN105718528A (en) * 2016-01-15 2016-06-29 上海交通大学 Academic map display method based on reference relationship among thesises
US20190156946A1 (en) * 2017-11-17 2019-05-23 Accenture Global Solutions Limited Accelerated clinical biomarker prediction (acbp) platform
CN110580289A (en) * 2019-08-28 2019-12-17 浙江工业大学 Scientific and technological paper classification method based on stacking automatic encoder and citation network
CN111428091A (en) * 2020-03-19 2020-07-17 腾讯科技(深圳)有限公司 Encoder training method, information recommendation method and related device
CN111291190A (en) * 2020-03-23 2020-06-16 腾讯科技(深圳)有限公司 Training method of encoder, information detection method and related device

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
BINYUAN HUI等: "Collaborative graph convolutional networks: Unsupervised learning meets semi-supervised learning", 《PROCEEDINGS OF THE AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE》, vol. 34, no. 04, pages 4215 - 4222 *
CHUN WANG等: "Attributed graph clustering: A deep attentional embedding approach", 《ARXIV PREPRINT ARXIV: 1906.06532》, pages 1 - 7 *
THOMAS N. KIPF等: "variational graph auto-encoders", 《MACHINE LEARNING》, pages 1 - 3 *
余平刚: "基于变分自编码器的带属性网络表示学习与深度嵌入聚类", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》, no. 2019, pages 139 - 17 *
林春燕,朱东华: "科学文献的模糊聚类算法", 计算机应用, no. 11, pages 68 - 69 *
白铂;刘玉婷;马驰骋;王光辉;闫桂英;闫凯;张明;周志恒;: "图神经网络", 中国科学:数学, no. 03, pages 367 - 384 *
陈梦雪;刘勇;: "基于对抗图卷积的网络表征学习框架", 模式识别与人工智能, no. 11, pages 1042 - 1050 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800749A (en) * 2021-01-08 2021-05-14 北京师范大学 Academic space construction method based on H-GCN
CN112784121A (en) * 2021-01-28 2021-05-11 浙江工业大学 Traffic accident prediction method based on space-time diagram representation learning
CN112836736A (en) * 2021-01-28 2021-05-25 哈尔滨理工大学 Hyperspectral image semi-supervised classification method based on depth self-encoder composition
CN112836736B (en) * 2021-01-28 2022-12-30 哈尔滨理工大学 Hyperspectral image semi-supervised classification method based on depth self-encoder composition
CN112990721A (en) * 2021-03-24 2021-06-18 山西大学 Electric power user value analysis method and system based on payment behaviors
WO2022227957A1 (en) * 2021-04-25 2022-11-03 浙江师范大学 Graph autoencoder-based fusion subspace clustering method and system
CN114817578A (en) * 2022-06-29 2022-07-29 北京邮电大学 Scientific and technological thesis citation relation representation learning method, system and storage medium
CN114817578B (en) * 2022-06-29 2022-09-09 北京邮电大学 Scientific and technological thesis citation relation representation learning method, system and storage medium
CN117113240A (en) * 2023-10-23 2023-11-24 华南理工大学 Dynamic network community discovery method, device, equipment and storage medium
CN117113240B (en) * 2023-10-23 2024-03-26 华南理工大学 Dynamic network community discovery method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN112084328A (en) Scientific and technological thesis clustering analysis method based on variational graph self-encoder and K-Means
Smith et al. The geometry of continuous latent space models for network data
Zhang et al. Uncovering fuzzy community structure in complex networks
Zhao et al. Spectral feature selection for data mining
Qiu et al. Graph matching and clustering using spectral partitions
Frossyniotis et al. A clustering method based on boosting
Mueller et al. A comparison of vertex ordering algorithms for large graph visualization
CN111950594A (en) Unsupervised graph representation learning method and unsupervised graph representation learning device on large-scale attribute graph based on sub-graph sampling
CN107220311B (en) Text representation method for modeling by utilizing local embedded topics
Ghadiri et al. BigFCM: Fast, precise and scalable FCM on hadoop
Jia et al. Adaptive neighborhood propagation by joint L2, 1-norm regularized sparse coding for representation and classification
CN110990718A (en) Social network model building module of company image improving system
Jordan Bayesian nonparametric learning: Expressive priors for intelligent systems
Cucuringu et al. Regularized spectral methods for clustering signed networks
Shutta et al. Gaussian graphical models with applications to omics analyses
Liebmann et al. Hierarchical correlation clustering in multiple 2d scalar fields
CN114064894A (en) Text processing method and device, electronic equipment and storage medium
US10698918B2 (en) Methods and systems for wavelet based representation
CN113516019A (en) Hyperspectral image unmixing method and device and electronic equipment
Dahal Effect of different distance measures in result of cluster analysis
Olteanu et al. Using SOMbrero for clustering and visualizing graphs
Müller et al. Extracting knowledge from life courses: Clustering and visualization
de Sá et al. A novel approach to estimated Boulingand-Minkowski fractal dimension from complex networks
CN116050119A (en) Positive and negative graph segmentation multi-view clustering method based on binary representation
Xue et al. Taurus: towards a unified force representation and universal solver for graph layout

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination