CN116578708A

CN116578708A - Paper data name disambiguation algorithm based on graph neural network

Info

Publication number: CN116578708A
Application number: CN202310584872.6A
Authority: CN
Inventors: 张华熊; 汤哲冲; 方志坚
Original assignee: Zhejiang Sci Tech University ZSTU
Current assignee: Zhejiang Sci Tech University ZSTU
Priority date: 2023-05-23
Filing date: 2023-05-23
Publication date: 2023-08-11

Abstract

The invention discloses a paper data name disambiguation algorithm based on a graph neural network, which takes each paper as a node of a heterogeneous network, establishes edges through strong correlation among paper attribute characteristics, learns to obtain a characterization vector of each paper by using an unsupervised graph automatic encoder, enhances the vector representation of the paper by adopting a layered attention mechanism network, and finally realizes disambiguation of homonyms by a hierarchical clustering algorithm. Compared with the traditional method, the method utilizes the graph neural network to characterize the nodes in the heterogeneous network, can fully utilize the association information among the nodes, and improves the accuracy of disambiguation; the method uses an unsupervised graph automatic encoder to learn the paper representation vector, so that the problem that a large amount of label data is required in the traditional disambiguation method is avoided; the invention adopts a layered attention mechanism network to learn the weight relation between the nodes and the element paths, and further enhances the accuracy of the vector representation and disambiguation of papers.

Description

Paper data name disambiguation algorithm based on graph neural network

Technical Field

The invention belongs to the technical field of entity disambiguation, and particularly relates to a paper data name disambiguation algorithm based on a graph neural network.

Background

The advent of digital libraries has provided high quality academic information resources for students to conveniently access vast amounts of academic journals, papers, and scholars' information, thereby providing convenience for their academic research. With the continued depth of scientific research, researchers are increasingly demanding high quality academic resources to support their research efforts, and thus ensuring the accuracy of data in digital libraries is becoming particularly important. However, due to the ubiquitous phenomenon of the duplicate names of the authors and the inconsistent recording modes caused by cultural differences, a large number of homonyms exist in the academic database, which brings great trouble to the information retrieval of the users, and the users are required to spend a great deal of time to screen the retrieval results, so that the retrieval difficulty of the users is increased, and the development of scientific research activities is hindered. Furthermore, the presence of a colleague may also cause the learner's research effort to be incorrectly attributed to other colleagues, which may affect the learner's awareness and reputation, even with confusion and incorrect citations. Meanwhile, the article citation times of the same name students can be wrongly calculated into the citation count of a specific student, so that academic ranking and evaluation of the same name students are affected, scientific metrology is affected, and disambiguation of the same name students becomes a problem to be solved urgently in a literature database.

Aiming at the disambiguation problem of the co-name authors, the existing solution is mainly divided into the following:

1. based on a supervised disambiguation method, the documents of the same name authors are classified by mainly utilizing a training set training classification model of artificial labeling; however, the supervised disambiguation method requires training on a pre-labeled data set, and the cost of manually labeling the training set is too high to be suitable for disambiguation of a large amount of data, so that the method has certain limitations.

2. Based on an unsupervised disambiguation method, attribute features of documents are mainly used for calculating similarity, and a clustering algorithm is used for disambiguation; the data set does not need to be marked in advance based on an unsupervised method, but when the similarity of the literature is calculated, a proper similarity judgment threshold value is difficult to select; meanwhile, when clustering is carried out, the number of the co-name authors cannot be determined in advance, namely the number of clustering results, namely the number of the clusters of the co-name authors cannot be determined, so that the disambiguation accuracy is relatively low.

3. The method is between supervised and unsupervised, and can use a small amount of marked data information to train a classifier to classify a large amount of unmarked data, so that the accuracy of a disambiguation result is improved; however, the method has a more complex structure, the overall performance is relatively dependent on the integrity of the manually marked information, the requirement on the data quality is high, and the possibility of noise generation caused by human beings exists.

4. According to the graph-based disambiguation method, authors or papers are usually used as nodes of a network, then graphs are constructed according to the relation between the papers or between the authors and the papers, and finally disambiguation is performed by calculating similarity or clustering algorithm between the nodes; the disambiguation effect of the method is good in general, but the existing graph-based disambiguation method only considers simple relations such as the partnership relation and the quotation relation among papers, and the network constructed by the simple relations can not effectively capture abundant semantic and structural information in the paper data.

Disclosure of Invention

In view of the above, the invention provides a graph neural network-based paper data name disambiguation algorithm, which takes each paper as a node of a heterogeneous network, establishes edges through strong correlation among paper attribute features, learns to obtain a characterization vector of each paper by using an unsupervised graph automatic encoder, adopts a hierarchical attention mechanism network to enhance vector representation of the paper, and finally realizes disambiguation of homonyms through a hierarchical clustering algorithm.

A graph neural network-based paper data name disambiguation algorithm, comprising the steps of:

(1) Extracting paper characteristics of each paper in the paper data set by utilizing characteristic engineering as metadata of name disambiguation, and taking each paper as a node in a heterogeneous network;

(2) The conversion method based on the pinyin initial consonants divides the paper data set into a plurality of homonym author clusters so as to solve the problem that the names of the same author have a plurality of different writing methods;

(3) Word2Vec is used for carrying out Word vector embedding representation on paper features, feature vectors of each paper are generated, further, a triplet loss model is adopted for adjusting the feature vectors, and finally preliminary clustering is carried out based on the feature vectors;

(4) Constructing an academic relationship network according to common communication authors of papers, and carrying out secondary clustering on homonyms in the same relationship network based on strong rules;

(5) Learning a distributed representation of nodes in the academic relationship network by using a graph automatic encoder, so as to obtain a characterization vector of each node containing paper attribute information and inter-paper relationship information;

(6) A hierarchical attention mechanism network comprising a node level and a semantic level is used for learning weight relationships among different nodes on the same unary path and weight relationships among different unary paths, and further the characterization vectors of the paper nodes are enhanced through weighted fusion;

(7) Clustering is carried out through a hierarchical clustering algorithm according to the paper representation vector obtained after enhancement, so that name disambiguation is realized.

Further, the paper characteristic extracted in the step (1) consists of two parts, namely a paper attribute characteristic and a paper relation characteristic, wherein the paper attribute characteristic comprises an author name (first author), a mailbox, an address mechanism name and a title, and the paper relation characteristic comprises a partner, a keyword and a publication.

Further, the specific implementation process of the step (2) is as follows:

step1: the author names of all papers are regarded as classes, and a class set A= { a is formed ₁ ,a ₂ ,…,a _n }；

Step2: unifying all author names into lowercase and removing special symbols (e.g., commas, semicolons, connectors, etc.);

step3: the pinyin in the name of the author is fully written with unique Chinese characters (for example, zeng corresponds to the ever and Zheng corresponds to Zheng);

step4: analyzing whether the name of the author is a pinyin full name or an initial consonant short letter, and analyzing the pinyin full name into pinyin, the initial consonant corresponding to the pinyin and Chinese characters corresponding to the pinyin;

step5: if any two classes a in set A ₁ And a ₂ The author names are all pinyin full-writing and the corresponding Chinese characters are the same, or class a ₁ And a ₂ The author name containing the initial and the corresponding initial are the same, then a will be ₁ And a ₂ Merging into class a ₁₂ And group a ₁₂ Adding to set A while removing a ₁ And a ₂ ；

Step6: step5 is repeatedly executed until no class in the set A can be recombined, and clustering is finished.

Further, in the step (3), word2Vec is used to generate Word vectors of features of each paper, then TF-IDF is used to calculate weights of features of each paper, and finally weighted sum of all Word vectors is used to obtain feature vectors of each paper, wherein a specific calculation formula is as follows:

wherein: x is x _m Representing characteristics of paper, D _i Representing the feature set of paper i, x _i Representing the feature vector of paper i,representative paper specialSign x _m Word vector of f _m Representing paper feature x _m Weight coefficient of (c) is provided.

Further, in the step (3), the feature vector is adjusted by using a triplet loss model, that is, a large number of positive and negative sample pairs are used as training data, the positive sample pair is two papers belonging to the same author, the negative sample pair is two papers belonging to different authors, and the following loss function ζ is further used _d Training the triplet loss model, and recalculating Word2Vec in the model after training to generate a feature vector of each paper;

wherein: y is _ij =1 means that paper i and paper j belong to the same author, positive sample pair, y _ik =0 means that paper i and paper k belong to different authors, i.e. negative sample pairs, d _ij Representing the Euclidean distance between the feature vectors of paper i and paper j, d _ik Representing the Euclidean distance between the feature vectors of paper i and paper k, m being a fixed boundary distance constant, [] ₊ Is a range loss function.

Further, in the step (3), the similarity between the feature vectors of any two paper nodes is calculated through cosine similarity in the heterogeneous network according to the feature vectors obtained after adjustment, and if the similarity is high enough (i.e. greater than a threshold value), an edge is constructed between the two nodes.

Further, since the mailbox address has uniqueness, if two duplicate authors have the same mailbox in the absence of mailbox information, the two authors are considered to be the same person, and the scholars having a partnership with the same communication author are in the same academic relationship network in the step (4).

Further, the strong rule in the step (4) includes:

(1) two papers can be considered to belong to the same author if their authors are the same name, address information is the same and contain the same partner;

(2) two papers can be considered to belong to the same author if their authors are the same name, address information is the same and are published on the same publication;

(3) two papers can be considered to belong to the same author if their authors are the same name, address information is the same and contain the same keywords.

In the step (6), firstly, weighting and fusing neighbor nodes on the same element path through a graph attention network to obtain a node-level paper representation vector; then learning the importance of different element paths by using a semantic-level attention mechanism, and fusing the semantics of each element path to obtain a final thesis characterization vector; the meta path is a path formed by nodes connected based on the same paper relation characteristics.

Based on the technical scheme, the invention has the following beneficial technical effects:

1. according to the method, the nodes in the heterogeneous network are characterized by utilizing the graph neural network, the association information among the nodes can be fully utilized, and the disambiguation accuracy is improved.

2. The method uses the unsupervised graph automatic encoder to learn the paper representation vector, and avoids the problem that a large amount of label data is needed in the traditional disambiguation method.

3. The invention adopts a layered attention mechanism network to learn the weight relation between the nodes and the element paths, and further enhances the accuracy of the vector representation and disambiguation of papers.

Drawings

Fig. 1 is a flow chart of the data name disambiguation algorithm of the present invention.

FIG. 2 is a schematic diagram of a network structure of a triplet loss model.

FIG. 3 is a flow chart of the co-name author disambiguation algorithm of the present invention.

Fig. 4 is a schematic diagram of a network structure of the automatic encoder.

Fig. 5 is a schematic diagram of the architecture of a hierarchical attention mechanism network.

Fig. 6 is a schematic diagram of node level weight calculation.

Detailed Description

In order to more particularly describe the present invention, the following detailed description of the technical scheme of the present invention is provided with reference to the accompanying drawings and the specific embodiments.

As shown in fig. 1, the paper data name disambiguation algorithm based on the graph neural network of the invention comprises the following steps:

(1) And (5) preprocessing data.

The invention needs to preprocess the original data, including data cleaning and normalization processing.

Firstly, uniformly filling special marks 'null' for the problems that information loss, character abnormality and the like possibly exist in original data; then, extracting paper attribute features and paper relation features from the paper through feature engineering, wherein the paper attribute features and the paper relation features comprise author names, mailboxes, synopsis, address mechanism names, titles, keywords and publications, and the paper attribute features and the paper relation features are used as disambiguation metadata; then, denoising and word segmentation processing is carried out on the acquired text data, wherein the denoising and word segmentation processing comprises punctuation mark removal and special symbol removal, redundant blank space removal and line feed symbol removal, stop word removal, character string lowercase removal and useless word removal, and word segmentation and word root extraction are carried out by using an NTLK tool after denoising is finished; and finally, adding attribute tags before each feature so as to facilitate the subsequent feature weight calculation by using the TF-IDF.

(2) Solves the problem of multiple persons.

The invention uses a method based on pinyin initial consonants to solve the problem that the name of the same author has a plurality of different writing methods, and the specific implementation mode is as follows:

step1: the author names of each paper are treated as one class respectively, constituting the set a= { a ₁ ,a ₂ ,…,a _n }；

Step2: unifying the case and special symbols (e.g., comma, semicolon, connector, etc.) of the author name;

step3: the pinyin of each name is corresponding to a unique Chinese character, such as 'Zeng' corresponding to 'once' and 'Zheng' corresponding to 'Zheng';

step4: analyzing whether the name of the author is a pinyin full name or an initial consonant short, and analyzing the pinyin full name into pinyin, the initial consonant corresponding to the pinyin and Chinese characters corresponding to the pinyin;

step5: if class a ₁ And class a ₂ The author names of the Chinese characters are all full names and the corresponding Chinese characters are the same, or the Chinese characters are of the type a ₁ And class a ₂ The author name containing the initial and the corresponding initial are the same, then a will be ₁ And a ₂ Merging into a ₁₂ And a is combined with ₁₂ Adding to set A while removing a ₁ And a ₂ Otherwise, jumping to Step7;

step6: if the number of classes in the class set is greater than 1, repeating Step4 and Step5;

step7: and (5) ending the clustering.

(3) The feature content is embedded.

The invention uses Word2Vec model to generate Word vector of each feature, then calculates weight of each feature through TF-IDF, finally obtains feature vector x of each paper by weighting, fusing and averaging all Word vectors _i The specific calculation formula is as follows:

wherein: x is x _m Representing characteristics of paper, D _i Representing a feature set of the paper,representing a word vector corresponding to each feature, f _m And the weight coefficient corresponding to each feature is represented.

After the feature vector of the paper is obtained, the vector is adjusted through the triplet loss model, and a more accurate result is obtained. The triplet loss model structure is shown in FIG. 2, given two papers D ⁱ And D ^j If they belong to the same author, a pair of aligned samples is formed; conversely, if they belong to different authors, a pair of negative sample pairs is formed. The purpose of the triplet loss model is to find an accurate distance threshold m to distinguish positive and negative pairs of samples, which allows positiveThe distance between the sample pairs is more and more similar, and the distance between the negative sample pairs is more and more distant, the loss function ζ _d The following is shown:

wherein: d, d _ij Representing the distance between the paper node i and the node j, is typically calculated using the Euclidean distance, i.e., d _ij ＝‖d _i -d _j ‖；y _ij =1 means that both papers belong to the same author, i.e. are a pair of aligned samples; y is _ik =0 means that the two papers belong to different authors, i.e. are a pair of negative sample pairs; [] ₊ For the range loss function, it can be understood that [ x ]] ₊ =max (0, x), m is a fixed boundary distance constant.

And finally, calculating the similarity between the paper feature vectors through cosine similarity, and if the similarity between the two papers is high enough, constructing an edge between the nodes corresponding to the two papers.

(4) And (5) constructing a relational network.

Because the mailbox addresses are unique, under the condition that the mailbox information is not deficient, if two double-name authors contain the same mailbox, the two authors are considered to be the same person, so that students with a partnership with the same communication author can be considered to be in the same academic relationship network. The method uses common communication authors of documents to construct a heterogeneous academic relationship network, and disambiguates homonyms in the same academic relationship network according to the following algorithm, wherein the algorithm comprises the following steps:

as shown in fig. 3, first by comparing paper a ₁ And a ₂ Carrying out first clustering on mailbox information of the network node to reduce complexity of subsequent clustering and improve efficiency; then, performing secondary clustering according to the address mechanism, if the address mechanisms are the same, considering that the two papers belong to the same authorThe method comprises the steps of carrying out a first treatment on the surface of the If the primary mechanism is the same, but the secondary mechanism is different, the secondary mechanism can be judged again through quotation and a compliance relationship; if the coauthoring relation and the quoting relation are matched, clustering is carried out; disambiguating through matching the partnership relationship, the quotation relationship and the discipline when the primary mechanism cannot be matched; if all three features match, then they are considered the same author and are clustered into a specified cluster.

(5) And (5) learning a relation network.

The invention uses an unsupervised graph-based automatic encoder to learn the distributed representation of nodes in a heterogeneous network, and then predicts the link relationship between the nodes, thereby obtaining a new paper vector representation. Model structure of the automatic graph encoder as shown in fig. 4, the automatic graph encoder is constructed by a node encoder model z=g ₁ (Y, A) and edge decoder modelTwo-part composition, wherein->Is the embedded matrix of node D, A epsilon R ^N×N Is the adjacency matrix of graph G, mainly used to represent the relationship between nodes, < >> Is a node embedding matrix, ">Is a model predicted adjacency matrix, with the goal of making the predicted adjacency matrix +.>The reconstruction error with the original adjacency matrix a is minimized.

Encoding part: the graph automatic encoder uses a two-layer graph convolutional neural network GCN as an encoder to obtain nodesEmbedded representation, encoder g ₁ The calculation formula is as follows:

wherein:is a symmetric normalized adjacency matrix, i.e.>D is the node degree matrix of graph G, relu () =max (0,) W ₀ And W is ₁ Is a parameter of the first and second layers of the neural network.

Decoding part: the automatic graph encoder adopts an inner product mode to reconstruct the structural information of the original graph, and a decoder g ₂ The calculation formula of (2) is as follows:

g ₂ (Z)＝sigmoid(Z ^T Z)

node D _i And D _j The probability of an edge between them is as follows:

the cross entropy is used as a loss function, and the specific formula is as follows:

finally, the graph-based automatic encoder can obtain a latent variable Z= [ Z ] containing paper attribute characteristic information and inter-paper relation information ₁ ,z ₂ ,…,z _n ]And treat it as a new vector representation of the paper.

(6) The relationship network is enhanced.

The invention uses a hierarchical attention mechanism network comprising a node level and a semantic level to enhance the vector representation of paper nodes, the structure of the network is shown in figure 5, in the weight calculation of the node level, the neighbor nodes on the same element path are weighted and fused through the graph attention network, so that better node embedding representation is obtained, and the calculation process of the node level is shown in figure 6. Since the vector representation of each paper has been obtained using the graph automatic encoder, only the weight of each neighbor node needs to be calculated, and the weight calculation formula of the neighbor node j of the center node i is given as follows:

N _ij ＝att _node (n _i ,n _j )＝σ(n ^T ·[n _i ‖n _j ])

wherein: n (N) _ij Representing the importance of node j to node i, it should be noted that the weight coefficient N is due to the asymmetric heterogeneous network _ij Also asymmetric; att (att) _node Representing a node-level weight network model for generating weights, wherein the weight network model is consistent for nodes on the same element path; sigma represents a sigmoid activation function, n represents a node-level attention vector, and the attention vector is obtained through single-layer feedforward neural network training; n is n _i n _j Representing the embedded vector of the corresponding node.

Normalizing the attention value of the node to obtain a weight coefficient M _ij The calculation formula is as follows:

wherein: r is (r) _i Representing neighboring nodes (including node i itself) on the same meta-path.

The embedded representation of the central node i can be obtained by aggregating neighbor nodes on the meta-path, and the calculation formula is as follows:

after obtaining the node representation under each element path, learning the importance of different element paths by using the semantic level attention mechanism, and fusing each element pathThe semantics of (2) to obtain the final vector representation; given meta-path r _i ，r _i The weights of (2) are calculated as follows:

wherein: att (att) _sem A weight network model representing a semantic level, wherein W is a weight matrix, q is a attention vector of the semantic level, and the attention vector is obtained through a feedforward neural network; v represents an attribute feature set, b is a bias vector; weight coefficient S of different types of meta-paths _i The calculation can be performed by the following formula:

and carrying out weighted calculation on the weight coefficient of the meta-path and node embedding, so as to obtain the final representation of the node i as follows:

the model adopts the loss entropy as a loss function, and a specific calculation formula is as follows:

wherein: c is the parameter of the classifier, y _l Representing labeled nodes, Y ^l And Z ^l Is the tag value and the predicted value of the tag data.

(7) And (5) clustering.

And finally, clustering through a hierarchical clustering algorithm according to the paper characterization vector obtained after enhancement, so as to realize name disambiguation.

The previous description of the embodiments is provided to facilitate a person of ordinary skill in the art in order to make and use the present invention. It will be apparent to those having ordinary skill in the art that various modifications to the above-described embodiments may be readily made and the generic principles described herein may be applied to other embodiments without the use of inventive faculty. Therefore, the present invention is not limited to the above-described embodiments, and those skilled in the art, based on the present disclosure, should make improvements and modifications within the scope of the present invention.

Claims

1. A graph neural network-based paper data name disambiguation algorithm, comprising the steps of:

2. The paper data name disambiguation algorithm of claim 1, wherein: the paper characteristic extracted in the step (1) consists of two parts of paper attribute characteristics and paper relation characteristics, wherein the paper attribute characteristics comprise author names, mailboxes, address mechanism names and titles, and the paper relation characteristics comprise editors, keywords and publications.

3. The paper data name disambiguation algorithm of claim 1, wherein: the specific implementation process of the step (2) is as follows:

Step2: unifying all the author names into lowercase and removing special symbols;

step3: the pinyin full-writing in the name of the author is corresponding to the unique Chinese characters;

4. The paper data name disambiguation algorithm of claim 1, wherein: in the step (3), word2Vec is used for generating Word vectors of features of each paper, then TF-IDF is used for calculating weights of features of each paper, and finally weighted summation of all Word vectors is used for obtaining feature vectors of each paper, wherein the specific calculation formula is as follows:

wherein: x is x _m Representing characteristics of paper, D _i Representing the feature set of paper i, x _i Representing the feature vector of paper i,representing paper feature x _m Word vector of f _m Representing paper feature x _m Weight coefficient of (c) is provided.

5. The paper data name disambiguation algorithm of claim 1, wherein: in the step (3), the feature vector is adjusted by adopting a triplet loss model, namely a large number of positive and negative sample pairs are used as training data, the positive sample pair is two papers belonging to the same author, the negative sample pair is two papers belonging to different authors, and the following loss function ζ is further used _d Training the triplet loss model, and recalculating Word2Vec in the model after training to generate a feature vector of each paper;

6. The paper data name disambiguation algorithm of claim 1, wherein: and (3) traversing and calculating the similarity between the feature vectors of any two paper nodes in the heterogeneous network through cosine similarity according to the feature vectors obtained after adjustment, and if the similarity is high enough, constructing an edge between the two nodes.

7. The paper data name disambiguation algorithm of claim 1, wherein: because the mailbox address has uniqueness, if two duplicate authors have the same mailbox under the condition of no mailbox information, the two authors are considered to be the same person, and the scholars with the same communication author are in the same academic relationship network in the step (4).

8. The paper data name disambiguation algorithm of claim 1, wherein: the strong rule in the step (4) comprises:

9. The paper data name disambiguation algorithm of claim 1, wherein: in the step (6), firstly, weighting and fusing neighbor nodes on the same element path through a graph attention network to obtain a node-level thesis characterization vector; then learning the importance of different element paths by using a semantic-level attention mechanism, and fusing the semantics of each element path to obtain a final thesis characterization vector; the meta path is a path formed by nodes connected based on the same paper relation characteristics.