CN112131872A

CN112131872A - Document author duplicate name disambiguation method and construction system

Info

Publication number: CN112131872A
Application number: CN202010987031.6A
Authority: CN
Inventors: 李微; 胡晟
Original assignee: Sanluoxuan Big Data Technology Kunshan Co ltd
Current assignee: Sanluoxuan Big Data Technology Kunshan Co ltd
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2020-12-25

Abstract

The invention discloses a method for disambiguating duplicate names of document authors and a construction system, which comprises the following steps: the method comprises the following steps: reading document data and student data in a database; step two: training and predicting a document vector of each document by using a Word2Vec model; step three: constructing a partner relation network graph of the author to be disambiguated and calculating node similarity and clustering; step four: and acquiring document vectors of documents in the partner relation graph clustering document clusters, and calculating similarity and clustering among the document clusters. The invention can ensure that the disambiguation result has higher accuracy and recall level, and is suitable for the conditions of multi-language and multi-document types such as Chinese documents, English documents, patents and the like.

Description

Document author duplicate name disambiguation method and construction system

Technical Field

The invention belongs to the technical field of document processing, and particularly relates to a method for eliminating duplicate name of document authors.

Background

With the rapid development of science and technology and the continuous fusion of information, when the informatization problem is processed, particularly when flexible and various natural language data is processed, the renaming phenomenon widely existing in the real world can greatly influence the retrieval and processing of the data, so that the technology of named entity disambiguation is generated, and how to match ambiguous entity references with correct entities in a knowledge base is researched. Author disambiguation pertains to named entity disambiguation, in the real world different people may have the same name, and in many applications such as scientific literature management and information integration, the names of people are used as identifiers for retrieved information, and name ambiguity can greatly impair the quality of the retrieved information. Author disambiguation is essentially a classification problem requiring accurate partitioning of documents and corresponding to different author names with duplicate names.

The method can be used for completing the task of renegotiation of the author of the document, most of the existing methods are based on information contained in the document, and mainly comprise a method based on feature distinguishing, a method based on graph segmentation, a method based on network resource classification and the like.

Disclosure of Invention

The invention mainly solves the technical problem of providing a method and a construction system for duplicate name disambiguation of document authors, so as to solve the technical problem.

In order to solve the technical problems, the invention adopts a technical scheme that: a method for disambiguating duplicate names of document authors comprises the following steps:

the method comprises the following steps: reading document data and student data in a database;

step two: training and predicting a document vector of each document by using a Word2Vec model;

step three: constructing a partner relation network graph of the author to be disambiguated and calculating node similarity and clustering;

step four: and acquiring document vectors of documents in the partner relation graph clustering document clusters, and calculating similarity and clustering among the document clusters.

Further, the first step specifically includes:

reading relevant data from a literature database and a scholars database of a company respectively, comprising:

(1) ID, title, author, organization, abstract, journal, year, keyword in Chinese thesis data;

(2) ID, title, author, organization, abstract, journal, year, keyword in english paper data;

(3) ID, title, inventor, abstract, date, unit of publication in patent data;

wherein, Chinese thesis-author, English thesis-author and patent-inventor are used to extract cooperation data, namely points and edges in the network of the cooperation; the Chinese thesis-abstract, the Chinese thesis-abstract and the patent-abstract are used for training a Word vector model by a Word2Vec model and extracting a document vector, so that text information can be integrated in the duplicate name disambiguation process.

Further, the second step specifically includes:

the method comprises the following steps that the subject content of a document comprises a title, a keyword and an abstract, the title and the abstract of the document are subjected to character string combination, then words are divided, feature words are extracted, the feature words and the keyword are combined and then trained by using a Skip-Gram model of Word2Vec, and an output dimension is set to obtain a Word vector model;

and finally, calculating the IDF value alpha of all the characteristic words in the document Di in all the documents_iSum word vector ω_iDocument vector p_iThe calculation formula of (2) is as follows:

further, the third step specifically includes:

(1) acquiring data of points, including names of authors to be disambiguated and names of collaborators thereof, wherein the nodes of the authors to be disambiguated are designed to be in a 'author name-document id' form, the number of the nodes is the same as the number of documents, and the nodes of the collaborators are designed to be 'author names';

(2) acquiring data of edges, and extracting one-to-one correspondence relation between author names;

(3) representing the relation of ' author ' -paper ' of all extracted documents as a graph G ═ V, E, W, wherein each node V ∈ V represents an example of one author, and a undirected edge E ∈ E represents that two authors collaborate to write a document;

(4) calculating the similarity of the nodes to be disambiguated, wherein the similarity function is as follows:

P_ijin order to join the effective path set of which the path length of the two nodes is less than or equal to 4, vi and vj are different author names;

(5) and constructing a similarity matrix, and clustering by using AP clustering.

Further, the fourth step specifically includes:

(1) computing two document vectors p_i，p_jThe similarity between the documents and the document vector similarity sij is calculated according to the formula:

(2) calculate two document clusters c_a，c_bSimilarity between, document Cluster similarity s_abThe calculation formula of (2) is as follows:

the invention adopts a further technical scheme for solving the technical problems that:

a construction system for duplicate name disambiguation of document authors comprises:

the data acquisition module comprises a database connecting component for connecting a database; the query component is used for executing database query statements and returning corresponding results;

the data preprocessing module comprises a document deduplication component and is used for removing duplicated documents; an error document format modification component for modifying an error document format; the author organization normalization component is used for normalizing the unit information of the author; the key attribute missing value processing component is used for processing the key attribute missing record; the document structuring component is used for converting document data into json files so as to facilitate subsequent processing;

the document vector production module comprises a user-defined word segmentation dictionary component and is used for reading a keyword expansion word segmentation dictionary; the Word vector model training component is used for respectively training Chinese and English literature data according to a Skip-Gram model of the Word2Vec model to obtain and store a Word vector model; a document vector generation component: putting each word of the document into a word vector model to predict a word vector, calculating an idf value as a weight, and finally weighting the word vector to synthesize a document vector;

the partner relation graph clustering module comprises a partner relation graph construction component, and is used for reading document data under the name of a to-be-disambiguated author and constructing a partner network graph, wherein a single author document is independently stored in the single author document to be used as the partition of a cluster; the similarity calculation component is used for calculating the path similarity value of each author node to be disambiguated and constructing a similarity matrix; the clustering component is used for carrying out AP clustering on the basis of the similarity matrix to obtain a final cluster;

the semantic feature clustering module comprises a cluster data loading component and is used for reading document vector data and author school data to be disambiguated of each cluster in the partner relation graph clustering module; the similarity calculation component is used for calculating the similarity between the document clusters; and the clustering component is used for clustering the document clusters on the basis of the similarity to obtain final document cluster division information.

The invention has the following beneficial effects:

1. the invention not only considers the strong characteristic information of the document partner relationship, but also considers weak characteristic information of the document such as semantic information and the like, and can furthest mine document data and obtain more accurate and complete document cluster division;

2. the invention can be used for renaming and disambiguation of document authors in different formats and languages, and has good compatibility.

Drawings

FIG. 1 is a schematic diagram of the renaming disambiguation method of the present invention;

FIG. 2 is a schematic view of the Word2Vec model of the present invention;

FIG. 3 is a document vector generation flow diagram of the present invention;

FIG. 4 is a flowchart of partner relationship graph clustering in accordance with the present invention;

FIG. 5 is a repeat path process flow diagram of the present invention;

FIG. 6 is a semantic feature clustering flow diagram of the present invention;

FIG. 7 is a system block diagram of the present invention;

FIG. 8 is a table comparing the accuracy of the Chinese paper, patent and English paper evaluation results of the present invention;

FIG. 9 is a comparison table of Chinese paper, patent and English paper evaluation result recall rates of the present invention;

FIG. 10 is a comparison table of the evaluation results F1 of the Chinese paper, patent and English paper according to the present invention.

Detailed Description

The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the invention.

Example (b): a method for disambiguating duplicate names of document authors is shown in FIG. 1, and comprises the following steps:

The first step specifically comprises:

(3) ID, title, inventor, abstract, date, unit of publication in patent data;

The second step specifically comprises:

the method comprises the following steps that the subject content of a document comprises a title, a keyword and an abstract, the title and the abstract of the document are subjected to character string combination, then words are divided, feature words are extracted, the feature words and the keyword are combined and then trained by using a Skip-Gram model of Word2Vec, and an output dimension is set to obtain a Word vector model; as shown in fig. 2, where ω (t) represents a target word in the text, and ω (t-2), ω (t-1), ω (t +2), ω (t +1), and so on are adjacent words of the target word in the text;

and finally, calculating IDF values alpha i and word vectors omega i of all feature words in all documents in the document Di, and calculating document vectors p_iThe calculation formula of (2) is as follows:

the specific steps of the process for generating the document vector model are shown in fig. 3, wherein the chinese and english steps are the same, but are trained separately.

The flowchart of the third step is shown in fig. 4, and specifically includes:

(2) the data of the obtained edges, Chinese thesis, English thesis and patent cooperation data are all that a plurality of names are in a column (for example: Zhang three, Li four and Wang five), and a one-to-one corresponding relation (for example: Zhang three-Li four, Zhang three-Wang five and Li four-Wang five) needs to be extracted. The final similarity calculation method depends on the path retrieval among the target authors, and when a thesis relationship including the target authors is established, if nodes except the target authors are connected pairwise, redundant search is generated for many times, so that the method is not very significant, and the calculation amount is greatly increased. As shown in fig. 5, if a1 and B, C cooperate with document P1, then when calculating the path between a1 and a2, a1 → C → a2 can be considered to be a search worthy path, and a1 → B → C → a2 is an invalid path. Therefore, when a relationship graph containing target author documents is added in the collaboration relationship graph, only the connection between the target author and the collaborators is established, and the connection between the collaborators is not considered, such as in the case of establishing a P1 document relationship graph in FIG. 5, only two sides A1 → B and A1 → C are provided, and no side B → C is provided;

(3) and expressing the relation of ' author ' -paper ' of all extracted documents as a graph G ═ V, E and W, wherein each node V ∈ V represents an example of an author name R, and an undirected edge E ∈ E represents that two authors collaborate to write a document so as to eliminate redundant paths in subsequent work. For a paper under the condition of a single author, because the paper has no collaborators, the relation clustering of the collaborators cannot be used, and the paper is labeled and stored and is used in the semantic feature clustering;

(4) after the graph is constructed, the similarity of the nodes to be disambiguated is calculated, in order to increase the graph path searching efficiency, all paths with paths less than or equal to 4 are only taken, effective paths are searched, then the similarity is calculated, the length and the number of the effective paths are parameters which can be used for designing similarity calculation, and an obvious fact is that the shorter the length of the effective paths, the larger the number of the effective paths, the larger the similarity of the two nodes. Therefore, for the authors vi and vj, Pij is an effective path set joining the two nodes with a path length less than or equal to 4, and the similarity function is designed as follows:

if no path exists between the two nodes, the similarity is designed to be-10 instead of negative infinity for the accuracy of the subsequent clustering result;

(5) calculating the similarity of the nodes to be disambiguated and constructing a similarity matrix, then clustering by using an AP (affinity propagation), setting initial parameters as median of all data and obtaining a clustering result;

the algorithm pseudo-code is as follows:

the effect of the relation clustering of the collaborators is good under the condition that the number of the collaborators is sufficient, but the relation clustering of the collaborators has the limitation that the relation clustering of the collaborators cannot be completely applied to an actual literature database. Many documents in document databases are rare in relation to collaborators and documents written by single authors, and clustering results by using the strong feature of the relationship between collaborators shows that although the accuracy of the same author in a result cluster is very high, a large number of clusters are increased compared with actual results. Therefore, the obtained results need to be clustered continuously by adopting other characteristics of the literature until the final result is similar to or identical to the actual result.

Aiming at the defects of the partner relation graph clustering algorithm, the topic content characteristics are used in the method, and for the topic content characteristics, because the authors keep relatively fixed topic contents and directions in a certain time, the semantic similarity of the topic contents of the documents under the same author name is higher, otherwise, the topic content similarity of the documents under different author names under the same name is not high. On the basis of partner relationship graph clustering, topic content characteristics are fully utilized to assist disambiguation, so that the influence caused by infrequent cooperation of a single writer and a student can be effectively reduced, and the disambiguation effect is further improved. The flow chart of this step is shown in fig. 6.

The documents related to the system basically contain the school characteristics of the authors, but most of the documents do not contain the information of the colleges or the systems, so the school characteristics of the authors in the document clusters clustered by the relationship of the collaborators are firstly extracted and marked, and the documents in different schools are not markedThe clusters are not merged in a subsequent clustering step. Calculating the feature similarity of the subject contents of the documents by using cosine similarity in the clustering, and respectively setting the document vectors previously generated into two duplicate author documents as p_iAnd p_jSimilarity of document vector s_ijThe calculation formula of (2) is as follows:

however, since this clustering is performed after the collaborators are clustered, i.e., the similarity between two document clusters is calculated, the calculation method of the two articles cannot be simply used, and after all the documents in the two clusters are calculated pairwise, the maximum result is selected as the final similarity. Let two clusters be c respectively_aAnd c_bSimilarity of document clusters s_abThe calculation formula of (2) is as follows:

after the similarity is obtained, clustering and merging are carried out on the target clusters again by adopting a hierarchical clustering method until the clustering is not changed or reaches a threshold value, different threshold values are adopted through experiments, and in order to obtain the best effect, the Chinese thesis and patent threshold values are finally set to be 0.7, and the English thesis threshold value is 0.65.

The algorithm pseudo-code is as follows:

the data acquisition module (mainly used for acquiring related data functions including Chinese thesis data, English thesis data and patent data from a database) comprises a database connecting component for connecting the database; the query component is used for executing database query statements and returning corresponding results;

a data preprocessing module (mainly used for preprocessing data) comprising a document deduplication component for removing duplicate documents; an error document format modification component for modifying an error document format; the author organization normalization component is used for normalizing the unit information of the author; the key attribute missing value processing component is used for processing the key attribute missing record; the document structuring component is used for converting document data into json files so as to facilitate subsequent processing;

the document vector generation module (mainly using the Word2Vec algorithm of Google open source to train documents into a Word vector model and predict and generate document vectors) comprises a user-defined Word segmentation dictionary component and a Word segmentation dictionary expansion component, wherein the user-defined Word segmentation dictionary component is used for reading a keyword expansion Word segmentation dictionary; the Word vector model training component is used for respectively training Chinese and English literature data according to a Skip-Gram model of the Word2Vec model to obtain and store a Word vector model; a document vector generation component: putting each word of the document into a word vector model to predict a word vector, calculating an idf value as a weight, and finally weighting the word vector to synthesize a document vector;

the partner relation graph clustering module (mainly used for generating a partner cooperation network graph of the author to be disambiguated and calculating path similarity and clustering) comprises a partner relation graph construction component, a path matching component and a path matching component, wherein the partner relation graph construction component is used for reading document data under the name of the author to be disambiguated and constructing a partner network graph, and single author documents are independently stored into the single author documents to be used as the partition of a cluster; the similarity calculation component is used for calculating the path similarity value of each author node to be disambiguated and constructing a similarity matrix; the clustering component is used for carrying out AP clustering on the basis of the similarity matrix to obtain a final cluster;

the semantic feature clustering module (mainly extracts graph clustering results to calculate text similarity and finally completes clustering of document clusters) comprises a cluster data loading component, a cluster data processing component and a cluster mapping component, wherein the cluster mapping component is used for reading document vector data of each cluster in the partner relation graph clustering module and data of a to-be-disambiguated author school; the similarity calculation component is used for calculating the similarity between the document clusters; and the clustering component is used for clustering the document clusters on the basis of the similarity to obtain final document cluster division information.

In order to prove the accuracy of the disambiguation clustering algorithm, the results of a contrast experiment comparing step-by-step disambiguation clustering and a contrast method are designed.

First, data preparation

The data of the adopted documents are all from web crawlers, and comprise Chinese documents such as Chinese thesis, Chinese patents and the like and English thesis. Because the data volume of all documents is large, only part of the data is selected for testing. Because the data are labeled on the network, the Chinese evaluation data including the test data of the Chinese thesis and the patent are manually labeled, the manual labeling is mainly judged by combining an infrequent mailbox in the literature, the name of the affiliated college and a Baidu search engine and the homepage of a scholars, and certain retired teachers cannot be linked and are manually deleted. In order to guarantee the objectivity and the accuracy of the test set, the same data are sent to a plurality of persons for marking, and finally, the data are processed in a unified mode. The Chinese literature is selected as the literature to which 8 different author names belong, and English paper test data comes from a paper disambiguation match which Aminer has held, and a training set provided by the match can be used as test data for the evaluation. In selecting author names, to ensure validity and generality of test data, we select names of two different attributes:

(1) the phenomenon of duplication is serious, such as "Lidong" in Chinese literature, and "Chen, Yong" in English "

(2) The phenomenon of duplication is not serious, such as "Zhao iron military" in the Chinese literature and "Shi, Xianming" in the English literature. The Chinese document data annotation is shown in Table 1.

TABLE 1 Chinese documentation data after annotation

English paper data is selected as shown in Table 2.

TABLE 2 English discourse data

Second, evaluation method

For evaluation indexes of disambiguation quality of different methods, the invention respectively defines a PairwisePrecision (PairwiseRecall), a PairwiseRecall (PairwiseRecall) and a harmonic value F1 (PairwiseF1) of the PairwiseRecall and the PairwiseRecall by using common evaluation methods for clustering indexes in information retrieval and statistical classification. Disambiguation was assessed by counting the number of document pairs correctly divided in the name of the scholars. Specifically, if two documents in the set to be evaluated and the manually labeled set have the same label, they are called the correct pair. If two papers with the same label in the evaluation set to be tested do not have the same label in the manually labeled data set, the two papers are called as a misprediction pair. The index is defined as follows.

Let document set P ═ { P ═ P₁,P₂,P₃…, C is the set after being evaluated and disambiguated, M is the real classification set of the manually marked document set, and n is the number of documents in the set.

(1) PairwisePrecision)

PairwisePrecision represents the pair number TP of all documents accurately divided under the corresponding author names in the set C to be evaluated and the pair number P of all document division results in the set C_CThe higher the numerical value is, the more accurate the clustering result is, and the index is as shown in formula 5:

(2) paired recall (PairwiseRecall)

PairwiseRecall indicates that the sum PM of the paired quantity TP of all documents accurately divided under the corresponding author names in the set C to be evaluated and the paired quantity of all document divisions in the manually marked set M, the higher the numerical value is, the higher the concentration degree of the similar documents in the clustering result is, and the index is as shown in formula 6:

(3) f1 value (F1-Measure)

The F1 value is a harmonic value of PairwisePrecision and PairwiseRecall, the accuracy and the recall rate are comprehensively considered, and the higher the value is, the better the comprehensive performance of the clustering effect is represented. The index is shown in formula 7:

third, analysis of experimental results

In order to evaluate the effect of the disambiguation clustering algorithm, a comparison algorithm is required to be set for carrying out a comparison experiment. Comparative experiments are shown below.

(1) For the problem of document and author integration, the simplest and most common method is a feature Rule matching method, the school contents of the school matching the document features and the student features of the student information base are matched, and the documents of the same collaborators are merged without the information of the school. Merging the documents under the names of the authors to be disambiguated in the colleges and universities; if there is no college information, the documents of the same collaborators to be disambiguated are merged.

(2) Because the GHOST algorithm which has better effect in the market and is often taken as a comparison method in the field of duplicate name disambiguation at present, the first step in the step-by-step clustering is also used for referring to some ideas of the algorithm, and therefore the GHOST algorithm is also used for comparison. The GHOST method has the idea that a cooperation relation graph of authors to be disambiguated is constructed, similarity calculation is carried out by utilizing paths among the authors to be disambiguated in the graph, AP clustering is carried out, in order to achieve a comparison effect, parameters of the AP clustering in a comparison experiment are the same as parameters of the first-step clustering in the TSC, and median is selected as an initial value.

(3) The experimental results are shown in tables 3, 4 and 5, while the disambiguation result accuracy, recall and F-value comparisons are shown in fig. 8, 9 and 10, respectively.

The results of the comparative experiments in the Chinese thesis are shown in Table 3.

TABLE 3 Experimental results of Chinese thesis

The results of the patent comparative experiments are shown in table 4.

The results of the comparative experiments in the English paper are shown in Table 5.

Table 5 english thesis experimental results

Comparing the experimental results leads to the following conclusions:

(1) in the aspect of accuracy, the GHOST algorithm has the best accuracy in comparison algorithm because the considered partner relationship belongs to the strong distinguishing features, but the recall rate is not high because only the partner relationship attribute is considered.

(2) Since the method of using rules depends on the attributes of the institution to which the rule belongs, the accuracy performance is obviously lower than that of the GHOST algorithm and the algorithm provided by the invention under the condition of inaccurate or lack of information of the institution.

(3) The results of the unusual name experiments are better than the results of the unusual name experiments.

(4) Under the condition of rare paper quantity, for example, only 13 authors of Chinese paper are "expensive", the evaluation result of the step-by-step clustering algorithm provided by the invention is lower than that of the Rule method, so that the Rule method with high efficiency and good effect is preferentially considered under the condition of small author literature quantity.

In the comparison of the total effect F value, except that the Rule method in the patent is almost the same as the algorithm provided by the invention, in the Chinese paper and the English paper, the algorithm provided by the invention greatly improves the F value, and proves the effectiveness of the algorithm.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent structural changes made by using the contents of the present specification and the drawings, or applied directly or indirectly to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for disambiguating duplicate names of document authors is characterized in that: the method comprises the following steps:

2. A method of document author renaming disambiguation as claimed in claim 1, further comprising: the first step specifically comprises:

reading relevant data from the literature database and the scholars database respectively, comprising:

(3) ID, title, inventor, abstract, date, unit of publication in patent data;

3. A method of document author renaming disambiguation as claimed in claim 1, further comprising: the second step specifically comprises:

4. a method of document author renaming disambiguation as claimed in claim 1, further comprising: the third step specifically comprises:

(5) and constructing a similarity matrix, and clustering by using the AP.

5. A method of document author renaming disambiguation as claimed in claim 1, further comprising: the fourth step specifically comprises:

6. a construction system for duplicate name disambiguation of document authors is characterized in that: the method comprises the following steps:

the document vector generation module comprises a user-defined word segmentation dictionary component and is used for reading a keyword expansion word segmentation dictionary; the Word vector model training component is used for respectively training Chinese and English literature data according to a Skip-Gram model of the Word2Vec model to obtain and store a Word vector model; a document vector generation component: putting each word of the document into a word vector model to predict a word vector, calculating an idf value as a weight, and finally weighting the word vector to synthesize a document vector;