CN112131872A - Document author duplicate name disambiguation method and construction system - Google Patents

Document author duplicate name disambiguation method and construction system Download PDF

Info

Publication number
CN112131872A
CN112131872A CN202010987031.6A CN202010987031A CN112131872A CN 112131872 A CN112131872 A CN 112131872A CN 202010987031 A CN202010987031 A CN 202010987031A CN 112131872 A CN112131872 A CN 112131872A
Authority
CN
China
Prior art keywords
document
author
data
similarity
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202010987031.6A
Other languages
Chinese (zh)
Inventor
李微
胡晟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sanluoxuan Big Data Technology Kunshan Co ltd
Original Assignee
Sanluoxuan Big Data Technology Kunshan Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sanluoxuan Big Data Technology Kunshan Co ltd filed Critical Sanluoxuan Big Data Technology Kunshan Co ltd
Priority to CN202010987031.6A priority Critical patent/CN112131872A/en
Publication of CN112131872A publication Critical patent/CN112131872A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Abstract

The invention discloses a method for disambiguating duplicate names of document authors and a construction system, which comprises the following steps: the method comprises the following steps: reading document data and student data in a database; step two: training and predicting a document vector of each document by using a Word2Vec model; step three: constructing a partner relation network graph of the author to be disambiguated and calculating node similarity and clustering; step four: and acquiring document vectors of documents in the partner relation graph clustering document clusters, and calculating similarity and clustering among the document clusters. The invention can ensure that the disambiguation result has higher accuracy and recall level, and is suitable for the conditions of multi-language and multi-document types such as Chinese documents, English documents, patents and the like.

Description

Document author duplicate name disambiguation method and construction system
Technical Field
The invention belongs to the technical field of document processing, and particularly relates to a method for eliminating duplicate name of document authors.
Background
With the rapid development of science and technology and the continuous fusion of information, when the informatization problem is processed, particularly when flexible and various natural language data is processed, the renaming phenomenon widely existing in the real world can greatly influence the retrieval and processing of the data, so that the technology of named entity disambiguation is generated, and how to match ambiguous entity references with correct entities in a knowledge base is researched. Author disambiguation pertains to named entity disambiguation, in the real world different people may have the same name, and in many applications such as scientific literature management and information integration, the names of people are used as identifiers for retrieved information, and name ambiguity can greatly impair the quality of the retrieved information. Author disambiguation is essentially a classification problem requiring accurate partitioning of documents and corresponding to different author names with duplicate names.
The method can be used for completing the task of renegotiation of the author of the document, most of the existing methods are based on information contained in the document, and mainly comprise a method based on feature distinguishing, a method based on graph segmentation, a method based on network resource classification and the like.
Disclosure of Invention
The invention mainly solves the technical problem of providing a method and a construction system for duplicate name disambiguation of document authors, so as to solve the technical problem.
In order to solve the technical problems, the invention adopts a technical scheme that: a method for disambiguating duplicate names of document authors comprises the following steps:
the method comprises the following steps: reading document data and student data in a database;
step two: training and predicting a document vector of each document by using a Word2Vec model;
step three: constructing a partner relation network graph of the author to be disambiguated and calculating node similarity and clustering;
step four: and acquiring document vectors of documents in the partner relation graph clustering document clusters, and calculating similarity and clustering among the document clusters.
Further, the first step specifically includes:
reading relevant data from a literature database and a scholars database of a company respectively, comprising:
(1) ID, title, author, organization, abstract, journal, year, keyword in Chinese thesis data;
(2) ID, title, author, organization, abstract, journal, year, keyword in english paper data;
(3) ID, title, inventor, abstract, date, unit of publication in patent data;
wherein, Chinese thesis-author, English thesis-author and patent-inventor are used to extract cooperation data, namely points and edges in the network of the cooperation; the Chinese thesis-abstract, the Chinese thesis-abstract and the patent-abstract are used for training a Word vector model by a Word2Vec model and extracting a document vector, so that text information can be integrated in the duplicate name disambiguation process.
Further, the second step specifically includes:
the method comprises the following steps that the subject content of a document comprises a title, a keyword and an abstract, the title and the abstract of the document are subjected to character string combination, then words are divided, feature words are extracted, the feature words and the keyword are combined and then trained by using a Skip-Gram model of Word2Vec, and an output dimension is set to obtain a Word vector model;
and finally, calculating the IDF value alpha of all the characteristic words in the document Di in all the documentsiSum word vector ωiDocument vector piThe calculation formula of (2) is as follows:
Figure BDA0002689606160000021
further, the third step specifically includes:
(1) acquiring data of points, including names of authors to be disambiguated and names of collaborators thereof, wherein the nodes of the authors to be disambiguated are designed to be in a 'author name-document id' form, the number of the nodes is the same as the number of documents, and the nodes of the collaborators are designed to be 'author names';
(2) acquiring data of edges, and extracting one-to-one correspondence relation between author names;
(3) representing the relation of ' author ' -paper ' of all extracted documents as a graph G ═ V, E, W, wherein each node V ∈ V represents an example of one author, and a undirected edge E ∈ E represents that two authors collaborate to write a document;
(4) calculating the similarity of the nodes to be disambiguated, wherein the similarity function is as follows:
Figure BDA0002689606160000031
Pijin order to join the effective path set of which the path length of the two nodes is less than or equal to 4, vi and vj are different author names;
(5) and constructing a similarity matrix, and clustering by using AP clustering.
Further, the fourth step specifically includes:
(1) computing two document vectors pi,pjThe similarity between the documents and the document vector similarity sij is calculated according to the formula:
Figure BDA0002689606160000032
(2) calculate two document clusters ca,cbSimilarity between, document Cluster similarity sabThe calculation formula of (2) is as follows:
Figure BDA0002689606160000033
the invention adopts a further technical scheme for solving the technical problems that:
a construction system for duplicate name disambiguation of document authors comprises:
the data acquisition module comprises a database connecting component for connecting a database; the query component is used for executing database query statements and returning corresponding results;
the data preprocessing module comprises a document deduplication component and is used for removing duplicated documents; an error document format modification component for modifying an error document format; the author organization normalization component is used for normalizing the unit information of the author; the key attribute missing value processing component is used for processing the key attribute missing record; the document structuring component is used for converting document data into json files so as to facilitate subsequent processing;
the document vector production module comprises a user-defined word segmentation dictionary component and is used for reading a keyword expansion word segmentation dictionary; the Word vector model training component is used for respectively training Chinese and English literature data according to a Skip-Gram model of the Word2Vec model to obtain and store a Word vector model; a document vector generation component: putting each word of the document into a word vector model to predict a word vector, calculating an idf value as a weight, and finally weighting the word vector to synthesize a document vector;
the partner relation graph clustering module comprises a partner relation graph construction component, and is used for reading document data under the name of a to-be-disambiguated author and constructing a partner network graph, wherein a single author document is independently stored in the single author document to be used as the partition of a cluster; the similarity calculation component is used for calculating the path similarity value of each author node to be disambiguated and constructing a similarity matrix; the clustering component is used for carrying out AP clustering on the basis of the similarity matrix to obtain a final cluster;
the semantic feature clustering module comprises a cluster data loading component and is used for reading document vector data and author school data to be disambiguated of each cluster in the partner relation graph clustering module; the similarity calculation component is used for calculating the similarity between the document clusters; and the clustering component is used for clustering the document clusters on the basis of the similarity to obtain final document cluster division information.
The invention has the following beneficial effects:
1. the invention not only considers the strong characteristic information of the document partner relationship, but also considers weak characteristic information of the document such as semantic information and the like, and can furthest mine document data and obtain more accurate and complete document cluster division;
2. the invention can be used for renaming and disambiguation of document authors in different formats and languages, and has good compatibility.
Drawings
FIG. 1 is a schematic diagram of the renaming disambiguation method of the present invention;
FIG. 2 is a schematic view of the Word2Vec model of the present invention;
FIG. 3 is a document vector generation flow diagram of the present invention;
FIG. 4 is a flowchart of partner relationship graph clustering in accordance with the present invention;
FIG. 5 is a repeat path process flow diagram of the present invention;
FIG. 6 is a semantic feature clustering flow diagram of the present invention;
FIG. 7 is a system block diagram of the present invention;
FIG. 8 is a table comparing the accuracy of the Chinese paper, patent and English paper evaluation results of the present invention;
FIG. 9 is a comparison table of Chinese paper, patent and English paper evaluation result recall rates of the present invention;
FIG. 10 is a comparison table of the evaluation results F1 of the Chinese paper, patent and English paper according to the present invention.
Detailed Description
The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the invention.
Example (b): a method for disambiguating duplicate names of document authors is shown in FIG. 1, and comprises the following steps:
the method comprises the following steps: reading document data and student data in a database;
step two: training and predicting a document vector of each document by using a Word2Vec model;
step three: constructing a partner relation network graph of the author to be disambiguated and calculating node similarity and clustering;
step four: and acquiring document vectors of documents in the partner relation graph clustering document clusters, and calculating similarity and clustering among the document clusters.
The first step specifically comprises:
reading relevant data from a literature database and a scholars database of a company respectively, comprising:
(1) ID, title, author, organization, abstract, journal, year, keyword in Chinese thesis data;
(2) ID, title, author, organization, abstract, journal, year, keyword in english paper data;
(3) ID, title, inventor, abstract, date, unit of publication in patent data;
wherein, Chinese thesis-author, English thesis-author and patent-inventor are used to extract cooperation data, namely points and edges in the network of the cooperation; the Chinese thesis-abstract, the Chinese thesis-abstract and the patent-abstract are used for training a Word vector model by a Word2Vec model and extracting a document vector, so that text information can be integrated in the duplicate name disambiguation process.
The second step specifically comprises:
the method comprises the following steps that the subject content of a document comprises a title, a keyword and an abstract, the title and the abstract of the document are subjected to character string combination, then words are divided, feature words are extracted, the feature words and the keyword are combined and then trained by using a Skip-Gram model of Word2Vec, and an output dimension is set to obtain a Word vector model; as shown in fig. 2, where ω (t) represents a target word in the text, and ω (t-2), ω (t-1), ω (t +2), ω (t +1), and so on are adjacent words of the target word in the text;
and finally, calculating IDF values alpha i and word vectors omega i of all feature words in all documents in the document Di, and calculating document vectors piThe calculation formula of (2) is as follows:
Figure BDA0002689606160000051
the specific steps of the process for generating the document vector model are shown in fig. 3, wherein the chinese and english steps are the same, but are trained separately.
The flowchart of the third step is shown in fig. 4, and specifically includes:
(1) acquiring data of points, including names of authors to be disambiguated and names of collaborators thereof, wherein the nodes of the authors to be disambiguated are designed to be in a 'author name-document id' form, the number of the nodes is the same as the number of documents, and the nodes of the collaborators are designed to be 'author names';
(2) the data of the obtained edges, Chinese thesis, English thesis and patent cooperation data are all that a plurality of names are in a column (for example: Zhang three, Li four and Wang five), and a one-to-one corresponding relation (for example: Zhang three-Li four, Zhang three-Wang five and Li four-Wang five) needs to be extracted. The final similarity calculation method depends on the path retrieval among the target authors, and when a thesis relationship including the target authors is established, if nodes except the target authors are connected pairwise, redundant search is generated for many times, so that the method is not very significant, and the calculation amount is greatly increased. As shown in fig. 5, if a1 and B, C cooperate with document P1, then when calculating the path between a1 and a2, a1 → C → a2 can be considered to be a search worthy path, and a1 → B → C → a2 is an invalid path. Therefore, when a relationship graph containing target author documents is added in the collaboration relationship graph, only the connection between the target author and the collaborators is established, and the connection between the collaborators is not considered, such as in the case of establishing a P1 document relationship graph in FIG. 5, only two sides A1 → B and A1 → C are provided, and no side B → C is provided;
(3) and expressing the relation of ' author ' -paper ' of all extracted documents as a graph G ═ V, E and W, wherein each node V ∈ V represents an example of an author name R, and an undirected edge E ∈ E represents that two authors collaborate to write a document so as to eliminate redundant paths in subsequent work. For a paper under the condition of a single author, because the paper has no collaborators, the relation clustering of the collaborators cannot be used, and the paper is labeled and stored and is used in the semantic feature clustering;
(4) after the graph is constructed, the similarity of the nodes to be disambiguated is calculated, in order to increase the graph path searching efficiency, all paths with paths less than or equal to 4 are only taken, effective paths are searched, then the similarity is calculated, the length and the number of the effective paths are parameters which can be used for designing similarity calculation, and an obvious fact is that the shorter the length of the effective paths, the larger the number of the effective paths, the larger the similarity of the two nodes. Therefore, for the authors vi and vj, Pij is an effective path set joining the two nodes with a path length less than or equal to 4, and the similarity function is designed as follows:
Figure BDA0002689606160000061
if no path exists between the two nodes, the similarity is designed to be-10 instead of negative infinity for the accuracy of the subsequent clustering result;
(5) calculating the similarity of the nodes to be disambiguated and constructing a similarity matrix, then clustering by using an AP (affinity propagation), setting initial parameters as median of all data and obtaining a clustering result;
the algorithm pseudo-code is as follows:
Figure BDA0002689606160000071
the effect of the relation clustering of the collaborators is good under the condition that the number of the collaborators is sufficient, but the relation clustering of the collaborators has the limitation that the relation clustering of the collaborators cannot be completely applied to an actual literature database. Many documents in document databases are rare in relation to collaborators and documents written by single authors, and clustering results by using the strong feature of the relationship between collaborators shows that although the accuracy of the same author in a result cluster is very high, a large number of clusters are increased compared with actual results. Therefore, the obtained results need to be clustered continuously by adopting other characteristics of the literature until the final result is similar to or identical to the actual result.
Aiming at the defects of the partner relation graph clustering algorithm, the topic content characteristics are used in the method, and for the topic content characteristics, because the authors keep relatively fixed topic contents and directions in a certain time, the semantic similarity of the topic contents of the documents under the same author name is higher, otherwise, the topic content similarity of the documents under different author names under the same name is not high. On the basis of partner relationship graph clustering, topic content characteristics are fully utilized to assist disambiguation, so that the influence caused by infrequent cooperation of a single writer and a student can be effectively reduced, and the disambiguation effect is further improved. The flow chart of this step is shown in fig. 6.
The documents related to the system basically contain the school characteristics of the authors, but most of the documents do not contain the information of the colleges or the systems, so the school characteristics of the authors in the document clusters clustered by the relationship of the collaborators are firstly extracted and marked, and the documents in different schools are not markedThe clusters are not merged in a subsequent clustering step. Calculating the feature similarity of the subject contents of the documents by using cosine similarity in the clustering, and respectively setting the document vectors previously generated into two duplicate author documents as piAnd pjSimilarity of document vector sijThe calculation formula of (2) is as follows:
Figure BDA0002689606160000081
however, since this clustering is performed after the collaborators are clustered, i.e., the similarity between two document clusters is calculated, the calculation method of the two articles cannot be simply used, and after all the documents in the two clusters are calculated pairwise, the maximum result is selected as the final similarity. Let two clusters be c respectivelyaAnd cbSimilarity of document clusters sabThe calculation formula of (2) is as follows:
Figure BDA0002689606160000082
after the similarity is obtained, clustering and merging are carried out on the target clusters again by adopting a hierarchical clustering method until the clustering is not changed or reaches a threshold value, different threshold values are adopted through experiments, and in order to obtain the best effect, the Chinese thesis and patent threshold values are finally set to be 0.7, and the English thesis threshold value is 0.65.
The algorithm pseudo-code is as follows:
Figure BDA0002689606160000091
a construction system for duplicate name disambiguation of document authors comprises:
the data acquisition module (mainly used for acquiring related data functions including Chinese thesis data, English thesis data and patent data from a database) comprises a database connecting component for connecting the database; the query component is used for executing database query statements and returning corresponding results;
a data preprocessing module (mainly used for preprocessing data) comprising a document deduplication component for removing duplicate documents; an error document format modification component for modifying an error document format; the author organization normalization component is used for normalizing the unit information of the author; the key attribute missing value processing component is used for processing the key attribute missing record; the document structuring component is used for converting document data into json files so as to facilitate subsequent processing;
the document vector generation module (mainly using the Word2Vec algorithm of Google open source to train documents into a Word vector model and predict and generate document vectors) comprises a user-defined Word segmentation dictionary component and a Word segmentation dictionary expansion component, wherein the user-defined Word segmentation dictionary component is used for reading a keyword expansion Word segmentation dictionary; the Word vector model training component is used for respectively training Chinese and English literature data according to a Skip-Gram model of the Word2Vec model to obtain and store a Word vector model; a document vector generation component: putting each word of the document into a word vector model to predict a word vector, calculating an idf value as a weight, and finally weighting the word vector to synthesize a document vector;
the partner relation graph clustering module (mainly used for generating a partner cooperation network graph of the author to be disambiguated and calculating path similarity and clustering) comprises a partner relation graph construction component, a path matching component and a path matching component, wherein the partner relation graph construction component is used for reading document data under the name of the author to be disambiguated and constructing a partner network graph, and single author documents are independently stored into the single author documents to be used as the partition of a cluster; the similarity calculation component is used for calculating the path similarity value of each author node to be disambiguated and constructing a similarity matrix; the clustering component is used for carrying out AP clustering on the basis of the similarity matrix to obtain a final cluster;
the semantic feature clustering module (mainly extracts graph clustering results to calculate text similarity and finally completes clustering of document clusters) comprises a cluster data loading component, a cluster data processing component and a cluster mapping component, wherein the cluster mapping component is used for reading document vector data of each cluster in the partner relation graph clustering module and data of a to-be-disambiguated author school; the similarity calculation component is used for calculating the similarity between the document clusters; and the clustering component is used for clustering the document clusters on the basis of the similarity to obtain final document cluster division information.
In order to prove the accuracy of the disambiguation clustering algorithm, the results of a contrast experiment comparing step-by-step disambiguation clustering and a contrast method are designed.
First, data preparation
The data of the adopted documents are all from web crawlers, and comprise Chinese documents such as Chinese thesis, Chinese patents and the like and English thesis. Because the data volume of all documents is large, only part of the data is selected for testing. Because the data are labeled on the network, the Chinese evaluation data including the test data of the Chinese thesis and the patent are manually labeled, the manual labeling is mainly judged by combining an infrequent mailbox in the literature, the name of the affiliated college and a Baidu search engine and the homepage of a scholars, and certain retired teachers cannot be linked and are manually deleted. In order to guarantee the objectivity and the accuracy of the test set, the same data are sent to a plurality of persons for marking, and finally, the data are processed in a unified mode. The Chinese literature is selected as the literature to which 8 different author names belong, and English paper test data comes from a paper disambiguation match which Aminer has held, and a training set provided by the match can be used as test data for the evaluation. In selecting author names, to ensure validity and generality of test data, we select names of two different attributes:
(1) the phenomenon of duplication is serious, such as "Lidong" in Chinese literature, and "Chen, Yong" in English "
(2) The phenomenon of duplication is not serious, such as "Zhao iron military" in the Chinese literature and "Shi, Xianming" in the English literature. The Chinese document data annotation is shown in Table 1.
TABLE 1 Chinese documentation data after annotation
Figure BDA0002689606160000111
English paper data is selected as shown in Table 2.
TABLE 2 English discourse data
Figure BDA0002689606160000112
Second, evaluation method
For evaluation indexes of disambiguation quality of different methods, the invention respectively defines a PairwisePrecision (PairwiseRecall), a PairwiseRecall (PairwiseRecall) and a harmonic value F1 (PairwiseF1) of the PairwiseRecall and the PairwiseRecall by using common evaluation methods for clustering indexes in information retrieval and statistical classification. Disambiguation was assessed by counting the number of document pairs correctly divided in the name of the scholars. Specifically, if two documents in the set to be evaluated and the manually labeled set have the same label, they are called the correct pair. If two papers with the same label in the evaluation set to be tested do not have the same label in the manually labeled data set, the two papers are called as a misprediction pair. The index is defined as follows.
Let document set P ═ { P ═ P1,P2,P3…, C is the set after being evaluated and disambiguated, M is the real classification set of the manually marked document set, and n is the number of documents in the set.
(1) PairwisePrecision)
PairwisePrecision represents the pair number TP of all documents accurately divided under the corresponding author names in the set C to be evaluated and the pair number P of all document division results in the set CCThe higher the numerical value is, the more accurate the clustering result is, and the index is as shown in formula 5:
Figure BDA0002689606160000121
(2) paired recall (PairwiseRecall)
PairwiseRecall indicates that the sum PM of the paired quantity TP of all documents accurately divided under the corresponding author names in the set C to be evaluated and the paired quantity of all document divisions in the manually marked set M, the higher the numerical value is, the higher the concentration degree of the similar documents in the clustering result is, and the index is as shown in formula 6:
Figure BDA0002689606160000122
(3) f1 value (F1-Measure)
The F1 value is a harmonic value of PairwisePrecision and PairwiseRecall, the accuracy and the recall rate are comprehensively considered, and the higher the value is, the better the comprehensive performance of the clustering effect is represented. The index is shown in formula 7:
Figure BDA0002689606160000123
third, analysis of experimental results
In order to evaluate the effect of the disambiguation clustering algorithm, a comparison algorithm is required to be set for carrying out a comparison experiment. Comparative experiments are shown below.
(1) For the problem of document and author integration, the simplest and most common method is a feature Rule matching method, the school contents of the school matching the document features and the student features of the student information base are matched, and the documents of the same collaborators are merged without the information of the school. Merging the documents under the names of the authors to be disambiguated in the colleges and universities; if there is no college information, the documents of the same collaborators to be disambiguated are merged.
(2) Because the GHOST algorithm which has better effect in the market and is often taken as a comparison method in the field of duplicate name disambiguation at present, the first step in the step-by-step clustering is also used for referring to some ideas of the algorithm, and therefore the GHOST algorithm is also used for comparison. The GHOST method has the idea that a cooperation relation graph of authors to be disambiguated is constructed, similarity calculation is carried out by utilizing paths among the authors to be disambiguated in the graph, AP clustering is carried out, in order to achieve a comparison effect, parameters of the AP clustering in a comparison experiment are the same as parameters of the first-step clustering in the TSC, and median is selected as an initial value.
(3) The experimental results are shown in tables 3, 4 and 5, while the disambiguation result accuracy, recall and F-value comparisons are shown in fig. 8, 9 and 10, respectively.
The results of the comparative experiments in the Chinese thesis are shown in Table 3.
TABLE 3 Experimental results of Chinese thesis
Figure BDA0002689606160000131
The results of the patent comparative experiments are shown in table 4.
Figure BDA0002689606160000132
The results of the comparative experiments in the English paper are shown in Table 5.
Table 5 english thesis experimental results
Figure BDA0002689606160000133
Figure BDA0002689606160000141
Comparing the experimental results leads to the following conclusions:
(1) in the aspect of accuracy, the GHOST algorithm has the best accuracy in comparison algorithm because the considered partner relationship belongs to the strong distinguishing features, but the recall rate is not high because only the partner relationship attribute is considered.
(2) Since the method of using rules depends on the attributes of the institution to which the rule belongs, the accuracy performance is obviously lower than that of the GHOST algorithm and the algorithm provided by the invention under the condition of inaccurate or lack of information of the institution.
(3) The results of the unusual name experiments are better than the results of the unusual name experiments.
(4) Under the condition of rare paper quantity, for example, only 13 authors of Chinese paper are "expensive", the evaluation result of the step-by-step clustering algorithm provided by the invention is lower than that of the Rule method, so that the Rule method with high efficiency and good effect is preferentially considered under the condition of small author literature quantity.
In the comparison of the total effect F value, except that the Rule method in the patent is almost the same as the algorithm provided by the invention, in the Chinese paper and the English paper, the algorithm provided by the invention greatly improves the F value, and proves the effectiveness of the algorithm.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent structural changes made by using the contents of the present specification and the drawings, or applied directly or indirectly to other related technical fields, are included in the scope of the present invention.

Claims (6)

1. A method for disambiguating duplicate names of document authors is characterized in that: the method comprises the following steps:
the method comprises the following steps: reading document data and student data in a database;
step two: training and predicting a document vector of each document by using a Word2Vec model;
step three: constructing a partner relation network graph of the author to be disambiguated and calculating node similarity and clustering;
step four: and acquiring document vectors of documents in the partner relation graph clustering document clusters, and calculating similarity and clustering among the document clusters.
2. A method of document author renaming disambiguation as claimed in claim 1, further comprising: the first step specifically comprises:
reading relevant data from the literature database and the scholars database respectively, comprising:
(1) ID, title, author, organization, abstract, journal, year, keyword in Chinese thesis data;
(2) ID, title, author, organization, abstract, journal, year, keyword in english paper data;
(3) ID, title, inventor, abstract, date, unit of publication in patent data;
wherein, Chinese thesis-author, English thesis-author and patent-inventor are used to extract cooperation data, namely points and edges in the network of the cooperation; the Chinese thesis-abstract, the Chinese thesis-abstract and the patent-abstract are used for training a Word vector model by a Word2Vec model and extracting a document vector, so that text information can be integrated in the duplicate name disambiguation process.
3. A method of document author renaming disambiguation as claimed in claim 1, further comprising: the second step specifically comprises:
the method comprises the following steps that the subject content of a document comprises a title, a keyword and an abstract, the title and the abstract of the document are subjected to character string combination, then words are divided, feature words are extracted, the feature words and the keyword are combined and then trained by using a Skip-Gram model of Word2Vec, and an output dimension is set to obtain a Word vector model;
and finally, calculating IDF values alpha i and word vectors omega i of all feature words in all documents in the document Di, and calculating document vectors piThe calculation formula of (2) is as follows:
Figure FDA0002689606150000011
4. a method of document author renaming disambiguation as claimed in claim 1, further comprising: the third step specifically comprises:
(1) acquiring data of points, including names of authors to be disambiguated and names of collaborators thereof, wherein the nodes of the authors to be disambiguated are designed to be in a 'author name-document id' form, the number of the nodes is the same as the number of documents, and the nodes of the collaborators are designed to be 'author names';
(2) acquiring data of edges, and extracting one-to-one correspondence relation between author names;
(3) representing the relation of ' author ' -paper ' of all extracted documents as a graph G ═ V, E, W, wherein each node V ∈ V represents an example of one author, and a undirected edge E ∈ E represents that two authors collaborate to write a document;
(4) calculating the similarity of the nodes to be disambiguated, wherein the similarity function is as follows:
Figure FDA0002689606150000023
Pijin order to join the effective path set of which the path length of the two nodes is less than or equal to 4, vi and vj are different author names;
(5) and constructing a similarity matrix, and clustering by using the AP.
5. A method of document author renaming disambiguation as claimed in claim 1, further comprising: the fourth step specifically comprises:
(1) computing two document vectors pi,pjThe similarity between the documents and the document vector similarity sij is calculated according to the formula:
Figure FDA0002689606150000021
(2) calculate two document clusters ca,cbSimilarity between, document Cluster similarity sabThe calculation formula of (2) is as follows:
Figure FDA0002689606150000022
6. a construction system for duplicate name disambiguation of document authors is characterized in that: the method comprises the following steps:
the data acquisition module comprises a database connecting component for connecting a database; the query component is used for executing database query statements and returning corresponding results;
the data preprocessing module comprises a document deduplication component and is used for removing duplicated documents; an error document format modification component for modifying an error document format; the author organization normalization component is used for normalizing the unit information of the author; the key attribute missing value processing component is used for processing the key attribute missing record; the document structuring component is used for converting document data into json files so as to facilitate subsequent processing;
the document vector generation module comprises a user-defined word segmentation dictionary component and is used for reading a keyword expansion word segmentation dictionary; the Word vector model training component is used for respectively training Chinese and English literature data according to a Skip-Gram model of the Word2Vec model to obtain and store a Word vector model; a document vector generation component: putting each word of the document into a word vector model to predict a word vector, calculating an idf value as a weight, and finally weighting the word vector to synthesize a document vector;
the partner relation graph clustering module comprises a partner relation graph construction component, and is used for reading document data under the name of a to-be-disambiguated author and constructing a partner network graph, wherein a single author document is independently stored in the single author document to be used as the partition of a cluster; the similarity calculation component is used for calculating the path similarity value of each author node to be disambiguated and constructing a similarity matrix; the clustering component is used for carrying out AP clustering on the basis of the similarity matrix to obtain a final cluster;
the semantic feature clustering module comprises a cluster data loading component and is used for reading document vector data and author school data to be disambiguated of each cluster in the partner relation graph clustering module; the similarity calculation component is used for calculating the similarity between the document clusters; and the clustering component is used for clustering the document clusters on the basis of the similarity to obtain final document cluster division information.
CN202010987031.6A 2020-09-18 2020-09-18 Document author duplicate name disambiguation method and construction system Withdrawn CN112131872A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010987031.6A CN112131872A (en) 2020-09-18 2020-09-18 Document author duplicate name disambiguation method and construction system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010987031.6A CN112131872A (en) 2020-09-18 2020-09-18 Document author duplicate name disambiguation method and construction system

Publications (1)

Publication Number Publication Date
CN112131872A true CN112131872A (en) 2020-12-25

Family

ID=73841419

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010987031.6A Withdrawn CN112131872A (en) 2020-09-18 2020-09-18 Document author duplicate name disambiguation method and construction system

Country Status (1)

Country Link
CN (1) CN112131872A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112650852A (en) * 2021-01-06 2021-04-13 广东泰迪智能科技股份有限公司 Event merging method based on named entity and AP clustering
CN112765418A (en) * 2021-04-08 2021-05-07 中译语通科技股份有限公司 Alias merging and storing method, system, terminal and medium based on graph structure
CN112836518A (en) * 2021-01-29 2021-05-25 华南师范大学 Name disambiguation model processing method, system and storage medium
CN112835852A (en) * 2021-04-20 2021-05-25 中译语通科技股份有限公司 Character duplicate name disambiguation method, system and equipment for improving filing-by-filing efficiency
CN113111178A (en) * 2021-03-04 2021-07-13 中国科学院计算机网络信息中心 Method and device for disambiguating homonymous authors based on expression learning without supervision
CN113255324A (en) * 2021-03-09 2021-08-13 西安循数信息科技有限公司 Method for disambiguating inventor names in patent data
CN113434659A (en) * 2021-06-17 2021-09-24 天津大学 Implicit conflict sensing method in collaborative design process
CN113780001A (en) * 2021-08-12 2021-12-10 北京工业大学 Visual analysis method for homonymous disambiguation of academic papers
CN114328488A (en) * 2021-12-27 2022-04-12 中科大数据研究院 Chinese and English literature author name fusion disambiguation method
CN116776854A (en) * 2023-08-25 2023-09-19 湖南汇智兴创科技有限公司 Online multi-version document content association method, device, equipment and medium
CN117312565A (en) * 2023-11-28 2023-12-29 山东科技大学 Literature author name disambiguation method based on relation fusion and representation learning
CN117610541A (en) * 2024-01-17 2024-02-27 之江实验室 Author disambiguation method and device for large-scale data and readable storage medium

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112650852A (en) * 2021-01-06 2021-04-13 广东泰迪智能科技股份有限公司 Event merging method based on named entity and AP clustering
CN112836518B (en) * 2021-01-29 2023-12-26 华南师范大学 Method, system and storage medium for processing name disambiguation model
CN112836518A (en) * 2021-01-29 2021-05-25 华南师范大学 Name disambiguation model processing method, system and storage medium
CN113111178A (en) * 2021-03-04 2021-07-13 中国科学院计算机网络信息中心 Method and device for disambiguating homonymous authors based on expression learning without supervision
CN113255324B (en) * 2021-03-09 2022-02-18 西安循数信息科技有限公司 Method for disambiguating inventor names in patent data
CN113255324A (en) * 2021-03-09 2021-08-13 西安循数信息科技有限公司 Method for disambiguating inventor names in patent data
CN112765418B (en) * 2021-04-08 2022-04-01 中译语通科技股份有限公司 Alias merging and storing method, system, terminal and medium based on graph structure
CN112765418A (en) * 2021-04-08 2021-05-07 中译语通科技股份有限公司 Alias merging and storing method, system, terminal and medium based on graph structure
CN112835852A (en) * 2021-04-20 2021-05-25 中译语通科技股份有限公司 Character duplicate name disambiguation method, system and equipment for improving filing-by-filing efficiency
CN112835852B (en) * 2021-04-20 2021-08-17 中译语通科技股份有限公司 Character duplicate name disambiguation method, system and equipment for improving filing-by-filing efficiency
CN113434659B (en) * 2021-06-17 2023-03-17 天津大学 Implicit conflict sensing method in collaborative design process
CN113434659A (en) * 2021-06-17 2021-09-24 天津大学 Implicit conflict sensing method in collaborative design process
CN113780001B (en) * 2021-08-12 2023-12-15 北京工业大学 Visual analysis method for academic paper homonymy disambiguation
CN113780001A (en) * 2021-08-12 2021-12-10 北京工业大学 Visual analysis method for homonymous disambiguation of academic papers
CN114328488A (en) * 2021-12-27 2022-04-12 中科大数据研究院 Chinese and English literature author name fusion disambiguation method
CN114328488B (en) * 2021-12-27 2023-03-14 中科大数据研究院 Chinese and English literature author name fusion disambiguation method
CN116776854A (en) * 2023-08-25 2023-09-19 湖南汇智兴创科技有限公司 Online multi-version document content association method, device, equipment and medium
CN116776854B (en) * 2023-08-25 2023-11-03 湖南汇智兴创科技有限公司 Online multi-version document content association method, device, equipment and medium
CN117312565A (en) * 2023-11-28 2023-12-29 山东科技大学 Literature author name disambiguation method based on relation fusion and representation learning
CN117312565B (en) * 2023-11-28 2024-02-06 山东科技大学 Literature author name disambiguation method based on relation fusion and representation learning
CN117610541A (en) * 2024-01-17 2024-02-27 之江实验室 Author disambiguation method and device for large-scale data and readable storage medium

Similar Documents

Publication Publication Date Title
CN112131872A (en) Document author duplicate name disambiguation method and construction system
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
CN111914558A (en) Course knowledge relation extraction method and system based on sentence bag attention remote supervision
CN110750995B (en) File management method based on custom map
Fengmei et al. FSFP: Transfer learning from long texts to the short
CN108287911A (en) A kind of Relation extraction method based on about fasciculation remote supervisory
CN103646112A (en) Dependency parsing field self-adaption method based on web search
CN114443855A (en) Knowledge graph cross-language alignment method based on graph representation learning
CN113610626A (en) Bank credit risk identification knowledge graph construction method and device, computer equipment and computer readable storage medium
CN111597330A (en) Intelligent expert recommendation-oriented user image drawing method based on support vector machine
Sun et al. Important attribute identification in knowledge graph
Zadgaonkar et al. An Approach for Analyzing Unstructured Text Data Using Topic Modeling Techniques for Efficient Information Extraction
Song et al. Research on intelligent question answering system based on college enrollment
Ziv et al. CompanyName2Vec: Company Entity Matching Based on Job Ads
Rao et al. Automatic identification of concepts and conceptual relations from patents using machine learning methods
Sun et al. Generalized abbreviation prediction with negative full forms and its application on improving chinese web search
Lu et al. Overview of knowledge mapping construction technology
Chen Natural language processing in web data mining
Yongmei et al. Research on Domain-independent Opinion Target Extraction
CN117407511B (en) Electric power safety regulation intelligent question-answering method and system based on Bert model
CN111241283B (en) Rapid characterization method for portrait of scientific research student
Wang et al. Research and implementation of SVM and bootstrapping fusion algorithm in emotion analysis of stock review texts
Azeroual A text and data analytics approach to enrich the quality of unstructured research information
CN111259166B (en) Scientific research entity linking method and device based on knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20201225

WW01 Invention patent application withdrawn after publication