CN114328488B

CN114328488B - Chinese and English literature author name fusion disambiguation method

Info

Publication number: CN114328488B
Application number: CN202111615229.2A
Authority: CN
Inventors: 贾士杨; 冯凯; 王元卓; 彭亮
Original assignee: China Science And Technology Big Data Research Institute
Current assignee: China Science And Technology Big Data Research Institute
Priority date: 2021-12-27
Filing date: 2021-12-27
Publication date: 2023-03-14
Anticipated expiration: 2041-12-27
Also published as: CN114328488A

Abstract

The invention belongs to the technical field of name disambiguation, and particularly relates to a name disambiguation method for authors of Chinese and English documents. The method is based on semantic fingerprints, author cooperation network similarity, author citation network similarity and the like to disambiguate Chinese author names and English author names, and completes the disambiguation of name pinyin in Chinese authors and English documents according to Chinese disambiguation results and English disambiguation results, whether authors of different documents are the same person can be accurately distinguished, the same author under Chinese and English can be well identified, the author to be found is quickly positioned, the accuracy rate is high, and the search work is facilitated to be developed; in addition, the method introduces the calculation of similarity of scientific research durations of the authors, can well assist in disambiguation of English names in Chinese authors, can also determine the age range of the authors, filters out other authors with the same name who are not in the range, and improves the disambiguation accuracy.

Description

Chinese and English literature author name fusion disambiguation method

Technical Field

The invention belongs to the technical field of name disambiguation, and particularly relates to a name disambiguation method for authors of Chinese and English documents.

Background

With the rapid development of the internet, a large number of scientific documents such as papers and patents are emerging continuously, and when people search for needed useful information from the massive documents, a frequently used search means is to search through names of document authors and inquire all published documents. However, in the searching process, people can find that a large number of authors with the same name exist, and the authors to be found are difficult to quickly locate, which is very unfavorable for work of people.

The ambiguity of author names in literature has been long, and the following problems mainly exist:

1. chinese author name ambiguity. Such as: "zhangwei" is a person who has a name in life, and when they release documents such as papers and patents, the signatures are "zhangwei", and we can hardly distinguish which "zhangwei" is released from the documents.

2. English author name ambiguity. Like Chinese authors, english also has the problem that a large number of different people have the same name, and how to distinguish different people is also a difficult problem.

3. English name of Chinese author. In the background of academic internationalization, domestic authors begin to publish more and more documents in international journals and conferences, and when they publish documents, signatures mostly adopt a pinyin mode, such as: "Zhang San" or "San Zhang", because of the characteristic of the spelling, "Zhang San" can correspond to "Zhang San" of the Chinese, can also correspond to "Zhang San", etc., and "Lin Yang" can correspond to "forest", can also correspond to "yanlin", under such circumstances, it is more difficult to distinguish which one is concrete, when carrying on Chinese and English academic achievement to an author and appraising, the result often lacks scientificity and practicability.

In view of the above problems, name disambiguation is a difficult point to be solved urgently when a document knowledge base is constructed and document retrieval is performed, and has very important significance and value.

Disclosure of Invention

Aiming at the defects and problems existing in the existing Chinese and English author name distinction, the invention provides a Chinese and English literature author name disambiguation method.

The technical scheme adopted by the invention for solving the technical problems is as follows: a Chinese and English literature author name fusion disambiguation method comprises the following steps:

step one, chinese literature author name disambiguation, comprising the following steps:

s1, cleaning by author names: removing symbols in the author name, and converting the author name into a format of surname plus first name according to common surnames;

s2, cleaning the mechanism to which the author belongs: uniformly regulating the author institutions into the names of the institution bodies to which the author institutions belong;

s3, comparing two authors of the Chinese literature to judge whether the names of the authors are the same,

if the Chinese character disambiguation results are different, aggregating the results to obtain a Chinese character disambiguation result;

if the same, respectively calculating mechanism similarity, cooperative network similarity, citation network similarity and document content similarity, and judging whether the same author exists according to the results of the mechanism similarity, the cooperative network similarity, the citation network similarity and the document content similarity; the judgment standard is as follows:

if the mechanism similarity is more than or equal to 0.9, one of the three similarities, namely the author cooperation network similarity, the author citation network similarity and the document content similarity is more than 0.8, and the three similarities are considered as the same person;

if the mechanism similarity is less than 0.9, two or more of the three similarities, namely the author cooperation network similarity, the author citation network similarity and the document content similarity, are more than 0.8, and the three similarities are considered as the same person;

(1) If the two characters are the same author, marking the same author ID, and aggregating the results after two-to-two calculation to obtain a Chinese disambiguation result;

(2) If not, aggregating the results to obtain the Chinese disambiguation result.

Step two, disambiguation of names of authors of English documents comprises the following steps:

s1, cleaning by author names: removing symbols in the author name, and uniformly converting the pinyin of the author name into a first name + surname format;

s2, cleaning the mechanism to which the author belongs: removing symbols in the mechanism name, and completing the mechanism in a shorthand way;

s3, comparing every two English literature authors to judge whether the names of the authors are the same,

if the English disambiguation results are different, the English disambiguation results are collected;

if the same, respectively calculating mechanism similarity, cooperative network similarity, citation network similarity and document content similarity, and judging whether the same author exists according to the mechanism similarity, the cooperative network similarity, the citation network similarity and the document content similarity; the judgment standard is as follows:

if the mechanism similarity is less than 0.9, two or more than two of the three similarities, namely the author cooperative network similarity, the author citation network similarity and the document content similarity, are more than 0.8, and the three similarities are considered as the same person;

(1) If the English disambiguation result is the same author, marking the same author ID, and aggregating results after two-by-two calculation to obtain an English disambiguation result;

(2) If not, aggregating the results to obtain the disambiguation result of the English literature.

Step three, chinese and English author name fusion disambiguation, which comprises the following steps:

s1, converting Chinese document authors and Chinese authors in cited documents obtained from Chinese disambiguation results into Pinyin formats according to the first name plus surname formats, and translating mechanisms to which the Chinese authors belong into English; and grouping according to the author ID;

s2, grouping author IDs obtained from English disambiguation results;

s3, comparing the names of the Chinese and English literature authors pairwise to judge whether the names are the same,

if the Chinese and English literature author names are different, aggregating results to complete name disambiguation of Chinese and English literature authors;

if the similarity is the same, respectively calculating the mechanism similarity, the cooperation network similarity, the citation network similarity, the document content similarity and the scientific research duration similarity of the Chinese and English documents, and judging whether the Chinese and English documents belong to the same author according to the results of the mechanism similarity, the cooperation network similarity, the citation network similarity, the document content similarity and the scientific research duration similarity; the judgment standard is as follows:

if the similarity of the mechanisms is more than or equal to 0.9, one of the four similarities, namely the author cooperation network similarity, the author citation network similarity, the document content similarity and the scientific research duration similarity is more than 0.8, and the authors are considered to be the same person;

if the similarity of the organization is less than 0.9, two or more of the three similarities, namely the author cooperation network similarity, the author citation network similarity, the document content similarity and the scientific research duration similarity, are more than 0.8, and the three similarities are regarded as the same person;

(1) If the Chinese literature and the English literature are the same author, marking the ID of the author of the English literature to finish the disambiguation of the names of the Chinese and English authors;

(2) If the Chinese document and the English document are not the same author, the results are aggregated to complete the name disambiguation of the author of the Chinese and English documents.

In the method for fusing and disambiguating names of authors of Chinese and English documents, the step one of calculating the document content similarity of authors of Chinese documents comprises the following steps:

(1) Splicing the title, the abstract and the key words into a character string E;

(2) Extracting keywords based on a TF-IDF algorithm from the character string E by using jieba word segmentation, and taking the word of Top 10 and the weight thereof to generate a { word + weight } array' F;

(3) Converting the weight in the array F into an integer weight of 1-5 to obtain a converted array G of { word + weight }; the conversion criteria were:

weight less than 0.2: converting into 1;

weight is greater than or equal to 0.2 and less than 0.4: the conversion to 2;

weight is greater than or equal to 0.4 and less than 0.6: conversion to 3;

weight is greater than or equal to 0.6 and less than 0.8: 4 is converted;

weight is 0.8 or more: turning to 5;

(4) Calculating the hash value of the array G by using the SimHash to obtain a semantic fingerprint H of the text;

(5) Semantic fingerprints H1 and H2 of two Chinese documents of the same author are respectively calculated according to the steps (1) to (4);

(6) Calculating the content similarity of the two documents according to the Hamming distance, wherein the similarity calculation standard is as follows:

hamming distance =0, similarity =1;

hamming distance =1, similarity =0.9;

hamming distance =2, similarity =0.8;

hamming distance > =3, similarity =0;

if the Hamming distance is greater than or equal to 3, the two documents are dissimilar;

if the hamming distance is less than 3, the two documents are similar.

In the method for fusing and disambiguating names of authors of Chinese and English documents, the step two of calculating the similarity of the contents of the authors of English documents comprises the following steps:

(1) Splicing the title, the abstract and the key words into a character string E';

(2) Performing key word extraction based on TF-IDF algorithm on the character string E ' by using NLTK, and taking the word and the weight of Top 10 to generate a { word + weight } array ' F ';

(3) Converting the weight in the array F 'into an integer weight of 1-5 to obtain a converted array G' { words + weight }; the conversion criteria were:

weight less than 0.2: converting into 1;

weight is greater than or equal to 0.2 and less than 0.4: converting into 2;

weight is greater than or equal to 0.4 and less than 0.6: conversion to 3;

weight is greater than or equal to 0.6 and less than 0.8: conversion to 4;

weight of 0.8 or more: turning to 5;

(4) Calculating the hash value of the array G 'by using the SimHash to obtain a semantic fingerprint H' of the text;

(5) Semantic fingerprints H1 'and H2' of two Chinese documents of the same author are respectively calculated according to the steps (1) to (4);

(6) Calculating the content similarity of the two documents according to the Hamming distance,

if the hamming distance is less than 3, the two documents are similar.

In the method for name fusion and disambiguation of authors in chinese and english documents, the method for calculating similarity of document contents published by authors in step three comprises the following steps:

s1, calculating semantic fingerprints of Chinese documents, comprising the following steps of:

(1) Grouping all Chinese data after Chinese authors are disambiguated according to author IDs, wherein one author may correspond to a plurality of documents, and merging the abstracts, the titles and the keywords of all the documents of the same author according to ID grouping results, wherein the merging results are marked as A1, A2 and A3;

(2) Respectively matching the A1, the A2 and the A3. By using a Chinese research Topic set Topic _ zh, respectively obtaining Chinese research topics and occurrence times contained in the Chinese research Topic set Topic _ zh, and generating a ' Chinese research Topic + occurrence times ' array ' B1, B2 and B3.;

(3) Converting the Chinese research topics in the B1, B2 and B3. Into English by utilizing zh _ To _ en, and generating an array of { English research topics + occurrence times } C1_ zh _ To _ en, C2_ zh _ To _ en and C3_ zh _ To _ en.;

(4) Merging the C1_ zh _ to _ en, the C2_ zh _ to _ en and the C3_ zh _ to _ en, adding the occurrence times of the same research topics, and taking out 10 research topics with the largest number of the current times to obtain a final { English research topic + occurrence times } array' C _ zh _ to _ en;

(5) Calculating a hash value of C _ zh _ to _ en by using SimHash to obtain a semantic fingerprint D _ zh of the Chinese document;

s2, calculating the semantic fingerprint of the English literature, comprising the following steps:

(1) Grouping all English data after English authors are disambiguated according to author IDs, wherein one author may correspond to a plurality of documents; according to the author ID grouping result, all documents of the same author are respectively subjected to abstract summary + title + keyword combination and are recorded as A1', A2' and A3 ';

(2) Matching the A1', the A2' and the A3' respectively by using an English research Topic set Topic _ en, respectively acquiring English research topics contained in the English research Topic set Topic _ en and the occurrence times of the English research topics, and generating ' English research Topic + occurrence times ' arrays ' B1', B2' and B3 ';

(3) Merging the B1', the B2' and the B3', adding the occurrence times of the same research topics, and taking out the 10 research topics with the largest number of current times to obtain a final { English research topic + occurrence times } array' C _ en;

(4) Calculating a hash value of C _ en by using SimHash to obtain a semantic fingerprint D _ en of the English literature;

s3, calculating the similarity of the document contents through the Hamming distances of D _ zh and D _ en,

if the Hamming distance between D _ zh and D _ en is less than 3, the contents of the two documents are similar;

if the Hamming distance between D _ zh and D _ en is greater than or equal to 3, the contents of the two documents are not similar.

In the method for fusing and disambiguating names of authors in Chinese and English documents, the similarity of scientific research durations is calculated for documents with the same Pinyin name as the name of the author in English documents in the third step, and the calculation method comprises the following steps:

(1) Grouping all Chinese data after the Chinese author is disambiguated, grouping the author IDs, finding a document with the earliest publication time in all Chinese documents of the author according to a grouping result, and calculating the time difference between the publication time of the earliest document and the current year to serve as the scientific research time length R _ zh of the Chinese document author;

(2) Grouping all English data after disambiguation of English authors, and the author IDs, finding out one document with the earliest publication time in all English documents of the authors according to a grouping result, and calculating the time difference between the publication year of the earliest document and the current year as the scientific research duration R _ en of the English author;

(3) If the pinyin of the author of the Chinese document is the same as the name of the author in the English document, calculating a difference R _ diff between R _ zh and R _ en;

(4) Calculating the similarity of scientific research durations, wherein the calculation standard is as follows:

r _ diff =0, then similarity =1;

1= < R _ diff < =2, then the similarity =0.9;

3= < R _ diff < =4, then the similarity =0.8;

r _ diff >4, then similarity =0.

In the method for fusing and disambiguating names of authors of Chinese and English documents, in the first step, the second step and the third step, the Jacard similarity coefficient is adopted to respectively calculate the similarity of the cooperation network and the similarity of the citation network.

The invention has the beneficial effects that: the method can accurately distinguish whether the authors of different documents are the same person, can well identify the same author in Chinese and English, can quickly locate the author to be found, has high accuracy, and is beneficial to the development of retrieval work.

The method is based on semantic fingerprint comparison document similarity, can greatly simplify the comparison process, improves the comparison efficiency, avoids the process of model training and saves training resources compared with the method for carrying out document similarity calculation based on a model.

The method provides the calculation of the similarity of the scientific research durations of the authors, can well assist the disambiguation of English names in Chinese authors, can determine the age range of the authors by introducing the scientific research durations, filters out other authors with the same name who are not in the range, and improves the disambiguation accuracy.

Drawings

FIG. 1 is a flow chart of the present invention for disambiguating the name of a author of a chinese character.

FIG. 2 is a flowchart illustrating the process of disambiguation of English author names.

FIG. 3 is a flow chart of English author name fusion disambiguation in the present invention.

Detailed Description

The method for disambiguating the same-name author mainly starts from the following aspects:

(1) whether the working mechanisms of the authors are the same: the probability of being the same person is higher if the working mechanism is the same.

(2) Author collaborative network similarity: expert scholars generally have fixed scientific research cooperation teams, and if the cooperation networks of two authors are highly similar, the two authors have the same person with high probability.

(3) Authors quote network similarity: a citation network refers to a collection of authors referring to documents cited by authors who would normally refer to documents from the same other authors as they belong to the same research field. Meanwhile, for some guides and guides, the names of each student brought by the guides and guides are different, the cooperation network similarity of the students is extremely low, the research directions are greatly different, but the common characteristics exist, the large probability of the thesis of the students brings the name of the guide, and the large probability of the thesis of the students refers to the thesis of the guide or other students, so that the name disambiguation can be carried out by adopting the network similarity.

(4) Authors published the similarity of the content of the literature: the research content of the same author basically does not change greatly, and the similarity of the contents of two documents of the same author can determine whether the two documents are the same author.

(5) Similarity of scientific research durations of authors: the scientific research duration in the application is defined as follows: the earliest literature published by the authors is time-shifted. The probability that the people are the same is higher if the scientific research duration is the same.

The invention is further illustrated by the following examples in conjunction with the drawings. It is noted that all letters appearing in the present invention are exemplary representations.

Example 1: the embodiment provides a Chinese and English literature author name fusion disambiguation method, which comprises Chinese author name disambiguation, english author name disambiguation and disambiguation of Pinyin of a Chinese author and a name in an English literature. Wherein

1. Chinese author name disambiguation, as shown in FIG. 1, includes the following steps,

step one, cleaning author names: removing symbols (including space, semicolon, comma and other symbols) in the author name, converting the author name according to common names, and uniformly converting the author name into a format of surname + first name; such as: the 'Chonglin' is converted into 'Linchong'.

Step two, cleaning the mechanism to which the author belongs: uniformly regulating the author institutions into the names of the institution bodies to which the author institutions belong; such as: the xx department of the xx hospital is regular as the xx hospital, and the xx college of the xx university is regular as the xx university; the types of companies such as 'company limited', 'stock limited', 'group stock limited', etc. are eliminated, for example, the 'group stock limited by the Alibara' is converted into 'Alibara'.

Step three, calculating the mechanism similarity of the author:

taking mechanisms of two articles of the same author as characteristic values, calculating word frequencies of all words of the characteristic values, generating word frequency vectors, and then calculating the similarity of the word frequency vectors according to a cosine similarity formula to obtain mechanism similarity;

example (c):

1) The two institutions are respectively 'millet science and technology limited liability company' and 'Guangdong millet science and technology limited liability company';

2) After the mechanism is cleaned in the step (2), the method comprises the following steps: two characteristic values of millet technology and Guangdong millet technology;

3) All the terms of the eigenvalues "millet technology" are: [ Xiao, mi, ke, zhi ]

4) All the terms of the eigenvalues "Guangdong millet science" are: [ Guangdong, east, xiao, mi, ke, zhi ]

5) All words of the two eigenvalues are merged to be: [ Guangdong, east, xiao, mi, ke, zhi ]

6) The term frequency vector of the eigenvalue "millet science" is: [0,0,1,1,1,1]

7) The word frequency vector of the eigenvalue "Guangdong millet science" is: [1,1,1,1,1,1]

8) Calculating the similarity of two word frequency vectors according to a cosine similarity algorithm:

step four, calculating the similarity of the author cooperation network:

the cooperative networks of the same author are A and B respectively, and the similarity is as follows:

example (c): the cooperation network A is: [ a, B, c ], the cooperative network B is: [ a, c, d, e ], calculating the similarity according to a cooperative network similarity formula as follows:

step five, calculating the similarity of the author citation network

The similarity algorithm of the citation network is the same as the similarity algorithm of the cooperation network, the citation networks of the same author are respectively C and D, and the similarity is as follows:

step six, calculating the similarity of the contents of the documents published by the authors, wherein the similarity comprises the following contents:

s1, calculating content similarity by adopting a title, an abstract and key words, and splicing the title, the abstract and the key words into a character string E;

s2, extracting keywords based on a TF-IDF algorithm from the character string E by using jieba word segmentation, and taking the word of Top 10 and the weight thereof to generate a { word + weight } array which is marked as F;

s3, converting the weight in the array F into an integer weight of 1-5, recording the converted array of { word + weight } as G, and converting the standard as follows:

weight less than 0.2: converting into 1;

weight is greater than or equal to 0.2 and less than 0.4: converting into 2;

weight is greater than or equal to 0.4 and less than 0.6: conversion to 3;

weight is greater than or equal to 0.6 and less than 0.8: conversion to 4;

weight of 0.8 or more: and was turned to 5.

S4, calculating a hash value of the array G by using the SimHash to obtain a semantic fingerprint H of the text;

s5, semantic fingerprints H1 and H2 of two documents of the same author are respectively obtained through calculation according to the steps S1 to S4;

s6, calculating the content similarity of the two documents according to the Hamming distance, wherein the specific similarity calculation standard is as follows:

hamming distance =0, similarity =1;

hamming distance =1, similarity =0.9;

hamming distance =2, similarity =0.8;

hamming distance > =3, similarity =0;

if the hamming distance is less than 3, the two documents are similar.

Step seven, judging whether the authors are the same person or not,

if the mechanism similarity is less than 0.9, two or more of the three similarities, namely the author cooperation network similarity, the author citation network similarity and the document content similarity, are more than 0.8, and the three similarities are considered as the same person.

Step eight, calculating the similarity of the documents of the same author pairwise according to the steps one to seven, and judging whether the documents are the same person; and (4) marking the same author ID on the same person, and aggregating the two calculated notes to complete Chinese author name disambiguation.

2. English literature author name disambiguation, as shown in FIG. 2, includes the following steps:

step one, cleaning author names:

removing symbols in the author name, such as spaces, semicolons, commas, etc.; carrying out format conversion on the pinyin of the name of the author, and uniformly converting the pinyin of the name of the author into a format of 'first name + surname'; for example, "Wang Yuanzhuo" turns to: "Yuanzhuo Wang".

Step two, cleaning the mechanism to which the author belongs:

removing symbols in the mechanism name, and completing the mechanism by shorthand; such as: the "Univ" is complemented to "University".

Step three, calculating the similarity of the author mechanism

Taking mechanisms of two articles of the same author as characteristic values, calculating word frequencies of all words of the characteristic values, generating word frequency vectors, and calculating the similarity of the word frequency vectors according to a cosine similarity formula to obtain mechanism similarity;

example (c):

1) The two mechanisms are respectively AA BB CD and AA CD EE;

2) All words of the eigenvalue "AA BB CD" (partitioned with white space) are: [ AA, BB, CD ];

3) All the words of the characteristic value "AA CD EE" (segmented with spaces) are: [ AA, CD, EE ];

4) All words of the two eigenvalues are merged to be: [ AA, BB, CD, EE ];

5) The word frequency vector of the eigenvalue "AA BB CD" is: [1, 0];

6) The word frequency vector of the characteristic value 'AA CD EE' is as follows: [1,0,1,1];

7) Calculating the similarity of two word frequency vectors according to a cosine similarity algorithm:

step four, calculating the similarity of the author cooperation network

According to the cooperation networks A 'and B' of the same author, the similarity of the author cooperation networks is calculated,

step five, calculating the similarity of the author citation network

The citing network similarity algorithm is the same as the cooperation network similarity algorithm, the citing network similarity of the author is calculated according to the citing networks C 'and D' of the author with the same name,

step six, calculating the similarity of the contents of the documents published by the authors

S1, a document, wherein a title, an abstract and a keyword contain more and more accurate information, so that the title, the abstract and the keyword are adopted to calculate the content similarity, and the title, the abstract and the keyword are spliced to form a character string E';

s2, segmenting words of the character string E 'by using NLTK, calculating a TF-IDF value of each word as a weight, and taking the words and the weights of the words with the weight Top 10 to generate a { word + weight } array, which is marked as F';

calculating TF-IDF value:

(1) TF-IDF is a statistical analysis method for keywords to evaluate the importance of a word to a corpus or a corpus.

Wherein:

a larger TF indicates a higher frequency of occurrence of the word, and is more important in this article.

Under the condition that the total number of documents in the corpus is fixed, the smaller the number of documents containing the word is, the larger the IDF is, the more novel the word is, and the more important the word has good category distinguishing capability.

TF-IDF = TF (word frequency) × IDF (inverse document frequency)

TF-IDF is the product of the word frequency TF and the inverse document frequency IDF.

(2) 10000 documents are randomly extracted from all English documents to serve as a corpus;

(3) using NLTK to segment the character string E, taking a word1 after segmentation as an example, the TF-IDF is calculated as follows:

calculating the occurrence frequency of word1 in the character string E, and recording as w _ count1, wherein TF _1= w _count1 ÷ (total number of words in E);

searching 10000 documents in a corpus to find out how many documents word1 appears, and recording the number as

TF-IDF = TF _1 × IDF _1 of word 1.

S3, converting the weights in the number group F 'into integer weights of 1-5, and marking the converted "{ word + weight } array" as G', wherein the conversion mode is as follows:

weight less than 0.2: is converted to 1

Weight is greater than or equal to 0.2 and less than 0.4: is converted into 2

Weight is greater than or equal to 0.4 and less than 0.6: is converted to 3

Weight is greater than or equal to 0.6 and less than 0.8: is converted to 4

Weight of 0.8 or more: turning to 5;

s4, calculating a hash value of G 'by using SimHash, namely, a semantic fingerprint of the text, and recording the semantic fingerprint as H';

s5, calculating semantic fingerprints of two documents of the same author according to the steps from S1 to S4, wherein the semantic fingerprints are H1 'and H2'

S6, according to the Hamming distance, if the Hamming distances of H1 'and H2' are smaller than 3, the contents of the two documents are similar, otherwise, the contents of the two documents are not similar; the hamming distance and similarity are converted as follows:

hamming distance =0, similarity =1;

hamming distance =1, similarity =0.9;

hamming distance =2, similarity =0.8;

hamming distance > =3, similarity =0.

Step seven, judging whether the authors are the same person

Step eight, the same-name authors are aggregated to complete disambiguation

According to the steps from one to seven, the similarity of documents of the same author is calculated in pairs, whether the documents are the same person or not is judged, the same person marks the same author ID, the results after the calculation in pairs are aggregated, and the disambiguation of English authors is completed.

3. The disambiguation of the name pinyin in the Chinese author and English literature, as shown in FIG. 3, includes the following steps:

step one, author name conversion

The names are cleaned when the Chinese author and the English author disambiguate respectively, and the Pinyin format of the English author is regulated to be the first name + last name format, so that additional cleaning operation is not needed. The Chinese authors in the Chinese literature and the Chinese authors in the cited literature are all converted into Pinyin format, and are in 'first name + last name' format. If the first name and the surname of the Chinese name are in common family names, such as "forest poplar", the Chinese name and the surname are converted into pinyin array formats of { "Yang Lin" and "Lin Yang", and when the Chinese name and the surname are matched with the author name of the English document, the two formats are matched.

Step two, the mechanism of the author is converted

The affiliated institution is cleaned already when the English writer disambiguates, and no additional cleaning operation is needed.

For Chinese literature, company type words are not removed, and simultaneously, google translation and Wikipedia are used for translating Chinese mechanisms into English.

Step three, calculating the similarity of the author mechanism

And taking mechanisms of two articles of the same author as characteristic values, calculating word frequency of all words of the characteristic values, generating word frequency vectors, and calculating vector similarity according to a cosine similarity formula, wherein the vector similarity is the mechanism similarity.

Step four, calculating the similarity of the author cooperation network

Grouping all Chinese data after Chinese authors are disambiguated according to author IDs, wherein one author may correspond to a plurality of documents; merging collaborators in all documents of the same author to generate a collaboration network of the author, wherein the collaboration network is M;

all English data after the English authors' disambiguation are grouped according to the author IDs, and one author may correspond to a plurality of documents; merging collaborators in all documents of the same author to generate a collaboration network of the author, wherein the collaboration network is N;

if the pinyin of the author of the Chinese document is the same as the name of the author in the English document, calculating the similarity of the cooperation network, wherein the similarity is as follows:

step five, calculating the similarity of the author citation network

Grouping all Chinese data after Chinese authors are disambiguated according to author IDs, wherein one author may correspond to a plurality of documents, merging authors of cited documents in all documents of the same author to generate a cited network of the author, and the cited network is P;

grouping all English data after English authors are disambiguated according to author IDs, wherein one author may correspond to a plurality of articles, merging authors of cited documents in all documents of the same author to generate a cited network of the author, and marking the cited network as Q;

if the name pinyin of the author in the Chinese document is the same as the name of the author in the English document, calculating the similarity of the citation network, wherein the similarity is as follows:

The similarity calculation between the Chinese literature and the English literature needs to translate the literature in one language into the literature in another language, but the translation accuracy cannot be guaranteed when large-scale translation is carried out, so that the similarity calculation is assisted by introducing the research theme of Microsoft academia. The research theme of Microsoft's academic is based on hundred million articles of Microsoft's academic, technical nouns extracted through technologies such as artificial intelligence and natural language processing, etc., 70 more than ten thousand in total, utilize Google translation, wikipedia to translate these research themes, produce the Chinese research theme set, mark as Topics _ zh; the English research subject set is recorded as Topics _ en; the corresponding relation of the Chinese and English research subject set is recorded as zh _ To _ en.

The method comprises the following steps:

(4) Merging C1_ zh _ to _ en, C2_ zh _ to _ en and C3_ zh _ to _ en, adding the occurrence times of the same research topics, and taking out 10 research topics with the largest number of current times to obtain a final { English research topic + occurrence times } array' C _ zh _ to _ en; the array contains the research subject with the most occurrence frequency in all Chinese documents of the author, and has very strong representativeness.

(5) Calculating a hash value of C _ zh _ to _ en by using SimHash to obtain a semantic fingerprint of the Chinese document, and recording the semantic fingerprint as D _ zh;

s2, calculating semantic fingerprints of English documents, comprising the following steps:

(1) Grouping all English data after English authors are disambiguated according to author IDs, wherein one author may correspond to a plurality of documents; according to the author ID grouping result, respectively carrying out summary abstract, title and keyword combination on all documents of the same author, and recording as A1', A2' and A3 ';

(3) Merging the B1', the B2' and the B3', adding the occurrence times of the same research topics, and taking out the 10 research topics with the largest number of current times to obtain a final { English research topic + occurrence times } array' C _ en; the array contains the research subject with the most occurrence frequency in all English documents of the author, and has very strong representativeness.

(4) Calculating a hash value of C _ en by using SimHash to obtain a semantic fingerprint of the English document, and recording the semantic fingerprint as D _ en;

step three, calculating the Hamming distance of D _ zh and D _ en,

if the Hamming distances of D _ zh and D _ en are more than or equal to 3, the contents of the two documents are not similar;

the conversion mode of the Hamming distance and the similarity is as follows:

hamming distance =0, similarity =1;

hamming distance =1, similarity =0.9;

hamming distance =2, similarity =0.8;

hamming distance > =3, similarity =0.

Step seven, if the pinyin of the Chinese literature author is the same as the name of the English literature author, calculating the similarity of the scientific research duration, comprising the following steps:

s1, calculating the scientific research duration of an author of the Chinese literature: grouping all Chinese data after Chinese authors are disambiguated according to author IDs, wherein one author may correspond to a plurality of documents; according to the author ID grouping result, one of all Chinese documents of the author with the earliest publication time is found, and the time difference between the publication year and the current year is calculated, namely the scientific research time length of the author is recorded as R _ zh;

s2, calculating the scientific research duration of an author of the English literature: grouping all English data after English document authors are disambiguated according to author IDs, wherein one author may correspond to a plurality of documents; according to the author ID grouping result, finding one of all English documents with earliest publication time of the author, and calculating the time difference between the publication year and the current year, namely the scientific research time length of the author, and recording the time length as R _ en;

s3, calculating a difference value R _ diff between R _ zh and R _ en, and converting to obtain the similarity of the scientific research durations, wherein the calculation standard is as follows:

r _ diff =0, then similarity =1;

1= < R _ diff < =2, then the similarity =0.9;

3= < R _ diff < =4, then the similarity =0.8;

r _ diff >4, then similarity =0.

Step eight, judging whether the Chinese literature author and the English literature author are the same person, wherein the judgment standard is as follows:

if the mechanism similarity is more than or equal to 0.9, one of the four similarities, namely the author cooperation network similarity, the author citation network similarity, the document content similarity and the scientific research duration similarity is more than 0.8, and the author cooperation network similarity, the author citation network similarity, the document content similarity and the scientific research duration similarity are considered as the same person;

if the similarity of the organization is less than 0.9, two or more of the three similarities, namely the author cooperation network similarity, the author citation network similarity, the document content similarity and the scientific research duration similarity, are more than 0.8, and the three similarities are regarded as the same person.

And step nine, according to the steps one to eight, calculating similarity in pairs of documents of the same author, judging whether the documents are the same person, modifying the author ID of the Chinese document into the author ID of the English document if the documents are the same person, and aggregating results after calculation in pairs to complete Chinese and English author name fusion disambiguation.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and scope of the present invention are intended to be covered thereby.

Claims

1. A Chinese and English literature author name fusion disambiguation method is characterized in that: the method comprises the following steps:

s3, comparing every two authors of the Chinese literature to judge whether the names of the authors are the same,

if the Chinese character disambiguation result is different from the Chinese character disambiguation result, aggregating the results to obtain a Chinese character disambiguation result;

if the two parts are the same, respectively calculating mechanism similarity, cooperation network similarity, citation network similarity and document content similarity, and judging whether the parts are the same author or not according to the results of the mechanism similarity, the cooperation network similarity, the citation network similarity and the document content similarity;

the judgment standard is as follows:

(2) If not, aggregating the results to obtain a Chinese disambiguation result; the method for calculating the similarity of the contents of the Chinese documents comprises the following steps:

weight less than 0.2: converting into 1;

weight is greater than or equal to 0.2 and less than 0.4: converting into 2;

weight is greater than or equal to 0.4 and less than 0.6: conversion to 3;

weight is greater than or equal to 0.6 and less than 0.8: conversion to 4;

weight of 0.8 or more: turning to 5;

hamming distance =0, similarity =1;

hamming distance =1, similarity =0.9;

hamming distance =2, similarity =0.8;

hamming distance > =3, similarity =0;

if the Hamming distance is more than or equal to 3, the two literatures are not similar;

if the Hamming distance is less than 3, the two documents are similar;

s1, cleaning by author names: removing symbols in the author name, and uniformly converting the pinyin of the author name into a first name plus surname format;

s2, cleaning the mechanism to which the author belongs: removing symbols in the mechanism name, and completing the mechanism by shorthand;

if the English disambiguation results are different, the English disambiguation results are obtained by collecting the results;

(1) If the author is the same author, marking the ID of the same author, and aggregating the results after two-two calculation to obtain an English disambiguation result;

(2) If the English documents are not the same author, aggregating the results to obtain a disambiguation result of the English documents;

the method for calculating the document content similarity of the author of the English document comprises the following steps:

(1) Splicing title, abstract and key words into a character string E ^′ ；

(2) Pair of character strings E Using NLTK ^′ Extracting key words based on TF-IDF algorithm, and taking Top 10 words and weights thereof to generate a { word + weight } array' F ^′ ；

(3) Will array F ^′ The medium weight is converted into an integer weight of 1-5, resulting in a converted "{ word + weight } array" G ^′ (ii) a The conversion criteria were:

weight less than 0.2: converting into 1;

weight is greater than or equal to 0.2 and less than 0.4: converting into 2;

weight is greater than or equal to 0.4 and less than 0.6: conversion to 3;

weight is greater than or equal to 0.6 and less than 0.8: 4 is converted;

weight of 0.8 or more: turning to 5;

(4) Computing array G using SimHash ^′ The hash value of the text is obtained as the semantic fingerprint H of the text ^′ ；

(5) Semantic fingerprints H1 of two Chinese documents of the same author are respectively calculated according to the steps (1) to (4) ^′ 、H2 ^′ ；

(6) The content similarity of the two documents is calculated according to the Hamming distance,

if the Hamming distance is less than 3, the two documents are similar;

s1, converting Chinese document authors and Chinese authors in cited documents obtained from Chinese disambiguation results into Pinyin formats according to the first name plus surname formats, and translating mechanisms to which the Chinese authors belong into English; and grouping according to author ID;

s2, grouping author IDs obtained from English disambiguation results;

s3, comparing the names of the Chinese and English literature authors pairwise, judging whether the names are the same, and if the names are different, aggregating the results to complete disambiguation of the names of the Chinese and English literature authors;

if the similarity is the same, respectively calculating the mechanism similarity, the cooperation network similarity, the citation network similarity, the document content similarity and the scientific research duration similarity of the Chinese and English documents, and judging whether the Chinese and English documents belong to the same author according to the results of the mechanism similarity, the cooperation network similarity, the citation network similarity, the document content similarity and the scientific research duration similarity; the judgment standard is as follows: if the mechanism similarity is more than or equal to 0.9, one of the four similarities, namely the author cooperation network similarity, the author citation network similarity, the document content similarity and the scientific research duration similarity is more than 0.8, and the author cooperation network similarity, the author citation network similarity, the document content similarity and the scientific research duration similarity are considered as the same person;

(2) If the Chinese literature and the English literature are not the same author, aggregating the results to complete name disambiguation of the author of the Chinese and English literature;

the method for calculating the similarity of the contents of the documents published by the authors comprises the following steps:

(1) Grouping all Chinese data after Chinese authors are disambiguated according to author IDs, wherein one author may correspond to a plurality of documents, and respectively performing summary, title and keyword combination on all documents of the same author according to ID grouping results, wherein the summary, title and keyword combinations are recorded as A1, A2 and A3;

(4) Merging C1_ zh _ to _ en, C2_ zh _ to _ en and C3_ zh _ to _ en, adding the occurrence times of the same research topics, and taking out 10 research topics with the largest number of current times to obtain a final { English research topic + occurrence times } array' C _ zh _ to _ en;

(1) Grouping all English data after English authors are disambiguated according to author IDs, wherein one author may correspond to a plurality of documents; according to the author ID grouping result, all the documents of the same author are respectively merged by abstract, title and keyword, and are marked as A1 ^′ 、A2 ^′ 、A3 ^′ ...；

(2) Using English research Topic set Topic _ en to respectively pair A1 ^′ 、A2 ^′ 、A3 ^′ Matching the English research subjects in a preset distance, respectively acquiring the English research subjects contained in the English research subjects and the occurrence times of the English research subjects, and generating a ' English research subject + occurrence time ' array ' B1 ^′ 、B2 ^′ 、B3 ^′ ...；

(3) To B1 ^′ 、B2 ^′ 、B3 ^′ Merging, adding the occurrence times of the same research topics, and taking out the 10 research topics with the largest number of the current times to obtain a final { English research topic + occurrence times } array' C _ en;

2. The Chinese-English literature author name fusion disambiguation method of claim 1, wherein: in the third step, the similarity of the scientific research duration is calculated aiming at the documents with the same name of the pinyin of the Chinese document author and the English document author, and the calculation method comprises the following steps:

(3) If the pinyin of the author of the Chinese document is the same as the name of the author in the English document, calculating a difference value R _ diff between R _ zh and R _ en;

(4) Calculating the similarity of the scientific research duration, wherein the calculation standard is as follows:

r _ diff =0, then the similarity =1;

1= < R _ diff < =2, then the similarity =0.9;

3= < R _ diff < =4, then the similarity =0.8

R _ diff >4, then similarity =0.

3. The Chinese-English literature author name fusion disambiguation method of claim 1, wherein: in the first step, the second step and the third step, jacard similarity coefficients are adopted to respectively calculate the cooperative network similarity and the reference network similarity.