CN114328488A

CN114328488A - Chinese and English literature author name fusion disambiguation method

Info

Publication number: CN114328488A
Application number: CN202111615229.2A
Authority: CN
Inventors: 贾士杨; 冯凯; 王元卓; 彭亮
Original assignee: China Science And Technology Big Data Research Institute
Current assignee: China Science And Technology Big Data Research Institute
Priority date: 2021-12-27
Filing date: 2021-12-27
Publication date: 2022-04-12
Anticipated expiration: 2041-12-27
Also published as: CN114328488B

Abstract

The invention belongs to the technical field of name disambiguation, and particularly relates to a Chinese and English literature author name disambiguation method. The method is based on semantic fingerprints, author cooperation network similarity, author citation network similarity and the like to disambiguate Chinese author names and English author names, and completes the disambiguation of name pinyin in Chinese authors and English documents according to Chinese disambiguation results and English disambiguation results, whether authors of different documents are the same person can be accurately distinguished, the same author under Chinese and English can be well identified, the author to be found is quickly positioned, the accuracy rate is high, and the search work is facilitated to be developed; moreover, the invention introduces the calculation of similarity of the scientific research duration of the author, can well assist the disambiguation of the English name in the Chinese author, can also determine the age range of the author, filters out other authors with the same name who are not in the range, and improves the disambiguation accuracy.

Description

Chinese and English literature author name fusion disambiguation method

Technical Field

The invention belongs to the technical field of name disambiguation, and particularly relates to a Chinese and English literature author name disambiguation method.

Background

With the rapid development of the internet, a large amount of scientific documents such as articles and patents are emerging continuously, and when people search for needed useful information from the massive documents, a frequently used search means is to search through names of document authors and inquire all published documents. However, in the searching process, people can find that a large number of authors with the same name exist, and the authors to be found are difficult to quickly locate, which is very unfavorable for work of people.

The ambiguity of author names in literature has been long, and the following problems mainly exist:

1. chinese author name ambiguity. Such as: "zhangwei" is a person who has a name in life, and when they release documents such as papers and patents, the signatures are "zhangwei", and we can hardly distinguish which "zhangwei" is released from the documents.

2. The English author is ambiguous in name. Like Chinese authors, English also has a large number of different people with the same name, and it is a difficult problem how to distinguish different people.

3. English name of chinese author. In the background of academic internationalization, domestic authors begin to publish more and more documents in international journals and conferences, and when they publish documents, signatures are usually in a pinyin mode, such as: "Zhang San" or "San Zhang", because of the characteristic of the spelling, "Zhang San" can correspond to "Zhang San" of the Chinese, can also correspond to "Zhang San", etc., and "Lin Yang" can correspond to "forest", can also correspond to "yanlin", under such circumstances, it is more difficult to distinguish which one is concrete, when carrying on Chinese and English academic achievement to an author and appraising, the result often lacks scientificity and practicability.

In view of the above problems, name disambiguation is a difficult point to be solved urgently when a document knowledge base is constructed and document retrieval is performed, and has very important significance and value.

Disclosure of Invention

Aiming at the defects and problems existing in the existing Chinese and English author name distinction, the invention provides a Chinese and English literature author name disambiguation method.

The technical scheme adopted by the invention for solving the technical problems is as follows: a Chinese and English literature author name fusion disambiguation method comprises the following steps:

step one, Chinese literature author name disambiguation, comprising the following steps:

s1, cleaning by author name: removing symbols in the author name, and converting the author name into a format of surname plus first name according to common surnames;

s2, cleaning the mechanism to which the author belongs: unifying the author institutions into the names of the institution bodies;

s3, comparing the two authors of the Chinese document to determine if the names of the authors are the same,

if the Chinese character disambiguation results are different, aggregating the results to obtain a Chinese character disambiguation result;

if the same, respectively calculating mechanism similarity, cooperative network similarity, citation network similarity and document content similarity, and judging whether the same author exists according to the results of the mechanism similarity, the cooperative network similarity, the citation network similarity and the document content similarity; the judgment standard is as follows:

if the mechanism similarity is more than or equal to 0.9, one of the three similarities, namely the author cooperation network similarity, the author citation network similarity and the document content similarity is more than 0.8, and the three similarities are considered as the same person;

if the mechanism similarity is less than 0.9, two or more than two of the three similarities, namely the author cooperative network similarity, the author citation network similarity and the document content similarity, are more than 0.8, and the three similarities are considered as the same person;

(1) if the two characters are the same author, marking the same author ID, and aggregating the results after two-to-two calculation to obtain a Chinese disambiguation result;

(2) if not, aggregating the results to obtain the Chinese disambiguation result.

Step two, disambiguation of names of authors of English documents comprises the following steps:

s1, cleaning by author name: removing symbols in the author name, and uniformly converting the pinyin of the author name into a first name + surname format;

s2, cleaning the mechanism to which the author belongs: removing symbols in the mechanism name, and completing the mechanism by shorthand;

s3, comparing the English literature authors two by two, judging whether the author names are the same,

if the English disambiguation results are different, the English disambiguation results are obtained by collecting the results;

if the same, respectively calculating mechanism similarity, cooperative network similarity, citation network similarity and document content similarity, and judging whether the same author exists according to the mechanism similarity, the cooperative network similarity, the citation network similarity and the document content similarity; the judgment standard is as follows:

(1) if the English disambiguation result is the same author, marking the same author ID, and aggregating results after two-by-two calculation to obtain an English disambiguation result;

(2) if not, aggregating the results to obtain the English disambiguation result.

Step three, Chinese and English author name fusion disambiguation, which comprises the following steps:

s1, converting the Chinese document author and the Chinese author in the cited document obtained from the Chinese disambiguation result into Pinyin format according to the first name plus last name format, and translating the mechanism of the Chinese author into English; and grouping according to author ID;

s2, grouping the author IDs obtained from the English disambiguation result;

s3, comparing the names of the Chinese and English literature authors two by two, judging whether the names are the same,

if the Chinese and English literature names are different, aggregating results to complete name disambiguation of Chinese and English literature authors;

if the Chinese and English documents are the same, respectively calculating mechanism similarity, cooperation network similarity, citation network similarity, document content similarity and scientific research duration similarity of the Chinese and English documents, and judging whether the Chinese and English documents are the same author according to the results of the mechanism similarity, the cooperation network similarity, the citation network similarity, the document content similarity and the scientific research duration similarity; the judgment standard is as follows:

if the mechanism similarity is more than or equal to 0.9, one of the four similarities, namely the author cooperation network similarity, the author citation network similarity, the document content similarity and the scientific research duration similarity is more than 0.8, and the author cooperation network similarity, the author citation network similarity, the document content similarity and the scientific research duration similarity are considered as the same person;

if the mechanism similarity is less than 0.9, two or more than 0.8 of the three similarities, namely the author cooperation network similarity, the author citation network similarity, the document content similarity and the scientific research duration similarity, are considered as the same person;

(1) if the Chinese literature and the English literature are the same author, marking the ID of the author of the English literature to finish the disambiguation of the names of the Chinese and English authors;

(2) if the Chinese document and the English document are not the same author, the results are aggregated to complete the name disambiguation of the author of the Chinese and English documents.

In the method for fusing and disambiguating names of authors of Chinese and English documents, the step one of calculating the document content similarity of authors of Chinese documents comprises the following steps:

(1) splicing the title, the abstract and the key words into a character string E;

(2) extracting keywords based on a TF-IDF algorithm from the character string E by using the jieba word, and taking the word of Top 10 and the weight thereof to generate a { word + weight } array F;

(3) converting the weight in the array F into an integer weight of 1-5 to obtain a converted { word + weight } array G; the conversion criteria were:

weight less than 0.2: is converted to 1

Weight is greater than or equal to 0.2 and less than 0.4: is converted into 2

Weight is greater than or equal to 0.4 and less than 0.6: is converted to 3

Weight is greater than or equal to 0.6 and less than 0.8: is converted to 4

Weight of 0.8 or more: turning to 5;

(4) calculating the hash value of the array G by using the SimHash to obtain a semantic fingerprint H of the text;

(5) semantic fingerprints H1 and H2 of two Chinese documents of the same author are respectively calculated according to the steps (1) to (4);

(6) calculating the content similarity of the two documents according to the Hamming distance, wherein the similarity calculation standard is as follows:

hamming distance is 0 and similarity is 1

Hamming distance is 1 and similarity is 0.9

Hamming distance is 2, similarity is 0.8

Hamming distance > is 3, similarity is 0;

if the Hamming distance is more than or equal to 3, the two literatures are not similar;

if the hamming distance is less than 3, the two documents are similar.

In the method for fusing and disambiguating names of authors of Chinese and English documents, the step two of calculating the similarity of the contents of the authors of English documents comprises the following steps:

(1) splicing the title, the abstract and the key words into a character string E';

(2) extracting keywords based on a TF-IDF algorithm from the character string E 'by using NLTK, and taking words and weights of Top 10 to generate a { word + weight } array F';

(3) converting the weight in the array F 'into an integer weight of 1-5 to obtain a converted { word + weight } array G'; the conversion criteria were:

weight less than 0.2: is converted to 1

Weight is greater than or equal to 0.2 and less than 0.4: is converted into 2

Weight is greater than or equal to 0.4 and less than 0.6: is converted to 3

Weight is greater than or equal to 0.6 and less than 0.8: is converted to 4

Weight of 0.8 or more: turning to 5;

(4) calculating the hash value of the array G 'by using the SimHash to obtain a semantic fingerprint H' of the text;

(5) semantic fingerprints H1 'and H2' of two Chinese documents of the same author are respectively calculated according to the steps (1) to (4);

(6) the content similarity of the two documents is calculated according to the Hamming distance,

if the hamming distance is less than 3, the two documents are similar.

In the method for name fusion and disambiguation of authors in chinese and english documents, the method for calculating similarity of document contents published by authors in step three comprises the following steps:

s1, calculating the semantic fingerprint of the Chinese document, comprising the following steps:

(1) grouping all Chinese data after Chinese authors are disambiguated according to author IDs, wherein one author may correspond to a plurality of documents, and respectively merging the abstracts, the titles and the keywords of all the documents of the same author according to ID grouping results, wherein the merging results are marked as A1, A2 and A3.;

(2) respectively matching A1, A2 and A3. by using a Chinese research Topic set Topic _ zh, respectively acquiring Chinese research topics and occurrence times contained in the Chinese research topics, and generating a ' Chinese research Topic + occurrence times ' array ' B1, B2 and B3.;

(3) convert chinese research topics in B1, B2, B3.. To english using zh _ To _ en, generating "{ english research topic + number of occurrences } arrays" C1_ zh _ To _ en, C2_ zh _ To _ en, C3_ zh _ To _ en..;

(4) merging C1_ zh _ to _ en, C2_ zh _ to _ en and C3_ zh _ to _ en., adding the occurrence times of the same research topics, and taking out 10 research topics with the largest number of current times to obtain a final { English research topic + occurrence times } array' C _ zh _ to _ en;

(5) calculating a hash value of C _ zh _ to _ en by using SimHash to obtain a semantic fingerprint D _ zh of the Chinese document;

s2, calculating the semantic fingerprint of the English literature, comprising the following steps:

(1) grouping all English data after English authors are disambiguated according to author IDs, wherein one author may correspond to a plurality of documents; according to the author ID grouping result, all documents of the same author are respectively merged by abstract summary, title and key words, which are recorded as A1 ', A2 ' and A3 ';

(2) respectively matching A1 ', A2 ' and A3 ' by using an English research Topic set Topic _ en, respectively acquiring English research topics contained in the English research topics and the occurrence times of the English research topics, and generating ' English research Topic + occurrence times ' arrays ' B1 ', B2 ' and B3 ';

(3) b1 ', B2 ' and B3 ' are combined, the occurrence times of the same research topics are added, and 10 research topics with the largest number of current times are taken out to obtain a final { english research topic + occurrence times } array "C _ en;

(4) calculating a hash value of C _ en by using SimHash to obtain a semantic fingerprint D _ en of the English literature;

s3, calculating the similarity of the document contents through the Hamming distance of D _ zh and D _ en,

if the Hamming distance between D _ zh and D _ en is less than 3, the contents of the two documents are similar;

if the Hamming distance between D _ zh and D _ en is greater than or equal to 3, the contents of the two documents are not similar.

In the method for fusing and disambiguating names of authors in Chinese and English documents, the similarity of scientific research durations is calculated for documents with the same Pinyin name as the name of the author in English documents in the third step, and the calculation method comprises the following steps:

(1) grouping all Chinese data after the Chinese author is disambiguated, grouping the author IDs, finding out one document with the earliest publication time in all Chinese documents of the author according to a grouping result, and calculating the time difference between the publication time of the earliest document and the current year to be used as the scientific research time length R _ zh of the Chinese document author;

(2) grouping all English data after disambiguation of English authors, and the author IDs, finding out one document with the earliest publication time in all English documents of the authors according to a grouping result, and calculating the time difference between the publication year of the earliest document and the current year as the scientific research duration R _ en of the English author;

(3) if the pinyin of the author of the Chinese document is the same as the name of the author in the English document, calculating a difference R _ diff between R _ zh and R _ en;

(4) calculating the similarity of the scientific research duration, wherein the calculation standard is as follows:

r _ diff is 0, the similarity is 1

If 1 ═ R _ diff ═ 2, then the similarity is 0.9

If 3 ═ R _ diff ═ 4, then the similarity is 0.8

And R _ diff >4, the similarity is 0.

In the method for fusing and disambiguating names of authors of Chinese and English documents, in the first step, the second step and the third step, the Jacard similarity coefficient is adopted to respectively calculate the similarity of the cooperation network and the similarity of the citation network.

The invention has the beneficial effects that: the method can accurately distinguish whether the authors of different documents are the same person, can well identify the same author in Chinese and English, can quickly locate the author to be found, has high accuracy, and is beneficial to the development of retrieval work.

The method is based on semantic fingerprint comparison document similarity, can greatly simplify the comparison process, improves the comparison efficiency, avoids the process of model training and saves training resources compared with the method for carrying out document similarity calculation based on a model.

The invention provides the calculation of the similarity of the scientific research durations of the authors, can well assist the disambiguation of English names in Chinese authors, can determine the age range of the authors by introducing the scientific research durations, filters out other authors with the same name who are not in the range, and improves the disambiguation accuracy.

Drawings

FIG. 1 is a flow chart of the present invention for disambiguating the name of a author of a chinese character.

FIG. 2 is a flow chart of English author name disambiguation.

FIG. 3 is a flow chart of English author name fusion disambiguation in the present invention.

Detailed Description

The homonymy author disambiguation method designed by the invention mainly starts from the following aspects:

(1) whether the working mechanisms of the authors are the same: the probability of being the same person is higher if the working mechanisms are the same.

(ii) author cooperative network similarity: expert scholars generally have fixed scientific research cooperation teams, and if the cooperation networks of two authors are highly similar, the two persons have the same person with high probability.

Third, the author refers to the network similarity: a citation network refers to a collection of authors referring to documents cited by authors who would normally refer to documents from the same other authors as they belong to the same research field. Meanwhile, for some guides and great guides, the names of the students in the guide and the guide are different, the cooperative network similarity of the students is extremely low, the research directions are greatly different, but the common characteristics exist, the names of the guides can be brought by the high probability of the student's paper, the papers of the guides or other students can be introduced by the high probability of the student's paper, and therefore name disambiguation can be carried out by introducing the network similarity.

Fourthly, the authors publish the similarity of the content of the documents: the research content of the same author basically does not change greatly, and the similarity of the contents of two documents of the same author can determine whether the two documents are the same author.

Fifth, the similarity of the scientific research duration of the author: the scientific research duration in the application is defined as: the earliest literature published by the authors is time-shifted. The probability that the people are the same is higher if the scientific research duration is the same.

The invention is further illustrated with reference to the following figures and examples. It is noted that all letters appearing in the present invention are exemplary representations.

Example 1: the embodiment provides a Chinese and English literature author name fusion disambiguation method, which comprises Chinese author name disambiguation, English author name disambiguation and Chinese author and English literature name pinyin disambiguation. Wherein

First, Chinese author name disambiguation, as shown in FIG. 1, includes the following steps,

step one, cleaning author names: removing symbols (including space, semicolon, comma and other symbols) in the author name, converting the author name according to common names, and uniformly converting the author name into a format of surname + first name; such as: the 'forest rushing' is converted into 'forest rushing'.

Step two, cleaning the mechanism to which the author belongs: unifying the author institutions into the names of the institution bodies; such as: the xx department of the xx hospital is regular as the xx hospital, and the xx college of the xx university is regular as the xx university; the types of companies such as 'company limited', 'stock limited', 'group stock limited', etc. are eliminated, for example, the 'group stock limited by the Alibara' is converted into 'Alibara'.

Step three, calculating the mechanism similarity of the author:

taking mechanisms of two articles of the same author as characteristic values, calculating word frequencies of all words of the characteristic values to generate word frequency vectors, and then calculating the similarity of the word frequency vectors according to a cosine similarity formula to obtain mechanism similarity;

example (c):

1) the two institutions are respectively 'millet science and technology limited liability company' and 'Guangdong millet science and technology limited liability company';

2) after the mechanism is cleaned in the step (2), the method comprises the following steps: two characteristic values of millet science and Guangdong millet science and technology;

3) all the terms of the eigenvalues "millet technology" are: [ Xiao, Mi, Ke, Zhi ]

4) All the terms of the eigenvalues "Guangdong millet science" are: [ Guangdong, east, Xiao, Mi, Ke, Zhi ]

5) All words of the two eigenvalues are merged to be: [ Guangdong, east, Xiao, Mi, Ke, Zhi ]

6) The word frequency vector of the eigenvalue "millet science and technology" is: [0,0,1,1,1,1]

7) The word frequency vector of the eigenvalue "Guangdong millet science" is: [1,1,1,1,1,1]

8) Calculating the similarity of two word frequency vectors according to a cosine similarity algorithm:

step four, calculating the similarity of the author cooperation network:

the cooperative networks of the same author are A and B respectively, and the similarity is as follows:

example (c): the cooperation network A is: [ a, B, c ], the cooperative network B is: [ a, c, d, e ], calculating the similarity according to a cooperative network similarity formula as follows:

step five, calculating the similarity of the author citation network

The similarity algorithm of the citation network is the same as the similarity algorithm of the cooperation network, the citation networks of the same author are respectively C and D, and then the similarity is as follows:

step six, calculating the similarity of the contents of the published documents of the authors, wherein the similarity comprises the following contents:

s1, calculating the content similarity by adopting the title, the abstract and the key words, and splicing the title, the abstract and the key words into a character string E;

s2, extracting keywords based on the TF-IDF algorithm from the character string E by using jieba word segmentation, and taking the word of Top 10 and the weight thereof to generate a { word + weight } array which is marked as F;

s3, converting the weight in the array F into an integer weight of 1-5, recording the converted array (word + weight) as G, and converting the standard as:

weight less than 0.2: converting into 1;

weight is greater than or equal to 0.2 and less than 0.4: converting into 2;

weight is greater than or equal to 0.4 and less than 0.6: conversion to 3;

weight is greater than or equal to 0.6 and less than 0.8: 4 is converted;

weight of 0.8 or more: and was turned to 5.

S4, calculating the hash value of the array G by using the SimHash to obtain the semantic fingerprint H of the text;

s5, semantic fingerprints H1 and H2 of two documents of the same author are respectively obtained through calculation according to the steps S1-S4;

s6, calculating the content similarity of the two documents according to the Hamming distance, wherein the specific similarity calculation standard is as follows:

hamming distance is 0 and similarity is 1

Hamming distance is 1 and similarity is 0.9

Hamming distance is 2, similarity is 0.8

Hamming distance > is 3, similarity is 0;

if the hamming distance is less than 3, the two documents are similar.

Step seven, judging whether the authors are the same person or not,

if the mechanism similarity is less than 0.9, two or more of the three similarities, namely the author cooperation network similarity, the author citation network similarity and the document content similarity, are more than 0.8, and the three similarities are considered as the same person.

Step eight, calculating the similarity of the documents of the same author pairwise according to the steps one to seven, and judging whether the documents are the same person; and (4) marking the same author ID on the same person, and aggregating the two calculated notes to complete Chinese author name disambiguation.

Secondly, the name disambiguation of the author of the English literature, as shown in FIG. 2, comprises the following steps:

step one, cleaning author names:

removing symbols in the author name, such as spaces, semicolons, commas, etc.; carrying out format conversion on the pinyin of the name of the author, and uniformly converting the pinyin of the name of the author into a format of 'first name + surname'; for example, "Wang Yuanzhuo" turns to: "Yuanzhuo Wang".

Step two, cleaning the mechanism to which the author belongs:

removing symbols in the mechanism name, and completing the mechanism by shorthand; such as: the "Univ" is complemented to "University".

Step three, calculating the similarity of the author mechanism

Taking mechanisms of two articles of the same author as characteristic values, calculating word frequencies of all words of the characteristic values, generating word frequency vectors, and calculating the similarity of the word frequency vectors according to a cosine similarity formula to obtain mechanism similarity;

example (c):

1) the two mechanisms are respectively AA BB CD and AA CD EE;

2) all the words of the eigenvalue "AA BB CD" (partitioned with blank spaces) are: [ AA, BB, CD ];

3) all the words of the characteristic value "AA CD EE" (segmented with spaces) are: [ AA, CD, EE ];

4) all words of the two eigenvalues are merged to be: [ AA, BB, CD, EE ];

5) the word frequency vector of the eigenvalue "AA BB CD" is: [1,1,1,0 ];

6) the word frequency vector of the characteristic value 'AA CD EE' is as follows: [1,0,1,1 ];

7) calculating the similarity of two word frequency vectors according to a cosine similarity algorithm:

step four, calculating the similarity of the author cooperation network

According to the cooperation networks A 'and B' of the same author, the similarity of the author cooperation networks is calculated,

step five, calculating the similarity of the author citation network

The citing network similarity algorithm is the same as the cooperation network similarity algorithm, the citing network similarity of the author is calculated according to the citing networks C 'and D' of the same author,

step six, calculating the similarity of the contents of the published documents of the authors

S1, a document, title, abstract and key words contain more and more accurate information, so that the title, abstract and key words are adopted to calculate the content similarity, and the title, abstract and key words are spliced to form a character string E';

s2, dividing words of the character string E 'by using NLTK, calculating TF-IDF value of each word as weight, and taking the word with weight Top 10 and the weight thereof to generate a { word + weight } array, which is marked as F';

calculating TF-IDF value:

the TF-IDF is a statistical analysis method for keywords and is used for evaluating the importance degree of a word to a file set or a corpus.

Wherein:

a larger TF indicates a higher frequency of occurrence of the word, and is more important in this article.

In the case that the total number of documents in the corpus is fixed, the smaller the number of documents containing the word is, the larger the IDF is, the more novel the word is, and the more important the word has a good category distinguishing capability.

TF-IDF ═ TF (word frequency) × IDF (inverse document frequency)

TF-IDF is the product of the word frequency TF and the inverse document frequency IDF.

10000 documents are randomly extracted from all English documents to serve as a corpus;

thirdly, using NLTK to perform word segmentation on the character string E, taking a word1 after word segmentation as an example, the TF-IDF is calculated as follows:

calculating the occurrence frequency of word1 in the character string E, and recording as w _ count1, then TF _1 ═ w _ count1 ÷ (total number of words in E);

from 10000 documents in the corpus, find out how many documents word1 appears, which is denoted as p _ count1, then

TF-IDF of word1 is TF _1 × IDF _ 1.

S3, converting the weights in the array F 'into integer weights of 1-5, and recording the converted array of { word + weight } as G', wherein the conversion mode is as follows:

weight less than 0.2: is converted to 1

Weight is greater than or equal to 0.2 and less than 0.4: is converted into 2

Weight is greater than or equal to 0.4 and less than 0.6: is converted to 3

Weight is greater than or equal to 0.6 and less than 0.8: is converted to 4

Weight of 0.8 or more: turning to 5;

s4, calculating a hash value of G 'by using SimHash, namely, the hash value is a semantic fingerprint of the text and is recorded as H';

s5, calculating semantic fingerprints of two documents of the same author according to the steps S1-S4, wherein the semantic fingerprints are respectively H1 'and H2'

S6, according to the Hamming distance, if the Hamming distance of H1 'and H2' is less than 3, the contents of the two documents are similar, otherwise, the contents are not similar; the hamming distance and similarity are converted as follows:

hamming distance is 0 and similarity is 1

Hamming distance is 1 and similarity is 0.9

Hamming distance is 2, similarity is 0.8

Hamming distance > is 3, similarity is 0;

step seven, judging whether the authors are the same person

Step eight, the same-name authors are aggregated to complete disambiguation

According to the steps from one to seven, the similarity of documents of the same author is calculated in pairs, whether the documents are the same person or not is judged, the same person marks the same author ID, the results after the calculation in pairs are aggregated, and the disambiguation of English authors is completed.

The disambiguation of the name pinyin in the Chinese author and English literature, as shown in figure 3, comprises the following steps:

step one, author name conversion

The names are cleaned when the Chinese author and the English author disambiguate respectively, and the Pinyin format of the English author is regulated to be the format of 'first name + last name', so that additional cleaning operation is not needed. The Chinese authors in the Chinese literature and the Chinese authors in the cited literature are all converted into Pinyin format, and are in 'first name + last name' format. If the first name and the surname of the Chinese name are in common family names, such as "forest poplar", the Chinese name and the surname are converted into pinyin array formats of { "Yang Lin" and "Lin Yang", and when the Chinese name and the surname are matched with the author name of the English document, the two formats are matched.

Step two, the mechanism conversion of the author belongs to

The affiliated institution is cleaned already when the English writer disambiguates, and no additional cleaning operation is needed.

For Chinese literature, company type word eyes are not removed, and simultaneously, Google translation and Wikipedia are used for translating Chinese mechanisms into English.

Step three, calculating the similarity of the author mechanism

And taking mechanisms of two articles of the same author as characteristic values, calculating word frequencies of all words of the characteristic values, generating word frequency vectors, and calculating vector similarity according to a cosine similarity formula, wherein the vector similarity is the mechanism similarity.

Step four, calculating the similarity of the author cooperation network

Grouping all Chinese data after Chinese authors are disambiguated according to author IDs, wherein one author may correspond to a plurality of documents; merging collaborators in all documents of the same author to generate a collaboration network of the author, wherein the collaboration network is M;

grouping all English data after the disambiguation of the English authors under play according to the author IDs, wherein one author may correspond to a plurality of documents; merging collaborators in all documents of the same author to generate a collaboration network of the author, wherein the collaboration network is N;

if the pinyin of the author of the Chinese document is the same as the name of the author in the English document, calculating the similarity of the cooperation network, wherein the similarity is as follows:

step five, calculating the similarity of the author citation network

Grouping all Chinese data after Chinese authors are disambiguated according to author IDs, wherein one author may correspond to a plurality of documents, merging authors of cited documents in all documents of the same author to generate a cited network of the author, and the cited network is P;

grouping all English data after English authors are disambiguated according to author IDs, wherein one author may correspond to a plurality of articles, merging authors of cited documents in all documents of the same author to generate a cited network of the author, and marking the cited network as Q;

if the name pinyin of the author in the Chinese document is the same as the name of the author in the English document, calculating the similarity of the citation network, wherein the similarity is as follows:

The similarity calculation between the Chinese literature and the English literature needs to translate the literature in one language into the literature in another language, but the translation accuracy cannot be guaranteed when large-scale translation is carried out, so that the similarity calculation is assisted by introducing the research theme of Microsoft academia. The research theme of Microsoft's academic is based on hundred million articles of Microsoft's academic, technical nouns extracted through technologies such as artificial intelligence and natural language processing, etc., 70 more than ten thousand in total, utilize Google translation, wikipedia to translate these research themes, produce the Chinese research theme set, mark as Topics _ zh; the English research subject set is recorded as Topics _ en; the corresponding relation of the Chinese and English research subject set is recorded as zh _ To _ en.

The method comprises the following steps:

(4) merging C1_ zh _ to _ en, C2_ zh _ to _ en and C3_ zh _ to _ en., adding the occurrence times of the same research topics, and taking out 10 research topics with the largest number of current times to obtain a final { English research topic + occurrence times } array' C _ zh _ to _ en; the array contains the research subject with the most occurrence frequency in all Chinese documents of the author, and has very strong representativeness.

(5) Calculating a hash value of C _ zh _ to _ en by using SimHash to obtain a semantic fingerprint of the Chinese document, and recording the semantic fingerprint as D _ zh;

(3) b1 ', B2 ' and B3 ' are combined, the occurrence times of the same research topics are added, and 10 research topics with the largest number of current times are taken out to obtain a final { english research topic + occurrence times } array "C _ en; the array contains the research subject with the most occurrence frequency in all English documents of the author, and has very strong representativeness.

(4) Calculating a hash value of C _ en by using SimHash to obtain a semantic fingerprint of the English document, and recording the semantic fingerprint as D _ en;

step three, calculating the Hamming distance of D _ zh and D _ en,

if the Hamming distances of D _ zh and D _ en are more than or equal to 3, the contents of the two documents are not similar;

the conversion mode of the Hamming distance and the similarity is as follows:

hamming distance is 0 and similarity is 1

Hamming distance is 1 and similarity is 0.9

Hamming distance is 2, similarity is 0.8

Hamming distance > is 3 and similarity is 0.

Step seven, if the pinyin of the Chinese literature author is the same as the name of the English literature author, calculating the similarity of the scientific research duration, comprising the following steps:

s1, calculating the scientific research duration of the author of the Chinese literature: grouping all Chinese data after Chinese authors are disambiguated according to author IDs, wherein one author may correspond to a plurality of documents; according to the author ID grouping result, one of all Chinese documents of the author with the earliest publication time is found, and the time difference between the publication year and the current year is calculated, namely the scientific research time length of the author is recorded as R _ zh;

s2, calculating the scientific research duration of the author of the English literature: grouping all English data after English document authors are disambiguated according to author IDs, wherein one author may correspond to a plurality of documents; according to the author ID grouping result, finding one of all English documents with earliest publication time of the author, and calculating the time difference between the publication year and the current year, namely the scientific research time length of the author, and recording the time length as R _ en;

s3, calculating a difference R _ diff between R _ zh and R _ en, and converting to obtain the similarity of the scientific research duration, wherein the calculation standard is as follows:

r _ diff is 0, the similarity is 1

If 1 ═ R _ diff ═ 2, then the similarity is 0.9

If 3 ═ R _ diff ═ 4, then the similarity is 0.8

R _ diff >4, then the similarity is 0;

step eight, judging whether the Chinese literature author and the English literature author are the same person, wherein the judgment standard is as follows:

if the mechanism similarity is less than 0.9, two or more than 0.8 of the three similarities, namely the author cooperation network similarity, the author citation network similarity, the document content similarity and the scientific research duration similarity, are considered as the same person.

And step nine, according to the steps one to eight, calculating the similarity of the documents of the same author in pairs, judging whether the documents are the same person, modifying the author ID of the Chinese document into the author ID of the English document if the documents are the same person, and aggregating the results after two-to-two calculation to complete Chinese and English author name fusion disambiguation.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and scope of the present invention are intended to be covered thereby.

Claims

1. A Chinese and English literature author name fusion disambiguation method is characterized in that: the method comprises the following steps:

(2) if not, aggregating the results to obtain a Chinese disambiguation result;

(2) if not, aggregating the results to obtain English disambiguation results;

s2, grouping the author IDs obtained from the English disambiguation result;

if the Chinese and English documents are the same, respectively calculating mechanism similarity, cooperation network similarity, citation network similarity, document content similarity and scientific research duration similarity of the Chinese and English documents, and judging whether the Chinese and English documents are the same author according to the results of the mechanism similarity, the cooperation network similarity, the citation network similarity, the document content similarity and the scientific research duration similarity; the judgment standard is as follows: if the mechanism similarity is more than or equal to 0.9, one of the four similarities, namely the author cooperation network similarity, the author citation network similarity, the document content similarity and the scientific research duration similarity is more than 0.8, and the author cooperation network similarity, the author citation network similarity, the document content similarity and the scientific research duration similarity are considered as the same person;

2. The Chinese-English literature author name fusion disambiguation method of claim 1, wherein: the method for calculating the similarity of the document contents of the author of the Chinese document in the first step comprises the following steps:

weight less than 0.2: is converted to 1

Weight is greater than or equal to 0.2 and less than 0.4: is converted into 2

Weight is greater than or equal to 0.4 and less than 0.6: is converted to 3

Weight is greater than or equal to 0.6 and less than 0.8: is converted to 4

Weight of 0.8 or more: is turned to 5

hamming distance is 0 and similarity is 1

Hamming distance is 1 and similarity is 0.9

Hamming distance is 2, similarity is 0.8

Hamming distance > is 3, similarity is 0;

if the hamming distance is less than 3, the two documents are similar.

3. The Chinese-English literature author name fusion disambiguation method according to claim 1 or 2, wherein: the step two of calculating the similarity of the content of the author document of the English document comprises the following steps:

weight less than 0.2: is converted to 1

Weight is greater than or equal to 0.2 and less than 0.4: is converted into 2

Weight is greater than or equal to 0.4 and less than 0.6: is converted to 3

Weight is greater than or equal to 0.6 and less than 0.8: is converted to 4

Weight of 0.8 or more: is turned to 5

if the hamming distance is less than 3, the two documents are similar.

4. The Chinese-English literature author name fusion disambiguation method of claim 1, wherein: the method for calculating the similarity of the content of the literature published by the authors in the third step comprises the following steps:

s3, calculating the similarity of the document contents through the Hamming distances of D _ zh and D _ en, wherein if the Hamming distances of D _ zh and D _ en are smaller than 3, the contents of the two documents are similar;

5. The Chinese-English literature author name fusion disambiguation method of claim 1, wherein: in the third step, the similarity of the scientific research duration is calculated aiming at the documents with the same name of the pinyin of the Chinese document author and the English document author, and the calculation method comprises the following steps:

(3) if the pinyin of the author in the Chinese document is the same as the name of the author in the English document, calculating the difference R _ diff between R _ zh and R _ en,

r _ diff is 0, the similarity is 1

If 1 ═ R _ diff ═ 2, then the similarity is 0.9

If 3 ═ R _ diff ═ 4, then the similarity is 0.8

And R _ diff >4, the similarity is 0.

6. The Chinese-English literature author name fusion disambiguation method of claim 1, wherein: in the first step, the second step and the third step, the Jacard similarity coefficient is adopted to respectively calculate the cooperative network similarity and the reference network similarity.