CN111008285B

CN111008285B - Author disambiguation method based on thesis key attribute network

Info

Publication number: CN111008285B
Application number: CN201911207075.6A
Authority: CN
Inventors: 冯凯; 康锐文; 王元卓; 刘冰冰; 彭亮; 贾士杨
Original assignee: Big Data Research Institute Institute Of Computing Technology Chinese Academy Of Sciences
Current assignee: China Science And Technology Big Data Research Institute
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2021-04-13
Anticipated expiration: 2039-11-29
Also published as: CN111008285A

Abstract

The invention discloses an author disambiguation method based on a thesis key attribute network, wherein a key attribute relationship network is a relationship network formed by collecting key attributes in a thesis and through the correlation relationship of the key attributes, and the relationship network between thesis co-writers, the relationship network of the same organization and the relationship network of the same field are respectively formed, and finally the relationship network of the thesis key attributes is formed. The method of the invention can effectively solve the situation that the same name of a person corresponds to different actual authors in a paper by extracting the paper name, the author mechanism and the author field in the paper and establishing a relational network around the author name, matching the name of the author of the paper when disambiguating the author of the paper and relating the author mechanism and the author field in the relational network. In addition, by combining the paper names to match the co-writers of the current authors to be disambiguated and matching the co-writers again, the situation that the same actual author has different name writing methods can be effectively solved.

Description

Author disambiguation method based on thesis key attribute network

Technical Field

The invention belongs to the technical field of disambiguation of the same author and different authors of a thesis, and particularly relates to an author disambiguation method based on a key attribute network of the thesis.

Background art:

in recent years, with the development of the internet, people have been closely related to the internet in all aspects of life, and academic activities are also the same. Most of the academic results can be queried through the Internet nowadays. However, in the presence of massive data, it is particularly important how to accurately query the data needed by people. At present, most of the paper platforms can search through authors to retrieve the paper information published by the authors of the query. In this case, the accuracy of the author's name is particularly important. But in real-life situations the following two situations generally occur.

One is that the names of authors of papers published by the same author may be presented in different ways, such as the name of the author's real name called "Zhang", possibly "San Zhang" in some foreign literature, and also in an abbreviated manner such as "Zhang s.

The second is the case of different authors with duplicate names, for example, two authors in different organizations are called 'Lisi', or one author is called 'Wanwu', the other author is called 'Wanwu', and the name results of communication authors written in some foreign documents are all 'Wu Wang'.

The two situations cause great difficulty in thesis retrieval, many thesis search engines in the existing system directly perform retrieval query aiming at character string matching, along with the increase of data volume, the accuracy of retrieved results cannot be guaranteed to a great extent, and the results need to be manually discriminated in most cases. With the improvement of the requirement on the accuracy of the paper authors, many methods for disambiguating the paper authors also appear, but the traditional methods are only simple and simple to match from the dimensions of organizations, keywords, published information and the like, and with the increase of data volume, the traditional methods cause the screened papers to be disordered and have no seal, and researchers need to perform long-time screening in the later period. The research efficiency is seriously influenced.

Disclosure of Invention

The invention provides an author disambiguation method based on a thesis key attribute network, which is mainly based on the current necessity of disambiguating an author of a thesis and the effectiveness of a traditional disambiguation method under the condition of large data volume, and combines data of the same actual author with different name writing methods; data of the same name but corresponding to different actual authors are distinguished.

The technical scheme adopted for realizing the purpose is as follows: an author disambiguation method based on a thesis key attribute network establishes a key attribute relationship network, which is a relationship network formed by collecting key attributes in the thesis and through the correlation relationship of the key attributes, wherein entity nodes in the relationship network mainly comprise: author name, author institution, author domain and thesis name; the authors are clustered through three dimensions of the thesis name, the institution and the field, so as to form a relationship network among the thesis co-writers, a relationship network of the same institution and a relationship network of the same field, and finally form a relationship network of key attributes of the thesis; the implementation logic of the author disambiguation method based on the key attribute relationship network comprises the following steps 1 to 7.

Step 1: cell a1 is input into the relationship net.

Step 2: and inserting the domain, the mechanism and the paper name in the unit A1 into the relational network for Merge operation.

And step 3: query N1 in A1 whether there is the same node as all N nodes in the relationship network.

And 4, step 4: if the same node exists, the FLOW1 is entered to start the judgment, mainly to judge whether the same name exists but corresponds to the situation of different actual authors.

The FLOW1 procedure was performed, including the following steps (1) - (7).

(1) The lists of domains (F) and organizations (O) associated with the same N node as N1 are taken and are denoted as F-List and O-List, respectively.

(2) Matching F related to N1 with F-List, calculating weight, wherein the weight is 1 once matching is successful, calculating field weight sum, and recording as: SumWeightField.

(3) And (3) matching O related to N1 with O-List, calculating weight, wherein the weight is 2 when matching is successful, and calculating the weight sum of the mechanism, which is recorded as: SumWeiightorg.

(4) Calculating the weighted sum, which is recorded as: SumWeight ═ weight (F) + weight (O).

(5) If SumWeight > 2, the label N1 is the same person as the matching successful N node.

(6) If SumWeight is less than or equal to 2, the node marked as N1 and the N node with successful matching is two persons.

(7) And outputting the result.

And 5: if the two types of data are different, the FLOW2 is entered, and the judgment is started to mainly judge whether the situation that the same actual author has different name writings exists or not.

The FLOW2 procedure was performed, including the following steps (1) - (8).

(1) A paper name node List Title-List which is the same as the paper name (T) of A1, a Field node List Field-List which is the same as the Field (F) of A1, and an organization node List Org-List which is the same as the organization (O) of A1 are respectively taken out in the relational network.

(2) Through a Title-List associated author name node, namely an N node, the relationship between a paper author and a co-worker thereof is associated, through querying the co-worker of N1, the matching is queried again in a reverse direction, i.e., the potential matching authors are screened, the part of N-List is the re-associated co-worker with the author who has collaborated with N1 in a1, and the part is based on a realistic situation that the author who has collaborated with N1 may collaborate with N1 more than once, and the main steps are as follows:

a) and querying an author name N-List associated with the Title-List through the Title-List.

b) And querying a paper name T-List associated with the N-List through the N-List, namely associating the paper name.

c) And inquiring the name of the author associated with the name through the T-List, and outputting the name as the N-Title-List.

(3) And inquiring the author name node associated with the Field-List, and outputting the author name node as the N-Field-List.

(4) The author name points associated with Org-List are queried and the output is N-Org-List.

(5) N1 is respectively matched with N-Title-List, N-Field-List and N-Org-List in terms of correlation degree, and respectively recorded as Ret-Title-List, Ret-Field-List and Ret-Org-List, wherein the weights are 3, 2 and 1.

(6) And aggregating the Ret-Title-List, the Ret-Field-List and the Ret-Org-List according to values, solving intersection, respectively calculating the weights and SumWeight of different result sets after aggregation, and outputting the result set as the Ret-List.

(₇) The highest weight and SumWeight in the Ret-List is taken, and if SumWeight is more than 4, the author is the same author, and if SumWeight is less than or equal to 4, the author is different.

(8) If the weight sum is the highest and more than 4, the relevance matching of the author names is carried out again, and the one with the highest relevance matching is taken.

Step 6: and inputting the result of the step 4 or the step 5 into the relational network, inserting the author name node into the relational network if the author name node is a new author name node, and otherwise, updating the author name node in the relational network and adding a new alias for the author name node.

And 7: and repeating the above 6 steps to achieve the purpose of disambiguation while establishing the relation network.

The above-mentioned unit: referring to an input of information as a unit, one of which is one of a list of author information extracted in a paper, includes: author name (N), domain (F), organization (O), paper name (T). A1 represents a specific example of a unit; the description of the documents in the following is the same, and is not repeated for the sake of convenience.

The step 1 is mainly to input unit data into the relational network, and includes the following steps (1) - (2).

(1) All data of author name, thesis name, domain and organization character are converted into lower case.

(2) Removing special characters such as "-", etc. in the data.

The above step 2 includes the following steps (1) to (5).

(1) And extracting the domain nodes in A1, and sequentially inserting the domain nodes into the relational network, wherein F1 is one domain node in A1.

(2) It is determined whether the same node as F1 exists in the relationship network.

(3) If so, ignore.

(4) And if not, inserting into the relation network.

(5) The rest of the organisations are in steps 1 to 4 above with the title of the article.

The invention has the beneficial effects that: the method of the invention can effectively solve the situation that the same name of a person corresponds to different actual authors in a paper by extracting the paper name, the author mechanism and the author field in the paper and establishing a relational network around the author name, matching the name of the author of the paper when disambiguating the author of the paper and relating the author mechanism and the author field in the relational network. In addition, the situation that the same actual Author has different name writing methods can be effectively solved by matching the co-writers (Coop Author List) of the current authors to be disambiguated with the paper names and matching the co-writers of the Coop Author List again.

Drawings

FIG. 1 is an exemplary diagram of a key attribute relationship network.

FIG. 2 is a general flow diagram of a disambiguation method.

Fig. 3 is a FLOW chart of FLOW 1.

Fig. 4 is a FLOW chart of FLOW 2.

Detailed Description

The invention provides an author disambiguation method based on a thesis key attribute network, which is mainly based on the current necessity of disambiguating the author of a thesis and the effectiveness of a traditional disambiguation method under the condition of large data volume. The method can effectively solve the situation that the same person name corresponds to different actual authors in the paper by extracting the paper name, the author mechanism and the author field in the paper and establishing a relational network around the author name, matching the name of the author of the paper when disambiguating the author of the paper and relating the author mechanism and the author field in the relational network. In addition, by combining the paper names to match the co-writers of the current authors to be disambiguated and matching the co-writers again, the situation that the same actual author has different name writing methods can be effectively solved. The following first gives a brief description of the relationship network, and then explains the logic for implementing the method.

The key attribute relational network is a relational network formed by collecting key attributes in the papers and through their correlation, wherein entity nodes in the relational network mainly comprise: author name, author institution, author field, thesis name. The authors are clustered through three dimensions of the thesis name, the institution and the field, so that a relationship network among the thesis co-writers, a relationship network of the same institution and a relationship network of the same field are respectively formed, and finally a relationship network of key attributes of the thesis is formed.

A summary of the relationship network is provided below, followed by a description of the logic for implementing the method.

FIG. 1 is an exemplary diagram of a key attribute relationship network, wherein N represents author name, F represents domain, O represents mechanism, and T represents paper name, and the key attribute relationship network is formed by the relationship between nodes. A unit: for convenience of description, a single input of information is referred to herein as a unit, where a unit is one of a list of author information extracted in a paper, and includes: author name (N), domain (F), organization (O), paper name (T). A1 represents a specific example of a unit; the description of the documents in the following is the same, and is not repeated for the sake of convenience.

The following describes the implementation logic of the author disambiguation method based on the key attribute relationship network in detail, and fig. 2 is a general flowchart of the disambiguation method.

The flow in fig. 2 is explained.

(1) Inputting cell A1 into the relationship net;

(2) and inserting the domain, the mechanism and the paper name in the unit A1 into the relation network for Merge operation.

(3) Query N1 in A1 whether there is the same node as all N nodes in the relationship network.

(4) If the same node exists, the FLOW1 is entered to start the judgment, mainly to judge whether the same name exists but corresponds to the situation of different actual authors.

(5) If the two types of data are different, the FLOW2 is entered, and the judgment is started to mainly judge whether the situation that the same actual author has different name writings exists or not.

(6) And (4) inputting the result of the step (4) or (5) into the relational network, inserting the author name node into the relational network if the author name node is a new author name node, and otherwise, updating the author name node in the relational network and adding a new alias for the author name node.

(7) And (6) circulating the steps to achieve the purpose of disambiguation while establishing the relation network.

Wherein, the step (1) is mainly to input unit data into the relation network, and the step (1) mainly comprises the following steps:

1. converting all data of author names, thesis names, fields and mechanism characters into lower case;

2. removing special characters such as "-", etc. in the data.

The main steps of the step (2) are as follows:

1. and extracting the domain nodes in A1, and sequentially inserting the domain nodes into the relational network, wherein F1 is one domain node in A1.

2. It is determined whether the same node as F1 exists in the relationship network.

3. If so, ignore.

4. And if not, inserting into the relation network.

5. The rest of the organisations are in steps 1 to 4 above with the title of the article.

In the step (4), when it is determined that N1 has the same node as N in the relational network, FLOW1 is performed, and fig. 3 is a FLOW1 flowchart, which is described in detail below.

1. Taking out a List of domains (F) and organizations (O) which are associated with the same N node as the N1 node and respectively marking as F-List and O-List;

2. matching F related to N1 with F-List, calculating weight, wherein the weight is 1 once matching is successful, calculating field weight sum, and recording as: SumWeightFieid;

3. and (3) matching O related to N1 with O-List, calculating weight, wherein the weight is 2 when matching is successful, and calculating the weight sum of the mechanism, which is recorded as: SumWeiightorg;

4. calculating the weighted sum, which is recorded as: SumWeight ═ weight (f) + weight (o);

5. if SumWeight is more than 2, marking N1 as the same person as the successfully matched N node;

6. if SumWeight is less than or equal to 2, the node marked as N1 and the N node successfully matched is two persons;

7. and outputting the result.

When the step (5) determines that N1 does not have the same node as N in the relational network, FLOW2 is performed, and fig. 4 is a FLOW2 flowchart, which is described in detail below.

1. Respectively taking out a paper name node List Title-List which is the same as the paper name (T) of A1, a Field node List Field-List which is the same as the Field (F) of A1 and an organization node List Org-List which is the same as the organization (O) of A1 in the relational network;

2. through a Title-List associated author name node, namely an N node, the relationship between a paper author and a co-worker thereof is associated, through querying the co-worker of N1, the matching is queried again in a reverse direction, i.e., the potential matching authors are screened, the part of N-List is the re-associated co-worker with the author who has collaborated with N1 in a1, and the part is based on a realistic situation that the author who has collaborated with N1 may collaborate with N1 more than once, and the main steps are as follows:

d) inquiring an author name N-List associated with the Title-List through the Title-List;

e) inquiring a paper name T-List associated with the N-List through the N-List, namely associating the paper name;

f) and inquiring the name of the author associated with the name through the T-List, and outputting the name as the N-Title-List.

3. And inquiring the author name node associated with the Field-List, and outputting the author name node as the N-Field-List.

4. The author name points associated with Org-List are queried and the output is N-Org-List.

5. N1 is respectively matched with N-Title-List, N-Field-List and N-Org-List in terms of correlation degree, and respectively recorded as Ret-Title-List, Ret-Field-List and Ret-Org-List, wherein the weights are 3, 2 and 1.

6. And aggregating the Ret-Title-List, the Ret-Field-List and the Ret-Org-List according to values, solving intersection, respectively calculating the weights and SumWeigt of different result sets after aggregation, and outputting the result set as the Ret-List.

7. The one with the highest weight and SumWeigt in the Ret-List is taken, and the same author is the one with SumWeigt > 4, and the different author is the one with SumWeigt ≦ 4.

8. If the weight sum is the highest and more than 4, the relevance matching of the author names is carried out again, and the one with the highest relevance matching is taken.

Claims

1. An author disambiguation method based on a paper key attribute network is characterized in that a key attribute relationship network is established, the key attribute relationship network is formed by collecting key attributes in paper and through the correlation relationship of the key attributes, and entity nodes in the relationship network comprise: author name, author institution, author domain and thesis name; the authors are clustered through three dimensions of the thesis name, the institution and the field, so as to form a relationship network among the thesis co-writers, a relationship network of the same institution and a relationship network of the same field, and finally form a relationship network of key attributes of the thesis; the implementation logic of the author disambiguation method based on the key attribute relationship network comprises the following steps:

step 1: inputting cell A1 into the relationship net;

step 2: inserting the fields, mechanisms and paper names in the unit A1 into the relational network, and performing Merge operation;

and step 3: inquiring whether the N1 in the A1 has the same node with all the N nodes in the relational network;

and 4, step 4: if the same node exists, the FLOW1 is entered, and the judgment is started to judge whether the same name is the same but the same name corresponds to the situation of different actual authors; the FLOW1 process was carried out, comprising the following steps (1) - (7):

(1) taking out a domain F and a mechanism O List which are associated with the same N node as the N1 node and respectively recording the domain F and the mechanism O as an F-List and an O-List;

(2) matching F related to N1 with F-List, calculating weight, wherein the weight is 1 once matching is successful, calculating field weight sum, and recording as: SumWeightField;

(3) and (3) matching O related to N1 with O-List, calculating weight, wherein the weight is 2 when matching is successful, and calculating the weight sum of the mechanism, which is recorded as: SumWeiightorg;

(4) calculating the weighted sum, which is recorded as: SumWeight ═ SumWeightField + SumWeightOrg;

(5) if SumWeight is more than 2, marking N1 as the same person as the successfully matched N node;

(6) if SumWeight is less than or equal to 2, the node marked as N1 and the N node successfully matched is two persons;

(7) outputting a result;

and 5: if the two types of data are different, entering a FLOW2, starting to judge, and judging whether the situation that the same actual author has different name writing methods exists or not; the FLOW2 process was carried out, comprising the following steps (1) - (8):

(1) respectively taking out a paper name node List Title-List which is the same as the paper name T of A1, a Field node List Field-List which is the same as the Field F of A1 and an organization node List Org-List which is the same as the organization O of A1 in the relational network;

(2) through a Title-List associated author name node, namely an N node, the relationship between a paper author and a co-worker thereof is associated, through querying the co-worker of N1, the matching is queried again in a reverse direction, namely, the matched author is screened, the part of N-List is the co-worker associated again with the author who has collaborated with N1 in a1, and the part is based on a realistic condition that the author who has collaborated with N1 collaborates with N1 more than once, and the following steps are performed:

a) inquiring an author name N-List associated with the Title-List through the Title-List;

b) inquiring a paper name T-List associated with the N-List through the N-List, namely associating the paper name;

c) inquiring the name of the author associated with the name through the T-List, and outputting the name as an N-Title-List;

(3) inquiring an author name node associated with the Field-List, and outputting the author name node as the N-Field-List;

(4) inquiring an author name node associated with the Org-List, and outputting the author name node as N-Org-List;

(5) respectively matching N1 with N-Title-List, N-Field-List and N-Org-List in terms of correlation degree, respectively recording as Ret-Title-List, Ret-Field-List and Ret-Org-List, wherein the weights are 3, 2 and 1;

(6) aggregating the Ret-Title-List, the Ret-Field-List and the Ret-Org-List according to values, solving intersection, respectively calculating the weight and SumWeight of different aggregated result sets, and outputting the result set as Ret-List;

(7) taking the highest weight and SumWeight one of Ret-List, if SumWeight is more than 4, the author is the same author, and if SumWeight is less than or equal to 4, the author is different;

(8) if the weight sum is the highest and is more than 4, the relevance matching of the author name is carried out again, and the one with the highest relevance matching is taken;

step 6: inputting the result of the step 4 or the step 5 into the relation network, if the result is a new author name node, inserting the author name node into the relation network, otherwise, updating the author name node in the relation network, and adding a new alias for the author name node;

and 7: repeating the above 6 steps, and achieving the purpose of disambiguation while establishing a relation network;

the unit is as follows: referring to an input of information as a unit, one of which is one of a list of author information extracted in a paper, includes: author name N, field F, organization O, thesis name T; a1 represents a specific example of a unit.

2. The author disambiguation method based on paper key attribute network as claimed in claim 1, wherein said step 1 is inputting unit data into a relational network, comprising the steps of:

(1) converting all data of author names, thesis names, fields and mechanism characters into lower case;

(2) special characters in the data are removed.

3. The author disambiguation method based on paper key attribute network as claimed in claim 1, wherein step 2 comprises the steps of:

(1) extracting the domain nodes in A1, and sequentially inserting the domain nodes into the relational network, wherein F1 is one domain node in A1;

(2) judging whether the same node as F1 exists in the relational network;

(3) if so, ignoring;

(4) if not, inserting into the relation network;

(5) the rest of the institutions and the paper title repeat the above steps (1) to (4).