CN117610541A

CN117610541A - Author disambiguation method and device for large-scale data and readable storage medium

Info

Publication number: CN117610541A
Application number: CN202410067264.2A
Authority: CN
Inventors: 陆矜菁; 姬朋立; 严笑然; 刘洋; 顾剑波; 侯炜华
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2024-01-17
Filing date: 2024-01-17
Publication date: 2024-02-27

Abstract

The application relates to an author disambiguation method, device and readable storage medium for large-scale data, wherein the method comprises the following steps: generating corresponding predicate instance data based on the large-scale data and a predefined predicate; wherein the large-scale data includes structurally stored literature information and corresponding author information; the predicate instance data includes a document author pair; splitting the predicate instance data into a plurality of initial data blocks; merging literature author pairs with the same author in the plurality of initial data blocks based on a predetermined first-order logic rule to generate a final data block; based on the document author pairs in the final data block, author identification information corresponding to the large-scale data is generated, and the problem of low author disambiguation efficiency of large-scale academic document data in the related technology is solved.

Description

Author disambiguation method and device for large-scale data and readable storage medium

Technical Field

The present application relates to the field of text processing technologies, and in particular, to a method and apparatus for author disambiguation of large-scale data, and a readable storage medium.

Background

Because of the huge and increasing data volume of academic documents, how to identify the co-name authors becomes an important problem for improving the accuracy of data retrieval. In the prior art, there are some techniques for identifying authors by using a method of constructing a similarity matrix by using a correlation heterogeneous network and paper semantic characterization, clustering and disambiguating, extracting characterization vectors to respectively generate a similarity matrix, weighting and summing, clustering by using an unsupervised clustering method, and obtaining disambiguation results after cluster assignment of clustered discrete points. However, the existing author disambiguation method cannot provide a way for accelerating the algorithm implementation process for large-scale and ultra-large-scale academic literature data, has poor suitability for huge literature data volume in real situations, and has low efficiency for completing author disambiguation.

Disclosure of Invention

The embodiment provides an author disambiguation method and device of large-scale data and a readable storage medium, so as to solve the problem of low author disambiguation efficiency of large-scale academic literature data in the related technology.

In a first aspect, in this embodiment there is provided a method of author disambiguation of large-scale data, the method comprising:

Generating corresponding predicate instance data based on the large-scale data and a predefined predicate; wherein the large-scale data includes structurally stored literature information and corresponding author information; the predicate instance data includes a document author pair;

splitting the predicate instance data into a plurality of initial data blocks;

merging literature author pairs with the same author in the plurality of initial data blocks based on a predetermined first-order logic rule to generate a final data block;

and generating author identification information corresponding to the large-scale data based on the document author pairs in the final data block.

In some of these embodiments, the merging, based on a predetermined first-order logic rule, pairs of literature authors having the same author in the plurality of initial data blocks, generating a final data block includes:

acquiring a literature author pair with the same author in the initial data block based on a predetermined first-order logic rule;

merging the document author pairs to obtain updated data blocks corresponding to the initial data blocks;

and repeatedly combining the plurality of updated data blocks based on the document author pairs with the same author in the plurality of updated data blocks until a final data block is generated.

In some of these embodiments, the obtaining, based on a predetermined first-order logic rule, a document author pair having the same author in the initial data block includes:

based on the first-order logic rule, obtaining probability values of the same authors among the pairs of literature authors in the initial data block;

based on the probability value, aggregating authors in each literature author pair to obtain an author group and an independent author corresponding to the initial data block;

and merging literature author pairs corresponding to each author in the author group, and acquiring the mapping relation between the author group and the corresponding author.

In some embodiments, the obtaining, based on the first-order logical rule, a probability value for each document author pair in the initial data block having the same author comprises:

generating rule instance data based on the first-order logic rule and predicate instance data in the initial data block;

based on the Markov logic network, calculating and obtaining the probability value of the same author between each document author pair corresponding to the rule instance data.

In some embodiments, the aggregating the authors in each document author pair based on the probability value to obtain the author group and the individual authors corresponding to the initial data block includes:

Establishing an author network by taking authors in each document author pair in the initial data block as nodes and taking probability values of the same authors among each document author pair as edges;

and aggregating the author network with the aim of maximizing the modularity of the author network to obtain the author group and the individual authors.

In some embodiments, the repeatedly merging the plurality of updated data blocks based on the document author pairs having the same author among the plurality of updated data blocks until a final data block is generated includes:

determining a data block to be merged from the plurality of updated data blocks based on a predetermined data block size threshold;

combining the data blocks to be combined in pairs to obtain a combined data block;

merging the document author pairs with the same author in the merged data block to generate an updated data block corresponding to the merged data block;

and merging the updated data blocks serving as the data blocks to be merged in pairs until a final data block is generated.

In some of these embodiments, the generating the corresponding predicate instance data based on the large-scale data and the predefined predicates includes:

Generating each document author pair based on the document information and the corresponding author information in the large-scale data;

combining the document author pairs to generate pairing data;

and generating predicate instance data corresponding to the paired data based on the predefined predicates, and the author information and the literature information corresponding to the paired data.

In some of these embodiments, prior to the generating the corresponding predicate instance data based on the large-scale data and the predefined predicates, the method further comprises:

collecting original document data and preprocessing the original document data to generate large-scale data, wherein the large-scale data comprises a document identifier and an author identifier;

and carrying out structured storage on the large-scale data.

In a second aspect, in this embodiment, there is provided an author disambiguation apparatus for large-scale data, the apparatus comprising:

the first generation module is used for generating corresponding predicate instance data based on the large-scale data and the predefined predicates; wherein the large-scale data includes structurally stored literature information and corresponding author information; the predicate instance data includes a document author pair;

The segmentation module is used for segmenting the predicate instance data into a plurality of initial data blocks;

the merging module is used for merging the document author pairs with the same author in the plurality of initial data blocks based on a predetermined first-order logic rule to generate a final data block;

and the second generation module is used for generating the author identification information corresponding to the large-scale data based on the document author pairs in the final data block.

In a third aspect, in this embodiment, there is provided a readable storage medium having stored thereon a program which, when executed by a processor, implements the steps of the author disambiguation method for large-scale data according to the first aspect.

Compared with the related art, the method for disambiguating the authors of the large-scale data provided in the present embodiment generates corresponding predicate instance data based on the large-scale data and the predefined predicates, and uses the corresponding predicate instance data as basic data for performing subsequent disambiguation of the authors; the predicate instance data is segmented into a plurality of initial data blocks, so that the requirement on hardware resources for data processing is reduced, and the processing efficiency of large-scale data is improved; merging literature author pairs with the same author in a plurality of initial data blocks based on a predetermined first-order logic rule to generate a final data block, and completing author disambiguation based on a modeling reasoning mode, so that the logicality and the interpretability of a disambiguation process are improved; by generating the author identification information corresponding to the large-scale data based on the document author pairs in the final data block, the same authors in the large-scale document are identified, and the problem of low author disambiguation efficiency of large-scale academic document data in the related technology is solved.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a block diagram of a computer hardware architecture of an author disambiguation method for large-scale data according to some embodiments of the present application;

FIG. 2 is a flow chart of an author disambiguation method for large scale data according to some embodiments of the present application;

FIG. 3 is a flow chart of merging initial data blocks to generate a final data block according to some embodiments of the present application;

FIG. 4 is a flow chart of merging pairs of document authors in an initial data block according to some embodiments of the present application;

FIG. 5 is a flow chart of acquiring the same author probability value for an initial data block according to some embodiments of the present application;

FIG. 6 is a flow chart of acquiring an author group and individual authors of an initial data block according to some embodiments of the present application;

FIG. 7 is a flow chart of repeatedly merging updated data blocks to generate a final data block according to some embodiments of the present application;

FIG. 8 is a flow diagram of generating predicate instance data based on massive data according to some embodiments of the present application;

FIG. 9 is a flow chart of generating large-scale data according to some embodiments of the present application;

FIG. 10 is a flow chart of an author disambiguation method for large scale data according to some preferred embodiments of the present application;

fig. 11 is a block diagram of the author disambiguation device of large-scale data according to some embodiments of the present application.

Detailed Description

For a clearer understanding of the objects, technical solutions and advantages of the present application, the present application is described and illustrated below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

Unless defined otherwise, technical or scientific terms used herein shall have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terms "a," "an," "the," "these," and the like in this application are not intended to be limiting in number, but rather are singular or plural. The terms "comprising," "including," "having," and any variations thereof, as used in the present application, are intended to cover a non-exclusive inclusion; for example, a process, method, and system, article, or apparatus that comprises a list of steps or modules (units) is not limited to the list of steps or modules (units), but may include other steps or modules (units) not listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in this application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. Reference to "a plurality" in this application means two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. Typically, the character "/" indicates that the associated object is an "or" relationship. The terms "first," "second," "third," and the like, as referred to in this application, merely distinguish similar objects and do not represent a particular ordering of objects.

The author disambiguation method of large-scale data provided by the embodiment of the application can be executed in a server, a computer or similar computing device. When the method is applied to a computer, fig. 1 is a block diagram of a hardware structure of a computer of an author disambiguation method of large-scale data according to some embodiments of the present application. As shown in fig. 1, the computer may include one or more (only one is shown in fig. 1) processors 102 and a memory 104 for storing data, wherein the processors 102 may include, but are not limited to, a central processing unit CPU, a microprocessor MCU, a programmable logic device FPGA, or the like. The computer may also include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those of ordinary skill in the art that the configuration shown in FIG. 1 is merely illustrative and is not intended to limit the configuration of the computer described above. For example, the computer may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store computer programs, for example, software programs of application software and modules, such as computer programs corresponding to the author disambiguation method of large-scale data in the present embodiment, and the processor 102 executes the computer programs stored in the memory 104 to perform various functional applications and data processing, i.e., to implement the above-described method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some embodiments, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

In this embodiment, an author disambiguation method of large-scale data is provided, and fig. 2 is a flowchart of an author disambiguation method of large-scale data according to some embodiments of the present application, as shown in fig. 2, where the flowchart includes the following steps:

step S201, based on the large-scale data and the predefined predicates, generating corresponding predicate instance data; wherein the large-scale data includes structurally stored literature information and corresponding author information; the predicate instance data includes a document author pair.

Large-scale data refers to structured storage data including literature information and corresponding author information, and may be data of a scale of GB or more in this embodiment. Specifically, the large-scale data includes a document author pair for describing the correspondence between authors and documents in the large-scale data. The identity of the literature author pair is denoted by paId, e.g. paId1, paId2. Each document author pair includes an author identification and a document identification corresponding to each other.

Further, the large-scale data may further include author information corresponding to the author identification, and document information corresponding to the document identification. The author information may include the name, mailbox, address, etc. of the author, and the document information may include the collaborators, sponsors, etc. of the document.

Predicates represent relationships between multiple objects or attributes of the objects. In this embodiment, predicates may be defined based on the content of the large-scale data, and the actual needs of author disambiguation.

In a specific embodiment, it is determined whether the authors of two documents are the same person, and whether the authors are the same name can be determined, and then the determination can be performed according to whether the collaborators are the same, whether they belong to the same organization, whether the documents are sponsored by the same organization, whether the author mailboxes are the same, etc. Thus, predicates may be defined as shown in the table below, which may include an author attribute predicate and a document attribute predicate.

(paId 1, paId 2) in the following table is a combination generated after pairing two pairs of literature authors, and predicates in the following table are used to describe the relationship between the two pairs of literature authors.

Each predicate is specifically described below:

SameAuthor (paId 1, paId 2): judging according to the identifier of the author, wherein the consistency is 1, and the inconsistency is 0;

SameName (paId 1, paId 2): judging whether the names of the authors are consistent, wherein the consistency is 1, and the inconsistency is 0. Wherein, some authors have name shorthand, such as Moses, jacob and Moses, J, the adopted method is to process the data, and judge according to the initial and whether the last name and first name parts with the length more than 1 are respectively the same;

SameEmail (paId 1, paId 2): judging according to the mailbox address corresponding to the author, wherein the consistency is 1, and the inconsistency is 0;

SameAddr (paId 1, paId 2): judging according to the address corresponding to the author, wherein the consistency is 1, and the inconsistency is 0;

hascooauthor (paId 1, paId 2): according to the document p1 corresponding to the paId1 and the document p2 corresponding to the paId2, extracting all other author names except for the A1 (namely, the author name corresponding to the selected paId 1) and the P2 except for the A2 (namely, the author name corresponding to the selected paId 2), generating two lists of namelist 1= [ name1, name2, … ], namelist 2= [ name3, name4, … ], comparing whether the two lists have coincident items or not, if yes, the HasCoAuthor is 1, and if not, the HasCoAuthor is 0;

HasSameGrant (paId 1, paId 2): judging according to the sponsoring institutions of the literature, generating sponsoring institution lists corresponding to the literature respectively, comparing whether the two lists have coincident items or not, if so, setting the HasSameGrant as 1, and if not, setting the HasCoAuthor as 0;

SameRwAuthor (paId 1, paId 2): judging whether the names of corresponding communication authors of the documents are the same or not, wherein the consistency is 1, and the inconsistency is 0.

The predicates in the table above may be made as observation predicates, with SameAuthor (paId 1, paId 2) being both the observation predicate and the target predicate.

The predicate instance data refers to observation predicate instance data generated by processing large-scale data according to predefined predicates. Specifically, each document author pair in the large-scale data is combined and paired, and corresponding predicate instance data is generated according to the author attribute and the document attribute of the document author pair.

Further, in an embodiment in which predicate instance data is generated by means of machine learning, a part of predicate instance data generated by SameAuthor may be used as an observation predicate instance, and another part may be used as a target predicate instance, so as to facilitate testing of a model.

Step S202, predicate instance data is segmented into a plurality of initial data blocks.

The data is partitioned according to the size of predicate instance data such that the size of an initial data block is in the order of hundred thousand or less, and the size of the data block is generally in the order of MB.

Step S203, merging the document author pairs with the same author in the plurality of initial data blocks based on a predetermined first-order logic rule, and generating a final data block.

The first-order logical rule of the present embodiment is a rule for determining whether a plurality of documents have the same author. For example, in the specific embodiment described in step S201, to determine whether multiple documents have the same author, whether the author names are the same may be determined, and then whether the addresses are consistent, mailboxes are consistent, or there are co-collaborators, the same sponsor, etc., so that SameName may be combined with other author attribute predicates, document attribute predicates to design the following rules:

Rule 1, if the names are the same and the addresses corresponding to the two authors are the same, presuming that the two authors are the same;

rule 2, if the names of the two authors are the same and the corresponding mailboxes are the same, presuming that the two authors are the same author;

rule 3, presuming that two authors are the same if the names are the same and the collaborators of the documents to which they respectively correspond are the same;

rule 4, presuming that two authors are the same if the names are the same and the corresponding documents are sponsored by the same sponsor;

rule 5, presuming that two authors are the same if the names are the same and the names of the corresponding literature communication authors are the same;

rule 6, if the documents corresponding to the two authors respectively have a common partner, a common sponsor and the same communication author, the two authors are presumed to be the same author.

In addition, the rules defined according to transitivity are:

rule 7, supposing that P1 and P3 are the same author if P1 and P2 are the same author and P2 and P3 are the same author;

other rules also include:

rule 8, supposing that the two authors are not the same author if the names of the two authors are different;

The order of rules 9, P1 and P2 has no effect on the determination of whether it is the same author.

The corresponding rules are expressed as follows:

1.SameName(P1,P2)&SameAddr(P1,P2)&(P1!=P2)->SameAuthor(P1,P2)；

2.SameName(P1,P2)&SameEmail(P1，P2)&(P1!=P2)->SameAuthor(P1,P2)；

3.SameName(P1,P2)&HasCoAuthor(P1，P2)&(P1!=P2)->SameAuthor(P1,P2)；

4.SameName(P1,P2)&HasSameGrant(P1,P2)&(P1!= P2)->SameAuthor(P1,P2)；

5.SameName(P1,P2)&SameRwAuthor(P1，P2)&(P1!=P2)->SameAuthor(P1, P2)；

6.HasCoAuthor(P1,P2)&HasSameGrant(P1,P2)&SameRwAuthor(P1P2)&(P1!=P2)->SameAuthor(P1,P2)；

7.SameAuthor(P1,P2)&SameAuthor(P2,P3)&(P1!=P3)->SameAuthor(P1,P3)；

8.!SameName(P1,P2)->!SameAuthor(P1,P2)；

9.SameAuthor(P1,P2) = SameAuthor(P2, P1)。

in the application, the segmented data blocks can be inferred based on the first-order logic rule through the sub-flow, and the document author pairs with the same author are combined according to the inference result, so that the size of the data blocks is further reduced. And then merging the processed data blocks again, inputting the merged data blocks into the sub-process for processing, repeating the steps until only one final data block is left finally, and outputting an inference result after the sub-process processing.

Step S204, based on the document author pairs in the final data block, author identification information corresponding to the large-scale data is generated.

The reasoning result obtained by the sub-flow processing of the final data block comprises merged document author pairs, and the same author identification can correspond to one or more document identifications with the same author.

Further, a mapping dictionary including the author group identifier determined to be the same author and the identifiers of the corresponding plurality of authors may also be output.

And calibrating unique identifiers for all authors according to the identifiers of the final data block and the authors in the mapping record table, and updating the unique identifiers into the large-scale data.

Through the steps S201-S204, corresponding predicate instance data is generated based on the large-scale data and the predefined predicates and is used as basic data for performing subsequent author disambiguation; the predicate instance data is segmented into a plurality of initial data blocks, so that the requirement on hardware resources for data processing is reduced, and the processing efficiency of large-scale data is improved; merging literature author pairs with the same author in a plurality of initial data blocks based on a predetermined first-order logic rule to generate a final data block, and completing author disambiguation based on a modeling reasoning mode, so that the logicality and the interpretability of a disambiguation process are improved; by generating the author identification information corresponding to the large-scale data based on the document author pairs in the final data block, the same authors in the large-scale document are identified, and the problem of low author disambiguation efficiency of large-scale academic document data in the related technology is solved.

In some embodiments, fig. 3 is a flow chart of merging initial data blocks to generate final data blocks according to some embodiments of the present application, as shown in fig. 3, the flow comprising the steps of:

step S301, a document author pair with the same author in an initial data block is acquired based on a predetermined first-order logic rule.

In the embodiment of step S201, according to the first order logic rule described in step S203, a determination is made as to whether a pair of literature authors having the same author, i.e., a case where SameAuthor (paId 1, paId 2) =1, is included in each initial data block.

Step S302, merging the document author pairs to obtain an updated data block corresponding to the initial data block.

Literature authors to SameAuthor (paId 1, paId 2) =1 combine paId1, paId2, the same author corresponds to multiple literature in paId1, paId 2. And generating an updated data block according to the updated document author pair. The size of the update data block is typically smaller than the initial data block.

Step S303, repeatedly combining the plurality of updated data blocks based on the document author pairs with the same author in the plurality of updated data blocks until a final data block is generated.

And merging the processed updated data blocks again, inputting the merged data blocks into the sub-process for processing, repeating until only one final data block is left at last, and outputting an inference result after the sub-process processing.

Through the steps S301-S303, acquiring a document author pair with the same author in the initial data block based on a predetermined first-order logic rule, and performing author disambiguation on each initial data block; combining the document author pairs to obtain updated data blocks corresponding to the initial data blocks, and reducing the scale of each initial data block; the multiple update data blocks are repeatedly combined based on the document author pairs with the same author in the multiple update data blocks until a final data block is generated, and author disambiguation of large-scale data is performed in a repeated combination mode without consuming excessive hardware resources, so that author disambiguation efficiency of document data is improved.

In some embodiments, FIG. 4 is a flow chart of merging pairs of document authors in an initial data block according to some embodiments of the present application, as shown in FIG. 4, the flow comprising the steps of:

step S401, based on the first order logic rule, obtaining probability values of the same authors among the pairs of literature authors in the initial data block.

The individual document author pairs in the initial data block are grouped in a manner that may be based on the ordering of author names. The author names in the same group are the same or similar. According to a predefined first-order logic rule, probability values of the same author between two document author pairs combined pairwise in a group are obtained. The acquisition method may be obtained by constructing a markov logic network.

The Markov logic network considers that if the first order logic rules violated by the entity relationship are fewer, the more likely the set of entity relationships are established; conversely, if the more first order logical rules that the set of entity relationships violate, the less likely the set of entity relationships are established. In this way, the probability value of having the same author between two pairs of literature authors can be obtained by obtaining the number and weight of the first order logical rules satisfied or violated between the two pairs of literature authors in the initial data block.

Step S402, based on the probability value, the authors in each literature author pair are aggregated to obtain an author group and an individual author corresponding to the initial data block.

According to the probability value of the same author between two document author pairs, the authors in each document author pair in the initial data block are aggregated, and the aggregation method is not limited in this embodiment, and may be a clustering method based on distance division and density division, or a clustering method based on graphs, such as a Louvain algorithm, and the like. And obtaining an author group and an independent author corresponding to the initial data block according to the aggregation result, wherein each author in the author group is considered to be the same author.

Step S403, merging the document author pairs corresponding to each author in the author group, and obtaining the mapping relation between the author group and the corresponding author.

An author group identifier c_id may be set, and a correspondence of the author group identifier c_id to the identifier paId by document authors corresponding to authors in the author group may be recorded, and stored as a mapping dictionary author_map, author_map [ paId ] =c_id.

Unique identifiers are set for individual authors in the initial data block and the corresponding paId is updated into the document structured database. Merging the document author pairs corresponding to the authors in the author group, calibrating the same unique identifier for all the authors in the author group according to the mapping dictionary, and updating the unique identifier into the document structured database. The size of the combined data blocks is greatly reduced.

Through the steps S401-S403, the probability value of the same author among each literature author pair in the initial data block is obtained based on a first-order logic rule, and logic reasoning is carried out on predicate instance data in the initial data block; the authors in each literature author pair are aggregated based on the probability value to obtain an author group and an independent author corresponding to the initial data block, and author disambiguation is carried out on the initial data block; by merging the literature author pairs corresponding to each author in the author group and obtaining the mapping relation between the author group and the corresponding author, the size of the data block is reduced according to the processing result of author disambiguation, and the hardware resources required by subsequent data block merging are saved.

In some embodiments, fig. 5 is a flowchart of acquiring the same author probability value for an initial data block according to some embodiments of the present application, as shown in fig. 5, the flowchart including the steps of:

step S501, rule instance data is generated based on the first order logic rule and predicate instance data in the initial data block.

And instantiating the first-order logic rule according to predicate instance data in the initial data block. That is, for a given first order logical rule, information that entities and relationships corresponding to bodies and heads exist in the data at the same time is queried, and each piece of information forms an instantiated rule.

Step S502, calculating and obtaining probability values of the same authors among the literature author pairs corresponding to the rule instance data based on the Markov logic network.

For each instantiated rule, lu Kaxi wiki logic can be used to translate the disjunctive paradigm corresponding to the rule into a continuous expression:

wherein d _j (x) I is a function corresponding to the instantiated first order logic rule _j ⁺ And I _j ^- Respectively represent rule r _j Is not inA negated variable and a negated variable; y is _i A value representing predicate instance data.

Based on the above rules, the probability distribution function of the markov logic network is:

wherein Z is a normalization constant, w _j Is the weight corresponding to the rule.

Wherein, the weight learning can adopt a maximum likelihood estimation method and a rule r _j The weight gradient of (2) is:

where x is the true value of the data and x' is all possible data, the gradient can be understood as the rule r in the current world x _i True value number of (2) and rule r in all possible worlds _i Mathematical expectation differences in the true numbers of (a). The logarithmic maximum likelihood function is optimized using a gradient descent method.

And finally outputting the probability values of the same authors among all the pairs of literature authors in the initial data block.

Through the steps S501-S502, rule instance data are generated based on the first-order logic rules and predicate instance data in the initial data block, and instance data are provided for logic reasoning; the probability value of the same author between each document author pair corresponding to the rule instance data is calculated and obtained through the Markov logic network, and the probability value is used as basic data for author disambiguation through clustering.

In some embodiments, FIG. 6 is a flow chart of acquiring an author group and individual authors of an initial data block according to some embodiments of the present application, as shown in FIG. 6, the flow comprising the steps of:

in step S601, an author network is established with authors in each document author pair in the initial data block as nodes and probability values of the same authors in each document author pair as edges.

The present example uses Lu Wen (Louvain) algorithm for author aggregation. The Louvain algorithm is a community discovery algorithm based on multi-level optimization modularity. Each document author may be considered a node for the author to which the paId corresponds, and the probability value output in step S502 may be considered an edge between nodes. The "community" may be considered as a "group of authors," with the authors within each group of authors being most likely to be the same author.

For a weighted graph, the modularization is defined as:

wherein A is _ij Representing edge weights between nodes i and j; k (k) _i And k _j The sum of the weights of the edges connected to nodes i and j, respectively; m is the sum of all edge weights in the graph; c _i And c _j Is a community of nodes; delta is the kronecker delta function (if x=y, delta (x, y) =1; otherwise, delta (x, y) =0).

Step S602, aggregating the author network with the aim of maximizing the modularity of the author network to obtain an author group and individual authors.

Based on two stages of Louvain algorithm, firstly traversing all adjacent nodes in the network, classifying the node (accounting) capable of maximizing the module increment into communities, secondly merging each community into a supernode after convergence to reconstruct the network, and repeating the process until the algorithm is stable and the module of the whole graph is maximized. The module degree increment calculation formula is as follows:

wherein k is _i,in Is the sum of the weights of the edges between node i (or community a) and all nodes to be moved into community B; sigma (sigma) _tot Is the sum of the weights of node i (or community a) and the connected edges of all nodes or communities on the graph.

Authors were aggregated according to the algorithm described above, and the community found was the "author group".

The aggregated author data has two types, namely an author group comprising two or more author nodes and a single author node, which are respectively abbreviated as an author group and a single author, and the following processes are respectively described in detail:

(1) The author nodes in the author group are all determined to be the same author, a new identifier c_id is set, a mapping relation is recorded, and c_id and paId are one-to-many relations and can be saved as dictionary author_map, author_map [ paId ] =c_id.

(2) The individual authors were judged as non-identical authors with the individual authors, i.e. sameAuthor (paId 1, paId 2) =0; between the individual author and the author group, it is determined as a non-identical author, i.e., sameAuthor (paId, c_id) =0; between the author group and the author group, it is determined that the authors are not identical, that is, sameAuthor (c_id1, c_id2) =0.

And then outputting the aggregated author data and the mapping relation, wherein the aggregated author data identifier is paId or c_id, and the size of the aggregated data block is greatly reduced compared with that of the initial data block.

Through the steps S601-S602, an author network is established by taking authors in each document author pair in an initial data block as nodes and taking probability values of the same authors among each document author pair as edges, and author clustering is carried out in a graph mode; by aggregating the author network with the aim of maximizing the modularity of the author network, an author group and individual authors are obtained, and the efficiency and accuracy of author disambiguation are improved.

In some embodiments, fig. 7 is a flowchart of generating a final data block by repeatedly merging updated data blocks according to some embodiments of the present application, as shown in fig. 7, the flowchart including the steps of:

in step S701, a data block to be merged is determined from a plurality of updated data blocks based on a predetermined data block size threshold.

In some embodiments, a data block size threshold may be preset to define the size of the data blocks to be consolidated. For example, in a particular embodiment, update data blocks greater than 10MB are not selected as data blocks to be merged when first merged.

Step S702, merging the data blocks to be merged in pairs to obtain a merged data block.

And merging the updated data blocks below 10MB in pairs to generate a merged data block and a mapping dictionary. The pairwise merging procedure is to merge the literature authors in data block 1 and data block 2 into a merged data block. Wherein the authors in data block 1 and data block 2 are divided into two types of author groups and individual authors. The logic of the author group related merging in the data block pairwise merging process is as follows:

if a certain node paid_a is a single author in data block 1 and belongs to the author group c_id_b in data block 2, paid_a in data block 1 may be merged into the author group c_id_b, and paid_a may be mapped to c_id_b.

If the same paId exists in the author group 1 of the data block 1 and the author group 2 of the data block 2, the two author groups are indicated to be the same author group, c_id1 and c_id2 of the merged author group are c_id, and the mapping dictionary is updated.

Step S703, merging the document author pairs with the same author in the merged data block to generate an updated data block corresponding to the merged data block.

Acquiring a document author pair with the same author in the combined data block according to the flow of the steps S301-S302; and merging the document author pairs to obtain updated data blocks corresponding to the merged data blocks.

And step S704, the update data blocks are used as data blocks to be combined in pairs, and the data blocks are combined until a final data block is generated.

And taking the updated data block as the data block to be combined, adding the data block with the scale exceeding the threshold value when the data block is combined for the first time, and combining the data blocks repeatedly, namely repeating the steps of S702-S703 until a final data block, and a mapping dictionary of c_id and paId are generated.

Through the steps S701 to S704, determining the data blocks to be combined from the plurality of updated data blocks based on a predetermined data block size threshold, so as to reduce the requirement of data block combination on the hardware of the computing device; the data blocks to be combined are combined in pairs to obtain combined data blocks, and the number of the data blocks is gradually reduced; the method comprises the steps of merging literature author pairs with the same author in the merged data block to generate an updated data block corresponding to the merged data block, and reducing the scale of the merged data block; by updating the repeated merging of the data blocks until the final data block is generated, the accuracy and efficiency of author disambiguation of large-scale data is improved without requiring excessive hardware resources.

In some embodiments, FIG. 8 is a flow chart of generating predicate instance data based on large-scale data of some embodiments of the present application, as shown in FIG. 8, the flow comprising the steps of:

step S801 generates each document author pair based on document information and corresponding author information in the large-scale data.

The unique identifier of the document author pair, i.e. paId, if the authors of document P1 have A1, A2 and the authors of document P2 have A3, A4, then the generation of the document author pair P1A1, P1A2, P2A3, P2A4 can pair any PxAa and PyAb (generally x not equal y) to generate a set of data.

Step S802, combining the document author pairs to generate pairing data.

Specifically, document author pairs may be grouped based on the ordering of author names, with the author names in the same group being the same or similar. Pairwise pairing is carried out on document author pairs in the same group, and paired data are generated. Further, the data in the database can be correspondingly processed according to predicate definition, the PaId pair which needs to be judged is extracted, and then different observation predicate instance data are generated according to different predicates.

Step S803, based on the predicate defined in advance, and the author information and literature information corresponding to the paired data, predicate instance data corresponding to the paired data is generated.

Through the steps S801-S803, each document author pair is generated based on the document information and the corresponding author information in the large-scale data, and matching information of the documents and the authors is provided; the paired data are generated by combining the author pairs of each document, and the same author judgment is carried out on all the author pairs of the document, so that omission is avoided; based on the pre-defined predicates, and the author information and literature information corresponding to the paired data, predicate instance data corresponding to the paired data are generated and serve as basic data for the same author judgment of the initial data block, and accuracy of author disambiguation is improved.

In some embodiments, fig. 9 is a flowchart of generating large-scale data according to some embodiments of the present application, as shown in fig. 9, the flowchart including the steps of:

step S901, collecting original document data and preprocessing, generating large-scale data, where the large-scale data includes document identification and author identification.

The structured data of the document is collected and preprocessed, and preprocessing work mainly comprises data cleaning, identifier setting and the like, document unique identifier (uid), author unique identifier (author_id) and the like, and a certain author of a certain document is calibrated in a manner of setting the paId, namely, uid+author_id.

Step S902, performing structured storage on the large-scale data.

The method mainly comprises the steps of building a table according to the data types of each feature column of a literature data set and providing a search query interface so as to facilitate the subsequent data analysis, development and other works.

Through the steps S901-S902, raw document data are collected and preprocessed to generate large-scale data, and document information and author information are identified through identifiers; by carrying out structured storage on large-scale data, the efficiency of data query and analysis is improved.

The present embodiment is described and illustrated below by way of preferred embodiments. FIG. 10 is a flow chart of an author disambiguation method for large scale data according to some preferred embodiments of the present application, as shown in FIG. 10, the flow comprising the steps of:

step S1001, collecting document structured data and preprocessing the data to generate large-scale data, wherein the large-scale data comprises a document identifier and an author identifier;

step S1002, importing the preprocessed structured data into a distributed structured database for storage;

step S1003, generating each document author pair based on the document information and the corresponding author information in the large-scale data;

Step S1004, combining the document author pairs to generate pairing data;

step S1005, generating predicate instance data corresponding to the paired data based on the predefined predicates, and the author information and literature information corresponding to the paired data;

step S1006, cutting predicate instance data into a plurality of initial data blocks;

step S1007, for each initial data block, rule instance data is generated based on the first order logic rule and predicate instance data in the initial data block;

step S1008, calculating and obtaining probability values of the same authors among the literature author pairs corresponding to the rule instance data in each initial data block based on a Markov logic network;

step S1009, establishing an author network by taking authors in each document author pair in the initial data block as nodes and taking probability values of the same authors among each document author pair as edges;

step S1010, aggregating the author network with the aim of maximizing the modularity of the author network to obtain an author group and individual authors of the initial data block;

step S1011, merging document author pairs corresponding to each author in the author group to obtain an updated data block corresponding to the initial data block, and obtaining a mapping relation between the author group and the corresponding author;

The steps S1007 to S1011 can be regarded as a sub-process.

Step S1012, executing steps S1007-S1011 on each initial data block to obtain each updated data block corresponding to each initial data block, and obtaining the mapping relation between the author group and the corresponding author in each updated data block;

step S1013, selecting data blocks to be combined with the size smaller than or equal to the data block size threshold from the updated data blocks;

step S1014, merging the data blocks to be merged in pairs to obtain a merged data block;

step S1015, taking each combined data block as an initial data block, and executing the sub-flows of the steps S1007-S1011 to obtain an updated data block corresponding to each combined data block;

step S1016, combining the updated data block and the data blocks to be combined with the size larger than the data block size threshold in step S1013, and repeating steps S1014-S1015 until a final data block is generated;

step S1017, based on the document author pair in the final data block, author identification information corresponding to the large-scale data is generated.

Through the steps S1001-S1017, the original document data is marked through preprocessing; generating predicate instance data as basic data for subsequent author disambiguation; the data block segmentation reduces the requirement on hardware resources for data processing and improves the processing efficiency of large-scale data; merging literature author pairs with the same author in an initial data block based on a predetermined first-order logic rule, and completing author disambiguation based on a modeling reasoning mode, so that the logicality and the interpretability of a disambiguation process are improved; the initial data blocks are reduced in scale through aggregation and combination of the data blocks, so that resources required by data processing are reduced, and accuracy of author disambiguation is improved; under the condition that excessive hardware resources are not required to be consumed, the author disambiguation of the large-scale data is carried out in a repeated merging mode, and the author disambiguation efficiency of the document data is improved.

In some embodiments, the present application also provides an author disambiguation device for large-scale data. The author disambiguation device for large-scale data is used to implement the above embodiments and preferred embodiments, and is not described in detail. The terms "module," "unit," "sub-unit," and the like as used below may refer to a combination of software and/or hardware that performs a predetermined function. In some embodiments, fig. 11 is a block diagram of the author disambiguation device of the large-scale data of the present embodiment, as shown in fig. 11, the device comprising:

a first generation module 1101, configured to generate corresponding predicate instance data based on the large-scale data and a predefined predicate; wherein the large-scale data includes structurally stored literature information and corresponding author information; the predicate instance data includes a document author pair;

a splitting module 1102, configured to split predicate instance data into a plurality of initial data blocks;

a merging module 1103, configured to merge, based on a predetermined first-order logic rule, document author pairs having the same author in a plurality of initial data blocks, and generate a final data block;

a second generation module 1104 is configured to generate author identification information corresponding to the large-scale data based on the document author pairs in the final data block.

The author disambiguation device of large-scale data of the present embodiment generates corresponding predicate instance data based on the large-scale data and a predefined predicate through the first generation module 1101, as basic data for performing author disambiguation subsequently; the predicate instance data is segmented into a plurality of initial data blocks through the segmentation module 1102, so that the requirement on hardware resources for data processing is reduced, and the processing efficiency of large-scale data is improved; merging literature author pairs with the same author in a plurality of initial data blocks based on a predetermined first-order logic rule through a merging module 1103 to generate a final data block, and completing author disambiguation based on a modeling reasoning mode, so that the logicality and the interpretability of a disambiguation process are improved; the second generation module 1104 generates the author identification information corresponding to the large-scale data based on the document author pairs in the final data block, so that the same authors in the large-scale document are identified, and the problem of low author disambiguation efficiency of the large-scale academic document data in the related technology is solved.

In some embodiments, the merging module includes a first obtaining submodule, a first merging submodule and a second merging submodule, where the first obtaining submodule is used to obtain a document author pair with the same author in the initial data block based on a predetermined first-order logic rule; the first merging submodule is used for merging the document author pairs to obtain updated data blocks corresponding to the initial data blocks; the second merging sub-module is used for repeatedly merging the plurality of updated data blocks based on the document author pairs with the same author in the plurality of updated data blocks until a final data block is generated.

The author disambiguation device of large-scale data in this embodiment obtains, through a first obtaining submodule, a document author pair having the same author in an initial data block based on a predetermined first-order logic rule, and performs author disambiguation on each initial data block; merging the document author pairs through a first merging sub-module to obtain updated data blocks corresponding to the initial data blocks, and reducing the scale of each initial data block; and repeatedly merging the plurality of updated data blocks until a final data block is generated based on the document author pairs with the same author in the plurality of updated data blocks by the second merging sub-module, and carrying out author disambiguation of large-scale data in a repeated merging mode under the condition of not consuming excessive hardware resources, thereby improving the author disambiguation efficiency of document data.

In some embodiments, the first obtaining submodule includes a first obtaining unit, an aggregation unit and a merging unit, where the first obtaining unit is configured to obtain a probability value of having the same author between each document author pair in the initial data block based on a first-order logic rule; the aggregation unit is used for aggregating authors in each literature author pair based on the probability value to obtain an author group and an independent author corresponding to the initial data block; the merging unit is used for merging document author pairs corresponding to each author in the author group and obtaining the mapping relation between the author group and the corresponding author.

According to the author disambiguation device of the large-scale data, a first acquisition unit is used for acquiring probability values of the same authors among pairs of literature authors in an initial data block based on a first-order logic rule, and logic reasoning is carried out on predicate instance data in the initial data block; the authors in each literature author pair are aggregated through an aggregation unit based on the probability value to obtain an author group and an independent author corresponding to the initial data block, and author disambiguation is carried out on the initial data block; and merging the literature author pairs corresponding to each author in the author group through a merging unit, acquiring the mapping relation between the author group and the corresponding author, reducing the scale of the data block according to the processing result of author disambiguation, and saving the hardware resources required by subsequent data block merging.

In some embodiments, the first obtaining unit includes a generating subunit and a calculating subunit, the generating subunit is configured to generate rule instance data based on the first-order logic rule and predicate instance data in the initial data block; the calculation subunit is used for calculating and obtaining probability values of the same authors among the literature author pairs corresponding to the rule instance data based on the Markov logic network.

The author disambiguation device of the large-scale data of the embodiment generates rule instance data by a generation subunit based on the first-order logic rule and predicate instance data in the initial data block, and provides instance data for logic reasoning; the probability value of the same author between each document author pair corresponding to the rule instance data is calculated and obtained through a calculation subunit based on a Markov logic network and is used as basic data for author disambiguation through clustering.

In some embodiments, the aggregation unit includes a building subunit and an aggregation subunit, where the building subunit is configured to build an author network with authors in each document author pair in the initial data block as nodes and with probability values of the same authors between each document author pair as edges; the aggregation subunit is used for aggregating the author network with the aim of maximizing the modularity of the author network to obtain an author group and individual authors.

The author disambiguation device of the large-scale data in this embodiment establishes an author network by establishing a subunit with authors in each document author pair in an initial data block as nodes and with probability values of the same authors between each document author pair as edges, and performs author clustering in a graph manner; and the aggregation subunit aggregates the author network by taking the modularity maximization of the author network as a target to obtain an author group and individual authors, so that the efficiency and accuracy of author disambiguation are improved.

In addition, in combination with the author disambiguation method of large-scale data provided in the above embodiment, a readable storage medium may also be provided for implementation in the present embodiment. The readable storage medium has a program stored thereon; the program, when executed by a processor, implements the author disambiguation method for any of the large-scale data in the above embodiments.

It should be noted that, specific examples in this embodiment may refer to examples described in the foregoing embodiments and alternative implementations, and are not described in detail in this embodiment.

It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to be limiting. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present application, are within the scope of the present application in light of the embodiments provided herein.

It is evident that the drawings are only examples or embodiments of the present application, from which the present application can also be adapted to other similar situations by a person skilled in the art without the inventive effort. In addition, it should be appreciated that while the development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as an admission of insufficient detail.

The term "embodiment" in this application means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive. It will be clear or implicitly understood by those of ordinary skill in the art that the embodiments described in this application can be combined with other embodiments without conflict.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the patent. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A method of author disambiguation of large-scale data, the method comprising:

Splitting the predicate instance data into a plurality of initial data blocks;

2. The method of claim 1, wherein merging pairs of document authors having the same author in the plurality of initial data blocks based on a predetermined first order logic rule, generating a final data block comprises:

3. The method of claim 2, wherein the obtaining a document author pair having the same author in the initial data block based on a predetermined first order logical rule comprises:

4. The method of claim 3, wherein the obtaining a probability value for each document author pair in the initial data block having the same author based on the first order logical rule comprises:

5. The method of claim 3, wherein aggregating authors in each of the document author pairs based on the probability values to obtain an author group and individual authors corresponding to the initial data block comprises:

6. The method of claim 2, wherein the repeatedly merging the plurality of updated data blocks based on the document author pairs having the same author among the plurality of updated data blocks until a final data block is generated comprises:

7. The method of claim 1, wherein the generating corresponding predicate instance data based on the large-scale data and a predefined predicate comprises:

combining the document author pairs to generate pairing data;

8. The method of claim 1, wherein prior to the generating the corresponding predicate instance data based on the large-scale data and a predefined predicate, the method further comprises:

and carrying out structured storage on the large-scale data.

9. An author disambiguation apparatus for large-scale data, the apparatus comprising:

10. A readable storage medium having stored thereon a program, wherein the program when executed by a processor performs the steps of the method of author disambiguation of large scale data according to any one of claims 1 to 8.