CN112183100A

CN112183100A - Multi-source homonymous expert disambiguation method

Info

Publication number: CN112183100A
Application number: CN202011082199.9A
Authority: CN
Inventors: 李林; 李成中; 谭祥; 巴宗岳
Original assignee: Inspur Tianyuan Communication Information System Co Ltd
Current assignee: Inspur Tianyuan Communication Information System Co Ltd
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2021-01-05

Abstract

The invention discloses a multisource homonymous expert disambiguation method, belonging to the technical field of engineering science and technology, wherein the method is used for cleaning and processing expert data from a plurality of expert libraries and carrying out expert name uniqueness analysis; combining expert achievement association and collision disambiguation processing, integrating a multi-source knowledge base containing expert entity definitions, and performing entity disambiguation on a large number of existing expert rename phenomena so as to determine the correct direction of an entity, determine the semantics of the entity and establish a unified expert base. The invention can effectively solve the ambiguity problem of the same-name experts widely existing in various texts and provide accurate expert data support for semantic search engines, intelligent question-answering systems and the like in the field of engineering science and technology.

Description

Multi-source homonymous expert disambiguation method

Technical Field

The invention relates to the technical field of engineering science and technology, in particular to a multi-source homonymous expert disambiguation method.

Background

The multi-source homonymous expert disambiguation refers to eliminating the ambiguity of homonymous experts in a plurality of expert libraries and distinguishing the experts with the same name according to different entities of the real world. As a plurality of expert libraries constructed by different mechanisms exist in the field of engineering science and technology, and bottom layer data among the expert libraries are not integrated, a large number of expert renaming phenomena exist. The efficiency of searching for experts in the field of engineering and science and technology and in a document database is low, and a user needs to spend a lot of time to screen out expert information interested by the user from duplicate experts.

Disclosure of Invention

The technical task of the invention is to provide a multi-source homonymous expert disambiguation method aiming at the defects, which can effectively solve the problem of homonymous expert ambiguity widely existing in various texts and provide accurate expert data support for semantic search engines, intelligent question-answering systems and the like in the field of engineering science and technology.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a multisource homonymous expert disambiguation method is characterized in that expert data from a plurality of expert libraries are cleaned and processed, and expert name uniqueness analysis is carried out;

combining expert achievement association and collision disambiguation processing, integrating a multi-source knowledge base containing expert entity definitions, and performing entity disambiguation on a large number of existing expert rename phenomena so as to determine the correct direction of an entity, determine the semantics of the entity and establish a unified expert base.

The method is used for carrying out entity disambiguation on a large number of existing expert renaming phenomena by cleaning and preprocessing expert database data and by means of technologies of uniqueness analysis, expert result association and collision disambiguation, outputting expert disambiguation results and establishing a knowledge base, effectively solving the problem of synonym expert ambiguity widely existing in various texts in a certain technical field, and is suitable for application in the fields of semantic search, question and answer systems, knowledge base expansion, heterogeneous knowledge base fusion and the like.

Preferably, the cleaning process includes expert data attribute value confirmation, and the expert attribute values which do not conform to the conventional logic and have obvious errors are emptied.

Preferably, the process of analyzing the uniqueness of the expert name is as follows:

if the unique, directly extracting and warehousing the multi-source synonyms experts;

if not, judging according to the correlation result of the expert results, and storing the processed results after collision disambiguation.

Furthermore, a relational database of the experts and the journal papers is established, association of the experts and the journal paper achievements is realized, and further uniqueness of the expert names is judged according to the association result of the expert achievements.

Preferably, the collision disambiguation processing adopts the rule intersection fusion of a triangular nondirectional collision disambiguation matrix:

and establishing a collision disambiguation rule engine, and performing intersection fusion on the collision disambiguation rule based on the triangular nondirectional collision disambiguation matrix.

Further, expert grouping is carried out according to initials of the values of the expert names MD5, experts with the same names are inquired out in each group, data duplication judgment is carried out on the basis of collision disambiguation rules, and repeated data are fused.

Further, based on the intersection fusion result, performing collision disambiguation according to expert data attribute values, wherein the attribute values comprise birth date, mobile phone number, literature achievement, brief introduction, research field and data source authority.

Preferably, the expert databases are expert databases in the field of engineering science and technology; and warehousing the multisource homonymous experts subjected to collision disambiguation, and establishing a unified expert library in the engineering science and technology field.

The invention also claims a multi-source synonym expert disambiguation device, comprising: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor is used for calling the machine readable program and executing the method.

The invention also claims a computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform the above-described method.

Compared with the prior art, the multisource synonym expert disambiguation method has the following beneficial effects:

the method integrates the multi-source knowledge base containing the expert entity definition by means of technologies such as uniqueness analysis, expert achievement association, collision disambiguation and the like, entity disambiguation is carried out on a large number of existing expert renaming phenomena, the problem of ambiguity of the same-name experts widely existing in various texts in some fields can be effectively solved, and accurate expert data support is provided for semantic search engines, intelligent question-answering systems and the like in the fields.

Drawings

FIG. 1 is a flow chart of a multi-source synonym expert disambiguation method provided by one embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and specific examples.

The embodiment of the invention provides a multisource homonymous expert disambiguation method, which is characterized in that referring to fig. 1, expert data from a plurality of expert libraries are subjected to attribute value confirmation and other cleaning processing, and then are subjected to expert name uniqueness analysis;

Wherein, the expert data attribute value confirmation is to blank the expert attribute values which do not conform to the conventional logic and have obvious errors.

The process of analyzing the uniqueness of the expert name is as follows:

And establishing a relation library of the experts and the journal papers, realizing the association of the experts and the journal paper achievements, and further judging the uniqueness of the expert names according to the association result of the expert achievements.

The collision disambiguation processing adopts the rule intersection fusion of a triangular nondirectional collision disambiguation matrix:

And carrying out expert grouping according to initials of the values of the expert names MD5, inquiring experts with the same name in each group, carrying out data duplication judgment based on a collision disambiguation rule, and fusing duplicated data.

And performing collision disambiguation processing according to expert data attribute values based on the intersection and fusion result, wherein the attribute values comprise birth date, mobile phone number, literature result, brief introduction, research field and data source authority.

And warehousing the multi-source homonymous expert data subjected to collision disambiguation, and establishing a unified expert database in the field.

The embodiment of the invention also provides a multi-source homonymous expert disambiguation method which is applied to the field of engineering science and technology, integrates a multi-source knowledge base containing expert entity definitions based on the disambiguation technologies such as expert attribute value judgment, name uniqueness analysis, expert achievement association, triangular nondirectional collision disambiguation matrix rules and the like, performs entity disambiguation on a large number of existing expert rename phenomena, determines the correct direction of an entity, determines the semantics of the entity, and establishes a unified expert base in the field of engineering science and technology.

The specific implementation process is as follows:

data cleaning: and (3) carrying out cleaning processing and pretreatment such as attribute value confirmation on a plurality of expert database data in the engineering science and technology field, and emptying the expert attribute values which do not conform to the conventional logic and have obvious errors.

Expert name uniqueness analysis: confirming whether the expert name is unique, if so, directly extracting and warehousing the multi-source synonyms expert (target library); if not, judging according to the correlation result of the expert results.

And (3) association of results: and establishing a relation library of experts and journal papers to realize the association of the experts and the journal paper achievements.

And (3) intersecting and fusing the rules of the triangular nondirectional collision disambiguation matrix: and establishing a collision disambiguation rule engine, and performing intersection fusion on the collision disambiguation rule based on the triangular nondirectional collision disambiguation matrix.

And (4) fusion of expert repeated data: and carrying out expert grouping according to initials of the values of the expert names MD5, inquiring experts with the same name in each group, carrying out data duplication judgment based on a collision disambiguation rule, and fusing duplicated data.

Collision disambiguation processing: based on the intersection and fusion results, collision disambiguation processing is carried out according to the attribute values of the birth date of experts, the mobile phone number, the literature results, the brief introduction, the research field, the authority of data sources and the like.

Multi-source homonymous experts disambiguation and warehousing: and (4) warehousing the multi-source homonymous expert data subjected to collision disambiguation (a target library), and establishing a unified expert library in the engineering science and technology field.

The embodiment of the invention also provides a multi-source synonym expert disambiguation device, which comprises: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor is configured to invoke the machine readable program to perform the multi-source synonym expert disambiguation method described in any of the above embodiments.

An embodiment of the present invention further provides a computer-readable medium, where the computer-readable medium stores computer instructions, and when the computer instructions are executed by a processor, the processor is caused to execute the multi-source synonym expert disambiguation method described in the above embodiment of the present invention. Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the above-described embodiments are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.

In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.

Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.

Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.

Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.

While the invention has been shown and described in detail in the drawings and in the preferred embodiments, it is not intended to limit the invention to the embodiments disclosed, and it will be apparent to those skilled in the art that various combinations of the code auditing means in the various embodiments described above may be used to obtain further embodiments of the invention, which are also within the scope of the invention.

Claims

1. A multisource homonymous expert disambiguation method is characterized in that expert data from a plurality of expert libraries are cleaned and processed, and expert name uniqueness analysis is carried out;

2. The method of claim 1, wherein the cleaning process comprises expert data attribute value validation, and the expert attribute values that do not conform to conventional logic and are significantly incorrect are nulled.

3. The method of claim 1, wherein the expert name uniqueness analysis comprises the following steps:

4. The method as claimed in claim 3, wherein a relational database of experts and journal papers is established to correlate the experts with the journal papers results, and further to determine the uniqueness of the expert names according to the correlation results of the expert results.

5. The method of claim 1, 2 or 3, wherein the collision disambiguation process employs a triangular nondirectional collision disambiguation matrix rule intersection fusion:

6. The multi-source synonym expert disambiguation method as claimed in claim 5, wherein the experts are grouped according to initials of the values of the expert names MD5, the experts with the same name are queried in each group, data duplication is performed based on collision disambiguation rules, and duplicate data are fused.

7. The multi-source synonym expert disambiguation method of claim 6, wherein based on the intersection fusion result, the collision disambiguation process is performed according to expert data attribute values, the attribute values including birth date, mobile phone number, literature work, profile, research field, and data source authority.

8. The multi-source synonym expert disambiguation method of claim 1 or 2, wherein the plurality of expert databases are a plurality of expert databases of the engineering science and technology field; and warehousing the multisource homonymous experts subjected to collision disambiguation, and establishing a unified expert library in the engineering science and technology field.

9. A multi-source synonym expert disambiguation apparatus, comprising: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor, configured to invoke the machine readable program to perform the method of any of claims 1 to 8.

10. Computer readable medium, characterized in that it has stored thereon computer instructions which, when executed by a processor, cause the processor to carry out the method of any one of claims 1 to 8.