CN112183100A - Multi-source homonymous expert disambiguation method - Google Patents

Multi-source homonymous expert disambiguation method Download PDF

Info

Publication number
CN112183100A
CN112183100A CN202011082199.9A CN202011082199A CN112183100A CN 112183100 A CN112183100 A CN 112183100A CN 202011082199 A CN202011082199 A CN 202011082199A CN 112183100 A CN112183100 A CN 112183100A
Authority
CN
China
Prior art keywords
expert
disambiguation
collision
source
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011082199.9A
Other languages
Chinese (zh)
Inventor
李林
李成中
谭祥
巴宗岳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Tianyuan Communication Information System Co Ltd
Original Assignee
Inspur Tianyuan Communication Information System Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Tianyuan Communication Information System Co Ltd filed Critical Inspur Tianyuan Communication Information System Co Ltd
Priority to CN202011082199.9A priority Critical patent/CN112183100A/en
Publication of CN112183100A publication Critical patent/CN112183100A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a multisource homonymous expert disambiguation method, belonging to the technical field of engineering science and technology, wherein the method is used for cleaning and processing expert data from a plurality of expert libraries and carrying out expert name uniqueness analysis; combining expert achievement association and collision disambiguation processing, integrating a multi-source knowledge base containing expert entity definitions, and performing entity disambiguation on a large number of existing expert rename phenomena so as to determine the correct direction of an entity, determine the semantics of the entity and establish a unified expert base. The invention can effectively solve the ambiguity problem of the same-name experts widely existing in various texts and provide accurate expert data support for semantic search engines, intelligent question-answering systems and the like in the field of engineering science and technology.

Description

Multi-source homonymous expert disambiguation method
Technical Field
The invention relates to the technical field of engineering science and technology, in particular to a multi-source homonymous expert disambiguation method.
Background
The multi-source homonymous expert disambiguation refers to eliminating the ambiguity of homonymous experts in a plurality of expert libraries and distinguishing the experts with the same name according to different entities of the real world. As a plurality of expert libraries constructed by different mechanisms exist in the field of engineering science and technology, and bottom layer data among the expert libraries are not integrated, a large number of expert renaming phenomena exist. The efficiency of searching for experts in the field of engineering and science and technology and in a document database is low, and a user needs to spend a lot of time to screen out expert information interested by the user from duplicate experts.
Disclosure of Invention
The technical task of the invention is to provide a multi-source homonymous expert disambiguation method aiming at the defects, which can effectively solve the problem of homonymous expert ambiguity widely existing in various texts and provide accurate expert data support for semantic search engines, intelligent question-answering systems and the like in the field of engineering science and technology.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a multisource homonymous expert disambiguation method is characterized in that expert data from a plurality of expert libraries are cleaned and processed, and expert name uniqueness analysis is carried out;
combining expert achievement association and collision disambiguation processing, integrating a multi-source knowledge base containing expert entity definitions, and performing entity disambiguation on a large number of existing expert rename phenomena so as to determine the correct direction of an entity, determine the semantics of the entity and establish a unified expert base.
The method is used for carrying out entity disambiguation on a large number of existing expert renaming phenomena by cleaning and preprocessing expert database data and by means of technologies of uniqueness analysis, expert result association and collision disambiguation, outputting expert disambiguation results and establishing a knowledge base, effectively solving the problem of synonym expert ambiguity widely existing in various texts in a certain technical field, and is suitable for application in the fields of semantic search, question and answer systems, knowledge base expansion, heterogeneous knowledge base fusion and the like.
Preferably, the cleaning process includes expert data attribute value confirmation, and the expert attribute values which do not conform to the conventional logic and have obvious errors are emptied.
Preferably, the process of analyzing the uniqueness of the expert name is as follows:
if the unique, directly extracting and warehousing the multi-source synonyms experts;
if not, judging according to the correlation result of the expert results, and storing the processed results after collision disambiguation.
Furthermore, a relational database of the experts and the journal papers is established, association of the experts and the journal paper achievements is realized, and further uniqueness of the expert names is judged according to the association result of the expert achievements.
Preferably, the collision disambiguation processing adopts the rule intersection fusion of a triangular nondirectional collision disambiguation matrix:
and establishing a collision disambiguation rule engine, and performing intersection fusion on the collision disambiguation rule based on the triangular nondirectional collision disambiguation matrix.
Further, expert grouping is carried out according to initials of the values of the expert names MD5, experts with the same names are inquired out in each group, data duplication judgment is carried out on the basis of collision disambiguation rules, and repeated data are fused.
Further, based on the intersection fusion result, performing collision disambiguation according to expert data attribute values, wherein the attribute values comprise birth date, mobile phone number, literature achievement, brief introduction, research field and data source authority.
Preferably, the expert databases are expert databases in the field of engineering science and technology; and warehousing the multisource homonymous experts subjected to collision disambiguation, and establishing a unified expert library in the engineering science and technology field.
The invention also claims a multi-source synonym expert disambiguation device, comprising: at least one memory and at least one processor;
the at least one memory to store a machine readable program;
the at least one processor is used for calling the machine readable program and executing the method.
The invention also claims a computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform the above-described method.
Compared with the prior art, the multisource synonym expert disambiguation method has the following beneficial effects:
the method integrates the multi-source knowledge base containing the expert entity definition by means of technologies such as uniqueness analysis, expert achievement association, collision disambiguation and the like, entity disambiguation is carried out on a large number of existing expert renaming phenomena, the problem of ambiguity of the same-name experts widely existing in various texts in some fields can be effectively solved, and accurate expert data support is provided for semantic search engines, intelligent question-answering systems and the like in the fields.
Drawings
FIG. 1 is a flow chart of a multi-source synonym expert disambiguation method provided by one embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following figures and specific examples.
The embodiment of the invention provides a multisource homonymous expert disambiguation method, which is characterized in that referring to fig. 1, expert data from a plurality of expert libraries are subjected to attribute value confirmation and other cleaning processing, and then are subjected to expert name uniqueness analysis;
combining expert achievement association and collision disambiguation processing, integrating a multi-source knowledge base containing expert entity definitions, and performing entity disambiguation on a large number of existing expert rename phenomena so as to determine the correct direction of an entity, determine the semantics of the entity and establish a unified expert base.
Wherein, the expert data attribute value confirmation is to blank the expert attribute values which do not conform to the conventional logic and have obvious errors.
The process of analyzing the uniqueness of the expert name is as follows:
if the unique, directly extracting and warehousing the multi-source synonyms experts;
if not, judging according to the correlation result of the expert results, and storing the processed results after collision disambiguation.
And establishing a relation library of the experts and the journal papers, realizing the association of the experts and the journal paper achievements, and further judging the uniqueness of the expert names according to the association result of the expert achievements.
The collision disambiguation processing adopts the rule intersection fusion of a triangular nondirectional collision disambiguation matrix:
and establishing a collision disambiguation rule engine, and performing intersection fusion on the collision disambiguation rule based on the triangular nondirectional collision disambiguation matrix.
And carrying out expert grouping according to initials of the values of the expert names MD5, inquiring experts with the same name in each group, carrying out data duplication judgment based on a collision disambiguation rule, and fusing duplicated data.
And performing collision disambiguation processing according to expert data attribute values based on the intersection and fusion result, wherein the attribute values comprise birth date, mobile phone number, literature result, brief introduction, research field and data source authority.
And warehousing the multi-source homonymous expert data subjected to collision disambiguation, and establishing a unified expert database in the field.
The method is used for carrying out entity disambiguation on a large number of existing expert renaming phenomena by cleaning and preprocessing expert database data and by means of technologies of uniqueness analysis, expert result association and collision disambiguation, outputting expert disambiguation results and establishing a knowledge base, effectively solving the problem of synonym expert ambiguity widely existing in various texts in a certain technical field, and is suitable for application in the fields of semantic search, question and answer systems, knowledge base expansion, heterogeneous knowledge base fusion and the like.
The embodiment of the invention also provides a multi-source homonymous expert disambiguation method which is applied to the field of engineering science and technology, integrates a multi-source knowledge base containing expert entity definitions based on the disambiguation technologies such as expert attribute value judgment, name uniqueness analysis, expert achievement association, triangular nondirectional collision disambiguation matrix rules and the like, performs entity disambiguation on a large number of existing expert rename phenomena, determines the correct direction of an entity, determines the semantics of the entity, and establishes a unified expert base in the field of engineering science and technology.
The specific implementation process is as follows:
data cleaning: and (3) carrying out cleaning processing and pretreatment such as attribute value confirmation on a plurality of expert database data in the engineering science and technology field, and emptying the expert attribute values which do not conform to the conventional logic and have obvious errors.
Expert name uniqueness analysis: confirming whether the expert name is unique, if so, directly extracting and warehousing the multi-source synonyms expert (target library); if not, judging according to the correlation result of the expert results.
And (3) association of results: and establishing a relation library of experts and journal papers to realize the association of the experts and the journal paper achievements.
And (3) intersecting and fusing the rules of the triangular nondirectional collision disambiguation matrix: and establishing a collision disambiguation rule engine, and performing intersection fusion on the collision disambiguation rule based on the triangular nondirectional collision disambiguation matrix.
And (4) fusion of expert repeated data: and carrying out expert grouping according to initials of the values of the expert names MD5, inquiring experts with the same name in each group, carrying out data duplication judgment based on a collision disambiguation rule, and fusing duplicated data.
Collision disambiguation processing: based on the intersection and fusion results, collision disambiguation processing is carried out according to the attribute values of the birth date of experts, the mobile phone number, the literature results, the brief introduction, the research field, the authority of data sources and the like.
Multi-source homonymous experts disambiguation and warehousing: and (4) warehousing the multi-source homonymous expert data subjected to collision disambiguation (a target library), and establishing a unified expert library in the engineering science and technology field.
The embodiment of the invention also provides a multi-source synonym expert disambiguation device, which comprises: at least one memory and at least one processor;
the at least one memory to store a machine readable program;
the at least one processor is configured to invoke the machine readable program to perform the multi-source synonym expert disambiguation method described in any of the above embodiments.
An embodiment of the present invention further provides a computer-readable medium, where the computer-readable medium stores computer instructions, and when the computer instructions are executed by a processor, the processor is caused to execute the multi-source synonym expert disambiguation method described in the above embodiment of the present invention. Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the above-described embodiments are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.
In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.
Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.
Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.
Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.
While the invention has been shown and described in detail in the drawings and in the preferred embodiments, it is not intended to limit the invention to the embodiments disclosed, and it will be apparent to those skilled in the art that various combinations of the code auditing means in the various embodiments described above may be used to obtain further embodiments of the invention, which are also within the scope of the invention.

Claims (10)

1. A multisource homonymous expert disambiguation method is characterized in that expert data from a plurality of expert libraries are cleaned and processed, and expert name uniqueness analysis is carried out;
combining expert achievement association and collision disambiguation processing, integrating a multi-source knowledge base containing expert entity definitions, and performing entity disambiguation on a large number of existing expert rename phenomena so as to determine the correct direction of an entity, determine the semantics of the entity and establish a unified expert base.
2. The method of claim 1, wherein the cleaning process comprises expert data attribute value validation, and the expert attribute values that do not conform to conventional logic and are significantly incorrect are nulled.
3. The method of claim 1, wherein the expert name uniqueness analysis comprises the following steps:
if the unique, directly extracting and warehousing the multi-source synonyms experts;
if not, judging according to the correlation result of the expert results, and storing the processed results after collision disambiguation.
4. The method as claimed in claim 3, wherein a relational database of experts and journal papers is established to correlate the experts with the journal papers results, and further to determine the uniqueness of the expert names according to the correlation results of the expert results.
5. The method of claim 1, 2 or 3, wherein the collision disambiguation process employs a triangular nondirectional collision disambiguation matrix rule intersection fusion:
and establishing a collision disambiguation rule engine, and performing intersection fusion on the collision disambiguation rule based on the triangular nondirectional collision disambiguation matrix.
6. The multi-source synonym expert disambiguation method as claimed in claim 5, wherein the experts are grouped according to initials of the values of the expert names MD5, the experts with the same name are queried in each group, data duplication is performed based on collision disambiguation rules, and duplicate data are fused.
7. The multi-source synonym expert disambiguation method of claim 6, wherein based on the intersection fusion result, the collision disambiguation process is performed according to expert data attribute values, the attribute values including birth date, mobile phone number, literature work, profile, research field, and data source authority.
8. The multi-source synonym expert disambiguation method of claim 1 or 2, wherein the plurality of expert databases are a plurality of expert databases of the engineering science and technology field; and warehousing the multisource homonymous experts subjected to collision disambiguation, and establishing a unified expert library in the engineering science and technology field.
9. A multi-source synonym expert disambiguation apparatus, comprising: at least one memory and at least one processor;
the at least one memory to store a machine readable program;
the at least one processor, configured to invoke the machine readable program to perform the method of any of claims 1 to 8.
10. Computer readable medium, characterized in that it has stored thereon computer instructions which, when executed by a processor, cause the processor to carry out the method of any one of claims 1 to 8.
CN202011082199.9A 2020-10-12 2020-10-12 Multi-source homonymous expert disambiguation method Pending CN112183100A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011082199.9A CN112183100A (en) 2020-10-12 2020-10-12 Multi-source homonymous expert disambiguation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011082199.9A CN112183100A (en) 2020-10-12 2020-10-12 Multi-source homonymous expert disambiguation method

Publications (1)

Publication Number Publication Date
CN112183100A true CN112183100A (en) 2021-01-05

Family

ID=73949172

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011082199.9A Pending CN112183100A (en) 2020-10-12 2020-10-12 Multi-source homonymous expert disambiguation method

Country Status (1)

Country Link
CN (1) CN112183100A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113961652A (en) * 2021-12-22 2022-01-21 北京金堤科技有限公司 Information association method and device, computer storage medium and electronic equipment
CN113987113A (en) * 2021-06-25 2022-01-28 四川大学 Multi-site naming service fusion method and device, storage medium and server

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113987113A (en) * 2021-06-25 2022-01-28 四川大学 Multi-site naming service fusion method and device, storage medium and server
CN113987113B (en) * 2021-06-25 2023-09-22 四川大学 Multi-station naming service fusion method, device, storage medium and server
CN113961652A (en) * 2021-12-22 2022-01-21 北京金堤科技有限公司 Information association method and device, computer storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
US7610317B2 (en) Synchronization with derived metadata
CN111597243B (en) Method and system for abstract data loading based on data warehouse
CN106611053B (en) Data cleaning and indexing method
CN112183100A (en) Multi-source homonymous expert disambiguation method
CN111914066B (en) Global searching method and system for multi-source database
CN102110102A (en) Data processing method and device, and file identifying method and tool
CN111125116B (en) Method and system for positioning code field in service table and corresponding code table
LU503512B1 (en) Operating method for construction of knowledge graph based on naming rule and caching mechanism
CN105808653A (en) User label system-based data processing method and device
CN110362596A (en) A kind of control method and device of text Extracting Information structural data processing
CN101957860B (en) Method and device for releasing and searching information
CN105095436A (en) Automatic modeling method for data of data sources
CN114116856A (en) Field level blood relationship analysis method based on data management full link
US20080005077A1 (en) Encoded version columns optimized for current version access
CN111177181A (en) SQL text auditing method, system, storage medium and device
CN111460000A (en) Backtracking data query method and system based on relational database
CN110309258A (en) A kind of input checking method, server and computer readable storage medium
CN108629002A (en) A kind of big data comparison method and device based on kettle
CN113032450A (en) Data storage and retrieval method, system, storage medium and processing terminal
WO2020101470A1 (en) System and method for tree based graph indexing and query processing
CN117290355B (en) Metadata map construction system
CN116541382B (en) Data management method and system based on data security identification level
CN114692595B (en) Repeated conflict scheme detection method based on text matching
CN118152423A (en) Intelligent query method, intelligent query device, electronic equipment and readable storage medium
US20110314022A9 (en) K engine - process count after build in threads

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210105