CN113761218A - Entity linking method, device, equipment and storage medium - Google Patents

Entity linking method, device, equipment and storage medium Download PDF

Info

Publication number
CN113761218A
CN113761218A CN202110461405.5A CN202110461405A CN113761218A CN 113761218 A CN113761218 A CN 113761218A CN 202110461405 A CN202110461405 A CN 202110461405A CN 113761218 A CN113761218 A CN 113761218A
Authority
CN
China
Prior art keywords
entity
candidate
linked
disambiguation
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110461405.5A
Other languages
Chinese (zh)
Inventor
刘一仝
郑孙聪
周博通
费昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110461405.5A priority Critical patent/CN113761218A/en
Publication of CN113761218A publication Critical patent/CN113761218A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The application discloses a method, a device, equipment and a storage medium for entity linking, which relate to the field of natural language processing and comprise the steps of obtaining an entity to be linked and context content corresponding to the entity to be linked in a text to be identified, and determining entity information of each candidate entity in a plurality of candidate entities and a plurality of candidate entities in a knowledge graph according to the entity to be linked. And matching the context content with the entity information corresponding to each candidate entity to obtain a first matching score, and determining a target candidate entity in the candidate entities according to the first matching score. And determining a target entity in the target candidate entities based on the disambiguation score between the entity to be linked and the target candidate entities, and determining the target entity as a link entity corresponding to the entity to be linked.

Description

Entity linking method, device, equipment and storage medium
Technical Field
The present application relates to the field of natural language processing, and in particular, to a method, an apparatus, a device, and a storage medium for entity linking.
Background
Entity Linking (EL) technology is a hot spot in the natural language processing field in recent years, and plays a very important role particularly in scenes such as knowledge graph construction. Specifically, entity linking is a technique for mapping an entity appearing in a text to be recognized to a given knowledge graph, and is used for corresponding the entity to be recognized with the entity existing in the knowledge graph so as to complete natural language tasks such as question answering, semantic search, information extraction and the like.
The entity linking can comprise two processes of entity identification and entity disambiguation, wherein the entity identification is used for determining a plurality of candidate entities in the knowledge graph according to the entity to be identified, and the entity disambiguation is used for selecting the unique entity pointed by the entity to be identified from all the candidate entities. The essence of entity disambiguation is that the word is ambiguous, and matching identification needs to be performed according to the content of the context and the context in which the entity to be identified is located.
The entity disambiguation method based on the pre-trained language model considers entity information of all candidate entities in the knowledge graph, so that entity disambiguation covers more relevant information due to the consideration of excessive entity information, and when the candidate entities are excessive, the entity disambiguation method needs to perform disambiguation scores on all the candidate entities, so that the calculation amount is large, and the time is long. Therefore, how to reduce the computation of the pre-trained language model for entity disambiguation becomes an urgent problem to be solved.
Disclosure of Invention
When determining a link entity of an entity to be linked in a knowledge graph, firstly, coarsely screening a plurality of candidate entities according to context content of the entity to be linked, then inputting the screened candidate entities and the entity to be linked into an entity disambiguation model to obtain a disambiguation score between the entity to be linked and the candidate entities, and further finally determining the link entity in the candidate entities through the disambiguation score.
In view of the above, an aspect of the present application provides a method for linking entities, including:
and acquiring the entity to be linked and the context content corresponding to the entity to be linked in the text to be identified.
And determining entity information of each candidate entity in a plurality of candidate entities in the knowledge graph according to the entity to be linked.
And matching the context content with the entity information corresponding to each candidate entity to obtain a first matching score.
A target candidate entity of the plurality of candidate entities is determined based on the first match score.
Determining a target entity of the target candidate entities based on a disambiguation score between the entity to be linked and the target candidate entities.
And determining the target entity as a link entity corresponding to the entity to be linked.
Another aspect of the present application provides an apparatus for entity linking, including:
the acquiring unit is used for acquiring the entity to be linked and the context content corresponding to the entity to be linked in the text to be identified.
The determining unit is used for determining entity information of each candidate entity in the plurality of candidate entities in the knowledge graph according to the entity to be linked.
And the matching unit is used for matching the context content with the entity information corresponding to each candidate entity to obtain a first matching score.
The determining unit is further configured to determine a target candidate entity in the plurality of candidate entities according to the first matching score.
And the determining unit is further used for determining a target entity in the target candidate entities based on the disambiguation score between the entity to be linked and the target candidate entities.
And the determining unit is further used for determining the target entity as the link entity corresponding to the entity to be linked.
In one possible design, the determining unit is specifically configured to perform relevance ranking on entity information corresponding to the target candidate entity according to context content, determine key information in the entity information corresponding to the target candidate entity according to a result of the relevance ranking, input the entity to be linked, the context content, the target candidate entity, and the key information to an entity disambiguation model, and obtain a disambiguation score between the entity to be linked and the target candidate entity through the entity disambiguation model.
In a possible design, the matching unit is specifically configured to perform word segmentation on the context content and the entity information corresponding to each candidate entity to obtain a first word segment included in the context content and a second word segment included in the entity information corresponding to each candidate entity. And calculating cosine similarity of the word vector corresponding to the first word segmentation and the word vector corresponding to the second word segmentation, and determining a first matching score according to the cosine similarity.
In one possible design, the matching unit is specifically configured to obtain a similarity distribution vector corresponding to each candidate entity according to cosine similarity, input the similarity distribution vector corresponding to each candidate entity to the first model, obtain a first matching score between context content and entity information corresponding to each candidate entity through the first model, and determine a distribution score according to the similarity distribution vector by the first model.
In a possible design, the matching unit is specifically configured to determine the number of second tokens in each preset cosine similarity interval according to a numerical value of the cosine similarity, where the cosine similarity is preset with a plurality of cosine similarity intervals. And determining a similarity distribution vector corresponding to each candidate entity according to the number of the second participles in each preset cosine similarity interval.
In a possible design, the obtaining unit is specifically configured to obtain word frequency information corresponding to the entity information of each candidate entity.
The determining unit is specifically configured to determine a weight value corresponding to each candidate entity according to the word frequency information, and determine a second matching score corresponding to each candidate entity according to the weight value and the first matching score. And if the second matching score exceeds a first preset threshold value, determining the candidate entity as a target candidate entity.
In one possible design, the determining unit is specifically configured to perform word segmentation on the context content and the entity information corresponding to the target candidate entity to obtain a third word segment included in the context content and a fourth word segment included in the entity information corresponding to the target candidate entity; and calculating the cosine similarity of the word vector corresponding to the third word segmentation and the word vector corresponding to the fourth word segmentation. Sequencing the fourth participles according to the numerical value of the cosine similarity of the word vector corresponding to the third participle and the word vector corresponding to the fourth participle, and determining target participles in the fourth participles according to the sequencing result of the fourth participles, wherein the cosine similarity between the target participles and the third participles exceeds a second preset threshold; and determining the entity information corresponding to the target word segmentation as key information.
In a possible design, the determining unit is specifically configured to perform similarity calculation on the context content and the key information through an entity disambiguation model, and determine a disambiguation score between the entity to be linked and the target candidate entity according to a calculation result of the similarity calculation.
In one possible design, the physically linked device further includes a training unit. The acquisition unit is further used for acquiring a pre-training language model, the pre-training language model comprises a word embedding layer and a task layer, the word embedding layer is used for carrying out word vector representation on the input corpus, and the task layer is used for completing a pre-training language task.
And the training unit is specifically used for training the pre-training language model according to the first corpus data and updating the word embedding layer parameters of the pre-training language model.
And the determining unit is also used for determining the disambiguation model of the entity to be trained according to the updated word embedding layer parameters of the pre-training language model.
And the training unit is also used for training the entity disambiguation model to be trained according to second corpus data, and the second corpus data comprises the entity sample to be disambiguated and the labeling information of the entity sample to be disambiguated. And when the training result meets the training condition, obtaining the entity disambiguation model.
In one possible design, the obtaining unit is specifically configured to obtain word embedding layer parameters of the updated pre-trained language model.
And the determining unit is also used for initializing the parameters of the word embedding layer of the entity disambiguation model to be trained according to the updated parameters of the word embedding layer of the pre-training language model, wherein the entity disambiguation model to be trained comprises the word embedding layer and a task layer, and the model parameters of the word embedding layer of the entity disambiguation model to be trained are the same as those of the word embedding layer of the pre-training language model.
In a possible design, the determining unit is specifically configured to determine the target candidate entity as the target entity if a disambiguation score between the entity to be linked and the target candidate entity exceeds a third preset threshold. And if the disambiguation scores corresponding to the target candidate entities do not exceed a third preset threshold, determining the target candidate entity with the highest disambiguation score as the target entity.
In one possible design, the physically linked apparatus further includes a receiving unit.
And the receiving unit is specifically used for receiving the text query instruction.
And the acquisition unit is also used for acquiring a query text according to the text query instruction, wherein the query text is a text to be identified.
And the determining unit is also used for determining at least one query result related to the query text according to the link entity and feeding back the at least one query result.
Another aspect of the present application provides a computer-readable storage medium having stored therein instructions, which when executed on a computer, cause the computer to perform the method of the above-described aspects.
According to the technical scheme, the embodiment of the application has the following advantages:
in the embodiment of the application, an entity linking method is provided, when entity disambiguation is performed on candidate entities in a knowledge graph by using an entity disambiguation model, a plurality of candidate entities are roughly screened according to context content of the entities to be linked, and then the screened candidate entities and the entities to be linked are input into the entity disambiguation model, so that the matching workload of the entity disambiguation model can be reduced, the disambiguation score of each candidate entity for the entities to be linked is determined more efficiently, the entity linking duration is reduced, and the completion efficiency of entity linking tasks is improved.
Drawings
Fig. 1 is a schematic flowchart of an entity linking method according to an embodiment of the present disclosure;
fig. 2 is a schematic flowchart of another entity linking method according to an embodiment of the present disclosure;
fig. 3 is a schematic flowchart of a method for screening candidate entities according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of candidate entity screening provided in the present application;
fig. 5 is a schematic flowchart of a method for acquiring an entity disambiguation model according to an embodiment of the present application;
FIG. 6 is a schematic structural diagram of a pre-training model according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of an apparatus for physical linking according to an embodiment of the present application;
fig. 8 is a schematic diagram of an embodiment of a computer device according to an embodiment of the present application.
Detailed Description
When determining a link entity of an entity to be linked in a knowledge graph, firstly, coarsely screening a plurality of candidate entities according to context content of the entity to be linked, then inputting the screened candidate entities and the entity to be linked into an entity disambiguation model to obtain a disambiguation score between the entity to be linked and the candidate entities, and further finally determining the link entity in the candidate entities through the disambiguation score.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
As network data grows exponentially, the internet has become the largest repository where large amounts of network data are presented in natural language. In order to understand the specific semantic information of the natural language on the internet, the natural language needs to be connected with a knowledge base, and the natural language is labeled by using knowledge in the knowledge base. Since natural language itself is highly ambiguous, for example, a same entity may correspond to multiple names, and a name may correspond to multiple entities with the same name. Therefore, to label a certain natural language text, it is necessary to correspond the entity appearing in the natural language text to a unique entity (knowledge) in the knowledge base, and the key to realize this step is entity linking technology.
Entity links are the mapping of certain strings (entities) in natural language text to entities in a knowledge base. For example, mapping "apples" in the natural language text "students like to eat apples" to some entity in the knowledge base. It can be understood that there is a phenomenon of synonyms or synonyms in the knowledge base, so this mapping process requires entity disambiguation. Entity disambiguation is a technique of selecting, among all candidate entities of the knowledge-graph, an entity that refers to the entity in question, e.g., in the above example, "apple" in "students like to eat apples" should refer to the entity "apple (fruit)" in the knowledge-base, rather than the entity "apple (electronic product)".
As can be seen from the above examples, the difficulty of entity linking is in two areas, the multiword one-sense and the monoword multi-sense. The multi-word meaning means that an entity may have multiple names, and the standard name, alias, name abbreviation, etc. of the entity can be used to refer to the entity. For example, "potato" and "potato" all refer to the same entity. The term ambiguous refers to a plurality of entities, and in order to solve the problem of ambiguous, entity information in the knowledge base is used for entity disambiguation. I.e., entity linking, typically involves two steps, namely a process of named identification and entity disambiguation. The first step of entity linking is to perform designation recognition, and for example, a designation-entity dictionary may be constructed to statistically manage standard names and aliases of an entity, and then identify entities in the natural language text and candidate entities in the knowledge base according to the designation-entity dictionary. The second step of entity linking can perform entity disambiguation operation on the candidate entities according to the entity information in the knowledge base and the content of the natural language text to determine a final entity sequence.
The knowledge base used for entity disambiguation may be a knowledge graph, which is a semantic network system with a very large scale, and aims to determine the association relationship between entities or concepts. Through a large amount of data collection, the knowledge graph is arranged into a knowledge base which can be processed by a machine, and visual display is realized. The knowledge graph is composed of nodes and edges connecting the nodes, wherein the nodes can be entities, and the edges are used for representing the relationships between the entities. For example, "apple" and "fruit" correspond to two nodes in the knowledge graph, the two nodes are connected through an edge pointing from "apple" to "fruit", the edge can be an "attribute", and the attribute representing "apple" is "fruit".
Since the knowledge graph is the basis of machine learning natural language and stores a large amount of entity information, it is necessary to strictly ensure the security of the knowledge graph and ensure that the included contents cannot be easily tampered. Thus, knowledge-graphs may be stored using blockchain techniques to maintain the security and stability of knowledge-graphs. The blockchain technique is briefly described below:
the blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (blockchain), which is essentially a decentralized database, is a string of data blocks associated by using cryptography, and each data block contains information of a batch of network transactions, which is used to verify the validity (anti-counterfeiting) of the information and generate the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.
The block chain underlying platform can comprise processing modules such as user management, basic service, intelligent contract and operation monitoring. The user management module is responsible for identity information management of all blockchain participants, and comprises public and private key generation maintenance (account management), key management, user real identity and blockchain address corresponding relation maintenance (authority management) and the like, and under the authorization condition, the user management module supervises and audits the transaction condition of certain real identities and provides rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node equipment and used for verifying the validity of the service request, recording the service request to storage after consensus on the valid request is completed, for a new service request, the basic service firstly performs interface adaptation analysis and authentication processing (interface adaptation), then encrypts service information (consensus management) through a consensus algorithm, transmits the service information to a shared account (network communication) completely and consistently after encryption, and performs recording and storage; the intelligent contract module is responsible for registering and issuing contracts, triggering the contracts and executing the contracts, developers can define contract logics through a certain programming language, issue the contract logics to a block chain (contract registration), call keys or other event triggering and executing according to the logics of contract clauses, complete the contract logics and simultaneously provide the function of upgrading and canceling the contracts; the operation monitoring module is mainly responsible for deployment, configuration modification, contract setting, cloud adaptation in the product release process and visual output of real-time states in product operation, such as: alarm, monitoring network conditions, monitoring node equipment health status, and the like.
The platform product service layer provides basic capability and an implementation framework of typical application, and developers can complete block chain implementation of business logic based on the basic capability and the characteristics of the superposed business. The application service layer provides the application service based on the block chain scheme for the business participants to use.
As can be seen from the above description, the key to entity linking is entity disambiguation, which is inherent in the diversity and ambiguity of natural languages. Where diversity refers to the same entity having different references in the text, such as a segment of natural language text, "flyer", "helper" and "MJ" all refer to american basketball player a. Ambiguity refers to the fact that the same entity designation may refer to a different entity in a knowledge graph, for example, a may refer to a basketball player in the united states, or may refer to an Ireland politician, etc. In the prior art, there are various methods for entity disambiguation, and most commonly, a neural network is used to obtain a natural language model for entity disambiguation, and finally, the natural language model is used to complete the entity disambiguation task.
Specifically, an entity name (to-be-linked entity) in a certain natural language text may be determined, and then all candidate entities in the knowledge graph may be determined according to the entity name, where the candidate entity name is the same as the entity name. And then obtaining information of candidate entities through a network relation in a knowledge graph, finally inputting the entity names, context contents of the entity names and the candidate entities and the candidate entity information into a natural language model, matching the candidate entity information according to the context contents to obtain entity information most relevant to the context information, and determining the candidate entities corresponding to the entity information as target entities corresponding to the entity names. In the method, the natural language model needs to analyze all entity information of all candidate entities to obtain the disambiguation score corresponding to each candidate entity, which results in huge calculation amount and long time consumption of the natural language model. Therefore, how to reduce the amount of computation for the natural language model becomes an urgent problem to be solved.
Based on the above problems, embodiments of the present application provide an entity linking method, after determining candidate entities of entities to be linked in a knowledge graph, first perform rough screening on a plurality of candidate entities according to context content of the entities to be linked, and also perform screening on entity information of the candidate entities after the rough screening, and then input the candidate entities after the screening and the entities to be linked into an entity disambiguation model to obtain an entity disambiguation result, so that through the processes of entity rough screening and information screening, an input data amount of the entity disambiguation model can be reduced, thereby reducing a calculation amount of a natural language model.
Before introducing the scheme of the application, a simple introduction is firstly made to the field of Natural Language Processing (NLP), which is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
Fig. 1 is a schematic flowchart of an entity linking method according to an embodiment of the present application, and as shown in fig. 1, the method includes the following steps:
and S1, determining the entity to be linked in the text to be recognized.
And S2, determining candidate entities in the knowledge graph according to the entities to be linked.
And S3, carrying out coarse screening on the candidate entities according to the context content of the entities to be linked.
And S4, screening the entity information of the candidate entities based on the deep relevance ranking.
And S5, carrying out entity disambiguation by utilizing a natural language model based on the screened candidate entities and the entity information.
And S6, outputting the disambiguation result.
When a segment of natural language text needs to be recognized, word segmentation technology can be used for carrying out word segmentation on the text to be recognized, entities to be linked (entity names) in the text to be recognized are determined according to word segmentation results, then a plurality of candidate entities are determined in a pre-selected knowledge graph according to the entity names, and entity information of the candidate entities is obtained. Then, according to the context content (text to be recognized) corresponding to the entity designation and the entity information of the multiple candidate entities, the multiple candidate entities are roughly screened, and there are multiple specific rough screening processes, which will be described in detail below. After the candidate entities are coarsely screened, entity information included in the remaining candidate entities needs to be screened to obtain information more beneficial to entity disambiguation, so that entity disambiguation is performed by combining the screened candidate entities and the entity information, specifically, context content of the entity designation and the entity information of the candidate entities are matched by using a natural language model, a disambiguation result is finally output, and a unique target entity corresponding to the entity designation is determined. Because the rough screening of the candidate entities and the screening of the entity information are performed before the data are input by the natural language model, the matching workload of the natural language model can be reduced, and the disambiguation scores of the candidate entities aiming at the entities to be linked can be determined more efficiently, so that the operation amount and the time length of entity linking are reduced, and the completion efficiency of entity linking tasks is improved.
Specifically, the entity linking method is described below, and fig. 2 is a schematic flowchart of an entity linking method according to an embodiment of the present application, and as shown in fig. 2, the method includes the following steps.
201. And acquiring the entity to be linked and the context content of the entity to be linked in the text to be identified.
When semantic recognition of a natural language text by a machine is required, at least one entity in the text to be recognized needs to be determined. Due to the diversity and ambiguity of natural language, the machine needs to solve the diversity problem, i.e. the entity to be linked in the text to be recognized needs to be determined first. For example, a word segmentation operation may be performed on a text to be recognized, and then an entity to be linked may be determined according to the word, for example, a certain text to be recognized is "liu somebody will start singing, and a sidelung singer is known to sing more than 20 songs at this time, so as to exhibit charm of queen. By performing word segmentation on the text, the words "Liu (a kind of a thing)," the iron lung singer "and" the heaven king "included in the text can be determined to all represent the movie and the star Liu (a kind of a thing), and then one entity to be linked in the text can be determined to be" Liu (a kind of a thing ").
Illustratively, solving natural language diversity problems may utilize a form of a reference-entity dictionary to build a knowledge base for storing various references to entities, including standard names, aliases, name shorthand, and the like, of the entities. When determining the entity to be linked in the text to be recognized, the final entity name can be determined according to the entity in the word matching name-entity dictionary included in the text to be recognized, the standard name, the alias, the name shorthand and the like of the entity.
When the entity to be linked is determined, the machine needs to solve the ambiguity problem of the natural language, the nature of the ambiguity problem is that the natural language has word ambiguity, the ambiguity problem cannot be solved by using the character representation of the surface, the semantic meaning of the name of the entity needs to be determined by combining the context and the context, therefore, the context content of the entity to be linked needs to be obtained from the text to be recognized, and the entity to be linked needs to be determined by combining the context.
202. And determining a plurality of candidate entities and entity information of each candidate entity in the knowledge graph according to the entities to be linked.
After determining entities to be linked in a text to be recognized, firstly, a plurality of candidate entities need to be determined in a knowledge graph according to entity names, and entity information corresponding to each candidate entity needs to be acquired by utilizing an edge relation in the knowledge graph. For example, in the above example, when the entity to be linked is determined to be "liu a", a plurality of candidate "liu a" in the knowledge graph may be determined according to the two words of "liu a". Since the knowledge graph includes a large amount of entity information, assuming that the knowledge graph includes two liu chi entities, it is necessary to determine the two liu chi entities as candidate entities, and then obtain entity information corresponding to each liu chi entity, for example, the entity information may include nationality, occupation, age, gender, and the like.
203. And matching the context content of the entity to be linked with the entity information of each candidate entity to obtain a first matching score.
After obtaining a plurality of candidate entities, the candidate entities need to be roughly screened, and the purpose is to delete some candidate entities with extremely low relevance, so that the workload of the subsequent entity disambiguation process can be reduced. Specifically, the candidate entities may be screened according to the context content of the entity to be linked and the entity information of each candidate entity, and the first matching score is obtained by determining the matching degree between the context content of the entity to be linked and the entity information of each candidate entity.
204. A target candidate entity of the plurality of candidate entities is determined based on the first match score.
After determining a first matching score between the context content of the entity to be linked and the entity information of each candidate entity, the candidate entities may be screened according to the first matching score. For example, the candidate entities may be sorted according to the size of the matching score, and the candidate entities with high matching scores may be used as target candidate entities according to a preset number. For example, the target candidate entity may be selected according to a preset threshold, and if the first matching score corresponding to a certain candidate entity is higher than the preset threshold, the certain candidate entity is determined to be the target candidate entity.
In the above example, the entity "liu" to be linked is matched with two candidate entities "liu", and assuming that the occupation of one of the candidate entities "liu" is a bus driver, and the matching score determined by combining the "start concert" in the context content corresponding to the entity "liu" to be linked is lower than the preset threshold, the candidate entity "liu" needs to be deleted, and another candidate entity "liu" is determined as the target candidate entity.
205. And performing relevance sequencing on the entity information of the target candidate entity according to the context content of the entity to be linked.
After the candidate entities are coarsely screened, entity information of target candidate entities needs to be subjected to correlation sorting, each entity in the knowledge graph comprises massive entity information, and for a certain context, the association degree of the entity information and the context is different, if entity disambiguation is performed by combining all entity information of the target candidate entities, great workload is brought, and the entity disambiguation efficiency is extremely low, so before the entity disambiguation, the entity information of the target candidate entities can be subjected to correlation sorting, and entity information more related to the context content of the entities to be linked is screened out to perform a subsequent entity disambiguation process.
206. And determining the key information of the target candidate entity according to the relevance ranking result.
In combination with the above steps, the key information of the target candidate entity may be determined according to the ranking result of the relevance ranking, and entity disambiguation may be performed using the key information of the target candidate entity. Illustratively, in the above example, the context content of the entity "liu a" to be linked includes "first concert" and "singing", and the highest association degree between the occupation and the context content can be known in combination with the context content, so that the entity information in the target candidate entity can be ranked according to the association degree relationship, and the entity information of the occupation can be determined as the key information.
207. And acquiring a disambiguation score between the entity to be linked and the target candidate entity according to the entity to be linked, the context content of the entity to be linked, the target candidate entity and the key information of the target candidate entity.
After the candidate entities and the entity information corresponding to the candidate entities are screened, the entity to be linked, the context content of the entity to be linked, the target candidate entities and the key information of the target candidate entities need to be input into a final entity disambiguation model, so that a disambiguation score is obtained, and the entity disambiguation process is completed.
It will be appreciated that the entity disambiguation model is a natural language processing model, and may include a word embedding layer and a task layer. The task of the word embedding layer is to perform better word vector representation on the input natural language, and the task layer is to perform operation on the word vectors to obtain a final task result. In the entity disambiguation model, the task of the task layer is entity disambiguation, namely, the relevance between word vectors can be determined, the relevance between the entity to be linked and the candidate entity is evaluated, and a disambiguation score is obtained. The disambiguation score is the evaluation of the correlation between the entity to be linked and the candidate entities, and the target entities corresponding to the entity to be linked in the target candidate entities can be finally determined according to the disambiguation score.
208. Determining a target entity of the target candidate entities based on a disambiguation score between the entity to be linked and the target candidate entities.
For example, a preset threshold (a third preset threshold) may be preset, and if the disambiguation score between the entity to be linked and the target candidate entity exceeds the third preset threshold, the target candidate entity is determined to be the target entity. If the disambiguation scores between the entity to be linked and the target candidate entity do not exceed the third preset threshold, the target candidate entity with the highest disambiguation score may be determined to be the target entity.
209. And determining the target entity as a link entity corresponding to the entity to be linked.
After the target entity corresponding to the entity to be linked is determined, the entity to be linked and the target entity can be linked to complete the related natural language task. Such as semantic searching, keyword recognition, etc. For example, the machine performs an entity linking process based on a received text query instruction, and when determining a link entity corresponding to an entity to be linked in a good-band identification text, may obtain, in response to the text query instruction, a plurality of query results according to the link entity in the knowledge graph, where the query results are all from entity information of the link entity, and then feed back and display the query results.
In the embodiment of the application, an entity linking method is provided, when an entity disambiguation model is used for carrying out entity disambiguation on candidate entities in a knowledge graph, a plurality of candidate entities are roughly screened according to context content of the entities to be linked, then entity information of the remaining candidate entities is screened, and finally the screened candidate entities and the entity information are input into the entity disambiguation model to obtain a final entity disambiguation result.
The two processes of coarse screening of candidate entities and selection of key information of target candidate entities will be described in detail below with reference to the embodiment shown in fig. 2. Fig. 3 is a schematic flowchart of a method for screening candidate entities according to an embodiment of the present application, including:
301. and performing word segmentation on the context content of the entity to be linked to obtain a first word segment included by the context content.
In the process of performing the rough screening of the candidate entities, word segmentation processing needs to be performed on context content of the entities to be linked, context information represented by the context is represented according to first words included in the context content, fig. 4 is a structural schematic diagram of the candidate entity screening provided by the embodiment of the present application, as shown in fig. 4, the context content is segmented into a plurality of first words through the word segmentation processing, and each first word needs to be matched with entity information corresponding to the candidate entity.
302. And performing word segmentation processing on the entity information corresponding to the candidate entity to obtain a second word segmentation included in the entity information.
Similarly, the entity information corresponding to the candidate entity also needs to be subjected to word segmentation, as shown in fig. 4, the candidate entity may include a plurality of entity information, the machine needs to perform word segmentation on each entity information to obtain a plurality of second words, and each second word also needs to be matched with each first word included in the context content of the entity to be linked.
303. And calculating cosine similarity between the word vector corresponding to the first word segmentation and the word vector corresponding to the second word segmentation.
The specific matching process may be to calculate cosine similarity between word vectors corresponding to the first participles and the second participles, represent each first participle by a word vector, and represent each second participle by a word vector. And then, respectively carrying out cosine similarity calculation on the word vector corresponding to each first participle and the word vector corresponding to each second participle. It can be understood that if there are M first segmentations and N second segmentations, then M × N cosine similarity calculations are needed to obtain M × N cosine similarity values.
304. And counting the number of second participles corresponding to each cosine similarity interval according to a preset cosine similarity interval.
When a plurality of cosine similarity values are obtained, the plurality of cosine similarity values need to be represented statistically. For example, since the cosine similarity value is within the range of [ -1,1], the range of [ -1,1] may be divided according to a preset step length, for example, the range of [ -1,1] may be divided into 10 ranges by a step length of 0.2. Then, the number of cosine similarities falling into each interval is determined, for example, 5 first participles and 10 second participles are calculated, and the number of cosine similarities falling into each cosine similarity interval needs to be counted, for example, the number of cosine similarities falling into [ -1, -0.8] is 10.
305. And determining a similarity distribution vector corresponding to the candidate entity according to the number of the second participles corresponding to each cosine similarity interval.
After the number of cosine similarities corresponding to each cosine similarity interval is counted, it is required to determine a similarity distribution vector corresponding to the candidate entity according to the statistical result, for example, in the above example, the number of cosine similarities falling in [ -1, -0.8] is 10, the number of cosine similarities falling in [ -0.8, -0.6] is 2, and the number of cosine similarities falling in [ -0.6, -0.4] is 0. It can be understood that the similarity distribution vector has a great relationship with the step size of the cosine similarity interval, and when the step size is 0.2, the similarity distribution vector is a ten-dimensional vector, and when the step size is 0.1, the similarity distribution vector is a twenty-dimensional vector.
306. And determining a first matching score between the context content of the entity to be linked and the entity information of the candidate entity according to the similarity distribution vector.
After the good similarity distribution vector is determined, the good similarity distribution vector can be input into the first model to obtain a first matching score. It can be understood that the first model may evaluate the distribution of cosine similarity according to the distribution vector of similarity, that is, determine a distribution score according to the distribution vector of similarity, and evaluate the similarity of cosine similarity according to the distribution score. The first model may also be a full link layer of the model, and the similarity distribution vector is input to the full link layer to obtain a distribution score, which is not limited specifically.
307. And acquiring word frequency information corresponding to the entity information of the candidate entity, and determining a weight value corresponding to the candidate entity according to the word frequency information.
The word frequency information refers to the properties of a first word segment included in a context and a second word segment included in entity information, and includes the frequency of the first word segment appearing in the context and the reverse text frequency of the first word segment for the context, which reflects the importance degree of the first word segment for semantic understanding of the context, and the main idea is that if a word appears frequently in an article and rarely appears in other articles, the word or phrase is considered to have good category distinguishing capability and is suitable for classification and recognition. Similarly, the frequency of the second participle appearing in the entity information and the reverse text frequency of the second participle for the entity information can also reflect the importance degree of the second participle for the entity information. The second participle in each candidate entity can be evaluated by using the word frequency information, and the weight value of each candidate entity is determined according to the evaluation result.
308. And determining a second matching score corresponding to the candidate entity according to the weight value and the first matching score.
And according to the weight value of each candidate entity and the first matching score of each candidate entity aiming at the entity to be linked, a second matching score can be obtained comprehensively. The candidate entities are then screened according to the second match score.
309. And determining a target candidate entity in the candidate entities according to the second matching score.
Specifically, a first preset threshold may be set, and if the second matching score exceeds the first preset threshold, the candidate entity is determined to be the target candidate entity. The priority levels of the plurality of candidate entities may also be determined according to the second matching scores, and a target candidate entity of the candidate entities may be determined according to the priority levels.
In the embodiment, the context content corresponding to the entity to be linked and the entity information of the candidate entity are subjected to word segmentation, then the similarity distribution vector is obtained according to the cosine similarity between the word vectors corresponding to the first word segmentation and the second word segmentation, then the correlation degree between the context content and the entity information is evaluated according to the similarity distribution vector, and finally the candidate entity with high correlation degree with the context content is screened out to be used as the target candidate entity.
After the rough screening of the candidate entities is completed and the target candidate entities are obtained, entity information of the target candidate entities also needs to be screened, and the purpose is to select key information which is more relevant to context content corresponding to the entities to be linked as contrast information for disambiguation of subsequent entities. Specifically, the entity information of the candidate entities can be ranked based on the deep correlation ranking idea, and then important information is selected as key information according to the ranking result to be input into the entity disambiguation model.
Depth correlation ordering may also be based on cosine similarity. First, performing word segmentation on context content corresponding to an entity to be linked to obtain a plurality of third words corresponding to the context content, where it can be understood that the third words may be the same as or different from the first words in the coarse screening process of the candidate entity, and a specific word segmentation result is related to an algorithm corresponding to a word segmentation operation, which is not limited herein. And then, carrying out word segmentation on the entity information of the target candidate entity to obtain a fourth word segmentation, wherein the fourth word segmentation can be the same as or different from the second word segmentation in the rough screening process of the candidate entity.
After the third participle and the fourth participle are obtained, the cosine similarity of the third participle and the fourth participle needs to be calculated in sequence, specifically, word vectors of the third participle and the fourth participle are determined respectively, and then the cosine similarity of the word vector corresponding to each third participle and each fourth participle is calculated in sequence. And then, sequencing the multiple fourth participles according to the obtained cosine similarity value, screening out fourth participles with high similarity to the third participles, and determining entity information corresponding to the fourth participles as key information of the target candidate entity.
By screening the entity information of the target candidate entity by using the method, the key information more relevant to the context content of the entity to be linked can be selected for entity disambiguation, so that the subsequent entity disambiguation process does not need to compare and identify all the entity information of the target candidate entity, thereby reducing the workload of an entity disambiguation model, improving the efficiency of the whole entity link and reducing the time length of the entity link.
The entity disambiguation model obtains the disambiguation scores of the candidate entity and the entity to be linked by comparing the entity information of the candidate entity with the context content of the entity to be linked, the disambiguation scores are used for measuring the matching degree of the entity information of the candidate entity and the context content of the entity to be linked, and the purpose is to find a target entity which is highly consistent with the entity to be linked from the candidate entity, namely the obtained target entity and the entity to be linked refer to the same entity. In order to improve the accuracy of the entity disambiguation result, the entity disambiguation may be performed through a natural language model, and specifically, an entity disambiguation model for entity disambiguation may be obtained based on a pre-trained model.
The existing mainstream natural language understanding model is a fine tuning model based on a pre-training model. Based on the disclosed non-labeled data, firstly, a model is pre-trained on tasks such as a Mask Language Model (MLM) or a next sentence prediction model (NSP), then, a downstream network structure of the pre-trained model is designed based on a specific natural language understanding task, then, fine tuning training is carried out on the whole network structure by using labeled data of the task, and finally, a neural network model related to the specific task is obtained. Therefore, the entity disambiguation model can be obtained by firstly obtaining a pre-training model, then designing a downstream network structure according to an entity disambiguation task, then replacing a task layer in the trained pre-training model to obtain the entity disambiguation model, then training the entity disambiguation model based on sample data of the entity disambiguation, and when the training requirement is met, obtaining the final entity disambiguation model.
The function of the natural language model according to the network model structure can be divided into a word embedding layer (word embedding), a converter layer (transformations) and a downstream task layer; the word embedding layer is used for segmenting input natural language and converting words into word vectors, and the converter layer is used for processing and understanding input natural language data and determining logic relations among words and word sequences in the natural language; the downstream network layer is related to a specific natural language task, the network structures of the downstream network layers corresponding to different tasks are different, and the downstream network layer calculates word vectors through the logical relation of the converter layer to finally obtain an output result related to the specific task and corresponding to the natural language; therefore, the word embedding layer is mainly used for machine recognition of character data, vectorization of the character data and independence of specific tasks; the converter layer and the downstream task layer need to understand the input text data and react the text data to obtain an output result, so that the converter layer and the downstream task layer are closely related to a specific language task and can be collectively called as a task layer.
Because the network structure layer number of the natural language model is deep, if the parameters of each layer of network structure are initialized randomly, a large amount of training data is needed to train the model, and the training process of the natural language model is extremely slow; currently, a pre-training mode is generally adopted to obtain a natural language model aiming at a specific task; firstly, a larger training data set is used for pre-training a network model aiming at a pre-task to obtain a pre-training model, and the pre-training model can learn partial language knowledge; and then, initializing the parameters of the task model (natural language model) according to partial network parameters of the pre-training model, so that the task model can be trained on the basis of the pre-training model, the quantity of training data of the task model is reduced, the convergence speed of the task model can be accelerated, and the acquisition efficiency of the task model is improved.
The following describes a training process of the entity disambiguation model based on a pre-training model, and fig. 5 is a schematic flow chart of a method for acquiring the entity disambiguation model according to an embodiment of the present application, including:
501. and acquiring a pre-training model.
The network structure of the pre-training model can be divided into a first word embedding layer and a first task layer, wherein the first word embedding layer is used for carrying out word vector conversion on input linguistic data, the first task layer is related to a pre-task of the pre-training model, and the word vector is operated according to the task to obtain an output result of the input linguistic data. For example, the pre-training model may be a model pre-trained based on tasks such as MLM or NSP, and the primary purpose of the first word embedding layer is to perform good word vector representation on the input corpus.
502. And training the pre-training model by using the unmarked corpus.
When a natural language model about entity disambiguation is obtained through the pre-training model, the pre-training model needs to be finely adjusted by using the entity disambiguation corpus, so that the pre-training model can learn the knowledge of the entity disambiguation more pertinently, and meanwhile, the training speed can be improved.
The method comprises the steps that a sample in an entity disambiguation corpus comprises an entity to be linked, a context of the entity to be linked and entity information of a candidate entity and a candidate entity, the entity disambiguation corpus is input into a pre-training model, the pre-training model can obtain knowledge of entity disambiguation, and a word embedding layer can conveniently represent the corpus by word vectors.
503. Determining the word embedding layer parameters of the trained pre-training model.
After the training process of the pre-training model is finished, the word embedding layer of the pre-training model can well express the input entity disambiguation linguistic data, so that the word embedding layer parameters of the pre-training model need to be acquired, the word embedding layer parameters of the pre-training model are utilized to initialize the word embedding layer parameters of the entity disambiguation model, the transfer of knowledge is completed, the training process of the subsequent entity disambiguation model can be accelerated, and the entity disambiguation model can be rapidly converged.
504. And determining the word embedding layer parameters of the entity disambiguation model according to the word embedding layer parameters of the trained pre-trained model.
And designing a structure of a target task layer by combining the entity disambiguation task, so that the target task layer can match the context content of the entity to be linked with the entity information of the candidate entity to evaluate the similarity degree of the entity to be linked with the entity information. And then combining the word embedding layer structure and the target task layer structure of the pre-trained model to obtain the structure of the entity disambiguation model, initializing the word embedding layer parameters of the entity disambiguation model by using the word embedding layer parameters of the pre-trained model, and finally obtaining the model to be trained.
505. And training the entity disambiguation model by using the entity disambiguation sample to obtain the trained entity disambiguation model.
Specifically, the entity disambiguation sample comprises an entity to be linked, context content of the entity to be linked, entity information of a candidate entity and a candidate entity, and label information of the entity to be linked, the training process comprises the steps of inputting the entity to be linked, the context content of the entity to be linked, the entity information of the candidate entity and the entity information of the candidate entity into the entity disambiguation model to obtain an entity disambiguation result, calculating loss between the entity disambiguation result and the label information of the entity to be linked, transmitting the loss in a reverse direction, adjusting parameters of the entity disambiguation model, and finishing the training process through multiple reverse iterations when a training condition is met, for example, a preset training time is reached or a loss value is smaller than a preset threshold value to obtain a final entity disambiguation model.
By the method, the entity disambiguation model can be obtained based on the pre-training model, the training process of the entity disambiguation model is accelerated by knowledge migration, and the training efficiency of the entity disambiguation model is improved.
FIG. 6 is a schematic structural diagram of a pre-training model according to an embodiment of the present disclosure; the context of the entity to be linked, the candidate entity and the entity information of the candidate entity are input into a pre-training model Bert, and the context of the entity to be linked and the entity information of the candidate entity are compared through the pre-training model Bert to obtain a final disambiguation score.
Fig. 7 is a schematic structural diagram of an entity linking apparatus according to an embodiment of the present application, including:
an obtaining unit 701, configured to obtain an entity to be linked and context content corresponding to the entity to be linked in a text to be identified.
A determining unit 702, configured to determine, according to the entity to be linked, entity information of each of a plurality of candidate entities and a plurality of candidate entities in a knowledge graph.
The matching unit 703 is configured to match the context content with the entity information corresponding to each candidate entity to obtain a first matching score.
The determining unit 702 is further configured to determine a target candidate entity of the plurality of candidate entities according to the first matching score.
The determining unit 702 is further configured to determine a target entity in the target candidate entities based on a disambiguation score between the entity to be linked and the target candidate entities.
The determining unit 702 is further configured to determine the target entity as a link entity corresponding to the entity to be linked.
In a possible design, the determining unit 702 is specifically configured to perform relevance ranking on entity information corresponding to a target candidate entity according to context content, determine key information in the entity information corresponding to the target candidate entity according to a result of the relevance ranking, input an entity to be linked, the context content, the target candidate entity, and the key information into an entity disambiguation model, and obtain a disambiguation score between the entity to be linked and the target candidate entity through the entity disambiguation model.
In a possible design, the matching unit 703 is specifically configured to perform word segmentation on the context content and the entity information corresponding to each candidate entity to obtain a first word segment included in the context content and a second word segment included in the entity information corresponding to each candidate entity. And calculating cosine similarity of the word vector corresponding to the first word segmentation and the word vector corresponding to the second word segmentation, and determining a first matching score according to the cosine similarity.
In a possible design, the matching unit 703 is specifically configured to obtain a similarity distribution vector corresponding to each candidate entity according to cosine similarity, input the similarity distribution vector corresponding to each candidate entity into a first model, obtain a first matching score between context content and entity information corresponding to each candidate entity through the first model, and the first model is configured to determine a distribution score according to the similarity distribution vector.
In a possible design, the matching unit 703 is specifically configured to determine the number of second tokens in each preset cosine similarity interval according to a numerical value of the cosine similarity, where the cosine similarity is preset with a plurality of cosine similarity intervals. And determining a similarity distribution vector corresponding to each candidate entity according to the number of the second participles in each preset cosine similarity interval.
In a possible design, the obtaining unit 701 is specifically configured to obtain word frequency information corresponding to entity information of each candidate entity.
The determining unit 702 is specifically configured to determine a weight value corresponding to each candidate entity according to the word frequency information, and determine a second matching score corresponding to each candidate entity according to the weight value and the first matching score. And if the second matching score exceeds a first preset threshold value, determining the candidate entity as a target candidate entity.
In a possible design, the determining unit 702 is specifically configured to perform word segmentation on the context content and the entity information corresponding to the target candidate entity to obtain a third word segment included in the context content and a fourth word segment included in the entity information corresponding to the target candidate entity; and calculating the cosine similarity of the word vector corresponding to the third word segmentation and the word vector corresponding to the fourth word segmentation. Sequencing the fourth participles according to the numerical value of the cosine similarity of the word vector corresponding to the third participle and the word vector corresponding to the fourth participle, and determining target participles in the fourth participles according to the sequencing result of the fourth participles, wherein the cosine similarity between the target participles and the third participles exceeds a second preset threshold; and determining the entity information corresponding to the target word segmentation as key information.
In one possible design, the determining unit 702 is specifically configured to perform similarity calculation on the context content and the key information through an entity disambiguation model, and determine a disambiguation score between the entity to be linked and the target candidate entity according to a calculation result of the similarity calculation.
In one possible design, the physically linked device further includes a training unit 704. The obtaining unit 701 is further configured to obtain a pre-training language model, where the pre-training language model includes a word embedding layer and a task layer, the word embedding layer is configured to perform word vector representation on the input corpus, and the task layer is configured to complete a pre-training language task.
The training unit 704 is specifically configured to train the pre-training language model according to the first corpus data, and update the word embedding layer parameter of the pre-training language model.
The determining unit 702 is further configured to determine a disambiguation model of the entity to be trained according to the updated word embedding layer parameter of the pre-training language model.
The training unit 704 is further configured to train the entity to be trained disambiguation model according to second corpus data, where the second corpus data includes the entity sample to be disambiguated and the label information of the entity sample to be disambiguated. And when the training result meets the training condition, obtaining the entity disambiguation model.
In one possible design, the obtaining unit 701 is specifically configured to obtain word embedding layer parameters of the updated pre-trained language model.
The determining unit 702 is further configured to initialize a parameter of a word embedding layer of the entity to be trained disambiguation model according to the updated word embedding layer parameter of the pre-training language model, where the entity to be trained disambiguation model includes the word embedding layer and a task layer, and the model parameter of the word embedding layer of the entity to be trained disambiguation model is the same as the model parameter of the word embedding layer of the pre-training language model.
In one possible design, the determining unit 702 is specifically configured to determine the target candidate entity as the target entity if the disambiguation score between the entity to be linked and the target candidate entity exceeds a third preset threshold. And if the disambiguation scores corresponding to the target candidate entities do not exceed a third preset threshold, determining the target candidate entity with the highest disambiguation score as the target entity.
In one possible design, the physically linked apparatus further includes a receiving unit 705.
The receiving unit 705 is specifically configured to receive a text query instruction.
The obtaining unit 701 is further configured to obtain a query text according to the text query instruction, where the query text is a text to be identified.
The determining unit 702 is further configured to determine at least one query result related to the query text according to the link entity, and feed back the at least one query result.
Embodiments of the present application further provide a computer device, which may be deployed in a server, please refer to fig. 8, fig. 8 is a diagram of an embodiment of a computer device in an embodiment of the present application, and as shown in the drawing, the server 800 may generate a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 822 (e.g., one or more processors) and a memory 832, and one or more storage media 830 (e.g., one or more mass storage devices) storing an application 842 or data 844. Memory 832 and storage medium 830 may be, among other things, transient or persistent storage. The program stored in the storage medium 830 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, a central processor 822 may be provided in communication with the storage medium 830 for executing a series of instruction operations in the storage medium 830 on the server 800.
The Server 800 may also include one or more power supplies 826, one or more wired or wireless network interfaces 850, one or more input-output interfaces 858, and/or one or more operating systems 841, such as a Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTMAnd so on.
The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 8.
In the embodiment of the present application, the CPU 822 included in the server is configured to execute the steps executed by the target server in the embodiments shown in fig. 2 to 5.
Also provided in the embodiments of the present application is a computer-readable storage medium, which stores a computer program, and when the computer program runs on a computer, the computer program causes the computer to execute the steps executed by the server in the method described in the foregoing embodiments shown in fig. 2 to 5.
Also provided in embodiments of the present application is a computer program product including a program, which when run on a computer, causes the computer to perform the steps performed by the server in the method described in the foregoing embodiments shown in fig. 2 to 5.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (15)

1. A method of entity linking, the method comprising:
acquiring an entity to be linked and context content corresponding to the entity to be linked in a text to be identified;
determining a plurality of candidate entities and entity information of each candidate entity in the plurality of candidate entities in a knowledge graph according to the entity to be linked;
matching the context content with the entity information corresponding to each candidate entity to obtain a first matching score;
determining a target candidate entity of the plurality of candidate entities according to the first matching score;
determining a target entity of the target candidate entities based on a disambiguation score between the entity to be linked and the target candidate entities;
and determining the target entity as a link entity corresponding to the entity to be linked.
2. The method of claim 1, wherein prior to said determining a target entity of said target candidate entities based on a disambiguation score between said entity to be linked and said target candidate entities, said method further comprises:
performing relevance sequencing on entity information corresponding to the target candidate entity according to the context content;
determining key information in entity information corresponding to the target candidate entity according to the relevance sorting result;
inputting the entity to be linked, the context content, the target candidate entity and the key information into the entity disambiguation model, and acquiring the disambiguation score between the entity to be linked and the target candidate entity through the entity disambiguation model.
3. The method of claim 1, wherein the matching the context with the entity information corresponding to each candidate entity to obtain a first matching score comprises:
performing word segmentation processing on the context content and the entity information corresponding to each candidate entity to obtain a first word segmentation included in the context content and a second word segmentation included in the entity information corresponding to each candidate entity;
calculating cosine similarity of the word vector corresponding to the first participle and the word vector corresponding to the second participle;
and determining the first matching score according to the cosine similarity.
4. The method of claim 3, wherein determining the first match score based on the cosine similarity comprises:
obtaining a similarity distribution vector corresponding to each candidate entity according to the cosine similarity;
inputting the similarity distribution vector corresponding to each candidate entity into a first model, and acquiring the context content and a first matching score of entity information corresponding to each candidate entity through the first model; the first model is used for determining distribution scores according to the similarity distribution vectors.
5. The method of claim 4, wherein the obtaining the similarity distribution vector corresponding to each candidate entity according to the cosine similarity comprises:
determining the number of second participles in each preset cosine similarity interval according to the numerical value of the cosine similarity; the cosine similarity is preset with a plurality of cosine similarity intervals;
and determining the similarity distribution vector corresponding to each candidate entity according to the number of second participles in each preset cosine similarity interval.
6. The method of claim 5, wherein determining a target candidate entity of the plurality of candidate entities based on the first match score comprises:
acquiring word frequency information corresponding to the entity information of each candidate entity;
determining a weight value corresponding to each candidate entity according to the word frequency information, and determining a second matching score corresponding to each candidate entity according to the weight value and the first matching score;
and if the second matching score exceeds a first preset threshold value, determining the candidate entity as the target candidate entity.
7. The method of claim 2, wherein the performing the relevance ranking of the entity information corresponding to the target candidate entity according to the context content comprises:
performing word segmentation processing on the context content and the entity information corresponding to the target candidate entity to obtain a third word segmentation included in the context content and a fourth word segmentation included in the entity information corresponding to the target candidate entity;
calculating cosine similarity of a word vector corresponding to the third participle and a word vector corresponding to the fourth participle;
sorting the fourth participles according to the numerical value of the cosine similarity of the word vector corresponding to the third participle and the word vector corresponding to the fourth participle;
determining key information in the entity information corresponding to the target candidate entity according to the relevance ranking result comprises:
determining target participles in the fourth participles according to the sequencing result of the fourth participles, wherein the cosine similarity between the target participles and the third participles exceeds a second preset threshold;
and determining the entity information corresponding to the target word segmentation as the key information.
8. The method of claim 2, wherein the obtaining the disambiguation score between the entity to be linked and the target candidate entity through the entity disambiguation model comprises:
calculating the similarity of the context content and the key information through the entity disambiguation model;
and determining the disambiguation score between the entity to be linked and the target candidate entity according to the calculation result of the similarity calculation.
9. The method of claim 2, wherein prior to obtaining the disambiguation score between the entity to be linked and the target candidate entity via the entity disambiguation model, the method further comprises:
the method comprises the steps of obtaining a pre-training language model, wherein the pre-training language model comprises a word embedding layer and a task layer, the word embedding layer is used for carrying out word vector representation on input linguistic data, and the task layer is used for completing a pre-training language task;
training the pre-training language model according to first corpus data, and updating word embedding layer parameters of the pre-training language model;
determining a disambiguation model of the entity to be trained according to the updated word embedding layer parameters of the pre-training language model;
training the entity disambiguation model to be trained according to second corpus data, wherein the second corpus data comprises an entity sample to be disambiguated and the labeling information of the entity sample to be disambiguated;
and when the training result meets the training condition, obtaining the entity disambiguation model.
10. The method of claim 9, wherein determining an entity to be trained disambiguation model based on the updated word embedding layer parameters of the pre-trained language model comprises:
acquiring the updated word embedding layer parameters of the pre-training language model;
initializing parameters of a word embedding layer of the entity disambiguation model to be trained according to the updated parameters of the word embedding layer of the pre-training language model, wherein the entity disambiguation model to be trained comprises a word embedding layer and a task layer, and the model parameters of the word embedding layer of the entity disambiguation model to be trained are the same as those of the word embedding layer of the pre-training language model.
11. The method according to any one of claims 1 to 10, wherein determining a target entity of the target candidate entities based on a disambiguation score between the entity to be linked and the target candidate entities comprises:
if the disambiguation score between the entity to be linked and the target candidate entity exceeds a third preset threshold, determining the target candidate entity as the target entity;
and if the disambiguation score between the entity to be linked and the target candidate entity does not exceed the third preset threshold, determining the target candidate entity with the highest disambiguation score as the target entity.
12. The method according to claim 1, wherein before the entity to be linked and the context content corresponding to the entity to be linked are obtained from the text to be recognized, the method further comprises:
receiving a text query instruction;
acquiring a query text according to the text query instruction, wherein the query text is the text to be identified;
after the target entity is determined as the link entity corresponding to the entity to be linked, the method further includes:
determining at least one query result related to the query text according to the link entity;
and feeding back the at least one query result.
13. An apparatus for physical linking, the apparatus comprising:
the acquiring unit is used for acquiring an entity to be linked and context content corresponding to the entity to be linked in a text to be identified;
the determining unit is used for determining a plurality of candidate entities and entity information of each candidate entity in the candidate entities in a knowledge graph according to the entity to be linked;
the matching unit is used for matching the context content with the entity information corresponding to each candidate entity to obtain a first matching score;
the determining unit is further configured to determine a target candidate entity of the plurality of candidate entities according to the first matching score;
the determining unit is further configured to determine a target entity in the target candidate entities based on a disambiguation score between the entity to be linked and the target candidate entities;
the determining unit is further configured to determine the target entity as a link entity corresponding to the entity to be linked.
14. A computer device, comprising: a memory, a transceiver, a processor, and a bus system;
wherein the memory is used for storing programs;
the processor is configured to execute a program in the memory to implement the method of any one of claims 1 to 12;
the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.
15. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method of any of claims 1 to 12.
CN202110461405.5A 2021-04-27 2021-04-27 Entity linking method, device, equipment and storage medium Pending CN113761218A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110461405.5A CN113761218A (en) 2021-04-27 2021-04-27 Entity linking method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110461405.5A CN113761218A (en) 2021-04-27 2021-04-27 Entity linking method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113761218A true CN113761218A (en) 2021-12-07

Family

ID=78786911

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110461405.5A Pending CN113761218A (en) 2021-04-27 2021-04-27 Entity linking method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113761218A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113919347A (en) * 2021-12-14 2022-01-11 山东捷瑞数字科技股份有限公司 Method and device for extracting and matching internal link words of text data
CN113947087A (en) * 2021-12-20 2022-01-18 太极计算机股份有限公司 Label-based relation construction method and device, electronic equipment and storage medium
CN114330331A (en) * 2021-12-27 2022-04-12 北京天融信网络安全技术有限公司 Method and device for determining importance of word segmentation in link
CN114491318A (en) * 2021-12-16 2022-05-13 北京百度网讯科技有限公司 Method, device and equipment for determining target information and storage medium
CN115795051A (en) * 2022-12-02 2023-03-14 中科雨辰科技有限公司 Data processing system for obtaining link entity based on entity relationship
CN116127053A (en) * 2023-02-14 2023-05-16 北京百度网讯科技有限公司 Entity word disambiguation, knowledge graph generation and knowledge recommendation methods and devices

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009052277A1 (en) * 2007-10-17 2009-04-23 Evri, Inc. Nlp-based entity recognition and disambiguation
US20110106807A1 (en) * 2009-10-30 2011-05-05 Janya, Inc Systems and methods for information integration through context-based entity disambiguation
CN107102989A (en) * 2017-05-24 2017-08-29 南京大学 A kind of entity disambiguation method based on term vector, convolutional neural networks
CN108280061A (en) * 2018-01-17 2018-07-13 北京百度网讯科技有限公司 Text handling method based on ambiguity entity word and device
CN110555208A (en) * 2018-06-04 2019-12-10 北京三快在线科技有限公司 ambiguity elimination method and device in information query and electronic equipment
CN110569496A (en) * 2018-06-06 2019-12-13 腾讯科技(深圳)有限公司 Entity linking method, device and storage medium
CN112069826A (en) * 2020-07-15 2020-12-11 浙江工业大学 Vertical domain entity disambiguation method fusing topic model and convolutional neural network
WO2021073119A1 (en) * 2019-10-15 2021-04-22 平安科技(深圳)有限公司 Method and apparatus for entity disambiguation based on intention recognition model, and computer device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009052277A1 (en) * 2007-10-17 2009-04-23 Evri, Inc. Nlp-based entity recognition and disambiguation
US20110106807A1 (en) * 2009-10-30 2011-05-05 Janya, Inc Systems and methods for information integration through context-based entity disambiguation
CN107102989A (en) * 2017-05-24 2017-08-29 南京大学 A kind of entity disambiguation method based on term vector, convolutional neural networks
CN108280061A (en) * 2018-01-17 2018-07-13 北京百度网讯科技有限公司 Text handling method based on ambiguity entity word and device
CN110555208A (en) * 2018-06-04 2019-12-10 北京三快在线科技有限公司 ambiguity elimination method and device in information query and electronic equipment
CN110569496A (en) * 2018-06-06 2019-12-13 腾讯科技(深圳)有限公司 Entity linking method, device and storage medium
WO2021073119A1 (en) * 2019-10-15 2021-04-22 平安科技(深圳)有限公司 Method and apparatus for entity disambiguation based on intention recognition model, and computer device
CN112069826A (en) * 2020-07-15 2020-12-11 浙江工业大学 Vertical domain entity disambiguation method fusing topic model and convolutional neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵亚辉: "临床医疗实体链接方法研究", 中国优秀硕士学位论文全文数据库 (医药卫生科技辑), 15 February 2018 (2018-02-15) *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113919347A (en) * 2021-12-14 2022-01-11 山东捷瑞数字科技股份有限公司 Method and device for extracting and matching internal link words of text data
CN113919347B (en) * 2021-12-14 2022-04-05 山东捷瑞数字科技股份有限公司 Method and device for extracting and matching internal link words of text data
CN114491318A (en) * 2021-12-16 2022-05-13 北京百度网讯科技有限公司 Method, device and equipment for determining target information and storage medium
CN114491318B (en) * 2021-12-16 2023-09-01 北京百度网讯科技有限公司 Determination method, device, equipment and storage medium of target information
CN113947087A (en) * 2021-12-20 2022-01-18 太极计算机股份有限公司 Label-based relation construction method and device, electronic equipment and storage medium
CN114330331A (en) * 2021-12-27 2022-04-12 北京天融信网络安全技术有限公司 Method and device for determining importance of word segmentation in link
CN114330331B (en) * 2021-12-27 2022-09-16 北京天融信网络安全技术有限公司 Method and device for determining importance of word segmentation in link
CN115795051A (en) * 2022-12-02 2023-03-14 中科雨辰科技有限公司 Data processing system for obtaining link entity based on entity relationship
CN116127053A (en) * 2023-02-14 2023-05-16 北京百度网讯科技有限公司 Entity word disambiguation, knowledge graph generation and knowledge recommendation methods and devices
CN116127053B (en) * 2023-02-14 2024-01-02 北京百度网讯科技有限公司 Entity word disambiguation, knowledge graph generation and knowledge recommendation methods and devices

Similar Documents

Publication Publication Date Title
CN111737474B (en) Method and device for training business model and determining text classification category
CN113761218A (en) Entity linking method, device, equipment and storage medium
CN110097085B (en) Lyric text generation method, training method, device, server and storage medium
Rahman et al. Classifying non-functional requirements using RNN variants for quality software development
CN107273913B (en) Short text similarity calculation method based on multi-feature fusion
CN110457708B (en) Vocabulary mining method and device based on artificial intelligence, server and storage medium
CN110929125B (en) Search recall method, device, equipment and storage medium thereof
CN108563703A (en) A kind of determination method of charge, device and computer equipment, storage medium
CN106663124A (en) Generating and using a knowledge-enhanced model
CN112818093B (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN108519971B (en) Cross-language news topic similarity comparison method based on parallel corpus
CN113420145B (en) Semi-supervised learning-based bid-bidding text classification method and system
Jiang et al. Learning numeral embedding
CN110866102A (en) Search processing method
Helmy et al. Applying deep learning for Arabic keyphrase extraction
CN110597956A (en) Searching method, searching device and storage medium
CN114077841A (en) Semantic extraction method and device based on artificial intelligence, electronic equipment and medium
Almiman et al. Deep neural network approach for Arabic community question answering
CN115168590A (en) Text feature extraction method, model training method, device, equipment and medium
CN114662477A (en) Stop word list generating method and device based on traditional Chinese medicine conversation and storage medium
Kalra et al. Generation of domain-specific vocabulary set and classification of documents: weight-inclusion approach
Trupthi et al. Possibilistic fuzzy C-means topic modelling for twitter sentiment analysis
CN113722512A (en) Text retrieval method, device and equipment based on language model and storage medium
CN113515699A (en) Information recommendation method and device, computer-readable storage medium and processor
CN115481313A (en) News recommendation method based on text semantic mining

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination