CN113761218B

CN113761218B - Method, device, equipment and storage medium for entity linking

Info

Publication number: CN113761218B
Application number: CN202110461405.5A
Authority: CN
Inventors: 刘一仝; 郑孙聪; 周博通; 费昊
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-04-27
Filing date: 2021-04-27
Publication date: 2024-05-10
Anticipated expiration: 2041-04-27
Also published as: CN113761218A

Abstract

The application discloses a method, a device, equipment and a storage medium for entity linking, which relate to the field of natural language processing and comprise the steps of acquiring an entity to be linked and context content corresponding to the entity to be linked from a text to be identified, and determining a plurality of candidate entities and entity information of each candidate entity in the plurality of candidate entities in a knowledge graph according to the entity to be linked. And matching the context content with the entity information corresponding to each candidate entity to obtain a first matching score, and determining a target candidate entity in the plurality of candidate entities according to the first matching score. And determining a target entity in the target candidate entity based on the disambiguation score between the entity to be linked and the target candidate entity, and determining the target entity as a link entity corresponding to the entity to be linked.

Description

Method, device, equipment and storage medium for entity linking

Technical Field

The present application relates to the field of natural language processing, and in particular, to a method, an apparatus, a device, and a storage medium for entity linking.

Background

The technology of entity linking (ENTITY LINKING, EL) is a hotspot in the field of natural language processing in recent years, and plays a very important role in the scene of knowledge graph construction and the like. Specifically, entity linking is a technology for mapping an entity appearing in a text to be identified to a given knowledge graph, and is used for corresponding the entity to be identified to an entity existing in the knowledge graph, so as to complete natural language tasks such as question answering, semantic searching, information extraction and the like.

The entity link may include two processes of entity identification for determining a plurality of candidate entities in the knowledge graph according to the entity to be identified, and entity disambiguation for selecting a unique entity pointed by the entity to be identified from all candidate entities. The entity disambiguation is essentially that the word is ambiguous, matching identification needs to be performed according to the content of the context and the context of the entity to be identified, and generally, a preprocessing language model can be adopted to perform entity disambiguation.

Because the entity disambiguation method based on the pre-training language model can consider the entity information of all candidate entities in the knowledge graph, the entity disambiguation can be caused to cover more relevant information because of considering excessive entity information, and when the candidate entities are excessive, the entity disambiguation method can cause huge calculation amount and long time consumption because of needing to perform disambiguation scores on all the candidate entities. Therefore, how to reduce the computation amount of the pre-trained language model for entity disambiguation becomes a problem to be solved.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for entity linking, when determining a linked entity of an entity to be linked in a knowledge graph, firstly coarsely screening a plurality of candidate entities according to the context content of the entity to be linked, then inputting the screened candidate entities and the entity to be linked into an entity disambiguation model to obtain disambiguation scores between the entity to be linked and the candidate entities, further finally determining the linked entity in the candidate entities through the disambiguation scores, and reducing the input data quantity of the entity disambiguation model through the coarse screening process, thereby reducing the calculation quantity of the entity disambiguation model.

In view of this, the present application provides, in one aspect, a method for entity linking, including:

And acquiring the entity to be linked and the context content corresponding to the entity to be linked in the text to be identified.

And determining a plurality of candidate entities and entity information of each candidate entity in the plurality of candidate entities in the knowledge graph according to the entity to be linked.

And matching the context content with the entity information corresponding to each candidate entity to obtain a first matching score.

A target candidate entity of the plurality of candidate entities is determined based on the first matching score.

A target entity of the target candidate entities is determined based on the disambiguation score between the entity to be linked and the target candidate entity.

And determining the target entity as a link entity corresponding to the entity to be linked.

In another aspect, the present application provides an apparatus for entity linking, including:

the acquisition unit is used for acquiring the entity to be linked and the context content corresponding to the entity to be linked in the text to be identified.

And the determining unit is used for determining a plurality of candidate entities and entity information of each candidate entity in the plurality of candidate entities in the knowledge graph according to the entity to be linked.

And the matching unit is used for matching the context content with the entity information corresponding to each candidate entity to obtain a first matching score.

And the determining unit is further used for determining a target candidate entity in the plurality of candidate entities according to the first matching score.

And the determining unit is also used for determining the target entity in the target candidate entities based on the disambiguation scores between the entity to be linked and the target candidate entities.

And the determining unit is also used for determining the target entity as a link entity corresponding to the entity to be linked.

In one possible design, the determining unit is specifically configured to perform relevance ranking on entity information corresponding to the target candidate entity according to the context content, determine key information in the entity information corresponding to the target candidate entity according to a relevance ranking result, input the entity to be linked, the context content, the target candidate entity and the key information into the entity disambiguation model, and obtain a disambiguation score between the entity to be linked and the target candidate entity through the entity disambiguation model.

In one possible design, the matching unit is specifically configured to perform word segmentation processing on the context content and entity information corresponding to each candidate entity, so as to obtain a first word segment included in the context content and a second word segment included in the entity information corresponding to each candidate entity. And calculating cosine similarity of the word vector corresponding to the first word and the word vector corresponding to the second word, and determining a first matching score according to the cosine similarity.

In one possible design, the matching unit is specifically configured to obtain a similarity distribution vector corresponding to each candidate entity according to cosine similarity, input the similarity distribution vector corresponding to each candidate entity to a first model, obtain a first matching score of the context content and the entity information corresponding to each candidate entity through the first model, and determine the distribution score according to the similarity distribution vector.

In one possible design, the matching unit is specifically configured to determine the number of second branches in each preset cosine similarity interval according to the value of the cosine similarity, where the cosine similarity is preset with a plurality of cosine similarity intervals. And determining a similarity distribution vector corresponding to each candidate entity according to the number of second segmentation words in each preset cosine similarity interval.

In one possible design, the obtaining unit is specifically configured to obtain word frequency information corresponding to the entity information of each candidate entity.

The determining unit is specifically configured to determine a weight value corresponding to each candidate entity according to the word frequency information, and determine a second matching score corresponding to each candidate entity according to the weight value and the first matching score. And if the second matching score exceeds the first preset threshold, determining the candidate entity as a target candidate entity.

In one possible design, the determining unit is specifically configured to perform word segmentation processing on the context content and entity information corresponding to the target candidate entity, so as to obtain a third word segment included in the context content and a fourth word segment included in the entity information corresponding to the target candidate entity; and calculating cosine similarity of the word vector corresponding to the third segmentation word and the word vector corresponding to the fourth segmentation word. Sorting the fourth word segments according to the cosine similarity values of the word vectors corresponding to the third word segments and the word vectors corresponding to the fourth word segments, and determining target word segments in the fourth word segments according to sorting results of the fourth word segments, wherein the cosine similarity between the target word segments and the third word segments exceeds a second preset threshold; and determining entity information corresponding to the target word as key information.

In one possible design, the determining unit is specifically configured to perform similarity calculation on the context content and the key information through the entity disambiguation model, and determine a disambiguation score between the entity to be linked and the target candidate entity according to a calculation result of the similarity calculation.

In one possible design, the physically linked device further comprises a training unit. The acquisition unit is also used for acquiring a pre-training language model, the pre-training language model comprises a word embedding layer and a task layer, the word embedding layer is used for carrying out word vector representation on input linguistic data, and the task layer is used for completing a pre-training language task.

The training unit is specifically configured to train the pre-training language model according to the first corpus data, and update word embedding layer parameters of the pre-training language model.

And the determining unit is also used for determining an entity disambiguation model to be trained according to the word embedding layer parameters of the updated pre-training language model.

The training unit is further used for training the entity disambiguation model to be trained according to second corpus data, and the second corpus data comprises the entity samples to be disambiguated and labeling information of the entity samples to be disambiguated. And when the training result meets the training condition, obtaining the entity disambiguation model.

In one possible design, the obtaining unit is specifically configured to obtain word embedding layer parameters of the updated pre-trained language model.

The determining unit is further used for initializing parameters of a word embedding layer of the entity disambiguation model to be trained according to the updated word embedding layer parameters of the pre-training language model, wherein the entity disambiguation model to be trained comprises the word embedding layer and a task layer, and the word embedding layer of the entity disambiguation model to be trained is identical to the model parameters of the word embedding layer of the pre-training language model.

In one possible design, the determining unit is specifically configured to determine the target candidate entity as the target entity if the disambiguation score between the entity to be linked and the target candidate entity exceeds a third preset threshold. And if the disambiguation scores corresponding to the target candidate entities do not exceed a third preset threshold, determining the target candidate entity with the highest disambiguation score as the target entity.

In one possible design, the physically linked device further comprises a receiving unit.

The receiving unit is specifically configured to receive a text query instruction.

The acquisition unit is also used for acquiring a query text according to the text query instruction, wherein the query text is a text to be identified.

And the determining unit is also used for determining at least one query result related to the query text according to the link entity and feeding back the at least one query result.

Another aspect of the application provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the methods of the above aspects.

From the above technical solutions, the embodiment of the present application has the following advantages:

In the embodiment of the application, when entity disambiguation is carried out on candidate entities in a knowledge graph by utilizing an entity disambiguation model, a plurality of candidate entities are coarsely screened according to the context content of the entity to be linked, and then the screened candidate entities and the entity to be linked are input into the entity disambiguation model, so that the matching workload of the entity disambiguation model can be reduced, the disambiguation score of each candidate entity for the entity to be linked can be more effectively determined, the duration of entity linking is reduced, and the completion efficiency of entity linking tasks is improved.

Drawings

Fig. 1 is a flow chart of an entity linking method according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating another entity linking method according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for screening candidate entities according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a candidate entity screening structure according to an embodiment of the present application;

fig. 5 is a flowchart of a method for obtaining an entity disambiguation model according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a pre-training model according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an apparatus for physical linking according to an embodiment of the present application;

fig. 8 is a schematic diagram of an embodiment of a computer device according to an embodiment of the present application.

Detailed Description

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "includes" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

As network data grows exponentially, the internet is just as though it were the largest data warehouse, with large amounts of network data being presented in the internet in natural language. In order to understand the specific semantic information of the natural language on the internet, the natural language needs to be connected with a knowledge base, and the labeling of the natural language is completed by utilizing the knowledge in the knowledge base. Because natural language itself is highly ambiguous, for example, the same entity may correspond to multiple names, and a name may correspond to multiple homonymous entities. Therefore, to implement labeling of a certain natural language text, it is necessary to associate an entity appearing in the natural language text with a unique entity (knowledge) in the knowledge base, and the key to implement this step is the entity linking technology.

Entity linking is the mapping of certain strings (entities) in natural language text onto entities in a knowledge base. For example, the "apple" of the natural language text "student likes to eat an apple" is mapped to some entity in the knowledge base. It will be appreciated that the knowledge base is subject to the phenomenon of homonymy or homonymy, and therefore the mapping process requires entity disambiguation. Entity disambiguation is a technique of selecting, among all candidate entities of a knowledge-graph, an entity referred to, for example, in the above example, "apple" in "student likes to eat apple" shall refer to an entity of "apple (fruit)" in the knowledge base, not an entity of "apple (electronic product)".

As can be seen from the above examples, the difficulty of physical linking is in two aspects, the multi-word meaning and the word multi-meaning. The term "multi-word" means that an entity may have multiple names, and standard names, aliases, name abbreviations, etc. of the entity may be used to refer to the entity. For example, "potato," "potato," and "potato" all refer to the same entity. The word ambiguity refers to a meaning that can refer to a plurality of entities, and to solve the word ambiguity problem, entity information in a knowledge base is used for entity disambiguation. I.e. entity linking generally comprises two steps, referred to as two processes of identification and entity disambiguation. The first step of entity linking is to perform name recognition, and for example, a name-entity dictionary may be constructed to statistically manage standard names and aliases of an entity, and then identify entities in natural language text and candidate entities in a knowledge base according to the name-entity dictionary. And the second step of entity linking can perform entity disambiguation operation on the candidate entities according to the entity information in the knowledge base and the content of the natural language text, and determine a final entity sequence.

The knowledge base used for entity disambiguation can be a knowledge graph, and the knowledge graph is used as a semantic network system with a very large scale, so as to determine the association relationship between entities or concepts. Through a large amount of data collection, the knowledge graph is organized into a knowledge base which can be processed by a machine, and visual display is realized. The knowledge graph is composed of nodes and edges connecting the nodes, wherein the nodes can be entities, and the edges are used for representing the relationship between the entities. For example, "apple" and "fruit" correspond to two nodes in the knowledge graph, and these two nodes are connected by an edge pointing from "apple" to "fruit", and this edge can be "attribute", and the attribute representing "apple" is "fruit".

Since the knowledge graph is the basis of machine learning natural language and a large amount of entity information is stored therein, it is necessary to strictly secure the security thereof, and to ensure that the content included therein cannot be easily tampered with. Thus, knowledge-graph may be stored using blockchain techniques to maintain the security and stability of the knowledge-graph. The blockchain technique is briefly described below:

Blockchains are novel application modes of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanisms, encryption algorithms, and the like. The blockchain (blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.

The blockchain underlying platform may include processing modules for user management, basic services, smart contracts, and operational management. The user management module is responsible for identity information management of all blockchain participants, including maintenance of public and private key generation (account management), key management, maintenance of corresponding relation between the real identity of the user and the blockchain address (authority management) and the like, and under the condition of authorization, supervision and audit of transaction conditions of certain real identities, and provision of rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node devices, is used for verifying the validity of a service request, recording the service request on a storage after the effective request is identified, for a new service request, the basic service firstly analyzes interface adaptation and authenticates the interface adaptation, encrypts service information (identification management) through an identification algorithm, and transmits the encrypted service information to a shared account book (network communication) in a complete and consistent manner, and records and stores the service information; the intelligent contract module is responsible for registering and issuing contracts, triggering contracts and executing contracts, a developer can define contract logic through a certain programming language, issue the contract logic to a blockchain (contract registering), invoke keys or other event triggering execution according to the logic of contract clauses to complete the contract logic, and simultaneously provide a function of registering contract upgrading; the operation management module is mainly responsible for deployment in the product release process, modification of configuration, contract setting, cloud adaptation and visual output of real-time states in product operation, for example: alarms, managing network conditions, managing node device health status, etc.

The platform product service layer provides basic capabilities and implementation frameworks of typical applications, and developers can complete the blockchain implementation of business logic based on the basic capabilities and the characteristics of the superposition business. The application service layer provides the application service based on the block chain scheme to the business participants for use.

From the above description, it can be seen that the key of entity linking is entity disambiguation, which is essentially due to the diversity and ambiguity of natural language. The diversity refers to that the same entity has different references in the text, for example, in a certain section of natural language text, "flying man", "upper owner" and "MJ" all refer to basketball player a in the united states. While ambiguity refers to the same entity meaning that in a certain knowledge-graph may refer to a different entity, e.g., in a certain knowledge-graph, a may refer to a basketball player in the united states. In the prior art, there are various methods for entity disambiguation, most commonly, a neural network is used to obtain a natural language model for entity disambiguation, and finally, the natural language model is used to complete the task of entity disambiguation.

Specifically, an entity name (entity to be linked) in a certain natural language text may be determined first, and then all candidate entities in the knowledge graph are determined according to the entity name, where the reference of the candidate entities is the same as the entity name. And obtaining information of the candidate entity through a network relation in the knowledge graph, inputting entity names, context contents of the entity names and the candidate entity and candidate entity information into a natural language model, matching the candidate entity information according to the context contents to obtain entity information most relevant to the context information, and determining the candidate entity corresponding to the entity information as a target entity corresponding to the entity names. In the method, the natural language model needs to analyze all entity information of all candidate entities to obtain the disambiguation score corresponding to each candidate entity, so that the natural language model has huge calculation amount and long time consumption. Therefore, how to reduce the amount of computation for the natural language model becomes a problem to be solved.

Based on the above problems, the embodiments of the present application provide a method for entity linking, after determining candidate entities of an entity to be linked in a knowledge graph, first performing coarse screening on a plurality of candidate entities according to context contents of the entity to be linked, screening entity information of the screened candidate entities, and then inputting the screened candidate entities and the entity to be linked into an entity disambiguation model to obtain an entity disambiguation result, so that through processes of entity coarse screening and information screening, an input data amount of the entity disambiguation model can be reduced, thereby reducing a calculation amount of a natural language model.

Before the proposal of the application is introduced, the field of natural language processing (Nature Language processing, NLP) is simply introduced, and the natural language processing is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Fig. 1 is a flow chart of an entity linking method according to an embodiment of the present application, as shown in fig. 1, including the following steps:

S1, determining an entity to be linked in a text to be identified.

S2, determining candidate entities in the knowledge graph according to the entities to be linked.

S3, carrying out candidate entity coarse screening according to the context content of the entity to be linked.

And S4, screening entity information of the candidate entities based on the depth correlation sorting.

S5, performing entity disambiguation by using a natural language model based on the screened candidate entity and entity information.

S6, outputting a disambiguation result.

When a section of natural language text needs to be identified, word segmentation operation can be performed on the text to be identified by using a word segmentation technology, entities (entity names) to be linked in the text to be identified are determined according to word segmentation results, then a plurality of candidate entities are determined in a preselected knowledge graph according to the entity names, and entity information of the plurality of candidate entities is obtained. And then, the plurality of candidate entities are first coarsely screened according to the corresponding context content (text to be identified) of the entity designation and the entity information of the plurality of candidate entities, and various specific coarse screening processes are carried out, which are described in detail below. After the candidate entities are coarsely screened, entity information included in the remaining candidate entities is screened to obtain information more favorable for entity disambiguation, so that the entity disambiguation is carried out by combining the screened candidate entities and the entity information, specifically, the context content pointed by the entity and the entity information of the candidate entities are matched by using a natural language model, and finally, a disambiguation result is output to determine a unique target entity corresponding to the entity pointed by the entity. Because the rough screening of the candidate entities and the screening of the entity information are carried out before the data is input by the natural language model, the matching workload of the natural language model can be reduced, and the disambiguation fraction of each candidate entity for the entity to be linked can be more efficiently determined, so that the calculation amount and the duration of entity link are reduced, and the completion efficiency of the entity link task is improved.

The following describes an entity linking method specifically, and fig. 2 is a schematic flow chart of an entity linking method according to an embodiment of the present application, as shown in fig. 2, and the method includes the following steps.

201. And acquiring the entity to be linked and the context content of the entity to be linked in the text to be identified.

When semantic recognition of natural language text by a machine is required, at least one entity in the text to be recognized needs to be determined first. Because natural language has the characteristics of diversity and ambiguity, the machine needs to solve the diversity problem firstly, namely, determining the entity to be linked in the text to be identified firstly. For example, the text to be identified may be firstly subjected to word segmentation operation, and then the entity to be linked is determined according to the words, for example, a singing session is about to be performed when a certain text to be identified is Liu Mou, and the singer is familiar with the iron lung singer to sing more than 20 songs at the time, so as to display the charm of the heaven king. By word segmentation operation on the text, the words "Liu Mou", "iron lung singer" and "heavenly king" included in the text can be determined to represent the movie star Liu Mou, and then one entity to be linked of the text can be determined to be "Liu Mou".

Illustratively, solving the natural language diversity problem may utilize the form of a name-entity dictionary to build a knowledge base for storing various names of entities, including standard names, aliases, name shorthand, and the like of the entities. When determining the entity to be linked in the text to be identified, the final entity reference can be determined according to the word matching reference included in the text to be identified, the entity in the entity dictionary, the standard name, the alias, the name shorthand and the like of the entity.

When determining the entity to be linked, the machine needs to solve the ambiguity problem of the natural language, the ambiguity problem is essentially that the natural language has word ambiguity problem, the problem cannot be solved by purely using the text representation of the surface, the meaning of the entity reference needs to be determined by combining the context and the context content, and therefore, the context content of the entity to be linked needs to be obtained from the text to be identified, and the entity to be linked needs to be determined by combining the context content.

202. And determining a plurality of candidate entities and entity information of each candidate entity in the knowledge graph according to the entity to be linked.

After determining the entity to be linked in the text to be identified, determining a plurality of candidate entities in the knowledge graph according to the entity names, and acquiring entity information corresponding to each candidate entity by utilizing the side relationship in the knowledge graph. For example, in the above example, when the entity to be linked is determined to be "Liu Mou", a plurality of candidates "Liu Mou" in the knowledge-graph may be determined according to the two words "Liu Mou". Since the knowledge graph includes massive entity information, assuming that the knowledge graph includes two "Liu Mou" entities, it is necessary to determine both "Liu Mou" entities as candidate entities, and then obtain entity information corresponding to each "Liu Mou" entity, where the entity information may include nationality, work, age, gender, and the like.

203. And matching the context content of the entity to be linked with the entity information of each candidate entity to obtain a first matching score.

After a plurality of candidate entities are obtained, the candidate entities need to be coarsely screened first, and the purpose of the screening method is to delete some candidate entities with extremely low correlation, so that the workload of the subsequent entity disambiguation process can be reduced. Specifically, the candidate entities may be screened in combination with the context content of the entity to be linked and the entity information of each candidate entity, and the first matching score is obtained by determining the matching degree of the context content of the entity to be linked and the entity information of each candidate entity.

204. A target candidate entity of the plurality of candidate entities is determined based on the first matching score.

After determining the first matching score between the context content of the entity to be linked and the entity information of each candidate entity, the candidate entities may be filtered according to the first matching score. For example, the candidate entities may be ranked according to the size of the matching score, and the candidate entity with the high matching score is used as the target candidate entity according to the preset number. For example, the target candidate entity may be selected according to a preset threshold, and if the first matching score corresponding to a certain candidate entity is higher than the preset threshold, the candidate entity is determined to be the target candidate entity.

In the above example, the entity to be linked "Liu Mou" is matched with two candidate entities "Liu Mou", and if one candidate entity "Liu Mou" works as a bus driver and the matching score determined by combining with "opening concert" in the context corresponding to the entity to be linked "Liu Mou" is lower than the preset threshold, then the candidate entity "Liu Mou" needs to be deleted to determine the other candidate entity "Liu Mou" as the target candidate entity.

205. And according to the context content of the entity to be linked, carrying out relevance ranking on the entity information of the target candidate entity.

After the candidate entity is coarsely screened, the relevance ranking is further required to be performed on the entity information of the target candidate entity, and as each entity in the knowledge graph comprises massive entity information, the relevance between the entity information and the context is different for a certain context, if all the entity information of the target candidate entity is combined for entity disambiguation, great workload is probably brought, and the entity disambiguation efficiency is very low, so that the relevance ranking can be further performed on the entity information of the target candidate entity before entity disambiguation, and the entity information which is more relevant to the context content of the entity to be linked is screened out for subsequent entity disambiguation.

206. And determining key information of the target candidate entity according to the relevance ranking result.

In combination with the above steps, the key information of the target candidate entity can be determined according to the sorting result of the relevance sorting, and entity disambiguation is performed by using the key information of the target candidate entity. In the above example, the context content of the entity "Liu Mou" to be linked includes "concert" and "singing", and the highest association degree between the work and the context content can be known by combining the context content, so that the entity information in the target candidate entity can be ranked according to the association degree relationship, and the entity information of the work can be determined to be the key information.

207. And acquiring disambiguation scores between the entity to be linked and the target candidate entity according to the entity to be linked, the context content of the entity to be linked, the target candidate entity and the key information of the target candidate entity.

After screening the candidate entity and entity information corresponding to the candidate entity, the entity to be linked, the context content of the entity to be linked, the target candidate entity and the key information of the target candidate entity are required to be input into a final entity disambiguation model to obtain disambiguation scores, and the entity disambiguation process is completed.

It will be appreciated that the entity disambiguation model is a natural language processing model and may include a word embedding layer and a task layer. The task layer calculates the word vector to obtain the final task result. In the entity disambiguation model, the task of the task layer is entity disambiguation, namely, the correlation between word vectors can be determined, and the correlation degree between the entity to be linked and the candidate entity is evaluated to obtain disambiguation scores. The disambiguation score is an evaluation of the correlation degree between the entity to be linked and the candidate entity, and a target entity corresponding to the entity to be linked in a plurality of target candidate entities can be finally determined according to the disambiguation score.

208. A target entity of the target candidate entities is determined based on the disambiguation score between the entity to be linked and the target candidate entity.

For example, a preset threshold (a third preset threshold) may be preset, and if the disambiguation score between the entity to be linked and the target candidate entity exceeds the third preset threshold, the target candidate entity is determined to be the target entity. If the disambiguation scores between the entity to be linked and the target candidate entity do not exceed the third preset threshold, the target candidate entity with the highest disambiguation score can be determined to be the target entity.

209. And determining the target entity as a link entity corresponding to the entity to be linked.

After the target entity corresponding to the entity to be linked is determined, the entity to be linked and the target entity can be linked to complete the related natural language task. Exemplary, e.g., semantic search, keyword recognition, etc. For example, the machine performs the entity linking process based on the received text query instruction, when determining the link entity corresponding to the entity to be linked in the identification text, the machine may respond to the text query instruction, obtain a plurality of query results according to the link entity in the knowledge graph, where the query results are all from the entity information of the link entity, and then feed back and display the query results.

In the embodiment of the application, when entity disambiguation is performed on candidate entities in a knowledge graph by using an entity disambiguation model, firstly, a plurality of candidate entities are coarsely screened according to the context content of the entity to be linked, then, the entity information of the rest candidate entities is screened, and finally, the screened candidate entities and entity information are input into the entity disambiguation model to obtain a final entity disambiguation result, so that the matching workload of the entity disambiguation model can be reduced, the disambiguation score of each candidate entity for the entity to be linked can be more effectively determined, the duration of entity linking is reduced, and the completion efficiency of entity linking tasks is improved.

The process of coarse screening of candidate entities and key information selection of target candidate entities will be described in detail below in connection with the embodiment shown in fig. 2. Fig. 3 is a flow chart of a method for screening candidate entities according to an embodiment of the present application, including:

301. And performing word segmentation processing on the context content of the entity to be linked to obtain a first word segment included in the context content.

In the process of rough screening of candidate entities, word segmentation processing is required to be performed on the context content of the entity to be linked, context information represented by the context is represented according to first word segmentation included in the context, fig. 4 is a schematic diagram of a candidate entity screening structure provided in an embodiment of the present application, and as shown in fig. 4, the context content is divided into a plurality of first word segments through word segmentation processing, and each first word segment needs to be matched with entity information corresponding to the candidate entity.

302. And performing word segmentation processing on the entity information corresponding to the candidate entity to obtain a second word segment included in the entity information.

Similarly, word segmentation processing is required to be performed on entity information corresponding to the candidate entity, as shown in fig. 4, the candidate entity may include a plurality of entity information, the machine needs to perform word segmentation processing on each entity information to obtain a plurality of second words, and each second word needs to be matched with each first word included in the context content of the entity to be linked.

303. And calculating cosine similarity between the word vector corresponding to the first word and the word vector corresponding to the second word.

The specific matching process may be to calculate cosine similarity between word vectors corresponding to the first word and the second word, perform word vector representation on each first word, and also perform word vector representation on each second word. And then, carrying out cosine similarity calculation on the word vector corresponding to each first word and the word vector corresponding to each second word. It can be understood that if there are M first words and N second words, then m×n cosine similarity calculation is needed to obtain m×n cosine similarity values.

304. And counting the number of second segmentation words corresponding to each cosine similarity interval according to the preset cosine similarity interval.

When a plurality of cosine similarity values are obtained, the plurality of cosine similarity values need to be statistically represented. For example, since the cosine similarity has a value within the interval of [ -1,1], the interval of [ -1,1] may be divided first according to a preset step size, for example, the interval of [ -1,1] may be divided into 10 intervals in a step size of 0.2. Then, the number of cosine similarity falling into each interval is determined, for example, the number of the first divided words is 5, the number of the second divided words is 10, and a total of 50 times of cosine similarity calculation is performed, so that the number of cosine similarity falling into each cosine similarity interval, for example, the number of cosine similarity falling into [ -1, -0.8] is 10.

305. And determining similarity distribution vectors corresponding to the candidate entities according to the number of the second segmentation words corresponding to each cosine similarity interval.

After counting the number of cosine similarities corresponding to each cosine similarity interval, it is necessary to determine a similarity distribution vector corresponding to the candidate entity according to the counted result, for example, in the above example, the number of cosine similarities falling within [ -1, -0.8] is 10, the number of cosine similarities falling within [ -0.8, -0.6] is 2, and the number of cosine similarities falling within [ -0.6, -0.4] is 0. It can be understood that the similarity distribution vector has a great relationship with the step length of the cosine similarity interval, and is a ten-dimensional vector when the step length is 0.2, and is a twenty-dimensional vector when the step length is 0.1.

306. And determining a first matching score between the context content of the entity to be linked and the entity information of the candidate entity according to the similarity distribution vector.

After the similarity distribution vector is determined, the similarity distribution vector can be input into a first model to obtain a first matching score. It will be appreciated that the first model may evaluate the distribution of cosine similarities from the similarity distribution vector, i.e. determine a distribution score from the similarity distribution vector, and evaluate both similarities from the distribution score. The first model may be a full link layer of the model, and the similarity distribution vector is input to the full link layer to obtain a distribution score, which is not limited in detail.

307. And acquiring word frequency information corresponding to the entity information of the candidate entity, and determining a weight value corresponding to the candidate entity according to the word frequency information.

The word frequency information refers to the properties of a first word included in a context and a second word included in entity information, including the frequency of occurrence of the first word in the context and the reverse text frequency of the first word relative to the context, and reflects the importance degree of the first word in semantic understanding of the context. Similarly, the frequency of occurrence of the second word in the entity information and the reverse text frequency of the second word for the entity information can also reflect the importance degree of the second word for the entity information. The second word in each candidate entity can be evaluated by using word frequency information, and the weight value of each candidate entity is determined according to the evaluation result, for example, if the second word included in one candidate entity has good class distinguishing capability, the weight value of the candidate entity needs to be properly increased.

308. And determining a second matching score corresponding to the candidate entity according to the weight value and the first matching score.

And according to the weight value of each candidate entity and the first matching score of each candidate entity for the entity to be linked, the second matching score can be obtained comprehensively. Candidate entities are then screened according to the second matching score.

309. And determining target candidate entities in the candidate entities according to the second matching scores.

Specifically, a first preset threshold may be set, and if the second matching score exceeds the first preset threshold, the candidate entity is determined to be a target candidate entity. The priority level of the plurality of candidate entities may also be determined based on the second matching score, and the target candidate entity of the candidate entities may be determined based on the priority level.

In the above embodiment, the method is used for performing word segmentation on the context content corresponding to the entity to be linked and the entity information of the candidate entity, then acquiring the similarity distribution vector according to the cosine similarity between the word vectors corresponding to the first word and the second word, evaluating the relevance between the context content and the entity information according to the similarity distribution vector, and finally screening out the candidate entity with high relevance to the context content as the target candidate entity.

After the candidate entity is coarsely screened to obtain the target candidate entity, the entity information of the target candidate entity is also required to be screened, so that key information which is more relevant to the context content corresponding to the entity to be linked is selected as the disambiguation comparison information of the subsequent entity. Specifically, the entity information of the candidate entity can be ranked based on the depth correlation ranking concept, and then important information is selected as key information according to the ranking result to be input into the entity disambiguation model.

Depth dependency ranking may also be based on cosine similarity. Firstly, word segmentation is performed on context contents corresponding to entities to be linked to obtain a plurality of third words corresponding to the context contents, and it can be understood that the third words can be the same as or different from the first words in the rough screening process of the candidate entities, and specific word segmentation results are related to algorithms corresponding to word segmentation operations, which are not limited herein. Then, the entity information of the target candidate entity is required to be segmented to obtain a fourth segmentation, and the fourth segmentation can be the same as or different from the second segmentation in the process of coarse screening the candidate entity.

After the third word segment and the fourth word segment are obtained, cosine similarity calculation is needed to be sequentially carried out on the third word segment and the fourth word segment, specifically, word vectors of the third word segment and the fourth word segment can be respectively determined first, and then cosine similarity of the word vectors corresponding to each third word segment and each fourth word segment is sequentially calculated. And then, sorting the fourth segmentation words according to the obtained cosine similarity value, screening out the fourth segmentation words with high similarity with the third segmentation words, and determining the corresponding entity information as the key information of the target candidate entity.

By using the method to screen the entity information of the target candidate entity, the key information more relevant to the context content of the entity to be linked can be selected to perform entity disambiguation, so that the subsequent entity disambiguation process does not need to compare and identify all the entity information of the target candidate entity, thereby reducing the workload of an entity disambiguation model, improving the efficiency of the whole entity link and reducing the duration of the entity link.

The entity disambiguation model is simply introduced below, and the disambiguation score of the entity disambiguation model is obtained by comparing the entity information of the candidate entity with the context content of the entity to be linked, wherein the disambiguation score is used for measuring the matching degree of the entity information of the candidate entity and the context content of the entity to be linked, and the purpose is to find a target entity with the same height as the entity to be linked from the candidate entity, namely the obtained target entity and the entity to be linked refer to the same entity. To improve the accuracy of the entity disambiguation results, entity disambiguation may be performed by a natural language model, in particular, an entity disambiguation model for entity disambiguation may be obtained based on a pre-trained model.

The existing mainstream natural language understanding models are all fine tuning models based on pre-training models. Firstly, pre-training a model on tasks such as a mask language model (masked language model, MLM) or a sentence-in-sentence prediction model (next sentence prediction, NSP) and the like based on the disclosed unmarked data, designing a downstream network structure of the pre-trained model based on a specific natural language understanding task, and performing fine tuning training on the whole network structure by using the marked data of the task to finally obtain a neural network model related to the specific task. Therefore, the acquisition of the entity disambiguation model can firstly acquire a pre-training model, then design a downstream network structure according to the entity disambiguation task, then in the trained pre-training model, acquire the entity disambiguation model by replacing a task layer, train the entity disambiguation model based on the entity disambiguation sample data, and acquire the final entity disambiguation model when the training requirement is met.

The functions of the natural language model according to the network model structure may be divided into a word embedding layer (word embedding), a converter layer (transformers), and a downstream task layer; the word embedding layer is used for segmenting the input natural language and converting the words into word vectors, and the converter layer is used for processing and understanding the input natural language data and determining the logical relationship and word sequence among the words in the natural language; the downstream network layer is related to specific natural language tasks, the network structures of the downstream network layers corresponding to different tasks are different, the downstream network layer carries out operation on word vectors through the logical relation of the converter layer, and finally an output result corresponding to the natural language and related to the specific tasks is obtained; therefore, the word embedding layer is mainly used for machine recognition of the text data, vectorizes the text data and is irrelevant to specific tasks; the converter layer and the downstream task layer need to understand the input text data and react to the text data to obtain an output result, so that the output result is closely related to a specific language task and can be collectively called as a task layer.

Because the number of layers of the network structure of the natural language model is very deep, if the parameters of each layer of network structure are initialized randomly, a large amount of training data is needed to train the model, and the training process of the natural language model is very slow; in the prior art, a pre-training mode is generally adopted to acquire a natural language model aiming at a specific task; firstly, a larger training data set is used for pre-training a network model aiming at a pre-task to obtain a pre-training model, and the pre-training model can be understood to learn part of language knowledge; and initializing parameters of a task model (natural language model) according to partial network parameters of the pre-training model, so that the task model can be trained on the basis of the pre-training model, the quantity of training data of the task model is reduced, the convergence speed of the task model can be accelerated, and the acquisition efficiency of the task model is improved.

The following describes a training process of an entity disambiguation model based on a pre-training model, and fig. 5 is a flow chart of a method for obtaining an entity disambiguation model according to an embodiment of the present application, where the method includes:

501. A pre-training model is obtained.

The network structure of the pre-training model can be divided into a first word embedding layer and a first task layer, wherein the first word embedding layer is used for carrying out word vector conversion on the input corpus, the first task layer is related to the pre-task of the pre-training model, and word vectors are calculated according to the tasks to obtain an output result of the input corpus. The pre-training model may be a model pre-trained based on tasks such as MLM or NSP, and the main purpose of the first word embedding layer is to perform a good word vector representation on the input corpus.

502. And training the pre-training model by using the non-labeling corpus.

When the natural language model related to entity disambiguation is obtained through the pre-training model, the pre-training model needs to be finely tuned by utilizing entity disambiguation corpus, so that the pre-training model can learn the knowledge of entity disambiguation more specifically, and meanwhile, the training speed can be improved.

The entity disambiguation corpus is input to the pre-training model, so that the pre-training model can acquire knowledge of entity disambiguation, and the word embedding layer can better express word vectors of the language.

503. Word embedding layer parameters of the trained pre-training model are determined.

After the training process of the pre-training model is finished, the word embedding layer of the pre-training model can well express the input entity disambiguation corpus, so that the word embedding layer parameters of the pre-training model are required to be acquired, the word embedding layer parameters of the entity disambiguation model are initialized by utilizing the word embedding layer parameters of the pre-training model, the migration of knowledge is completed, the training process of the subsequent entity disambiguation model can be quickened, and the entity disambiguation model can be converged rapidly.

504. And determining the word embedding layer parameters of the entity disambiguation model according to the word embedding layer parameters of the trained pre-training model.

The structure of the target task layer is designed by combining the entity disambiguation task, so that the target task layer can match the context content of the entity to be linked with the entity information of the candidate entity to evaluate the similarity. And then combining the word embedding layer structure of the pre-training model and the target task layer structure to obtain the structure of the entity disambiguation model, initializing the word embedding layer parameters of the entity disambiguation model by utilizing the word embedding layer parameters of the pre-training model, and finally obtaining the model to be trained.

505. Training the entity disambiguation model by using the entity disambiguation sample to obtain a trained entity disambiguation model.

Specifically, the entity disambiguation sample comprises an entity to be linked, context content of the entity to be linked, entity information of candidate entity and candidate entity, and labeling information of the entity to be linked, wherein the training process is to input the entity to be linked, the context content of the entity to be linked, the entity information of the candidate entity and the candidate entity into an entity disambiguation model to obtain an entity disambiguation result, then calculate loss between the entity disambiguation result and the labeling information of the entity to be linked, then reversely transfer the loss, adjust parameters of the entity disambiguation model, and finish the training process through repeated reverse iteration when training conditions are met, for example, when the preset training times are reached or the loss value is smaller than a preset threshold value, so as to obtain the final entity disambiguation model.

By the method, the entity disambiguation model can be obtained based on the pre-training model, the training process of the entity disambiguation model is quickened by utilizing knowledge migration, and the training efficiency of the entity disambiguation model is improved.

FIG. 6 is a schematic structural diagram of a pre-training model according to an embodiment of the present application; the entity context to be linked, the candidate entity and the entity information of the candidate entity are input to a pretraining model Bert, and the pretraining model Bert compares the entity context to be linked with the entity information of the candidate entity to obtain the final disambiguation score.

Fig. 7 is a schematic structural diagram of an apparatus for physical linking according to an embodiment of the present application, including:

the obtaining unit 701 is configured to obtain, in the text to be identified, the entity to be linked and the context content corresponding to the entity to be linked.

A determining unit 702, configured to determine, in the knowledge graph, a plurality of candidate entities and entity information of each candidate entity in the plurality of candidate entities according to the entity to be linked.

And a matching unit 703, configured to match the context content with the entity information corresponding to each candidate entity, so as to obtain a first matching score.

The determining unit 702 is further configured to determine a target candidate entity of the plurality of candidate entities according to the first matching score.

The determining unit 702 is further configured to determine a target entity in the target candidate entities based on the disambiguation score between the entity to be linked and the target candidate entity.

The determining unit 702 is further configured to determine the target entity as a link entity corresponding to the entity to be linked.

In one possible design, the determining unit 702 is specifically configured to perform relevance ranking on entity information corresponding to the target candidate entity according to the context content, determine key information in the entity information corresponding to the target candidate entity according to the relevance ranking result, input the entity to be linked, the context content, the target candidate entity and the key information to an entity disambiguation model, and obtain a disambiguation score between the entity to be linked and the target candidate entity through the entity disambiguation model.

In one possible design, the matching unit 703 is specifically configured to perform word segmentation processing on the context content and the entity information corresponding to each candidate entity, so as to obtain a first word segment included in the context content and a second word segment included in the entity information corresponding to each candidate entity. And calculating cosine similarity of the word vector corresponding to the first word and the word vector corresponding to the second word, and determining a first matching score according to the cosine similarity.

In one possible design, the matching unit 703 is specifically configured to obtain a similarity distribution vector corresponding to each candidate entity according to cosine similarity, input the similarity distribution vector corresponding to each candidate entity to a first model, obtain a first matching score of the context content and the entity information corresponding to each candidate entity through the first model, and determine the distribution score according to the similarity distribution vector.

In one possible design, the matching unit 703 is specifically configured to determine the number of the second branches in each preset cosine similarity interval according to the value of the cosine similarity, where the cosine similarity is preset with a plurality of cosine similarity intervals. And determining a similarity distribution vector corresponding to each candidate entity according to the number of second segmentation words in each preset cosine similarity interval.

In one possible design, the obtaining unit 701 is specifically configured to obtain word frequency information corresponding to entity information of each candidate entity.

The determining unit 702 is specifically configured to determine a weight value corresponding to each candidate entity according to the word frequency information, and determine a second matching score corresponding to each candidate entity according to the weight value and the first matching score. And if the second matching score exceeds the first preset threshold, determining the candidate entity as a target candidate entity.

In one possible design, the determining unit 702 is specifically configured to perform word segmentation processing on the context content and entity information corresponding to the target candidate entity, so as to obtain a third word segment included in the context content and a fourth word segment included in the entity information corresponding to the target candidate entity; and calculating cosine similarity of the word vector corresponding to the third segmentation word and the word vector corresponding to the fourth segmentation word. Sorting the fourth word segments according to the cosine similarity values of the word vectors corresponding to the third word segments and the word vectors corresponding to the fourth word segments, and determining target word segments in the fourth word segments according to sorting results of the fourth word segments, wherein the cosine similarity between the target word segments and the third word segments exceeds a second preset threshold; and determining entity information corresponding to the target word as key information.

In one possible design, the determining unit 702 is specifically configured to perform similarity calculation on the context content and the key information through the entity disambiguation model, and determine a disambiguation score between the entity to be linked and the target candidate entity according to a calculation result of the similarity calculation.

In one possible design, the physically linked device further includes a training unit 704. The obtaining unit 701 is further configured to obtain a pre-training language model, where the pre-training language model includes a word embedding layer and a task layer, the word embedding layer is configured to perform word vector representation on an input corpus, and the task layer is configured to complete a pre-training language task.

The training unit 704 is specifically configured to train the pre-training language model according to the first corpus data, and update word embedding layer parameters of the pre-training language model.

The determining unit 702 is further configured to determine an entity disambiguation model to be trained according to the word embedding layer parameters of the updated pre-training language model.

The training unit 704 is further configured to train the entity disambiguation model to be trained according to second corpus data, where the second corpus data includes the entity sample to be disambiguated and labeling information of the entity sample to be disambiguated. And when the training result meets the training condition, obtaining the entity disambiguation model.

In one possible design, the obtaining unit 701 is specifically configured to obtain word embedding layer parameters of the updated pre-trained language model.

The determining unit 702 is further configured to initialize parameters of a word embedding layer of the to-be-trained entity disambiguation model according to the updated word embedding layer parameters of the pre-training language model, where the to-be-trained entity disambiguation model includes a word embedding layer and a task layer, and the word embedding layer of the to-be-trained entity disambiguation model is the same as model parameters of the word embedding layer of the pre-training language model.

In one possible design, the determining unit 702 is specifically configured to determine the target candidate entity as the target entity if the disambiguation score between the entity to be linked and the target candidate entity exceeds a third preset threshold. And if the disambiguation scores corresponding to the target candidate entities do not exceed a third preset threshold, determining the target candidate entity with the highest disambiguation score as the target entity.

In a possible design, the physically linked device further comprises a receiving unit 705.

The receiving unit 705 is specifically configured to receive a text query instruction.

The obtaining unit 701 is further configured to obtain a query text according to the text query instruction, where the query text is a text to be identified.

The determining unit 702 is further configured to determine at least one query result related to the query text according to the linking entity, and feed back the at least one query result.

The present application further provides a computer device that may be deployed on a server, referring to fig. 8, fig. 8 is a schematic diagram of an embodiment of a computer device according to the present application, where the server 800 may have a relatively large difference due to different configurations or performances, and may include one or more central processing units (central processing units, CPU) 822 (e.g., one or more processors) and a memory 832, and one or more storage mediums 830 (e.g., one or more mass storage devices) storing application programs 842 or data 844. Wherein the memory 832 and the storage medium 830 may be transitory or persistent. The program stored in the storage medium 830 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 822 may be configured to communicate with the storage medium 830 to execute a series of instruction operations in the storage medium 830 on the server 800.

The Server 800 may also include one or more power supplies 826, one or more wired or wireless network interfaces 850, one or more input/output interfaces 858, and/or one or more operating systems 841, such as Windows Server ^TM,Mac OS X^TM,Unix^TM,Linux^TM,FreeBSD^TM, or the like.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 8.

In an embodiment of the present application, CPU 822 included in the server is configured to perform the steps performed by the target server in the embodiment shown in FIGS. 2-5.

Embodiments of the present application also provide a computer-readable storage medium having a computer program stored therein, which when run on a computer causes the computer to perform the steps performed by the server in the method described in the embodiments of fig. 2 to 5 as described above.

There is also provided in an embodiment of the application a computer program product comprising a program which, when run on a computer, causes the computer to perform the steps performed by the server in the method described in the embodiments of figures 2 to 5 as described above.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of physical linking, the method comprising:

Acquiring an entity to be linked and context content corresponding to the entity to be linked in a text to be identified;

determining a plurality of candidate entities and entity information of each candidate entity in the plurality of candidate entities in a knowledge graph according to the entity to be linked;

Matching the context content with the entity information corresponding to each candidate entity to obtain a first matching score;

Determining a target candidate entity in the plurality of candidate entities according to the first matching score;

performing relevance ranking on entity information corresponding to the target candidate entity according to the context content;

Determining key information in entity information corresponding to the target candidate entity according to the relevance sorting result;

inputting the entity to be linked, the context content, the target candidate entity and the key information into an entity disambiguation model, and acquiring disambiguation scores between the entity to be linked and the target candidate entity through the entity disambiguation model;

Determining a target entity in the target candidate entities based on the disambiguation score between the entity to be linked and the target candidate entity;

determining the target entity as a link entity corresponding to the entity to be linked;

The acquiring, by the entity disambiguation model, the disambiguation score between the entity to be linked and the target candidate entity includes:

Performing similarity calculation on the context content and the key information through the entity disambiguation model;

and determining the disambiguation score between the entity to be linked and the target candidate entity according to the calculation result of the similarity calculation.

2. The method of claim 1, wherein the matching the context content with the entity information corresponding to each candidate entity to obtain a first matching score includes:

performing word segmentation processing on the context content and the entity information corresponding to each candidate entity to obtain a first word segment included in the context content and a second word segment included in the entity information corresponding to each candidate entity;

Calculating cosine similarity of the word vector corresponding to the first word and the word vector corresponding to the second word;

And determining the first matching score according to the cosine similarity.

3. The method of claim 2, wherein said determining said first matching score from said cosine similarity comprises:

acquiring a similarity distribution vector corresponding to each candidate entity according to the cosine similarity;

Inputting the similarity distribution vector corresponding to each candidate entity into a first model, and acquiring a first matching score of the context content and the entity information corresponding to each candidate entity through the first model; the first model is used for determining distribution scores according to the similarity distribution vectors.

4. The method of claim 3, wherein the obtaining the similarity distribution vector corresponding to each candidate entity according to the cosine similarity includes:

Determining the number of second branches in each preset cosine similarity interval according to the cosine similarity value; wherein, the cosine similarity is preset with a plurality of cosine similarity intervals;

And determining the similarity distribution vector corresponding to each candidate entity according to the number of second segmentation words in each preset cosine similarity interval.

5. The method of claim 4, wherein determining a target candidate entity of the plurality of candidate entities based on the first matching score comprises:

acquiring word frequency information corresponding to the entity information of each candidate entity;

determining a weight value corresponding to each candidate entity according to the word frequency information, and determining a second matching score corresponding to each candidate entity according to the weight value and the first matching score;

And if the second matching score exceeds a first preset threshold, determining the candidate entity as the target candidate entity.

6. The method according to claim 1, wherein said relevance ranking of entity information corresponding to the target candidate entity according to the context content comprises:

performing word segmentation processing on the context content and the entity information corresponding to the target candidate entity to obtain a third word segment included in the context content and a fourth word segment included in the entity information corresponding to the target candidate entity;

Calculating cosine similarity of the word vector corresponding to the third word segmentation and the word vector corresponding to the fourth word segmentation;

Sorting the fourth word segment according to the value of cosine similarity of the word vector corresponding to the third word segment and the word vector corresponding to the fourth word segment;

the determining key information in the entity information corresponding to the target candidate entity according to the relevance ranking result comprises the following steps:

determining target word segmentation in the fourth word segmentation according to the sequencing result of the fourth word segmentation, wherein cosine similarity between the target word segmentation and the third word segmentation exceeds a second preset threshold;

and determining entity information corresponding to the target word as the key information.

7. The method of claim 1, wherein prior to obtaining the disambiguation score between the entity to be linked and the target candidate entity by the entity disambiguation model, the method further comprises:

The method comprises the steps of obtaining a pre-training language model, wherein the pre-training language model comprises a word embedding layer and a task layer, the word embedding layer is used for carrying out word vector representation on input linguistic data, and the task layer is used for completing a pre-training language task;

Training the pre-training language model according to the first corpus data, and updating word embedding layer parameters of the pre-training language model;

Determining an entity disambiguation model to be trained according to the updated word embedding layer parameters of the pre-training language model;

Training the entity disambiguation model to be trained according to second corpus data, wherein the second corpus data comprises entity samples to be disambiguated and labeling information of the entity samples to be disambiguated;

and when the training result meets the training condition, obtaining the entity disambiguation model.

8. The method of claim 7, wherein the determining the entity disambiguation model to be trained based on the updated word embedding layer parameters of the pre-trained language model comprises:

Acquiring word embedding layer parameters of the updated pre-training language model;

initializing parameters of a word embedding layer of the entity disambiguation model to be trained according to the updated word embedding layer parameters of the pre-training language model, wherein the entity disambiguation model to be trained comprises a word embedding layer and a task layer, and the word embedding layer of the entity disambiguation model to be trained is identical to the model parameters of the word embedding layer of the pre-training language model.

9. The method according to any one of claims 1 to 8, wherein determining a target entity of the target candidate entities based on a disambiguation score between the entity to be linked and the target candidate entity comprises:

If the disambiguation score between the entity to be linked and the target candidate entity exceeds a third preset threshold, determining that the target candidate entity is the target entity;

And if the disambiguation scores between the entity to be linked and the target candidate entity do not exceed the third preset threshold, determining the target candidate entity with the highest disambiguation score as the target entity.

10. The method of claim 1, wherein before obtaining the entity to be linked and the context content corresponding to the entity to be linked in the text to be identified, the method further comprises:

Receiving a text query instruction;

Acquiring a query text according to the text query instruction, wherein the query text is the text to be identified;

after the target entity is determined to be the link entity corresponding to the entity to be linked, the method further includes:

determining at least one query result related to the query text according to the link entity;

And feeding back the at least one query result.

11. An apparatus for physical linking, the apparatus comprising:

The acquisition unit is used for acquiring the entity to be linked and the context content corresponding to the entity to be linked in the text to be identified;

The determining unit is used for determining a plurality of candidate entities and entity information of each candidate entity in the plurality of candidate entities in a knowledge graph according to the entity to be linked;

the matching unit is used for matching the context content with the entity information corresponding to each candidate entity to obtain a first matching score;

the determining unit is further configured to determine a target candidate entity from the plurality of candidate entities according to the first matching score;

The determining unit is further configured to perform relevance ranking on entity information corresponding to the target candidate entity according to the context content; determining key information in entity information corresponding to the target candidate entity according to the relevance sorting result; inputting the entity to be linked, the context content, the target candidate entity and the key information into an entity disambiguation model, and acquiring disambiguation scores between the entity to be linked and the target candidate entity through the entity disambiguation model;

The determining unit is further configured to determine a target entity in the target candidate entities based on a disambiguation score between the entity to be linked and the target candidate entity;

the determining unit is further configured to determine the target entity as a link entity corresponding to the entity to be linked;

wherein, the determining unit is specifically configured to:

12. The apparatus according to claim 11, wherein the matching unit is specifically configured to:

And determining the first matching score according to the cosine similarity.

13. The apparatus according to claim 12, wherein the matching unit is specifically configured to:

14. The apparatus according to claim 13, wherein the matching unit is specifically configured to:

15. The apparatus of claim 14, wherein the obtaining unit is specifically configured to obtain word frequency information corresponding to entity information of each candidate entity;

The determining unit is specifically configured to determine a weight value corresponding to each candidate entity according to the word frequency information, and determine a second matching score corresponding to each candidate entity according to the weight value and the first matching score; and if the second matching score exceeds a first preset threshold, determining the candidate entity as the target candidate entity.

16. The apparatus according to claim 11, wherein the determining unit is specifically configured to:

17. The apparatus of claim 11, further comprising a training unit;

The acquisition unit is further used for acquiring a pre-training language model, the pre-training language model comprises a word embedding layer and a task layer, the word embedding layer is used for carrying out word vector representation on input corpus, and the task layer is used for completing a pre-training language task;

the training unit is specifically configured to train the pre-training language model according to the first corpus data, and update word embedding layer parameters of the pre-training language model;

the determining unit is further used for determining an entity disambiguation model to be trained according to the updated word embedding layer parameters of the pre-training language model;

the training unit is further configured to train the entity disambiguation model to be trained according to second corpus data, where the second corpus data includes an entity sample to be disambiguated and labeling information of the entity sample to be disambiguated; and when the training result meets the training condition, obtaining the entity disambiguation model.

18. The apparatus according to claim 17, wherein the obtaining unit is configured to obtain word embedding layer parameters of the updated pre-trained language model;

The determining unit is further configured to initialize parameters of a word embedding layer of the to-be-trained entity disambiguation model according to the updated word embedding layer parameters of the pre-training language model, where the to-be-trained entity disambiguation model includes a word embedding layer and a task layer, and the word embedding layer of the to-be-trained entity disambiguation model is the same as model parameters of the word embedding layer of the pre-training language model.

19. The apparatus according to any one of claims 11 to 18, wherein the determining unit is specifically configured to determine that the target candidate entity is the target entity if a disambiguation score between the entity to be linked and the target candidate entity exceeds a third preset threshold; and if the disambiguation scores between the entity to be linked and the target candidate entity do not exceed the third preset threshold, determining the target candidate entity with the highest disambiguation score as the target entity.

20. The apparatus of claim 11, further comprising a receiving unit;

the receiving unit is specifically used for receiving a text query instruction;

the acquisition unit is further used for acquiring a query text according to the text query instruction, wherein the query text is the text to be identified;

the determining unit is further used for determining at least one query result related to the query text according to the link entity; and feeding back the at least one query result.

21. A computer device, comprising: memory, transceiver, processor, and bus system;

wherein the memory is used for storing programs;

The processor being configured to execute a program in the memory to implement the method of any one of claims 1 to 10;

the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.

22. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 10.

23. A computer program product, characterized in that the computer program product comprises a program which, when run on a computer, causes the computer to perform the method according to any one of claims 1 to 10.