CN110674304A

CN110674304A - Entity disambiguation method and device, readable storage medium and electronic equipment

Info

Publication number: CN110674304A
Application number: CN201910952886.2A
Authority: CN
Inventors: 陈栋; 齐云飞; 付骁弈
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2019-10-09
Filing date: 2019-10-09
Publication date: 2020-01-10

Abstract

The application discloses an entity disambiguation method, an entity disambiguation device, a readable storage medium and electronic equipment. The method comprises the following steps: firstly, inputting a text to be processed comprising at least two entities into an entity extraction language model for entity extraction to obtain entities included in the text to be processed; inputting the text to be processed into a bidirectional language model for processing to obtain a word vector sequence of the text to be processed; then, acquiring a word vector of each word in any entity; then, calculating to obtain a word vector of any entity according to the word vector of each word in any entity; then calculating the similarity between every two entities according to the word vectors of the entities; and finally clustering the entities according to the similarity between every two entities in the text to be processed so as to realize entity disambiguation. The scheme of the embodiment can improve the accuracy of entity disambiguation.

Description

Entity disambiguation method and device, readable storage medium and electronic equipment

Technical Field

The present application relates to the field of natural language processing, and in particular, to an entity disambiguation method, apparatus, readable storage medium, and electronic device.

Background

Text is organized by a large number of words, and there are usually a large number of ambiguous entity words among the words that make up the text. After the entity disambiguation is carried out on the text, the information extraction, the text abstract construction and the like can be realized according to the entity disambiguation result, and the accuracy of the entity disambiguation directly influences the accuracy of the information extraction and the like.

In the prior art, when entity disambiguation is performed, one way is to perform word vector training using a large number of related texts to obtain a word embedding matrix, and extract entities from the text to be processed. And converting the entities extracted from the text to be processed into vectors, and clustering the entities to complete entity disambiguation. In the entity disambiguation mode, one word only corresponds to one word vector, and if one word corresponds to a plurality of meanings, the disambiguation mode cannot distinguish different expressed meanings. In addition, new words that do not exist in the word embedding matrix cannot be converted into vectors. Therefore, the result of disambiguation using this physical disambiguation approach has a large error.

Disclosure of Invention

In order to overcome at least the above-mentioned deficiencies in the prior art, it is an object of the present application to provide an entity disambiguation method comprising:

inputting a text to be processed comprising at least two entities into a pre-trained entity extraction language model for entity extraction to obtain entities included in the text to be processed;

inputting the text to be processed into a pre-trained bidirectional language model for processing to obtain a word vector sequence of the text to be processed, wherein the word vector sequence is formed by arranging word vectors of all words in the text to be processed according to the sequence of the words in the text to be processed, and the word vectors are calculated according to the context of the words;

aiming at any entity in the text to be processed, acquiring a word vector of each word in the any entity from the word vector sequence according to the position of the any entity and each word in the any entity in the text;

calculating to obtain a word vector of any entity according to the word vector of each word in any entity;

calculating the similarity between every two entities according to the word vectors of the entities;

and clustering the entities according to the similarity between every two entities in the text to be processed so as to realize entity disambiguation.

Optionally, the step of obtaining, for any entity in the text to be processed, a word vector of each word in the any entity from the word vector sequence according to the position of the any entity and each word in the any entity in the text includes:

obtaining an identification sequence based on the text to be processed, wherein the position of a word in each entity in the text to be processed is represented by a first identifier, and other words except the entities are represented by second identifiers;

and aiming at the any entity, acquiring a word vector of a corresponding position in the word vector sequence according to the position of the first identifier of each word in the any entity in the identification sequence, thereby acquiring the word vector of each word in the any entity.

Optionally, the step of obtaining the word vector of the arbitrary entity by calculating the word vector of each word in the arbitrary entity includes calculating an average vector of the word vectors of all the words in the entity, and using the average vector as the word vector of the entity.

Optionally, the step of calculating the similarity between each two entities according to the word vector of each entity includes calculating the similarity between each two entities by using a cosine similarity algorithm.

Optionally, before the step of inputting the text to be processed including at least two entities into a pre-trained entity extraction language model for entity extraction to obtain the entities included in the text to be processed, the method further includes:

inputting a plurality of training texts with marked entities as training samples into an entity extraction language model for training;

comparing the output entity label with the labeled entity, and calculating to obtain a loss function value of the training;

and if the loss function value is smaller than a preset loss value, judging that the entity extraction language model is trained, if the loss function value is not smaller than the preset loss value, adjusting parameters in the entity extraction language model, inputting a plurality of training texts marked with entities as training samples into the entity extraction language model after the parameters are adjusted, and repeating the steps until the loss function value is smaller than the preset loss value.

It is another object of the present application to provide an entity disambiguation apparatus comprising:

the entity extraction module is used for inputting the text to be processed comprising at least two entities into a pre-trained entity extraction language model for entity extraction to obtain the entities contained in the text to be processed;

the word vector acquisition module is used for inputting the text to be processed into a pre-trained bidirectional language model for processing to obtain a word vector sequence of the text to be processed, wherein the word vector sequence is formed by arranging word vectors of all words in the text to be processed according to the sequence of the words in the text to be processed, and the word vectors are calculated according to the context of the words;

a word vector corresponding module, configured to, for any entity in the text to be processed, obtain, from the word vector sequence, a word vector for each word in the any entity according to the any entity and a position of each word in the any entity in the text;

the word vector calculation module is used for calculating and obtaining the word vector of any entity according to the word vector of each word in any entity;

the similarity calculation module is used for calculating the similarity between every two entities according to the word vectors of the entities;

and the entity disambiguation module is used for clustering the entities according to the similarity between every two entities in the text to be processed so as to realize entity disambiguation.

Optionally, the word vector correspondence module is specifically configured to:

Optionally, the word vector calculation module is specifically configured to calculate an average vector of word vectors of all words in the entity, and use the average vector as the word vector of the entity.

Another object of the present application is also a readable storage medium storing an executable program which, when executed by a processor, implements a method according to any of the present applications.

Another object of the present application is also an electronic device comprising a memory and a processor, the memory being in communication with the processor, the memory having stored therein an executable program, the processor, when executing the executable program, implementing the method according to any of the present application.

Compared with the prior art, the method has the following beneficial effects:

according to the entity disambiguation method, the entity disambiguation device, the readable storage medium and the electronic device, each entity in the text is extracted, the word vector of each word of the entity is obtained according to the position of each word in the text, the word vector of each entity is calculated, the entity disambiguation is performed after the similarity is calculated according to the word vector of each entity, and because each word vector is related to the context of each word, the vector expression of the same entity with different positions is also related to the context, so that the entity disambiguation accuracy can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;

FIG. 2 is a first flowchart of an entity disambiguation method provided in an embodiment of the present application;

FIG. 3 is a flowchart illustrating a second entity disambiguation method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an entity disambiguation entity extraction result provided in an embodiment of the present application;

FIG. 5 is a third flowchart of an entity disambiguation method provided in an embodiment of the present application;

FIG. 6 is a schematic diagram of a similarity matrix provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of the structure of input data for a bi-directional language model;

FIG. 8 is a diagram I of predicting masked words during a bi-directional language model training process;

FIG. 9 is a diagram two of predicting masked words during two-way language model training;

fig. 10 is a functional block diagram of an entity disambiguation apparatus provided in an embodiment of the present application.

Icon: 100-an electronic device; 110-entity disambiguation means; 111-entity extraction module; 112-word vector acquisition module; 113-word vector correspondence module; 114-a word vector calculation module; 115-similarity calculation module; 116-entity disambiguation module; 120-a memory; 130-a processor.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the present application, it is further noted that, unless expressly stated or limited otherwise, the terms "disposed," "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present application can be understood in a specific case by those of ordinary skill in the art.

With the development of technologies related to natural language processing, the technologies of natural language processing are increasingly applied to various fields. For example, entity disambiguation is an important technical means in information extraction, text abstract construction, topic modeling, domain knowledge mining, and automatic professional literature translation. However, a very important expression form of natural language is text. Text generally refers to a sentence or a combination of sentences having a complete and systematic meaning, and the sentence is generally composed of words, and many of the words composing the text, although they are not identical, may have the same meaning, and may also have the same meaning, and therefore, the disambiguation result of an entity will often directly affect other processes applying the result.

In the prior art, there are generally two methods for implementing entity disambiguation, one of which is: in the clustering method based on Word Embedding, Word vector training needs to be performed on a large number of texts in related industries to obtain a Word Embedding Matrix. After the word embedding matrix is obtained, converting the entity extracted from the text to be processed into a vector, clustering according to the word embedding matrix, and classifying similar words into one class. The first drawback of this disambiguation method is that one word has only one word vector, and the problem of word ambiguity cannot be solved. A second drawback of this approach is that entities at the phrase level (noun phrases, verb phrases, adjective phrases, etc.) cannot be disambiguated. In addition, because the word embedding matrix is obtained according to the existing text, if the text for word vector training does not have the corresponding word, the word vector of the word can not exist in the word vector matrix finally obtained. The words in the text to be processed are not necessarily converted into word vectors.

Another disambiguation approach in the prior art is a knowledge-graph based disambiguation method. In this disambiguation method, an entity extracted from a text is generally aligned with an entity in an existing knowledge graph (entity link), and if a plurality of different entities in the text can be aligned to the same entity in the knowledge graph, the entities are considered to have the same semantic meaning. This way entity disambiguation, there will also be an inability to disambiguate phrase-level entities (noun phrases, verb phrases, adjective phrases, etc.). In addition, when the disambiguation mode is used for calculating the entity similarity, a more complex model such as a neural network is generally adopted.

In summary, a common disadvantage of the existing entity disambiguation techniques is that the accuracy of entity disambiguation is low.

Referring to fig. 1, fig. 1 is a schematic block diagram of a structure of an electronic device 100 provided in an embodiment of the present application, where the electronic device 100 includes an entity disambiguation apparatus 110, a memory 120, and a processor 130, and the memory 120 and the processor 130 are electrically connected to each other directly or indirectly for implementing data interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The entity disambiguation apparatus 110 includes at least one software function module which may be stored in the memory 120 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the electronic device 100. The processor 130 is used to execute executable modules stored in the memory 120, such as software functional modules and computer programs included in the entity disambiguation apparatus 110.

In order to solve the above problem, the present embodiment provides an entity disambiguation method applicable to the electronic device 100, please refer to fig. 2, where the method includes steps S010 to S060.

And S010, adopting an entity extraction language model to extract entities to obtain the entities included in the text to be processed.

Specifically, a text to be processed including at least two entities is input into a pre-trained entity extraction language model for entity extraction, and the entities included in the text to be processed are obtained.

In this embodiment, an entity refers to a word, a word or a phrase having a specific meaning in a text, wherein the phrase may be a noun phrase, a verb phrase, an adjective phrase, or the like.

And step S020, acquiring a word vector sequence of the text to be processed by adopting a bidirectional language model.

Specifically, the text to be processed is input into a pre-trained bidirectional language model for processing, and a word vector sequence of the text to be processed is obtained, wherein the word vector sequence is formed by arranging word vectors of all words in the text to be processed according to the sequence of the words in the text to be processed, and the word vectors are calculated according to the context of the words.

Step S030, each word vector corresponding to each entity is obtained.

Specifically, for any entity in the text to be processed, the word vector of each word in the any entity is obtained from the word vector sequence according to the position of the any entity and each word in the any entity in the text.

Step S040, a word vector of the arbitrary entity is obtained by calculation according to the word vector of each word in the arbitrary entity.

And S050, calculating the similarity between every two entities according to the word vectors of the entities.

And step S060, clustering the entities according to the similarity between every two entities in the text to be processed so as to implement entity disambiguation.

In the embodiment, the entity in the text to be processed is extracted by using the entity extraction language model, the word vector sequence of the word in the text to be processed, which is related to the context of the text to be processed, is extracted by using the two-way pre-training model, and then the word vector of each word in each entity is obtained from the word vector sequence by calculating according to the entity extracted by using the entity extraction language model, so that the word vector of the entity in the text to be processed, which is related to the context, is calculated according to the word vector in the entity, and thus, the word vector of each word, the position and the context can be obtained, for example, for a polysemous word, different word vectors can be obtained at different positions. The word vectors of each entity obtained according to the scheme of the embodiment are disambiguated, so that the situations of polysemous words and polysemous synonyms can be identified, and the accuracy of entity disambiguation is improved.

Referring to fig. 3, optionally, in the present embodiment, step S030 includes sub-steps S031-S032.

Step S031, an identification sequence is obtained based on the text to be processed, where a position of a word in each entity in the text to be processed is represented by a first identifier, and words other than the entity are represented by a second identifier.

Step S032, a word vector is obtained according to the first identifier of each word in the arbitrary entity.

Specifically, for the any entity, a word vector of a corresponding position in a word vector sequence is obtained according to the position of the first identifier of each word in the any entity in the identification sequence, so as to obtain the word vector of each word in the any entity.

The embodiment is used for respectively acquiring the word vector of each word in each entity.

The following explains how to obtain a word vector for each word in an entity, in combination with a practical example, the text to be processed is "arbor (full name: steve arbor) who has gone to apple company with an apple that has eaten half. If "there are entities" arbor "," steve arbor "," apple "and apple" extracted in step S010, please refer to fig. 4, mean represents the average vector calculated for the word vectors in the entities. If the first identifier is 1 and the second identifier is 2, then the identification sequence obtained according to the text to be processed is "1, 0, 1, 0, 1, 0, 1, 0, then the word vector of the word corresponding to the first identifier position of the entity can be obtained from the word vector sequence according to the position of the first identifier corresponding to the entity in the identification sequence.

In this embodiment, the first identifier may also be represented by 1, 2, and 3, where 1 represents the beginning of the entity, 2 represents the middle of the entity, 3 represents the end of the entity, and the second identifier is represented by 0. At this time, the identification sequence obtained from the text to be processed is "1, 2, 3, 0, 1, 2, 3, 0, 1, 3, 0". Then, a word vector of a word corresponding to the first identifier position of the entity may be obtained from the word vector sequence according to the position of the first identifier corresponding to the entity in the identification sequence.

Optionally, in this embodiment, the step of obtaining the word vector of the arbitrary entity by calculating the word vector of each word in the arbitrary entity includes calculating an average vector of the word vectors of all the words in the entity, and taking the average vector as the word vector of the entity.

The embodiment is used for carrying out averaging processing on the word vectors contained in the entity, so as to obtain the entity vector. In the embodiment, the entity vector is subjected to averaging processing, so that the accuracy of the entity vector can be improved, and the entity disambiguation result can be more accurate.

For example, in the above-listed pending text, "history" is a 300-dimensional vector, "pedicle" is also a 300-dimensional vector, and "fu" is also a 300-dimensional vector, so "steve" is 3 300-dimensional vectors, and after the three vectors are averaged (averaged), a 300-dimensional vector is obtained, which is an entity expression of steve, i.e., a word vector.

To aid understanding, the following examples illustrate the process of the term vector averaging process:

in an entity comprising three words, the word vectors of the three words are respectively: (a1, a2, a3), (b1, b2, b3), (c1, c2, c3), then the word vector for this entity is ((a1+ b1+ c1)/3, (a2+ b2+ c2)/3, (a3+ b3+ c 3)/3).

Optionally, in this embodiment, the step of calculating the similarity between each two entities according to the word vector of each entity includes calculating the similarity between each two entities by using a cosine similarity algorithm.

The embodiment is used for specifically calculating the similarity between every two entities in the text to be processed.

Also, to aid understanding, the following description will still illustrate the method of calculating entity similarity by way of example:

for example, the word vector for entity A is expressed as: [1, 2, 3, 4, 1], the word vector for entity B is expressed as: [1,2,3,4,3].

The formula for calculating the cosine similarity S is as follows: and S is M/N. Where M is the dot product of the word vector a and the word vector B, and N is the vector product of the word vector a and the word vector B. M, N are respectively:

M＝1*1+2*2+3*3+4*4+1*3＝33

finally, the cosine similarity S of 33/34.77 of 0.949 can be obtained.

Referring to fig. 5, optionally, in this embodiment, before step S010, the method further includes step S110 to step S130.

Step S110, inputting a plurality of training texts with marked entities as training samples into an entity extraction language model for training.

And step S120, comparing the output entity label with the labeled entity, and calculating to obtain a loss function value of the training.

Step S130, determining whether the loss function is smaller than a preset loss value.

Step S140, if the loss function value is not less than the preset loss value, adjusting parameters in the entity extraction language model, inputting a plurality of training texts marked with entities as training samples into the entity extraction language model after parameter adjustment for training, and repeating the steps until the loss function value is less than the preset loss value.

And if the loss function value is smaller than a preset loss value, judging that the entity extraction language model training is finished.

The embodiment is used for carrying out model training to obtain the entity extraction language model capable of carrying out entity extraction.

In this embodiment, when the entity extraction language model is trained, entity tagging may be performed on the training text according to the type of the entity that needs to be extracted actually, and then training is performed by using the existing entity language model training method, so that the trained entity extraction language model may be used to extract entities from entities at a phrase level (noun phrases, verb phrases, adjective phrases, etc.), so that vectors of the entities at a phrase level may be obtained, and disambiguation of the entities at a phrase level (noun phrases, verb phrases, adjective phrases, etc.) may be achieved.

Finally, clustering can be carried out according to the obtained word vectors, so that entity disambiguation is realized. For ease of understanding, the pending text "arbor (full name: Steve arbor) was used to reach apple Inc. holding an apple that eaten half of it. "for example, a specific clustering method is listed.

In the text to be processed, the extracted entities include "arbor", "steve arbor", "apple" (fruit), and "apple" (company), and the similarity between every two of the four entities is calculated. Fig. 6 shows a similarity matrix formed by calculating the similarity between each two of the four entities. Thus, clustering can be performed according to the similarity between entities.

As can be seen from the above diagram, the diagonal elements in the similarity matrix are all equal to 1, which means that the entity has a similarity of 100% to itself, for example, "arbor" has a similarity of 93% to "steve arbor", which means that the two entities have a high similarity and may belong to the same category. However, "apple (fruit)" and "apple (company)" have the same name, but the similarity calculated by vector expression is very low, indicating that these two are not one thing. The purpose of clustering is to put elements with high similarity together, and elements with low similarity are divided, and finally, the elements under the same class are the same thing expressed, for example: arbor, steve arbor, old, apple founder arbor, american apple arbor.

When clustering is performed according to the similarity threshold of entities, entities with similarity greater than a certain similarity threshold may be classified into one class.

In this embodiment, the bidirectional language model may be a model trained by google through an attention mechanism, for example, a Bert model. For the Bert model, there are 12 layers of Transformer encoders for feature extraction of sentences of data, and the process of processing data is roughly as follows: first by using the WordPiece tool to perform word segmentation and insert a special separator (e.g., [ CLS)]For separating samples) and separators (e.g., [ SEP ]]For separating different sentences in the sample), then inputting vector WordEmbelling (TokenEmbelling) layer for converting each word into fixed dimension, PositionEmbelling layer for obtaining the position of each word in the sentence, segmentEmbelling for classifying the input sentence, and network structure formed by adding Masking for Masking partial words or words in the sentence, so as to train the training text in the sample to be "my big is cut, he likes playing. "for example, about the trainingAfter the text is trained and Input into the above network structure of the Bert model, the obtained result is shown in fig. 7, where Input represents the Input layer, E_[CLS]Representing the separators via Token embeddings layer "[ CLS ]]"processed vector, E_myRepresents the vector of the word "my" after processing by the Token embeddings layer, E_dogRepresents the vector of the word "dog" after processing by the token embeddings layer, E_isRepresents the vector of the word "is" processed by the Token embeddings layer, E_cuteRepresents the vector of the word "cute" processed by the Token embeddings layer, E_heRepresents the vector of the word "he" processed by the Token embeddings layer, E_likesRepresents the vector of the word "likes" after processing by the Token embeddings layer, E_playRepresents the vector of the word "play" processed by the token embeddings layer, E_ingRepresents the vector of the character "ing" processed by the Token embeddings layer, E_[SEP]Represents a separator "[ SEP ]]"vectors after processing by the token embeddings layer. E_AThe words indicating the corresponding position being of type A, E_BThe word indicating the corresponding position belongs to type B. E_iA position vector representing a word, i being a natural number, represents the position of the word or phrase in the sentence.

For the Bert model, the model randomly masks characters or words in the training text in the training process, and then the masked characters or words in the model are predicted through a Transformer network (12 layers of Transformer encoders in the middle of the model). For example, when the training text is "harbin is province of black dragon river, famous city of international ice and snow culture", and when the masked words are "er", "black", "country" and "snow", the masked text is input into the transform network, and the prediction result of the masked words is output: "er", "black", "country" and "snow". When the masked words are 'harbin' and 'snow and ice', the masked text is input into a transform network, and the prediction result of the masked words is output: "Harbin" and "snow and ice". For a schematic diagram of prediction of a text in a model training process, please refer to fig. 8 and 9.

In this embodiment, the bi-directional language model may also be trained using LSTM, CNN, or RNN.

Referring to fig. 10, an embodiment of the present application further provides an entity disambiguation apparatus 110, which includes an entity extracting module 111, a word vector obtaining module 112, a word vector corresponding module 113, a word vector calculating module 114, a similarity calculating module 115, and an entity disambiguation module 116. The entity disambiguation apparatus 110 includes a software function module that may be stored in the memory 120 in the form of software or firmware or solidified in an Operating System (OS) of the image processing apparatus.

The entity extraction module 111 is configured to input a to-be-processed text including at least two entities into a pre-trained entity extraction language model for entity extraction, so as to obtain entities included in the to-be-processed text.

The entity extraction module 111 in this embodiment is used in the step S010, and the specific description about the entity extraction module 111 may refer to the description about the step S010.

A word vector obtaining module 112, configured to input the to-be-processed text into a pre-trained bi-directional language model for processing, so as to obtain a word vector sequence of the to-be-processed text, where the word vector sequence is formed by arranging word vectors of words in the to-be-processed text according to a sequence of the words in the to-be-processed text, and the word vectors are calculated according to a context of the words;

the word vector obtaining module 112 in this embodiment is used in step S020, and specific description about the word vector obtaining module 112 may refer to the description of step S020.

And a word vector corresponding module 113, configured to, for any entity in the text to be processed, obtain, from the word vector sequence, a word vector of each word in the any entity according to the position of the any entity and each word in the any entity in the text.

The word vector correspondence module 113 in this embodiment is used in step S030, and the detailed description of the word vector correspondence module 113 can refer to the description of step S030.

And the word vector calculation module 114 is configured to calculate and obtain a word vector of the arbitrary entity according to the word vector of each word in the arbitrary entity.

The word vector calculation module 114 in this embodiment is used in step S040, and the detailed description about the word vector calculation module 114 may refer to the description of step S040.

And the similarity calculation module 115 is configured to calculate a similarity between each two entities according to the word vector of each entity.

The similarity calculation module 115 in this embodiment is used in step S050, and specific description about the similarity calculation module 115 may refer to the description of step S050.

And the entity disambiguation module 116 is configured to cluster entities according to the similarity between every two entities in the text to be processed, so as to implement entity disambiguation.

The entity disambiguation module 116 in this embodiment is used in step S060, and the detailed description about the entity disambiguation module 116 may refer to the description about step S060.

Another object of the present application is also a readable storage medium, which stores an executable program, and when executing the executable program, the processor 130 may implement the method according to any one of the embodiments.

The above description is only for various embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and all such changes or substitutions are included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of entity disambiguation, the method comprising:

2. The method according to claim 1, wherein the step of obtaining, for any entity in the text to be processed, a word vector for each word in the any entity from the word vector sequence according to the position of the any entity and each word in the any entity in the text comprises:

and aiming at the any entity, acquiring a word vector of a corresponding position in a word vector sequence according to the position of the first identifier of each word in the any entity in the identification sequence, thereby acquiring the word vector of each word in the any entity.

3. The method of claim 1, wherein the step of obtaining the word vector of the arbitrary entity by calculating the word vector of each word in the arbitrary entity comprises calculating an average vector of the word vectors of all words in the entity, and using the average vector as the word vector of the entity.

4. The method of claim 1, wherein the step of calculating the similarity between each two entities according to the word vectors of the respective entities comprises calculating the similarity between each two entities by a cosine similarity algorithm.

5. The method according to claim 1, wherein before the step of inputting the text to be processed including at least two entities into the pre-trained entity extraction language model for entity extraction to obtain the entities included in the text to be processed, the method further comprises:

6. An entity disambiguation apparatus, the apparatus comprising:

7. The apparatus of claim 6, wherein the word vector correspondence module is specifically configured to:

8. The apparatus of claim 6, wherein the word vector calculation module is specifically configured to calculate an average vector of word vectors of all words in the entity, and use the average vector as the word vector of the entity.

9. A readable storage medium, characterized in that the readable storage medium stores an executable program, which when executed by a processor implements the method according to any one of claims 1-5.

10. An electronic device, comprising a memory communicatively coupled to a processor and a processor, wherein the memory stores an executable program, and wherein the processor, when executing the executable program, implements the method of any of claims 1-5.