CN110674304A - Entity disambiguation method and device, readable storage medium and electronic equipment - Google Patents

Entity disambiguation method and device, readable storage medium and electronic equipment Download PDF

Info

Publication number
CN110674304A
CN110674304A CN201910952886.2A CN201910952886A CN110674304A CN 110674304 A CN110674304 A CN 110674304A CN 201910952886 A CN201910952886 A CN 201910952886A CN 110674304 A CN110674304 A CN 110674304A
Authority
CN
China
Prior art keywords
entity
word
text
processed
entities
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910952886.2A
Other languages
Chinese (zh)
Inventor
陈栋
齐云飞
付骁弈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN201910952886.2A priority Critical patent/CN110674304A/en
Publication of CN110674304A publication Critical patent/CN110674304A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses an entity disambiguation method, an entity disambiguation device, a readable storage medium and electronic equipment. The method comprises the following steps: firstly, inputting a text to be processed comprising at least two entities into an entity extraction language model for entity extraction to obtain entities included in the text to be processed; inputting the text to be processed into a bidirectional language model for processing to obtain a word vector sequence of the text to be processed; then, acquiring a word vector of each word in any entity; then, calculating to obtain a word vector of any entity according to the word vector of each word in any entity; then calculating the similarity between every two entities according to the word vectors of the entities; and finally clustering the entities according to the similarity between every two entities in the text to be processed so as to realize entity disambiguation. The scheme of the embodiment can improve the accuracy of entity disambiguation.

Description

Entity disambiguation method and device, readable storage medium and electronic equipment
Technical Field
The present application relates to the field of natural language processing, and in particular, to an entity disambiguation method, apparatus, readable storage medium, and electronic device.
Background
Text is organized by a large number of words, and there are usually a large number of ambiguous entity words among the words that make up the text. After the entity disambiguation is carried out on the text, the information extraction, the text abstract construction and the like can be realized according to the entity disambiguation result, and the accuracy of the entity disambiguation directly influences the accuracy of the information extraction and the like.
In the prior art, when entity disambiguation is performed, one way is to perform word vector training using a large number of related texts to obtain a word embedding matrix, and extract entities from the text to be processed. And converting the entities extracted from the text to be processed into vectors, and clustering the entities to complete entity disambiguation. In the entity disambiguation mode, one word only corresponds to one word vector, and if one word corresponds to a plurality of meanings, the disambiguation mode cannot distinguish different expressed meanings. In addition, new words that do not exist in the word embedding matrix cannot be converted into vectors. Therefore, the result of disambiguation using this physical disambiguation approach has a large error.
Disclosure of Invention
In order to overcome at least the above-mentioned deficiencies in the prior art, it is an object of the present application to provide an entity disambiguation method comprising:
inputting a text to be processed comprising at least two entities into a pre-trained entity extraction language model for entity extraction to obtain entities included in the text to be processed;
inputting the text to be processed into a pre-trained bidirectional language model for processing to obtain a word vector sequence of the text to be processed, wherein the word vector sequence is formed by arranging word vectors of all words in the text to be processed according to the sequence of the words in the text to be processed, and the word vectors are calculated according to the context of the words;
aiming at any entity in the text to be processed, acquiring a word vector of each word in the any entity from the word vector sequence according to the position of the any entity and each word in the any entity in the text;
calculating to obtain a word vector of any entity according to the word vector of each word in any entity;
calculating the similarity between every two entities according to the word vectors of the entities;
and clustering the entities according to the similarity between every two entities in the text to be processed so as to realize entity disambiguation.
Optionally, the step of obtaining, for any entity in the text to be processed, a word vector of each word in the any entity from the word vector sequence according to the position of the any entity and each word in the any entity in the text includes:
obtaining an identification sequence based on the text to be processed, wherein the position of a word in each entity in the text to be processed is represented by a first identifier, and other words except the entities are represented by second identifiers;
and aiming at the any entity, acquiring a word vector of a corresponding position in the word vector sequence according to the position of the first identifier of each word in the any entity in the identification sequence, thereby acquiring the word vector of each word in the any entity.
Optionally, the step of obtaining the word vector of the arbitrary entity by calculating the word vector of each word in the arbitrary entity includes calculating an average vector of the word vectors of all the words in the entity, and using the average vector as the word vector of the entity.
Optionally, the step of calculating the similarity between each two entities according to the word vector of each entity includes calculating the similarity between each two entities by using a cosine similarity algorithm.
Optionally, before the step of inputting the text to be processed including at least two entities into a pre-trained entity extraction language model for entity extraction to obtain the entities included in the text to be processed, the method further includes:
inputting a plurality of training texts with marked entities as training samples into an entity extraction language model for training;
comparing the output entity label with the labeled entity, and calculating to obtain a loss function value of the training;
and if the loss function value is smaller than a preset loss value, judging that the entity extraction language model is trained, if the loss function value is not smaller than the preset loss value, adjusting parameters in the entity extraction language model, inputting a plurality of training texts marked with entities as training samples into the entity extraction language model after the parameters are adjusted, and repeating the steps until the loss function value is smaller than the preset loss value.
It is another object of the present application to provide an entity disambiguation apparatus comprising:
the entity extraction module is used for inputting the text to be processed comprising at least two entities into a pre-trained entity extraction language model for entity extraction to obtain the entities contained in the text to be processed;
the word vector acquisition module is used for inputting the text to be processed into a pre-trained bidirectional language model for processing to obtain a word vector sequence of the text to be processed, wherein the word vector sequence is formed by arranging word vectors of all words in the text to be processed according to the sequence of the words in the text to be processed, and the word vectors are calculated according to the context of the words;
a word vector corresponding module, configured to, for any entity in the text to be processed, obtain, from the word vector sequence, a word vector for each word in the any entity according to the any entity and a position of each word in the any entity in the text;
the word vector calculation module is used for calculating and obtaining the word vector of any entity according to the word vector of each word in any entity;
the similarity calculation module is used for calculating the similarity between every two entities according to the word vectors of the entities;
and the entity disambiguation module is used for clustering the entities according to the similarity between every two entities in the text to be processed so as to realize entity disambiguation.
Optionally, the word vector correspondence module is specifically configured to:
obtaining an identification sequence based on the text to be processed, wherein the position of a word in each entity in the text to be processed is represented by a first identifier, and other words except the entities are represented by second identifiers;
and aiming at the any entity, acquiring a word vector of a corresponding position in the word vector sequence according to the position of the first identifier of each word in the any entity in the identification sequence, thereby acquiring the word vector of each word in the any entity.
Optionally, the word vector calculation module is specifically configured to calculate an average vector of word vectors of all words in the entity, and use the average vector as the word vector of the entity.
Another object of the present application is also a readable storage medium storing an executable program which, when executed by a processor, implements a method according to any of the present applications.
Another object of the present application is also an electronic device comprising a memory and a processor, the memory being in communication with the processor, the memory having stored therein an executable program, the processor, when executing the executable program, implementing the method according to any of the present application.
Compared with the prior art, the method has the following beneficial effects:
according to the entity disambiguation method, the entity disambiguation device, the readable storage medium and the electronic device, each entity in the text is extracted, the word vector of each word of the entity is obtained according to the position of each word in the text, the word vector of each entity is calculated, the entity disambiguation is performed after the similarity is calculated according to the word vector of each entity, and because each word vector is related to the context of each word, the vector expression of the same entity with different positions is also related to the context, so that the entity disambiguation accuracy can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
Fig. 1 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;
FIG. 2 is a first flowchart of an entity disambiguation method provided in an embodiment of the present application;
FIG. 3 is a flowchart illustrating a second entity disambiguation method according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an entity disambiguation entity extraction result provided in an embodiment of the present application;
FIG. 5 is a third flowchart of an entity disambiguation method provided in an embodiment of the present application;
FIG. 6 is a schematic diagram of a similarity matrix provided by an embodiment of the present application;
FIG. 7 is a schematic diagram of the structure of input data for a bi-directional language model;
FIG. 8 is a diagram I of predicting masked words during a bi-directional language model training process;
FIG. 9 is a diagram two of predicting masked words during two-way language model training;
fig. 10 is a functional block diagram of an entity disambiguation apparatus provided in an embodiment of the present application.
Icon: 100-an electronic device; 110-entity disambiguation means; 111-entity extraction module; 112-word vector acquisition module; 113-word vector correspondence module; 114-a word vector calculation module; 115-similarity calculation module; 116-entity disambiguation module; 120-a memory; 130-a processor.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
In the description of the present application, it is further noted that, unless expressly stated or limited otherwise, the terms "disposed," "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present application can be understood in a specific case by those of ordinary skill in the art.
With the development of technologies related to natural language processing, the technologies of natural language processing are increasingly applied to various fields. For example, entity disambiguation is an important technical means in information extraction, text abstract construction, topic modeling, domain knowledge mining, and automatic professional literature translation. However, a very important expression form of natural language is text. Text generally refers to a sentence or a combination of sentences having a complete and systematic meaning, and the sentence is generally composed of words, and many of the words composing the text, although they are not identical, may have the same meaning, and may also have the same meaning, and therefore, the disambiguation result of an entity will often directly affect other processes applying the result.
In the prior art, there are generally two methods for implementing entity disambiguation, one of which is: in the clustering method based on Word Embedding, Word vector training needs to be performed on a large number of texts in related industries to obtain a Word Embedding Matrix. After the word embedding matrix is obtained, converting the entity extracted from the text to be processed into a vector, clustering according to the word embedding matrix, and classifying similar words into one class. The first drawback of this disambiguation method is that one word has only one word vector, and the problem of word ambiguity cannot be solved. A second drawback of this approach is that entities at the phrase level (noun phrases, verb phrases, adjective phrases, etc.) cannot be disambiguated. In addition, because the word embedding matrix is obtained according to the existing text, if the text for word vector training does not have the corresponding word, the word vector of the word can not exist in the word vector matrix finally obtained. The words in the text to be processed are not necessarily converted into word vectors.
Another disambiguation approach in the prior art is a knowledge-graph based disambiguation method. In this disambiguation method, an entity extracted from a text is generally aligned with an entity in an existing knowledge graph (entity link), and if a plurality of different entities in the text can be aligned to the same entity in the knowledge graph, the entities are considered to have the same semantic meaning. This way entity disambiguation, there will also be an inability to disambiguate phrase-level entities (noun phrases, verb phrases, adjective phrases, etc.). In addition, when the disambiguation mode is used for calculating the entity similarity, a more complex model such as a neural network is generally adopted.
In summary, a common disadvantage of the existing entity disambiguation techniques is that the accuracy of entity disambiguation is low.
Referring to fig. 1, fig. 1 is a schematic block diagram of a structure of an electronic device 100 provided in an embodiment of the present application, where the electronic device 100 includes an entity disambiguation apparatus 110, a memory 120, and a processor 130, and the memory 120 and the processor 130 are electrically connected to each other directly or indirectly for implementing data interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The entity disambiguation apparatus 110 includes at least one software function module which may be stored in the memory 120 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the electronic device 100. The processor 130 is used to execute executable modules stored in the memory 120, such as software functional modules and computer programs included in the entity disambiguation apparatus 110.
In order to solve the above problem, the present embodiment provides an entity disambiguation method applicable to the electronic device 100, please refer to fig. 2, where the method includes steps S010 to S060.
And S010, adopting an entity extraction language model to extract entities to obtain the entities included in the text to be processed.
Specifically, a text to be processed including at least two entities is input into a pre-trained entity extraction language model for entity extraction, and the entities included in the text to be processed are obtained.
In this embodiment, an entity refers to a word, a word or a phrase having a specific meaning in a text, wherein the phrase may be a noun phrase, a verb phrase, an adjective phrase, or the like.
And step S020, acquiring a word vector sequence of the text to be processed by adopting a bidirectional language model.
Specifically, the text to be processed is input into a pre-trained bidirectional language model for processing, and a word vector sequence of the text to be processed is obtained, wherein the word vector sequence is formed by arranging word vectors of all words in the text to be processed according to the sequence of the words in the text to be processed, and the word vectors are calculated according to the context of the words.
Step S030, each word vector corresponding to each entity is obtained.
Specifically, for any entity in the text to be processed, the word vector of each word in the any entity is obtained from the word vector sequence according to the position of the any entity and each word in the any entity in the text.
Step S040, a word vector of the arbitrary entity is obtained by calculation according to the word vector of each word in the arbitrary entity.
And S050, calculating the similarity between every two entities according to the word vectors of the entities.
And step S060, clustering the entities according to the similarity between every two entities in the text to be processed so as to implement entity disambiguation.
In the embodiment, the entity in the text to be processed is extracted by using the entity extraction language model, the word vector sequence of the word in the text to be processed, which is related to the context of the text to be processed, is extracted by using the two-way pre-training model, and then the word vector of each word in each entity is obtained from the word vector sequence by calculating according to the entity extracted by using the entity extraction language model, so that the word vector of the entity in the text to be processed, which is related to the context, is calculated according to the word vector in the entity, and thus, the word vector of each word, the position and the context can be obtained, for example, for a polysemous word, different word vectors can be obtained at different positions. The word vectors of each entity obtained according to the scheme of the embodiment are disambiguated, so that the situations of polysemous words and polysemous synonyms can be identified, and the accuracy of entity disambiguation is improved.
Referring to fig. 3, optionally, in the present embodiment, step S030 includes sub-steps S031-S032.
Step S031, an identification sequence is obtained based on the text to be processed, where a position of a word in each entity in the text to be processed is represented by a first identifier, and words other than the entity are represented by a second identifier.
Step S032, a word vector is obtained according to the first identifier of each word in the arbitrary entity.
Specifically, for the any entity, a word vector of a corresponding position in a word vector sequence is obtained according to the position of the first identifier of each word in the any entity in the identification sequence, so as to obtain the word vector of each word in the any entity.
The embodiment is used for respectively acquiring the word vector of each word in each entity.
The following explains how to obtain a word vector for each word in an entity, in combination with a practical example, the text to be processed is "arbor (full name: steve arbor) who has gone to apple company with an apple that has eaten half. If "there are entities" arbor "," steve arbor "," apple "and apple" extracted in step S010, please refer to fig. 4, mean represents the average vector calculated for the word vectors in the entities. If the first identifier is 1 and the second identifier is 2, then the identification sequence obtained according to the text to be processed is "1, 0, 1, 0, 1, 0, 1, 0, then the word vector of the word corresponding to the first identifier position of the entity can be obtained from the word vector sequence according to the position of the first identifier corresponding to the entity in the identification sequence.
In this embodiment, the first identifier may also be represented by 1, 2, and 3, where 1 represents the beginning of the entity, 2 represents the middle of the entity, 3 represents the end of the entity, and the second identifier is represented by 0. At this time, the identification sequence obtained from the text to be processed is "1, 2, 3, 0, 1, 2, 3, 0, 1, 3, 0". Then, a word vector of a word corresponding to the first identifier position of the entity may be obtained from the word vector sequence according to the position of the first identifier corresponding to the entity in the identification sequence.
Optionally, in this embodiment, the step of obtaining the word vector of the arbitrary entity by calculating the word vector of each word in the arbitrary entity includes calculating an average vector of the word vectors of all the words in the entity, and taking the average vector as the word vector of the entity.
The embodiment is used for carrying out averaging processing on the word vectors contained in the entity, so as to obtain the entity vector. In the embodiment, the entity vector is subjected to averaging processing, so that the accuracy of the entity vector can be improved, and the entity disambiguation result can be more accurate.
For example, in the above-listed pending text, "history" is a 300-dimensional vector, "pedicle" is also a 300-dimensional vector, and "fu" is also a 300-dimensional vector, so "steve" is 3 300-dimensional vectors, and after the three vectors are averaged (averaged), a 300-dimensional vector is obtained, which is an entity expression of steve, i.e., a word vector.
To aid understanding, the following examples illustrate the process of the term vector averaging process:
in an entity comprising three words, the word vectors of the three words are respectively: (a1, a2, a3), (b1, b2, b3), (c1, c2, c3), then the word vector for this entity is ((a1+ b1+ c1)/3, (a2+ b2+ c2)/3, (a3+ b3+ c 3)/3).
Optionally, in this embodiment, the step of calculating the similarity between each two entities according to the word vector of each entity includes calculating the similarity between each two entities by using a cosine similarity algorithm.
The embodiment is used for specifically calculating the similarity between every two entities in the text to be processed.
Also, to aid understanding, the following description will still illustrate the method of calculating entity similarity by way of example:
for example, the word vector for entity A is expressed as: [1, 2, 3, 4, 1], the word vector for entity B is expressed as: [1,2,3,4,3].
The formula for calculating the cosine similarity S is as follows: and S is M/N. Where M is the dot product of the word vector a and the word vector B, and N is the vector product of the word vector a and the word vector B. M, N are respectively:
M=1*1+2*2+3*3+4*4+1*3=33
finally, the cosine similarity S of 33/34.77 of 0.949 can be obtained.
Referring to fig. 5, optionally, in this embodiment, before step S010, the method further includes step S110 to step S130.
Step S110, inputting a plurality of training texts with marked entities as training samples into an entity extraction language model for training.
And step S120, comparing the output entity label with the labeled entity, and calculating to obtain a loss function value of the training.
Step S130, determining whether the loss function is smaller than a preset loss value.
Step S140, if the loss function value is not less than the preset loss value, adjusting parameters in the entity extraction language model, inputting a plurality of training texts marked with entities as training samples into the entity extraction language model after parameter adjustment for training, and repeating the steps until the loss function value is less than the preset loss value.
And if the loss function value is smaller than a preset loss value, judging that the entity extraction language model training is finished.
The embodiment is used for carrying out model training to obtain the entity extraction language model capable of carrying out entity extraction.
In this embodiment, when the entity extraction language model is trained, entity tagging may be performed on the training text according to the type of the entity that needs to be extracted actually, and then training is performed by using the existing entity language model training method, so that the trained entity extraction language model may be used to extract entities from entities at a phrase level (noun phrases, verb phrases, adjective phrases, etc.), so that vectors of the entities at a phrase level may be obtained, and disambiguation of the entities at a phrase level (noun phrases, verb phrases, adjective phrases, etc.) may be achieved.
Finally, clustering can be carried out according to the obtained word vectors, so that entity disambiguation is realized. For ease of understanding, the pending text "arbor (full name: Steve arbor) was used to reach apple Inc. holding an apple that eaten half of it. "for example, a specific clustering method is listed.
In the text to be processed, the extracted entities include "arbor", "steve arbor", "apple" (fruit), and "apple" (company), and the similarity between every two of the four entities is calculated. Fig. 6 shows a similarity matrix formed by calculating the similarity between each two of the four entities. Thus, clustering can be performed according to the similarity between entities.
As can be seen from the above diagram, the diagonal elements in the similarity matrix are all equal to 1, which means that the entity has a similarity of 100% to itself, for example, "arbor" has a similarity of 93% to "steve arbor", which means that the two entities have a high similarity and may belong to the same category. However, "apple (fruit)" and "apple (company)" have the same name, but the similarity calculated by vector expression is very low, indicating that these two are not one thing. The purpose of clustering is to put elements with high similarity together, and elements with low similarity are divided, and finally, the elements under the same class are the same thing expressed, for example: arbor, steve arbor, old, apple founder arbor, american apple arbor.
When clustering is performed according to the similarity threshold of entities, entities with similarity greater than a certain similarity threshold may be classified into one class.
In this embodiment, the bidirectional language model may be a model trained by google through an attention mechanism, for example, a Bert model. For the Bert model, there are 12 layers of Transformer encoders for feature extraction of sentences of data, and the process of processing data is roughly as follows: first by using the WordPiece tool to perform word segmentation and insert a special separator (e.g., [ CLS)]For separating samples) and separators (e.g., [ SEP ]]For separating different sentences in the sample), then inputting vector WordEmbelling (TokenEmbelling) layer for converting each word into fixed dimension, PositionEmbelling layer for obtaining the position of each word in the sentence, segmentEmbelling for classifying the input sentence, and network structure formed by adding Masking for Masking partial words or words in the sentence, so as to train the training text in the sample to be "my big is cut, he likes playing. "for example, about the trainingAfter the text is trained and Input into the above network structure of the Bert model, the obtained result is shown in fig. 7, where Input represents the Input layer, E[CLS]Representing the separators via Token embeddings layer "[ CLS ]]"processed vector, EmyRepresents the vector of the word "my" after processing by the Token embeddings layer, EdogRepresents the vector of the word "dog" after processing by the token embeddings layer, EisRepresents the vector of the word "is" processed by the Token embeddings layer, EcuteRepresents the vector of the word "cute" processed by the Token embeddings layer, EheRepresents the vector of the word "he" processed by the Token embeddings layer, ElikesRepresents the vector of the word "likes" after processing by the Token embeddings layer, EplayRepresents the vector of the word "play" processed by the token embeddings layer, EingRepresents the vector of the character "ing" processed by the Token embeddings layer, E[SEP]Represents a separator "[ SEP ]]"vectors after processing by the token embeddings layer. EAThe words indicating the corresponding position being of type A, EBThe word indicating the corresponding position belongs to type B. EiA position vector representing a word, i being a natural number, represents the position of the word or phrase in the sentence.
For the Bert model, the model randomly masks characters or words in the training text in the training process, and then the masked characters or words in the model are predicted through a Transformer network (12 layers of Transformer encoders in the middle of the model). For example, when the training text is "harbin is province of black dragon river, famous city of international ice and snow culture", and when the masked words are "er", "black", "country" and "snow", the masked text is input into the transform network, and the prediction result of the masked words is output: "er", "black", "country" and "snow". When the masked words are 'harbin' and 'snow and ice', the masked text is input into a transform network, and the prediction result of the masked words is output: "Harbin" and "snow and ice". For a schematic diagram of prediction of a text in a model training process, please refer to fig. 8 and 9.
In this embodiment, the bi-directional language model may also be trained using LSTM, CNN, or RNN.
Referring to fig. 10, an embodiment of the present application further provides an entity disambiguation apparatus 110, which includes an entity extracting module 111, a word vector obtaining module 112, a word vector corresponding module 113, a word vector calculating module 114, a similarity calculating module 115, and an entity disambiguation module 116. The entity disambiguation apparatus 110 includes a software function module that may be stored in the memory 120 in the form of software or firmware or solidified in an Operating System (OS) of the image processing apparatus.
The entity extraction module 111 is configured to input a to-be-processed text including at least two entities into a pre-trained entity extraction language model for entity extraction, so as to obtain entities included in the to-be-processed text.
The entity extraction module 111 in this embodiment is used in the step S010, and the specific description about the entity extraction module 111 may refer to the description about the step S010.
A word vector obtaining module 112, configured to input the to-be-processed text into a pre-trained bi-directional language model for processing, so as to obtain a word vector sequence of the to-be-processed text, where the word vector sequence is formed by arranging word vectors of words in the to-be-processed text according to a sequence of the words in the to-be-processed text, and the word vectors are calculated according to a context of the words;
the word vector obtaining module 112 in this embodiment is used in step S020, and specific description about the word vector obtaining module 112 may refer to the description of step S020.
And a word vector corresponding module 113, configured to, for any entity in the text to be processed, obtain, from the word vector sequence, a word vector of each word in the any entity according to the position of the any entity and each word in the any entity in the text.
The word vector correspondence module 113 in this embodiment is used in step S030, and the detailed description of the word vector correspondence module 113 can refer to the description of step S030.
And the word vector calculation module 114 is configured to calculate and obtain a word vector of the arbitrary entity according to the word vector of each word in the arbitrary entity.
The word vector calculation module 114 in this embodiment is used in step S040, and the detailed description about the word vector calculation module 114 may refer to the description of step S040.
And the similarity calculation module 115 is configured to calculate a similarity between each two entities according to the word vector of each entity.
The similarity calculation module 115 in this embodiment is used in step S050, and specific description about the similarity calculation module 115 may refer to the description of step S050.
And the entity disambiguation module 116 is configured to cluster entities according to the similarity between every two entities in the text to be processed, so as to implement entity disambiguation.
The entity disambiguation module 116 in this embodiment is used in step S060, and the detailed description about the entity disambiguation module 116 may refer to the description about step S060.
Another object of the present application is also a readable storage medium, which stores an executable program, and when executing the executable program, the processor 130 may implement the method according to any one of the embodiments.
The above description is only for various embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and all such changes or substitutions are included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method of entity disambiguation, the method comprising:
inputting a text to be processed comprising at least two entities into a pre-trained entity extraction language model for entity extraction to obtain entities included in the text to be processed;
inputting the text to be processed into a pre-trained bidirectional language model for processing to obtain a word vector sequence of the text to be processed, wherein the word vector sequence is formed by arranging word vectors of all words in the text to be processed according to the sequence of the words in the text to be processed, and the word vectors are calculated according to the context of the words;
aiming at any entity in the text to be processed, acquiring a word vector of each word in the any entity from the word vector sequence according to the position of the any entity and each word in the any entity in the text;
calculating to obtain a word vector of any entity according to the word vector of each word in any entity;
calculating the similarity between every two entities according to the word vectors of the entities;
and clustering the entities according to the similarity between every two entities in the text to be processed so as to realize entity disambiguation.
2. The method according to claim 1, wherein the step of obtaining, for any entity in the text to be processed, a word vector for each word in the any entity from the word vector sequence according to the position of the any entity and each word in the any entity in the text comprises:
obtaining an identification sequence based on the text to be processed, wherein the position of a word in each entity in the text to be processed is represented by a first identifier, and other words except the entities are represented by second identifiers;
and aiming at the any entity, acquiring a word vector of a corresponding position in a word vector sequence according to the position of the first identifier of each word in the any entity in the identification sequence, thereby acquiring the word vector of each word in the any entity.
3. The method of claim 1, wherein the step of obtaining the word vector of the arbitrary entity by calculating the word vector of each word in the arbitrary entity comprises calculating an average vector of the word vectors of all words in the entity, and using the average vector as the word vector of the entity.
4. The method of claim 1, wherein the step of calculating the similarity between each two entities according to the word vectors of the respective entities comprises calculating the similarity between each two entities by a cosine similarity algorithm.
5. The method according to claim 1, wherein before the step of inputting the text to be processed including at least two entities into the pre-trained entity extraction language model for entity extraction to obtain the entities included in the text to be processed, the method further comprises:
inputting a plurality of training texts with marked entities as training samples into an entity extraction language model for training;
comparing the output entity label with the labeled entity, and calculating to obtain a loss function value of the training;
and if the loss function value is smaller than a preset loss value, judging that the entity extraction language model is trained, if the loss function value is not smaller than the preset loss value, adjusting parameters in the entity extraction language model, inputting a plurality of training texts marked with entities as training samples into the entity extraction language model after the parameters are adjusted, and repeating the steps until the loss function value is smaller than the preset loss value.
6. An entity disambiguation apparatus, the apparatus comprising:
the entity extraction module is used for inputting the text to be processed comprising at least two entities into a pre-trained entity extraction language model for entity extraction to obtain the entities contained in the text to be processed;
the word vector acquisition module is used for inputting the text to be processed into a pre-trained bidirectional language model for processing to obtain a word vector sequence of the text to be processed, wherein the word vector sequence is formed by arranging word vectors of all words in the text to be processed according to the sequence of the words in the text to be processed, and the word vectors are calculated according to the context of the words;
a word vector corresponding module, configured to, for any entity in the text to be processed, obtain, from the word vector sequence, a word vector for each word in the any entity according to the any entity and a position of each word in the any entity in the text;
the word vector calculation module is used for calculating and obtaining the word vector of any entity according to the word vector of each word in any entity;
the similarity calculation module is used for calculating the similarity between every two entities according to the word vectors of the entities;
and the entity disambiguation module is used for clustering the entities according to the similarity between every two entities in the text to be processed so as to realize entity disambiguation.
7. The apparatus of claim 6, wherein the word vector correspondence module is specifically configured to:
obtaining an identification sequence based on the text to be processed, wherein the position of a word in each entity in the text to be processed is represented by a first identifier, and other words except the entities are represented by second identifiers;
and aiming at the any entity, acquiring a word vector of a corresponding position in the word vector sequence according to the position of the first identifier of each word in the any entity in the identification sequence, thereby acquiring the word vector of each word in the any entity.
8. The apparatus of claim 6, wherein the word vector calculation module is specifically configured to calculate an average vector of word vectors of all words in the entity, and use the average vector as the word vector of the entity.
9. A readable storage medium, characterized in that the readable storage medium stores an executable program, which when executed by a processor implements the method according to any one of claims 1-5.
10. An electronic device, comprising a memory communicatively coupled to a processor and a processor, wherein the memory stores an executable program, and wherein the processor, when executing the executable program, implements the method of any of claims 1-5.
CN201910952886.2A 2019-10-09 2019-10-09 Entity disambiguation method and device, readable storage medium and electronic equipment Pending CN110674304A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910952886.2A CN110674304A (en) 2019-10-09 2019-10-09 Entity disambiguation method and device, readable storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910952886.2A CN110674304A (en) 2019-10-09 2019-10-09 Entity disambiguation method and device, readable storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN110674304A true CN110674304A (en) 2020-01-10

Family

ID=69081044

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910952886.2A Pending CN110674304A (en) 2019-10-09 2019-10-09 Entity disambiguation method and device, readable storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN110674304A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414759A (en) * 2020-03-12 2020-07-14 北京明略软件系统有限公司 Method and system for entity disambiguation
CN111581949A (en) * 2020-05-12 2020-08-25 上海市研发公共服务平台管理中心 Method and device for disambiguating name of learner, storage medium and terminal
CN111597336A (en) * 2020-05-14 2020-08-28 腾讯科技(深圳)有限公司 Processing method and device of training text, electronic equipment and readable storage medium
CN112949319A (en) * 2021-03-12 2021-06-11 江南大学 Method, device, processor and storage medium for marking ambiguous words in text
CN113158687A (en) * 2021-04-29 2021-07-23 新声科技(深圳)有限公司 Semantic disambiguation method and device, storage medium and electronic device
CN113239149A (en) * 2021-05-14 2021-08-10 北京百度网讯科技有限公司 Entity processing method, entity processing device, electronic equipment and storage medium
CN113343669A (en) * 2021-05-20 2021-09-03 北京明略软件系统有限公司 Method and system for learning word vector, electronic equipment and storage medium
CN115438674A (en) * 2022-11-08 2022-12-06 腾讯科技(深圳)有限公司 Entity data processing method, entity linking method, entity data processing device, entity linking device and computer equipment
CN116266266A (en) * 2022-11-08 2023-06-20 美的集团(上海)有限公司 Multi-tone word disambiguation method, device, equipment and storage medium

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160292281A1 (en) * 2015-04-01 2016-10-06 Microsoft Technology Licensing, Llc Obtaining content based upon aspect of entity
CN107102989A (en) * 2017-05-24 2017-08-29 南京大学 A kind of entity disambiguation method based on term vector, convolutional neural networks
CN107168952A (en) * 2017-05-15 2017-09-15 北京百度网讯科技有限公司 Information generating method and device based on artificial intelligence
CN107885721A (en) * 2017-10-12 2018-04-06 北京知道未来信息技术有限公司 A kind of name entity recognition method based on LSTM
US20180189265A1 (en) * 2015-06-26 2018-07-05 Microsoft Technology Licensing, Llc Learning entity and word embeddings for entity disambiguation
CN108280061A (en) * 2018-01-17 2018-07-13 北京百度网讯科技有限公司 Text handling method based on ambiguity entity word and device
CN109472032A (en) * 2018-11-14 2019-03-15 北京锐安科技有限公司 A kind of determination method, apparatus, server and the storage medium of entity relationship diagram
CN109885698A (en) * 2019-02-13 2019-06-14 北京航空航天大学 A kind of knowledge mapping construction method and device, electronic equipment
CN110083831A (en) * 2019-04-16 2019-08-02 武汉大学 A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF
CN110134965A (en) * 2019-05-21 2019-08-16 北京百度网讯科技有限公司 Method, apparatus, equipment and computer readable storage medium for information processing
CN110276075A (en) * 2019-06-21 2019-09-24 腾讯科技(深圳)有限公司 Model training method, name entity recognition method, device, equipment and medium
CN110287302A (en) * 2019-06-28 2019-09-27 中国船舶工业综合技术经济研究院 A kind of science and techniques of defence field open source information confidence level determines method and system

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160292281A1 (en) * 2015-04-01 2016-10-06 Microsoft Technology Licensing, Llc Obtaining content based upon aspect of entity
US20180189265A1 (en) * 2015-06-26 2018-07-05 Microsoft Technology Licensing, Llc Learning entity and word embeddings for entity disambiguation
CN107168952A (en) * 2017-05-15 2017-09-15 北京百度网讯科技有限公司 Information generating method and device based on artificial intelligence
CN107102989A (en) * 2017-05-24 2017-08-29 南京大学 A kind of entity disambiguation method based on term vector, convolutional neural networks
CN107885721A (en) * 2017-10-12 2018-04-06 北京知道未来信息技术有限公司 A kind of name entity recognition method based on LSTM
CN108280061A (en) * 2018-01-17 2018-07-13 北京百度网讯科技有限公司 Text handling method based on ambiguity entity word and device
CN109472032A (en) * 2018-11-14 2019-03-15 北京锐安科技有限公司 A kind of determination method, apparatus, server and the storage medium of entity relationship diagram
CN109885698A (en) * 2019-02-13 2019-06-14 北京航空航天大学 A kind of knowledge mapping construction method and device, electronic equipment
CN110083831A (en) * 2019-04-16 2019-08-02 武汉大学 A kind of Chinese name entity recognition method based on BERT-BiGRU-CRF
CN110134965A (en) * 2019-05-21 2019-08-16 北京百度网讯科技有限公司 Method, apparatus, equipment and computer readable storage medium for information processing
CN110276075A (en) * 2019-06-21 2019-09-24 腾讯科技(深圳)有限公司 Model training method, name entity recognition method, device, equipment and medium
CN110287302A (en) * 2019-06-28 2019-09-27 中国船舶工业综合技术经济研究院 A kind of science and techniques of defence field open source information confidence level determines method and system

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414759A (en) * 2020-03-12 2020-07-14 北京明略软件系统有限公司 Method and system for entity disambiguation
CN111581949A (en) * 2020-05-12 2020-08-25 上海市研发公共服务平台管理中心 Method and device for disambiguating name of learner, storage medium and terminal
CN111597336B (en) * 2020-05-14 2023-12-22 腾讯科技(深圳)有限公司 Training text processing method and device, electronic equipment and readable storage medium
CN111597336A (en) * 2020-05-14 2020-08-28 腾讯科技(深圳)有限公司 Processing method and device of training text, electronic equipment and readable storage medium
CN112949319A (en) * 2021-03-12 2021-06-11 江南大学 Method, device, processor and storage medium for marking ambiguous words in text
CN112949319B (en) * 2021-03-12 2023-01-06 江南大学 Method, device, processor and storage medium for marking ambiguous words in text
CN113158687A (en) * 2021-04-29 2021-07-23 新声科技(深圳)有限公司 Semantic disambiguation method and device, storage medium and electronic device
CN113239149A (en) * 2021-05-14 2021-08-10 北京百度网讯科技有限公司 Entity processing method, entity processing device, electronic equipment and storage medium
CN113239149B (en) * 2021-05-14 2024-01-19 北京百度网讯科技有限公司 Entity processing method, device, electronic equipment and storage medium
CN113343669A (en) * 2021-05-20 2021-09-03 北京明略软件系统有限公司 Method and system for learning word vector, electronic equipment and storage medium
CN116266266A (en) * 2022-11-08 2023-06-20 美的集团(上海)有限公司 Multi-tone word disambiguation method, device, equipment and storage medium
CN115438674A (en) * 2022-11-08 2022-12-06 腾讯科技(深圳)有限公司 Entity data processing method, entity linking method, entity data processing device, entity linking device and computer equipment
CN116266266B (en) * 2022-11-08 2024-02-20 美的集团(上海)有限公司 Multi-tone word disambiguation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110674304A (en) Entity disambiguation method and device, readable storage medium and electronic equipment
WO2022022163A1 (en) Text classification model training method, device, apparatus, and storage medium
CN110309267B (en) Semantic retrieval method and system based on pre-training model
CN110516067B (en) Public opinion monitoring method, system and storage medium based on topic detection
CN112528672B (en) Aspect-level emotion analysis method and device based on graph convolution neural network
Nguyen et al. Relation extraction: Perspective from convolutional neural networks
CN106372061B (en) Short text similarity calculation method based on semantics
CN108255813B (en) Text matching method based on word frequency-inverse document and CRF
CN111274814B (en) Novel semi-supervised text entity information extraction method
CN108733644B (en) A kind of text emotion analysis method, computer readable storage medium and terminal device
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN110134950B (en) Automatic text proofreading method combining words
CN109359184B (en) English event co-fingering resolution method and system
CN111160041A (en) Semantic understanding method and device, electronic equipment and storage medium
CN111709225B (en) Event causal relationship discriminating method, device and computer readable storage medium
CN110019820B (en) Method for detecting time consistency of complaints and symptoms of current medical history in medical records
CN111079433A (en) Event extraction method and device and electronic equipment
KR20210125449A (en) Method for industry text increment, apparatus thereof, and computer program stored in medium
CN113779190A (en) Event cause and effect relationship identification method and device, electronic equipment and storage medium
CN111241848B (en) Article reading comprehension answer retrieval method and device based on machine learning
CN112183060A (en) Reference resolution method of multi-round dialogue system
CN114969334B (en) Abnormal log detection method and device, electronic equipment and readable storage medium
CN111178080A (en) Named entity identification method and system based on structured information
CN111104520A (en) Figure entity linking method based on figure identity
CN115965003A (en) Event information extraction method and event information extraction device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200110

RJ01 Rejection of invention patent application after publication