CN113283236A

CN113283236A - Entity disambiguation method in complex Chinese text

Info

Publication number: CN113283236A
Application number: CN202110603755.0A
Authority: CN
Inventors: 王玉龙; 王闯; 刘同存; 王纯; 张乐剑; 王晶
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-08-20
Anticipated expiration: 2041-05-31
Also published as: CN113283236B

Abstract

A method of entity disambiguation in complex chinese text, comprising: extracting all the entity names to be disambiguated from the Chinese text to be disambiguated; selecting a plurality of entities from an entity knowledge base for each entity index to be disambiguated as pre-candidate entities by adopting an entity retrieval technology, then calculating the first similarity of each entity index to be disambiguated and each pre-candidate entity, and selecting a plurality of entities as candidate entities according to the first similarity; and calculating the disambiguation similarity of each entity to be disambiguated and each candidate entity, judging whether the maximum value of the disambiguation similarity of each entity to be disambiguated and all the candidate entities is greater than a threshold value of the disambiguation similarity, if so, indicating that the entity to be disambiguated is a linkable entity, and linking the entity to be disambiguated to the candidate entity corresponding to the maximum value of the disambiguation similarity in the entity knowledge base. The invention belongs to the technical field of information, can effectively solve the problem of entity ambiguity in the field of complex Chinese texts, and improves the entity recall rate and the entity link accuracy rate.

Description

Entity disambiguation method in complex Chinese text

Technical Field

The invention relates to an entity disambiguation method in a complex Chinese text, belonging to the technical field of information.

Background

Natural languages suffer from a wide range of ambiguity and non-normative problems, such as shorthand, abbreviation and language usage habits of words that result in different language contexts in which the same word represents different meanings or in which different words represent the same meaning. Chinese language culture is profound, has richer semantics and expression forms, and particularly has a large number of appearance characters, a large number of scenes and an intricate and complex organizational structure in texts of literary works such as novels. The entities such as characters, scenes, organizations, etc. in these novels also present many ambiguity problems, which present a great challenge to many downstream tasks of the novel-based natural language processing.

Although there has been much research on entity disambiguation in the general field, in the field of complex Chinese texts such as novel, the ambiguity problem still depends on inefficient manual processing to be solved, and a systematic solution is lacked. Conventional entity disambiguation includes cluster-based entity disambiguation and knowledge-base based entity linking methods. The entity disambiguation method based on clustering can only utilize shallow semantic information generally, so the entity link method is generally adopted to realize disambiguation at present. The entity linking method comprises three parts of entity designation identification, candidate entity generation and candidate entity sorting. The entity linking method is limited by the scale and the updating speed of the knowledge base, and the traditional entity linking method relies on the construction of the nickname table when generating the candidate entity, so that the entity recall rate is lower in the case of a complex environment such as a novel text. Meanwhile, the novel domain entities in the general knowledge base generally lack effective and available alias information, statistical information and the like. In addition, for the novels of the colder door or the novels of the updated state, a large number of unlinkable entities exist, and the entity linking method cannot well solve the disambiguation problem.

Therefore, how to effectively solve the problem of entity ambiguity in the field of complex Chinese texts such as novel and the like and improve the entity recall rate and the entity link accuracy rate becomes a technical problem which needs to be solved by technical personnel urgently.

Disclosure of Invention

In view of the above, the present invention provides an entity disambiguation method in a complex chinese text, which can effectively solve the problem of entity ambiguity in the field of complex chinese text and improve the entity recall rate and the entity link accuracy rate.

In order to achieve the above object, the present invention provides a method for disambiguating entities in complex chinese texts, comprising:

step one, extracting all entity names to be disambiguated from a Chinese text to be disambiguated;

selecting a plurality of entities from an entity knowledge base as pre-candidate entities for each entity to be disambiguated by adopting an entity retrieval technology, forming a pre-candidate entity set of each entity to be disambiguated by all the pre-candidate entities, then calculating the first similarity of each entity to be disambiguated and each pre-candidate entity in the pre-candidate entity set, selecting a plurality of pre-candidate entities as candidate entities according to the first similarity, and forming the candidate entity set of each entity to be disambiguated by all the candidate entities, wherein the entity knowledge base stores two parts of information of an entity name and a description text, and the first similarities of the entity to be disambiguated and the pre-candidate entities are cosine similarities between respective expression vectors;

step three, calculating the disambiguation similarity of each entity to be disambiguated and each candidate entity in the candidate entity set thereof, judging whether the maximum value in the disambiguation similarity of each entity to be disambiguated and all the candidate entities thereof is greater than a threshold value of the disambiguation similarity, if so, indicating that the entity to be disambiguated is a linkable entity, linking the entity to be disambiguated to the candidate entity corresponding to the maximum value of the disambiguation similarity in the entity knowledge base, then continuously judging the maximum value in the disambiguation similarity of the next entity to be disambiguated and all the candidate entities thereof, and after judging all the entities to be disambiguated, continuing the step four; if not, the indication that the entity to be disambiguated is the unlinkable entity is judged, then the maximum value of the disambiguation similarity of the next entity to be disambiguated and all candidate entities is continuously judged, and after all the entities to be disambiguated are judged, the step four is continuously carried out; the disambiguation similarity is used for describing the similarity degree of the reference of the entity to be disambiguated and the candidate entity;

and step four, clustering all the unlinkable entities to divide the unlinkable entities into a plurality of groups, setting the serial number of each group, and marking each unlinkable entity according to the group serial number.

Compared with the prior art, the invention has the beneficial effects that: the method applies an entity retrieval technology and a candidate entity generation method based on semantic similarity calculation to an entity disambiguation method of Chinese texts, realizes the generation of candidate entities without depending on a manual nickname table construction mode, can effectively improve the entity recall rate in a complex Chinese text environment, and realizes the generation and the sequencing of the candidate entities only by using entity names and description texts in a constructed entity knowledge base, and can effectively improve the link accuracy in the sequencing stage; by adopting multiple similarity calculation models such as a double-tower type similarity calculation model and an interactive similarity calculation model, the entity designation information and the entity information can be effectively utilized under the condition of lacking external statistical knowledge; through a knowledge distillation mode, the category information of the entity designation recognition output result is used for guiding the sequencing of the candidate entities, and accurate linkage of the entity designation can be realized; the invention can also implement a certain degree of disambiguation for unlinkable entities in Chinese text.

Drawings

FIG. 1 is a flow chart of a method of entity disambiguation in complex Chinese text in accordance with the present invention.

FIG. 2 is a flowchart illustrating an embodiment of step two in FIG. 1.

Fig. 3 is a specific calculation flowchart of the two-tower similarity calculation model.

FIG. 4 is a flowchart illustrating the third step of FIG. 1, wherein the disambiguation similarity between the reference A of the entity to be disambiguated and the candidate B in the candidate set is calculated.

Fig. 5 is a specific calculation flowchart of the interactive similarity calculation model.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the accompanying drawings.

As shown in FIG. 1, the present invention provides a method for entity disambiguation in complex Chinese texts, comprising:

step three, calculating the disambiguation similarity of each entity to be disambiguated and each candidate entity in the candidate entity set thereof, judging whether the maximum value in the disambiguation similarity of each entity to be disambiguated and all the candidate entities thereof is greater than a threshold value of the disambiguation similarity, if so, indicating that the entity to be disambiguated is a linkable entity, linking the entity to be disambiguated to the candidate entity corresponding to the maximum value of the disambiguation similarity in the entity knowledge base, then continuously judging the maximum value in the disambiguation similarity of the next entity to be disambiguated and all the candidate entities thereof, and after judging all the entities to be disambiguated, continuing the step four; if not, the indication that the entity to be disambiguated is the unlinkable entity is judged, then the maximum value of the disambiguation similarity of the next entity to be disambiguated and all candidate entities is continuously judged, and after all the entities to be disambiguated are judged, the step four is continuously carried out; the disambiguation similarity is used for describing the similarity between the designation of the entity to be disambiguated and the candidate entity, and the threshold of the disambiguation similarity can be set according to the actual service requirement;

The first step may further comprise:

an entity designation recognition model is constructed and trained, the entity designation recognition model is formed by adding a structure of a public Chinese pre-training model and a BilSTM-CRF (bidirectional Long Short-Term Memory-Conditional Random Field), the input of the entity designation recognition model is a Chinese text, the output of the entity designation recognition model is an entity designation recognized from the input Chinese text, the type of the entity designation can comprise characters, scenes, organizations and the like,

thus, the Chinese text to be disambiguated is input into the trained entity reference recognition model, and the output of the model is all the entity references to be disambiguated extracted from the Chinese text to be disambiguated.

The present invention can use the published chinese knowledge base as the entity knowledge base.

As shown in fig. 2, step two in fig. 1 may further include:

step 21, searching in an entity knowledge base by adopting an entity retrieval technology, such as a k-nearest neighbor algorithm, and selecting a plurality of pre-candidate entities for each to-be-disambiguated entity index so as to form a pre-candidate entity set of each to-be-disambiguated entity index;

step 22, obtaining a context text of each entity to be disambiguated from the Chinese text to be disambiguated, where the context text may be composed of the entity to be disambiguated, m words on the left and right, and the Chinese text to be disambiguated, and simultaneously obtaining a knowledge text of each pre-candidate entity of each entity to be disambiguated, where the knowledge text may be composed of an entity name of the pre-candidate entity and a description text thereof in an entity knowledge base, and a value of m may be set according to actual business needs;

for example, the specific format of the context text referred by the entity to be disambiguated may be: [ CLS ] entity to disambiguate refers to [ SEP ] entity to disambiguate refers to left m words [ SEP ] entity to disambiguate refers to right m words [ SEP ] Chinese text to disambiguate, the specific format of the knowledge text of the pre-candidate entity may be: [ CLS ] Pre-candidate entity name [ SEP ] description text, [ CLS ] is a sentence start identifier, [ SEP ] is a sentence middle and sentence end delimiter;

step 23, taking the context text named by each entity to be disambiguated and the knowledge text of each pre-candidate entity as a group of text pairs, and inputting the context text and the knowledge text of each pre-candidate entity into a double-tower similarity calculation model to obtain a first similarity of each group of text pairs, wherein the double-tower similarity calculation model calculates respective expression vectors of two texts in the input text pairs, and then calculates cosine similarity between the expression vectors of the two texts, wherein the cosine similarity is the first similarity of the input text pairs;

and 24, sequencing the first similarity of the context text named by each entity to be disambiguated and the knowledge texts of all the pre-candidate entities according to the descending order, then selecting a plurality of entities with the top sequencing positions as the candidate entities named by each entity to be disambiguated, and forming a candidate entity set named by each entity to be disambiguated by all the candidate entities.

The double-tower similarity calculation model comprises two identical tower networks and a similarity calculation network, wherein each tower network comprises an embedded layer and a plurality of layers of blocks (namely blocks), each layer of block further comprises an encoding layer and a combination layer, and as shown in fig. 3, the specific calculation process of the double-tower similarity calculation model is as follows:

step A1, respectively inputting two texts in an input text pair into different embedding layers of the tower network;

step A2, the processing flow of each tower network to the respective input text is as follows: the embedding layer converts the input text into a Representation vector by adopting a depth language model based on a deformer/transformer, such as Bert (Bidirectional Encoder Representation from Transformers), records the converted Representation vector as an embedding layer Representation vector, and then outputs the embedding layer Representation vector to the coding layer of the first layer block; the coding layer of the first layer of block receives an embedded layer representation vector transmitted by the embedded layer, the coding layers of the blocks of other layers receive an input vector transmitted by the block of the previous layer, the coding layer of each layer of block calculates to obtain an output vector according to the received input vector, then the calculated output vector and the embedded layer representation vector are spliced into a coding layer representation vector, and the spliced coding layer representation vector is transmitted to the combined layer of the block; the combined layer receives the coding layer representation vector transmitted by the coding layer of the block of the combined layer, calculates to obtain an output vector, splices the calculated output vector, the received embedded layer representation vector and the coding layer representation vector into a combined layer representation vector, outputs the spliced combined layer representation vector to the coding layer of the next layer of block, and outputs the combined layer representation vector of the last layer of block to the similarity calculation network;

step A3, the similarity calculation network receives the combined layer representation vectors respectively transmitted by the two tower networks, maps the two combined layer representation vectors into the same dimension through the average pooling layer, and then calculates the cosine similarity between the two representation vectors, wherein the cosine similarity is the first similarity output by the double-tower similarity calculation model.

The coding layer and the combination layer of each layer of block of the double-tower type similarity calculation model can be composed of a plurality of layers of feedforward neural networks (FFNs), wherein:

each layer of FFN of the coding layer receives an input vector transmitted by the FFN of the previous layer, and outputs an output vector obtained by calculation to the FFN of the next layer, wherein the FFN of the first layer receives an embedded layer representation vector transmitted by the embedded layer, the FFN of the last layer splices the output vector obtained by calculation and the embedded layer representation vector into a coding layer representation vector, and the coding layer representation vector is output to a combination layer of the block of the FFN,

each layer of FFN of the combined layer receives an input vector transmitted by the FFN of the previous layer, and outputs an output vector obtained by calculation to the FFN of the next layer, wherein the FFN of the first layer receives a coding layer representation vector transmitted by the coding layer of the block, the FFN of the last layer splices the output vector obtained by calculation, the received embedded layer representation vector and the coding layer representation vector into a combined layer representation vector, and the combined layer representation vector is output to the coding layer of the block of the next layer.

As shown in fig. 4, taking the to-be-disambiguated entity name a and the candidate entity B in the candidate entity set thereof as an example, the step three in fig. 1 of calculating the disambiguation similarity between each to-be-disambiguated entity name and each candidate entity in the candidate entity set thereof may further include:

step 31, constructing a disambiguation entity text set for the to-be-disambiguated entity name a, where the disambiguation entity text set may include: the entity to be disambiguated refers to A, and the entity to be disambiguated refers to context text of A;

step 32, constructing a candidate entity text set for the candidate entity B, where the candidate entity text set may include: candidate entity B and description texts of the candidate entity B in an entity knowledge base;

step 33, constructing a similarity set for the entity designation a to be disambiguated, then selecting a text from the disambiguated entity text set of the entity designation a to be disambiguated and the candidate entity text set of the candidate entity B to form a plurality of groups of text pairs, inputting each group of text pairs into a dual-tower similarity calculation model and an interactive similarity calculation model to obtain the first similarity and the second similarity of each group of text pairs output by the models respectively, finally writing the first similarity and the second similarity of all the text pairs into the similarity set of the entity designation a to be disambiguated, wherein the interactive similarity calculation model firstly calculates the respective expression vectors of the two texts in the input text pairs and the similarity matrix of the two expression vectors respectively, then regenerating new expression vectors weighted by the similarity matrix, and finally splicing and combining the two newly generated expression vectors, and obtaining a final similarity score through a full connection layer, wherein the similarity score is the second similarity of the input text pair;

for example, the entity to be disambiguated refers to the disambiguated entity text set of A: { entity designation, context }, and set of candidate entity texts of candidate entity B { entity name entry, description text entry }, may form 4 sets of text pairs: (maintenance, entity), (maintenance, entity description), (maintenance context, entity description);

and step 34, performing weighted voting on all similarity values in the similarity set of the to-be-disambiguated entity name A, so as to obtain a final similarity value, namely the disambiguation similarity of the to-be-disambiguated entity name A and the candidate entity B.

As shown in fig. 5, the specific calculation process of the interactive similarity calculation model is as follows:

step B1, converting the two texts in the input text pair into representation vectors respectively;

b2, calculating the similarity of the two expression vectors by adopting a dot product mode to obtain a similarity matrix;

step B3, performing interactive attention coding on the two expression vectors, and generating the expression vectors with weighted similarity by using the similarity matrix, which may specifically be: taking each word in one expression vector as a query, taking the other expression vector as a key and a value, and taking a similarity matrix of the two expression vectors as an attention weight to perform weighting so as to obtain two new expression vectors; wherein, Query, key, value are a calculation mode, used in the attribute calculation;

and step B4, performing feature enhancement on the two expression vectors, namely, performing splicing combination after the two expression vectors are added and subtracted, wherein the addition and subtraction can better focus on the similar or dissimilar parts of the two inputs, and then passing the spliced vectors through a full connection layer to obtain a final similarity score, namely, the second similarity output by the interactive similarity calculation model.

During training, a knowledge distillation thought can be utilized, the entity designation recognition model is used as a teacher network, the entity link model (comprising a double-tower similarity calculation model and an interactive similarity calculation model) is used as a student network, the output of the entity designation recognition model comprises the category information of the entity designation, and the category information is helpful for positioning candidate entity categories and can be used for guiding the link result of the entity link model. Thus, the loss function of the final entity-linked model is the original loss function (generally, cross entropy loss function) of the entity-linked model plus the loss function of the knowledge distillation. The loss function of knowledge distillation is a loss function (generally a cross entropy loss function) of the entity named recognition model and a differentiation measurement function of the entity named recognition model and the entity link model.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for disambiguating entities in complex Chinese text, comprising:

2. The method of claim 1, wherein step one further comprises:

an entity name recognition model is constructed and trained, the entity name recognition model is formed by a Chinese pre-training model and a BilSTM-CRF structure, the input of the entity name recognition model is a Chinese text, the output of the entity name recognition model is an entity name recognized from the input Chinese text, the type of the entity name includes characters, scenes and organizations,

3. The method of claim 1, wherein step two further comprises:

step 21, searching in an entity knowledge base by adopting an entity retrieval technology, and selecting a plurality of pre-candidate entities for each entity designation to be disambiguated so as to form a pre-candidate entity set of each entity designation to be disambiguated;

step 22, obtaining a context text of each entity to be disambiguated from the Chinese text to be disambiguated, wherein the context text comprises the entity to be disambiguated, m words on the left and right, and the Chinese text to be disambiguated, and simultaneously obtaining a knowledge text of each pre-candidate entity of each entity to be disambiguated, and the knowledge text comprises an entity name of the pre-candidate entity and a description text of the entity name in an entity knowledge base;

4. The method of claim 1, wherein in step three, the disambiguation similarity between each of the to-be-disambiguated entity designations and each of the candidate entities in its candidate entity set is calculated, taking the to-be-disambiguated entity designation a and the candidate entity B in its candidate entity set as an example, further comprising:

step 31, constructing a disambiguation entity text set for the to-be-disambiguated entity name A, wherein the disambiguation entity text set comprises: the entity to be disambiguated refers to A, and the entity to be disambiguated refers to context text of A;

step 32, constructing a candidate entity text set for the candidate entity B, wherein the candidate entity text set comprises: candidate entity B and description texts of the candidate entity B in an entity knowledge base;

step 33, a similarity set is constructed for the entity to be disambiguated named A, then a text is respectively selected from the disambiguated entity text set of the entity to be disambiguated named A and the candidate entity text set of the candidate entity B to form a plurality of groups of text pairs, then each group of text pairs is respectively input into a double-tower similarity calculation model and an interactive similarity calculation model to obtain the first similarity and the second similarity of each group of text pairs output by the models respectively, finally the first similarity and the second similarity of all the text pairs are written into the similarity set of the entity to be disambiguated named A, wherein the double-tower similarity calculation model respectively calculates the representation vectors of the two texts in the input text pair first, then calculates the cosine similarity between the two text representation vectors, the cosine similarity is the first similarity of the input text pair, and the interactive similarity calculation model respectively calculates the representation vectors and the two representation directions of the two texts in the input text pair first The similarity matrix of the quantity, then according to the similarity matrix regenerating new similarity weighted expression vector, finally two newly generated expression vectors are spliced and combined, and a final similarity score is obtained through a full connection layer, wherein the similarity score is the second similarity of the input text pair;

5. The method according to claim 3 or 4, wherein the two-tower similarity calculation model comprises two identical tower networks and a similarity calculation network, wherein each tower network comprises an embedded layer and a multi-layer block, each layer of block further comprises an encoding layer and a combination layer, and the two-tower similarity calculation model is calculated by the following specific steps:

step A2, the processing flow of each tower network to the respective input text is as follows: the embedded layer converts the input text into a representation vector, the converted representation vector is recorded as an embedded layer representation vector, and then the embedded layer representation vector is input to the coding layer of the first layer block; the coding layer of the first layer of block receives an embedded layer representation vector transmitted by the embedded layer, the coding layers of the blocks of other layers receive an input vector transmitted by the block of the previous layer, the coding layer of each layer of block calculates to obtain an output vector according to the received input vector, then the calculated output vector and the embedded layer representation vector are spliced into a coding layer representation vector, and the spliced coding layer representation vector is transmitted to the combined layer of the block; the combined layer receives the coding layer representation vector transmitted by the coding layer of the block of the combined layer, calculates to obtain an output vector, splices the calculated output vector, the received embedded layer representation vector and the coding layer representation vector into a combined layer representation vector, outputs the spliced combined layer representation vector to the coding layer of the next layer of block, and outputs the combined layer representation vector of the last layer of block to the similarity calculation network;

6. The method of claim 5, wherein the coding layer and the combination layer of each block of the two-tower type similarity calculation model are each composed of a multilayer feedforward neural network (FFN), wherein:

7. The method of claim 4, wherein the interactive similarity calculation model is calculated as follows:

step B3, performing interactive attention coding on the two expression vectors, and generating the expression vectors with weighted similarity by using the similarity matrix, specifically: taking each word in one expression vector as a query, taking the other expression vector as a key and a value, and taking a similarity matrix of the two expression vectors as an attention weight to perform weighting so as to obtain two new expression vectors;

and step B4, performing feature enhancement on the two expression vectors, namely, performing splicing combination after adding and subtracting the two expression vectors, and then passing the spliced vectors through a full connection layer to obtain a final similarity score, namely, a second similarity output by the interactive similarity calculation model.