CN113283236B - Entity disambiguation method in complex Chinese text - Google Patents
Entity disambiguation method in complex Chinese text Download PDFInfo
- Publication number
- CN113283236B CN113283236B CN202110603755.0A CN202110603755A CN113283236B CN 113283236 B CN113283236 B CN 113283236B CN 202110603755 A CN202110603755 A CN 202110603755A CN 113283236 B CN113283236 B CN 113283236B
- Authority
- CN
- China
- Prior art keywords
- entity
- similarity
- disambiguated
- layer
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000005516 engineering process Methods 0.000 claims abstract description 7
- 239000013598 vector Substances 0.000 claims description 60
- 238000004364 calculation method Methods 0.000 claims description 54
- 239000013604 expression vector Substances 0.000 claims description 54
- 230000002452 interceptive effect Effects 0.000 claims description 13
- 239000011159 matrix material Substances 0.000 claims description 10
- 238000012163 sequencing technique Methods 0.000 claims description 7
- 238000012549 training Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 claims description 2
- 238000011176 pooling Methods 0.000 claims description 2
- 230000001172 regenerating effect Effects 0.000 claims description 2
- 230000006870 function Effects 0.000 description 8
- 238000013140 knowledge distillation Methods 0.000 description 4
- 238000012423 maintenance Methods 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A method of entity disambiguation in complex chinese text, comprising: extracting all the entity names to be disambiguated from the Chinese text to be disambiguated; selecting a plurality of entities from an entity knowledge base for each entity index to be disambiguated as pre-candidate entities by adopting an entity retrieval technology, then calculating the first similarity of each entity index to be disambiguated and each pre-candidate entity, and selecting a plurality of entities as candidate entities according to the first similarity; and calculating the disambiguation similarity of each entity to be disambiguated and each candidate entity, judging whether the maximum value of the disambiguation similarity of each entity to be disambiguated and all candidate entities is greater than a threshold value of the disambiguation similarity, if so, indicating that the entity to be disambiguated is a linkable entity, and linking the entity to be disambiguated to the candidate entity corresponding to the maximum value of the disambiguation similarity in the entity knowledge base. The invention belongs to the technical field of information, can effectively solve the problem of entity ambiguity in the field of complex Chinese texts, and improves the entity recall rate and the entity link accuracy rate.
Description
Technical Field
The invention relates to an entity disambiguation method in a complex Chinese text, belonging to the technical field of information.
Background
Natural languages suffer from a wide range of ambiguity and non-normative problems, such as shorthand, abbreviation and language usage habits of words that result in different language contexts in which the same word represents different meanings or in which different words represent the same meaning. Chinese language culture is profound, has richer semantics and expression forms, and particularly, literary works such as novel speeches often have a large number of appearance characters, a large number of scenes and an intricate organization structure in texts. The entities such as characters, scenes, organizations, etc. in these novels also present many ambiguity problems, which present a great challenge to many downstream tasks of the novel-based natural language processing.
Although there has been much research on entity disambiguation in the general field, in the field of complex chinese texts such as novel, the ambiguity problem still depends on inefficient manual processing to solve, and a systematic solution is lacking. Conventional entity disambiguation includes cluster-based entity disambiguation and knowledge-base based entity linking methods. The entity disambiguation method based on clustering can only utilize shallow semantic information generally, so that entity link disambiguation is generally adopted at present. The entity linking method comprises three parts of entity designation identification, candidate entity generation and candidate entity sorting. The entity linking method is limited by the scale and the updating speed of the knowledge base, and the traditional entity linking method depends on the construction of the nickname table when generating the candidate entity, so that the entity recall rate is lower in the face of a complex environment such as a novel text. Meanwhile, the novel domain entities in the general knowledge base generally lack effective usable alias information, statistical information and the like. In addition, for the novels of the cold door or the novels of the updated state, a large number of unlinkable entities exist, and the entity linking method cannot well solve the disambiguation problem.
Therefore, how to effectively solve the problem of entity ambiguity in the field of complex Chinese texts such as novel and the like and improve the entity recall rate and the entity link accuracy rate becomes a technical problem which needs to be solved by technical personnel urgently.
Disclosure of Invention
In view of the above, the present invention provides an entity disambiguation method in a complex chinese text, which can effectively solve the problem of entity ambiguity in the field of complex chinese text and improve the entity recall rate and the entity link accuracy rate.
In order to achieve the above object, the present invention provides a method for disambiguating entities in complex chinese texts, comprising:
step one, extracting all entity names to be disambiguated from a Chinese text to be disambiguated;
selecting a plurality of entities from an entity knowledge base for each entity index to be disambiguated as pre-candidate entities by adopting an entity retrieval technology, forming a pre-candidate entity set of each entity index to be disambiguated by all the pre-candidate entities, then calculating the first similarity of each entity index to be disambiguated and each pre-candidate entity in the pre-candidate entity set, selecting a plurality of pre-candidate entities as candidate entities according to the first similarity, and forming a candidate entity set of each entity index to be disambiguated by all the candidate entities, wherein the entity knowledge base stores two parts of information of an entity name and a description text, and the first similarity of the entity index to be disambiguated and the pre-candidate entities is the cosine similarity between respective expression vectors;
Step three, calculating the disambiguation similarity of each entity to be disambiguated and each candidate entity in the candidate entity set thereof, judging whether the maximum value in the disambiguation similarity of each entity to be disambiguated and all the candidate entities thereof is greater than a threshold value of the disambiguation similarity, if so, indicating that the entity to be disambiguated is a linkable entity, linking the entity to be disambiguated to the candidate entity corresponding to the maximum value of the disambiguation similarity in the entity knowledge base, then continuously judging the maximum value in the disambiguation similarity of the next entity to be disambiguated and all the candidate entities thereof, and after judging all the entities to be disambiguated, continuing the step four; if not, the indication that the entity to be disambiguated is the unlinkable entity is judged, then the maximum value of the disambiguation similarity of the next entity to be disambiguated and all candidate entities is continuously judged, and after all the entities to be disambiguated are judged, the step four is continuously carried out; the disambiguation similarity is used for describing the similarity between the indication of the entity to be disambiguated and the candidate entity;
and step four, clustering all the unlinkable entities to divide the unlinkable entities into a plurality of groups, setting the serial number of each group, and marking each unlinkable entity according to the group serial number.
Compared with the prior art, the invention has the beneficial effects that: the entity retrieval technology and the candidate entity generation method based on semantic similarity calculation are applied to the entity disambiguation method of the Chinese text, the generation of the candidate entity is realized without depending on a manual name table construction mode, the entity recall rate can be effectively improved in a complex Chinese text environment, the generation and the sequencing of the candidate entity are realized by only using the entity name and the description text in the constructed entity knowledge base, and the link accuracy rate in the sequencing stage can be effectively improved; by adopting multiple similarity calculation models such as a double-tower type similarity calculation model and an interactive similarity calculation model, the entity designation information and the entity information can be effectively utilized under the condition of lacking external statistical knowledge; through a knowledge distillation mode, the category information of the entity designation recognition output result is used for guiding the sequencing of the candidate entities, and the accurate linkage of the entity designation can be realized; the invention can also implement a certain degree of disambiguation for unlinkable entities in Chinese text.
Drawings
FIG. 1 is a flow chart of a method of entity disambiguation in complex Chinese text in accordance with the present invention.
FIG. 2 is a flowchart illustrating a second embodiment of step two in FIG. 1.
Fig. 3 is a specific calculation flowchart of the two-tower similarity calculation model.
FIG. 4 is a flowchart illustrating the third step of FIG. 1, wherein the disambiguation similarity between the reference A of the entity to be disambiguated and the candidate B in the candidate set is calculated.
Fig. 5 is a specific calculation flowchart of the interactive similarity calculation model.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the accompanying drawings.
As shown in FIG. 1, the present invention provides a method for entity disambiguation in complex Chinese texts, comprising:
step one, extracting all entity names to be disambiguated from a Chinese text to be disambiguated;
selecting a plurality of entities from an entity knowledge base for each entity index to be disambiguated as pre-candidate entities by adopting an entity retrieval technology, forming a pre-candidate entity set of each entity index to be disambiguated by all the pre-candidate entities, then calculating the first similarity of each entity index to be disambiguated and each pre-candidate entity in the pre-candidate entity set, selecting a plurality of pre-candidate entities as candidate entities according to the first similarity, and forming a candidate entity set of each entity index to be disambiguated by all the candidate entities, wherein the entity knowledge base stores two parts of information of an entity name and a description text, and the first similarity of the entity index to be disambiguated and the pre-candidate entities is the cosine similarity between respective expression vectors;
Step three, calculating the disambiguation similarity of each entity to be disambiguated and each candidate entity in the candidate entity set thereof, judging whether the maximum value in the disambiguation similarity of each entity to be disambiguated and all the candidate entities thereof is greater than a threshold value of the disambiguation similarity, if so, indicating that the entity to be disambiguated is a linkable entity, linking the entity to be disambiguated to the candidate entity corresponding to the maximum value of the disambiguation similarity in the entity knowledge base, then continuously judging the maximum value in the disambiguation similarity of the next entity to be disambiguated and all the candidate entities thereof, and after judging all the entities to be disambiguated, continuing the step four; if not, the indication that the entity to be disambiguated is the unlinkable entity is judged, then the maximum value of the disambiguation similarity of the next entity to be disambiguated and all candidate entities is continuously judged, and after all the entities to be disambiguated are judged, the step four is continuously carried out; the disambiguation similarity is used for describing the similarity between the designation of the entity to be disambiguated and the candidate entity, and the threshold of the disambiguation similarity can be set according to the actual service requirement;
and step four, clustering all the unlinkable entities to divide the unlinkable entities into a plurality of groups, setting the serial number of each group, and marking each unlinkable entity according to the group serial number.
The first step may further comprise:
an entity designation recognition model is constructed and trained, the entity designation recognition model is formed by adding a structure of a public Chinese pre-training model and a BilSTM-CRF (bidirectional Long Short-Term Memory-Conditional Random Field), the input of the entity designation recognition model is a Chinese text, the output of the entity designation recognition model is an entity designation recognized from the input Chinese text, the type of the entity designation can comprise characters, scenes, organizations and the like,
thus, the Chinese text to be disambiguated is input into the trained entity reference recognition model, and the output of the model is all the entity references to be disambiguated extracted from the Chinese text to be disambiguated.
The present invention can use the published chinese knowledge base as the entity knowledge base.
As shown in fig. 2, step two in fig. 1 may further include:
For example, the specific format of the context text referred by the entity to be disambiguated may be: [ CLS ] entity to disambiguate refers to [ SEP ] entity to disambiguate refers to left m words [ SEP ] entity to disambiguate refers to right m words [ SEP ] Chinese text to disambiguate, the specific format of the knowledge text of the pre-candidate entity may be: [ CLS ] Pre-candidate entity name [ SEP ] description text, [ CLS ] is a sentence start identifier, [ SEP ] is a sentence middle and sentence end delimiter;
and 24, sequencing the first similarity of the context text named by each entity to be disambiguated and the knowledge texts of all the pre-candidate entities according to the descending order, then selecting a plurality of entities with the top sequencing positions as the candidate entities named by each entity to be disambiguated, and forming a candidate entity set named by each entity to be disambiguated by all the candidate entities.
The double-tower similarity calculation model comprises two identical tower networks and a similarity calculation network, wherein each tower network comprises an embedded layer and a plurality of layers of blocks (namely blocks), each layer of block further comprises an encoding layer and a combination layer, and as shown in fig. 3, the specific calculation process of the double-tower similarity calculation model is as follows:
step A1, respectively inputting two texts in an input text pair into different embedding layers of the tower network;
step A2, the processing flow of each tower network to the respective input text is as follows: the embedding layer converts the input text into a Representation vector by adopting a depth language model based on a deformer/transformer, such as Bert (Bidirectional Encoder Representation from Transformers), records the converted Representation vector as an embedding layer Representation vector, and then outputs the embedding layer Representation vector to the coding layer of the first layer block; the coding layer of the first layer of block receives an embedded layer expression vector transmitted by the embedded layer, the coding layers of the blocks of other layers receive an input vector transmitted by the block of the previous layer, the coding layer of each layer of block calculates to obtain an output vector according to the received input vector, then the calculated output vector and the embedded layer expression vector are spliced into a coding layer expression vector, and the spliced coding layer expression vector is transmitted to the combined layer of the block; the combined layer receives the coding layer representation vector transmitted by the coding layer of the block of the combined layer, calculates to obtain an output vector, splices the calculated output vector, the received embedded layer representation vector and the coding layer representation vector into a combined layer representation vector, outputs the spliced combined layer representation vector to the coding layer of the next layer of block, and outputs the combined layer representation vector of the last layer of block to the similarity calculation network;
And step A3, the similarity calculation network receives the combined layer representation vectors respectively transmitted by the two tower networks, maps the two combined layer representation vectors into the same dimension through the average pooling layer, and then calculates the cosine similarity between the two representation vectors, wherein the cosine similarity is the first similarity output by the double-tower similarity calculation model.
The coding layer and the combination layer of each block layer of the double-tower similarity calculation model can be composed of a plurality of layers of feedforward neural networks (FFNs), wherein:
each layer of FFN of the coding layer receives an input vector transmitted by the FFN of the previous layer, and outputs an output vector obtained by calculation to the FFN of the next layer, wherein the FFN of the first layer receives an embedded layer representation vector transmitted by the embedded layer, the FFN of the last layer splices the output vector obtained by calculation and the embedded layer representation vector into a coding layer representation vector, and the coding layer representation vector is output to a combination layer of the block of the FFN,
each layer of FFN of the combined layer receives an input vector transmitted by the FFN of the previous layer, and outputs an output vector obtained by calculation to the FFN of the next layer, wherein the FFN of the first layer receives a coding layer representation vector transmitted by the coding layer of the block, the FFN of the last layer splices the output vector obtained by calculation, the received embedded layer representation vector and the coding layer representation vector into a combined layer representation vector, and the combined layer representation vector is output to the coding layer of the block of the next layer.
As shown in fig. 4, taking the to-be-disambiguated entity name a and the candidate entity B in the candidate entity set thereof as an example, the step three in fig. 1 of calculating the disambiguation similarity between each to-be-disambiguated entity name and each candidate entity in the candidate entity set thereof may further include:
For example, the entity to be disambiguated refers to the disambiguated entity text set of A: { entity name, context, candidate entity text set of candidate entity B { entity name, description } may form 4 sets of text pairs: (maintenance, entity), (maintenance, entity description), (maintenance context, entity description);
and step 34, performing weighted voting on all similarity values in the similarity set of the to-be-disambiguated entity name A, so as to obtain a final similarity value, namely the disambiguation similarity of the to-be-disambiguated entity name A and the candidate entity B.
As shown in fig. 5, the specific calculation process of the interactive similarity calculation model is as follows:
step B1, converting the two texts in the input text pair into representation vectors respectively;
b2, calculating the similarity of the two expression vectors by adopting a dot product mode to obtain a similarity matrix;
step B3, performing interactive attention coding on the two expression vectors, and generating the expression vectors with weighted similarity by using the similarity matrix, which may specifically be: taking each word in one of the expression vectors as a query, taking the other expression vector as a key and a value, and then taking a similarity matrix of the two expression vectors as attention weight to carry out weighting so as to obtain two new expression vectors; wherein, Query, key, value are a calculation mode, used in attention calculation;
And step B4, performing feature enhancement on the two expression vectors, namely, performing splicing combination after the two expression vectors are added and subtracted, wherein the addition and subtraction can better focus on the similar or dissimilar parts of the two inputs, and then passing the spliced vectors through a full connection layer to obtain a final similarity score, namely, the second similarity output by the interactive similarity calculation model.
During training, a knowledge distillation thought can be utilized, the entity designation recognition model is used as a teacher network, the entity link model (comprising a double-tower similarity calculation model and an interactive similarity calculation model) is used as a student network, the output of the entity designation recognition model comprises the category information of the entity designation, and the category information is helpful for positioning candidate entity categories and can be used for guiding the link result of the entity link model. Thus, the loss function of the final entity-linked model is the original loss function (generally, cross entropy loss function) of the entity-linked model plus the loss function of the knowledge distillation. The loss function of knowledge distillation is a loss function (generally a cross entropy loss function) of the entity named recognition model and a differentiation measurement function of the entity named recognition model and the entity link model.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (7)
1. A method for disambiguating entities in complex Chinese texts, comprising:
step one, extracting all entity names to be disambiguated from a Chinese text to be disambiguated;
selecting a plurality of entities from an entity knowledge base as pre-candidate entities for each entity to be disambiguated by adopting an entity retrieval technology, forming a pre-candidate entity set of each entity to be disambiguated by all the pre-candidate entities, then calculating the first similarity of each entity to be disambiguated and each pre-candidate entity in the pre-candidate entity set, selecting a plurality of pre-candidate entities as candidate entities according to the first similarity, and forming the candidate entity set of each entity to be disambiguated by all the candidate entities, wherein the entity knowledge base stores two parts of information of an entity name and a description text, and the first similarities of the entity to be disambiguated and the pre-candidate entities are cosine similarities between respective expression vectors;
Step three, calculating the disambiguation similarity of each entity to be disambiguated and each candidate entity in the candidate entity set, judging whether the maximum value of the disambiguation similarity of each entity to be disambiguated and all the candidate entities is greater than a threshold value of the disambiguation similarity, if so, indicating that the entity to be disambiguated is a linkable entity, linking the entity to be disambiguated to the candidate entity corresponding to the maximum value of the disambiguation similarity in the entity knowledge base, then continuously judging the maximum value of the disambiguation similarity of the next entity to be disambiguated and all the candidate entities, and after judging all the entity to be disambiguated, continuing the step four; if not, the indication of the entity to be disambiguated is an unlinkable entity, then continuously judging the maximum value in the disambiguation similarity of the indication of the next entity to be disambiguated and all candidate entities, and after judging all the indications of the entity to be disambiguated, continuing the step four; the disambiguation similarity is used for describing the similarity between the indication of the entity to be disambiguated and the candidate entity;
and step four, clustering all the unlinkable entities to divide the unlinkable entities into a plurality of groups, setting the serial number of each group, and marking each unlinkable entity according to the group serial number.
2. The method of claim 1, wherein step one further comprises:
constructing and training an entity designation recognition model, wherein the entity designation recognition model is formed by a Chinese pre-training model and a BilSTM-CRF structure, the input is a Chinese text, the output is an entity designation recognized from the input Chinese text, the entity designation types comprise characters, scenes and organizations,
thus, the Chinese text to be disambiguated is input into the trained entity reference recognition model, and the output of the model is all the entity references to be disambiguated extracted from the Chinese text to be disambiguated.
3. The method of claim 1, wherein step two further comprises:
step 21, searching in an entity knowledge base by adopting an entity retrieval technology, and selecting a plurality of pre-candidate entities for each entity name to be disambiguated so as to form a pre-candidate entity set of each entity name to be disambiguated;
step 22, obtaining a context text of each entity to be disambiguated from the Chinese text to be disambiguated, wherein the context text comprises the entity to be disambiguated, m words on the left and right sides and the Chinese text to be disambiguated, and simultaneously obtaining a knowledge text of each pre-candidate entity of each entity to be disambiguated, wherein the knowledge text comprises an entity name of the pre-candidate entity and a description text of the entity name in an entity knowledge base;
Step 23, taking the context text named by each entity to be disambiguated and the knowledge text of each pre-candidate entity as a group of text pairs, and inputting the context text and the knowledge text of each pre-candidate entity into a double-tower similarity calculation model to obtain first similarity of each group of text pairs, wherein the double-tower similarity calculation model calculates respective expression vectors of two texts in the input text pairs, and then calculates cosine similarity between the two text expression vectors, wherein the cosine similarity is the first similarity of the input text pairs;
and 24, sequencing the first similarity of the context text referred by each entity to be disambiguated and the knowledge texts of all the pre-candidate entities according to the descending order, then selecting a plurality of entities with the top sequencing positions as the candidate entities referred by each entity to be disambiguated, and forming a candidate entity set referred by each entity to be disambiguated by all the candidate entities.
4. The method of claim 1, wherein in step three, the disambiguation similarity between each of the indications of entities to be disambiguated and each of the candidate entities in its candidate entity set is calculated, further comprising:
step 31, constructing a disambiguation entity text set for the to-be-disambiguated entity designation A, wherein the disambiguation entity text set comprises: the entity to be disambiguated refers to A, and the entity to be disambiguated refers to the context text of A;
Step 32, constructing a candidate entity text set for the candidate entity B, wherein the candidate entity text set comprises: candidate entity B and description texts of the candidate entity B in an entity knowledge base;
step 33, a similarity set is constructed for the entity reference A to be disambiguated, then a text is respectively selected from the disambiguated entity text set of the entity reference A to be disambiguated and the candidate entity text set of the candidate entity B to form a plurality of groups of text pairs, then each group of text pairs is respectively input into a double-tower similarity calculation model and an interactive similarity calculation model to obtain the first similarity and the second similarity of each group of text pairs respectively output by the models, finally the first similarity and the second similarity of all the text pairs are written into the similarity set of the entity reference A to be disambiguated, wherein the double-tower similarity calculation model respectively calculates the respective expression vectors of the two texts in the input text pair, then calculates the cosine similarity between the two text expression vectors, the cosine similarity is the first similarity of the input text pair, and the interactive similarity calculation model respectively calculates the respective expression vectors and the two expression directions of the two texts in the input text pair Measuring a similarity matrix, then regenerating a new similarity weighted expression vector according to the similarity matrix, finally splicing and combining the two newly generated expression vectors, and obtaining a final similarity score through a full-connection layer, wherein the similarity score is the second similarity of the input text pair;
And step 34, performing weighted voting on all similarity values in the similarity set of the entity to be disambiguated A, so as to obtain a final similarity value, namely the disambiguation similarity of the entity to be disambiguated A and the candidate entity B.
5. The method according to claim 3 or 4, wherein the dual-tower similarity calculation model comprises two identical tower networks and a similarity calculation network, wherein each tower network comprises an embedded layer and a multilayer block, each layer of block further comprises an encoding layer and a combination layer, and the dual-tower similarity calculation model comprises the following calculation processes:
step A1, respectively inputting two texts in the input text pair into different embedded layers of the tower network;
step A2, the processing flow of each tower network to the respective input text is as follows: the embedded layer converts the input text into a representation vector, the converted representation vector is recorded as an embedded layer representation vector, and then the embedded layer representation vector is input into the coding layer of the first layer block; the coding layer of the first layer of block receives an embedded layer expression vector transmitted by the embedded layer, the coding layers of the blocks of other layers receive an input vector transmitted by the block of the previous layer, the coding layer of each layer of block calculates to obtain an output vector according to the received input vector, then the calculated output vector and the embedded layer expression vector are spliced into a coding layer expression vector, and the spliced coding layer expression vector is transmitted to the combined layer of the block; the combined layer receives coding layer expression vectors transmitted by the coding layer of the block of the combined layer, an output vector is obtained through calculation, then the output vector obtained through calculation, the received embedded layer expression vector and the coding layer expression vector are spliced into a combined layer expression vector, the spliced combined layer expression vector is transmitted to the coding layer of the next block, and the combined layer expression vector of the last block is transmitted to the similarity calculation network;
And step A3, the similarity calculation network receives the combined layer representation vectors respectively transmitted by the two tower networks, maps the two combined layer representation vectors into the same dimension through the average pooling layer, and then calculates the cosine similarity between the two representation vectors, wherein the cosine similarity is the first similarity output by the double-tower similarity calculation model.
6. The method of claim 5, wherein the coding layer and the combination layer of each block of the two-tower similarity calculation model are each composed of a multi-layer feedforward neural network (FFN), wherein:
each layer of FFN of the coding layer receives an input vector transmitted by the FFN of the previous layer, and outputs an output vector obtained by calculation to the FFN of the next layer, wherein the FFN of the first layer receives an embedded layer representation vector transmitted by the embedded layer, the FFN of the last layer splices the output vector obtained by calculation and the embedded layer representation vector into a coding layer representation vector, and the coding layer representation vector is output to a combination layer of the block of the FFN,
and each FFN of the combined layer receives an input vector transmitted by the FFN of the previous layer, and outputs an output vector obtained by calculation to the FFN of the next layer, wherein the FFN of the first layer receives a coding layer expression vector transmitted by the coding layer of the block, the FFN of the last layer splices the output vector obtained by calculation, the embedded layer expression vector and the coding layer expression vector into a combined layer expression vector, and the combined layer expression vector is output to the coding layer of the block of the next layer.
7. The method according to claim 4, wherein the interactive similarity calculation model is calculated as follows:
step B1, converting the two texts in the input text pair into expression vectors respectively;
b2, calculating the similarity of the two expression vectors by adopting a dot product mode to obtain a similarity matrix;
step B3, performing interactive attention coding on the two expression vectors, and using the similarity matrix to generate expression vectors with weighted similarity, specifically: taking each word in one of the expression vectors as a query, taking the other expression vector as a key and a value, and then taking a similarity matrix of the two expression vectors as attention weight to carry out weighting so as to obtain two new expression vectors;
and step B4, performing feature enhancement on the two expression vectors, namely, performing splicing combination after adding and subtracting the two expression vectors, and then passing the spliced vectors through a full connection layer to obtain a final similarity score, namely, a second similarity output by the interactive similarity calculation model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110603755.0A CN113283236B (en) | 2021-05-31 | 2021-05-31 | Entity disambiguation method in complex Chinese text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110603755.0A CN113283236B (en) | 2021-05-31 | 2021-05-31 | Entity disambiguation method in complex Chinese text |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113283236A CN113283236A (en) | 2021-08-20 |
CN113283236B true CN113283236B (en) | 2022-07-19 |
Family
ID=77282713
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110603755.0A Expired - Fee Related CN113283236B (en) | 2021-05-31 | 2021-05-31 | Entity disambiguation method in complex Chinese text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113283236B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113947087B (en) * | 2021-12-20 | 2022-04-15 | 太极计算机股份有限公司 | Label-based relation construction method and device, electronic equipment and storage medium |
CN114818736B (en) * | 2022-05-31 | 2023-06-09 | 北京百度网讯科技有限公司 | Text processing method, chain finger method and device for short text and storage medium |
CN115600603B (en) * | 2022-12-15 | 2023-04-07 | 南京邮电大学 | Named entity disambiguation method for Chinese coronary heart disease diagnosis report |
CN116306504B (en) * | 2023-05-23 | 2023-08-08 | 匀熵智能科技(无锡)有限公司 | Candidate entity generation method and device, storage medium and electronic equipment |
CN118468854A (en) * | 2024-07-09 | 2024-08-09 | 南京信息工程大学 | Task demand analysis method, system and storage medium based on entity link |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107102989A (en) * | 2017-05-24 | 2017-08-29 | 南京大学 | A kind of entity disambiguation method based on term vector, convolutional neural networks |
CN107861939A (en) * | 2017-09-30 | 2018-03-30 | 昆明理工大学 | A kind of domain entities disambiguation method for merging term vector and topic model |
CN111581973A (en) * | 2020-04-24 | 2020-08-25 | 中国科学院空天信息创新研究院 | Entity disambiguation method and system |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10810375B2 (en) * | 2018-07-08 | 2020-10-20 | International Business Machines Corporation | Automated entity disambiguation |
-
2021
- 2021-05-31 CN CN202110603755.0A patent/CN113283236B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107102989A (en) * | 2017-05-24 | 2017-08-29 | 南京大学 | A kind of entity disambiguation method based on term vector, convolutional neural networks |
CN107861939A (en) * | 2017-09-30 | 2018-03-30 | 昆明理工大学 | A kind of domain entities disambiguation method for merging term vector and topic model |
CN111581973A (en) * | 2020-04-24 | 2020-08-25 | 中国科学院空天信息创新研究院 | Entity disambiguation method and system |
Non-Patent Citations (1)
Title |
---|
结合实体链接与实体聚类的命名实体消歧;谭咏梅等;《北京邮电大学学报》;20141031;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113283236A (en) | 2021-08-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113283236B (en) | Entity disambiguation method in complex Chinese text | |
CN109918666B (en) | Chinese punctuation mark adding method based on neural network | |
CN109271529B (en) | Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian | |
CN111209401A (en) | System and method for classifying and processing sentiment polarity of online public opinion text information | |
CN112270193A (en) | Chinese named entity identification method based on BERT-FLAT | |
CN107798140A (en) | A kind of conversational system construction method, semantic controlled answer method and device | |
CN107818164A (en) | A kind of intelligent answer method and its system | |
CN111931506A (en) | Entity relationship extraction method based on graph information enhancement | |
CN115858758A (en) | Intelligent customer service knowledge graph system with multiple unstructured data identification | |
CN110532395B (en) | Semantic embedding-based word vector improvement model establishing method | |
CN112784878B (en) | Intelligent correction method and system for Chinese treatises | |
CN113158671B (en) | Open domain information extraction method combined with named entity identification | |
CN115310551A (en) | Text analysis model training method and device, electronic equipment and storage medium | |
CN111666764A (en) | XLNET-based automatic summarization method and device | |
CN116910272B (en) | Academic knowledge graph completion method based on pre-training model T5 | |
CN113012822A (en) | Medical question-answering system based on generating type dialogue technology | |
CN116662500A (en) | Method for constructing question-answering system based on BERT model and external knowledge graph | |
CN114064901B (en) | Book comment text classification method based on knowledge graph word meaning disambiguation | |
CN114443846B (en) | Classification method and device based on multi-level text different composition and electronic equipment | |
CN114020871B (en) | Multi-mode social media emotion analysis method based on feature fusion | |
CN116010553A (en) | Viewpoint retrieval system based on two-way coding and accurate matching signals | |
CN115510841A (en) | Text matching method based on data enhancement and graph matching network | |
CN111666374A (en) | Method for integrating additional knowledge information into deep language model | |
CN114972907A (en) | Image semantic understanding and text generation based on reinforcement learning and contrast learning | |
CN114757184A (en) | Method and system for realizing knowledge question answering in aviation field |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20220719 |
|
CF01 | Termination of patent right due to non-payment of annual fee |