CN113283236B - Entity disambiguation method in complex Chinese text - Google Patents

Entity disambiguation method in complex Chinese text Download PDF

Info

Publication number
CN113283236B
CN113283236B CN202110603755.0A CN202110603755A CN113283236B CN 113283236 B CN113283236 B CN 113283236B CN 202110603755 A CN202110603755 A CN 202110603755A CN 113283236 B CN113283236 B CN 113283236B
Authority
CN
China
Prior art keywords
entity
similarity
disambiguated
layer
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN202110603755.0A
Other languages
Chinese (zh)
Other versions
CN113283236A (en
Inventor
王玉龙
王闯
刘同存
王纯
张乐剑
王晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202110603755.0A priority Critical patent/CN113283236B/en
Publication of CN113283236A publication Critical patent/CN113283236A/en
Application granted granted Critical
Publication of CN113283236B publication Critical patent/CN113283236B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method of entity disambiguation in complex chinese text, comprising: extracting all the entity names to be disambiguated from the Chinese text to be disambiguated; selecting a plurality of entities from an entity knowledge base for each entity index to be disambiguated as pre-candidate entities by adopting an entity retrieval technology, then calculating the first similarity of each entity index to be disambiguated and each pre-candidate entity, and selecting a plurality of entities as candidate entities according to the first similarity; and calculating the disambiguation similarity of each entity to be disambiguated and each candidate entity, judging whether the maximum value of the disambiguation similarity of each entity to be disambiguated and all candidate entities is greater than a threshold value of the disambiguation similarity, if so, indicating that the entity to be disambiguated is a linkable entity, and linking the entity to be disambiguated to the candidate entity corresponding to the maximum value of the disambiguation similarity in the entity knowledge base. The invention belongs to the technical field of information, can effectively solve the problem of entity ambiguity in the field of complex Chinese texts, and improves the entity recall rate and the entity link accuracy rate.

Description

Entity disambiguation method in complex Chinese text
Technical Field
The invention relates to an entity disambiguation method in a complex Chinese text, belonging to the technical field of information.
Background
Natural languages suffer from a wide range of ambiguity and non-normative problems, such as shorthand, abbreviation and language usage habits of words that result in different language contexts in which the same word represents different meanings or in which different words represent the same meaning. Chinese language culture is profound, has richer semantics and expression forms, and particularly, literary works such as novel speeches often have a large number of appearance characters, a large number of scenes and an intricate organization structure in texts. The entities such as characters, scenes, organizations, etc. in these novels also present many ambiguity problems, which present a great challenge to many downstream tasks of the novel-based natural language processing.
Although there has been much research on entity disambiguation in the general field, in the field of complex chinese texts such as novel, the ambiguity problem still depends on inefficient manual processing to solve, and a systematic solution is lacking. Conventional entity disambiguation includes cluster-based entity disambiguation and knowledge-base based entity linking methods. The entity disambiguation method based on clustering can only utilize shallow semantic information generally, so that entity link disambiguation is generally adopted at present. The entity linking method comprises three parts of entity designation identification, candidate entity generation and candidate entity sorting. The entity linking method is limited by the scale and the updating speed of the knowledge base, and the traditional entity linking method depends on the construction of the nickname table when generating the candidate entity, so that the entity recall rate is lower in the face of a complex environment such as a novel text. Meanwhile, the novel domain entities in the general knowledge base generally lack effective usable alias information, statistical information and the like. In addition, for the novels of the cold door or the novels of the updated state, a large number of unlinkable entities exist, and the entity linking method cannot well solve the disambiguation problem.
Therefore, how to effectively solve the problem of entity ambiguity in the field of complex Chinese texts such as novel and the like and improve the entity recall rate and the entity link accuracy rate becomes a technical problem which needs to be solved by technical personnel urgently.
Disclosure of Invention
In view of the above, the present invention provides an entity disambiguation method in a complex chinese text, which can effectively solve the problem of entity ambiguity in the field of complex chinese text and improve the entity recall rate and the entity link accuracy rate.
In order to achieve the above object, the present invention provides a method for disambiguating entities in complex chinese texts, comprising:
step one, extracting all entity names to be disambiguated from a Chinese text to be disambiguated;
selecting a plurality of entities from an entity knowledge base for each entity index to be disambiguated as pre-candidate entities by adopting an entity retrieval technology, forming a pre-candidate entity set of each entity index to be disambiguated by all the pre-candidate entities, then calculating the first similarity of each entity index to be disambiguated and each pre-candidate entity in the pre-candidate entity set, selecting a plurality of pre-candidate entities as candidate entities according to the first similarity, and forming a candidate entity set of each entity index to be disambiguated by all the candidate entities, wherein the entity knowledge base stores two parts of information of an entity name and a description text, and the first similarity of the entity index to be disambiguated and the pre-candidate entities is the cosine similarity between respective expression vectors;
Step three, calculating the disambiguation similarity of each entity to be disambiguated and each candidate entity in the candidate entity set thereof, judging whether the maximum value in the disambiguation similarity of each entity to be disambiguated and all the candidate entities thereof is greater than a threshold value of the disambiguation similarity, if so, indicating that the entity to be disambiguated is a linkable entity, linking the entity to be disambiguated to the candidate entity corresponding to the maximum value of the disambiguation similarity in the entity knowledge base, then continuously judging the maximum value in the disambiguation similarity of the next entity to be disambiguated and all the candidate entities thereof, and after judging all the entities to be disambiguated, continuing the step four; if not, the indication that the entity to be disambiguated is the unlinkable entity is judged, then the maximum value of the disambiguation similarity of the next entity to be disambiguated and all candidate entities is continuously judged, and after all the entities to be disambiguated are judged, the step four is continuously carried out; the disambiguation similarity is used for describing the similarity between the indication of the entity to be disambiguated and the candidate entity;
and step four, clustering all the unlinkable entities to divide the unlinkable entities into a plurality of groups, setting the serial number of each group, and marking each unlinkable entity according to the group serial number.
Compared with the prior art, the invention has the beneficial effects that: the entity retrieval technology and the candidate entity generation method based on semantic similarity calculation are applied to the entity disambiguation method of the Chinese text, the generation of the candidate entity is realized without depending on a manual name table construction mode, the entity recall rate can be effectively improved in a complex Chinese text environment, the generation and the sequencing of the candidate entity are realized by only using the entity name and the description text in the constructed entity knowledge base, and the link accuracy rate in the sequencing stage can be effectively improved; by adopting multiple similarity calculation models such as a double-tower type similarity calculation model and an interactive similarity calculation model, the entity designation information and the entity information can be effectively utilized under the condition of lacking external statistical knowledge; through a knowledge distillation mode, the category information of the entity designation recognition output result is used for guiding the sequencing of the candidate entities, and the accurate linkage of the entity designation can be realized; the invention can also implement a certain degree of disambiguation for unlinkable entities in Chinese text.
Drawings
FIG. 1 is a flow chart of a method of entity disambiguation in complex Chinese text in accordance with the present invention.
FIG. 2 is a flowchart illustrating a second embodiment of step two in FIG. 1.
Fig. 3 is a specific calculation flowchart of the two-tower similarity calculation model.
FIG. 4 is a flowchart illustrating the third step of FIG. 1, wherein the disambiguation similarity between the reference A of the entity to be disambiguated and the candidate B in the candidate set is calculated.
Fig. 5 is a specific calculation flowchart of the interactive similarity calculation model.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the accompanying drawings.
As shown in FIG. 1, the present invention provides a method for entity disambiguation in complex Chinese texts, comprising:
step one, extracting all entity names to be disambiguated from a Chinese text to be disambiguated;
selecting a plurality of entities from an entity knowledge base for each entity index to be disambiguated as pre-candidate entities by adopting an entity retrieval technology, forming a pre-candidate entity set of each entity index to be disambiguated by all the pre-candidate entities, then calculating the first similarity of each entity index to be disambiguated and each pre-candidate entity in the pre-candidate entity set, selecting a plurality of pre-candidate entities as candidate entities according to the first similarity, and forming a candidate entity set of each entity index to be disambiguated by all the candidate entities, wherein the entity knowledge base stores two parts of information of an entity name and a description text, and the first similarity of the entity index to be disambiguated and the pre-candidate entities is the cosine similarity between respective expression vectors;
Step three, calculating the disambiguation similarity of each entity to be disambiguated and each candidate entity in the candidate entity set thereof, judging whether the maximum value in the disambiguation similarity of each entity to be disambiguated and all the candidate entities thereof is greater than a threshold value of the disambiguation similarity, if so, indicating that the entity to be disambiguated is a linkable entity, linking the entity to be disambiguated to the candidate entity corresponding to the maximum value of the disambiguation similarity in the entity knowledge base, then continuously judging the maximum value in the disambiguation similarity of the next entity to be disambiguated and all the candidate entities thereof, and after judging all the entities to be disambiguated, continuing the step four; if not, the indication that the entity to be disambiguated is the unlinkable entity is judged, then the maximum value of the disambiguation similarity of the next entity to be disambiguated and all candidate entities is continuously judged, and after all the entities to be disambiguated are judged, the step four is continuously carried out; the disambiguation similarity is used for describing the similarity between the designation of the entity to be disambiguated and the candidate entity, and the threshold of the disambiguation similarity can be set according to the actual service requirement;
and step four, clustering all the unlinkable entities to divide the unlinkable entities into a plurality of groups, setting the serial number of each group, and marking each unlinkable entity according to the group serial number.
The first step may further comprise:
an entity designation recognition model is constructed and trained, the entity designation recognition model is formed by adding a structure of a public Chinese pre-training model and a BilSTM-CRF (bidirectional Long Short-Term Memory-Conditional Random Field), the input of the entity designation recognition model is a Chinese text, the output of the entity designation recognition model is an entity designation recognized from the input Chinese text, the type of the entity designation can comprise characters, scenes, organizations and the like,
thus, the Chinese text to be disambiguated is input into the trained entity reference recognition model, and the output of the model is all the entity references to be disambiguated extracted from the Chinese text to be disambiguated.
The present invention can use the published chinese knowledge base as the entity knowledge base.
As shown in fig. 2, step two in fig. 1 may further include:
step 21, searching in an entity knowledge base by adopting an entity retrieval technology, such as a k-nearest neighbor algorithm, and selecting a plurality of pre-candidate entities for each to-be-disambiguated entity index so as to form a pre-candidate entity set of each to-be-disambiguated entity index;
step 22, obtaining a context text of each entity to be disambiguated from the Chinese text to be disambiguated, where the context text may be composed of the entity to be disambiguated, m words on the left and right, and the Chinese text to be disambiguated, and simultaneously obtaining a knowledge text of each pre-candidate entity of each entity to be disambiguated, where the knowledge text may be composed of an entity name of the pre-candidate entity and a description text thereof in an entity knowledge base, and a value of m may be set according to actual business needs;
For example, the specific format of the context text referred by the entity to be disambiguated may be: [ CLS ] entity to disambiguate refers to [ SEP ] entity to disambiguate refers to left m words [ SEP ] entity to disambiguate refers to right m words [ SEP ] Chinese text to disambiguate, the specific format of the knowledge text of the pre-candidate entity may be: [ CLS ] Pre-candidate entity name [ SEP ] description text, [ CLS ] is a sentence start identifier, [ SEP ] is a sentence middle and sentence end delimiter;
step 23, taking the context text named by each entity to be disambiguated and the knowledge text of each pre-candidate entity as a group of text pairs, and inputting the context text and the knowledge text of each pre-candidate entity into a double-tower similarity calculation model to obtain a first similarity of each group of text pairs, wherein the double-tower similarity calculation model calculates respective expression vectors of two texts in the input text pairs, and then calculates cosine similarity between the expression vectors of the two texts, wherein the cosine similarity is the first similarity of the input text pairs;
and 24, sequencing the first similarity of the context text named by each entity to be disambiguated and the knowledge texts of all the pre-candidate entities according to the descending order, then selecting a plurality of entities with the top sequencing positions as the candidate entities named by each entity to be disambiguated, and forming a candidate entity set named by each entity to be disambiguated by all the candidate entities.
The double-tower similarity calculation model comprises two identical tower networks and a similarity calculation network, wherein each tower network comprises an embedded layer and a plurality of layers of blocks (namely blocks), each layer of block further comprises an encoding layer and a combination layer, and as shown in fig. 3, the specific calculation process of the double-tower similarity calculation model is as follows:
step A1, respectively inputting two texts in an input text pair into different embedding layers of the tower network;
step A2, the processing flow of each tower network to the respective input text is as follows: the embedding layer converts the input text into a Representation vector by adopting a depth language model based on a deformer/transformer, such as Bert (Bidirectional Encoder Representation from Transformers), records the converted Representation vector as an embedding layer Representation vector, and then outputs the embedding layer Representation vector to the coding layer of the first layer block; the coding layer of the first layer of block receives an embedded layer expression vector transmitted by the embedded layer, the coding layers of the blocks of other layers receive an input vector transmitted by the block of the previous layer, the coding layer of each layer of block calculates to obtain an output vector according to the received input vector, then the calculated output vector and the embedded layer expression vector are spliced into a coding layer expression vector, and the spliced coding layer expression vector is transmitted to the combined layer of the block; the combined layer receives the coding layer representation vector transmitted by the coding layer of the block of the combined layer, calculates to obtain an output vector, splices the calculated output vector, the received embedded layer representation vector and the coding layer representation vector into a combined layer representation vector, outputs the spliced combined layer representation vector to the coding layer of the next layer of block, and outputs the combined layer representation vector of the last layer of block to the similarity calculation network;
And step A3, the similarity calculation network receives the combined layer representation vectors respectively transmitted by the two tower networks, maps the two combined layer representation vectors into the same dimension through the average pooling layer, and then calculates the cosine similarity between the two representation vectors, wherein the cosine similarity is the first similarity output by the double-tower similarity calculation model.
The coding layer and the combination layer of each block layer of the double-tower similarity calculation model can be composed of a plurality of layers of feedforward neural networks (FFNs), wherein:
each layer of FFN of the coding layer receives an input vector transmitted by the FFN of the previous layer, and outputs an output vector obtained by calculation to the FFN of the next layer, wherein the FFN of the first layer receives an embedded layer representation vector transmitted by the embedded layer, the FFN of the last layer splices the output vector obtained by calculation and the embedded layer representation vector into a coding layer representation vector, and the coding layer representation vector is output to a combination layer of the block of the FFN,
each layer of FFN of the combined layer receives an input vector transmitted by the FFN of the previous layer, and outputs an output vector obtained by calculation to the FFN of the next layer, wherein the FFN of the first layer receives a coding layer representation vector transmitted by the coding layer of the block, the FFN of the last layer splices the output vector obtained by calculation, the received embedded layer representation vector and the coding layer representation vector into a combined layer representation vector, and the combined layer representation vector is output to the coding layer of the block of the next layer.
As shown in fig. 4, taking the to-be-disambiguated entity name a and the candidate entity B in the candidate entity set thereof as an example, the step three in fig. 1 of calculating the disambiguation similarity between each to-be-disambiguated entity name and each candidate entity in the candidate entity set thereof may further include:
step 31, constructing a disambiguation entity text set for the to-be-disambiguated entity name a, where the disambiguation entity text set may include: the entity to be disambiguated refers to A, and the entity to be disambiguated refers to the context text of A;
step 32, constructing a candidate entity text set for the candidate entity B, where the candidate entity text set may include: candidate entity B and description texts of the candidate entity B in an entity knowledge base;
step 33, constructing a similarity set for the entity designation a to be disambiguated, then selecting a text from the disambiguated entity text set of the entity designation a to be disambiguated and the candidate entity text set of the candidate entity B to form a plurality of groups of text pairs, inputting each group of text pairs into a dual-tower similarity calculation model and an interactive similarity calculation model to obtain the first similarity and the second similarity of each group of text pairs output by the models respectively, finally writing the first similarity and the second similarity of all the text pairs into the similarity set of the entity designation a to be disambiguated, wherein the interactive similarity calculation model firstly calculates the respective expression vectors of the two texts in the input text pairs and the similarity matrix of the two expression vectors respectively, then regenerating new expression vectors weighted by the similarity matrix, and finally splicing and combining the two newly generated expression vectors, and obtaining a final similarity score through a full connection layer, wherein the similarity score is the second similarity of the input text pair;
For example, the entity to be disambiguated refers to the disambiguated entity text set of A: { entity name, context, candidate entity text set of candidate entity B { entity name, description } may form 4 sets of text pairs: (maintenance, entity), (maintenance, entity description), (maintenance context, entity description);
and step 34, performing weighted voting on all similarity values in the similarity set of the to-be-disambiguated entity name A, so as to obtain a final similarity value, namely the disambiguation similarity of the to-be-disambiguated entity name A and the candidate entity B.
As shown in fig. 5, the specific calculation process of the interactive similarity calculation model is as follows:
step B1, converting the two texts in the input text pair into representation vectors respectively;
b2, calculating the similarity of the two expression vectors by adopting a dot product mode to obtain a similarity matrix;
step B3, performing interactive attention coding on the two expression vectors, and generating the expression vectors with weighted similarity by using the similarity matrix, which may specifically be: taking each word in one of the expression vectors as a query, taking the other expression vector as a key and a value, and then taking a similarity matrix of the two expression vectors as attention weight to carry out weighting so as to obtain two new expression vectors; wherein, Query, key, value are a calculation mode, used in attention calculation;
And step B4, performing feature enhancement on the two expression vectors, namely, performing splicing combination after the two expression vectors are added and subtracted, wherein the addition and subtraction can better focus on the similar or dissimilar parts of the two inputs, and then passing the spliced vectors through a full connection layer to obtain a final similarity score, namely, the second similarity output by the interactive similarity calculation model.
During training, a knowledge distillation thought can be utilized, the entity designation recognition model is used as a teacher network, the entity link model (comprising a double-tower similarity calculation model and an interactive similarity calculation model) is used as a student network, the output of the entity designation recognition model comprises the category information of the entity designation, and the category information is helpful for positioning candidate entity categories and can be used for guiding the link result of the entity link model. Thus, the loss function of the final entity-linked model is the original loss function (generally, cross entropy loss function) of the entity-linked model plus the loss function of the knowledge distillation. The loss function of knowledge distillation is a loss function (generally a cross entropy loss function) of the entity named recognition model and a differentiation measurement function of the entity named recognition model and the entity link model.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (7)

1. A method for disambiguating entities in complex Chinese texts, comprising:
step one, extracting all entity names to be disambiguated from a Chinese text to be disambiguated;
selecting a plurality of entities from an entity knowledge base as pre-candidate entities for each entity to be disambiguated by adopting an entity retrieval technology, forming a pre-candidate entity set of each entity to be disambiguated by all the pre-candidate entities, then calculating the first similarity of each entity to be disambiguated and each pre-candidate entity in the pre-candidate entity set, selecting a plurality of pre-candidate entities as candidate entities according to the first similarity, and forming the candidate entity set of each entity to be disambiguated by all the candidate entities, wherein the entity knowledge base stores two parts of information of an entity name and a description text, and the first similarities of the entity to be disambiguated and the pre-candidate entities are cosine similarities between respective expression vectors;
Step three, calculating the disambiguation similarity of each entity to be disambiguated and each candidate entity in the candidate entity set, judging whether the maximum value of the disambiguation similarity of each entity to be disambiguated and all the candidate entities is greater than a threshold value of the disambiguation similarity, if so, indicating that the entity to be disambiguated is a linkable entity, linking the entity to be disambiguated to the candidate entity corresponding to the maximum value of the disambiguation similarity in the entity knowledge base, then continuously judging the maximum value of the disambiguation similarity of the next entity to be disambiguated and all the candidate entities, and after judging all the entity to be disambiguated, continuing the step four; if not, the indication of the entity to be disambiguated is an unlinkable entity, then continuously judging the maximum value in the disambiguation similarity of the indication of the next entity to be disambiguated and all candidate entities, and after judging all the indications of the entity to be disambiguated, continuing the step four; the disambiguation similarity is used for describing the similarity between the indication of the entity to be disambiguated and the candidate entity;
and step four, clustering all the unlinkable entities to divide the unlinkable entities into a plurality of groups, setting the serial number of each group, and marking each unlinkable entity according to the group serial number.
2. The method of claim 1, wherein step one further comprises:
constructing and training an entity designation recognition model, wherein the entity designation recognition model is formed by a Chinese pre-training model and a BilSTM-CRF structure, the input is a Chinese text, the output is an entity designation recognized from the input Chinese text, the entity designation types comprise characters, scenes and organizations,
thus, the Chinese text to be disambiguated is input into the trained entity reference recognition model, and the output of the model is all the entity references to be disambiguated extracted from the Chinese text to be disambiguated.
3. The method of claim 1, wherein step two further comprises:
step 21, searching in an entity knowledge base by adopting an entity retrieval technology, and selecting a plurality of pre-candidate entities for each entity name to be disambiguated so as to form a pre-candidate entity set of each entity name to be disambiguated;
step 22, obtaining a context text of each entity to be disambiguated from the Chinese text to be disambiguated, wherein the context text comprises the entity to be disambiguated, m words on the left and right sides and the Chinese text to be disambiguated, and simultaneously obtaining a knowledge text of each pre-candidate entity of each entity to be disambiguated, wherein the knowledge text comprises an entity name of the pre-candidate entity and a description text of the entity name in an entity knowledge base;
Step 23, taking the context text named by each entity to be disambiguated and the knowledge text of each pre-candidate entity as a group of text pairs, and inputting the context text and the knowledge text of each pre-candidate entity into a double-tower similarity calculation model to obtain first similarity of each group of text pairs, wherein the double-tower similarity calculation model calculates respective expression vectors of two texts in the input text pairs, and then calculates cosine similarity between the two text expression vectors, wherein the cosine similarity is the first similarity of the input text pairs;
and 24, sequencing the first similarity of the context text referred by each entity to be disambiguated and the knowledge texts of all the pre-candidate entities according to the descending order, then selecting a plurality of entities with the top sequencing positions as the candidate entities referred by each entity to be disambiguated, and forming a candidate entity set referred by each entity to be disambiguated by all the candidate entities.
4. The method of claim 1, wherein in step three, the disambiguation similarity between each of the indications of entities to be disambiguated and each of the candidate entities in its candidate entity set is calculated, further comprising:
step 31, constructing a disambiguation entity text set for the to-be-disambiguated entity designation A, wherein the disambiguation entity text set comprises: the entity to be disambiguated refers to A, and the entity to be disambiguated refers to the context text of A;
Step 32, constructing a candidate entity text set for the candidate entity B, wherein the candidate entity text set comprises: candidate entity B and description texts of the candidate entity B in an entity knowledge base;
step 33, a similarity set is constructed for the entity reference A to be disambiguated, then a text is respectively selected from the disambiguated entity text set of the entity reference A to be disambiguated and the candidate entity text set of the candidate entity B to form a plurality of groups of text pairs, then each group of text pairs is respectively input into a double-tower similarity calculation model and an interactive similarity calculation model to obtain the first similarity and the second similarity of each group of text pairs respectively output by the models, finally the first similarity and the second similarity of all the text pairs are written into the similarity set of the entity reference A to be disambiguated, wherein the double-tower similarity calculation model respectively calculates the respective expression vectors of the two texts in the input text pair, then calculates the cosine similarity between the two text expression vectors, the cosine similarity is the first similarity of the input text pair, and the interactive similarity calculation model respectively calculates the respective expression vectors and the two expression directions of the two texts in the input text pair Measuring a similarity matrix, then regenerating a new similarity weighted expression vector according to the similarity matrix, finally splicing and combining the two newly generated expression vectors, and obtaining a final similarity score through a full-connection layer, wherein the similarity score is the second similarity of the input text pair;
And step 34, performing weighted voting on all similarity values in the similarity set of the entity to be disambiguated A, so as to obtain a final similarity value, namely the disambiguation similarity of the entity to be disambiguated A and the candidate entity B.
5. The method according to claim 3 or 4, wherein the dual-tower similarity calculation model comprises two identical tower networks and a similarity calculation network, wherein each tower network comprises an embedded layer and a multilayer block, each layer of block further comprises an encoding layer and a combination layer, and the dual-tower similarity calculation model comprises the following calculation processes:
step A1, respectively inputting two texts in the input text pair into different embedded layers of the tower network;
step A2, the processing flow of each tower network to the respective input text is as follows: the embedded layer converts the input text into a representation vector, the converted representation vector is recorded as an embedded layer representation vector, and then the embedded layer representation vector is input into the coding layer of the first layer block; the coding layer of the first layer of block receives an embedded layer expression vector transmitted by the embedded layer, the coding layers of the blocks of other layers receive an input vector transmitted by the block of the previous layer, the coding layer of each layer of block calculates to obtain an output vector according to the received input vector, then the calculated output vector and the embedded layer expression vector are spliced into a coding layer expression vector, and the spliced coding layer expression vector is transmitted to the combined layer of the block; the combined layer receives coding layer expression vectors transmitted by the coding layer of the block of the combined layer, an output vector is obtained through calculation, then the output vector obtained through calculation, the received embedded layer expression vector and the coding layer expression vector are spliced into a combined layer expression vector, the spliced combined layer expression vector is transmitted to the coding layer of the next block, and the combined layer expression vector of the last block is transmitted to the similarity calculation network;
And step A3, the similarity calculation network receives the combined layer representation vectors respectively transmitted by the two tower networks, maps the two combined layer representation vectors into the same dimension through the average pooling layer, and then calculates the cosine similarity between the two representation vectors, wherein the cosine similarity is the first similarity output by the double-tower similarity calculation model.
6. The method of claim 5, wherein the coding layer and the combination layer of each block of the two-tower similarity calculation model are each composed of a multi-layer feedforward neural network (FFN), wherein:
each layer of FFN of the coding layer receives an input vector transmitted by the FFN of the previous layer, and outputs an output vector obtained by calculation to the FFN of the next layer, wherein the FFN of the first layer receives an embedded layer representation vector transmitted by the embedded layer, the FFN of the last layer splices the output vector obtained by calculation and the embedded layer representation vector into a coding layer representation vector, and the coding layer representation vector is output to a combination layer of the block of the FFN,
and each FFN of the combined layer receives an input vector transmitted by the FFN of the previous layer, and outputs an output vector obtained by calculation to the FFN of the next layer, wherein the FFN of the first layer receives a coding layer expression vector transmitted by the coding layer of the block, the FFN of the last layer splices the output vector obtained by calculation, the embedded layer expression vector and the coding layer expression vector into a combined layer expression vector, and the combined layer expression vector is output to the coding layer of the block of the next layer.
7. The method according to claim 4, wherein the interactive similarity calculation model is calculated as follows:
step B1, converting the two texts in the input text pair into expression vectors respectively;
b2, calculating the similarity of the two expression vectors by adopting a dot product mode to obtain a similarity matrix;
step B3, performing interactive attention coding on the two expression vectors, and using the similarity matrix to generate expression vectors with weighted similarity, specifically: taking each word in one of the expression vectors as a query, taking the other expression vector as a key and a value, and then taking a similarity matrix of the two expression vectors as attention weight to carry out weighting so as to obtain two new expression vectors;
and step B4, performing feature enhancement on the two expression vectors, namely, performing splicing combination after adding and subtracting the two expression vectors, and then passing the spliced vectors through a full connection layer to obtain a final similarity score, namely, a second similarity output by the interactive similarity calculation model.
CN202110603755.0A 2021-05-31 2021-05-31 Entity disambiguation method in complex Chinese text Expired - Fee Related CN113283236B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110603755.0A CN113283236B (en) 2021-05-31 2021-05-31 Entity disambiguation method in complex Chinese text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110603755.0A CN113283236B (en) 2021-05-31 2021-05-31 Entity disambiguation method in complex Chinese text

Publications (2)

Publication Number Publication Date
CN113283236A CN113283236A (en) 2021-08-20
CN113283236B true CN113283236B (en) 2022-07-19

Family

ID=77282713

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110603755.0A Expired - Fee Related CN113283236B (en) 2021-05-31 2021-05-31 Entity disambiguation method in complex Chinese text

Country Status (1)

Country Link
CN (1) CN113283236B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113947087B (en) * 2021-12-20 2022-04-15 太极计算机股份有限公司 Label-based relation construction method and device, electronic equipment and storage medium
CN114818736B (en) * 2022-05-31 2023-06-09 北京百度网讯科技有限公司 Text processing method, chain finger method and device for short text and storage medium
CN115600603B (en) * 2022-12-15 2023-04-07 南京邮电大学 Named entity disambiguation method for Chinese coronary heart disease diagnosis report
CN116306504B (en) * 2023-05-23 2023-08-08 匀熵智能科技(无锡)有限公司 Candidate entity generation method and device, storage medium and electronic equipment
CN118468854A (en) * 2024-07-09 2024-08-09 南京信息工程大学 Task demand analysis method, system and storage medium based on entity link

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107102989A (en) * 2017-05-24 2017-08-29 南京大学 A kind of entity disambiguation method based on term vector, convolutional neural networks
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN111581973A (en) * 2020-04-24 2020-08-25 中国科学院空天信息创新研究院 Entity disambiguation method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10810375B2 (en) * 2018-07-08 2020-10-20 International Business Machines Corporation Automated entity disambiguation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107102989A (en) * 2017-05-24 2017-08-29 南京大学 A kind of entity disambiguation method based on term vector, convolutional neural networks
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN111581973A (en) * 2020-04-24 2020-08-25 中国科学院空天信息创新研究院 Entity disambiguation method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
结合实体链接与实体聚类的命名实体消歧;谭咏梅等;《北京邮电大学学报》;20141031;全文 *

Also Published As

Publication number Publication date
CN113283236A (en) 2021-08-20

Similar Documents

Publication Publication Date Title
CN113283236B (en) Entity disambiguation method in complex Chinese text
CN109918666B (en) Chinese punctuation mark adding method based on neural network
CN109271529B (en) Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian
CN111209401A (en) System and method for classifying and processing sentiment polarity of online public opinion text information
CN112270193A (en) Chinese named entity identification method based on BERT-FLAT
CN107798140A (en) A kind of conversational system construction method, semantic controlled answer method and device
CN107818164A (en) A kind of intelligent answer method and its system
CN111931506A (en) Entity relationship extraction method based on graph information enhancement
CN115858758A (en) Intelligent customer service knowledge graph system with multiple unstructured data identification
CN110532395B (en) Semantic embedding-based word vector improvement model establishing method
CN112784878B (en) Intelligent correction method and system for Chinese treatises
CN113158671B (en) Open domain information extraction method combined with named entity identification
CN115310551A (en) Text analysis model training method and device, electronic equipment and storage medium
CN111666764A (en) XLNET-based automatic summarization method and device
CN116910272B (en) Academic knowledge graph completion method based on pre-training model T5
CN113012822A (en) Medical question-answering system based on generating type dialogue technology
CN116662500A (en) Method for constructing question-answering system based on BERT model and external knowledge graph
CN114064901B (en) Book comment text classification method based on knowledge graph word meaning disambiguation
CN114443846B (en) Classification method and device based on multi-level text different composition and electronic equipment
CN114020871B (en) Multi-mode social media emotion analysis method based on feature fusion
CN116010553A (en) Viewpoint retrieval system based on two-way coding and accurate matching signals
CN115510841A (en) Text matching method based on data enhancement and graph matching network
CN111666374A (en) Method for integrating additional knowledge information into deep language model
CN114972907A (en) Image semantic understanding and text generation based on reinforcement learning and contrast learning
CN114757184A (en) Method and system for realizing knowledge question answering in aviation field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20220719

CF01 Termination of patent right due to non-payment of annual fee