CN113283236A - Entity disambiguation method in complex Chinese text - Google Patents

Entity disambiguation method in complex Chinese text Download PDF

Info

Publication number
CN113283236A
CN113283236A CN202110603755.0A CN202110603755A CN113283236A CN 113283236 A CN113283236 A CN 113283236A CN 202110603755 A CN202110603755 A CN 202110603755A CN 113283236 A CN113283236 A CN 113283236A
Authority
CN
China
Prior art keywords
entity
similarity
disambiguated
layer
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110603755.0A
Other languages
Chinese (zh)
Other versions
CN113283236B (en
Inventor
王玉龙
王闯
刘同存
王纯
张乐剑
王晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202110603755.0A priority Critical patent/CN113283236B/en
Publication of CN113283236A publication Critical patent/CN113283236A/en
Application granted granted Critical
Publication of CN113283236B publication Critical patent/CN113283236B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method of entity disambiguation in complex chinese text, comprising: extracting all the entity names to be disambiguated from the Chinese text to be disambiguated; selecting a plurality of entities from an entity knowledge base for each entity index to be disambiguated as pre-candidate entities by adopting an entity retrieval technology, then calculating the first similarity of each entity index to be disambiguated and each pre-candidate entity, and selecting a plurality of entities as candidate entities according to the first similarity; and calculating the disambiguation similarity of each entity to be disambiguated and each candidate entity, judging whether the maximum value of the disambiguation similarity of each entity to be disambiguated and all the candidate entities is greater than a threshold value of the disambiguation similarity, if so, indicating that the entity to be disambiguated is a linkable entity, and linking the entity to be disambiguated to the candidate entity corresponding to the maximum value of the disambiguation similarity in the entity knowledge base. The invention belongs to the technical field of information, can effectively solve the problem of entity ambiguity in the field of complex Chinese texts, and improves the entity recall rate and the entity link accuracy rate.

Description

Entity disambiguation method in complex Chinese text
Technical Field
The invention relates to an entity disambiguation method in a complex Chinese text, belonging to the technical field of information.
Background
Natural languages suffer from a wide range of ambiguity and non-normative problems, such as shorthand, abbreviation and language usage habits of words that result in different language contexts in which the same word represents different meanings or in which different words represent the same meaning. Chinese language culture is profound, has richer semantics and expression forms, and particularly has a large number of appearance characters, a large number of scenes and an intricate and complex organizational structure in texts of literary works such as novels. The entities such as characters, scenes, organizations, etc. in these novels also present many ambiguity problems, which present a great challenge to many downstream tasks of the novel-based natural language processing.
Although there has been much research on entity disambiguation in the general field, in the field of complex Chinese texts such as novel, the ambiguity problem still depends on inefficient manual processing to be solved, and a systematic solution is lacked. Conventional entity disambiguation includes cluster-based entity disambiguation and knowledge-base based entity linking methods. The entity disambiguation method based on clustering can only utilize shallow semantic information generally, so the entity link method is generally adopted to realize disambiguation at present. The entity linking method comprises three parts of entity designation identification, candidate entity generation and candidate entity sorting. The entity linking method is limited by the scale and the updating speed of the knowledge base, and the traditional entity linking method relies on the construction of the nickname table when generating the candidate entity, so that the entity recall rate is lower in the case of a complex environment such as a novel text. Meanwhile, the novel domain entities in the general knowledge base generally lack effective and available alias information, statistical information and the like. In addition, for the novels of the colder door or the novels of the updated state, a large number of unlinkable entities exist, and the entity linking method cannot well solve the disambiguation problem.
Therefore, how to effectively solve the problem of entity ambiguity in the field of complex Chinese texts such as novel and the like and improve the entity recall rate and the entity link accuracy rate becomes a technical problem which needs to be solved by technical personnel urgently.
Disclosure of Invention
In view of the above, the present invention provides an entity disambiguation method in a complex chinese text, which can effectively solve the problem of entity ambiguity in the field of complex chinese text and improve the entity recall rate and the entity link accuracy rate.
In order to achieve the above object, the present invention provides a method for disambiguating entities in complex chinese texts, comprising:
step one, extracting all entity names to be disambiguated from a Chinese text to be disambiguated;
selecting a plurality of entities from an entity knowledge base as pre-candidate entities for each entity to be disambiguated by adopting an entity retrieval technology, forming a pre-candidate entity set of each entity to be disambiguated by all the pre-candidate entities, then calculating the first similarity of each entity to be disambiguated and each pre-candidate entity in the pre-candidate entity set, selecting a plurality of pre-candidate entities as candidate entities according to the first similarity, and forming the candidate entity set of each entity to be disambiguated by all the candidate entities, wherein the entity knowledge base stores two parts of information of an entity name and a description text, and the first similarities of the entity to be disambiguated and the pre-candidate entities are cosine similarities between respective expression vectors;
step three, calculating the disambiguation similarity of each entity to be disambiguated and each candidate entity in the candidate entity set thereof, judging whether the maximum value in the disambiguation similarity of each entity to be disambiguated and all the candidate entities thereof is greater than a threshold value of the disambiguation similarity, if so, indicating that the entity to be disambiguated is a linkable entity, linking the entity to be disambiguated to the candidate entity corresponding to the maximum value of the disambiguation similarity in the entity knowledge base, then continuously judging the maximum value in the disambiguation similarity of the next entity to be disambiguated and all the candidate entities thereof, and after judging all the entities to be disambiguated, continuing the step four; if not, the indication that the entity to be disambiguated is the unlinkable entity is judged, then the maximum value of the disambiguation similarity of the next entity to be disambiguated and all candidate entities is continuously judged, and after all the entities to be disambiguated are judged, the step four is continuously carried out; the disambiguation similarity is used for describing the similarity degree of the reference of the entity to be disambiguated and the candidate entity;
and step four, clustering all the unlinkable entities to divide the unlinkable entities into a plurality of groups, setting the serial number of each group, and marking each unlinkable entity according to the group serial number.
Compared with the prior art, the invention has the beneficial effects that: the method applies an entity retrieval technology and a candidate entity generation method based on semantic similarity calculation to an entity disambiguation method of Chinese texts, realizes the generation of candidate entities without depending on a manual nickname table construction mode, can effectively improve the entity recall rate in a complex Chinese text environment, and realizes the generation and the sequencing of the candidate entities only by using entity names and description texts in a constructed entity knowledge base, and can effectively improve the link accuracy in the sequencing stage; by adopting multiple similarity calculation models such as a double-tower type similarity calculation model and an interactive similarity calculation model, the entity designation information and the entity information can be effectively utilized under the condition of lacking external statistical knowledge; through a knowledge distillation mode, the category information of the entity designation recognition output result is used for guiding the sequencing of the candidate entities, and accurate linkage of the entity designation can be realized; the invention can also implement a certain degree of disambiguation for unlinkable entities in Chinese text.
Drawings
FIG. 1 is a flow chart of a method of entity disambiguation in complex Chinese text in accordance with the present invention.
FIG. 2 is a flowchart illustrating an embodiment of step two in FIG. 1.
Fig. 3 is a specific calculation flowchart of the two-tower similarity calculation model.
FIG. 4 is a flowchart illustrating the third step of FIG. 1, wherein the disambiguation similarity between the reference A of the entity to be disambiguated and the candidate B in the candidate set is calculated.
Fig. 5 is a specific calculation flowchart of the interactive similarity calculation model.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the accompanying drawings.
As shown in FIG. 1, the present invention provides a method for entity disambiguation in complex Chinese texts, comprising:
step one, extracting all entity names to be disambiguated from a Chinese text to be disambiguated;
selecting a plurality of entities from an entity knowledge base as pre-candidate entities for each entity to be disambiguated by adopting an entity retrieval technology, forming a pre-candidate entity set of each entity to be disambiguated by all the pre-candidate entities, then calculating the first similarity of each entity to be disambiguated and each pre-candidate entity in the pre-candidate entity set, selecting a plurality of pre-candidate entities as candidate entities according to the first similarity, and forming the candidate entity set of each entity to be disambiguated by all the candidate entities, wherein the entity knowledge base stores two parts of information of an entity name and a description text, and the first similarities of the entity to be disambiguated and the pre-candidate entities are cosine similarities between respective expression vectors;
step three, calculating the disambiguation similarity of each entity to be disambiguated and each candidate entity in the candidate entity set thereof, judging whether the maximum value in the disambiguation similarity of each entity to be disambiguated and all the candidate entities thereof is greater than a threshold value of the disambiguation similarity, if so, indicating that the entity to be disambiguated is a linkable entity, linking the entity to be disambiguated to the candidate entity corresponding to the maximum value of the disambiguation similarity in the entity knowledge base, then continuously judging the maximum value in the disambiguation similarity of the next entity to be disambiguated and all the candidate entities thereof, and after judging all the entities to be disambiguated, continuing the step four; if not, the indication that the entity to be disambiguated is the unlinkable entity is judged, then the maximum value of the disambiguation similarity of the next entity to be disambiguated and all candidate entities is continuously judged, and after all the entities to be disambiguated are judged, the step four is continuously carried out; the disambiguation similarity is used for describing the similarity between the designation of the entity to be disambiguated and the candidate entity, and the threshold of the disambiguation similarity can be set according to the actual service requirement;
and step four, clustering all the unlinkable entities to divide the unlinkable entities into a plurality of groups, setting the serial number of each group, and marking each unlinkable entity according to the group serial number.
The first step may further comprise:
an entity designation recognition model is constructed and trained, the entity designation recognition model is formed by adding a structure of a public Chinese pre-training model and a BilSTM-CRF (bidirectional Long Short-Term Memory-Conditional Random Field), the input of the entity designation recognition model is a Chinese text, the output of the entity designation recognition model is an entity designation recognized from the input Chinese text, the type of the entity designation can comprise characters, scenes, organizations and the like,
thus, the Chinese text to be disambiguated is input into the trained entity reference recognition model, and the output of the model is all the entity references to be disambiguated extracted from the Chinese text to be disambiguated.
The present invention can use the published chinese knowledge base as the entity knowledge base.
As shown in fig. 2, step two in fig. 1 may further include:
step 21, searching in an entity knowledge base by adopting an entity retrieval technology, such as a k-nearest neighbor algorithm, and selecting a plurality of pre-candidate entities for each to-be-disambiguated entity index so as to form a pre-candidate entity set of each to-be-disambiguated entity index;
step 22, obtaining a context text of each entity to be disambiguated from the Chinese text to be disambiguated, where the context text may be composed of the entity to be disambiguated, m words on the left and right, and the Chinese text to be disambiguated, and simultaneously obtaining a knowledge text of each pre-candidate entity of each entity to be disambiguated, where the knowledge text may be composed of an entity name of the pre-candidate entity and a description text thereof in an entity knowledge base, and a value of m may be set according to actual business needs;
for example, the specific format of the context text referred by the entity to be disambiguated may be: [ CLS ] entity to disambiguate refers to [ SEP ] entity to disambiguate refers to left m words [ SEP ] entity to disambiguate refers to right m words [ SEP ] Chinese text to disambiguate, the specific format of the knowledge text of the pre-candidate entity may be: [ CLS ] Pre-candidate entity name [ SEP ] description text, [ CLS ] is a sentence start identifier, [ SEP ] is a sentence middle and sentence end delimiter;
step 23, taking the context text named by each entity to be disambiguated and the knowledge text of each pre-candidate entity as a group of text pairs, and inputting the context text and the knowledge text of each pre-candidate entity into a double-tower similarity calculation model to obtain a first similarity of each group of text pairs, wherein the double-tower similarity calculation model calculates respective expression vectors of two texts in the input text pairs, and then calculates cosine similarity between the expression vectors of the two texts, wherein the cosine similarity is the first similarity of the input text pairs;
and 24, sequencing the first similarity of the context text named by each entity to be disambiguated and the knowledge texts of all the pre-candidate entities according to the descending order, then selecting a plurality of entities with the top sequencing positions as the candidate entities named by each entity to be disambiguated, and forming a candidate entity set named by each entity to be disambiguated by all the candidate entities.
The double-tower similarity calculation model comprises two identical tower networks and a similarity calculation network, wherein each tower network comprises an embedded layer and a plurality of layers of blocks (namely blocks), each layer of block further comprises an encoding layer and a combination layer, and as shown in fig. 3, the specific calculation process of the double-tower similarity calculation model is as follows:
step A1, respectively inputting two texts in an input text pair into different embedding layers of the tower network;
step A2, the processing flow of each tower network to the respective input text is as follows: the embedding layer converts the input text into a Representation vector by adopting a depth language model based on a deformer/transformer, such as Bert (Bidirectional Encoder Representation from Transformers), records the converted Representation vector as an embedding layer Representation vector, and then outputs the embedding layer Representation vector to the coding layer of the first layer block; the coding layer of the first layer of block receives an embedded layer representation vector transmitted by the embedded layer, the coding layers of the blocks of other layers receive an input vector transmitted by the block of the previous layer, the coding layer of each layer of block calculates to obtain an output vector according to the received input vector, then the calculated output vector and the embedded layer representation vector are spliced into a coding layer representation vector, and the spliced coding layer representation vector is transmitted to the combined layer of the block; the combined layer receives the coding layer representation vector transmitted by the coding layer of the block of the combined layer, calculates to obtain an output vector, splices the calculated output vector, the received embedded layer representation vector and the coding layer representation vector into a combined layer representation vector, outputs the spliced combined layer representation vector to the coding layer of the next layer of block, and outputs the combined layer representation vector of the last layer of block to the similarity calculation network;
step A3, the similarity calculation network receives the combined layer representation vectors respectively transmitted by the two tower networks, maps the two combined layer representation vectors into the same dimension through the average pooling layer, and then calculates the cosine similarity between the two representation vectors, wherein the cosine similarity is the first similarity output by the double-tower similarity calculation model.
The coding layer and the combination layer of each layer of block of the double-tower type similarity calculation model can be composed of a plurality of layers of feedforward neural networks (FFNs), wherein:
each layer of FFN of the coding layer receives an input vector transmitted by the FFN of the previous layer, and outputs an output vector obtained by calculation to the FFN of the next layer, wherein the FFN of the first layer receives an embedded layer representation vector transmitted by the embedded layer, the FFN of the last layer splices the output vector obtained by calculation and the embedded layer representation vector into a coding layer representation vector, and the coding layer representation vector is output to a combination layer of the block of the FFN,
each layer of FFN of the combined layer receives an input vector transmitted by the FFN of the previous layer, and outputs an output vector obtained by calculation to the FFN of the next layer, wherein the FFN of the first layer receives a coding layer representation vector transmitted by the coding layer of the block, the FFN of the last layer splices the output vector obtained by calculation, the received embedded layer representation vector and the coding layer representation vector into a combined layer representation vector, and the combined layer representation vector is output to the coding layer of the block of the next layer.
As shown in fig. 4, taking the to-be-disambiguated entity name a and the candidate entity B in the candidate entity set thereof as an example, the step three in fig. 1 of calculating the disambiguation similarity between each to-be-disambiguated entity name and each candidate entity in the candidate entity set thereof may further include:
step 31, constructing a disambiguation entity text set for the to-be-disambiguated entity name a, where the disambiguation entity text set may include: the entity to be disambiguated refers to A, and the entity to be disambiguated refers to context text of A;
step 32, constructing a candidate entity text set for the candidate entity B, where the candidate entity text set may include: candidate entity B and description texts of the candidate entity B in an entity knowledge base;
step 33, constructing a similarity set for the entity designation a to be disambiguated, then selecting a text from the disambiguated entity text set of the entity designation a to be disambiguated and the candidate entity text set of the candidate entity B to form a plurality of groups of text pairs, inputting each group of text pairs into a dual-tower similarity calculation model and an interactive similarity calculation model to obtain the first similarity and the second similarity of each group of text pairs output by the models respectively, finally writing the first similarity and the second similarity of all the text pairs into the similarity set of the entity designation a to be disambiguated, wherein the interactive similarity calculation model firstly calculates the respective expression vectors of the two texts in the input text pairs and the similarity matrix of the two expression vectors respectively, then regenerating new expression vectors weighted by the similarity matrix, and finally splicing and combining the two newly generated expression vectors, and obtaining a final similarity score through a full connection layer, wherein the similarity score is the second similarity of the input text pair;
for example, the entity to be disambiguated refers to the disambiguated entity text set of A: { entity designation, context }, and set of candidate entity texts of candidate entity B { entity name entry, description text entry }, may form 4 sets of text pairs: (maintenance, entity), (maintenance, entity description), (maintenance context, entity description);
and step 34, performing weighted voting on all similarity values in the similarity set of the to-be-disambiguated entity name A, so as to obtain a final similarity value, namely the disambiguation similarity of the to-be-disambiguated entity name A and the candidate entity B.
As shown in fig. 5, the specific calculation process of the interactive similarity calculation model is as follows:
step B1, converting the two texts in the input text pair into representation vectors respectively;
b2, calculating the similarity of the two expression vectors by adopting a dot product mode to obtain a similarity matrix;
step B3, performing interactive attention coding on the two expression vectors, and generating the expression vectors with weighted similarity by using the similarity matrix, which may specifically be: taking each word in one expression vector as a query, taking the other expression vector as a key and a value, and taking a similarity matrix of the two expression vectors as an attention weight to perform weighting so as to obtain two new expression vectors; wherein, Query, key, value are a calculation mode, used in the attribute calculation;
and step B4, performing feature enhancement on the two expression vectors, namely, performing splicing combination after the two expression vectors are added and subtracted, wherein the addition and subtraction can better focus on the similar or dissimilar parts of the two inputs, and then passing the spliced vectors through a full connection layer to obtain a final similarity score, namely, the second similarity output by the interactive similarity calculation model.
During training, a knowledge distillation thought can be utilized, the entity designation recognition model is used as a teacher network, the entity link model (comprising a double-tower similarity calculation model and an interactive similarity calculation model) is used as a student network, the output of the entity designation recognition model comprises the category information of the entity designation, and the category information is helpful for positioning candidate entity categories and can be used for guiding the link result of the entity link model. Thus, the loss function of the final entity-linked model is the original loss function (generally, cross entropy loss function) of the entity-linked model plus the loss function of the knowledge distillation. The loss function of knowledge distillation is a loss function (generally a cross entropy loss function) of the entity named recognition model and a differentiation measurement function of the entity named recognition model and the entity link model.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (7)

1. A method for disambiguating entities in complex Chinese text, comprising:
step one, extracting all entity names to be disambiguated from a Chinese text to be disambiguated;
selecting a plurality of entities from an entity knowledge base as pre-candidate entities for each entity to be disambiguated by adopting an entity retrieval technology, forming a pre-candidate entity set of each entity to be disambiguated by all the pre-candidate entities, then calculating the first similarity of each entity to be disambiguated and each pre-candidate entity in the pre-candidate entity set, selecting a plurality of pre-candidate entities as candidate entities according to the first similarity, and forming the candidate entity set of each entity to be disambiguated by all the candidate entities, wherein the entity knowledge base stores two parts of information of an entity name and a description text, and the first similarities of the entity to be disambiguated and the pre-candidate entities are cosine similarities between respective expression vectors;
step three, calculating the disambiguation similarity of each entity to be disambiguated and each candidate entity in the candidate entity set thereof, judging whether the maximum value in the disambiguation similarity of each entity to be disambiguated and all the candidate entities thereof is greater than a threshold value of the disambiguation similarity, if so, indicating that the entity to be disambiguated is a linkable entity, linking the entity to be disambiguated to the candidate entity corresponding to the maximum value of the disambiguation similarity in the entity knowledge base, then continuously judging the maximum value in the disambiguation similarity of the next entity to be disambiguated and all the candidate entities thereof, and after judging all the entities to be disambiguated, continuing the step four; if not, the indication that the entity to be disambiguated is the unlinkable entity is judged, then the maximum value of the disambiguation similarity of the next entity to be disambiguated and all candidate entities is continuously judged, and after all the entities to be disambiguated are judged, the step four is continuously carried out; the disambiguation similarity is used for describing the similarity degree of the reference of the entity to be disambiguated and the candidate entity;
and step four, clustering all the unlinkable entities to divide the unlinkable entities into a plurality of groups, setting the serial number of each group, and marking each unlinkable entity according to the group serial number.
2. The method of claim 1, wherein step one further comprises:
an entity name recognition model is constructed and trained, the entity name recognition model is formed by a Chinese pre-training model and a BilSTM-CRF structure, the input of the entity name recognition model is a Chinese text, the output of the entity name recognition model is an entity name recognized from the input Chinese text, the type of the entity name includes characters, scenes and organizations,
thus, the Chinese text to be disambiguated is input into the trained entity reference recognition model, and the output of the model is all the entity references to be disambiguated extracted from the Chinese text to be disambiguated.
3. The method of claim 1, wherein step two further comprises:
step 21, searching in an entity knowledge base by adopting an entity retrieval technology, and selecting a plurality of pre-candidate entities for each entity designation to be disambiguated so as to form a pre-candidate entity set of each entity designation to be disambiguated;
step 22, obtaining a context text of each entity to be disambiguated from the Chinese text to be disambiguated, wherein the context text comprises the entity to be disambiguated, m words on the left and right, and the Chinese text to be disambiguated, and simultaneously obtaining a knowledge text of each pre-candidate entity of each entity to be disambiguated, and the knowledge text comprises an entity name of the pre-candidate entity and a description text of the entity name in an entity knowledge base;
step 23, taking the context text named by each entity to be disambiguated and the knowledge text of each pre-candidate entity as a group of text pairs, and inputting the context text and the knowledge text of each pre-candidate entity into a double-tower similarity calculation model to obtain a first similarity of each group of text pairs, wherein the double-tower similarity calculation model calculates respective expression vectors of two texts in the input text pairs, and then calculates cosine similarity between the expression vectors of the two texts, wherein the cosine similarity is the first similarity of the input text pairs;
and 24, sequencing the first similarity of the context text named by each entity to be disambiguated and the knowledge texts of all the pre-candidate entities according to the descending order, then selecting a plurality of entities with the top sequencing positions as the candidate entities named by each entity to be disambiguated, and forming a candidate entity set named by each entity to be disambiguated by all the candidate entities.
4. The method of claim 1, wherein in step three, the disambiguation similarity between each of the to-be-disambiguated entity designations and each of the candidate entities in its candidate entity set is calculated, taking the to-be-disambiguated entity designation a and the candidate entity B in its candidate entity set as an example, further comprising:
step 31, constructing a disambiguation entity text set for the to-be-disambiguated entity name A, wherein the disambiguation entity text set comprises: the entity to be disambiguated refers to A, and the entity to be disambiguated refers to context text of A;
step 32, constructing a candidate entity text set for the candidate entity B, wherein the candidate entity text set comprises: candidate entity B and description texts of the candidate entity B in an entity knowledge base;
step 33, a similarity set is constructed for the entity to be disambiguated named A, then a text is respectively selected from the disambiguated entity text set of the entity to be disambiguated named A and the candidate entity text set of the candidate entity B to form a plurality of groups of text pairs, then each group of text pairs is respectively input into a double-tower similarity calculation model and an interactive similarity calculation model to obtain the first similarity and the second similarity of each group of text pairs output by the models respectively, finally the first similarity and the second similarity of all the text pairs are written into the similarity set of the entity to be disambiguated named A, wherein the double-tower similarity calculation model respectively calculates the representation vectors of the two texts in the input text pair first, then calculates the cosine similarity between the two text representation vectors, the cosine similarity is the first similarity of the input text pair, and the interactive similarity calculation model respectively calculates the representation vectors and the two representation directions of the two texts in the input text pair first The similarity matrix of the quantity, then according to the similarity matrix regenerating new similarity weighted expression vector, finally two newly generated expression vectors are spliced and combined, and a final similarity score is obtained through a full connection layer, wherein the similarity score is the second similarity of the input text pair;
and step 34, performing weighted voting on all similarity values in the similarity set of the to-be-disambiguated entity name A, so as to obtain a final similarity value, namely the disambiguation similarity of the to-be-disambiguated entity name A and the candidate entity B.
5. The method according to claim 3 or 4, wherein the two-tower similarity calculation model comprises two identical tower networks and a similarity calculation network, wherein each tower network comprises an embedded layer and a multi-layer block, each layer of block further comprises an encoding layer and a combination layer, and the two-tower similarity calculation model is calculated by the following specific steps:
step A1, respectively inputting two texts in an input text pair into different embedding layers of the tower network;
step A2, the processing flow of each tower network to the respective input text is as follows: the embedded layer converts the input text into a representation vector, the converted representation vector is recorded as an embedded layer representation vector, and then the embedded layer representation vector is input to the coding layer of the first layer block; the coding layer of the first layer of block receives an embedded layer representation vector transmitted by the embedded layer, the coding layers of the blocks of other layers receive an input vector transmitted by the block of the previous layer, the coding layer of each layer of block calculates to obtain an output vector according to the received input vector, then the calculated output vector and the embedded layer representation vector are spliced into a coding layer representation vector, and the spliced coding layer representation vector is transmitted to the combined layer of the block; the combined layer receives the coding layer representation vector transmitted by the coding layer of the block of the combined layer, calculates to obtain an output vector, splices the calculated output vector, the received embedded layer representation vector and the coding layer representation vector into a combined layer representation vector, outputs the spliced combined layer representation vector to the coding layer of the next layer of block, and outputs the combined layer representation vector of the last layer of block to the similarity calculation network;
step A3, the similarity calculation network receives the combined layer representation vectors respectively transmitted by the two tower networks, maps the two combined layer representation vectors into the same dimension through the average pooling layer, and then calculates the cosine similarity between the two representation vectors, wherein the cosine similarity is the first similarity output by the double-tower similarity calculation model.
6. The method of claim 5, wherein the coding layer and the combination layer of each block of the two-tower type similarity calculation model are each composed of a multilayer feedforward neural network (FFN), wherein:
each layer of FFN of the coding layer receives an input vector transmitted by the FFN of the previous layer, and outputs an output vector obtained by calculation to the FFN of the next layer, wherein the FFN of the first layer receives an embedded layer representation vector transmitted by the embedded layer, the FFN of the last layer splices the output vector obtained by calculation and the embedded layer representation vector into a coding layer representation vector, and the coding layer representation vector is output to a combination layer of the block of the FFN,
each layer of FFN of the combined layer receives an input vector transmitted by the FFN of the previous layer, and outputs an output vector obtained by calculation to the FFN of the next layer, wherein the FFN of the first layer receives a coding layer representation vector transmitted by the coding layer of the block, the FFN of the last layer splices the output vector obtained by calculation, the received embedded layer representation vector and the coding layer representation vector into a combined layer representation vector, and the combined layer representation vector is output to the coding layer of the block of the next layer.
7. The method of claim 4, wherein the interactive similarity calculation model is calculated as follows:
step B1, converting the two texts in the input text pair into representation vectors respectively;
b2, calculating the similarity of the two expression vectors by adopting a dot product mode to obtain a similarity matrix;
step B3, performing interactive attention coding on the two expression vectors, and generating the expression vectors with weighted similarity by using the similarity matrix, specifically: taking each word in one expression vector as a query, taking the other expression vector as a key and a value, and taking a similarity matrix of the two expression vectors as an attention weight to perform weighting so as to obtain two new expression vectors;
and step B4, performing feature enhancement on the two expression vectors, namely, performing splicing combination after adding and subtracting the two expression vectors, and then passing the spliced vectors through a full connection layer to obtain a final similarity score, namely, a second similarity output by the interactive similarity calculation model.
CN202110603755.0A 2021-05-31 2021-05-31 Entity disambiguation method in complex Chinese text Active CN113283236B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110603755.0A CN113283236B (en) 2021-05-31 2021-05-31 Entity disambiguation method in complex Chinese text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110603755.0A CN113283236B (en) 2021-05-31 2021-05-31 Entity disambiguation method in complex Chinese text

Publications (2)

Publication Number Publication Date
CN113283236A true CN113283236A (en) 2021-08-20
CN113283236B CN113283236B (en) 2022-07-19

Family

ID=77282713

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110603755.0A Active CN113283236B (en) 2021-05-31 2021-05-31 Entity disambiguation method in complex Chinese text

Country Status (1)

Country Link
CN (1) CN113283236B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113947087A (en) * 2021-12-20 2022-01-18 太极计算机股份有限公司 Label-based relation construction method and device, electronic equipment and storage medium
CN114818736A (en) * 2022-05-31 2022-07-29 北京百度网讯科技有限公司 Text processing method, chain finger method and device for short text and storage medium
CN115600603A (en) * 2022-12-15 2023-01-13 南京邮电大学(Cn) Named entity disambiguation method for Chinese coronary heart disease diagnosis report
CN116306504A (en) * 2023-05-23 2023-06-23 匀熵智能科技(无锡)有限公司 Candidate entity generation method and device, storage medium and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107102989A (en) * 2017-05-24 2017-08-29 南京大学 A kind of entity disambiguation method based on term vector, convolutional neural networks
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
US20200012719A1 (en) * 2018-07-08 2020-01-09 International Business Machines Corporation Automated entity disambiguation
CN111581973A (en) * 2020-04-24 2020-08-25 中国科学院空天信息创新研究院 Entity disambiguation method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107102989A (en) * 2017-05-24 2017-08-29 南京大学 A kind of entity disambiguation method based on term vector, convolutional neural networks
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
US20200012719A1 (en) * 2018-07-08 2020-01-09 International Business Machines Corporation Automated entity disambiguation
CN111581973A (en) * 2020-04-24 2020-08-25 中国科学院空天信息创新研究院 Entity disambiguation method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
谭咏梅等: "结合实体链接与实体聚类的命名实体消歧", 《北京邮电大学学报》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113947087A (en) * 2021-12-20 2022-01-18 太极计算机股份有限公司 Label-based relation construction method and device, electronic equipment and storage medium
CN114818736A (en) * 2022-05-31 2022-07-29 北京百度网讯科技有限公司 Text processing method, chain finger method and device for short text and storage medium
CN115600603A (en) * 2022-12-15 2023-01-13 南京邮电大学(Cn) Named entity disambiguation method for Chinese coronary heart disease diagnosis report
CN115600603B (en) * 2022-12-15 2023-04-07 南京邮电大学 Named entity disambiguation method for Chinese coronary heart disease diagnosis report
CN116306504A (en) * 2023-05-23 2023-06-23 匀熵智能科技(无锡)有限公司 Candidate entity generation method and device, storage medium and electronic equipment
CN116306504B (en) * 2023-05-23 2023-08-08 匀熵智能科技(无锡)有限公司 Candidate entity generation method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN113283236B (en) 2022-07-19

Similar Documents

Publication Publication Date Title
CN113283236B (en) Entity disambiguation method in complex Chinese text
CN109918666B (en) Chinese punctuation mark adding method based on neural network
CN109271529B (en) Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian
CN111209401A (en) System and method for classifying and processing sentiment polarity of online public opinion text information
CN107818164A (en) A kind of intelligent answer method and its system
CN107798140A (en) A kind of conversational system construction method, semantic controlled answer method and device
CN111858932A (en) Multiple-feature Chinese and English emotion classification method and system based on Transformer
CN111931506A (en) Entity relationship extraction method based on graph information enhancement
CN110532395B (en) Semantic embedding-based word vector improvement model establishing method
CN115858758A (en) Intelligent customer service knowledge graph system with multiple unstructured data identification
CN113158671B (en) Open domain information extraction method combined with named entity identification
CN111666764A (en) XLNET-based automatic summarization method and device
CN111061951A (en) Recommendation model based on double-layer self-attention comment modeling
CN112163607A (en) Network social media emotion classification method based on multi-dimension and multi-level combined modeling
CN113012822A (en) Medical question-answering system based on generating type dialogue technology
CN113704434A (en) Knowledge base question and answer method, electronic equipment and readable storage medium
CN112328773A (en) Knowledge graph-based question and answer implementation method and system
CN114064901B (en) Book comment text classification method based on knowledge graph word meaning disambiguation
CN111666374A (en) Method for integrating additional knowledge information into deep language model
CN116010553A (en) Viewpoint retrieval system based on two-way coding and accurate matching signals
CN115310551A (en) Text analysis model training method and device, electronic equipment and storage medium
CN113326367B (en) Task type dialogue method and system based on end-to-end text generation
CN114757184A (en) Method and system for realizing knowledge question answering in aviation field
CN116522165B (en) Public opinion text matching system and method based on twin structure
CN116910272B (en) Academic knowledge graph completion method based on pre-training model T5

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant