CN113626613B - Entity linking method based on knowledge graph sub-graph information and entity information - Google Patents

Entity linking method based on knowledge graph sub-graph information and entity information Download PDF

Info

Publication number
CN113626613B
CN113626613B CN202110951011.8A CN202110951011A CN113626613B CN 113626613 B CN113626613 B CN 113626613B CN 202110951011 A CN202110951011 A CN 202110951011A CN 113626613 B CN113626613 B CN 113626613B
Authority
CN
China
Prior art keywords
entity
candidate
information
graph
mentioning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110951011.8A
Other languages
Chinese (zh)
Other versions
CN113626613A (en
Inventor
唐蕾
徐有恒
何展鹏
毛婷婷
张武军
任斌
邓蓉蓉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Pharmaceutical Information Technology Co ltd
First Affiliated Hospital of Sun Yat Sen University
Original Assignee
Guangzhou Pharmaceutical Information Technology Co ltd
First Affiliated Hospital of Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Pharmaceutical Information Technology Co ltd, First Affiliated Hospital of Sun Yat Sen University filed Critical Guangzhou Pharmaceutical Information Technology Co ltd
Priority to CN202110951011.8A priority Critical patent/CN113626613B/en
Publication of CN113626613A publication Critical patent/CN113626613A/en
Application granted granted Critical
Publication of CN113626613B publication Critical patent/CN113626613B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to an entity linking method based on sub-graph information and entity information of an integrated knowledge graph, which comprises the following steps: obtaining a question text, feeding the question text into a trained mentioning recognition model, and recognizing a predicted mentioning word; and recalling the predicted mentioning words to obtain the target mentioning words with higher accuracy. Obtaining all candidate entities corresponding to the mentioning words from a knowledge base according to the target mentioning words; for each candidate entity, retrieving knowledge graph sub-graph information thereof; integrating the knowledge map sub-graph information into a candidate entity to obtain a candidate entity text, introducing an entity boundary label into the question text and the candidate entity text, and constructing an entity disambiguation data set; and feeding the trained entity disambiguation model, pooling to obtain a knowledge base entity corresponding to the reference word, and completing entity linkage. The method can effectively improve the recall rate of the recognition of the mentions, improve the sequencing accuracy of the candidate entities and reduce error transmission, and effectively solve the problems of reasoning and the condition of missing entities in the training set.

Description

Entity linking method based on knowledge graph sub-graph information and entity information
Technical Field
The invention relates to the field of artificial intelligence natural language processing, in particular to an entity linking method based on knowledge graph subgraph information and entity information.
Background
Knowledge Base Question Answering (KBQA) refers to a given natural language question, and the question is semantically understood and analyzed, and then is queried and inferred by using a knowledge base to obtain an answer. Two methods are classified: semantic analysis (SP) based methods and Information Extraction (IE) based methods. The information extraction method has a relatively fixed model framework and relatively good generalization performance. This kind of method generally adopts the structure of the pipeline, carry on (1) entity link first, namely link the entity of the word mentioning in question to the knowledge base. Then (2) relationship identification is performed, i.e., the intent of the question is mapped to a relationship of an entity in the knowledge base. And then, obtaining entities and relations through prediction, and searching in a knowledge base to obtain answers of final questions.
However, a common problem with entity linking tasks is that the amount of context information that a question can provide is too small, while the boundaries of the named entities are unclear. The accuracy of linking the mentions of the question to the knowledge base entities is not high.
Disclosure of Invention
Based on the above, in order to solve the problems of small information amount in the above text and fuzzy boundary of the Chinese named entity, the invention provides an entity linking method based on the knowledge map subgraph information and the entity information, which solves the problems mentioned in the above background by improving the performance of KBQA entity linking task and then further improving the performance of the whole knowledge base question-answer task.
The invention is realized by adopting the following technical scheme:
an entity linking method based on knowledge graph sub-graph information and entity information, comprising the following steps:
obtaining a question text, feeding the question text into a trained mentioning recognition model, and recognizing a predicted mentioning word (predicted meniton) in the question text;
recalling the predicted mentioning words, obtaining target mentioning words (target mentioning) according to a recall algorithm, obtaining all candidate entities of the target mentioning words according to a mapping file (comment 2id) of a knowledge base, and retrieving each candidate entity from the knowledge base to obtain the knowledge map subgraph information of all the candidate entities;
the knowledge graph sub-graph information is fused into the candidate entity to obtain a candidate entity text, a label pair is introduced into a question text containing the mentioned words and the candidate entity text containing the candidate entity, namely entity boundary information is introduced, and an entity disambiguation data set is constructed;
and feeding the entity disambiguation data set into a trained entity disambiguation model, pooling to obtain scores of different candidate entities corresponding to the mentioning words, and taking the highest score as a target entity of the mentioning words to complete entity linkage.
As a further aspect of the present invention, the method for training the reference recognition model includes:
constructing a recognition pre-training model;
acquiring training data, carrying out sequence marking on texts in the training data, and marking out named entities to obtain a training set;
and inputting the training set of the labels into the reference recognition pre-training model for training to obtain corresponding named entities, thereby obtaining the reference recognition model.
As a further scheme of the invention, the pre-training model is a basic model of adding a BERT model into a CRF framework; the method for labeling the text in the training data comprises the following steps:
obtaining sentences of texts in training data;
marking each Chinese character in the sentence;
different labels are marked on the first character of the named entity in the sentence, other characters in the named entity and non-named entity characters of the sentence, and the named entity is a reference word.
As a further aspect of the present invention, the method for recalling a referral word comprises:
acquiring a predicted reference word identified by a reference identification model and a mapping file of the reference word and a knowledge base entity;
obtaining all reference word lists of the knowledge graph from the mapping file, traversing items in all reference word lists, and extracting items contained in the predicted reference words to form a cutting list;
taking the item with the largest length in the cutting list as a recall reference word, and establishing a recall list according to the recall reference word and the prediction reference word;
traversing each item in a recall list, retrieving the items containing the recall list in all the mentions list, extracting and constructing an extended list, and combining the extended list and the vertical recall list to obtain a recalled entity recall list.
As a further aspect of the present invention, the method for constructing a candidate entity text includes:
retrieving a knowledge graph, wherein the knowledge graph is stored in a triple form of a head node, a relation and a tail node;
according to the triple form of the knowledge graph, all edges and tail nodes connected with the head node are taken out to be used as knowledge graph sub-graph information, and all knowledge graph sub-graph information of the candidate entity is obtained;
and splicing the knowledge graph sub-graph information into the context of the candidate entity to obtain a candidate entity text.
Further, the label pair identifies boundary information of the mentioning word and the candidate entity for an entity disambiguation model, and the label pair is respectively inserted before and after the mentioning word of the question text and before and after the candidate entity of the candidate entity text.
As a further aspect of the invention, the entity disambiguation dataset is constructed as: [ CLS ] "question text" [ SEP ] + "candidate entity text" + [ SEP ] + sentence pair correct classification.
As a further aspect of the present invention, a method of training an entity disambiguation model comprises:
building an entity disambiguation pre-training model of the sentence pair classification task of the BERT based on the BERT model;
and inputting the entity disambiguation data set into the entity disambiguation pre-training model for training to obtain a corresponding knowledge base entity, thereby obtaining an entity disambiguation model.
As a further scheme of the present invention, the pooling of the entity disambiguation model employs an entity-level maximum pooling strategy, and the knowledge base entities corresponding to the words to be referred obtained by the pooling are:
respectively acquiring the maximum pooling representation vector of the mentioning word segment and the maximum pooling representation vector of the candidate entity segment according to the entity-level maximum pooling strategy;
splicing the expression vector of [ CLS ], the maximum pooling expression vector of the word-mentioning segment of the question text and the maximum pooling expression vector of the candidate entity segment of the candidate text, inputting the spliced expression vectors into a feedforward neural network, and calculating to obtain the value of the knowledge base entity pointed by the word-mentioning in the question text;
and according to the calculated score comparison, taking the candidate entity with the maximum score as an entity target to obtain the knowledge base entity corresponding to the mentioning word.
Further, the value of the knowledge base entity pointed by the word mentioned in the question text obtained by the calculation is a Softmax function.
The entity linking method based on the knowledge graph sub-graph information and the entity information is characterized in that modeling is carried out based on a Chinese pre-training language model BERT, a mentioning word recognition model of a sequence labeling task with the BERT added into CRF is constructed, and an entity disambiguation model of a sentence of the BERT on a classification task is constructed; by identifying the words mentioned in the question and answer text and increasing the recall processing process, the recall rate of the mention identification can be effectively improved, the sequencing accuracy of the candidate entities is improved, and the error transmission is reduced; knowledge map information and entity boundary information are blended into the entity disambiguation task, so that the problems that reasoning is needed and a large amount of candidate entities and mentions are overlapped are effectively solved; the problem of knowledge noise is solved to a certain extent through a special pooling strategy; the modeling of the whole entity link is based on a pre-training language model, the knowledge map knowledge is integrated to effectively solve the inference condition, and the language knowledge is integrated to solve the condition that no entity exists in a training set to a certain extent.
Drawings
Fig. 1 is a schematic flow block diagram of an entity linking method based on knowledge graph sub-graph information and entity information fusion provided in an embodiment of the present invention.
Fig. 2 is a schematic block diagram of a process of training a reference recognition model in the entity linking method based on the knowledge graph subgraph information and the entity information provided in the embodiment of the present invention.
Fig. 3 is a block diagram illustrating a text tagging process in the entity linking method based on the knowledge graph sub-graph information and the entity information according to the embodiment of the present invention.
Fig. 4 is a schematic block diagram illustrating a process of recalling a word in the entity linking method based on the incorporated knowledge graph sub-graph information and the entity information according to the embodiment of the present invention.
Fig. 5 is a schematic block diagram of a process of constructing candidate entity texts in an entity linking method based on knowledge graph sub-graph information and entity information provided in an embodiment of the present invention.
Fig. 6 is a schematic block diagram of a process of training an entity disambiguation model in the entity linking method based on the incorporated knowledge graph sub-graph information and the entity information according to the embodiment of the present invention.
Fig. 7 is a schematic block diagram of a process of obtaining knowledge base entities by pooling in the entity linking method based on the incorporated knowledge graph sub-graph information and the entity information according to the embodiment of the present invention.
Fig. 8 is a flowchart of a task (a) of classifying sentences of BERT and a task (b) of annotating sequences of BERT + CRF in the entity linking method based on the knowledge graph subgraph information and the entity information incorporated according to the embodiment of the present invention.
Fig. 9 is an overall framework diagram of the entity linking method based on the blended knowledge-graph sub-graph information and the entity information according to the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Referring to fig. 1 and 9, the entity linking method based on the knowledge graph sub-graph information and the entity information includes the following steps:
and S1, obtaining a question text, feeding the trained reference recognition model, and recognizing the prediction reference words in the question text.
In this step, a trained reference recognition model is determined, and referring to fig. 2, the training method of the reference recognition model includes:
s11, constructing a reference recognition pre-training model;
in one embodiment of the invention, modeling is carried out based on a Chinese pre-training language model BERT, a CRF framework is added into the BERT model, and a sequence labeling task of BERT + CRF is formed; a schematic diagram of labeling tasks with the reference recognition pre-training model sequence is shown in fig. 8 (b);
s12, acquiring training data, carrying out sequence labeling on texts in the training data, marking out named entities and obtaining a training set;
in one embodiment of the present invention, the training data is derived from all raw data, and 80% of all raw data is used as training data to train the model, and the other 20% is used as test data, so that the effect of the model can be directly judged through the test data, and the model is improved before entering the real environment. In the mention identification, the position of the training data containing the book title number is specially processed, and the book title number is regarded as a part of the named entity to be labeled, so that the prediction accuracy of the mention word containing the book title number can be improved.
In an embodiment of the present invention, referring to fig. 3, the method for labeling the text in the training data includes:
s111, obtaining sentences of texts in training data;
s112, labeling each Chinese character in the sentence;
in one embodiment of the present invention, different labels are marked on the first word of the named entity in the sentence, other words in the named entity and non-named entity words of the sentence, and the named entity is a reference word. The specific method is to label natural languages (mainly various sentences) to be recognized as training data according to a BIO method to obtain a training set. I.e. the sentences in the training data, each Chinese character is labeled. The first word of the named entity is given to the label B-entity, and other words of the named entity are given to the label I-entity to represent the inside of the named entity. The words of other non-named entities of the sentence are given to tag O.
Illustratively, the labeling method is shown in table 1 below:
table 1 presents data annotation cases for recognition models
Figure BDA0003218622660000061
How many ranked are the number of people coming in and going out of the sentence "2013, 12 months Yongning station day? "labeling, the labeled sentence is" the number of people (O) who enter (O) and exit (O) at the station (I) of (O)12(O) month (O) perpen (B) ning (I) in 2013 (O)? (O) ". Wherein "Yongning station" is a reference word, also called a named entity.
As another example, the sentence "do you need to provide DDO-3055 clinical study protocol, investigator manual, etc.? "labeling, the labeled sentence is" requiring (O) to be provided for (O) D (B) D (I) O (I) - (I)3(I)0(I)5(I)5(I) 5(I) piece (I) to study (O) bed (O) into (O) prescription (O) and (O) study (O) into (O) hand (O) album (O) and other (O) materials (O)? (O) ". Wherein, the DDO-3055 piece is a reference word and is also called a named entity.
And S13, inputting the training set of the labels into the mentioned recognition pre-training model for training to obtain corresponding named entities, and obtaining the mentioned recognition model.
And (3) for labeling the training data by using a BIO method, training the training data by using a ready-made pre-training model BERT + CRF to obtain a reference recognition model.
The trained model can carry out de-triage prediction on each Chinese character of each input sentence. I.e., whether each Chinese character is predicted to be the beginning of a named entity (B-entity), the interior of a named entity (I-entity), or a non-named entity (O).
S2, recalling the predicted mentions, obtaining target mentions according to a recall algorithm, obtaining all candidate entities of the target mentions according to a mapping file (comment 2id) carried by a knowledge base, and retrieving each candidate entity from the knowledge base to obtain the sub-graph information of the knowledge graph of all candidate entities.
After any question is input into the trained mentioning recognition model, a predicted mentioning word is obtained, the predicted mentioning word is input into the mentioning recall module, and the predicted mentioning word is recalled to obtain the target mentioning word with higher accuracy.
Specifically, for example, the question "what order did the animal vernonia belong to? The named entity predicted by the "BERT + CRF model is" animal vernonia ", and the target named entity is" vernonia ".
After the recognition model is mentioned, for the mismatched named entities, under the condition that the overlapped characters actually exist with the target named entities, a recall mentioning module is designed, so that more target named entities are recalled, and the accuracy of the whole recognition is improved. In this step, referring to fig. 4, the method for recalling the mentioning word includes:
s21, acquiring a predicted reference word identified by the reference identification model and a mapping file of the reference word and a knowledge base entity;
in this step, in particular, the input is the predicted entity m _ p of the BERT + CRF model, and a mapping file 2id (some form of file such as "lina (actor), lina (basketball player), lina (volleyball player)") that refers to the word-and-knowledge base entity. Wherein the fragment 2id file is provided for official data.
S22, obtaining all reference word lists of the knowledge graph from the mapping file, traversing items in all reference word lists, and extracting items contained in the predicted reference words to form a cutting list;
in this step, in particular, the entire knowledge-graph list of all the mentioned words L _ gold is obtained from the official data providing file of opinion 2 id. The entries in L _ gold are traversed and the substrings that the prediction entity m _ p contains (entries in L _ gold) are placed into the list L _ cut.
S23, taking the item with the largest length in the cutting list as a recall mentioning word, and establishing a recall list according to the recall mentioning word and the prediction mentioning word;
in this step, specifically, the longest mentioning word in L _ cut is found as the recall. Denoted as m _ c. M _ p and m _ c are placed in the list L _ call.
Wherein the longest target named entity contained in the predicted named entities is obtained. The "animal vernonia" can be truncated to the target named entity "vernonia".
In the question, "who do you know the Zhao Wen Zhuozi Dan event? The named entity predicted by the BERT + CRF model is the Zhao Wen Zhuozhou, and the target named entity is the Zhao Wen Zhuozhou event.
S24, traversing each item in the recall list, retrieving the items containing the calendar recall list in all the mentions list, extracting and constructing an extended list, and combining the extended list and the vertical recall list to obtain a recalled entity recall list.
In this step, specifically, each item in the list L _ call is traversed to retrieve which items in L _ gold contain the item in L _ call. These L _ gold items are taken out and put into the list L _ extended. And merging the L _ extended and the L _ call to obtain a final entity recall list L _ call.
Wherein, each item (element) of the L _ gold is searched to obtain all target entities containing the predicted entity 'Zhao-wen-zho-zhuozi dan', and the target entity 'Zhao-wen-zhuozi dan event' can be recalled.
The method for recalling the mentioning words can improve the recall rate of the mentioning identification, and simultaneously effectively solves the problem of error transmission under a pipeline framework, namely for one problem, if the first subtask mentions identification and the prediction is wrong, disambiguation of the second subtask entity can be definitely carried out by mistake.
S3, the knowledge graph sub-graph information is fused into the candidate entity to obtain a candidate entity text, and label pairs are introduced into the question text containing the word to be mentioned and the candidate entity text containing the candidate entity, namely, entity boundary information is introduced, and an entity disambiguation data set is constructed.
In this step, referring to fig. 5, the method for constructing the candidate entity text includes:
and S31, retrieving a knowledge graph, wherein the knowledge graph is stored in a triple form of a head node, a relation and a tail node.
In this step, the knowledge graph (knowledge base) exists in the form of < subject, relationship, object > triplets, respectively denoted as head node, relationship, tail node. Such as < Yongning station (New North City), alias, Yongning station >.
And S32, taking out all edges and tail nodes connected with the head node as knowledge graph sub-graph information according to the triple form of the knowledge graph to obtain all knowledge graph sub-graph information of the candidate entity.
In this step, specifically, all edges and end nodes connected to the head node are taken out, which is called as one-hop sub-graph information of the head node, that is, all knowledge-graph sub-graph information of the candidate entity is obtained.
And S33, splicing the knowledge graph sub-graph information into the context of the candidate entity to obtain a candidate entity text.
In this step, specifically, the one-hop sub-map information is spliced into the context of the candidate entity as sentence 2, i.e., the candidate entity text.
In one embodiment of the invention, all knowledge is stored in the knowledge base in the form of triples < head entity, relationship, tail node >, such as there may be multiple triples for a candidate entity a from the knowledge-graph. I.e. corresponding relation 1-attribute value 1, relation 2-attribute value 2, etc. Such as yaoming, gender, male >, < yaoming, occupation, basketball players > etc. Specifically, in the training format: the candidate entity + relation 1+ attribute value 1+ relation 2+ attribute value 2+ is spliced to the candidate entity to obtain a sentence 2.
For example, the station is 'Yongning station (New North City)' + 'alias: Yongning station, station code BL37, operation system: Taibei Jiyun, affiliated route: line No. 5 (soil city line), position: Central road No. 3 segment No. 105B 1 of New North City soil city region, station type: underground station, station type: island type platform, export number: 4, station setting date: 2006, 5 months and 31 days, number of people entering and exiting a day: 29,806[1], 49 th (2013, 12 months)'.
In the invention, entity boundary information is also merged. The question text contains a mentioning word as sentence 1. Sentence 2 contains the candidate entity. In order to make the model learn the boundaries of the mentions and the candidate entities in the context, special label pairs are introduced before and after the mentions and the candidate entities, respectively, and the label pairs are < e1> </e1> and < e2> </e2 >. The tag pairs identify the mentions and boundary information of candidate entities for an entity disambiguation model.
The entity disambiguation dataset is constructed as: [ CLS ] "question text" [ SEP ] + "candidate entity text" + [ SEP ] + sentence pair correct classification. In one embodiment of the invention, based on the BERT's sentence pair matching task, the dataset is constructed as a correct classification of [ CLS ] "sentence 1" [ SEP ] + "sentence 2" + [ SEP ] + sentence pairs, resulting in the dataset.
The data format of the final input model is as follows:
"[ CLS ] how many ranked are the number of people entering and leaving a station Yongning station (e 1) in 12 months < e1> in 2013? [ SEP ] < e2> Yongning station (New North city) </e2> is named as Yongning station, station code BL37, operation system, Taibei Jiejie, and belongs to the route of No. 5 line (soil city line), position No. 3 section No. 105B 1 of central road of soil city district in New North city, station body type underground station, platform shape system: island type platform, exit number 4, station setting date: 2006 year 5 month 31 day, day in and out times 29,806[1], 49 th (2013 month 12), head bus: 06:00, last vehicle, Yongning station last vehicle time city line to south port exhibition hall: 00:00[ SEP ] 1'.
And S4, feeding the entity disambiguation data set into a trained entity disambiguation model, pooling to obtain scores of different candidate entities corresponding to the mentioning words, and taking the highest score as a target entity of the mentioning words to complete entity linkage.
In this step, referring to fig. 6, the method for training the entity disambiguation model includes:
s41, building an entity disambiguation pre-training model of the BERT sentences to the classification task based on the BERT model, wherein a schematic diagram of the BERT sentences to the classification task modeled as a basic model of the entity disambiguation is shown in FIG. 8 (a);
the entity disambiguation model was modeled as a BERT-based sentence pair classification model, with data labels as shown in table 2 below:
TABLE 2 data annotation case for entity disambiguation model
Figure BDA0003218622660000111
That is, the relationship between two sentences is predicted, for example, sentence a (content a) is a question, and sentence b (content b) is an answer to the question. Then sentence a and sentence B have a question-answer relationship with a label of 1. If sentence C is not the answer to this question, then sentence a and sentence C do not have a question-answer relationship. The sentence pair classification task is to predict whether there is some relation between two input sentences.
And S42, inputting the entity disambiguation data set into the entity disambiguation pre-training model for training to obtain a corresponding knowledge base entity, and obtaining an entity disambiguation model.
The entity disambiguation model is pooled by using an entity-level maximum pooling strategy, as shown in fig. 7, and the knowledge base entities corresponding to the words to be referred obtained by the pooling are:
s401, respectively obtaining a maximum pooling representation vector of the mentioning word segment and a maximum pooling representation vector of the candidate entity segment according to a maximum pooling strategy of an entity level;
s402, splicing the expression vector of [ CLS ], the maximum pooling expression vector of the mention word segment of the question text and the maximum pooling expression vector of the candidate entity segment of the candidate text, inputting the spliced expression vectors into a feedforward neural network, and calculating to obtain the value of the knowledge base entity pointed by the mention word in the question text;
and S403, according to the calculated score comparison, taking the candidate entity with the maximum score as an entity target to obtain the knowledge base entity corresponding to the mentioning word.
Specifically, in step S402, the Softmax function is used to calculate the score of the knowledge base entity pointed by the word mentioned in the question text.
In this embodiment, because the text of the introduced sub-graph is long, compared with the short problem itself, knowledge noise is introduced, which affects the vector representation of the whole sentence. Different pooling strategies have therefore been tried to refine the representation of the context.
Wherein, the two 5 kinds of pooling strategies of the aspect are respectively: maximum pooling, average pooling, attention pooling at the sequence level (i.e., each word of a sentence pair), and maximum pooling, average pooling at the entity level (i.e., each word of a candidate entity and a mentioning word in a sentence pair). The conclusion is that the maximum pooling effect at the entity level is the best. Finally we concatenate the enhanced representation (the largest pooled vector at the entity level) and the representation of the sequence (the vector representation of the special label [ CLS ]).
Therefore, after the representation of the text is obtained through the entity disambiguation model, the representation vector of [ CLS ], the maximum pooling representation vector of the word-mentioning segment of the question text and the maximum pooling representation vector of the candidate entity segment of the candidate text are put together through a designed pooling layer, the input is input into a full-link hierarchy (feedforward neural network), and a score is obtained through a Softmax function, wherein the score represents the score of the selected entity which is the knowledge base entity pointed by the word-mentioning in the question text, namely the probability of the target entity serving as the word-mentioning. We take the candidate entity with the highest score as the target entity.
The present invention obtains the best performance at present in the Chinese data set NLPCC2016CKBQA through the proposed entity link solution combining the rule-based recall mentioning algorithm and the entity disambiguation algorithm based on the pre-trained deep neural network.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (7)

1. An entity linking method based on sub-graph information and entity information of an integrated knowledge graph is characterized by comprising the following steps: obtaining a question text, feeding the question text into a trained mentioning recognition model, and recognizing a predicted mentioning word in the question text;
recalling the predicted mentioning words, obtaining target mentioning words according to a recall algorithm, obtaining all candidate entities of the target mentioning words according to a mapping file carried by a knowledge base, and retrieving each candidate entity from the knowledge base to obtain knowledge map subgraph information of all candidate entities;
the knowledge graph sub-graph information is fused into the candidate entity to obtain a candidate entity text, a label pair is introduced into a question text containing the mentioned words and the candidate entity text containing the candidate entity, namely entity boundary information is introduced, and an entity disambiguation data set is constructed;
feeding the entity disambiguation data set into a trained entity disambiguation model, pooling to obtain scores of different candidate entities corresponding to the mentioning word, and taking the highest score as a target entity of the mentioning word to complete entity linkage;
the method for training the recognition model comprises the following steps: constructing a recognition pre-training model;
acquiring training data, performing sequence labeling on texts in the training data, marking out named entities and obtaining a training set;
inputting the training set of the labels into the reference recognition pre-training model for training to obtain corresponding named entities to obtain a reference recognition model;
the pre-training model is a basic model of adding a BERT model into a CRF framework; the method for labeling the text in the training data comprises the following steps: obtaining sentences of texts in training data;
labeling each Chinese character in the sentence;
marking different labels on a first character of a named entity in a sentence, other characters in the named entity and non-named entity characters of the sentence, wherein the named entity is a reference word;
the method for recalling the mentions comprises the following steps: acquiring a prediction reference word identified by a reference identification model and a mapping file of the reference word and a knowledge base entity;
obtaining all reference word lists of the knowledge graph from the mapping file, traversing items in all reference word lists, and extracting items contained in the predicted reference words to form a cutting list;
taking the item with the largest length in the cutting list as a recall mentioning word, and establishing a recall list according to the recall mentioning word and the prediction mentioning word;
traversing each item in a recall list, retrieving the items containing the recall list in all the mentions list, extracting and constructing an extended list, and combining the extended list and the vertical recall list to obtain a recalled entity recall list.
2. The entity linking method based on the knowledge-graph sub-graph information and entity information blended in claim 1, wherein the candidate entity text construction method comprises the following steps: retrieving a knowledge graph, wherein the knowledge graph is stored in a triple form of a head node, a relation and a tail node;
according to the triple form of the knowledge graph, all edges and tail nodes connected with the head node are taken out to be used as knowledge graph sub-graph information, and all knowledge graph sub-graph information of the candidate entity is obtained;
and splicing the knowledge graph sub-graph information to the context of the candidate entity to obtain a candidate entity text.
3. The entity linking method based on the knowledge-graph sub-graph information and entity information blended in according to claim 2, wherein the label pair identifies the boundary information of the mentions and the candidate entities for an entity disambiguation model, and the label pair is respectively inserted before and after the mentions of the question text and before and after the candidate entities of the candidate entity text.
4. The entity linking method based on the integrated knowledge-graph sub-graph information and the entity information according to claim 1, wherein the entity disambiguation data set is constructed by: [ CLS ] "question text" [ SEP ] + "candidate entity text" + [ SEP ] + sentence pair correct classification.
5. The entity linking method based on the integrated knowledge-graph subgraph information and the entity information as claimed in claim 4, wherein the method for training the entity disambiguation model comprises the following steps: building an entity disambiguation pre-training model of the sentence pair classification task of the BERT based on the BERT model;
and inputting the entity disambiguation data set into the entity disambiguation pre-training model for training to obtain a corresponding knowledge base entity, thereby obtaining an entity disambiguation model.
6. The entity linking method based on the knowledge graph sub-graph information and the entity information blended in according to claim 4, wherein the entity disambiguation model is pooled by adopting an entity-level maximum pooling strategy, and the knowledge base entities corresponding to the words to be referred obtained by the pooling are: respectively acquiring the maximum pooling representation vector of the mentioning word segment and the maximum pooling representation vector of the candidate entity segment according to the entity-level maximum pooling strategy;
splicing the expression vector of [ CLS ], the maximum pooling expression vector of the word-mentioning segment of the question text and the maximum pooling expression vector of the candidate entity segment of the candidate text, inputting the spliced expression vectors into a feedforward neural network, and calculating to obtain the value of the knowledge base entity pointed by the word-mentioning in the question text;
and according to the calculated score comparison, taking the candidate entity with the maximum score as an entity target to obtain the knowledge base entity corresponding to the mentioning word.
7. The entity linking method based on knowledge graph sub-graph information and entity information fusion according to claim 6, wherein a Softmax function is adopted to calculate the score of the knowledge base entity to which the word is pointed in the question text.
CN202110951011.8A 2021-08-18 2021-08-18 Entity linking method based on knowledge graph sub-graph information and entity information Active CN113626613B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110951011.8A CN113626613B (en) 2021-08-18 2021-08-18 Entity linking method based on knowledge graph sub-graph information and entity information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110951011.8A CN113626613B (en) 2021-08-18 2021-08-18 Entity linking method based on knowledge graph sub-graph information and entity information

Publications (2)

Publication Number Publication Date
CN113626613A CN113626613A (en) 2021-11-09
CN113626613B true CN113626613B (en) 2022-07-05

Family

ID=78386484

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110951011.8A Active CN113626613B (en) 2021-08-18 2021-08-18 Entity linking method based on knowledge graph sub-graph information and entity information

Country Status (1)

Country Link
CN (1) CN113626613B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114416976A (en) * 2021-12-23 2022-04-29 北京百度网讯科技有限公司 Text labeling method and device and electronic equipment
CN114491001B (en) * 2022-04-18 2022-08-30 南京星耀智能科技有限公司 Entity searching method in military field
CN114841164A (en) * 2022-04-29 2022-08-02 浪潮电子信息产业股份有限公司 Entity linking method, device, equipment and storage medium
CN117113993B (en) * 2023-10-23 2024-01-09 腾讯科技(深圳)有限公司 Entity linking method, device, electronic equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784604A (en) * 2021-02-08 2021-05-11 哈尔滨工业大学 Entity linking method based on entity boundary network

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045826A (en) * 2015-06-29 2015-11-11 华东师范大学 Entity linkage algorithm based on graph model
CN108959461B (en) * 2018-06-15 2021-07-27 东南大学 Entity linking method based on graph model
CN110866399B (en) * 2019-10-24 2023-05-02 同济大学 Chinese short text entity recognition and disambiguation method based on enhanced character vector
CN111259653B (en) * 2020-01-15 2022-06-24 重庆邮电大学 Knowledge graph question-answering method, system and terminal based on entity relationship disambiguation
CN111523326B (en) * 2020-04-23 2023-03-17 北京百度网讯科技有限公司 Entity chain finger method, device, equipment and storage medium
CN111563149B (en) * 2020-04-24 2023-01-31 西北工业大学 Entity linking method for Chinese knowledge map question-answering system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784604A (en) * 2021-02-08 2021-05-11 哈尔滨工业大学 Entity linking method based on entity boundary network

Also Published As

Publication number Publication date
CN113626613A (en) 2021-11-09

Similar Documents

Publication Publication Date Title
CN113626613B (en) Entity linking method based on knowledge graph sub-graph information and entity information
CN103207855B (en) For the fine granularity sentiment analysis system and method for product review information
CN111753098A (en) Teaching method and system based on cross-media dynamic knowledge graph
CN110990590A (en) Dynamic financial knowledge map construction method based on reinforcement learning and transfer learning
CN110866124B (en) Medical knowledge graph fusion method and device based on multiple data sources
CN110727779A (en) Question-answering method and system based on multi-model fusion
CN111639171A (en) Knowledge graph question-answering method and device
CN110968699A (en) Logic map construction and early warning method and device based on event recommendation
CN106663125A (en) Question sentence generation device and computer program
CN110688489B (en) Knowledge graph deduction method and device based on interactive attention and storage medium
CN101539907A (en) Part-of-speech tagging model training device and part-of-speech tagging system and method thereof
CN106599032A (en) Text event extraction method in combination of sparse coding and structural perceptron
CN111159356B (en) Knowledge graph construction method based on teaching content
CN115269857A (en) Knowledge graph construction method and device based on document relation extraction
CN115599899B (en) Intelligent question-answering method, system, equipment and medium based on aircraft knowledge graph
CN112328800A (en) System and method for automatically generating programming specification question answers
CN112287095A (en) Method and device for determining answers to questions, computer equipment and storage medium
CN111339269A (en) Knowledge graph question-answer training and application service system with automatically generated template
CN113312914B (en) Security event entity identification method based on pre-training model
CN115438674B (en) Entity data processing method, entity linking method, entity data processing device, entity linking device and computer equipment
CN113724882A (en) Method, apparatus, device and medium for constructing user portrait based on inquiry session
CN113821605A (en) Event extraction method
CN113204967A (en) Resume named entity identification method and system
CN114077673A (en) Knowledge graph construction method based on BTBC model
CN115293161A (en) Reasonable medicine taking system and method based on natural language processing and medicine knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant