CN112925918A

CN112925918A - Question-answer matching system based on disease field knowledge graph

Info

Publication number: CN112925918A
Application number: CN202110213829.XA
Authority: CN
Inventors: 倪妙玲; 孙庆华; 王聪
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2021-06-08
Anticipated expiration: 2041-02-26
Also published as: CN112925918B

Abstract

The invention discloses a question-answer matching system based on a knowledge graph in a disease field, which comprises a preprocessing module, a named entity identification module, an entity linking module and a relation matching module.

Description

Question-answer matching system based on disease field knowledge graph

Technical Field

The invention relates to the field of natural language processing, in particular to a question-answer matching system based on a knowledge graph in the disease field.

Background

The question-answering system is a common application in the field of natural language processing, is a common expression form of an information retrieval system, can quickly acquire accurate and effective information from a plurality of information by analyzing questions input by a user and by means of algorithms such as deep learning and the like, returns the accurate and effective information to the user, answers the questions of the user, and meets the requirement of the user on quickly acquiring the accurate information.

The knowledge graph organizes and understands data in a form close to human cognitive thinking, and provides an excellent solution for massive, heterogeneous and dynamic large data management and use on the Internet. The knowledge graph integrates technologies and methods in many aspects, including knowledge representation, knowledge extraction (named entity identification, relationship extraction, and the like), knowledge fusion, knowledge storage, knowledge inference, graph computation, visualization, semantic search, knowledge question answering, knowledge crowdsourcing, and the like. The knowledge graph generally follows the RDF triple data structure, i.e., (s, p, o) form, corresponding to the main entity subject, the relationship predicate, and the tail entity object. A knowledge-graph contains tens or hundreds of millions of scale entities, as well as billions or billions of facts (i.e., attribute values and relationships to other entities) organized in thousands of objective world conceptual structures embodied by semantics. Fig. 1 illustrates a portion of a coronary heart disease domain knowledge map with arrows representing relationships, the arrows pointing from the main entity to the tail entity. Circles represent entities or attributes, entities being abstractions of objective individuals, such as body parts, diseases, symptoms. And attribute values are used to describe entities and are classified into text type and numerical type, such as "incurable", "80%".

The question-answering system based on the medical knowledge graph can help a user to screen out accurate answers of questions from massive medical data and return the answers to the user, and meanwhile, a certain explanation can be provided for the answers by means of the structured knowledge graph.

The defects in the prior art are as follows:

1. the word vector representation model based on the general field lacks field knowledge, and the cause is that BERT is pre-trained on general language materials, while the disease field has more professional terms and more complex entities, for example, rheumatic mitral stenosis is a complete entity, and a disease entity, namely mitral stenosis, is also nested, but the rheumatic mitral stenosis is less on common general training language materials, so that the corresponding semantic vector learning is poor, and only the mitral stenosis is recognized.

2. The reason why the accumulated errors exist in the process of entity linking and entity identification based on the pipeline is that the pipeline refers to the process that the entity linking and the entity identification are serial, and the entity linking is judged based on the result of the entity identification, so that the error transmission is caused. When the entity recognition module outputs the wrong entity mention, the entity cannot be linked to the real entity word in the knowledge graph.

Disclosure of Invention

In order to overcome two defects in the prior art, the invention provides a question-answer matching system based on a knowledge graph in the disease field.

The invention adopts the following technical scheme:

a question-answer matching system based on a disease domain knowledge graph, comprising:

the preprocessing module is used for preprocessing the problem input by the user and recording the output Q of the preprocessing module;

a named entity recognition module for determining the start and end positions of the entity mentions, predicting the probabilities of each character in the output Q as the start and end positions of the sentence entity mention, and determining the start and end positions of the entity mention according to the probabilities of the start and end positions, the named entity recognition module using BERT_domainFine adjustment is carried out;

the entity linking module is used for linking the entity mentions to entity words in the disease knowledge graph and retrieving all corresponding relations in the disease knowledge graph through the entity words;

a relation matching module for matching the relation corresponding to the entity word with the output Q of the preprocessing module, judging whether the relation is consistent with the user problem, if so, outputting, and using BERT_domainFine tuning is performed.

Further, the preprocessing includes removing spaces and punctuation marks using regularization on the user entered text problem and unifying the letters into lower case.

Further, predicting the probability of each character in the output Q as the starting position and the ending position of the entity mention of the sentence, and determining the entity mention, specifically:

BER after pretraining through disease domainT_domainThe model encodes the output Q of the preprocessing module and outputs a feature vector, the feature vector predicts the probability of each character as an entity mention starting position and an entity mention ending position through two Softmax classification layers, and a character string corresponding to the entity starting position and the entity mention ending position is the entity mention.

Further, BERT in the disease field is obtained_domainThe pre-training process of the model is as follows:

firstly, storing the disease names in clinical diagnosis and treatment documents and electronic medical records as dictionary files;

then, segmenting the text contents in the clinical diagnosis and treatment documents and the electronic medical records into sentences according to punctuations;

searching whether the sentence contains the disease word in the dictionary file or not by using a maximum dictionary matching method for the segmented sentence, and if so, shielding the disease word, namely replacing the disease word with MASK; for sentences which do not contain disease words, randomly masking a certain character, and constructing an MLM task data set of the disease field;

finally, the BERT model is pre-trained through an MLM task data set, the MLM task aims to predict words which are replaced by MASK through context information of sentences, so that the model learns the bidirectional information of the sentences, and the BERT is obtained through the MLM task_domainModel and character vector E_char。

Further, the named entity recognition module uses BERT_domainFine tuning is performed at BERT_domainAdding network parameters of the downstream-oriented task and finely adjusting the parameters of the whole network, wherein the named entity recognition module is arranged in BERT_domainAnd on the basis of the initial position and the final position, respectively splicing two softmax classification layers, and respectively predicting the probability that each character in the output Q is used as an entity to mention the initial position and the final position. In addition, BERT is used in the fine tuning stage_domainThe segment codes in the original input are replaced with participle codes. BERT in Fine tuning phase_domainThe network parameters of (a) are initialized using the model parameters saved during the pre-training phase.

Further, the probability of the named entity recognition module entity referring to the starting position and the ending position is calculated by the following formula:

wherein L is the number of characters of Q,

representing the k character c in Q_kIs a characteristic code of BERT_domainOutput, h_kIs a network parameter of the Softmax classification layer.

Further, the word segmentation code specifically comprises: the output Q is segmented through a word segmentation tool, a label is marked on each character according to a word segmentation result, the label set is { B, M, E, S }, wherein B represents that the character is a starting character of the word, M represents that the character is a middle character of the word, E represents that the character is an ending character of the word, S represents a single word, and corresponding word segmentation codes are respectively E_B、E_M、E_S、E_E。

Further, the entity link module is realized by means of retrieval, and is specifically divided into two stages, namely an offline stage and an online searching stage:

an off-line stage: segmenting entity words and entity aliases in the disease knowledge map according to characters, and counting TF-IDF;

establishing a word index for the counted entity words, and recording all the entity words containing a certain word and position information of the word in the entity words;

and (3) in an online searching stage: the method comprises the steps of preprocessing a problem input by a user to obtain Q, obtaining entity mentions through a named entity recognition module, segmenting the entity mentions according to characters, and searching corresponding character indexes. Accumulating according to TF-IDF, sorting from large to small according to scores, and selecting the first 2 entity words with the highest scores as candidate entities;

calculating DICE distance d between each candidate entity and entity₁And the DICE distance d of the candidate entity from Q₂If the weighted sum is greater than the threshold value, the candidate entity is retained,otherwise, removing the candidate entity.

Further, the DICE distance is:

x, Y, S sequentially represents entity word, entity mention and user input, alpha and beta are parameters, a ^ b represents the public character of a and b, and | X | represents the character length.

Further, the network parameters of the two Softmax classification layers are different.

The invention has the beneficial effects that:

(1) the invention adopts richer and more professional field disease characters for representation. The prior pre-training model is pre-trained by using universal linguistic data, and has the problems of more professional terms, entity nesting and the like in the disease field, so that the character vector representation and the field pre-training model BERT in the disease field can be obtained through the pre-training of clinical diagnosis and treatment documents and electronic medical records in the disease field;

(2) the entity boundary identification is more accurate, the entity in the knowledge map is used for assisting word segmentation, word segmentation codes are added according to word segmentation results, and the entity boundary identification is assisted through the word segmentation results;

(3) the present invention mitigates coupling between entity identification and entity linking. The entity link uses an entity link method based on retrieval, the entity link result is directly influenced aiming at the entity recognition result, when the DICE distance is calculated, the calculation of the DICE distance between the entity word and the sentence is increased, and the DICE distance caused by the error recognition of the entity boundary is avoided to be smaller.

Drawings

FIG. 1 is a partial schematic view of a coronary heart disease domain knowledge graph showing a schematic view of the principal entities, relationships, attributes and tail entities in the prior art;

FIG. 2 is a flow chart of the operation of the present invention;

FIG. 3 is a schematic diagram of a named entity recognition module of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited to these examples.

Examples

As shown in fig. 1 to 3, a question-answer matching system based on a disease domain knowledge graph analyzes questions posed by a user based on a constructed disease knowledge graph and provides relevant answers, including

The preprocessing module is used for removing spaces, punctuation marks and the like by using regularization aiming at the problem input by a user, the case and case are unified into lower case, and then the output of the preprocessing module is recorded as Q;

and the named entity identification module is used for determining the starting position and the ending position of the entity mention, predicting the probability of outputting each character in Q as the starting position and the ending position of the sentence entity mention through two Softmax classification layers, and determining a corresponding character string according to the probability of the starting position and the ending position, namely the entity mention. The named entity recognition module uses BERT_domainFine tuning is performed.

The probability calculation formula is:

wherein L is the number of characters of the output Q of the preprocessing module,

indicating the k-th character c_kIs a characteristic code of BERT_domainOutput of h_kIs a network parameter of a classification layer, and only the classification layer network parameter of the network layer used in the entity starting and ending position prediction is independent, i.e. h_kDifferent, BERT_domainThe same is true.

Being BERT through domain pre-training_domainAnd (4) obtaining. The original BERT inputs were: character encoding, position encoding, and segment encoding, named entity module using BERT_domainAnd (6) fine adjustment. BERT_domainIs divided into 3 parts: character encoding, position encoding and word segmentation encoding. Wherein the character code is BERT_domainPre-trained character representation E_charThe position code is the position representation of each character, is used for the time sequence judgment of the BERT model and is the same as the position code representation of the original BERT. The original segment coding is removed and replaced by participle coding, and because the named entity recognition task is a single sentence task and does not involve multiple sentences, segment coding is not required to indicate which sentence each character belongs to.

The word segmentation code specifically comprises: and (3) segmenting the output Q through a segmentation tool, and marking a label on each character according to a segmentation result, wherein the label set is { B, M, E, S }, wherein B represents that the character is the beginning character of the vocabulary, M represents that the character is the middle character of the vocabulary, E represents that the character is the ending character of the vocabulary, and S represents a single word. According to the word segmentation result, the word segmentation result is B, and the corresponding word segmentation boundary code is E_BAnd others are similar to E_B、E_M、E_S、E_EAre all network parameters. By introducing word segmentation coding, certain prior knowledge is provided for the identification of entity boundaries. If the user inputs 'how to treat rheumatic mitral stenosis', the 'stenosis' after word segmentation is a connected word, and the word segmentation result is used as the characteristic of the named entity recognition module, so that the situation that entity boundary recognition errors such as the fact that only the 'stenosis' is included in entity mention but the 'stenosis' is not included can be reduced.

The entity linking module is used for linking the entity mentions output by the entity recognition module to the entity words s in the knowledge graph, and is realized by using a retrieval-based entity linking method. The specific implementation method comprises the following steps: all entity words and entity aliases (the following entity words refer to entity words and entity aliases) in the knowledge graph are segmented according to characters, and TF-IDF (term frequency-inverse document frequency) is counted. TF is the Term Frequency (Term Frequency) that indicates how often a word appears in an article. IDF is an Inverse text Frequency index (Inverse Document Frequency) that represents the ratio of the total number of articles to the number of times a word appears in an article, and is typically logarithmized. The main idea of TF-IDF is: if a word or phrase appears frequently in one article, TF, and rarely appears in other articles, the word or phrase is considered to have a good classification capability. For the entity linking module, TF represents the frequency of a certain word in an entity word, and IDF is the number of all entity words divided by the number of entity words containing the word. And after statistics, performing inverted indexing on the entity words, and recording all the entity words containing a certain character and position information of the character in the entity words. When the input entity is mentioned, segmenting according to characters, obtaining indexes of entity words according to the characters, obtaining each entity word containing the characters and TF-IDF, accumulating the TF-IDF according to the hit characters of the entity words, sorting the entity words from large to small according to scores, and selecting the first 2 entity words with the highest scores as candidate entity words.

Whether the top2 candidate entity word is matched is judged according to the following distance formula. X, Y, S sequentially represents candidate entity words, entity mentions and the output Q of the preprocessing module, alpha and beta are parameters, a and b represent the public character of a and b, and | x | represents the character length.

The following formula (2) is the DICE distance d between the candidate entity word and the entity reference₁And the DICE distance d of the candidate entity word from the sentence₂If the weighted sum is greater than the threshold value, the candidate entity word is retained, otherwise the candidate entity word is removed.

Calculating a DICE distance d between an entity reference and an entity word₁Also, the DICE distance d of an entity from a sentence is of concern₂To reduce to some extent d caused by entity-mentioned boundary missing recognition₁Small, resulting in entity mentions that cannot be linked to the correct entity word. If the input "how to treat rheumatic mitral stenosis", the entity mentions that only "mitral stenosis" is involved, the recalled top1 entity is "mitral stenosis", and the top2 entity is "rheumatic mitral stenosis". Calculating d₁Top1 entity scored higher than top2 entity, and d was calculated₂Top1 entity scored less than top2 entity, by d₁And d₂The weighted sum of (2) can reduce the condition of mistaken link to 'mitral stenosis' when the entity is linked due to the wrong identification of the entity boundary.

The entity linking module calculates TF-IDF according to the word index to obtain two candidate entity words, calculates the weighted sum of the distance between the two DICEs according to the candidate entity words, if the distance meets the threshold value, the corresponding candidate entity words are called entity words, and the number of the entity words can be 0,1 or 2.

And the relation matching module is used for judging whether the relation p is matched with the intention input by the user, and if the relation p is matched with the intention input by the user, o in (s, p, o) in the knowledge graph is the answer to be output. The specific implementation method of the relationship matching module comprises the following steps: and carrying out relation retrieval of the knowledge graph according to the entity words obtained by entity linking. The entity words and the relationships are spliced to form a sentence 1, the user input is a sentence 2, and the relationships correspond to a plurality of sentences 1 consistent with the NSP task in the BERT pre-training. Using the character representation of [ CLS ] as the feature vector of the sentence, Softmax classification is performed to determine whether sentence 1 (the sentence in which the entity words and relationships are concatenated) and sentence 2 (the sentence Q in which the user input is preprocessed) are semantically similar. And matching the sentence spliced with the relation with Q by using the entity word when the relation is matched, rather than matching the non-entity-mentioned part without the entity mention by using the relation. This is because considering that the entity links are based on word matching, it is necessary to consider comprehensively whether both the entity and the relationship match the entity and the intention input by the user. And matching is performed only by using the non-entity-mentioned part after the entity mention is removed, so that the semantics of the non-entity-mentioned part can be influenced due to the recognition error of the entity mention boundary, and the matching degree of the non-entity-mentioned part and the relation is not high. And selecting the relation with the highest score and the probability larger than the threshold value as p, and taking the corresponding o in the (s, p, o) as the answer input by the user.

The invention uses the BERT model and has two stages:

stage 1: a pre-training phase. Pre-training by a Masked Language Model (MLM) task because the original BERT training corpus is generic and not directed to the disease domainAnd thus the character representation of the partial disease word is less effective. Therefore, by designing MLM task of disease domain, BERT of disease domain is obtained_domain. The input coding vector of the MLM task is consistent with that in BERT pre-training in the general field, and the input coding vector is the sum of a character vector, segment coding and a position vector. The character vector, segment code and position vector are learnable parameters, and the character vector E in the input process is trained after the pre-training is finished_charAnd the model parameters are saved as a file. Pre-training phase BERT_domainThe input of (1) is the sum of the character vector, the segment code and the position vector. The output is called a character feature vector.

And (2) stage: and (5) fine adjustment. The fine tuning phase is at BERT_domainAdding the network parameters of the downstream-oriented task and finely adjusting the parameters of the whole network. Fine tuning phase BERT_domainThe parameters of (2) are initialized by using the model parameters of the stage 1, the learning rate is relatively small and is generally 10 < -5 >, and the network parameters of the downstream-oriented tasks are initialized randomly. And designing a new loss function according to a downstream task, and finely adjusting the model parameters, wherein the learning rate is relatively low, so that the adjustment of the model parameters of the downstream task is relatively low, and the adjustment of the network layer parameters which are farther away from an output layer of the model is relatively small.

For the question-answering system, the downstream tasks include two tasks: entity identification and relationship matching.

Entity identification: for this task, BERT_domainThe segment codes in the input are changed into participle codes. In BERT_domainAnd on the basis of the two softmax classification layers, respectively splicing the two softmax classification layers, and respectively predicting the probability that each character in the Q is used as an entity to mention the starting position and the ending position. BERT_domainThe network parameters are initialized by using the model parameter file saved in the pre-training stage.

And (3) relation matching: for this task, BERT_domainThe input of (1) is unchanged, and is the sum of the character vector, the segment code and the position vector. In BERT_domainAnd on the basis, a softmax classification layer is spliced, and whether the output Q is consistent with the entity word + relation is predicted. BERT_domainUsing pre-training orderAnd initializing the model parameter file saved by the segment.

The difference between the invention and the prior art is that:

(1) the named entity recognition module and the relationship matching module use a pre-trained BERT model of the disease domain and a character vector representation of the disease domain. Pre-training is carried out through an MLM task, and the pre-trained linguistic data come from clinical diagnosis and treatment documents and electronic medical records in the disease field;

(2) the named entity recognition module is a single sentence task and therefore does not require segment coding. Meanwhile, the entities in the knowledge graph are used for assisting word segmentation, and word segmentation codes are obtained according to word segmentation results. And replacing the segmentation codes with the segment codes in the original BERT model input, and assisting the entity boundary identification through the segmentation results.

(3) The entity linking module is implemented using a retrieval-based entity linking method. Aiming at the fact that the entity recognition result directly influences the entity link result, when the DICE distance is calculated, the calculation of the DICE distance between the entity word and the sentence is increased, and the DICE distance smaller than a threshold value caused by entity boundary missing recognition is reduced.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A question-answer matching system based on a disease domain knowledge graph, comprising:

a named entity recognition module for determining the start and end positions of the entity mentions, predicting the probabilities of each character in the output Q as the start and end positions of the sentence entity mention, and determining the start and end positions of the entity mention according to the probabilities of the start and end positions, the named entity recognition module using BERT_domainMake fine adjustment；

a relation matching module for matching the relation corresponding to the entity word with the user input problem, judging whether the relation is consistent with the user problem, if so, outputting the result, and using BERT_domainFine tuning is performed.

2. The question-answer matching system according to claim 1, wherein the pre-processing includes removing spaces and punctuation marks using regularization on user-entered text questions and unifying letters into lower case.

3. The question-answer matching system according to claim 1, characterized in that probabilities of respective characters in the output Q as sentence entity mention start positions and end positions are predicted, and entity mention is determined, specifically:

BERT after pretraining through disease domains_domainThe model encodes the output Q of the preprocessing module and outputs a feature vector, the feature vector predicts the probability of each character as an entity mention starting position and an entity mention ending position through two Softmax classification layers, and a character string corresponding to the entity starting position and the entity mention ending position is the entity mention.

4. The question-answer matching system according to any one of claims 1 to 3, characterized in that the BERT of the field of disease is obtained_domainThe pre-training process of the model is as follows:

5. The question-answer matching system according to claim 3, characterized in that BERT is used in the named entity recognition module_domainFine tuning is performed at BERT_domainAdding network parameters of the downstream-oriented task and finely adjusting the parameters of the whole network, wherein the named entity module is in BERT_domainRespectively splicing two softmax classification layers on the basis of the initial position and the end position of the output Q, respectively predicting the probability that each character in the output Q is used as an entity to mention the initial position and the end position, and in addition, carrying out BERT (binary inverse transform) in a fine tuning stage_domainSegment coding in the original input is replaced by participle coding, BERT in the fine tuning stage_domainCharacter encoding in input using character vector E saved during pre-training phase_charInitialization is performed.

6. The question-answer matching system according to claim 4, wherein the probability that the named entity recognition module entity refers to the start position and the end position is calculated using the following formula:

wherein L is the number of characters of Q,

7. The question-answer matching system according to claim 5, wherein the word segmentation code is specifically: the output Q is segmented through a segmentation tool, a label is marked on each character according to the segmentation result, the label set is { B, M, E, S }, wherein B represents that the character is the beginning character of the vocabulary, M represents that the character is the middle character of the vocabulary, E represents that the character is the ending character of the vocabulary, and S represents a single word, and corresponding segmentation codes are respectively E_B、E_M、E_S、E_E。

8. The question-answer matching system according to claim 1, characterized in that the entity linking module is implemented by means of retrieval, and is specifically divided into two stages, an offline stage and an online search stage:

and (3) in an online searching stage: the method comprises the steps that Q is obtained through preprocessing problems input by a user, entity mentions are obtained through a named entity recognition module, the entity mentions are segmented according to characters, corresponding character indexes are searched, accumulated according to TF-IDF, sorted from large to small according to scores, and the first 2 entity words with the highest scores are selected as candidate entities;

calculating DICE distance d between each candidate entity and entity₁And the DICE distance d of the candidate entity from Q₂If the weighted sum is greater than the threshold, the candidate entity is retained, otherwise the candidate entity is removed.

9. The question-answer matching system according to claim 8, characterized in that the DICE distance is:

x, Y, S sequentially represents candidate entity words, entity mentions and the output Q of the preprocessing module, alpha and beta are parameters, a and b represent the public character of a and b, and | x | represents the character length.

10. The question-answer matching system according to claim 3, characterized in that the network parameters of the two Softmax classification layers are different.