CN113076411B

CN113076411B - Medical query expansion method based on knowledge graph

Info

Publication number: CN113076411B
Application number: CN202110454713.5A
Authority: CN
Inventors: 方钰; 崔雪; 翟鹏珺
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2021-04-26
Filing date: 2021-04-26
Publication date: 2022-06-03
Anticipated expiration: 2041-04-26
Also published as: CN113076411A

Abstract

A medical query expansion method based on knowledge graph. The query expansion technology in the automatic question-answering system reduces semantic difference between question-answering sentences by supplementing expansion information to the question sentences, thereby improving the accuracy of the question-answering system. In the field of medical question and answer, the existing query expansion method does not fully combine the co-occurrence incidence relation and the reasoning incidence relation among medical terms under different query intentions, so that the obtained expansion words are not accurate enough. The medical knowledge map is used as a knowledge source of the expansion words, the candidate expansion words are obtained by using the reasoning association of the medical terms under different query intentions, and the final expansion words are screened out by combining the negative medical term recognition and mutual information technology, so that the accuracy of the medical question-answering system is finally improved.

Description

Medical query expansion method based on knowledge graph

Technical Field

The invention relates to the field of natural language processing, in particular to query processing in a question-answering system. Query expansion is an important link and key technology in an automatic question and answer system.

Background

With the rapid development of the internet, more and more patients tend to seek medical help through online health communities. However, the drastically increased number of questions places a tremendous burden on the physician to return. In order to alleviate the workload of doctors and meet the demand of users for quick answers, a large number of researchers invest in the field of medical question-answering. In the medical question-answering system, word mismatching caused by different expression modes between question-answering sentences and semantic deviation caused by different information amounts between question-answering sentences are key factors influencing the accuracy of the system. For this reason, researchers have introduced query expansion techniques, i.e., by supplementing query-related expansion words in the query, to reduce the bias between question-answer sentences, so as to improve the performance of the system.

In the current medical question and answer field, the query expansion method mainly comprises query expansion based on key words and query expansion based on semantics. However, the keyword-based query expansion method only picks keywords from a statistical level, ignores semantic information of the query, and therefore may expand many irrelevant medical entities to introduce "noise" to the original query, thereby affecting the quality of answer selection. The semantic-based query expansion utilizes a medical ontology library or a medical semantic dictionary to mine potential semantics except surface word surfaces in queries, but at present, in the stage of acquiring candidate expansion words, the semantic-based query expansion research selects the candidate expansion words based on the concept of a medical entity, and the important role of reasoning association relation of the medical entity between question and answer sentences in guiding the acquisition of the candidate expansion words is ignored. In the expanded word screening stage, some researchers use mutual information to screen candidate words, but they neglect to deny the interference of medical entities on mutual information values among entities.

Disclosure of Invention

In view of the defects of the prior art, the invention provides a semantic query expansion method based on entity incidence relation in medical question answering. The method combines the inference association relation between the query intention and the entity to acquire candidate expansion words from the medical knowledge map, and combines the screening strategy of negating medical entity identification and mutual information to screen the expansion words.

Query expansion is an important ring in automated question-answering systems, which helps the question-answering model to pick the correct answer by processing the original question. At present, most of query expansion in the field of medical question and answer utilizes pseudo-correlation feedback to obtain expansion words, utilizes statistical relationship among medical terms to obtain expansion words, and utilizes semantic similarity among terms to obtain expansion words, and the obtained expansion words are probably irrelevant to query intentions and do not accord with medical scenes where the query is located, or have small correlation with the query, so that large noise is brought to a question and answer system, and the accuracy of the question and answer system is influenced.

Aiming at the problems, the invention aims at expanding the user query, adopts an SVM classifier to obtain the query intention of the user, then obtains candidate expansion words related to the query from a medical knowledge map based on the reasoning association relation of medical terms under different query intentions, and finally obtains the final expansion words by screening through a negative term recognition technology and a mutual information technology.

In order to achieve the purpose, the technical scheme provided by the invention is as follows:

the invention provides a medical query expansion method based on a knowledge graph, which comprises the following steps:

step 1, preprocessing a data set of medical question and answer pairs;

step 2, training an SVM classifier to predict the query intention of the question;

step 3, combining the query intention obtained in the step 2 to obtain candidate expansion words related to query from the medical knowledge graph;

and 4, screening the candidate expansion words obtained in the step 3 by utilizing a negative medical term recognition technology and a mutual information technology, so as to obtain final expansion words.

Advantageous effects

The invention aims at the problems that the existing query expansion technology in the medical question-answering field can not accurately generate expansion words related to a medical scene where a query is located, the co-occurrence incidence relation and the reasoning incidence relation among medical terms under different query intentions are not fully combined, the influence of negative medical terms on the co-occurrence relation among the terms is not considered, and the like, and realizes a medical query expansion method based on a knowledge graph. The invention utilizes a semi-supervised SVM classifier to obtain the query intention of a user, utilizes the reasoning association relation among medical terms under different intentions to obtain candidate expansion words from a medical knowledge map, and finally utilizes a negative medical term technology and a mutual information technology to screen out the expansion words closely related to the query.

The invention provides a medical query expansion method based on a knowledge graph, and experimental verification is carried out on a data set of medical question and answer pairs, so that matched expansion words can be observed to better accord with a medical scene where a query is located and are more closely related to the query. An increase in answer selectivity was also observed using the evaluation tool of the TREC conference. The intelligent community system has great significance in providing convenient online and timely medical service for residents and relieving the workload of doctors in the intelligent community scene.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and, together with the description, serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic flow chart of a query expansion method;

FIG. 2 is a flowchart of the query intent classification of question in step two;

FIG. 3 is a diagram of selecting candidate expansion words from the knowledge graph in step three;

fig. 4 is the step four of screening the expansion words by using the negative medical term recognition technology and mutual information.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, a detailed description of the embodiments of the present invention will be given below with reference to the accompanying drawings. It is to be understood that the specific embodiments described herein are merely illustrative and explanatory of the invention and are not restrictive thereof.

The specific implementation process of the invention is shown in fig. 1, and comprises the following 4 aspects:

step 1, preprocessing a data set of medical question and answer pairs;

Each step is described in detail below.

The first step is as follows: the chinese medical answers pre-process the data set,

1.1 integrating question-answer pairs datasets

In order to ensure the balance of the data set and be beneficial to the subsequent classification operation, the invalid question-answer pairs which are not clear in expression, do not contain answers, question sentences or pictures containing the question sentences are deleted, and in addition to four categories of disease diagnosis, disease symptom, disease treatment and disease cause, the question-answer pairs in other categories are deleted. Providing the integrated data set to step 1.2;

1.2 removing stop words

The stop words of the question and answer in the data set are removed by using the stop word list, and the stop words mainly comprise words with high use frequency and no actual meanings, such as language words, polite words and the like. The result after the stop word is removed is provided to step 1.4;

1.3 integrating domain dictionaries

Because a published and relatively complete Chinese medical knowledge base is lacked at present in China, the ICD-9-CM, the ICD-10 and 39 health networks, the dog searching medical word base (for example and without limitation) and the small-scale medical entity dictionary disclosed on the Internet are integrated to obtain four types of medical field dictionaries of diseases, symptoms, medicines and examinations.

1.4 adding the domain dictionary into the dictionary of the jieba word segmenter, and segmenting the question in the data set by using the jieba word segmenter;

after word segmentation, the preprocessing work of the data set by the step 1 on the question and answer is completed, the question sentences in the preprocessed data set are provided for the step 2, the step 3 and the step 4, and the domain dictionary is provided for the step 3 and the step 4.

The second step is that: training the SVM classifier to predict the query intent of the question, as shown in FIG. 2.

2.1 labeling question classification labels

Marking the intention type of the partial question sentences obtained in the step 1, and marking the inquiry intention of the question sentences as 0 if the inquiry intention belongs to the disease diagnosis type; if the query intention of the question belongs to the disease treatment category, the query intention is marked as 1; if the query intention of the question belongs to the disease symptom class, marking as 2; if the query intention of the question belongs to the diagnosis and treatment category, the query intention is marked as 3; if the query intent of the question belongs to the disease cause category, it is labeled 4. The annotated results are provided to step 2.2.

2.2 semi-supervised training SVM intention classifier

Since the dataset question itself does not contain intent classes, a semi-supervised approach taken from training is employed to train the intent classifier. The statistical result of step 2.1 shows that the data set has a data imbalance problem, so the initial classifier uses a Support Vector Machine (SVM) algorithm for sample imbalance. Training of a classifier needs two characteristics (1) TF-IDF characteristics of a question; (2) question and question word features.

(1) TF-IDF is a commonly used feature vectorization method in text classification, and reflects the importance of words in a whole corpus through word Frequency (Term Frequency) and Inverse Document Frequency (Inverse Document Frequency). The calculation formula is as follows:

where t represents the word frequency of a word, N represents the total word count of a document, x represents the total number of documents, and w represents the occurrence of the word in w documents.

(2) And (3) obtaining the query feature words of the four categories of question sentences by the statistical data set, processing the question sentences by using discrete feature codes, and judging whether the question sentences contain the query feature words (with the value of 0 or 1) of a certain category.

The trained intent classifier is provided to step 2.3.

2.3 inputting the question to be classified into the trained SVM classifier, and providing the classification result (namely the query intention of the question) to the step 3.

The third step: candidate expansion words relevant to the query are obtained from the medical knowledge-graph, as shown in fig. 3.

3.1 medical knowledge map acquisition

Extracting triples marked as pediatrics departments from the disclosed Chinese medical general knowledge graph, and integrating a Chinese pediatric knowledge graph by combining medical entity relations related to pediatrics collected on a health website. The map is provided to step 3.4.

3.2 counting the negative feature words and the termination feature words in the data set. To step 3.3 and step 4.1.

3.3 query keyword acquisition

And (4) screening the initial query keywords of the sentence according to the question intention category labels provided in the step 2.3 and by combining the domain dictionary obtained in the step 1.3. The screening basis is that a symptom entity is selected as an initial query keyword for a disease diagnosis question, a disease entity is selected as an initial query keyword for a disease treatment question, a disease entity is selected as an initial query keyword for a question of symptom type, and a symptom entity is selected as an initial query keyword for a diagnosis and treatment question. And then removing the negative medical terms in the initial query keywords by using the negative terms and the terminating terms to obtain final query keywords. The specific idea is to determine a negative window by using a negative term and a terminating term as boundaries, wherein all medical terms in the negative window are marked as negative medical terms, the negative term is the negative characteristic word obtained in step 3.2, and the terminating term comprises the terminating characteristic word obtained in step 3.2 and commas, periods and semicolons. The obtained query terms are provided to step 3.4.

3.4 candidate expanded word acquisition

Combining the query keywords of step 3.3 with the query intentions obtained in step 2.3, the types of medical terms that may be present in the answers can be deduced based on the following reasoning formula.

[rule:(Q belongsTo C)，(Q hasEntity M)→(A hasEntity N)]

In the formula, Q represents a question, A represents an answer, C represents a query intention, M represents a medical term type screened in the query, and N represents a corresponding medical term type in the answer.

For sentences of disease diagnosis, disease entities possibly corresponding to the query keywords are obtained from the knowledge graph, and intersection sets of the disease entities obtained by each symptom in the query are taken as final candidate expansion words. And for the sentences of the disease treatment class and the inquiry symptom class, respectively selecting the drug entities corresponding to the query keywords and the corresponding typical symptoms from the knowledge graph as candidate expansion words. For the compound question sentence of diagnosis and treatment, firstly inquiring the disease entity according to the processing method of the sentence of disease diagnosis, then inquiring the commonly used medicine entity according to the disease entity according to the processing method of the sentence of disease treatment, and finally, taking the disease entity and the medicine entity as candidate expansion words to be output. For the disease cause-like sentences, since it is difficult to generalize the cause with a single few expansion words, this type of question is not handled for the time being to avoid the introduction of a large amount of noise. The resulting list of candidate expanded words is provided to step 4.2.

The fourth step: screening all candidate expansion words by using a negative medical term recognition technology and a mutual information technology, as shown in fig. 4.

4.1 the questions and answers are used to label all negative medical terms in the dataset, the labeling method being the same as the labeling method described in step 3.3. The result of the labeling is provided to step 4.2.

4.2 calculating the normalized mutual information value of the expansion word and the whole query, and screening to obtain the final expansion word

And (3) calculating the mutual information quantity of each candidate expansion word and the whole query in 3.4, and selecting the candidate expansion word of which the normalized mutual information quantity is smaller than the expansion threshold value as the final expansion word of the query. The mutual information quantity calculation formula of the two words is as follows:

the co-occurrence window selects a range of a group of question-answer sentences, c (w1, w2) represents the times of the question sentences of the vocabulary w1 appearing in the co-occurrence window and the times of the response sentences of w2 appearing in the window simultaneously, c (w1) represents the times of the medical terms w1 appearing in the corpus, c (w2) represents the times of the medical terms w2 appearing in the corpus, and N represents the number of all the medical terms in the corpus. In the calculation stage of the mutual information matrix, the word frequency related to the negative medical terms marked in the step 4.1 is not counted, so that the negative medical terms are prevented from interfering the correlation degree of the whole medical terms in the corpus.

Assuming that each key medical term qi in the initial query Q is independent, a calculation formula of mutual information values between the expansion words and the whole query sentence is as follows.

M(Q)＝∑_qi∈QI(qi,w)

In order to conveniently set a screening threshold value and normalize the obtained mutual information value, the formula is as follows, wherein Mmax and Mmin respectively represent the maximum value and the minimum value of M (Q).

NM(Q)＝(Mmax-M(Q))/(Mmax-Mmin)

And the terms with the normalized mutual information value NM (Q) smaller than the expansion threshold value of the whole query in the candidate expansion words become final expansion words.

Innovation point

The invention provides a medical query expansion method based on a knowledge graph, which is different from the query expansion method in the field of medical question and answer at present. The method comprises the steps of judging a user query intention by using a classifier, then acquiring candidate expansion words from a knowledge graph by combining inference association of medical terms under different query intentions, and finally screening by combining a medical term recognition technology and a mutual information technology to obtain final expansion words. Compared with the query expansion based on synonyms commonly used in the field of medical question answering, the method obtains more accurate expansion words.

The method provided by the invention has good performance on the data set of Chinese medical question and answer pairs, and improves the accuracy of the Chinese medical question and answer system.

Claims

1. A medical query expansion method based on knowledge graph is characterized by comprising the following steps:

step 1, preprocessing a data set of medical question and answer;

1.1 integrating question-answer pairs datasets

Deleting invalid question-answer pairs which are not clear in expression, do not contain answers, question sentences or pictures containing the answer sentences, and deleting other question-answer pairs except four categories of disease diagnosis, disease symptom, disease treatment and disease cause in order to ensure the balance of the data set and facilitate subsequent classification operation; providing the integrated data set to step 1.2;

1.2 removing stop words

Removing stop words of question and answer in the data set by using a stop word vocabulary, wherein the stop words comprise words with high use frequency and no actual meanings; the result after the stop word is removed is provided to step 1.4;

1.3 integrating domain dictionaries

Constructing a medical field dictionary by integrating various existing medical entity dictionaries, wherein the medical field dictionary comprises four categories of diseases, symptoms, medicines and examinations;

after word segmentation, preprocessing the data set by the question and answer in the step 1 is completed, the questions in the preprocessed data set are provided for the step 2, the step 3 and the step 4, and the domain dictionary is provided for the step 3 and the step 4;

2.1 labeling question classification labels

Marking the intention type of the partial question sentences obtained in the step 1, and marking the inquiry intention of the question sentences as 0 if the inquiry intention belongs to the disease diagnosis type; if the query intention of the question belongs to the disease treatment category, the label is 1; if the query intention of the question belongs to the disease symptom class, marking as 2; if the query intention of the question belongs to the diagnosis and treatment category, the query intention is marked as 3; if the query intention of the question belongs to the disease cause class, marking as 4; the annotated result is provided to step 2.2;

2.2 semi-supervised training SVM intention classifier

The method adopts a self-training semi-supervised method to train an intention classifier, and an initial classifier uses a Support Vector Machine (SVM) algorithm for sample imbalance; training of a classifier requires two features of a question (1), namely TF-IDF features; (2) question and question word characteristics:

(1) TF-IDF is a commonly used feature vectorization method in text classification, which reflects the importance of words in the whole corpus through Term Frequency and Inverse file Frequency Inverse Document Frequency, and the calculation formula is as follows:

wherein t represents the word frequency of a certain word, N represents the total word number of the document, x represents the total number of the document, and w represents the occurrence of the word in w documents;

(2) the method comprises the steps that a statistical data set obtains question feature words of four categories of question sentences, discrete feature codes are used for processing the question sentences, and whether question feature words with the category of 0 or 1 are included is judged;

providing the trained intention classifier to the step 2.3;

2.3 inputting the question to be classified into the trained SVM classifier, and providing the classification result, namely the query intention of the question, to the step 3;

step 3, combining the query intention obtained in the step 2 to obtain candidate expansion words related to query from the medical knowledge graph:

3.1 medical knowledge map acquisition

Extracting triples marked as pediatrics departments from the disclosed Chinese medical general knowledge graph, and acquiring pediatrics medical entity relations from pediatrics question and answer corpora crawled from a 39 health network by using a BERT-based relation extraction method, so that the triples and the pediatrics knowledge graph are integrated; the map is provided to step 3.4;

3.2 counting negative characteristic words and termination characteristic words in the data set; providing to step 3.3 and step 4.1;

3.3 query keyword acquisition

Screening the initial query keywords of the sentence according to the question intention category labels provided in the step 2.3 and by combining the domain dictionary obtained in the step 1.3; the screening basis is that a symptom entity is selected as an initial query keyword for a disease diagnosis question, a disease entity is selected as an initial query keyword for a disease treatment question, a disease entity is selected as an initial query keyword for a question of symptom type, and a symptom entity is selected as an initial query keyword for a diagnosis and treatment question; then, removing negative medical terms in the initial query keywords by using the negative terms and the termination terms to obtain final query keywords; the specific idea is to determine a negative window by taking a negative term and a terminating term as boundaries, wherein medical terms in the negative window are all marked as negative medical terms, the negative term is a negative characteristic word obtained in step 3.2, and the terminating term comprises the terminating characteristic word obtained in step 3.2 and commas, periods and semicolons; providing the obtained query key words to step 3.4;

3.4 candidate expanded word acquisition

Combining the query keywords of step 3.3 with the query intentions obtained in step 2.3, the types of medical terms that may be present in the answers can be deduced based on the following reasoning formula;

[rule:(Q belongsTo C),(Q hasEntity M)→(A hasEntity N)]

in the formula, Q represents a question, A represents an answer, C represents a query intention, M represents a medical term type screened in the query, and N represents a corresponding medical term type in the answer;

for the sentences of disease diagnosis, acquiring disease entities possibly corresponding to the query keywords from the knowledge graph, and taking intersection of the disease entities obtained by each symptom in the query as final candidate expansion words;

for sentences of a disease treatment class and a symptom inquiry class, respectively selecting a drug entity corresponding to a query keyword and a corresponding typical symptom from a knowledge graph as candidate expansion words;

for the diagnosis and treatment compound question sentence, firstly inquiring a disease entity according to the processing method of a disease diagnosis sentence, then inquiring a commonly used medicine entity according to the disease entity according to the processing method of a disease treatment sentence, and finally, taking the disease entity and the medicine entity as candidate expansion words to be output;

for the disease reason sentences, the question sentences of the type are not processed for the time being;

the obtained candidate expansion word list is provided for the step 4;

and 4, screening the candidate expansion words obtained in the step 3 by utilizing a negative medical term recognition technology and a mutual information technology to obtain final expansion words:

4.1 marking all negative medical terms in the data set by the question and answer, wherein the marking method is the same as the marking method introduced in the step 3.3; the result of the labeling is provided to step 4.2;

Calculating the mutual information quantity of each candidate expansion word and the whole query in the step 3.4, and selecting the candidate expansion word of which the normalized mutual information quantity is smaller than the expansion threshold value as a final expansion word of the query; the mutual information quantity calculation formula of the two words is as follows:

selecting a range of a group of question-answer sentences in a co-occurrence window, wherein c (w1, w2) represents the times of the question sentences of a word w1 appearing in the co-occurrence window and the times of the response sentences of w2 appearing in the co-occurrence window simultaneously, c (w1) represents the times of the medical terms w1 appearing in the corpus set, c (w2) represents the times of the medical terms w2 appearing in the corpus set, and N represents the number of all medical terms in the corpus set; in the calculation stage of the mutual information matrix, the word frequency related to the negative medical terms marked in the step 4.1 is not counted;

assuming that each key medical term qi in the initial query Q is independent, the calculation formula of the mutual information value between the expansion word and the whole query statement is as follows:

M(Q)＝∑_qi∈QI(qi,w)

in order to conveniently set a screening threshold value and normalize the obtained mutual information value, the formula is shown as follows, wherein Mmax and Mmin respectively represent the maximum value and the minimum value of M (Q);

NM(Q)＝(Mmax-M(Q))/(Mmax-Mmin)

and (3) the terms with the normalized mutual information value NM (Q) smaller than the expansion threshold value of the whole query in the candidate expansion words become final expansion words.