CN113254609A

CN113254609A - Question-answering model integration method based on negative sample diversity

Info

Publication number: CN113254609A
Application number: CN202110516176.2A
Authority: CN
Inventors: 方钰; 翟鹏珺; 崔雪
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2021-05-12
Filing date: 2021-05-12
Publication date: 2021-08-13
Anticipated expiration: 2041-05-12
Also published as: CN113254609B

Abstract

A question-answering model integration method based on negative sample diversity is disclosed. In the question and answer matching stage of the automatic question and answer system, multi-angle information in the corpus is captured through an integrated model, so that the accuracy and the stability of the question and answer system are improved. In the field of Chinese medical question answering, most of the existing question answering model integration methods use a random sampling or single similarity distance segmentation sampling method to obtain negative samples, the diversity of the negative samples is ignored, the diversity of a base model is insufficient, and the effect of the integration model is influenced. According to the invention, the negative samples are respectively sequenced and sampled in a segmented manner according to various similarity distances between the positive samples and the negative samples, so that a plurality of training sample sets are formed, a plurality of base models are trained based on the training sample sets and finally integrated, the defect of diversity of the base models is overcome, and the stability and the accuracy of the question-answering model are improved.

Description

Question-answering model integration method based on negative sample diversity

Technical Field

The invention relates to the field of natural language processing, in particular to processing of model integration in a question-answering system.

Model integration is an important method and key technology for improving the performance of a question-answering model in an automatic question-answering system.

Background

The medical question-answering model is an application branch of the automatic question-answering model, and has become a key research and application along with the improvement of natural language processing technology. Accordingly, more and more patients tend to seek medical assistance through online health communities. However, the drastically increased number of questions places a tremendous burden on the physician to return. In order to alleviate the workload of doctors and meet the demand of users for quick answers, a large number of researchers invest in the field of medical question-answering. In the medical question-answering system, ensuring the accuracy and robustness of the model is a technical difficulty, so that some scholars pay attention to more data information through integrated learning and the performance of the question-answering system is improved.

At present, model integration methods in the Chinese medical field generally carry out random sampling on negative samples in the aspect of training data or carry out segmented sampling based on single similarity distance.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a question-answering model integration method based on negative sample diversity, which is used for carrying out segmented sampling on negative samples under multiple similarity distances so as to construct multiple training sets and training multiple base models based on the multiple training sets, aiming at ensuring the diversity of the base models by means of the diversity of the negative samples and finally improving the accuracy and the robustness of the generated integration model.

Medical questions and answers are used as a service platform for providing medical and health consultation for users, and high accuracy and stability are needed. In the question-answer matching phase of the question-answer system, the integrated model is often more accurate and robust than the method using a single learner, so that the integrated learning is also introduced into the study of the question-answer field. Different negative samples can be learned by the model to express information in different languages, but the diversity of the negative samples is often not considered enough in the model training stage aiming at the research of the integrated model at present, so that the diversity of the base model is limited, and the prediction performance of the final integrated model is influenced.

Aiming at the problems, the invention provides a model integration method based on negative sample multi-similarity segmented sampling, aiming at improving the stability and robustness of a Chinese medical question-answering model. The method comprises the steps of sequencing and sampling negative samples in a segmented mode according to various similarity distances between the positive samples and the negative samples, forming a plurality of training sample sets, training a plurality of base models based on the training sample sets, and finally integrating the base models.

In order to achieve the purpose, the technical scheme provided by the invention is as follows:

the invention provides a medical query expansion method based on a knowledge graph, which comprises the following steps:

step 1, preprocessing a data set of medical question and answer pairs;

step 2, sorting the similarity of the negative samples;

step 3, combining the negative sample sequencing result obtained in the step 2, carrying out segmented sampling on the negative samples, constructing a plurality of training sets and training a base model;

and 4, integrating the basic models obtained in the step 3 by using weighted average to obtain a final question-answering model.

Advantageous effects

The invention designs a model integration method based on negative sample multi-similarity segmented sampling, aiming at the problem that the existing model integration method for improving the performance of a Chinese medical question-answering model is insufficient in consideration of diversity of negative samples. The method comprises the steps of sequencing and sampling negative samples in a segmented mode according to various similarity distances between the positive samples and the negative samples to obtain a plurality of training sample sets, training a plurality of base models based on the training sample sets, and finally performing model integration. According to the method, the diversity of the negative samples is fully excavated to obtain the diversity of the basic models, so that the accuracy of the final integrated model is improved. The intelligent community system has great significance in providing convenient online and timely medical service for residents and relieving the workload of doctors in the intelligent community scene.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and, together with the description, serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic flow diagram of a model integration method;

FIG. 2 is a pre-experiment result of determining vocabulary weights in the word shape similarity in step two;

FIG. 3 is an exemplary diagram of segmented sampling in step three;

fig. 4 shows the result of preliminary experiments to determine the optimal number of segments in step three.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, a detailed description of the embodiments of the present invention will be given below with reference to the accompanying drawings. It is to be understood that the specific embodiments described herein are merely illustrative and explanatory of the invention and are not restrictive thereof.

The specific implementation process of the invention is shown in fig. 1, and comprises the following 4 aspects:

step 1, preprocessing a data set of medical question and answer pairs;

step 2, sorting the similarity of the negative samples;

Each step is described in detail below.

The first step is as follows: the chinese medical answers pre-process the data set,

1.1 integrating question-answer pairs datasets

Some invalid question-answer pairs that do not contain answers, are ambiguous, are asked for or contain pictures in the question. In order to ensure the balance of the data set, other question-answer sentences except disease diagnosis, disease treatment, disease symptom and disease cause are deleted. Providing the integrated data set to step 1.2;

1.2 removing stop words

The stop words of the question and answer in the data set are removed by using the stop word list, and the stop words mainly comprise words with high use frequency and no actual meanings, such as language words, polite words and the like. The result after the stop word is removed is provided to step 1.3 and step 1.4;

1.3 annotating question-answer alignment samples

And (3) marking the correct answer corresponding to each question sentence in the data set provided in the step (1.2), thereby obtaining a positive sample of question-answer pairs, and providing a marking result to the step (1.4).

1.4 random initialization question-answer pair negative sample

Based on the question-answer positive samples marked in the step 1.3, matching answers are randomly given to the question sentences from all answers provided in the step 1.2, the answer cannot be the same as the answer in the positive samples, and then the question-answer pairs are marked as negative samples, so that the random initialization of the question-answer pair negative samples is completed. After labeling, the preprocessing work of the step 1 on the data set of the question and answer is completed, and the question sentences in the preprocessed data set are provided for the step 2, the step 3 and the step 4.

The second step is that: negative sample similarity ranking.

2.1 calculating part-of-speech similarity of Positive and negative samples

For the answers in the positive sample and the negative sample of the question answers obtained in the step 1, the distance between the answers is calculated by utilizing a tfidf algorithm which can give the importance degree of words in the text based on a statistical method, and the result is provided to a step 2.2.

2.2 computing lexical weights

The question-answer corpus provided in the step 1 belongs to the medical field, wherein the field vocabularies are often more distinguished and important than common vocabularies, so that the importance of the field vocabularies is highlighted by giving higher weight to the medical field vocabularies on the basis of the step 2.1, namely, the tfidf algorithm weighted by the field words is adopted to optimize the calculation of the morphological similarity distance of the positive and negative samples.

The value of the weight directly influences the performance of the similarity algorithm, and a pre-experiment is designed to determine the weight ratio of the field vocabulary and the common vocabulary in the question and answer corpus provided in the step 1. As shown in fig. 2, the preliminary experiment uses ACC @1 as an evaluation index, and changes in the performance of the initial integration model are compared by adjusting the weight ratio of the common vocabulary to the domain vocabulary. Here, the initial integration model is a BIGRU _ CNN model obtained by combining 6 sampling based on the word shape similarity segmentation of the negative samples.

From the results of the preliminary experiments, it can be seen that the initial integration model works best when the weight ratio of the common vocabulary to the domain vocabulary is 0.6. Therefore, in the tfidf algorithm based on the weighting of the domain words, the weighting formulas of the domain words and the common words are shown in formulas (1) and (2). Where ω 1 is the domain vocabulary, c1 is the normal vocabulary, W' is the original weight based on the word frequency and the inverse text frequency index, W (ω 1) is the weighted domain vocabulary weight, and W (c1) is the weighted normal vocabulary weight.

W(ω1)＝1*W′(ω1) (1)

W(c1)＝0.6*W′(c1) (2)

And introducing W (omega 1) and W (c1) into the tfidf algorithm, sequencing the obtained part-of-speech similarity results from large to small, and providing the sequenced results to the step 3.

2.3 calculating the similarity between the field vocabularies in the positive and negative samples

Because the tree structure in the CMeSH (Chinese Medical Subject headers) can clearly show the semantic relationship among the Medical field vocabularies, the invention utilizes the CMeSH to calculate the similarity among the Medical field vocabularies contained in the answers in the positive and negative samples provided in the step 1, and provides the similarity result to the step 2.4. Specifically, the semantic similarity Sim (ω 1, ω 2) between the medical domain words is calculated according to the semantic distance between the medical domain words ω 1 and ω 2, and the similarity calculation formula is shown as formula (3). Where Dist (ω 1, ω 2) represents the semantic distance between the domain words.

2.4 calculating semantic similarity of Positive and negative samples

Calculating the semantic similarity corresponding to the answers of the positive and negative samples according to the domain vocabulary similarity provided in the step 2.3 and the formula (4), and providing the calculation results to the step after the calculation results are sorted from big to smallAnd 3. step 3. Wherein M and N are respectively a vocabulary set in two sentences, and N is₁,N₂,…,N_nFor the vocabulary in the set N, the calculation formula of the maximum similarity maxValue (ω, N) between the medical field vocabulary ω and the vocabulary in the sentence is shown in formula (5).

maxValue(ω,N)＝max(sim(ω,N₁),sim(ω,N₂),…,sim(ω,N_n)) (5)

The third step: and (4) combining the negative sample sequencing results obtained in the step (2.2) and the step (2.4), carrying out segmented sampling on the negative samples, constructing a plurality of training sets and training the base model.

3.1 fractional sampling

And (3) aiming at the negative sample sequencing results obtained in the step (2.2) and the step (2.4), uniformly segmenting the negative samples on the two similarity sequences of the morphology and the semantics respectively, and sampling in different segments to form different training sets. Here, the negative sample set collected for the ith problem in each segment l

Satisfy to any j epsilon [1, k-1 ∈]Is provided with

Where k-1 represents the total number of negative samples,

representing the set of negative samples in the ith segment. Since the negative sample sequences are arranged from large to small, so

The included negative sample has higher semantic similarity with the positive sample, and

the semantic similarity between the negative sample and the positive sample is lower, and L is the number of segments.

As shown in fig. 3, for example, the number of segments L is equal to 3, it is described that negative samples are sampled based on the morphological similarity degree segments, and the distance between samples is determined by the distance of the morphological similarity degree between samples. Wherein S is a positive sample, S1-S9 are negative samples, the negative samples in the first sample set are extracted from S1, S2 and S3 in the first segment, the negative samples in the second sample set are extracted from S4, S5 and S6 in the second segment, and so on, and finally three training sample sets are generated.

3.2 determining the number of segments

The number of segments will directly determine the number of base models and the learning granularity of the base models, so the invention determines the applicable number of segments by designing a pre-experiment. The preliminary experiment is carried out by taking ACC @1 as an evaluation index and based on a BIGRU _ CNN model structure, and the experimental result is shown in FIG. 4. Experimental results show that when the number of the segments is 3, 3 base models obtained by segmentation sampling based on the semantic similarity of the negative samples and 3 base models obtained by segmentation sampling based on the morphological similarity of the negative samples are integrated, and the ACC @1 has the best effect. When the number of segments is too small, the advantages of multi-granularity learning of the model cannot be exerted, and therefore the performance of the model is limited. However, when the number of segments is 4 or 5, the discrimination between the base models is lowered, resulting in a slight decrease in the ACC @1 index value as compared with the case where the number of segments is 3, and the number of generated base models is excessive resulting in a long algorithm calculation time, so that the number of segments is finally defined to be 3. Respectively combining the negative sample set collected when the number of the segments is 3 with the positive sample to form a training set, training the model, and obtaining a base model M after the training is finished_iAnd provides to step 4.

The fourth step: and (4) integrating the basic models obtained in the step (3) by using weighted average to obtain a final question-answering model.

For all base models M provided in step 3_i(i belongs to 2L), integrating all base models according to a weighted average combination mode, and weighting w_i(i ∈ 2L) depends on the accuracy p of the base model on the validation set_iAnd the weight ratio of the high-accuracy base model in the whole model is larger. The final integrated model prediction probability h (x) is shown in equations (6) and (7). Wherein T is a basic modelTotal number of (c), h_i(x) For each base model predicted result, w_iAre the respective weights of the base models.

Innovation point

The invention provides a question-answer model integration method based on negative sample diversity, which is different from the model integration method in the field of medical question-answer at present. And then acquiring different negative sample sets by using a segmented sampling method, training the negative sample sets respectively as training data sets to generate different base models, and finally realizing the integration of the base models through weighted average.

The model integration method provided by the invention has better performance on the data set of the Chinese medical question and answer, and improves the accuracy of the Chinese medical question and answer system.

Claims

1. A question-answer model integration method based on negative sample diversity is characterized by comprising

Step 1, preprocessing a data set of medical question and answer pairs;

step 2, sorting the similarity of the negative samples;

2. The negative sample diversity-based question-answer model integration method according to claim 1, characterized by the first step of: preprocessing a data set by Chinese medical questions and answers;

1.1 integrating question-answer pairs datasets

Deleting some invalid question-answer pairs which do not contain answers, have ambiguous expression, and contain pictures in question sentences or answer sentences; in order to ensure the balance of the data set, deleting question-answer sentences of other individual classes except four classes of disease diagnosis class, disease treatment class, disease symptom class and disease cause class; providing the integrated data set to step 1.2;

1.2 removing stop words

Removing stop words of question and answer in the data set by using a stop word vocabulary, wherein the stop words mainly comprise words with high use frequency and without actual meanings, such as language words, polite words and the like; the result after the stop word is removed is provided to step 1.3 and step 1.4;

1.3 annotating question-answer alignment samples

Marking the correct answer corresponding to each question sentence in the data set provided in the step 1.2, thereby obtaining a positive sample of question-answer pairs, and providing a marking result to the step 1.4;

1.4 random initialization question-answer pair negative sample

On the basis of the question-answer positive sample labeled in the step 1.3, matching answers are randomly given to the question sentences from all answers provided in the step 1.2, the answer cannot be the same as the answer in the positive sample, and then the question-answer pairs are labeled as negative samples, so that the random initialization of the question-answer pair negative samples is completed; after labeling, the preprocessing work of the step 1 on the data set of the question and answer is completed, and the question sentences in the preprocessed data set are provided for the step 2, the step 3 and the step 4.

3. The negative sample diversity-based question-answer model integration method according to claim 1, characterized by the second step of: sorting the similarity of the negative samples;

2.1 calculating part-of-speech similarity of Positive and negative samples

For the answers in the positive sample and the negative sample of the question answers obtained in the step 1, calculating the distance between the answers by utilizing a tfidf algorithm which can give the importance degree of words in the text based on a statistical method, and providing the result to a step 2.2;

2.2 computing lexical weights

The question-answer corpus provided in the step 1 belongs to the medical field, wherein the field vocabularies are often more distinguished and important than common vocabularies, so that the importance of the field vocabularies is highlighted by giving higher weight to the medical field vocabularies on the basis of the step 2.1, namely, the calculation of the morphological similarity distance of the positive and negative samples is optimized by adopting the tfidf algorithm weighted by the field words;

the value of the weight can directly influence the performance of the similarity algorithm, and a pre-experiment is designed to determine the weight ratio of the field vocabulary and the common vocabulary in the question and answer corpus provided in the step 1; in the pre-experiment, ACC @1 is used as an evaluation index, and the change of the performance of the initial integration model is compared by adjusting the weight proportion of the common vocabulary and the field vocabulary; here, the initial integration model adopts a BIGRU _ CNN model obtained by combining 6 sampling based on word shape similarity segmentation of negative samples;

the initial integration model has the best effect when the weight ratio of the ordinary vocabulary to the domain vocabulary is 0.6, so that in the tfidf algorithm based on the weighting of the domain words, the weight formulas of the domain vocabulary and the ordinary vocabulary are shown as formulas (1) and (2); wherein ω 1 is a domain vocabulary, c1 is a common vocabulary, W' is an original weight based on a word frequency and an inverse text frequency index, W (ω 1) is a weighted domain vocabulary weight, and W (c1) is a weighted common vocabulary weight;

W(ω1)＝1*W′(ω1) (1)

W(c1)＝0.6*W′(c1) (2)

introducing W (omega 1) and W (c1) into a tfidf algorithm, sequencing the obtained part-of-speech similarity results from large to small, and providing the sequenced results to the step 3;

Because the tree structure in the CMeSH (Chinese Medical Subject headers) can clearly show the semantic relationship among the Medical field vocabularies, the invention utilizes the CMeSH to calculate the similarity among the Medical field vocabularies contained in the answers in the positive and negative samples provided in the step 1 and provides the similarity result to the step 2.4; specifically, semantic similarity Sim (ω 1, ω 2) between the medical field words is calculated according to the semantic distance between the medical field words ω 1 and ω 2, and a similarity calculation formula is shown in formula (3), where Dist (ω 1, ω 2) represents the semantic distance between the field words:

2.4 calculating semantic similarity of Positive and negative samples

Calculating the semantic similarity corresponding to the answers of the positive and negative samples according to a formula (4) according to the domain vocabulary similarity provided in the step 2.3, and providing the calculation results to the step 3 after the calculation results are sorted from big to small; wherein M and N are respectively a vocabulary set in two sentences, and N is₁,N₂,…,N_nFor the vocabulary in the set N, the calculation formula of the maximum similarity maxValue (ω, N) between the medical field vocabulary ω and the vocabulary in the sentence is shown in formula (5):

maxValue(ω,N)＝max(sim(ω,N₁),sim(ω,N₂),…,sim(ω,N_n)) (5) 。

4. the negative sample diversity-based question-answer model integration method according to claim 1, characterized by the third step of: combining the negative sample sequencing results obtained in the step 2.2 and the step 2.4, carrying out segmented sampling on the negative samples, constructing a plurality of training sets and training a base model;

3.1 fractional sampling

Aiming at the negative sample sequencing results obtained in the step 2.2 and the step 2.4, uniformly segmenting the negative samples on the similarity sequences of the morphology and the semantics respectively, and sampling in different segments to form different training sets; here, the negative sample set collected for the ith problem in each segment l

Satisfy to any j epsilon [1, k-1 ∈]Is provided with

Where k-1 represents the total number of negative samples,

representing a set of negative examples in the l-th segment; since the negative sample sequences are arranged from large to small, so

the semantic similarity between the negative sample and the positive sample contained in the sample is lower, and L is the number of sections;

3.2 determining the number of segments

The number of the sections directly determines the number of the base models and the learning granularity of the base models, so that the invention determines the applicable number of the sections by designing a pre-experiment; the pre-experiment is carried out by taking ACC @1 as an evaluation index and based on a BIGRU _ CNN model structure; respectively combining the negative sample set collected when the number of the segments is 3 with the positive sample to form a training set, training the model, and obtaining a base model M after the training is finished_iAnd provides to step 4.

5. The negative sample diversity-based question-answering model integration method according to claim 1, characterized by the fourth step of: integrating the basic models obtained in the step 3 by using weighted average to obtain a final question-answering model;

for all base models M provided in step 3_i(i belongs to 2L), integrating all base models according to a weighted average combination mode, and weighting w_i(i ∈ 2L) depends on the accuracy p of the base model on the validation set_iHigh accuracy base model integrationThe weight ratio occupied in the body model is larger; the finally obtained integrated model prediction probability H (x) is shown in formulas (6) and (7); where T is the total number of basis models, h_i(x) For each base model predicted result, w_iAre the respective weights for the base models:

。