CN113254609A - Question-answering model integration method based on negative sample diversity - Google Patents

Question-answering model integration method based on negative sample diversity Download PDF

Info

Publication number
CN113254609A
CN113254609A CN202110516176.2A CN202110516176A CN113254609A CN 113254609 A CN113254609 A CN 113254609A CN 202110516176 A CN202110516176 A CN 202110516176A CN 113254609 A CN113254609 A CN 113254609A
Authority
CN
China
Prior art keywords
question
model
answer
similarity
negative
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110516176.2A
Other languages
Chinese (zh)
Other versions
CN113254609B (en
Inventor
方钰
翟鹏珺
崔雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN202110516176.2A priority Critical patent/CN113254609B/en
Publication of CN113254609A publication Critical patent/CN113254609A/en
Application granted granted Critical
Publication of CN113254609B publication Critical patent/CN113254609B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H80/00ICT specially adapted for facilitating communication between medical practitioners or patients, e.g. for collaborative diagnosis, therapy or health monitoring
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Epidemiology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Public Health (AREA)
  • Primary Health Care (AREA)
  • Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

A question-answering model integration method based on negative sample diversity is disclosed. In the question and answer matching stage of the automatic question and answer system, multi-angle information in the corpus is captured through an integrated model, so that the accuracy and the stability of the question and answer system are improved. In the field of Chinese medical question answering, most of the existing question answering model integration methods use a random sampling or single similarity distance segmentation sampling method to obtain negative samples, the diversity of the negative samples is ignored, the diversity of a base model is insufficient, and the effect of the integration model is influenced. According to the invention, the negative samples are respectively sequenced and sampled in a segmented manner according to various similarity distances between the positive samples and the negative samples, so that a plurality of training sample sets are formed, a plurality of base models are trained based on the training sample sets and finally integrated, the defect of diversity of the base models is overcome, and the stability and the accuracy of the question-answering model are improved.

Description

Question-answering model integration method based on negative sample diversity
Technical Field
The invention relates to the field of natural language processing, in particular to processing of model integration in a question-answering system.
Model integration is an important method and key technology for improving the performance of a question-answering model in an automatic question-answering system.
Background
The medical question-answering model is an application branch of the automatic question-answering model, and has become a key research and application along with the improvement of natural language processing technology. Accordingly, more and more patients tend to seek medical assistance through online health communities. However, the drastically increased number of questions places a tremendous burden on the physician to return. In order to alleviate the workload of doctors and meet the demand of users for quick answers, a large number of researchers invest in the field of medical question-answering. In the medical question-answering system, ensuring the accuracy and robustness of the model is a technical difficulty, so that some scholars pay attention to more data information through integrated learning and the performance of the question-answering system is improved.
At present, model integration methods in the Chinese medical field generally carry out random sampling on negative samples in the aspect of training data or carry out segmented sampling based on single similarity distance.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a question-answering model integration method based on negative sample diversity, which is used for carrying out segmented sampling on negative samples under multiple similarity distances so as to construct multiple training sets and training multiple base models based on the multiple training sets, aiming at ensuring the diversity of the base models by means of the diversity of the negative samples and finally improving the accuracy and the robustness of the generated integration model.
Medical questions and answers are used as a service platform for providing medical and health consultation for users, and high accuracy and stability are needed. In the question-answer matching phase of the question-answer system, the integrated model is often more accurate and robust than the method using a single learner, so that the integrated learning is also introduced into the study of the question-answer field. Different negative samples can be learned by the model to express information in different languages, but the diversity of the negative samples is often not considered enough in the model training stage aiming at the research of the integrated model at present, so that the diversity of the base model is limited, and the prediction performance of the final integrated model is influenced.
Aiming at the problems, the invention provides a model integration method based on negative sample multi-similarity segmented sampling, aiming at improving the stability and robustness of a Chinese medical question-answering model. The method comprises the steps of sequencing and sampling negative samples in a segmented mode according to various similarity distances between the positive samples and the negative samples, forming a plurality of training sample sets, training a plurality of base models based on the training sample sets, and finally integrating the base models.
In order to achieve the purpose, the technical scheme provided by the invention is as follows:
the invention provides a medical query expansion method based on a knowledge graph, which comprises the following steps:
step 1, preprocessing a data set of medical question and answer pairs;
step 2, sorting the similarity of the negative samples;
step 3, combining the negative sample sequencing result obtained in the step 2, carrying out segmented sampling on the negative samples, constructing a plurality of training sets and training a base model;
and 4, integrating the basic models obtained in the step 3 by using weighted average to obtain a final question-answering model.
Advantageous effects
The invention designs a model integration method based on negative sample multi-similarity segmented sampling, aiming at the problem that the existing model integration method for improving the performance of a Chinese medical question-answering model is insufficient in consideration of diversity of negative samples. The method comprises the steps of sequencing and sampling negative samples in a segmented mode according to various similarity distances between the positive samples and the negative samples to obtain a plurality of training sample sets, training a plurality of base models based on the training sample sets, and finally performing model integration. According to the method, the diversity of the negative samples is fully excavated to obtain the diversity of the basic models, so that the accuracy of the final integrated model is improved. The intelligent community system has great significance in providing convenient online and timely medical service for residents and relieving the workload of doctors in the intelligent community scene.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and, together with the description, serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a schematic flow diagram of a model integration method;
FIG. 2 is a pre-experiment result of determining vocabulary weights in the word shape similarity in step two;
FIG. 3 is an exemplary diagram of segmented sampling in step three;
fig. 4 shows the result of preliminary experiments to determine the optimal number of segments in step three.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, a detailed description of the embodiments of the present invention will be given below with reference to the accompanying drawings. It is to be understood that the specific embodiments described herein are merely illustrative and explanatory of the invention and are not restrictive thereof.
The specific implementation process of the invention is shown in fig. 1, and comprises the following 4 aspects:
step 1, preprocessing a data set of medical question and answer pairs;
step 2, sorting the similarity of the negative samples;
step 3, combining the negative sample sequencing result obtained in the step 2, carrying out segmented sampling on the negative samples, constructing a plurality of training sets and training a base model;
and 4, integrating the basic models obtained in the step 3 by using weighted average to obtain a final question-answering model.
Each step is described in detail below.
The first step is as follows: the chinese medical answers pre-process the data set,
1.1 integrating question-answer pairs datasets
Some invalid question-answer pairs that do not contain answers, are ambiguous, are asked for or contain pictures in the question. In order to ensure the balance of the data set, other question-answer sentences except disease diagnosis, disease treatment, disease symptom and disease cause are deleted. Providing the integrated data set to step 1.2;
1.2 removing stop words
The stop words of the question and answer in the data set are removed by using the stop word list, and the stop words mainly comprise words with high use frequency and no actual meanings, such as language words, polite words and the like. The result after the stop word is removed is provided to step 1.3 and step 1.4;
1.3 annotating question-answer alignment samples
And (3) marking the correct answer corresponding to each question sentence in the data set provided in the step (1.2), thereby obtaining a positive sample of question-answer pairs, and providing a marking result to the step (1.4).
1.4 random initialization question-answer pair negative sample
Based on the question-answer positive samples marked in the step 1.3, matching answers are randomly given to the question sentences from all answers provided in the step 1.2, the answer cannot be the same as the answer in the positive samples, and then the question-answer pairs are marked as negative samples, so that the random initialization of the question-answer pair negative samples is completed. After labeling, the preprocessing work of the step 1 on the data set of the question and answer is completed, and the question sentences in the preprocessed data set are provided for the step 2, the step 3 and the step 4.
The second step is that: negative sample similarity ranking.
2.1 calculating part-of-speech similarity of Positive and negative samples
For the answers in the positive sample and the negative sample of the question answers obtained in the step 1, the distance between the answers is calculated by utilizing a tfidf algorithm which can give the importance degree of words in the text based on a statistical method, and the result is provided to a step 2.2.
2.2 computing lexical weights
The question-answer corpus provided in the step 1 belongs to the medical field, wherein the field vocabularies are often more distinguished and important than common vocabularies, so that the importance of the field vocabularies is highlighted by giving higher weight to the medical field vocabularies on the basis of the step 2.1, namely, the tfidf algorithm weighted by the field words is adopted to optimize the calculation of the morphological similarity distance of the positive and negative samples.
The value of the weight directly influences the performance of the similarity algorithm, and a pre-experiment is designed to determine the weight ratio of the field vocabulary and the common vocabulary in the question and answer corpus provided in the step 1. As shown in fig. 2, the preliminary experiment uses ACC @1 as an evaluation index, and changes in the performance of the initial integration model are compared by adjusting the weight ratio of the common vocabulary to the domain vocabulary. Here, the initial integration model is a BIGRU _ CNN model obtained by combining 6 sampling based on the word shape similarity segmentation of the negative samples.
From the results of the preliminary experiments, it can be seen that the initial integration model works best when the weight ratio of the common vocabulary to the domain vocabulary is 0.6. Therefore, in the tfidf algorithm based on the weighting of the domain words, the weighting formulas of the domain words and the common words are shown in formulas (1) and (2). Where ω 1 is the domain vocabulary, c1 is the normal vocabulary, W' is the original weight based on the word frequency and the inverse text frequency index, W (ω 1) is the weighted domain vocabulary weight, and W (c1) is the weighted normal vocabulary weight.
W(ω1)=1*W′(ω1) (1)
W(c1)=0.6*W′(c1) (2)
And introducing W (omega 1) and W (c1) into the tfidf algorithm, sequencing the obtained part-of-speech similarity results from large to small, and providing the sequenced results to the step 3.
2.3 calculating the similarity between the field vocabularies in the positive and negative samples
Because the tree structure in the CMeSH (Chinese Medical Subject headers) can clearly show the semantic relationship among the Medical field vocabularies, the invention utilizes the CMeSH to calculate the similarity among the Medical field vocabularies contained in the answers in the positive and negative samples provided in the step 1, and provides the similarity result to the step 2.4. Specifically, the semantic similarity Sim (ω 1, ω 2) between the medical domain words is calculated according to the semantic distance between the medical domain words ω 1 and ω 2, and the similarity calculation formula is shown as formula (3). Where Dist (ω 1, ω 2) represents the semantic distance between the domain words.
Figure BDA0003062268780000051
2.4 calculating semantic similarity of Positive and negative samples
Calculating the semantic similarity corresponding to the answers of the positive and negative samples according to the domain vocabulary similarity provided in the step 2.3 and the formula (4), and providing the calculation results to the step after the calculation results are sorted from big to smallAnd 3. step 3. Wherein M and N are respectively a vocabulary set in two sentences, and N is1,N2,…,NnFor the vocabulary in the set N, the calculation formula of the maximum similarity maxValue (ω, N) between the medical field vocabulary ω and the vocabulary in the sentence is shown in formula (5).
Figure BDA0003062268780000052
maxValue(ω,N)=max(sim(ω,N1),sim(ω,N2),…,sim(ω,Nn)) (5)
The third step: and (4) combining the negative sample sequencing results obtained in the step (2.2) and the step (2.4), carrying out segmented sampling on the negative samples, constructing a plurality of training sets and training the base model.
3.1 fractional sampling
And (3) aiming at the negative sample sequencing results obtained in the step (2.2) and the step (2.4), uniformly segmenting the negative samples on the two similarity sequences of the morphology and the semantics respectively, and sampling in different segments to form different training sets. Here, the negative sample set collected for the ith problem in each segment l
Figure BDA0003062268780000053
Satisfy to any j epsilon [1, k-1 ∈]Is provided with
Figure BDA0003062268780000054
Where k-1 represents the total number of negative samples,
Figure BDA0003062268780000055
representing the set of negative samples in the ith segment. Since the negative sample sequences are arranged from large to small, so
Figure BDA0003062268780000056
The included negative sample has higher semantic similarity with the positive sample, and
Figure BDA0003062268780000057
the semantic similarity between the negative sample and the positive sample is lower, and L is the number of segments.
As shown in fig. 3, for example, the number of segments L is equal to 3, it is described that negative samples are sampled based on the morphological similarity degree segments, and the distance between samples is determined by the distance of the morphological similarity degree between samples. Wherein S is a positive sample, S1-S9 are negative samples, the negative samples in the first sample set are extracted from S1, S2 and S3 in the first segment, the negative samples in the second sample set are extracted from S4, S5 and S6 in the second segment, and so on, and finally three training sample sets are generated.
3.2 determining the number of segments
The number of segments will directly determine the number of base models and the learning granularity of the base models, so the invention determines the applicable number of segments by designing a pre-experiment. The preliminary experiment is carried out by taking ACC @1 as an evaluation index and based on a BIGRU _ CNN model structure, and the experimental result is shown in FIG. 4. Experimental results show that when the number of the segments is 3, 3 base models obtained by segmentation sampling based on the semantic similarity of the negative samples and 3 base models obtained by segmentation sampling based on the morphological similarity of the negative samples are integrated, and the ACC @1 has the best effect. When the number of segments is too small, the advantages of multi-granularity learning of the model cannot be exerted, and therefore the performance of the model is limited. However, when the number of segments is 4 or 5, the discrimination between the base models is lowered, resulting in a slight decrease in the ACC @1 index value as compared with the case where the number of segments is 3, and the number of generated base models is excessive resulting in a long algorithm calculation time, so that the number of segments is finally defined to be 3. Respectively combining the negative sample set collected when the number of the segments is 3 with the positive sample to form a training set, training the model, and obtaining a base model M after the training is finishediAnd provides to step 4.
The fourth step: and (4) integrating the basic models obtained in the step (3) by using weighted average to obtain a final question-answering model.
For all base models M provided in step 3i(i belongs to 2L), integrating all base models according to a weighted average combination mode, and weighting wi(i ∈ 2L) depends on the accuracy p of the base model on the validation setiAnd the weight ratio of the high-accuracy base model in the whole model is larger. The final integrated model prediction probability h (x) is shown in equations (6) and (7). Wherein T is a basic modelTotal number of (c), hi(x) For each base model predicted result, wiAre the respective weights of the base models.
Figure BDA0003062268780000061
Figure BDA0003062268780000062
Innovation point
The invention provides a question-answer model integration method based on negative sample diversity, which is different from the model integration method in the field of medical question-answer at present. And then acquiring different negative sample sets by using a segmented sampling method, training the negative sample sets respectively as training data sets to generate different base models, and finally realizing the integration of the base models through weighted average.
The model integration method provided by the invention has better performance on the data set of the Chinese medical question and answer, and improves the accuracy of the Chinese medical question and answer system.

Claims (5)

1. A question-answer model integration method based on negative sample diversity is characterized by comprising
Step 1, preprocessing a data set of medical question and answer pairs;
step 2, sorting the similarity of the negative samples;
step 3, combining the negative sample sequencing result obtained in the step 2, carrying out segmented sampling on the negative samples, constructing a plurality of training sets and training a base model;
and 4, integrating the basic models obtained in the step 3 by using weighted average to obtain a final question-answering model.
2. The negative sample diversity-based question-answer model integration method according to claim 1, characterized by the first step of: preprocessing a data set by Chinese medical questions and answers;
1.1 integrating question-answer pairs datasets
Deleting some invalid question-answer pairs which do not contain answers, have ambiguous expression, and contain pictures in question sentences or answer sentences; in order to ensure the balance of the data set, deleting question-answer sentences of other individual classes except four classes of disease diagnosis class, disease treatment class, disease symptom class and disease cause class; providing the integrated data set to step 1.2;
1.2 removing stop words
Removing stop words of question and answer in the data set by using a stop word vocabulary, wherein the stop words mainly comprise words with high use frequency and without actual meanings, such as language words, polite words and the like; the result after the stop word is removed is provided to step 1.3 and step 1.4;
1.3 annotating question-answer alignment samples
Marking the correct answer corresponding to each question sentence in the data set provided in the step 1.2, thereby obtaining a positive sample of question-answer pairs, and providing a marking result to the step 1.4;
1.4 random initialization question-answer pair negative sample
On the basis of the question-answer positive sample labeled in the step 1.3, matching answers are randomly given to the question sentences from all answers provided in the step 1.2, the answer cannot be the same as the answer in the positive sample, and then the question-answer pairs are labeled as negative samples, so that the random initialization of the question-answer pair negative samples is completed; after labeling, the preprocessing work of the step 1 on the data set of the question and answer is completed, and the question sentences in the preprocessed data set are provided for the step 2, the step 3 and the step 4.
3. The negative sample diversity-based question-answer model integration method according to claim 1, characterized by the second step of: sorting the similarity of the negative samples;
2.1 calculating part-of-speech similarity of Positive and negative samples
For the answers in the positive sample and the negative sample of the question answers obtained in the step 1, calculating the distance between the answers by utilizing a tfidf algorithm which can give the importance degree of words in the text based on a statistical method, and providing the result to a step 2.2;
2.2 computing lexical weights
The question-answer corpus provided in the step 1 belongs to the medical field, wherein the field vocabularies are often more distinguished and important than common vocabularies, so that the importance of the field vocabularies is highlighted by giving higher weight to the medical field vocabularies on the basis of the step 2.1, namely, the calculation of the morphological similarity distance of the positive and negative samples is optimized by adopting the tfidf algorithm weighted by the field words;
the value of the weight can directly influence the performance of the similarity algorithm, and a pre-experiment is designed to determine the weight ratio of the field vocabulary and the common vocabulary in the question and answer corpus provided in the step 1; in the pre-experiment, ACC @1 is used as an evaluation index, and the change of the performance of the initial integration model is compared by adjusting the weight proportion of the common vocabulary and the field vocabulary; here, the initial integration model adopts a BIGRU _ CNN model obtained by combining 6 sampling based on word shape similarity segmentation of negative samples;
the initial integration model has the best effect when the weight ratio of the ordinary vocabulary to the domain vocabulary is 0.6, so that in the tfidf algorithm based on the weighting of the domain words, the weight formulas of the domain vocabulary and the ordinary vocabulary are shown as formulas (1) and (2); wherein ω 1 is a domain vocabulary, c1 is a common vocabulary, W' is an original weight based on a word frequency and an inverse text frequency index, W (ω 1) is a weighted domain vocabulary weight, and W (c1) is a weighted common vocabulary weight;
W(ω1)=1*W′(ω1) (1)
W(c1)=0.6*W′(c1) (2)
introducing W (omega 1) and W (c1) into a tfidf algorithm, sequencing the obtained part-of-speech similarity results from large to small, and providing the sequenced results to the step 3;
2.3 calculating the similarity between the field vocabularies in the positive and negative samples
Because the tree structure in the CMeSH (Chinese Medical Subject headers) can clearly show the semantic relationship among the Medical field vocabularies, the invention utilizes the CMeSH to calculate the similarity among the Medical field vocabularies contained in the answers in the positive and negative samples provided in the step 1 and provides the similarity result to the step 2.4; specifically, semantic similarity Sim (ω 1, ω 2) between the medical field words is calculated according to the semantic distance between the medical field words ω 1 and ω 2, and a similarity calculation formula is shown in formula (3), where Dist (ω 1, ω 2) represents the semantic distance between the field words:
Figure FDA0003062268770000031
2.4 calculating semantic similarity of Positive and negative samples
Calculating the semantic similarity corresponding to the answers of the positive and negative samples according to a formula (4) according to the domain vocabulary similarity provided in the step 2.3, and providing the calculation results to the step 3 after the calculation results are sorted from big to small; wherein M and N are respectively a vocabulary set in two sentences, and N is1,N2,…,NnFor the vocabulary in the set N, the calculation formula of the maximum similarity maxValue (ω, N) between the medical field vocabulary ω and the vocabulary in the sentence is shown in formula (5):
Figure FDA0003062268770000032
maxValue(ω,N)=max(sim(ω,N1),sim(ω,N2),…,sim(ω,Nn)) (5) 。
4. the negative sample diversity-based question-answer model integration method according to claim 1, characterized by the third step of: combining the negative sample sequencing results obtained in the step 2.2 and the step 2.4, carrying out segmented sampling on the negative samples, constructing a plurality of training sets and training a base model;
3.1 fractional sampling
Aiming at the negative sample sequencing results obtained in the step 2.2 and the step 2.4, uniformly segmenting the negative samples on the similarity sequences of the morphology and the semantics respectively, and sampling in different segments to form different training sets; here, the negative sample set collected for the ith problem in each segment l
Figure FDA0003062268770000033
Satisfy to any j epsilon [1, k-1 ∈]Is provided with
Figure FDA0003062268770000034
Where k-1 represents the total number of negative samples,
Figure FDA0003062268770000035
representing a set of negative examples in the l-th segment; since the negative sample sequences are arranged from large to small, so
Figure FDA0003062268770000036
The included negative sample has higher semantic similarity with the positive sample, and
Figure FDA0003062268770000037
the semantic similarity between the negative sample and the positive sample contained in the sample is lower, and L is the number of sections;
3.2 determining the number of segments
The number of the sections directly determines the number of the base models and the learning granularity of the base models, so that the invention determines the applicable number of the sections by designing a pre-experiment; the pre-experiment is carried out by taking ACC @1 as an evaluation index and based on a BIGRU _ CNN model structure; respectively combining the negative sample set collected when the number of the segments is 3 with the positive sample to form a training set, training the model, and obtaining a base model M after the training is finishediAnd provides to step 4.
5. The negative sample diversity-based question-answering model integration method according to claim 1, characterized by the fourth step of: integrating the basic models obtained in the step 3 by using weighted average to obtain a final question-answering model;
for all base models M provided in step 3i(i belongs to 2L), integrating all base models according to a weighted average combination mode, and weighting wi(i ∈ 2L) depends on the accuracy p of the base model on the validation setiHigh accuracy base model integrationThe weight ratio occupied in the body model is larger; the finally obtained integrated model prediction probability H (x) is shown in formulas (6) and (7); where T is the total number of basis models, hi(x) For each base model predicted result, wiAre the respective weights for the base models:
Figure FDA0003062268770000041
Figure FDA0003062268770000042
CN202110516176.2A 2021-05-12 2021-05-12 Question-answering model integration method based on negative sample diversity Active CN113254609B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110516176.2A CN113254609B (en) 2021-05-12 2021-05-12 Question-answering model integration method based on negative sample diversity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110516176.2A CN113254609B (en) 2021-05-12 2021-05-12 Question-answering model integration method based on negative sample diversity

Publications (2)

Publication Number Publication Date
CN113254609A true CN113254609A (en) 2021-08-13
CN113254609B CN113254609B (en) 2022-08-09

Family

ID=77222953

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110516176.2A Active CN113254609B (en) 2021-05-12 2021-05-12 Question-answering model integration method based on negative sample diversity

Country Status (1)

Country Link
CN (1) CN113254609B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114444623A (en) * 2022-04-11 2022-05-06 智昌科技集团股份有限公司 Industrial robot-oriented anomaly detection and analysis method and system
CN115759027A (en) * 2022-11-25 2023-03-07 上海苍阙信息科技有限公司 Text data processing system and method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050027717A1 (en) * 2003-04-21 2005-02-03 Nikolaos Koudas Text joins for data cleansing and integration in a relational database management system
CN110046240A (en) * 2019-04-16 2019-07-23 浙江爱闻格环保科技有限公司 In conjunction with the target domain question and answer method for pushing of keyword retrieval and twin neural network
CN110543558A (en) * 2019-09-06 2019-12-06 北京百度网讯科技有限公司 question matching method, device, equipment and medium
CN111581354A (en) * 2020-05-12 2020-08-25 金蝶软件(中国)有限公司 FAQ question similarity calculation method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050027717A1 (en) * 2003-04-21 2005-02-03 Nikolaos Koudas Text joins for data cleansing and integration in a relational database management system
CN110046240A (en) * 2019-04-16 2019-07-23 浙江爱闻格环保科技有限公司 In conjunction with the target domain question and answer method for pushing of keyword retrieval and twin neural network
CN110543558A (en) * 2019-09-06 2019-12-06 北京百度网讯科技有限公司 question matching method, device, equipment and medium
CN111581354A (en) * 2020-05-12 2020-08-25 金蝶软件(中国)有限公司 FAQ question similarity calculation method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王丰等: "一种基于迭代的关系模型到本体模型的模式匹配方法", 《软件学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114444623A (en) * 2022-04-11 2022-05-06 智昌科技集团股份有限公司 Industrial robot-oriented anomaly detection and analysis method and system
CN114444623B (en) * 2022-04-11 2022-08-12 智昌科技集团股份有限公司 Industrial robot-oriented anomaly detection and analysis method and system
CN115759027A (en) * 2022-11-25 2023-03-07 上海苍阙信息科技有限公司 Text data processing system and method
CN115759027B (en) * 2022-11-25 2024-03-26 上海苍阙信息科技有限公司 Text data processing system and method

Also Published As

Publication number Publication date
CN113254609B (en) 2022-08-09

Similar Documents

Publication Publication Date Title
CN111192680B (en) Intelligent auxiliary diagnosis method based on deep learning and collective classification
CN111540468B (en) ICD automatic coding method and system for visualizing diagnostic reasons
CN107403068A (en) Merge the intelligence auxiliary way of inquisition and system of clinical thinking
CN113254609B (en) Question-answering model integration method based on negative sample diversity
CN113505243A (en) Intelligent question-answering method and device based on medical knowledge graph
CN113076411B (en) Medical query expansion method based on knowledge graph
CN111768869B (en) Medical guide mapping construction search system and method for intelligent question-answering system
CN112559684A (en) Keyword extraction and information retrieval method
CN110931128A (en) Method, system and device for automatically identifying unsupervised symptoms of unstructured medical texts
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN109408619B (en) Method for dynamically calculating similarity between question and answer in question-answering field
CN111144119A (en) Entity identification method for improving knowledge migration
CN116386805A (en) Intelligent guided diagnosis report generation method
CN114707516A (en) Long text semantic similarity calculation method based on contrast learning
CN112801217B (en) Text similarity judgment method and device, electronic equipment and readable storage medium
CN110347812A (en) A kind of search ordering method and system towards judicial style
CN112349294B (en) Voice processing method and device, computer readable medium and electronic equipment
CN113658690A (en) Intelligent medical guide method and device, storage medium and electronic equipment
CN117217801A (en) Scenic spot optimization scheme intelligent generation method and system based on tourist real evaluation
CN116719840A (en) Medical information pushing method based on post-medical-record structured processing
CN112765353B (en) Scientific research text-based biomedical subject classification method and device
CN114496231A (en) Constitution identification method, apparatus, equipment and storage medium based on knowledge graph
CN114692615A (en) Small sample semantic graph recognition method for small languages
CN114664415A (en) Intelligent department diagnosis guide recommendation system based on deep learning model
Han et al. Construction method of knowledge graph under machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant