CN111259127B - Long text answer selection method based on transfer learning sentence vector - Google Patents

Long text answer selection method based on transfer learning sentence vector Download PDF

Info

Publication number
CN111259127B
CN111259127B CN202010043764.4A CN202010043764A CN111259127B CN 111259127 B CN111259127 B CN 111259127B CN 202010043764 A CN202010043764 A CN 202010043764A CN 111259127 B CN111259127 B CN 111259127B
Authority
CN
China
Prior art keywords
answer
network
layer
pool
question
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010043764.4A
Other languages
Chinese (zh)
Other versions
CN111259127A (en
Inventor
张引
王炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202010043764.4A priority Critical patent/CN111259127B/en
Publication of CN111259127A publication Critical patent/CN111259127A/en
Application granted granted Critical
Publication of CN111259127B publication Critical patent/CN111259127B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a long text answer selection method based on a migration learning sentence vector, which adopts a two-stage method to construct a migration learning sentence vector network and a training prediction network, wherein the migration learning sentence vector network comprises a twin network structure, an attention aggregation structure and a classification layer; the training prediction network includes a twin network structure and a distance metric layer. Firstly, the invention does not need to perform word segmentation on the data set text sequence, and directly takes the complete question answer sentence as input, thereby avoiding error propagation caused by word segmentation tools. Secondly, the training prediction network of the second stage has simple structure and high calculation efficiency. Finally, a transfer learning method is introduced to combine a twin network structure and an attention mechanism to obtain sentence vector model weights with more similar semantics, a sentence-level semantic vector is provided for the training prediction network of the second stage, and a better effect is obtained compared with a traditional method and a common deep learning network method, and especially for long text data, the effect is more prominent.

Description

Long text answer selection method based on transfer learning sentence vector
Technical Field
The invention relates to a pre-training language model and an attention mechanism in natural language processing and deep learning. In particular to a long text answer selection method based on a transfer learning sentence vector.
Background
The internet has developed at a high rate over the years and various information platforms have exploded in a "blowout" manner. According to the incomplete statistics of the Hootsuie website and the Wearescoial website, the number of netizens in the world has broken through 3.5 hundred million people by 2019, and 45% of the world population is users of social media. The data shows that network users have increased 439 million times from 2018 to 2019, and social media users have increased 348 million times in the year. The vast amount of data shows that the global network has reached a very developed world, and countless internet knowledge information is brought along with the network. A large number of websites carry network information and are flooded in the internet environment, and how to effectively search and utilize the information is a problem, so that the existence of a search engine is very important. The storage and computing speed of computers has already been met in the golden age, the computing power and the storage power of computers have become tripartite which hinders the development of search engines, and with the arrival of high-performance computing and high-performance storage, how to efficiently and accurately search the most relevant search results becomes the research focus of search engines.
Aiming at the research focus, the difficult problem of accurately retrieving the most relevant information in massive documents must be overcome. Looking at the history of search development, the first generation search engine, Archie 3, was primarily used to search files distributed among various hosts. When the world Wide Web appeared, EINet Galaxy (Tradewave Galaxy)4 appeared, which functioned as the earliest web portal. Through the updating of search engine technology in the middle, under the competition mainly based on Baidu search engine, Google search engine and Bing search engine which are dominant by large Internet companies such as Baidu, Google, Microsoft and the like, how to accurately search still remains a continuous research hotspot in the future. With the rise of artificial intelligence wave, machine learning and deep learning methods bring new solutions to the fields of image recognition, natural language processing, voice recognition processing and the like. In the face of the current situation that the results recalled by search engine retrieval are not ideal, many retrieval results need secondary screening and filtering by searchers, so that the automatic question-answering technology is produced.
The answer selection technology is an important step in the automatic question and answer technology, and is widely applied in life, for example, millet xiaoai classmates, Iphone Siri, microsoft mini-ice, and hectorite are all actual floor products of the automatic question and answer technology. In the field of task-based automatic question answering, a robot assistant with technical achievements of automatic question answering can greatly liberate both hands, and can control and complete a series of tasks only by using voice commands. In the field of chatting type automatic question answering, the chatting robot can add a person pleasure to boring life. In the field of modern medicine, an automatic question-answering technology can establish a more convenient and efficient communication mode for doctors and patients. Therefore, it is important how to improve the accuracy of question answering in the field of automatic question answering, and the answer selection technology, which plays a very important role in the search field of retrieval type automatic question answering, also plays a very important role in the search engine described above.
The existing answer selection method generally uses a twin network structure to respectively model a question text and an answer text, and finally distinguishes whether the question is matched with the answer or not through a similarity measurement method such as cosine distance and the like. However, the traditional method mainly focuses on short text matching tasks, lacks research on long text application scenes, and is difficult to solve the problems of 'semantic migration' and 'semantic gap' in the field of long text application. Moreover, because the question-answer data in the medical field generally has the characteristic of short question and long answer, and the matching effect and recall precision of the existing answer selection method can not meet the on-line requirement, the technical difficulties mainly involved in order to better select the answer of the long text data are as follows:
1. how to design a model modeling long text sequence;
2. how to utilize external knowledge and introduce a transfer learning method to improve recall precision;
3. how to design the effect of the evaluation index quantification model.
Disclosure of Invention
In order to solve the problems, the invention provides a long text answer selection method based on a migration learning sentence vector, wherein BERT is used as a feature extraction layer to model long text data, and a two-stage task of migration learning and training prediction is adopted. Firstly, the question and answer text sequence is used as input and is processed by using a BERT input format without additional word segmentation, so that error propagation caused by word segmentation is avoided. Secondly, a transfer learning method is used and a twin network structure and an attention aggregation structure are used as assistance, so that the problem and answer sentence vectors obtained by transfer learning are more semantically similar. Finally, sentence vectors of texts are obtained by initializing model weight parameters of transfer learning in the training prediction process, semantic similarity of question and answer sentence vectors is simply calculated through a distance measurement method, and higher recall efficiency and lower video memory occupation are obtained due to the fact that a training prediction network structure is simplified.
In order to achieve the purpose, the invention adopts the following technical scheme:
a long text answer selection method based on a transfer learning sentence vector comprises the following steps:
1) XPATH design crawlers are used for crawling doctor-patient question and answer data of the inquiry forum and cleaning the data; taking answers in the doctor-patient question-answer data as a positive sample; for questions in the doctor-patient question-answer data, retrieval recall of relevant answers is carried out by using a Lucene index tool, and the relevant answers are used as negative samples; constructing a point type answer selection data set according to the obtained positive sample and the negative sample, and dividing a transfer learning data set and a training prediction data set according to the proportion of 27: 1-8: 1;
2) establishing a transfer learning sentence vector network which comprises a twin network structure, an attention aggregation structure and a classification layer, wherein the twin network structure comprises an input layer, a feature extraction layer and a pooling layer which are paired, and the attention aggregation structure comprises an attention layer and an aggregation network layer; the feature extraction layer adopts a BERT model, loads a full-word covering weight BERT parameter for initialization, takes a mean value for pooling output after feature extraction, and carries out aggregation output on features sequentially through an attention layer and an aggregation network layer; splicing the polymerization output vector and the BERT pooling output vector, and inputting the polymerization output vector and the BERT pooling output vector into a classification layer for two-classification output;
training a migration learning sentence vector network by using the migration learning data set obtained in the step 1), matching two classification values of whether the question and the answer are matched with the real label by using an MRR (maximum likelihood ratio) and Precision @ K (K) evaluation index method, and selecting a network parameter corresponding to a model with the highest matching score to obtain a BertAttTL migration learning sentence vector model;
3) establishing a training prediction network which comprises a twin network structure and a distance measurement layer, wherein the twin network structure comprises a pair of an input layer, a feature extraction layer and a pooling layer; the feature extraction layer adopts a BERT model, uses the weight parameters of the BertAttTL migration learning sentence vector model obtained in the step 2) to initialize the BERT model and pooling layer parameters in a training prediction network, outputs question sentence vectors and answer sentence vectors through the pooling layer, inputs the two sentence vectors into a distance measurement layer to obtain semantic similarity, and divides the similarity by a threshold value to obtain whether similar binary classification values are output as prediction contents; training a training prediction network by using the training prediction data set obtained in the step 1), matching the finally obtained binary classification value with a real label by using an MRR (maximum likelihood ratio) and Precision @ K (K) evaluation index method, and selecting a network parameter corresponding to a model with the highest matching score to obtain a trained training prediction network;
4) inputting the questions to be processed and the answer texts into the training prediction network obtained in the step 3), and outputting the two classification values of all the candidate answers to obtain the final answer of the questions to be processed.
Further, the MRR and Precision @ K evaluation index method specifically comprises the following steps:
expressing the output of the transfer learning sentence vector network or the training prediction network as pred ═ p1,p2,...,pn]Wherein p isiThe predicted value of the ith candidate answer is represented as 0 or 1,0 represents dissimilarity, 1 represents similarity, and n represents the number of test samples in the sample set; the real tag data is expressed as label ═ t1,t1,,...,tn]Wherein t isiThe real label 0 or 1 represents the ith candidate answer, 0 represents dissimilar, 1 represents similar, and n represents the number of the test samples in the sample set; aiming at all candidate answers of a question, obtaining two classification values through a transfer learning sentence vector network or a training prediction network and then sequencing to obtain a ranking rank of a correct answer aiming at the ith questioni
The MRR calculation formula is as follows:
Figure BDA0002368645450000042
wherein Q is a problem set, and | Q | represents the number of all problems;
precision @ K is calculated as:
Figure BDA0002368645450000041
where precisin represents precision, K represents the number of answers considered in the index, and the values in the present invention are 1,2, and 3, num (true answers) represents the number of correct answers, and sum (related K answers) represents the total number of recalled relevant answers.
Furthermore, the transfer learning sentence vector network comprises a twin network structure, an attention aggregation structure and a classification layer, wherein the twin network structure comprises an input layer, a feature extraction layer and a pooling layer, the attention aggregation structure comprises an attention layer and an aggregation network layer, the attention layer is mainly used for adding an attention mechanism in the twin network structure, the semantic representation of the answer text is enriched by using the context of the question, the semantic representation of the question text is enriched by using the context of the answer, and the matching effect can be effectively improved through the semantic interaction of the question and the answer; the aggregation network layer is mainly used for further deepening the model to model the characteristics of the similar part and the dissimilar part of the question and the answer through the comparison layer and the aggregation layer after the attention mechanism is carried out, so that the matching effect can be effectively improved on the basis of the attention mechanism. The feature extraction layer is modeled by using BERT, and the BERT is initialized by using a BERT weight parameter covered by a whole word;
inputting the paired samples into a twin network structure, wherein paired input layers correspond to two text sequences of a Question and an Answer, and respectively processing a Question text and an Answer text according to the input format [ CLS ] + Question + [ SEP ], [ CLS ] + Answer + [ SEP ] of BERT; after the BERT characteristic modeling, averaging the output of 12 layers of pooling layers to respectively obtain pooling output with unified dimensionality: a question pooling output Q pool and an answer pooling output A pool, wherein the dimension length is 768 dimensions;
inputting the question pooling output Q pool and the answer pooling output A pool into an attention layer, and respectively obtaining question semantic alignment vectors Z through an attention mechanism2And answer semantic alignment vector Z2'; mixing Q pool, A pool and Z2And Z2' input to the aggregation network layer, for problem, Q pool and Z2By [ Q pool, Z2],[Q pool,Q pool-Z2],[Q pool,Q pool*Z2]Transforming, and splicing by one layer of linear transformation to obtain splicing vector O1,O2,O3]The spliced vector is subjected to a layer of linear transformation and uses a Dropout mechanism to obtain a Fused output of problem attentionQ(ii) a Similarly, for answers, A pool and Z2' obtaining the answer attention aggregation output Fused through the aggregation network layerA
Will fuseQ、FusedAFurther splicing Q pool and A pool to obtain [ Q pool, A pool, | Q pool-A pool |, Q pool A pool, Fused Q, Fused A]Inputting the splicing vector into a classification layer, and obtaining prediction output pred ═ p through Softmax classification1,p2,...,pn]Wherein p isiThe predicted value of the ith candidate answer is represented by 0 or 1,0 represents dissimilarity, 1 represents similarity, and n represents the number of test samples in the sample set.
Further, the semantic similarity calculation method in step 3) adopts any one of cosine distance, manhattan distance, euler measurement and point multiplication measurement.
The invention has the following beneficial effects:
(1) the word representation of the long text data is obtained by using the pre-training language model BERT in the natural language processing technology, an additional data word segmentation stage is not needed, and the problem of inaccurate word segmentation caused by a word segmentation tool is avoided, so that the problem of semantic error propagation caused by inaccurate word segmentation is avoided;
(2) a two-stage method is designed, the first stage effectively uses a transfer learning method to utilize large-scale parallel corpus knowledge, the second stage uses a simple training prediction network to have higher model reasoning efficiency, and the two-stage task is integrated to have higher answer selection recall precision;
(3) aiming at a large-batch answer search scene, the method for directly obtaining the sentence vectors of all the text sequences can effectively avoid time-consuming calculation among a plurality of text pairs of the pre-training language model, and is higher in efficiency. For example: when the pre-training language model calculates the matching scores of the same question and m answers, the question and one answer need to be paired and sent into the model to be calculated every time, so that the question is repeatedly coded m times, the question and the answer are coded 2 x m times in total, the value m is very large in a large-scale search scene, and extra time overhead is very large, while the method only needs to obtain sentence vectors of the question and all the answers, only needs to code the question once and the answer m times for m +1 times, compared with the 2 x m times of coding work, the method reduces nearly half of coding time, and therefore the efficiency is higher;
(4) the invention adopts the pre-training language model BERT as the feature extractor, can effectively carry out semantic modeling on the long text data, and avoids the phenomena of 'semantic migration' and 'semantic gap' on the long text data by the existing answer selection method.
Drawings
FIG. 1 is a diagram of a transfer learning model architecture for a long text answer selection method based on transfer learning sentence vectors;
fig. 2 is a diagram of a training prediction model structure of a long text answer selection method based on a migration learning sentence vector.
Detailed Description
The present invention is described in detail below with reference to specific examples.
Because question-answer data in the medical field generally has the characteristic of short question answers and long question answers, the matching effect and recall precision of the existing answer selection method cannot meet the requirement of online, and therefore the long text answer selection method based on the migration learning sentence vector provided by the invention can effectively process the long text answer selection question through experimental verification.
As shown in fig. 1, the long text answer selection method based on a migration learning sentence vector provided by the present invention includes an input layer, a feature extraction layer, an attention aggregation network layer, and a classification layer, where the feature extraction layer adopts BERT for modeling, and BERT adopts BERT weight parameters covered by full words for initialization;
the input layer corresponds to two text sequences of questions and answers, and the two texts are processed according to the input format [ CLS ] + Question + [ SEP ], [ CLS ] + Answer + [ SEP ] of BERT. After the BERT characteristic modeling, averaging the outputs of 12 layers of pooling layers to obtain pooling outputs with uniform dimensionality, wherein the dimensionality length is 768 dimensions; the attention aggregation network layer obtains semantic alignment output by two text sequences through an attention mechanism, an alignment vector Z2 and a pooling output Z1 are transformed through [ Z1, Z2], [ Z1, Z1-Z2], [ Z1, Z1 x Z2], and are spliced through one layer of linear transformation to obtain [ O1, O2 and O3], the spliced vector is subjected to one layer of linear transformation and uses a Dropout mechanism to obtain question attention aggregation output FusedQ and answer attention aggregation output FusedA, the two are spliced with the pooling output to obtain [ Q pool, A pool, | Q pool-A pool |, Q pool a pool, Fused Q and Fused A ], the predicted output is obtained through Softmax classification, and the semantic transfer learning sentence network training is obtained.
As shown in fig. 2, in the long text answer selection method based on a transfer learning sentence vector provided by the present invention, the training prediction network included in the adopted training prediction network includes an input layer, a feature extraction layer and a distance measurement layer, the feature extraction layer adopts BERT, and is initialized by using the transfer learning weight parameter trained in step 3);
the input layer corresponds to two text sequences of questions and answers, and the two texts are processed according to the input format [ CLS ] + Question + [ SEP ], [ CLS ] + Answer + [ SEP ] of BERT. After the BERT characteristic modeling, averaging the outputs of 12 layers of pooling layers to obtain pooling outputs with uniform dimensionality, wherein the dimensionality length is 768 dimensions; initializing by using the transfer learning weight parameters trained in the step 3) to obtain sentence vectors with more similar semantics, calculating the similarity of the two sentence vectors by adopting cosine distance, Manhattan distance, Euler measurement and point multiplication measurement, and segmenting the similarity by using a threshold to obtain two classification values whether the similarity is similar or not.
In an embodiment of the present invention, answer selection is performed on long text question-and-answer data by using the above transfer learning sentence vector network and training prediction network, and the steps are as follows:
step one, a crawler frame is constructed through Python and XPATH, doctor-patient question and answer data are captured for medical inquiry platforms such as a Sanjiu health network, webpage labels outside texts, such as < div > and the like, are removed through a certain rule method, duplication of the data is removed, about 575 thousands of pieces of doctor-patient question and answer data are finally obtained through processing, and the doctor-patient question and answer data are stored in a warehouse according to a (question, disease description and disease answer) triple form.
And step two, recalling correlated answers to the questions by using a Lucene tool, recalling 500 negative sample answer sets which are sorted according to the correlation degree, and extracting one negative sample from the 1 st to 5 th negative samples, one negative sample from the 5 th to 50 th negative samples, one negative sample from the 50 th to 100 th negative samples and one negative sample from the 100 th to 500 th negative samples. For samples with less than 100 relevant negative sample answers to the recall, the sampling between the last 100 th to 500 th is reduced in the candidate answer set construction. 4354417 small sample data sets are sampled according to topic categories to serve as training prediction data sets, the small sample data sets comprise 120000 training sets, 20000 verification sets and 20000 test sets, and marking data are taken as migration learning data sets according to 8:1 of the total amount, wherein the migration learning data sets do not have cross parts with the training prediction data sets.
In one embodiment of the present invention, the corpus format is as follows:
Figure BDA0002368645450000071
wherein Question represents a Question text and Answer represents an Answer text.
And step three, building a transfer learning sentence vector network by using a Pythrch, initializing by using a full-word covering BERT weight parameter, wherein the network comprises an input layer, a feature extraction layer, an attention aggregation network layer and a classification layer, training and predicting are carried out on the transfer learning data set obtained in the step two, and finally, a sentence vector model weight file with more similar semantic vectors is obtained.
The loss function of the transfer learning sentence vector network training adopts cross entropy loss:
loss=-y*logy′
Figure BDA0002368645450000081
where y represents the true label of whether the answer to the question matches, and y' is the model prediction vector of whether the sample data matches.
In the test set, for one question q and 3 answers [ a ]1,a2,a3]For prediction vector pred [0.71,0.68.0.35 ]]And the real label ═ 0,1,0]In accordance with the following MRR calculation formula, | Q | ═ 1, pred ═ 1,1,0, and the threshold value 0.5 are used as the partition prediction results]The correct answer label can be known to predict correctly according to the real label, and the answers are sequenced according to the prediction probability to know that the second answer prediction probability is highest and arranged at the second position, namely rankiWhen 2, MRR 1/2 is 0.5. According to the Precision-K calculation formula, K is 1,2 and 3, and it can be known that when K is 1, num (true answers) is 0, then Precision @1 is 0; when K is 2, num (true answers) is 1, sum (related K answers) is 2, Precision @2 is 0.5; when K is 3, num (true answers) is 1, sum (related K answers) is 3, Precision @3 is 1/3 is 0.33. In the embodiment, only one question and a plurality of answers are explained, a plurality of questions exist in the test set, and the final result index is calculated according to the average value of the number of the questions.
Figure BDA0002368645450000082
Figure BDA0002368645450000083
And step four, building a training prediction network by using a Pythrch, initializing by using a transfer learning sentence vector network weight model in the step three, wherein the training prediction network comprises an input layer, a feature extraction layer and a distance measurement layer, and training prediction is carried out on the training prediction small sample data set obtained in the step two.
The loss function of the training prediction network adopts the loss of mean square error:
loss=(y-y')2
Figure BDA0002368645450000084
where y represents the true label of whether the answer to the question matches, and y' is the model prediction vector of whether the sample data matches.
After the question sentence vector and the answer sentence vector are obtained, the semantic similarity of the two sentence vectors is calculated by using a cosine similarity classifier, and the formula is as follows, for example, the question sentence vector is [1,1,0,0,1 ]]The answer sentence vector is [0,1,1,0]Then calculate the similarity as
Figure BDA0002368645450000091
Figure BDA0002368645450000092
And (3) obtaining a pred prediction result aiming at all samples in the test set, simultaneously comparing the pred prediction result with the real label, and obtaining the index on the test set according to the MRR and Precision @ K (K takes a value of 1,2 and 3) calculation formula.
Figure BDA0002368645450000093
And fifthly, reasoning on the test set data by using the model trained in the step four, and finally segmenting the obtained predicted value according to a threshold value to obtain whether the answers to the questions are similar in semantic meaning.
Compared with the prior art, firstly, the invention does not need to perform word segmentation on the data set text sequence, and directly takes the complete question answer sentence as input, thereby avoiding error propagation caused by a word segmentation tool. Secondly, the training prediction network of the second stage has simple structure and high calculation efficiency. Finally, a transfer learning method is introduced to combine a twin network structure and an attention mechanism to obtain sentence vector model weights with more similar semantics, a sentence-level semantic vector is provided for the training prediction network of the second stage, and a better effect is obtained compared with a traditional method and a common deep learning network method, and especially for long text data, the effect is more prominent. In order to objectively evaluate the performance of the model of the present invention, the model of the present invention was compared with other models, including siemese RNN, QACNN, dett, Cam, Seq Match Seq, ESIM. The evaluation indexes adopted by the embodiment are MRR, Precision @1, Precision @2 and Precision @ 3. These indices are used to evaluate the similarity between the question and the recalled answer. The larger the value, the better the effect. As shown in Table 1, the invention integrates two-stage tasks, has higher answer selection recall precision, and has better model effect than all comparison models. As shown in Table 2, compared with the pre-trained language model BERT, the method of the invention has the advantages that the time consumption of the inference stage is only 0.5 second, and the efficiency is high.
TABLE 1 recall accuracy results of comparative experiments
Model (model) MRR Precision@1 Precision@2 Precision@3
Siamese RNN 0.571769 0.311137 0.580483 0.833433
QACNN 0.612844 0.363327 0.650470 0.873225
DEATT 0.525945 0.258348 0.508098 0.745051
Cam 0.636339 0.415917 0.656469 0.827634
Seq Match Seq 0.631340 0.407518 0.651070 0.828834
ESIM 0.523529 0.254749 0.505299 0.743251
The invention 0.739136 0.543491 0.818636 0.971406
TABLE 2 comparison of the calculated time-consuming results of the present invention and the pre-trained language model
Model (model) Reasoning phase is time consuming (answer number m is 4)
Pre-training language model BERT 4.5 seconds
The invention 0.5 second
The above examples only show one embodiment of the present invention, and the description is specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (2)

1. A long text answer selection method based on a transfer learning sentence vector is characterized by comprising the following steps:
1) obtaining authoritative doctor-patient question-answer data, and taking answers in the doctor-patient question-answer data as positive samples; for questions in the doctor-patient question-answer data, a Lucene index tool is used for retrieving and recalling correlation answers, and the correlation answers are used as negative samples; constructing an answer selection data set according to the obtained positive sample and the negative sample, and dividing a transfer learning data set and a training prediction data set according to the proportion of 27: 1-8: 1;
2) establishing a transfer learning sentence vector network which comprises a twin network structure, an attention aggregation structure and a classification layer, wherein the twin network structure comprises an input layer, a feature extraction layer and a pooling layer which are paired, and the attention aggregation structure comprises an attention layer and an aggregation network layer; the feature extraction layer adopts a BERT model, loads a full-word covering weight BERT parameter for initialization, takes a mean value for pooling output after feature extraction, and carries out aggregation output on features sequentially through an attention layer and an aggregation network layer; splicing the polymerization output vector and the BERT pooling output vector, and inputting the polymerization output vector and the BERT pooling output vector into a classification layer for two-classification output;
training a transfer learning sentence vector network by using the transfer learning data set obtained in the step 1), inputting a pairing sample into a twin network structure, wherein paired input layers correspond to two text sequences of a Question and an Answer, and respectively processing a Question text and an Answer text according to the input format [ CLS ] + Question + [ SEP ], [ CLS ] + Answer + [ SEP ] of BERT; after the BERT characteristic modeling, averaging the output of 12 layers of pooling layers to respectively obtain pooling output with unified dimensionality: a question pooling output Q pool and an answer pooling output A pool, wherein the dimension length is 768 dimensions;
inputting the question pooling output Q pool and the answer pooling output A pool into an attention layer, and respectively obtaining question semantic alignment vectors Z through an attention mechanism2And answer semantic alignment vector Z2'; mixing Q pool, A pool and Z2And Z2' input to the aggregation network layer, for problem, Q pool and Z2By [ Q pool, Z2],[Q pool,Qpool-Z2],[Q pool,Q pool*Z2]Transforming, and splicing by one layer of linear transformation to obtain splicing vector O1,O2,O3]The spliced vector is subjected to one layer of linear transformation and uses a DropOut mechanism to obtain a Fused output Fused of the attention of the problemQ(ii) a Similarly, for answers, A pool and Z2' obtaining the answer attention aggregation output Fused through the aggregation network layerA
Will fuseQ、FusedAFurther splicing Q pool and A pool to obtain [ Q pool, A pool, | Q pool-A pool |, Q pool A pool, Fused Q, Fused A]Inputting the splicing vector into a classification layer, and obtaining prediction output pred ═ p through Softmax classification1,p2,...,pn]Wherein p isiThe predicted value of the ith candidate answer is represented as 0 or 1,0 represents dissimilar, 1 represents similar, and n represents the number of the test samples in the sample set;
matching two classification values of whether the question is matched with the answer or not with the real label by adopting an MRR (Markov random reference) and Precision @ K evaluation index method, and selecting a network parameter corresponding to the model with the highest matching score to obtain a BertAttTL migration learning sentence vector model; the MRR and Precision @ K evaluation index method specifically comprises the following steps:
expressing the output of the transfer learning sentence vector network or the training prediction network as pred ═ p1,p2,...,pn]Wherein p isiThe predicted value of the ith candidate answer is represented as 0 or 1,0 represents dissimilarity, 1 represents similarity, and n represents the number of test samples in the sample set; the real tag data is expressed as label ═ t1,t2,,...,tn]Wherein t isiThe true label of the ith candidate answer is 0 or 1,0 represents dissimilar, and 1 represents similar; aiming at all candidate answers of a question, obtaining two classification values through a transfer learning sentence vector network or a training prediction network and then sequencing to obtain a ranking rank of correct answers aiming at the ith questioni
The MRR calculation formula is as follows:
Figure FDA0003591074830000021
wherein Q is a problem set, and | Q | represents the number of all problems;
precision @ K is calculated as:
Figure FDA0003591074830000022
wherein Precision represents Precision, K represents the number of answers considered in the index, the values are 1,2 and 3, num (true answers) represents the number of correct answers, and Sum (related K answers) represents the total number of recalled related answers;
3) establishing a training prediction network which comprises a twin network structure and a distance measurement layer, wherein the twin network structure comprises a pair of an input layer, a feature extraction layer and a pooling layer; the feature extraction layer adopts a BERT model, the BERT model and pooling layer parameters in the training prediction network are initialized by using the weight parameters of the BertAttTL migration learning sentence vector model obtained in the step 2), question sentence vectors and answer sentence vectors are output through the pooling layer, the two sentence vectors are input into the distance measurement layer to obtain semantic similarity, and the semantic similarity is obtained, and the semantic similarity is divided by using a threshold value according to the similarity to obtain two similar classification values which are used as prediction contents to be output; training a training prediction network by using the training prediction data set obtained in the step 1), matching the finally obtained two classification values with the real labels by adopting an MRR (Markov random Access) and Precision @ K (K) evaluation index method, and selecting network parameters corresponding to a model with the highest matching score to obtain the trained training prediction network;
4) inputting the questions to be processed and the answer texts into the training prediction network obtained in the step 3), and outputting the binary values of all the candidate answers to obtain the final answer of the questions to be processed.
2. The method as claimed in claim 1, wherein the semantic similarity calculation in step 3) is performed by any one of cosine distance, manhattan distance, euler metric, and dot product metric.
CN202010043764.4A 2020-01-15 2020-01-15 Long text answer selection method based on transfer learning sentence vector Active CN111259127B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010043764.4A CN111259127B (en) 2020-01-15 2020-01-15 Long text answer selection method based on transfer learning sentence vector

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010043764.4A CN111259127B (en) 2020-01-15 2020-01-15 Long text answer selection method based on transfer learning sentence vector

Publications (2)

Publication Number Publication Date
CN111259127A CN111259127A (en) 2020-06-09
CN111259127B true CN111259127B (en) 2022-05-31

Family

ID=70946960

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010043764.4A Active CN111259127B (en) 2020-01-15 2020-01-15 Long text answer selection method based on transfer learning sentence vector

Country Status (1)

Country Link
CN (1) CN111259127B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111831789B (en) * 2020-06-17 2023-10-24 广东工业大学 Question-answering text matching method based on multi-layer semantic feature extraction structure
CN112966518B (en) * 2020-12-22 2023-12-19 西安交通大学 High-quality answer identification method for large-scale online learning platform
CN114691815A (en) * 2020-12-25 2022-07-01 科沃斯商用机器人有限公司 Model training method and device, electronic equipment and storage medium
CN112800196B (en) * 2021-01-18 2024-03-01 南京明略科技有限公司 FAQ question-answering library matching method and system based on twin network
CN112784130B (en) * 2021-01-27 2022-05-27 杭州网易云音乐科技有限公司 Twin network model training and measuring method, device, medium and equipment
CN112667799B (en) * 2021-03-15 2021-06-01 四川大学 Medical question-answering system construction method based on language model and entity matching
CN113221530B (en) * 2021-04-19 2024-02-13 杭州火石数智科技有限公司 Text similarity matching method and device, computer equipment and storage medium
CN113159187B (en) * 2021-04-23 2024-06-14 北京金山数字娱乐科技有限公司 Classification model training method and device and target text determining method and device
CN113987156B (en) * 2021-12-21 2022-03-22 飞诺门阵(北京)科技有限公司 Long text generation method and device and electronic equipment
CN114693396A (en) * 2022-02-28 2022-07-01 广州华多网络科技有限公司 Address information matching method and device, equipment, medium and product thereof
CN114757208B (en) * 2022-06-10 2022-10-21 荣耀终端有限公司 Question and answer matching method and device
CN116720503A (en) * 2023-03-13 2023-09-08 吉林省元启科技有限公司 On-line learning system answer discrimination method based on tree analysis coding
CN116522165B (en) * 2023-06-27 2024-04-02 武汉爱科软件技术股份有限公司 Public opinion text matching system and method based on twin structure

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108846126A (en) * 2018-06-29 2018-11-20 北京百度网讯科技有限公司 Generation, question and answer mode polymerization, device and the equipment of related question polymerization model
CN110532397A (en) * 2019-07-19 2019-12-03 平安科技(深圳)有限公司 Answering method, device, computer equipment and storage medium based on artificial intelligence

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10678830B2 (en) * 2018-05-31 2020-06-09 Fmr Llc Automated computer text classification and routing using artificial intelligence transfer learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108846126A (en) * 2018-06-29 2018-11-20 北京百度网讯科技有限公司 Generation, question and answer mode polymerization, device and the equipment of related question polymerization model
CN110532397A (en) * 2019-07-19 2019-12-03 平安科技(深圳)有限公司 Answering method, device, computer equipment and storage medium based on artificial intelligence

Also Published As

Publication number Publication date
CN111259127A (en) 2020-06-09

Similar Documents

Publication Publication Date Title
CN111259127B (en) Long text answer selection method based on transfer learning sentence vector
CN109753566B (en) Model training method for cross-domain emotion analysis based on convolutional neural network
CN109840287B (en) Cross-modal information retrieval method and device based on neural network
CN113239181B (en) Scientific and technological literature citation recommendation method based on deep learning
WO2021164199A1 (en) Multi-granularity fusion model-based intelligent semantic chinese sentence matching method, and device
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
WO2019153737A1 (en) Comment assessing method, device, equipment and storage medium
CN111597314B (en) Reasoning question-answering method, device and equipment
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN108932342A (en) A kind of method of semantic matches, the learning method of model and server
CN111291188B (en) Intelligent information extraction method and system
Shi et al. Deep adaptively-enhanced hashing with discriminative similarity guidance for unsupervised cross-modal retrieval
CN113392651B (en) Method, device, equipment and medium for training word weight model and extracting core words
CN111898374B (en) Text recognition method, device, storage medium and electronic equipment
CN110765240A (en) Semantic matching evaluation method for multiple related sentence pairs
CN113282711B (en) Internet of vehicles text matching method and device, electronic equipment and storage medium
CN113392209A (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN113297364A (en) Natural language understanding method and device for dialog system
CN116010553A (en) Viewpoint retrieval system based on two-way coding and accurate matching signals
CN110727769A (en) Corpus generation method and device, and man-machine interaction processing method and device
CN115905487A (en) Document question and answer method, system, electronic equipment and storage medium
Reddy et al. Convolutional recurrent neural network with template based representation for complex question answering
CN114372454A (en) Text information extraction method, model training method, device and storage medium
CN116227486A (en) Emotion analysis method based on retrieval and contrast learning
CN115828852A (en) Name entity linking method based on magazine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant