CN114357120A - Non-supervision type retrieval method, system and medium based on FAQ - Google Patents

Non-supervision type retrieval method, system and medium based on FAQ Download PDF

Info

Publication number
CN114357120A
CN114357120A CN202210032823.7A CN202210032823A CN114357120A CN 114357120 A CN114357120 A CN 114357120A CN 202210032823 A CN202210032823 A CN 202210032823A CN 114357120 A CN114357120 A CN 114357120A
Authority
CN
China
Prior art keywords
question
answer pair
answer
similarity
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210032823.7A
Other languages
Chinese (zh)
Inventor
吴育人
杨翰章
庄伯金
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210032823.7A priority Critical patent/CN114357120A/en
Publication of CN114357120A publication Critical patent/CN114357120A/en
Pending legal-status Critical Current

Links

Images

Abstract

The application provides an unsupervised retrieval method, an unsupervised retrieval system and an unsupervised retrieval medium based on FAQ, wherein the similarity between user query information and each question-answer pair is calculated through a BM25 algorithm to obtain a first candidate question-answer pair; calculating the similarity between the user query information and each question-answer pair document through a maximum-passage algorithm to obtain a second candidate question-answer pair; inputting the first candidate question-answer pair into a first pre-training BERT model to obtain a third candidate question-answer pair and a corresponding similarity score; inputting the first candidate question-answer pair into a second pre-training BERT model to obtain a fourth candidate question-answer pair and a corresponding similarity score; and fusing the similarity scores corresponding to the second, third and fourth candidate question-answer pairs to obtain a similarity fusion score and a final candidate question-answer pair which is correspondingly ordered. According to the method and the device, matching marks between the user query and question-answer pairs are not needed, and the three sequencing results are fused by using a linear summation combination algorithm and a set rearrangement algorithm, so that the general effect is further improved.

Description

Non-supervision type retrieval method, system and medium based on FAQ
Technical Field
The present application belongs to the field of information retrieval technologies, and in particular, to an unsupervised retrieval method, system and medium based on FAQ.
Background
The existing QA retrieval system based on the frequently Asked question and answer FAQ (frequently assigned questions) is a question-answer tuple (q, a) in the FAQ, the question-answer pair includes user query information (q) and answer or answer information (a), and the FAQ-based retrieval system mainly retrieves from two directions: firstly, matching a user query (q) with an answer (a) of an existing question-answer pair, namely a question-answer tuple (q, a); and secondly, matching the user query (q) with the question (q) with a question-answer tuple. In the two search directions, the current matching methods are further specifically divided into two types: (i) carrying out supervised model training by using a deep neural network, a convolutional neural network and the like, and then carrying out retrieval matching through a model; (ii) and carrying out supervised matching on the user query and the existing question and answer based on the BERT model.
When supervised model training is performed, a large amount of matched user query (q) and question-answer tuple (q, a) data are needed, but the data acquisition has certain difficulty. One way of obtaining this is by manually tagging a large number of user queries (q) and a question-and-answer tuple (q, a) of data, and the other way of obtaining this is by text mining from a large number of user logs, so supervised training models will suffer from data acquisition difficulties to some extent.
Then only unsupervised training models can be used when the corresponding labeled data is missing.
However, the current non-supervised model based on the FAQ mainly uses the more traditional retrieval matching modes such as semantic matching, extended query and the like, cannot be widely applied to the emerging natural language processing technology, and has inaccurate retrieval and poor effect.
Disclosure of Invention
According to the unsupervised retrieval method, the unsupervised retrieval system and the unsupervised retrieval medium based on the FAQ, the used retrieval model does not need to carry out matching marking between the user query and the question-answer element group, only the question-answer element group data is used for carrying out unsupervised training, the limitation caused by data acquisition difficulty is broken through on the premise that wide application is guaranteed, and the dependence on service personnel labeling data is reduced.
According to a first aspect of the embodiments of the present application, there is provided an unsupervised search method based on FAQ, specifically including the following steps:
calculating the similarity between the user query information and each question-answer pair through a BM25 algorithm, and pre-sequencing all question-answer pairs from high to low according to the obtained BM25 similarity to obtain a first candidate question-answer pair sequence;
calculating the similarity between the user query information and each question-answer pair through a maximum-pass algorithm, and sequencing the first candidate question-answer pair sequence according to the obtained maxpsg similarity from high to low to obtain a second candidate question-answer pair sequence;
inputting the first candidate question-answer pair sequence into a first pre-training BERT model to obtain a third candidate question-answer pair sequence and a corresponding similarity score; the training data of the first pre-training BERT model is each question-answer pair and the corresponding wrong answer question-answer pair;
inputting the first candidate question-answer pair sequence into a second pre-training BERT model to obtain a fourth candidate question-answer pair sequence and a corresponding similarity score; training data of the second pre-training BERT model are similar problem pairs formed by the user query information samples and the matching problems and dissimilar problem pairs formed by the user query information samples and the mismatching problems;
and fusing the similarity scores corresponding to the second candidate question-answer pair sequence, the third candidate question-answer pair sequence and the fourth candidate question-answer pair sequence to obtain a similarity fusion score, and sequencing the final candidate question-answer pair sequence according to the similarity fusion score.
In some embodiments of the present application, calculating the similarity between the user query information and each question-answer pair through the BM25 algorithm specifically includes:
merging the questions and answers of each question-answer pair to obtain a first merged question-answer pair document;
BM25 similarity of the user query information and each first combined question-answer pair document is calculated through a BM25 algorithm, and a concrete calculation formula of the similarity BM25(Q, d) is as follows:
Figure BDA0003467087940000031
wherein Q is user query information; q. q.siDividing a single character for each literal meaning of the user query information Q; d is a question-answer pair document; | d | is the length of the document; | d | non-woven gridavgAverage length for all documents; k and b are adjustable parameters;
wherein freq (q)iD) dividing the word q for the meaningiFrequency of occurrence in document d;
wherein, IDF (q)i) Segmenting a word q for each context calculated using the IDF functioniThe specific calculation formula of the weight of (a) is as follows:
Figure BDA0003467087940000032
where N is the number of all documents, N (q)i) To contain qiThe number of documents.
The FAQ-based unsupervised retrieval method according to claim 1, wherein the similarity between the user query information and each question-answer pair is calculated through a maximum-pass algorithm, and the first candidate question-answer pair sequence is ranked from high to low according to the score of the similarity to obtain a second candidate question-answer pair sequence, which specifically comprises:
merging the questions and answers of each question-answer pair to obtain a second merged question-answer pair document;
intercepting the sliding window fragments of the question-answer pair document after the second combination in a sliding window mode to obtain the question-answer pair document fragments;
calculating the maxpsg similarity of the user query information and each second combined question-answer pair document through a maximum-pass algorithm, wherein the concrete calculation formula of the similarity maxpsg (Q, d) is as follows:
Figure BDA0003467087940000033
wherein Q is user query information; q. q.siDividing a single character for each literal meaning of the user query information Q; d is a question-answer pair document; dlDocument snippets for question answering; w is aiMeaning-segmentation single word qiThe weight of (c);
wherein R (q)iD) for each question-answer pair document fragment dlWord q segmented with user query informationiThe specific calculation formula of the relevancy score is as follows:
Figure BDA0003467087940000034
wherein freq (q)iD) dividing the word q for the meaningiFrequency of occurrence in document d; | d | is the length of the document; | d | non-woven gridavgAverage length for all documents; k and b are adjustable parameters;
wherein, max { R (q) }iAnd d) is the maximum value in the relevancy scores of the question-answer pair document and the word segmentation of the user query information.
In some embodiments of the present application, before inputting the first candidate question-answer pair sequence into the first pre-trained BERT model and obtaining the third candidate question-answer pair sequence and the corresponding similarity score, the method further includes training the first pre-trained BERT model, and the specific training steps are as follows:
inputting the first label-free question-answer pair corpus into a BERT model, and performing combined training through a covering language model task and a next sentence prediction task to obtain a first rough training BERT model;
randomly selecting a wrong answer in the answer set to replace the answer in the corresponding question-answer pair according to the similarity between the user query information sample and each question-answer pair to obtain the wrong answer-question-answer pair corresponding to each question-answer pair;
and inputting each question-answer pair and the corresponding wrong answer-question-answer pair into a first rough training BERT model for training, and updating model parameters through a minimum loss function to obtain a first pre-training BERT model.
In some embodiments of the present application, before inputting the first candidate question-answer pair sequence into the second pre-trained BERT model and obtaining the fourth candidate question-answer pair sequence and the corresponding similarity score, the method further includes training the second pre-trained BERT model, and the specific training steps are as follows:
pre-training a generative two-stage language model, wherein the generative two-stage language model is used for generating a user query information set;
inputting the second label-free question-answer pair corpus into a BERT model, and performing combined training through a covering language model task and a next sentence prediction task to obtain a second rough training BERT model;
inputting a second rough training BERT model for training to obtain a second pre-training BERT model, wherein the second rough training BERT model comprises a similar problem pair consisting of a user query information sample of a user query information set and a matching problem and a dissimilar problem pair consisting of the user query information sample of the user query information set and a mismatching problem; the matching questions are the questions with the highest correlation degree with the user query information samples in the question-answer pairs, and the mismatching questions are the questions randomly extracted from the question-answer pairs.
In some embodiments of the present application, the pre-training of the generative two-stage language model specifically includes the following steps:
dividing each character of the meaning of the user query information sample into single characters, and obtaining a coding matrix combining a word vector and a position vector through word vector conversion and position vector conversion;
converting the coding matrix for multiple times through a multi-head self-attention mechanism and a full connection layer to obtain input matrix data;
inputting the input matrix data into a language model for training, and updating model parameters through a maximum likelihood function to obtain a first-stage language model;
merging the questions and answers of each question-answer pair to obtain a third merged question-answer pair document; taking out the third combined question-answer pair continuous word with certain length of meaning segmentation in the document in a sliding window mode to obtain fine tuning input data;
inputting the fine tuning input data into the first-stage language model for training, and updating model parameters through a maximum likelihood function to obtain a generative two-stage language model.
In some embodiments of the present application, the fusing the similarity scores corresponding to the second candidate question-answer pair sequence, the third candidate question-answer pair sequence, and the fourth candidate question-answer pair sequence to obtain a similarity fusion score and a final candidate question-answer pair sequence ordered according to the similarity fusion score specifically includes:
after the similarity scores corresponding to the second candidate question-answer pair sequence, the third candidate question-answer pair sequence and the fourth candidate question-answer pair sequence are normalized, a first fusion similarity score and a first fused candidate question-answer pair sequence sorted according to the similarity fusion score are obtained through a linear summation merging algorithm;
selecting a preset number of question-answer pairs from the first fused candidate question-answer pair sequence according to the first fused similarity score from high to low to obtain a first fused question-answer pair set; and obtaining a second fusion similarity score and a final candidate question-answer pair ordered according to the second similarity fusion score by a set rearrangement algorithm according to the first fused question-answer pair set.
According to a second aspect of the embodiments of the present application, there is provided an unsupervised search system based on FAQ, specifically including:
a first candidate question-answer pair module: the BM25 algorithm is used for calculating the similarity between the user query information and each question-answer pair, and all question-answer pairs are pre-ordered according to the obtained BM25 similarity from high to low to obtain a first candidate question-answer pair sequence;
the second candidate question-answer pair module: the method is used for calculating the similarity between the user query information and each question-answer pair through a maximum-pass algorithm, and sequencing a first candidate question-answer pair sequence according to the obtained maxpsg similarity from high to low to obtain a second candidate question-answer pair sequence;
the third candidate question-answer pair module: the first pre-training BERT model is used for inputting the first candidate question-answer pair sequence to obtain a third candidate question-answer pair sequence and a corresponding similarity score; the training data of the first pre-training BERT model is each question-answer pair and the corresponding wrong answer question-answer pair;
a fourth candidate question-answer pair module: the first candidate question-answer pair sequence is input into a second pre-training BERT model to obtain a fourth candidate question-answer pair sequence and a corresponding similarity score; training data of the second pre-training BERT model are similar problem pairs formed by the user query information samples and the matching problems and dissimilar problem pairs formed by the user query information samples and the mismatching problems;
final candidate question-answer pair module: and the final candidate question-answer pair sequence is used for fusing the similarity scores corresponding to the second candidate question-answer pair sequence, the third candidate question-answer pair sequence and the fourth candidate question-answer pair sequence to obtain a similarity fusion score and sorting according to the similarity fusion score.
According to a third aspect of the embodiments of the present application, there is provided an unsupervised FAQ-based search apparatus, including:
a memory: for storing executable instructions; and
and the processor is connected with the memory to execute the executable instructions so as to complete the FAQ-based unsupervised retrieval method.
According to a fourth aspect of embodiments of the present application, there is provided a computer-readable storage medium having a computer program stored thereon; the computer program is executed by the processor to implement the FAQ-based unsupervised retrieval method.
By adopting the FAQ-based unsupervised retrieval method, the FAQ-based unsupervised retrieval system and the FAQ-based unsupervised retrieval medium, the similarity between the user query information and each question-answer pair is calculated through a BM25 algorithm, and all question-answer pairs are pre-sorted from high to low according to the similarity to obtain a first candidate question-answer pair; calculating the similarity between the user query information and each question-answer pair document through a maximum-pass algorithm, and sequencing the first candidate question-answer pairs from high to low according to the score of the similarity to obtain second candidate question-answer pairs; inputting the first candidate question-answer pair into a first pre-training BERT model to obtain a third candidate question-answer pair and a corresponding similarity score; the training data of the first pre-training BERT model is each question-answer pair and the corresponding wrong answer question-answer pair; inputting the first candidate question-answer pair into a second pre-training BERT model to obtain a fourth candidate question-answer pair and a corresponding similarity score; training data of the pre-training BERT model are similar problem pairs consisting of user query information and matching problems and dissimilar problem pairs consisting of user query information and mismatching problems; and fusing the similarity scores corresponding to the second candidate question-answer pair sequence, the third candidate question-answer pair sequence and the fourth candidate question-answer pair sequence to obtain a similarity fusion score and a final candidate question-answer pair ordered according to the similarity fusion score.
According to the method and the device, only the question and answer are used for carrying out unsupervised training on the data, and matching and marking between the user query and the question and answer pair are not needed. On the basis of pre-sorting by using a BM25 algorithm, a maximum-pass algorithm and two sorting algorithms based on a BERT model are respectively adopted for sorting, similarity scores of three sorting results are fused, and the query-answer pairs are re-sorted, so that the retrieval accuracy based on the FAQ system query-answer pairs is further improved on the basis of reducing the workload of marking data of service personnel.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a schematic diagram illustrating steps of an unsupervised FAQ-based retrieval method according to an embodiment of the present application;
a flow diagram of an FAQ-based unsupervised retrieval method according to the present application is shown in fig. 2;
a schematic flow diagram for fine tuning the BERT model according to the present application is shown in fig. 3;
fig. 4 is a schematic structural diagram of an FAQ-based unsupervised retrieval system according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an unsupervised FAQ-based search device according to an embodiment of the present application.
Detailed Description
In the process of realizing the application, the inventor discovers that the supervised training model is subjected to elbow data acquisition difficulty to a certain extent, and when corresponding marked data is lacked, only the unsupervised training model can be used, but the current unsupervised model based on FAQ mainly uses more traditional retrieval matching modes such as semantic matching, extended query and the like, cannot be widely applied to emerging natural language processing technology, and has poor application effect.
The search model used by the method only uses the question-answer element group data to perform unsupervised training, does not need to make matching marks between user query and question-answer element groups, breaks through the limitation caused by data acquisition difficulty on the premise of ensuring wide application, and reduces the dependence on service personnel labeling data.
In particular, the method comprises the following steps of,
calculating the similarity between the user query information and each question-answer pair through a BM25 algorithm, and pre-sequencing all question-answer pairs from high to low according to the similarity to obtain a first candidate question-answer pair sequence; calculating the similarity between the user query information and each question-answer pair document through a maximum-pass algorithm, and sequencing the first candidate question-answer pair sequence from high to low according to the score of the similarity to obtain a second candidate question-answer pair sequence; inputting the first candidate question-answer pair sequence into a first pre-training BERT model to obtain a third candidate question-answer pair sequence and a corresponding similarity score; the training data of the first pre-training BERT model is each question-answer pair and the corresponding wrong answer question-answer pair; inputting the first candidate question-answer pair sequence into a second pre-training BERT model to obtain a fourth candidate question-answer pair sequence and a corresponding similarity score; training data of the pre-training BERT model are similar problem pairs consisting of user query information and matching problems and dissimilar problem pairs consisting of user query information and mismatching problems; and fusing the similarity scores corresponding to the second candidate question-answer pair sequence, the third candidate question-answer pair sequence and the fourth candidate question-answer pair sequence to obtain a similarity fusion score and a final candidate question-answer pair sequence which is correspondingly ordered.
In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and are not exhaustive of all embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
Example 1
Fig. 1 is a schematic diagram illustrating steps of an unsupervised FAQ-based search method according to an embodiment of the present application.
As shown in fig. 1, the unsupervised search method based on FAQ in the embodiment of the present application specifically includes the following steps:
s101: and calculating the similarity between the user query information and each question-answer pair through a BM25 algorithm, and pre-sequencing all question-answer pairs from high to low according to the similarity to obtain a first candidate question-answer pair sequence.
The method specifically comprises the following steps:
firstly, combining the question and the answer of each question-answer pair to obtain a combined question-answer pair document.
Then, BM25 algorithm is used for calculating BM25 similarity between the user query information and each question-answer pair document, and the similarity BM25(Q, d) is specifically calculated as follows:
Figure BDA0003467087940000091
wherein Q is user query information; q. q.siDividing a single character for each literal meaning of the user query information Q; d is a question-answer pair document; | d | is the length of the document; | d | non-woven gridavgAverage length for all documents; k and b are adjustable parameters;
wherein freq (q)iD) dividing the word q for the meaningiFrequency of occurrence in document d;
wherein, IDF (q)i) Segmenting a word q for each context calculated using the IDF functioniThe specific calculation formula of the weight of (a) is as follows:
Figure BDA0003467087940000092
where N is the number of all documents, N (q)i) To contain qiThe number of documents.
S102: and calculating the similarity between the user query information and each question-answer pair document through a maximum-pass algorithm, and sequencing the first candidate question-answer pair sequence from high to low according to the score of the similarity to obtain a second candidate question-answer pair sequence.
The method specifically comprises the following steps:
firstly, combining the question and the answer of each question-answer pair to obtain a combined question-answer pair document.
And then, intercepting the question-answer pair document by a sliding window to obtain a question-answer pair document fragment.
And finally, calculating the maxpsg similarity between the user query information and each question-answer pair document through a maximum-pass algorithm, wherein the specific calculation formula of the similarity maxpsg (Q, d) is as follows:
Figure BDA0003467087940000101
wherein Q is user query information; q. q.siDividing a single character for each literal meaning of the user query information Q; d is a question-answer pair document; dlDocument snippets for question answering; w is aiMeaning-segmentation single word qiThe weight of (c);
wherein R (q)iD) for each question-answer pair document fragment dlWord q segmented with user query informationiThe specific calculation formula of the relevancy score is as follows:
Figure BDA0003467087940000102
wherein freq (q)iD) dividing the word q for the meaningiFrequency of occurrence in document d; | d | is the length of the document; | d | non-woven gridavgAverage length for all documents; k and b are adjustable parameters.
S103: inputting the first candidate question-answer pair sequence into a first pre-training BERT model to obtain a third candidate question-answer pair sequence and a corresponding similarity score; the training data of the first pre-trained BERT model is for each question-answer pair and the corresponding wrong-answer question-answer pair.
Before inputting the first candidate question-answer pair sequence into the first pre-training BERT model to obtain the third candidate question-answer pair sequence and the corresponding similarity score, the method also comprises the step of training the first pre-training BERT model, and the specific training steps are as follows:
firstly, inputting a label-free question-answer pair corpus into a BERT model, and carrying out joint training through a covering language model task and a next sentence prediction task to obtain a first rough training BERT model.
And then, according to the similarity between the user query information and each question-answer pair, randomly selecting a wrong answer in the answer set to replace the answer in the corresponding question-answer pair to obtain the wrong answer-question-answer pair corresponding to each question-answer pair.
And finally, inputting each question-answer pair and the corresponding wrong answer-question-answer pair into a first rough training BERT model for training, and updating model parameters through a minimum loss function to obtain a first pre-training BERT model.
S104: inputting the first candidate question-answer pair sequence into a second pre-training BERT model to obtain a fourth candidate question-answer pair sequence and a corresponding similarity score; the training data of the pre-trained BERT model are similar problem pairs consisting of user query information and matching problems and dissimilar problem pairs consisting of user query information and mismatching problems.
Before inputting the first candidate question-answer pair sequence into a second pre-training BERT model to obtain a fourth candidate question-answer pair sequence and a corresponding similarity score, the method also comprises the step of training the second pre-training BERT model, and the specific training steps are as follows:
firstly, a generative two-stage language model is trained in advance, and the generative two-stage language model is used for generating a user query information set.
And then, inputting the label-free question-answer pair corpus into a BERT model, and performing joint training through a covering language model task and a next sentence prediction task to obtain a second rough training BERT model.
Finally, inputting the similar problem pair formed by the user query information of the user query information set and the matching problem and the dissimilar problem pair formed by the user query information of the user query information set and the mismatching problem, and inputting a second rough training BERT model for training to obtain a second pre-training BERT model; the matching questions are the questions with the highest relevance degree with the user query information in the question-answer pairs, and the questions are randomly extracted from the question-answer pairs.
The pre-training generation type two-stage language model specifically comprises the following steps:
firstly, each semantic segmentation single character of user query information is converted through a word vector and a position vector to obtain a coding matrix combining the word vector and the position vector.
And then, converting the coding matrix for multiple times through a multi-head self-attention mechanism and a full connection layer to obtain input matrix data.
Secondly, inputting the input matrix data into a language model for training, and updating model parameters through a maximum likelihood function to obtain a first-stage language model.
Then, combining the questions and answers of each question-answer pair to obtain a combined question-answer pair document; and continuous word segmentation of the text meaning with certain length in the document is taken out by the question and answer mode in a sliding window mode, and fine tuning input data is obtained.
And finally, inputting the fine tuning input data into the first-stage language model for training, and updating model parameters through a maximum likelihood function to obtain a generative two-stage language model.
S105: and fusing the similarity scores corresponding to the second candidate question-answer pair sequence, the third candidate question-answer pair sequence and the fourth candidate question-answer pair sequence to obtain a similarity fusion score and a final candidate question-answer pair sequence which is correspondingly ordered.
The method specifically comprises the following steps:
firstly, after the similarity scores corresponding to the second candidate question-answer pair sequence, the third candidate question-answer pair sequence and the fourth candidate question-answer pair sequence are normalized, a first fusion similarity score and a first fused candidate question-answer pair sequence which is correspondingly ordered are obtained through a linear summation merging algorithm.
Secondly, selecting a certain number of question-answer pairs from the first fused candidate question-answer pair sequence according to the first fused similarity score from high to low to obtain a first fused question-answer pair set; and according to the first fused question-answer pair set, obtaining a second fused similarity score and a final candidate question-answer pair sequence which is correspondingly ordered by a set rearrangement algorithm.
To further illustrate the examples of the present application, an unsupervised FAQ-based search scheme for specific implementations is given below.
A flow diagram of an unsupervised FAQ-based retrieval method according to the present application is shown in fig. 2.
The problem that this application was solved includes: the FAQ question-answer pairs are sorted according to the relevance to the user query (Q) according to the user query (Q). As shown in the flowchart of fig. 2, the flow of the FAQ-based unsupervised search method of the present application is summarized as follows:
procedure (one) pre-ranks all FAQ question-answer pairs: according to a given user query Q, similarity between the user query Q and each existing FAQ question-answer pair is calculated by using a BM25 algorithm and sequenced, and the top k FAQ question-answer pairs with the highest similarity are selected as candidate FAQ question-answer pairs, namely a first candidate question-answer pair sequence. The question-answer pair of the present application is a question-answer tuple (q, a).
The process (two) uses three sub-ranking processes to rank the candidate FAQ question-answer pairs: the sub-sorting flow 1 uses a maximum-pass algorithm to sort according to the similarity between Q and a question-answer tuple (Q, a), and a second candidate question-answer pair sequence is obtained; the sub-sequencing process 2 uses the existing one question and one answer tuple to finely tune the BERT model to obtain sequencing, and then a third candidate question and answer pair sequence is obtained; and the sub-sequencing process 3 uses a generative two-stage language model to generate a related question (q ') and uses the question (q, q') to finely tune the BERT model to obtain sequencing, so that a fourth candidate question-answer pair sequence is obtained.
And (3) fusing sequencing results of the flow (III): and normalizing and summing the three matching scores by using a linear summation merging algorithm and a set rearrangement algorithm to obtain a final FAQ question-answer pair sorting score according to the sorting results of the three sorting submodels, namely the second candidate question-answer pair sequence, the third candidate question-answer pair sequence and the fourth candidate question-answer pair sequence.
Regarding the procedure (I):
the BM25 algorithm is a ranking function in information retrieval, and weights and sums the relevance of each semantic segmentation single word of the user query Q and the relevance of the document d to obtain a final ranking score which can be used for ranking the relevance of the user query Q and the document d. In this application, document d represents a combined question-answer group.
The similarity BM25(Q, d), i.e. the ranking score, is calculated by the following formula:
Figure BDA0003467087940000131
wherein the content of the first and second substances,
Figure BDA0003467087940000132
representing each context-sliced word q computed using the IDF functioniThe weight of (c); and N is the number of all documents; n (q)i) To contain qiThe number of documents.
Wherein the content of the first and second substances,
Figure BDA0003467087940000133
meaning-segmentation single word qiThe degree of relevance to document d (referred to as BM25 similarity).
freq(qiD) dividing the word q for the meaningiThe frequency of occurrence in document d, | d | is the length of the document, | d | non calculationavgFor the average length of all documents, k and b are adjustable parameters.
In the pre-sorting step, the BM25 algorithm calculates a relevance sorting score between each question-answer tuple (Q, a) and the user query Q, and selects the top k with the highest relevance as candidate question-answer tuples according to the score. And finally, obtaining a first candidate question-answer pair sequence.
Regarding the procedure (ii):
sub-sequencing Process 1: firstly, intercepting a sliding window segment d by using a sliding window mode for each combined question-answer tuple document d through a maximum-pass algorithmlAnd calculate each dlWord q separated from user's queryiTaking the maximum relevance score as the document d and the user query word segmentation single word qiA relevance score of; then document d and each user query word q are divided into single wordsiAnd after the relevance scores are weighted and summed, the relevance of each semantic segmentation single word of the user query Q and the document d is obtained.
Therefore, the similarity maxpsg (Q, d) between the user query information Q and each question-answer pair document d is specifically calculated by the formula:
Figure BDA0003467087940000141
wherein, wiRepresenting a literal-cut word weight, R (q)iAnd d) meaning the word q of word segmentationiThe degree of relevance to document d. The present application uses the aforementioned IDF function to calculate the weight of a context-cut word, R (q)iAnd d) adopting the BM25 similarity calculation mode.
The first candidate question-answer pair sequence pre-ranked in the process (I) of the application obtains a new ranking through the sub-ranking model, namely a second candidate question-answer pair sequence.
Sub-sequencing flow 2: the BERT model was trained by q-a matching pairs.
The framework of the BERT model is mainly divided into two steps: in the pre-training step, the BERT model uses large-scale label-free linguistic data, joint training is carried out through a covering type language model task and a next sentence prediction task to obtain a first rough training BERT model, and semantic representation of a text can be obtained through the first rough training BERT model; in the fine tuning step, the BERT model is fine tuned on a specific language processing task using the textual semantic representation obtained in the previous step.
A schematic flow diagram for fine tuning the BERT model according to the present application is shown in fig. 3.
As shown in fig. 3, the correct one-question-one-answer tuple (q, a) and the incorrect one-question-one-answer tuple (q, a ') are respectively obtained by multiplying the text pair representing BERT (q, a) and BERT (q, a') with the parameter vector Θ to obtain similarity scores, which are respectively: score1 ═ BERT (q, a) · Θ and score2 ═ BERT (q, a') ·, the model was trained by minimizing the Loss function Loss ═ max (0, score2-score1+ m), where m is the tunable parameter. A finely tuned BERT model, i.e., a first pre-trained BERT model, is obtained.
The first candidate question-answer pair sequence pre-ranked in the process (I) of the application obtains a new ranking through the sub-ranking model, namely a third candidate question-answer pair sequence.
Sub-ordering flow 3: the BERT model was trained by q-q matching pairs.
Training of this model requires (Q, Q ') triple set of data, where Q is the user query, Q is the problem matching Q, and Q' is the randomly drawn mismatch problem in the problem set. Since only one question-answer tuple data exists in the application, similar questions need to be generated aiming at the existing question set.
The method and the device adopt a generative two-stage language model to generate the user query information Q required by training.
The generative two-stage language model mainly comprises a pre-training step and a fine tuning step: the pre-training step trains a large-capacity language model on the basis of a large text library, and the fine tuning step performs supervised training in a specific language scene.
Then, on the basis of generating the user query Q required for training, we perform matching ranking by finely tuning the BERT model.
The method for generating the two-stage language model specifically comprises the following steps:
(1) the language model is first trained in advance.
The input data of the pre-training step is the first one of the word sequences [ u ] of each word u of the user query information-1,u-2,…,u-l]This step needs to go through the following process:
(i) converting a word from a word-meaning-segmented word into a word vector and a position vector
The first one of the word sequence U ═ U for word U-1,u-2,…,u-l]Embedding each meaning-divided word into a word vector [ e ] of a high-dimensional space1,e2,…,e512]To obtain a word vector matrix We(ii) a In order to be able to express the position information of each word at the same time, a position vector p is used1,p2,…,p512]The position of each literal meaning segmentation single character in the sequence is shown to obtain a position vector matrix Wp(ii) a Adding the two codes to obtain the final word vector coding matrix h0=We+Wp
(ii) Multi-head self-attention mechanism and full-connection layer
In order to capture the relevance between the words of the semantic segmentation, the patent uses a self-attention mechanism to further encode the input word vectors and acquire the hidden state.
For an input word vector, converting the input word vector into a query vector Q', a key vector K and a value vector V, and calculating the weight of each semantic segmentation single word relative to other semantic segmentation single words, wherein the specific formula is as follows:
Figure BDA0003467087940000151
in order to extract more complete information from the characteristics of different dimensions of the word vector, the application adopts a multi-head self-attention mechanism on the basis of the self-attention mechanism, namely the following formula:
MultiHead(Q′,K,V)=[head1,head2,…,headh]WO
wherein the headi=Attention(Q′WQ,KWK,VWV) (ii) a Wherein the content of the first and second substances,
and then, obtaining new output by the output result of the multi-head self-attention mechanism through a full connection layer.
Finally, defining the process of normalization-multi-head self-attention mechanism-normalization-full connection layer as a conversion process T, and inputting data h of the neural network obtained from the previous step0The conversion process T is repeated n times, i.e.:
hk=T(hk-1),k=1,…,n。
(iii) likelihood function output
H obtained by the previous stepnObtaining the first L literal segmentation single-word sequences as [ u ] through softmax function-1,u-2,…,u-l]Conditional for the next word is a conditional probability of u:
Figure BDA0003467087940000161
from this we can get the likelihood function L1=∑ilogP(ui|U)。
In the pre-training step, by maximizing the likelihood function L1Updating the neural network parameters to obtain a pre-trained language model, namely obtaining the first-stage language model.
(2) The first stage language model is fine tuned.
In the fine tuning step, on the basis of a pre-trained first-stage language model, existing question-answer group data is converted into:
a1[SEP]q1[EOS]a2[SEP]q2[EOS]…an[SEP]qn[EOS]"in the form of text, wherein [ SEP]And [ EOS]Is a special mark.
Merging the question and answer of each question-answer pair to obtain merged question-answer pair document, taking out continuous m literal segmentation single characters [ x ] from the merged text by using a sliding window with the length of m1,x2,…,xm]And the single character y is cut by the next context as a label to obtain input data.
And calculating the data to obtain the conditional probability by the following formula:
P(y|x1,x2,…,xm)=softmax(hnWy);
wherein h isnFor activation results obtained after passing the input data through a neural network transformation process T of n pre-training steps, WyTo fine tune the parameters of the language model.
From this, a likelihood function L is obtained2=∑(,y)logP(y|x1,x2,…,xm) Fine tuning the language model by maximizing the likelihood function L2And updating the neural network parameters to obtain a final model, namely obtaining the generating type two-stage language model.
And finally, obtaining a generation type two-stage language model after fine tuning after training is finished, forming an input data 'a [ SEP ]' input model by the answer a of each one-question one-answer tuple and a special mark [ SEP ], and generating a candidate set { Q } of the user query Q required by training.
For the candidate set { Q }, the BM25 algorithm is used to calculate the correlation degree between each generated Q and the corresponding question-answer tuple (Q, a), and the Q with the highest correlation degree is selected as the user query matched with the question-answer tuple (Q, a) by taking the correlation degree as the standard.
Based on the (Q, Q) data matched with each other and the mismatch problem Q 'randomly extracted, the (Q, Q') triple-element data can be composed.
Referring to the foregoing fine tuning method of the BERT model, as shown in fig. 3, the similar problem pair (Q, Q) and the dissimilar problem pair (Q, Q ') are respectively obtained by pre-training the second coarse-trained BERT model to obtain text pairs representing BERT (Q, Q) and BERT (Q, Q '), multiplying the text pairs by the parameter vector Θ to obtain similarity scores score1 ═ BERT (Q, Q) · Θ and score2 ═ BERT (Q, Q '). Θ, and are trained by minimizing Loss function Loss ═ max (0, score2-score1+ m) to obtain the fine-tuned BERT model, i.e., the second pre-trained BERT model.
The first candidate question-answer pair sequence pre-ranked in the process (I) of the application obtains a new ranking through the sub-ranking model, namely a fourth candidate question-answer pair sequence.
Regarding scheme (c): and fusing the sequencing results.
Three different groups of one-question one-answer element sorting scores can be obtained through the three sub-sorting processes, the second candidate question-answer pair sequence, the third candidate question-answer pair sequence and the fourth candidate question-answer pair sequence are required to be integrated to form a final output sorting score in the step of sorting and fusing, and two algorithms are adopted for merging the sorting scores.
1. Linear summation and combination algorithm.
Scorec=∑i∈IRSchemesNormalizedScorei
The linear summation and combination algorithm normalizes the ranking scores of the three sub-ranking flows and then sums the ranking scores to obtain the final ranking Score, wherein the normalization adopts a min-max method, namely the ranking Score of each sub-ranking flow is Scorec
Wherein the content of the first and second substances,
Figure BDA0003467087940000181
in this step, the final scores obtained by the fusion according to the linear summation and combination method can obtain the final output one question-one answer group sequence, i.e. the first fused candidate question-answer pair sequence.
2. Set rearrangement algorithm
The set rearrangement algorithm selects the first k question-answer tuples with the highest score to form a question-answer tuple set F on the basis of the sorting scores of the linear summation merging algorithm, and calculates the correlation degree of each semantic segmentation single word w and the user query Q:
Figure BDA0003467087940000182
wherein R represents a correlation model;
Figure BDA0003467087940000183
dividing the frequency of the single word w in the document d, namely a question-answer tuple, for the meaning; c (w, d) represents the occurrence frequency of the word w of the word segmentation in the document d;
Figure BDA0003467087940000184
is the normalized query likelihood.
The query relevance represents the probability of occurrence of the whole query Q in the document, and is obtained by multiplying the probability of occurrence of each semantic segmentation single character in the Q in the document d, wherein the query relevance represents the specific formula:
Figure BDA0003467087940000185
the preliminary set rearrangement Score is then calculated using negative cross entropy:
Figure BDA0003467087940000186
wherein the content of the first and second substances,
Figure BDA0003467087940000187
dividing the frequency of the single word w in the document d for the character meaning smoothed by the Dirichlet; mu is an adjustable hyper-parameter; c represents a text library of the entire data set.
Rearrangement Score in the preliminary setpOn the basis of the above, in order to prevent the problem of query drift, the preliminary set rearrangement Score is multiplied by the linear summation and combination Score to obtain the adjusted set rearrangement Scorep', i.e.:
Scorep′=Scorep*Scorec
according to the first fused question-answer pair set, a second fused similarity score and a final candidate question-answer pair sequence which is correspondingly ordered can be obtained through a set rearrangement algorithm.
According to the FAQ-based unsupervised retrieval method, the similarity between the user query information and each question-answer pair is calculated through a BM25 algorithm, and all question-answer pairs are pre-ordered according to the similarity from high to low to obtain a first candidate question-answer pair sequence; calculating the similarity between the user query information and each question-answer pair document through a maximum-pass algorithm, and sequencing the first candidate question-answer pair sequence from high to low according to the score of the similarity to obtain a second candidate question-answer pair sequence; inputting the first candidate question-answer pair sequence into a first pre-training BERT model to obtain a third candidate question-answer pair sequence and a corresponding similarity score; the training data of the first pre-training BERT model is each question-answer pair and the corresponding wrong answer question-answer pair; inputting the first candidate question-answer pair sequence into a second pre-training BERT model to obtain a fourth candidate question-answer pair sequence and a corresponding similarity score; training data of the pre-training BERT model are similar problem pairs consisting of user query information and matching problems and dissimilar problem pairs consisting of user query information and mismatching problems; and fusing the similarity scores corresponding to the second candidate question-answer pair sequence, the third candidate question-answer pair sequence and the fourth candidate question-answer pair sequence to obtain a similarity fusion score and a final candidate question-answer pair sequence which is correspondingly ordered.
The method and the device only use the question and answer to conduct unsupervised training on the data, and matching marks between the user query and the question and answer pair are not needed. On the basis of pre-sorting by using a BM25 algorithm, a focused recall algorithm and two sorting algorithms based on a BERT model are respectively adopted for sorting, and the three sorting results are fused by using a linear summation combination algorithm and a set rearrangement algorithm, so that the general effect is further improved.
The concrete beneficial effects still include:
firstly, a novel unsupervised training model is provided, most of the QA retrieval systems in the prior art need supervised model training, but the model of the application does not need labeled question-answer tuple data, breaks through the limitation caused by the difficulty of data acquisition, and lightens the dependence on labeled data of business personnel.
Secondly, on the basis of text matching by applying different sorting models, two fusion models are used, so that the limitation of a single model is avoided, the stability of the model is improved, and the accuracy of the matching degree sorting of the query and question answering of the final user is improved.
Then, the method and the device use the generative two-stage language model to generate the similarity problem, solve the problem of data lack of the labeled similarity problem, and realize the weakly supervised model training; meanwhile, the model can be finely tuned on the basis of completing the first-stage training in advance, so that the time of a text generation stage is saved, and the speed of the whole process is increased.
Finally, the data set to be sorted is reduced by adopting a pre-screening mode for multiple times, the memory occupation is reduced, and the retrieval operation speed is increased.
Example 2
For details not disclosed in the FAQ-based unsupervised search system of this embodiment, please refer to specific implementation contents of the FAQ-based unsupervised search method in other embodiments.
A schematic structural diagram of an FAQ-based unsupervised retrieval system according to an embodiment of the present application is shown in fig. 4.
As shown in fig. 4, the FAQ-based unsupervised search system according to the embodiment of the present application specifically includes a first candidate question-answer pair module 10, a second candidate question-answer pair module 20, a third candidate question-answer pair module 30, a fourth candidate question-answer pair module 40, and a final candidate question-answer pair module 50.
In particular, the method comprises the following steps of,
the first candidate question-answer pair module 10: the BM25 algorithm is used for calculating the similarity between the user query information and each question-answer pair, and pre-ranking all question-answer pairs according to the similarity from high to low to obtain a first candidate question-answer pair.
The method specifically comprises the following steps:
firstly, combining the question and the answer of each question-answer pair to obtain a combined question-answer pair document.
Then, BM25 algorithm is used for calculating BM25 similarity between the user query information and each question-answer pair document, and the similarity BM25(Q, d) is specifically calculated as follows:
Figure BDA0003467087940000201
wherein Q is user query information; q. q.siDividing a single character for each literal meaning of the user query information Q; d is a question-answer pair document; | d | is the length of the document; | d | non-woven gridavgAverage length for all documents; k and b are adjustable parameters;
wherein freq (q)iD) dividing the word q for the meaningiFrequency of occurrence in document d;
wherein, IDF (q)i) Segmenting a word q for each context calculated using the IDF functioniThe specific calculation formula of the weight of (a) is as follows:
Figure BDA0003467087940000211
where N is the number of all documents, N (q)i) To contain qiThe number of documents.
The second candidate question-answer pair module 20: and the method is used for calculating the similarity between the user query information and each question-answer pair document through a maximum-pass algorithm, and ranking the first candidate question-answer pairs from high to low according to the similarity score to obtain second candidate question-answer pairs.
The method specifically comprises the following steps:
firstly, combining the question and the answer of each question-answer pair to obtain a combined question-answer pair document.
And then, intercepting the question-answer pair document by a sliding window to obtain a question-answer pair document fragment.
And finally, calculating the maxpsg similarity between the user query information and each question-answer pair document through a maximum-pass algorithm, wherein the specific calculation formula of the similarity maxpsg (Q, d) is as follows:
Figure BDA0003467087940000212
wherein Q is user query information; q. q.siDividing a single character for each literal meaning of the user query information Q; d is a question-answer pair document; dlDocument snippets for question answering; w is aiMeaning-segmentation single word qiThe weight of (c);
wherein R (q)iD) for each question-answer pair document fragment d1Word q segmented with user query informationiThe specific calculation formula of the relevancy score is as follows:
Figure BDA0003467087940000213
wherein freq (q)iD) dividing the word q for the meaningiFrequency of occurrence in document d; | d | is the length of the document; | d | non-woven gridavgAverage length for all documents; k and b are adjustable parameters.
The third candidate question-answer pair module 30: the first pre-training BERT model is used for inputting the first candidate question-answer pair to obtain a third candidate question-answer pair and a corresponding similarity score; the training data of the first pre-trained BERT model is for each question-answer pair and the corresponding wrong-answer question-answer pair.
Before inputting the first candidate question-answer pair into the first pre-training BERT model to obtain the third candidate question-answer pair and the corresponding similarity score, the method also comprises the step of training the first pre-training BERT model, and the specific training steps are as follows:
firstly, inputting a label-free question-answer pair corpus into a BERT model, and carrying out joint training through a covering language model task and a next sentence prediction task to obtain a first rough training BERT model.
And then, according to the similarity between the user query information and each question-answer pair, randomly selecting a wrong answer in the answer set to replace the answer in the corresponding question-answer pair to obtain the wrong answer-question-answer pair corresponding to each question-answer pair.
And finally, inputting each question-answer pair and the corresponding wrong answer-question-answer pair into a first rough training BERT model for training, and updating model parameters through a minimum loss function to obtain a first pre-training BERT model.
The fourth candidate question-answer pair module 40: the first candidate question-answer pair is input into a second pre-training BERT model to obtain a fourth candidate question-answer pair and a corresponding similarity score; the training data of the pre-trained BERT model are similar problem pairs consisting of user query information and matching problems and dissimilar problem pairs consisting of user query information and mismatching problems.
Before inputting the first candidate question-answer pair into the second pre-training BERT model to obtain the fourth candidate question-answer pair and the corresponding similarity score, the method also comprises the step of training the second pre-training BERT model, and the specific training steps are as follows:
firstly, a generative two-stage language model is trained in advance, and the generative two-stage language model is used for generating a user query information set.
And then, inputting the label-free question-answer pair corpus into a BERT model, and performing joint training through a covering language model task and a next sentence prediction task to obtain a second rough training BERT model.
Finally, inputting the similar problem pair formed by the user query information of the user query information set and the matching problem and the dissimilar problem pair formed by the user query information of the user query information set and the mismatching problem, and inputting a second rough training BERT model for training to obtain a second pre-training BERT model; the matching questions are the questions with the highest relevance degree with the user query information in the question-answer pairs, and the questions are randomly extracted from the question-answer pairs.
Final candidate question-answer pair module 50: and the similarity fusion score and the final candidate question-answer pair in the corresponding sequence are obtained by fusing the similarity scores corresponding to the second candidate question-answer pair, the third candidate question-answer pair and the fourth candidate question-answer pair.
Example 3
For details not disclosed in the FAQ-based unsupervised retrieving apparatus of this embodiment, please refer to specific implementation contents of the FAQ-based unsupervised retrieving method or system in other embodiments.
A schematic structural diagram of an unsupervised search 400 according to an embodiment of the application is shown in fig. 5.
As shown in fig. 5, unsupervised search 400 includes:
the memory 402: for storing executable instructions; and
a processor 401 is coupled to the memory 402 to execute executable instructions to perform the motion vector prediction method.
Those skilled in the art will appreciate that the schematic diagram 5 is merely an example of the unsupervised search 400 and does not constitute a limitation of the unsupervised search 400 and may include more or less components than those shown, or combine certain components, or different components, e.g., the unsupervised search 400 may also include input-output devices, network access devices, buses, etc.
The Processor 401 (CPU) may be other general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, or the like. The general purpose processor may be a microprocessor or the processor 401 may be any conventional processor or the like, and the processor 401 is the control center of the unsupervised search 400 and connects the various parts of the entire unsupervised search 400 using various interfaces and lines.
Memory 402 may be used to store computer readable instructions and processor 401 may implement the various functions of unsupervised retrieval 400 by executing or executing computer readable instructions or modules stored within memory 402 and invoking data stored within memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the stored data area may store data created according to the unsupervised retrieval 400, and the like. In addition, the Memory 402 may include a hard disk, a Memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Memory Card (Flash Card), at least one disk storage device, a Flash Memory device, a Read-Only Memory (ROM), a Random Access Memory (RAM), or other non-volatile/volatile storage devices.
The modules integrated in the unsupervised search 400, if implemented in the form of software functional modules and sold or used as separate products, may be stored in a computer-readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by hardware related to computer readable instructions, which may be stored in a computer readable storage medium, and when the computer readable instructions are executed by a processor, the steps of the method embodiments may be implemented.
Example 4
The present embodiment provides a computer-readable storage medium having stored thereon a computer program; the computer program is executed by the processor to implement the FAQ-based unsupervised retrieval method in other embodiments.
According to the non-supervision type retrieval equipment and the storage medium based on the FAQ, the similarity between the user query information and each question-answer pair is calculated through a BM25 algorithm, and all question-answer pairs are pre-sorted from high to low according to the similarity to obtain a first candidate question-answer pair; calculating the similarity between the user query information and each question-answer pair document through a maximum-pass algorithm, and sequencing the first candidate question-answer pairs from high to low according to the score of the similarity to obtain second candidate question-answer pairs; inputting the first candidate question-answer pair into a first pre-training BERT model to obtain a third candidate question-answer pair and a corresponding similarity score; the training data of the first pre-training BERT model is each question-answer pair and the corresponding wrong answer question-answer pair; inputting the first candidate question-answer pair into a second pre-training BERT model to obtain a fourth candidate question-answer pair and a corresponding similarity score; training data of the pre-training BERT model are similar problem pairs consisting of user query information and matching problems and dissimilar problem pairs consisting of user query information and mismatching problems; and fusing the similarity scores corresponding to the second candidate question-answer pair, the third candidate question-answer pair and the fourth candidate question-answer pair to obtain a similarity fusion score and a final candidate question-answer pair which is correspondingly sorted.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (10)

1. An unsupervised retrieval method based on FAQ is characterized by comprising the following steps:
calculating BM25 similarity of user query information and each question-answer pair through a BM25 algorithm, and pre-sequencing all question-answer pairs from high to low according to the BM25 similarity to obtain a first candidate question-answer pair sequence;
calculating the maxpsg similarity between the user query information and each question-answer pair through a maximum-pass algorithm, and sequencing the first candidate question-answer pair sequence according to the maxpsg similarity from high to low to obtain a second candidate question-answer pair sequence;
inputting the first candidate question-answer pair sequence into a first pre-training BERT model to obtain a third candidate question-answer pair sequence and a corresponding similarity score; the training data of the first pre-trained BERT model is each question-answer pair and the corresponding wrong answer question-answer pair;
inputting the first candidate question-answer pair sequence into a second pre-training BERT model to obtain a fourth candidate question-answer pair sequence and a corresponding similarity score; the training data of the second pre-training BERT model are a similar problem pair formed by the user query information sample and the matching problem and a dissimilar problem pair formed by the user query information sample and the mismatching problem;
and fusing the similarity scores corresponding to the second candidate question-answer pair sequence, the third candidate question-answer pair sequence and the fourth candidate question-answer pair sequence to obtain a similarity fusion score, and sequencing the final candidate question-answer pair sequence according to the similarity fusion score.
2. The FAQ-based unsupervised retrieval method according to claim 1, wherein the calculating the similarity between the user query information and each question-answer pair through BM25 algorithm specifically comprises:
merging the questions and answers of each question-answer pair to obtain a first merged question-answer pair document;
BM25 similarity of the user query information and each first combined question-answer pair document is calculated through a BM25 algorithm, and a concrete calculation formula of the similarity BM25(Q, d) is as follows:
Figure FDA0003467087930000011
wherein Q is user query information; q. q.siDividing a single character for each literal meaning of the user query information Q; d is a question-answer pair document; | d | is the length of the document; | d | non-woven gridavgFor all documentsAn average length; k and b are adjustable parameters;
wherein freq (q)iD) dividing the word q for the meaningiFrequency of occurrence in document d;
wherein, IDF (q)i) Segmenting a word q for each context calculated using the IDF functioniThe specific calculation formula of the weight of (a) is as follows:
Figure FDA0003467087930000021
where N is the number of all documents, N (q)i) To contain qiThe number of documents.
3. The FAQ-based unsupervised retrieval method according to claim 1, wherein the similarity between the user query information and each question-answer pair is calculated through a maximum-pass algorithm, and the first candidate question-answer pair sequence is ranked from high to low according to the score of the similarity to obtain a second candidate question-answer pair sequence, specifically comprising:
merging the questions and answers of each question-answer pair to obtain a second merged question-answer pair document;
intercepting a sliding window segment of the second merged question-answer pair document in a sliding window mode to obtain a question-answer pair document segment;
calculating the maxpsg similarity of the user query information and each second combined question-answer pair document through a maximum-pass algorithm, wherein the concrete calculation formula of the similarity maxpsg (Q, d) is as follows:
Figure FDA0003467087930000023
wherein Q is user query information; q. q.siDividing a single character for each literal meaning of the user query information Q; d is a question-answer pair document; dlDocument snippets for question answering; w is aiMeaning-segmentation single word qiThe weight of (c);
wherein R (q)iD) for each question-answer pair document fragment dlAnd useWord q for word meaning segmentation of user inquiry informationiThe specific calculation formula of the relevancy score is as follows:
Figure FDA0003467087930000022
wherein freq (q)iD) dividing the word q for the meaningiFrequency of occurrence in document d; | d | is the length of the document; | d | non-woven gridavgAverage length for all documents; k and b are adjustable parameters;
wherein, max { R (q) }iAnd d) is the maximum value in the relevancy scores of the question-answer pair document and the word segmentation of the user query information.
4. The FAQ-based unsupervised retrieval method of claim 1, wherein before inputting the first candidate question-answer pair sequence into a first pre-trained BERT model to obtain a third candidate question-answer pair sequence and a corresponding similarity score, the method further comprises training the first pre-trained BERT model, and the specific training steps are as follows:
inputting the first label-free question-answer pair corpus into a BERT model, and performing combined training through a covering language model task and a next sentence prediction task to obtain a first rough training BERT model;
randomly selecting a wrong answer in the answer set to replace the answer in the corresponding question-answer pair according to the similarity between the user query information sample and each question-answer pair to obtain the wrong answer-question-answer pair corresponding to each question-answer pair;
and inputting each question-answer pair and the corresponding wrong answer-question-answer pair into the first rough training BERT model for training, and updating model parameters through a minimum loss function to obtain a first pre-training BERT model.
5. The FAQ-based unsupervised retrieval method of claim 1, wherein before inputting the first candidate question-answer pair sequence into a second pre-trained BERT model to obtain a fourth candidate question-answer pair sequence and a corresponding similarity score, further comprising training the second pre-trained BERT model, the specific training steps are as follows:
pre-training a generative two-stage language model, wherein the generative two-stage language model is used for generating a user query information set;
inputting the second label-free question-answer pair corpus into a BERT model, and performing combined training through a covering language model task and a next sentence prediction task to obtain a second rough training BERT model;
inputting the second rough training BERT model for training to obtain a second pre-training BERT model, wherein the second rough training BERT model comprises a similar problem pair consisting of a user query information sample of the user query information set and a matching problem, and a dissimilar problem pair consisting of the user query information sample of the user query information set and a mismatching problem; the matching questions are questions in question-answer pairs with the highest correlation degree with the user query information samples, and the mismatching questions are questions randomly extracted from the question-answer pairs.
6. The FAQ-based unsupervised search method as defined in claim 5, wherein the pre-trained generative two-stage language model specifically comprises the steps of:
obtaining a coding matrix combining a word vector and a position vector by performing word vector conversion and position vector conversion on each semantic segmentation single character of the user query information sample;
converting the coding matrix for multiple times through a multi-head self-attention mechanism and a full connection layer to obtain input matrix data;
inputting the input matrix data into a language model for training, and updating model parameters through a maximum likelihood function to obtain a first-stage language model;
merging the questions and answers of each question-answer pair to obtain a third merged question-answer pair document; taking out the continuous meaning segmentation single characters with certain length in the document by the third combined question-answer in a sliding window mode to obtain fine tuning input data;
inputting the fine tuning input data into the first-stage language model for training, and updating model parameters through a maximum likelihood function to obtain a generative two-stage language model.
7. The FAQ-based unsupervised search method according to claim 1, wherein the fusing the similarity scores corresponding to the second candidate question-answer pair sequence, the third candidate question-answer pair sequence and the fourth candidate question-answer pair sequence to obtain a similarity fusion score and a final candidate question-answer pair sequence sorted according to the similarity fusion score specifically comprises:
after the similarity scores corresponding to the second candidate question-answer pair sequence, the third candidate question-answer pair sequence and the fourth candidate question-answer pair sequence are normalized, a first fused similarity score and a first fused candidate question-answer pair sequence sorted according to the similarity fused score are obtained through a linear summation merging algorithm;
selecting a preset number of question-answer pairs from a first fused candidate question-answer pair sequence according to the first fused similarity score from high to low to obtain a first fused question-answer pair set; and obtaining a second fusion similarity score and a final candidate question-answer pair ordered according to the second similarity fusion score by a set rearrangement algorithm according to the first fused question-answer pair set.
8. An unsupervised search system based on FAQ is characterized by comprising:
a first candidate question-answer pair module: the BM25 algorithm is used for calculating the similarity between the user query information and each question-answer pair, and all question-answer pairs are pre-ordered according to the obtained BM25 similarity from high to low to obtain a first candidate question-answer pair sequence;
the second candidate question-answer pair module: the similarity between the user query information and each question-answer pair is calculated through a maximum-pass algorithm, the first candidate question-answer pair sequence is ranked according to the obtained maxpsg similarity from high to low, and a second candidate question-answer pair sequence is obtained;
the third candidate question-answer pair module: the first pre-training BERT model is used for inputting the first candidate question-answer pair sequence to obtain a third candidate question-answer pair sequence and a corresponding similarity score; the training data of the first pre-trained BERT model is each question-answer pair and the corresponding wrong answer question-answer pair;
a fourth candidate question-answer pair module: the first candidate question-answer pair sequence is input into a second pre-training BERT model to obtain a fourth candidate question-answer pair sequence and a corresponding similarity score; the training data of the second pre-training BERT model are a similar problem pair formed by the user query information sample and the matching problem and a dissimilar problem pair formed by the user query information sample and the mismatching problem;
final candidate question-answer pair module: and the final candidate question-answer pair sequence is used for fusing the similarity scores corresponding to the second candidate question-answer pair sequence, the third candidate question-answer pair sequence and the fourth candidate question-answer pair sequence to obtain a similarity fusion score and sorting according to the similarity fusion score.
9. An unsupervised FAQ-based search device, comprising:
a memory: for storing executable instructions; and
a processor for interfacing with the memory to execute the executable instructions to perform the FAQ-based unsupervised retrieval method of any of claims 1-7.
10. A computer-readable storage medium, having stored thereon a computer program; computer program to be executed by a processor for implementing a FAQ-based unsupervised retrieval method as claimed in any of claims 1-7.
CN202210032823.7A 2022-01-12 2022-01-12 Non-supervision type retrieval method, system and medium based on FAQ Pending CN114357120A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210032823.7A CN114357120A (en) 2022-01-12 2022-01-12 Non-supervision type retrieval method, system and medium based on FAQ

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210032823.7A CN114357120A (en) 2022-01-12 2022-01-12 Non-supervision type retrieval method, system and medium based on FAQ

Publications (1)

Publication Number Publication Date
CN114357120A true CN114357120A (en) 2022-04-15

Family

ID=81109038

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210032823.7A Pending CN114357120A (en) 2022-01-12 2022-01-12 Non-supervision type retrieval method, system and medium based on FAQ

Country Status (1)

Country Link
CN (1) CN114357120A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996424A (en) * 2022-06-01 2022-09-02 吴艳 Weak supervision cross-domain question-answer pair generation method based on deep learning
CN115344668A (en) * 2022-07-05 2022-11-15 北京邮电大学 Multi-field and multi-disciplinary science and technology policy resource retrieval method and device
CN115687676A (en) * 2022-12-29 2023-02-03 浙江大华技术股份有限公司 Information retrieval method, terminal and computer-readable storage medium
CN117371404A (en) * 2023-12-08 2024-01-09 城云科技(中国)有限公司 Text question-answer data pair generation method and device

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996424A (en) * 2022-06-01 2022-09-02 吴艳 Weak supervision cross-domain question-answer pair generation method based on deep learning
CN114996424B (en) * 2022-06-01 2023-05-09 吴艳 Weak supervision cross-domain question-answer pair generation method based on deep learning
CN115344668A (en) * 2022-07-05 2022-11-15 北京邮电大学 Multi-field and multi-disciplinary science and technology policy resource retrieval method and device
CN115687676A (en) * 2022-12-29 2023-02-03 浙江大华技术股份有限公司 Information retrieval method, terminal and computer-readable storage medium
CN115687676B (en) * 2022-12-29 2023-03-31 浙江大华技术股份有限公司 Information retrieval method, terminal and computer-readable storage medium
CN117371404A (en) * 2023-12-08 2024-01-09 城云科技(中国)有限公司 Text question-answer data pair generation method and device
CN117371404B (en) * 2023-12-08 2024-02-27 城云科技(中国)有限公司 Text question-answer data pair generation method and device

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN111611361B (en) Intelligent reading, understanding, question answering system of extraction type machine
CN114357120A (en) Non-supervision type retrieval method, system and medium based on FAQ
WO2020224097A1 (en) Intelligent semantic document recommendation method and device, and computer-readable storage medium
WO2021093755A1 (en) Matching method and apparatus for questions, and reply method and apparatus for questions
CN112100351A (en) Method and equipment for constructing intelligent question-answering system through question generation data set
CN111581354A (en) FAQ question similarity calculation method and system
US20220237230A1 (en) System and method for automated file reporting
CN111324728A (en) Text event abstract generation method and device, electronic equipment and storage medium
CN112614538A (en) Antibacterial peptide prediction method and device based on protein pre-training characterization learning
CN111291188B (en) Intelligent information extraction method and system
CN112214593A (en) Question and answer processing method and device, electronic equipment and storage medium
CN112989005A (en) Knowledge graph common sense question-answering method and system based on staged query
CN115599902B (en) Oil-gas encyclopedia question-answering method and system based on knowledge graph
CN112417097A (en) Multi-modal data feature extraction and association method for public opinion analysis
CN112131876A (en) Method and system for determining standard problem based on similarity
CN111966810A (en) Question-answer pair ordering method for question-answer system
CN113282711A (en) Internet of vehicles text matching method and device, electronic equipment and storage medium
CN112905736A (en) Unsupervised text emotion analysis method based on quantum theory
CN114611491A (en) Intelligent government affair public opinion analysis research method based on text mining technology
CN115470338A (en) Multi-scene intelligent question and answer method and system based on multi-way recall
CN116304020A (en) Industrial text entity extraction method based on semantic source analysis and span characteristics
CN114332519A (en) Image description generation method based on external triple and abstract relation
JPH11328317A (en) Method and device for correcting japanese character recognition error and recording medium with error correcting program recorded
CN113569018A (en) Question and answer pair mining method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination