CN116089592A - Method, device and storage medium for realizing open-domain multi-answer question and answer - Google Patents

Method, device and storage medium for realizing open-domain multi-answer question and answer Download PDF

Info

Publication number
CN116089592A
CN116089592A CN202310277276.3A CN202310277276A CN116089592A CN 116089592 A CN116089592 A CN 116089592A CN 202310277276 A CN202310277276 A CN 202310277276A CN 116089592 A CN116089592 A CN 116089592A
Authority
CN
China
Prior art keywords
answer
paragraphs
question
paragraph
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310277276.3A
Other languages
Chinese (zh)
Inventor
程龚
赵悦
黄子贤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202310277276.3A priority Critical patent/CN116089592A/en
Publication of CN116089592A publication Critical patent/CN116089592A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A realization method, equipment and storage medium for open domain multi-answer question and answer comprises three stages, a dense retrieval stage, an intra-domain pre-training stage and a supervised multi-answer generation stage, wherein relevant paragraphs are found from encyclopedia corpus based on the dense retrieval module, then the relevant paragraphs are encoded based on a multi-paragraph reader, and a sequential training answer generator is generated based on multi-answer set optimization to obtain a plurality of answers for answering the questions. Because the labeling cost of the multi-answer data set is too high, the multi-answer data set is generally smaller, the invention provides the in-field pre-training to promote the multi-answer generation performance, and the multi-answer set optimal generation sequence strategy relieves the error bias caused by the forced specification of the generation answer sequence of the one-to-many generation paradigm in the multi-answer solution. The invention obtains better effect on the open domain multi-answer solving data set.

Description

Method, device and storage medium for realizing open-domain multi-answer question and answer
Technical Field
The invention belongs to the technical field of computers, relates to an intelligent question-answering technology in natural language understanding, and discloses a method, equipment and a storage medium for realizing open-domain multi-answer question-answering.
Background
The question-answering task facing the open domain requires a question-answering system to find out a question correlation paragraph through retrieving a knowledge document base, and inputs the question and the paragraph into a reading understanding model to predict a question answer. The open domain single answer generation task refers to a single answer to a question, and a hope model predicts the single answer corresponding to the question. Because the questions are more complex and diverse, a structure is generally adopted that relevant paragraphs of the questions are retrieved from an open corpus, and answers are predicted according to the relevant paragraphs. Further, open-field multiple answers refer to questions corresponding to multiple answers, possibly distributed among multiple paragraphs, and thus more difficult and challenging to generate than single answers. The representative dataset of the multiple answer generation task is AmbigQA. AmbigQA was constructed based on an open domain single answer Question-answer dataset Natural Question, where the questions all come from real user queries in google searches, and each Question was answered in Ying Weiji department. However, the related work finds that a large number of questions in Natural Question are ambiguous in terms of entity designation, lack of time limitation, and the like, and thus may correspond to a plurality of answers. AmbigQA focuses mainly on ambiguous questions in Natural questions and finds as many possible all possible answers from Wikipedia for each Question. For example, when asking when a movie is showing, the present invention may want to find all possible answers from the relevant paragraphs, such as showing time in different areas, etc.
The retrieval of the relevant paragraphs depends on the retriever. The retriever comprises a dense retrieval architecture and a sparse retrieval architecture. Sparse retrieval, including TF-IDF or BM25, uses inverted indexes to effectively match keywords, and can be viewed as representing problems and contexts with high-dimensional weighted sparse vectors. In contrast, dense retrieval is complementary to sparse representation, which allows synonyms of disparate character compositions to be mapped into vector space that is still likely to be close to each other, and dense encoding can also be learned by adjusting the embedding function, which provides great flexibility for having task-specific representations. Dense retrieval often employs a dual-tower architecture, typically based on a pre-trained language model BERT or ROBERTA, etc. The double-tower architecture is characterized in that a twin network is used for independently inputting the problems and the paragraphs into an encoder, then the respective output representations are respectively aggregated to obtain dense vector representations of the query and the paragraphs, and finally cosine similarity is used for calculating the similarity of the query and the paragraphs.
In the open domain question-answer OpenQA task, the answer can generally be obtained directly from the support paragraph. I.e. the answer is a continuous string of support paragraphs. Two architectures are typically employed to solve this task, including a decimated architecture and a generative architecture. The decimated architecture first inputs questions and paragraphs to an encoder such as BERT, and then predicts the beginning and ending positions at which the answers appear. There are some pre-trained models specifically applied to answer extraction, such as SpanBERT, SPlinter, etc. However, the extraction model faces two major challenges in open-domain multi-answer questions and answers: first, open domain questions and answers are typically based on multiple supporting paragraphs, and aggregating and combining evidence from multiple paragraphs is not straightforward when using the extraction model in this context, requiring some additional skill to support. Second, the extraction architecture is currently only able to predict a start and end position, and thus is only suitable for single answer question-answering tasks. The generative framework inputs questions and paragraphs to the encoder based on the sequence-to-sequence model, and then generates answers through the decoder. The generated pre-training language model based on the generated architecture mainly comprises T5, BART and the like, wherein the T5 explores the solution scheme of various downstream tasks of the natural language processing NLP by introducing a unified framework, namely a text-to-text architecture, including abstract, question-answer, text classification and the like. T5 adopts a pre-training mode that an input end destroys the text and a return is carried out at an output end. The encoder that generates the pre-training model generally limits the maximum input length and does not enable reading of a large number of segments. The FiD architecture (Fusion-in-Decoder) realizes reading of a plurality of paragraphs by adopting a mode that a plurality of encoders are adopted to encode the plurality of paragraphs respectively on the basis of a T5 model and then a plurality of paragraph encoding vectors are aggregated at a Decoder side. FiD achieves very good results in open-field single answer generation.
At present, two schemes are mainly used for realizing the migration from open-domain single-answer questions to multi-answer questions. The first approach is to model as a one-to-one task. In a one-to-one paradigm, a question and a corresponding plurality of answers are split into a plurality of sample pairs, with each sample pair corresponding to only one answer to the question. Since only one answer at a time is used as a supervision during training, if multiple answers need to be generated during the test phase, multiple output sequences need to be sampled with the beam search. This paradigm has two major problems: first is an answer-dependent question. There is no semantic dependency between the multiple answers generated by the beam search, and it is highly likely that the answer is generated semantically duplicate. Next to this is the dynamic number, the number of samples of the beam search is often given, i.e. the dynamic number of answers cannot be generated for different questions. Another approach is to model a one-to-many task, i.e., outputting multiple answers in a fixed sequence through a decoder to roughly implement the migration of single answer generation to multiple answer generation. Since the sequence generated before can be seen in decoding, the problems of semantic dependence and repeated generation are solved. The architecture relies on the end identifier predicted by the decoder to implement a dynamic number of answer generation questions. However, the architecture also has a problem that the dependence sequence among answers is set rigidly, but related work finds that different sequence functions have a relatively large influence on the result, the randomly set answer generation sequence is not necessarily the best choice, and how to learn the optimal sequence is not fully studied at present.
Because the construction of the multi-answer generation dataset requires significant human labeling costs, it is generally relatively small, resulting in a model that is difficult to achieve with good performance because of insufficient training samples. Related work shows that on the basis of an open domain pre-training language model, unlabeled texts in the domain are added to continue training the language model, and then fine adjustment is performed on a data set of a downstream task, so that the performance of the downstream task can be improved. However, the task mainly relieves the problem of untouched pre-training language model caused by overlarge corpus difference between the general field and the special field, and the influence caused by the difference between the pre-training task and the downstream task is not considered. Taking a pre-training task and an open-field multi-answer generation task adopted by the pre-training language model T5 as an example, the T5 adopts an input end to destroy the text, and the pre-training task restored at the output end is input as a destroyed text, and the prediction target is the restored text. The open domain multiple answer generation task input is questions and relevant paragraphs, and the prediction target is an answer. Two tasks have very large differences and therefore a suitable transition phase for task differences is also very necessary.
For Open-Domain Multi-answer question-and-answer research, document 1, "answer Open-Domain Multi-Answer Questions via a Recall-the-fact-Verify Framework" (Shao & Huang, ACL 2022) proposes a scheme of "recall-re-verification", which explicitly models one-to-many of multiple answers as multiple one-to-one tasks, i.e., for the same question, different answers are generated in combination with different paragraphs (i.e., recovery stage), and then answer sets greater than a certain threshold are screened out by verification (i.e., verifiers). This has the advantage that the memory footprint can be reduced, multiple segments can be read, but the following disadvantages also exist:
1. the number of answers generated by each sample is determined by the relative size of the probability of the predicted answer and the threshold value, but the probability distribution difference of the predicted answer is larger because of different aspects of difficulty, and a fixed threshold value cannot be used for carrying out verification screening division on the predicted answer set of the whole sample.
2. This approach cannot model the dependency between multiple answers. In an open-domain multi-answer question-and-answer task, there may be a correlation between different answers, e.g., some answers may be synonyms or paraphrasing, or some answers may be related entities or concepts. The model needs to be able to identify and exploit these correlations to generate better answers. However, the method obtains a plurality of answers in parallel, and does not utilize the dependency relationship between the answers.
Disclosure of Invention
The invention aims to solve the problems that: in the existing work, the error bias problem brought by the fixed answer generation sequence in the open domain multi-answer generation architecture; in addition, compared with the open domain data, the sample used for training the reader has a large scale difference, and how to set a training scheme of the reader to improve the answer effect on a smaller multi-answer data set is also a problem to be solved, wherein the multi-answer data set refers to a field training data set for training the reader.
The technical scheme of the invention is as follows: the implementation method of the open domain multi-answer question-answering comprises a dense retrieval phase, an intra-domain pre-training phase and a supervised multi-answer generation phase, wherein for a given question q, top K paragraphs D with highest correlation degree are retrieved from an encyclopedic corpus C based on a dense retriever, then a multi-paragraph reader is constructed based on the intra-domain corpus to conduct intra-domain pre-training, finally the multi-paragraph reader is subjected to fine tuning on a supervised open domain multi-answer question-answering data set, the optimal generation sequence of a multi-answer set is defined to reduce the influence of the sequence on multi-answer generation, and multi-answers are obtained according to given questions and retrieved paragraphs, and the implementation method is as follows:
1) Training a dense retriever, wherein the dense retriever measures the relevance of paragraphs and questions by calculating dot products of semantic vectors respectively encoded by the paragraph encoder and the question encoder, and outputs a group of most relevant paragraphs for a given question retrieval for a subsequent reading stage;
2) The pre-training stage in the field realizes the transition between the pre-training stage in the general field and the multi-answer generation stage by synthesizing multi-paragraph multi-answer question-answer data and training on a multi-paragraph reader, and comprises the steps of constructing self-supervision data and pre-training of the multi-paragraph reader, wherein the self-supervision data comprises three major parts, and the questions
Figure BDA0004136727160000041
Relevant paragraph->
Figure BDA0004136727160000042
And answer->
Figure BDA0004136727160000043
The construction flow of the self-supervision data is as follows:
2.1 Selecting a set of related search paragraphs from the intra-domain corpus rel The related search paragraphs have similar topics, common entities are contained among the search paragraphs, and the similarity of the topics is set automatically;
2.2 Random slave D rel Select k 0 Source A for answer extraction source The remaining paragraphs are regarded as relevant paragraphs
Figure BDA0004136727160000046
2.3 Identifying a based on a space tool source And filtering out entities not in it
Figure BDA0004136727160000047
An entity that appears in the network;
2.4 Dividing the filtered entities into a plurality of groups according to entity types identified by space, and then constructing a question-answer sample by each group of entities: each group of entities are spliced to form an answer
Figure BDA0004136727160000048
Each entity is at A source The sentences in which the sentences are positioned are spliced together, and the entity is used for [ MASK ]]The characters are replaced and synthesized as a question->
Figure BDA0004136727160000044
The corresponding relevant paragraph is->
Figure BDA0004136727160000045
After self-supervision data are constructed, modeling is conducted through a FiD architecture to obtain a generated multi-section reader, and optimization training is conducted;
3) And (3) performing multi-answer data set fine adjustment in a supervised multi-answer generation stage, for a given question q, predicting an answer set corresponding to the question q through Top K paragraphs D information obtained by a multi-fall reader through a dense retriever, wherein the answer set comprises a plurality of different answers, the different answers are separated by using a separator [ SEP ], considering that a plurality of answers have different generation sequences, all arrangement sequences corresponding to the answer set are enumerated, so that cross entropy loss between the answer set and the prediction of the multi-fall reader is obtained, and the answer sequence with the minimum loss is the optimal answer sequence, and is used as a final multi-answer result.
The pre-training phase in the field of the invention comprises two subtasks, self-supervising data construction and multi-segment-fall-based reader training. Where the construction of self-supervising data is very challenging, i.e., how question-answer pairs and related paragraphs are created from unlabeled data. The invention considers that a group of paragraphs with correlation can be utilized to randomly select a plurality of paragraphs to extract answers, and construct a question, and other paragraphs are used as the correlation paragraphs. The use of a set of paragraphs with a correlation can make the association between the constructed question-answer pairs and the relevant paragraphs tighter and more closely approximate the downstream scenario. The set of paragraphs that have a correlation may include paragraphs from the same article in the wikipedia, paragraphs corresponding to the same question in the open-domain supervised question-answer dataset, etc., the present invention employs the latter because it can be constructed based on a single answer dataset, with larger rules.
The invention also provides electronic equipment, which comprises a storage medium and a processor, wherein the storage medium is used for storing a computer program, and the processor is used for executing the computer program, and when the computer program is executed, the method for realizing the open-domain multi-answer question-answering is realized.
The invention also provides a computer readable storage medium, wherein the computer readable storage medium is stored with a computer program, and the method for realizing the open-domain multi-answer question-answering is realized when the computer program is executed.
Compared with document 1, the present invention models a multi-answer question-answer as one-to-many task, i.e., inputs a question, and directly predicts a plurality of answers. And in order to reduce the influence of the answer generation sequence on training, the generation sequence with the least loss is adopted as the optimal sequence. The invention thus has the following advantages: 1) The dependency between answers can be modeled. In the decoding process, the answer to be generated later may take the answer generated earlier as a reference. 2) The number of answers generated per sample is determined without the need for additional threshold parameters to determine the number of answers generated per sample.
Current readers are generally based on an open domain pre-trained language model, such as BERT, T5, BART, but such readers. On the basis of the pre-training language model, the method introduces in-field pre-training, namely, on the corpus in the field, the pre-training of the second stage is carried out by adopting a pre-training mode which is more similar to the downstream reading task. And then performing fine adjustment on downstream supervised tasks on the basis of the model pre-trained in the field. Corresponding to the following advantages 1) corpus and downstream are more similar 2) task and downstream are more similar. Corresponding to a transition phase between the open domain pre-training model and the fine tuning on the downstream supervised task. The pre-training in the field is equivalent to building a bridge between the pre-training in the general field and the downstream supervised task, and plays a role in transition.
The beneficial effects of the invention include two aspects. On the one hand, the invention defines an optimal sequence, namely the sequence with the minimum prediction loss, aiming at unordered multi-answer sets, so that a model can concentrate on answer prediction per se, and training error bias caused by the answer sequence is ignored. On the other hand, the invention provides an in-domain pre-training scheme suitable for the task aiming at the open-domain multi-answer generation task, reduces the difference between a pre-trained model and a downstream task on a general framework of a general task in a general domain, and improves the performance of the model on the downstream task.
Drawings
Fig. 1 is an overall process flow diagram of the present invention.
Fig. 2 is a process for constructing self-monitoring data in a pre-training process within the field of the present invention.
Detailed Description
The invention provides an open domain multi-answer question-answer solving method, which comprises three stages, a dense searching stage, an in-domain pre-training stage and a supervised multi-answer generating stage, wherein the software program is used as a searching module, the in-domain pre-training module and a multi-section reading and multi-answer generating module based on optimal sequence training. For a given question q, first, top K paragraphs D with highest relativity are found from an encyclopedia C such as Wikipedia based on a retrieval module, then, based on an intra-domain pre-training scheme designed by the invention, an intra-domain pre-training mode which is more similar to a downstream reading task is adopted based on the intra-domain corpus, the intra-domain pre-training is carried out on the basis of a generation model, a multi-segment reader is obtained, the multi-segment reader is used for predicting multiple answers according to the questions on multiple segments, finally, the prediction result fine adjustment of the multi-segment reader is carried out on a supervised open domain multi-answer question-answer dataset, and the optimal generation sequence of the multi-answer set is defined, so that the influence of the sequence on the generation of the multiple answers is reduced, and a final multi-answer result is obtained. The three stages are as follows.
Stage 1: a dense retrieval phase. Training a dense retriever that measures the relevance of paragraphs to questions by computing dot products of semantic vectors encoded by the paragraph encoder and the question encoder, retrieving a set of most relevant paragraphs for a given question for a subsequent reading phase;
stage 2: pre-training phase in the field. In view of the large corpus difference and task difference between the pre-training stage and the supervised multi-answer generation stage in the third stage in the general field, the method disclosed by the invention introduces the pre-training stage in the field to promote the performance of the model in the third stage, namely the supervised multi-answer generation stage. Taking the pre-training language model T5 based on the invention as an example, in the aspect of corpus, the T5 pre-training corpus is based on C4 (Colossal Clean Crawled Corpus) from text data crawled on the Internet, and the encyclopedic corpus based on the invention is from Wikipedic, and the coverage contents of the two are different to a certain extent; in terms of tasks, the pre-training task mainly adopted by T5 is to make a small segment of the input text to replace, i.e. randomly select a span, replace it with a special symbol such as MASK, and predict the span replaced by the special symbol at the output. If the initial text is "thank you invite me to your party around you," the replaced text is "thank you [ MASK1] i to your party [ MASK2], then the predictive label is" [ MASK1] invite [ MASK2] around ". The main differences between the open-domain multi-answer generation task and the pre-training task of T5 are: (1) T5 fill [ MASK ] text relies primarily on local context only, without modeling multi-paragraph or long distance context; (2) The multi-answer generating task comprises a question and a plurality of paragraphs, wherein the multi-paragraph encoder is used for encoding a plurality of sections and calculating a plurality of answers through the interaction of the question and the paragraphs; (3) The labels generated by multiple answers are often meaningful entities, such as location and time, and the pre-trained labels are randomly selected from the input text and may not have practical significance. Based on the method, the research of the invention proposes that a pre-training stage in the introduction field is used as a bridge between a T5 pre-training stage and a supervised multi-answer generation stage, and the corpus difference and task difference between the T5 pre-training stage and the supervised multi-answer generation stage are reduced.
In order to be closer to the supervised multi-answer generation stage of the third stage, the self-supervision data constructed in the pre-training stage in the field also comprises three major parts, namely, the problems
Figure BDA0004136727160000061
Relevant paragraph->
Figure BDA0004136727160000062
Answer->
Figure BDA0004136727160000063
The construction flow of the self-supervision data is as follows:
(1) Selecting a set of related search paragraphs from intra-domain predictions rel Because the related paragraphs have similar topics, more common entities are contained among the paragraphs, question-answer data can be synthesized more easily, the similar topics and the common entities are progressive relations, and because the topics among the paragraphs are similar, the same entities tend to exist, and the similarity degree of the topics can be set by oneself.
(2) Random slave D rel Select k 0 Source A for answer extraction source The remaining paragraphs are regarded as relevant paragraphs
Figure BDA0004136727160000064
(3) Identifying A based on space tool source And filtering out entities not in it
Figure BDA0004136727160000065
In that these entities are in ∈ ->
Figure BDA0004136727160000066
Does not have their information, if in subsequent question-answer construction is [ MASK ]]And is difficult to restore after replacement.
(4) The filtered entities are divided into a plurality of groups according to entity types identified by space, and each group of entities constructs a question and answer sample. Specifically, each group of entities are spliced to synthesize an answer
Figure BDA0004136727160000067
Each entity is at A source The sentences in which the sentences are positioned are spliced together, and the entity is used for [ MASK ]]The characters are replaced and synthesized as a question->
Figure BDA0004136727160000068
Problem->
Figure BDA0004136727160000069
The corresponding relevant paragraph is->
Figure BDA00041367271600000610
Problems of the construction here->
Figure BDA00041367271600000611
And related paragraph->
Figure BDA00041367271600000612
All from D rel Therefore, it can be considered that the problem->
Figure BDA00041367271600000613
And related paragraph->
Figure BDA00041367271600000614
Is relevant.
After self-supervision data are constructed, the invention adopts the generated classical architecture FiD modeling of multi-section reading to carry out optimization training.
Stage 3: a supervised multiple answer generation stage. The challenge at this stage is how to define the generation order for unordered multi-answer sets. The present invention considers the optimal generation order to be the order that minimizes the loss of the reader's predicted answers, as this minimizes the impact of order, avoiding false penalties.
The practice of the invention is specifically described below.
A dense retrieval phase. The retriever adopts a dense retrieval mode and is based on a classical retrieval architecture DPR. Both the question and the paragraphs in the corpus are encoded using dense encoders, respectively referred to as paragraph encoder and question encoder, using paragraph encoder E P (. Cndot.) mapping segments of a corpus to d-dimensional real-valued vectors using problem encoder E Q (. Cndot.) the question is mapped to a d-dimensional vector and the K paragraph vectors closest to the question vector are retrieved. The present invention uses the dot product of vectors to define the similarity between questions and paragraphs: i.e.
Figure BDA0004136727160000075
q represents a question and p represents a paragraph. Dense encoder uses two independent BERT networks and will [ CLS ]]The representation of the character serves as the output vector of the paragraph. Training stage, set up
Figure BDA0004136727160000071
Is training data consisting of m instances. Each instance contains a question q i And a related paragraph->
Figure BDA0004136727160000072
N irrelevant paragraphs->
Figure BDA0004136727160000073
j=1, … n. The loss function using InfoNCE, i.e.
Figure BDA0004136727160000074
The paragraphs of this embodiment are derived from the wikipedia, which contains about several million paragraphs, each paragraph containing both a title and a body.
In the training data of the dense retriever, for the selection of relevant or irrelevant paragraphs, the relevant paragraphs are directly based on the information provided by the dataset, and the irrelevant paragraphs consider two different types of construction: (1) The BM25 is used as a search algorithm to search in the open field according to the question, with the paragraphs returned that are in the front but do not contain the correct answer as irrelevant paragraphs. The paragraphs and questions of the type have a high BM25 score because of the existence of a plurality of identical characters, but do not contain answers, belong to negative paragraphs which are difficult to distinguish, and are greatly helpful for encoder training; (2) Relevant paragraphs of other question pairings that occur in the training set. Constructing negative paragraphs by BM25 alone may pose a problem of popularity sampling bias, i.e., the sampled paragraphs are a fixed subset of the corpus, and a large number of negative paragraphs in the corpus are not sampled. The invention adopts the mode of negative samples in the batch to construct random negative paragraphs, namely, the positive paragraphs paired with other problems in the same batch can be used as the negative paragraphs corresponding to the current problems. In this way, the calculated code representation can be reused, and the calculation cost is reduced.
Pre-training phase in the field. The phase is pre-trained in the general field, and the transition between the general field pre-training phase and the third phase multi-answer generation phase is realized by synthesizing multi-paragraph multi-answer question-answer data and training on a multi-paragraph reader. The pre-training stage in the field comprises the construction of self-supervision data and the pre-training of a multi-section reader, and in order to be closer to the supervised multi-answer generation stage in the third stage, the self-supervision data constructed in the stage also comprises three parts, namely, the problem
Figure BDA0004136727160000081
Relevant paragraph->
Figure BDA0004136727160000082
Answer +.>
Figure BDA0004136727160000083
The construction flow of the self-supervision data is as described above.
After self-supervision data are constructed, the invention adopts the generated classical architecture FiD modeling of multi-section reading to obtain the multi-section reader, and performs optimization training on the self-supervision data. To solve this task, the model will first find a context from the relevant paragraphs that is close to the question, and then further recover the entity, learning the relationship between the question-relevant paragraph and the answer. The task is very similar to the supervised multi-answer generation task of the third stage, so that the task can be used as a connecting bridge of the pre-training stage and the third stage in the general field.
Training of answer generation employs cross entropy loss functions, i.e
Figure BDA0004136727160000084
X cat =[X 1 ;X 2 ;…X K ]
Wherein the method comprises the steps of
Figure BDA0004136727160000085
Representing the generation loss, K is the number of search paragraphs, X cat Is a concatenation of multiple paragraph code representations. [ (r) ];]representing the splicing operation, X j For the code vector of the jth search paragraph, Y represents the answer sequence, N represents the length of the answer sequence, and θ represents the parameters of the reader.
The final stage is a supervised multiple answer generation stage. The FiD architecture realizes the reading of a plurality of paragraphs by adopting a plurality of encoders to respectively encode the plurality of paragraphs, has good effect when training an open-domain single answer generation task based on the FiD architecture, but generates the problem that the output sequence of the answer is passively and forcedly set when the single answer task is shifted to the multi-answer task, however, the plurality of answers are in unordered form, which brings about the training stageA large deviation is introduced, which severely penalizes the change in order between answers. In order to avoid deviation of answer sequence from model training, the invention provides a training method irrelevant to the supervision label sequence. In view of the inability to obtain an optimal order between the multiple answers in advance, it can be considered that the optimal order should be an order that minimizes the loss of the sample. Precisely, given a supervision tag set a corresponding to a certain sample, the size of the supervision tag set a is |a|, all possible serialized text sets corresponding to the supervision tag set a are a seq The size is |A|| -! Enumerating A seq Each sequence Y of (3) i Obtaining the corresponding loss
Figure BDA0004136727160000086
multi ,Y i ),θ multi The invention combines the cross entropy loss of the multi-segment reader to obtain the optimal answer sequence Y best Is defined by the formula:
Figure BDA0004136727160000087
at this point the loss is defined as
Figure BDA0004136727160000088
Wherein X is cat Is a concatenation of multiple paragraph code characterizations, θ multi Representing the reader parameters.
The multiple-drop reader of the present invention employs a FiD architecture based on a sequence-to-sequence network T5 that is pre-trained on unsupervised data. The Top K search paragraphs D obtained by the FiD architecture input questions and dense retriever output a plurality of answers in a fixed sequence. Each paragraph contains two parts, namely a title and a text, wherein each retrieved paragraph and the title thereof are connected with the question and processed by the encoder of the multi-paragraph reader independently of the other paragraphs, and special marking questions, namely 'question:', 'title:', and 'context:', are added before the question, the title and the text of each paragraph, which can help the model to better extract and understand the input information. Adding a special marker "question" before entering the question can help the model better identify which parts of the entered text are questions, as is the case with the "reference paragraph". Finally, the decoder of the multi-paragraph reader learns the relevant information based on the attention mechanism for the serial connection of the result representations of all the retrieved paragraphs, and completes multi-paragraph reading understanding. Since the FiD architecture only fuses the retrieved paragraphs in the decoder, interactions between paragraphs are reduced, computation effort is reduced, and expansion to a large number of contexts is allowed.
The practice of the invention will be described in connection with a specific embodiment. The question comes from AmgigqaQA, who won the final hoh big brother (who eventually wins old man 20.
Step 101, training an open domain retriever. The retriever is based on a dense retrieval architecture, and is trimmed from the basis of bert-base-uncased (https:// huggingface. Co/docs/transformers/model_doc/bert). The training phase is based on a double-tower architecture, and the problems and paragraphs are respectively encoded, so that the similarity of the problems and the positive paragraphs is maximized, and the similarity of the problems and the negative paragraphs is minimized. The negative paragraphs of each question are sampled from the positive paragraphs corresponding to other samples from the set of paragraphs of the first few paragraphs retrieved from BM25 that do not include the correct answer. Given a sample, "who eventually won the old mobile 20? The "corresponding positive paragraphs include" … After 99days in the Big Brother House,the September 26,2018finale saw Kaycee Clark crowned the winner of Big Brother in a 5-4vote over Tyler Crispen "(… in the big ending of day 26 of 9, 2018, kayce Clark obtained the old mobile champion at a score of 5 to 4 defeat the Tyler crispen.)" "On 25August 2017,Sarah Harding was announced as the winner of the series having received 35.33%of the final vote,with Amelia Lily as the runner-up After receiving 29.92%," 2017, 8, 25 days, salad hatin declared the winner of the series with a final vote of 35.33%, ales li were obtained with a subarmy of 29.92%. The negative paragraph includes "First evisce, jasmine Lennard, later appeared as a guest for a two-day stint on\" Big brother16\ ". Coleen Nolan returned to the house for \" Celebrity Big Brother19\ "as an All-Star representing this services. The cole nulan returns to the house of "name mobile 19" as a complete star representing the series of events. She wins this series of games "and so on. After training is finished, all paragraphs in the wikipedia are encoded based on the paragraph encoder, and indexes are built based on the FAISS. In the reasoning stage, a near neighbor search algorithm is adopted to obtain the first few paragraphs related to the problem.
Step 102, then performing field pre-training. And constructing self-supervision data of the open-domain multi-paragraph multi-answer question-answer task by using the paragraph set with the correlation. Specifically, based on paragraph construction corresponding to the same Question in the open domain supervised single answer Question-answer dataset, the invention adopts the Natural Question dataset because of large data scale and very simple construction of the method. Fig. 2 illustrates an example of self-supervising data construction. The left part of FIG. 2 is divided into an upper part and a lower part, which are respectively answer extraction sources A source And the remaining relevant paragraphs
Figure BDA0004136727160000101
These paragraphs are all from a paragraph set corresponding to a problem in the Natural Question data set, and can be found that the topics of the paragraphs are all related to the polar circle, and a lot of common entities are contained between the paragraphs. (1) Entities in the answer extraction source text of fig. 2, including Huo Ningsi wo, huo Ningsi Wo Gezhen, norway, seagram, nollan county, etc., are first identified based on the space tool. (2) Entities Huo Ningsi Wo, huo Ningsi Wo Gezhen that are not present in the relevant paragraph are then filtered out because they do not have their information in the relevant paragraph if it is [ MASK ]]And is difficult to restore after replacement. (3) Next, the entities identified by space are grouped into groups according to their types, where the entities are all of the location type and are therefore classified as oneA group. (4) Extracting sentences of the entity, and using different MASK for the entity]The character is replaced, the problem of constructing the upper right part of fig. 2->
Figure BDA0004136727160000102
For example Norway is replaced by [ MASK1]]Sea aglan substitution to [ MASK2]]Etc. (5) Question +.about.upper right side of FIG. 2>
Figure BDA0004136727160000103
And the relevant paragraph->
Figure BDA0004136727160000104
Inputting into a multi-segment reader FiD, and obtaining the predicted target as quilt [ MASK ]]An alternative entity. To solve this task, the model will first find a context from the relevant paragraph that is close to the problem, and then further restore the entity. The arrow indicates that for reduction of each MASK]The character, the model may be from information referenced in the relevant paragraph. From this, it was found that [ MASK]The character reduction process depends on the interaction between the problem and the related paragraph, the model needs to find out the information related to the problem from the related paragraph for solving, and the task is very similar to the supervised multi-answer generation task of the third stage, so that the third stage of the invention can be finely adjusted on the basis of the model trained in the second stage.
Step 103: fine tuning on the supervised multi-paragraph multi-answer question-answer dataset defines an optimal order for the generation of the multi-answer set. Fine tuning the multi-drop reader based on the model obtained in step 102. For the answer set Kayce Clark (Kemadder Clark), sarah Harding (Sara Harding), the size was 2. The corresponding all possible serialized text sets are 2 in size, i.e., two different answer ranking orders, "kemadder clark [ SEP ] sara habut" and "sara habut [ SEP ] kemadder clark". For the predicted answers of the model, enumerating each sequence in the set, and obtaining the corresponding loss, wherein the best answer sequence is the answer sequence with the minimum loss.
In this embodiment, the longest length of the encoder input sequence used by the reading model is set to 300, the overlength portion will be removed, and the underlongest length portion will be used for the padding operation using < pad >. The longest length of decoding is 40, and the fetch size is set to 1. The learning rate was set to 0.00005, dropout was set to 0.1, the number of epochs trained to 20, and the adam optimizer used default parameters. The data set used was AmbigQA and the evaluation index was F1. Compared with the existing generation model, the invention has more excellent answer index. Table 1 shows that the evaluation results of the solution model which works and is only combined with the DPR retrieval and the FiD reader on the AmbigQA data set are shown in Table 1 on the basis of the T5-base model, and therefore, compared with the same parameter model, the invention has a certain improvement.
TABLE 1
F1-all F1-multi
DPR+FiD 40.8 27.5
Ours (invention) 41.9 27.8

Claims (7)

1. The implementation method of the open domain multi-answer question-answering is characterized by comprising a dense retrieval phase, an intra-domain pre-training phase and a supervised multi-answer generation phase, wherein for a given question q, top K paragraphs D with highest correlation degree are retrieved from an encyclopedic corpus based on a dense retriever, then a multi-segment reader is constructed based on the intra-domain corpus to conduct intra-domain pre-training, finally the multi-segment reader is subjected to fine adjustment on a supervised open domain multi-answer question-answering data set, the optimal generation sequence of the multi-answer set is defined to reduce the influence of the sequence on multi-answer generation, and multi-answers are obtained according to the given question and the retrieved paragraphs, and the implementation method is as follows:
1) Training a dense retriever, wherein the dense retriever measures the relevance of paragraphs and questions by calculating dot products of semantic vectors respectively encoded by the paragraph encoder and the question encoder, and outputs a group of most relevant paragraphs for a given question retrieval for a subsequent reading stage;
2) The pre-training stage in the field realizes the transition between the pre-training stage in the general field and the multi-answer generation stage by synthesizing multi-paragraph multi-answer question-answer data and training on a multi-paragraph reader, and comprises the steps of constructing self-supervision data and pre-training of the multi-paragraph reader, wherein the self-supervision data comprises three major parts, and the questions
Figure FDA0004136727150000011
Relevant paragraph->
Figure FDA0004136727150000012
And answer->
Figure FDA0004136727150000013
The construction flow of the self-supervision data is as follows:
2.1 Selecting a set of related search paragraphs from the intra-domain corpus rel The related search paragraphs have similar topics, common entities are contained among the search paragraphs, and the similarity of the topics is set automatically;
2.2 Random slave D rel Select k 0 Source A for answer extraction source The remaining paragraphs are regarded as relevant paragraphs
Figure FDA0004136727150000014
2.3 Identifying a based on a space tool source And filtering out entities not in it
Figure FDA0004136727150000015
An entity that appears in the network;
2.4 Dividing the filtered entities into a plurality of groups according to entity types identified by space, and then constructing a question-answer sample by each group of entities: each group of entities are spliced to form an answer
Figure FDA0004136727150000016
Each entity is at A source The sentences in which the sentences are positioned are spliced together, and the entity is used for [ MASK ]]The characters are replaced and synthesized as a question->
Figure FDA0004136727150000017
The corresponding relevant paragraph is->
Figure FDA0004136727150000018
After self-supervision data are constructed, modeling is conducted through a FiD architecture to obtain a generated multi-section reader, and optimization training is conducted;
3) And (3) performing multi-answer data set fine adjustment in a supervised multi-answer generation stage, for a given question q, predicting an answer set corresponding to the question q through Top K paragraphs D information obtained by a multi-fall reader through a dense retriever, wherein the answer set comprises a plurality of different answers, the different answers are separated by using a separator [ SEP ], considering that a plurality of answers have different generation sequences, all arrangement sequences corresponding to the answer set are enumerated, so that cross entropy loss between the answer set and the prediction of the multi-fall reader is obtained, and the answer sequence with the minimum loss is the optimal answer sequence, and is used as a final multi-answer result.
2. The method for realizing open-domain multi-answer question and answer according to claim 1, wherein the dense retriever uses paragraph encoder E based on retrieval architecture DPR P (. About.) willFall mapping of corpus segments to d-dimensional real value vectors using problem encoder E Q (.) map questions to d-dimensional vectors, using the dot product of the vectors to define similarity between the questions and articles: i.e. sim (q, p) =
Figure FDA0004136727150000019
q represents a problem, p represents paragraphs, and K paragraph vectors closest to the problem vector are obtained according to similarity retrieval;
when training the dense retriever, set up
Figure FDA0004136727150000021
Is training data consisting of m instances, each instance containing a question q i And a related paragraph->
Figure FDA0004136727150000022
N irrelevant paragraphs->
Figure FDA0004136727150000023
The loss function uses InfoNCE, namely: />
Figure FDA0004136727150000024
3. The method of claim 2, wherein the selection of relevant or irrelevant paragraphs in the training data of the dense retriever is based directly on the selection of relevant or irrelevant paragraphs provided by the dataset, the irrelevant paragraphs consider two different types of construction modes: (1) Searching in the open domain according to the questions by using the BM25 as a searching algorithm, and taking the paragraphs which do not contain correct answers in the returned paragraphs before scoring as irrelevant paragraphs; (2) And constructing random negative paragraphs of the questions by using related paragraphs of other question pairs in the training data in a mode of negative samples in the batch, namely, positive paragraphs of other question pairs in the same batch are used as negative paragraphs corresponding to the current questions.
4. The method of claim 1, wherein the multi-paragraph reader is configured in a FiD architecture, inputs Top K search paragraphs D of questions and dense retrievers, outputs a plurality of answers in a fixed sequence, each paragraph includes a title and a text, wherein each searched paragraph and the title thereof are connected with the questions and processed by an encoder of the multi-paragraph reader independently of the other paragraphs, and adds special marks, namely a 'question', a 'title:' and a 'context:', before the questions, the title and the text of each paragraph are processed by an encoder of the multi-paragraph reader, and finally, the decoder of the multi-paragraph reader learns related information based on an attention mechanism for the concatenation of result representations of all the searched paragraphs, thereby completing multi-paragraph reading understanding.
5. The method for implementing open-domain multi-answer question-answering according to claim 1, wherein the determination of the best answer sequence is specifically: the best order is the order that minimizes the loss of generating answers, given a supervision tab set A corresponding to a sample of size |A|, all possible serialized text sets corresponding to it are A seq The size is |A|| -! Enumerating A seq Each sequence Y of (3) i Obtaining the corresponding loss
Figure FDA0004136727150000025
Best answer sequence Y best The following equation yields:
Figure FDA0004136727150000026
X cat =[X 1 ;X 2 ;…X K ]
at this point the loss is defined as
Figure FDA0004136727150000027
Wherein the method comprises the steps of
Figure FDA0004136727150000028
Representing the generation loss, K is the number of search paragraphs, X cat Is a concatenation of multiple paragraph code characterizations, [;]representing the splicing operation, X j For the code vector of the j-th search paragraph, θ multi Representing parameters of a multiple-drop reader.
6. An electronic device comprising a storage medium for storing a computer program and a processor for executing the computer program, the computer program when executed implementing the method of open-domain multi-answer question-answering according to any one of claims 1-5.
7. A computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and when executed, implements the method for implementing open-domain multiple-answer questions and answers as claimed in any one of claims 1 to 5.
CN202310277276.3A 2023-03-21 2023-03-21 Method, device and storage medium for realizing open-domain multi-answer question and answer Pending CN116089592A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310277276.3A CN116089592A (en) 2023-03-21 2023-03-21 Method, device and storage medium for realizing open-domain multi-answer question and answer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310277276.3A CN116089592A (en) 2023-03-21 2023-03-21 Method, device and storage medium for realizing open-domain multi-answer question and answer

Publications (1)

Publication Number Publication Date
CN116089592A true CN116089592A (en) 2023-05-09

Family

ID=86187143

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310277276.3A Pending CN116089592A (en) 2023-03-21 2023-03-21 Method, device and storage medium for realizing open-domain multi-answer question and answer

Country Status (1)

Country Link
CN (1) CN116089592A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116501859A (en) * 2023-06-26 2023-07-28 中国海洋大学 Paragraph retrieval method, equipment and medium based on refrigerator field

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116501859A (en) * 2023-06-26 2023-07-28 中国海洋大学 Paragraph retrieval method, equipment and medium based on refrigerator field
CN116501859B (en) * 2023-06-26 2023-09-01 中国海洋大学 Paragraph retrieval method, equipment and medium based on refrigerator field

Similar Documents

Publication Publication Date Title
EP3896581A1 (en) Learning to rank with cross-modal graph convolutions
Rios-Alvarado et al. Learning concept hierarchies from textual resources for ontologies construction
Hao et al. Cross-domain sentiment encoding through stochastic word embedding
Huang et al. Label-aware document representation via hybrid attention for extreme multi-label text classification
CN105393265A (en) Active featuring in computer-human interactive learning
Zhang et al. An unsupervised model with attention autoencoders for question retrieval
Zheng et al. Contextualized query expansion via unsupervised chunk selection for text retrieval
Patel TinySearch--semantics based search engine using bert embeddings
AlGhamdi et al. Learning to recommend items to wikidata editors
Liu et al. Hierarchical graph convolutional networks for structured long document classification
CN116089592A (en) Method, device and storage medium for realizing open-domain multi-answer question and answer
Siddharth et al. Enhancing patent retrieval using text and knowledge graph embeddings: a technical note
Sen et al. Support-BERT: predicting quality of question-answer pairs in MSDN using deep bidirectional transformer
AbdElminaam et al. DeepCorrect: Building an Efficient Framework for Auto Correction for Subjective Questions Using GRU_LSTM Deep Learning
Santosh et al. Gazetteer-guided keyphrase generation from research papers
Wang et al. Context retrieval for web tables
Onal et al. Utilizing word embeddings for result diversification in tweet search
Phu et al. English sentiment classification using a Fager & MacGowan coefficient and a genetic algorithm with a rank selection in a parallel network environment
CN114328820A (en) Information searching method and related equipment
Gulla et al. An interactive ontology learning workbench for non-experts
Li et al. A Dense Retrieval System and Evaluation Dataset for Scientific Computational Notebooks
Nechaev Linking knowledge bases to social media profiles
Dash et al. Open-Domain Long-Form Question–Answering Using Transformer-Based Pipeline
Zhang et al. Learning to order sub-questions for complex question answering
Mihi et al. Dialectal Arabic sentiment analysis based on tree-based pipeline optimization tool

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination