CN116108128A

CN116108128A - Open domain question-answering system and answer prediction method

Info

Publication number: CN116108128A
Application number: CN202310389053.6A
Authority: CN
Inventors: 张准; 苏俊杰; 马琼雄; 王一辰; 黄俊鹏
Original assignee: South China Normal University
Current assignee: South China Normal University
Priority date: 2023-04-13
Filing date: 2023-04-13
Publication date: 2023-05-12
Anticipated expiration: 2043-04-13
Also published as: CN116108128B

Abstract

The invention relates to a novel open domain question-answering system which comprises a vector converter, a retriever, a paragraph index library, a supporting document indexer and an answer generator, wherein the retriever comprises a feature extraction module, a question linear layer, a question hash layer, a paragraph linear layer and a paragraph hash layer. The paragraphs of the knowledge base documents are processed by the retriever to obtain paragraph binary codes stored in the paragraph index library, the questions are processed by the retriever to obtain question continuous vectors and question binary codes, the support documents are screened out from the paragraph index library by the support document indexer, and after the question vectors and the support document vectors are spliced, predictive answers are generated by the answer generator. The question-answering system can efficiently search the support paragraph with the largest correlation degree with the problem, compresses the continuous vector output by the linear layer into a binary code by adopting the hash layer, and reduces the storage space of the index memory; the two-stage paragraph indexing method of binary code and continuous vector is adopted to effectively reduce the time consumption of retrieval.

Description

Open domain question-answering system and answer prediction method

Technical Field

The invention relates to the technical field of question-answering systems, in particular to an open domain question-answering system and an answer prediction method.

Background

In 2017, the facebook company designs a DrQA open domain question-answering system, and proposes a two-section type retriever-reader framework, namely, in order to answer various questions, documents relevant to the questions need to be retrieved in a large-scale network resource, the semantics of the documents are understood through a reader, and finally, the answers are extracted, so that the two-section type retriever-reader framework becomes a dominant paradigm of the open domain question-answering system.

The initial DrQA system and the retrievers of the other most systems employed sparse word matching classical "Information Retrieval (IR) systems" and "elastic search" provided a convenient way to index documents, so nearest neighbor searches could be performed by using the BM25 similarity function (word-dependent TF-IDF weighting). This word matching-based approach has obvious limitations, such as not considering synonyms and grammatical variations, and does not allow for a better understanding of the semantics of the entire sentence.

The latest retriever construction scheme mainly comprises dense vector representation and a deep learning model, and the dense vector representation and the deep learning model have some problems when obtaining good effects, for example, the dense vector representation needs to calculate large-scale paragraphs or articles, so that the index memory occupies large space and the retrieval is slow; the supervised retriever pre-training method has high requirements on the data set, and specific paragraphs and answers of the document resources need to be marked by the data set, so that the marking labor cost is huge. The reader construction scheme mainly adopts a deep learning model, and the principle is that answers are intercepted in the document paragraphs which are already searched from the LSTM model to the Bert model, but when the answers cannot be intercepted from the document simply, the reader model based on extraction is invalid.

Disclosure of Invention

Based on the above, the invention aims to provide a novel answer prediction method for open domain questions and answers.

An answer prediction method of open domain questions and answers, comprising the following steps:

S-1A converts an input problem into a word vector, adds paragraph features and position information to the word vector, and obtains a sentence matrix fusing the word vector, the paragraph features and the position information;

S-2A performs semantic feature extraction on an input sentence matrix to obtain a semantic feature matrix, and divides the semantic feature matrix into problem feature matrices according to sentence head vectors of the semantic feature matrix;

S-3A carries out linear transformation on the problem feature matrix to obtain a problem continuous vector;

S-4A performs binary code conversion on the problem continuous vector to obtain a problem binary code;

s-5, sequentially screening K paragraphs with the largest inner product value from a paragraph index library through the problem binary code and the problem continuous vector to obtain a problem support document; wherein, the paragraph index library stores paragraph binary codes of documents of the prior knowledge base;

s-6, converting the input questions into question vectors, converting the question support documents into question support document vectors, splicing the question vectors with the question support document vectors to obtain question document splicing vectors, and generating predictive answers for the question document splicing vectors by adopting a generate function and a greedy decoding algorithm.

Further, the step S-5 specifically comprises the following steps:

s-51, calculating the hamming distance between the problem binary code and the fall binary code in a fall index library, and screening out m paragraphs with the closest hamming distance to the problem binary code;

s-52, carrying out inner product operation on the problem continuous vector and the m paragraphs, screening out K paragraphs with the largest inner product value as supporting documents of the problem, and obtaining the supporting documents of the problem.

The invention also provides an open domain question-answering system, which comprises:

the vector converter is used for converting the input problem into a word vector, adding paragraph features and position information to the word vector, and obtaining a sentence matrix fusing the word vector, the paragraph features and the position information;

the retriever comprises a feature extraction module, a problem linear layer and a problem hash layer, wherein the feature extraction module is used for extracting semantic features of an input sentence matrix to obtain a semantic feature matrix, and dividing the semantic feature matrix into problem feature matrices according to sentence head vectors of the semantic feature matrix; the problem linear layer is used for carrying out linear transformation on the problem feature matrix to obtain a problem continuous vector; the problem hash layer is used for performing binary code conversion on the problem continuous vector to obtain a problem binary code;

a paragraph index library storing paragraph binary codes of documents of an existing knowledge base;

the supporting document indexer is used for screening K paragraphs with the largest inner product value from a paragraph index library through the problem binary codes and the problem continuous vectors in sequence to serve as supporting documents of the problems, and obtaining problem supporting documents;

the answer generator is used for converting the input questions into question vectors, converting the question support documents into question support document vectors, splicing the question vectors with the question support document vectors to obtain question document splicing vectors, and generating predictive answers for the question document splicing vectors by adopting a generate function and a greedy decoding algorithm.

Further, the retriever further comprises:

the paragraph linear layer is used for carrying out linear transformation on the paragraph characteristic matrix to obtain paragraph continuous vectors;

the paragraph hash layer is used for performing binary code conversion on the paragraph continuous vector to obtain a paragraph binary code;

the paragraph binary codes of the documents of the knowledge base stored in the paragraph index library are as follows: the method comprises the steps of intercepting a document obtained from a large-scale knowledge base in the vertical field into paragraphs with fixed word numbers, and sequentially processing the paragraphs by a vector converter, a feature extraction module, a paragraph linear layer and a paragraph hash layer.

Compared with the prior art, the question-answering system and the answer prediction method adopt the characteristic extraction model of the rear wiring layer and the hash layer as the retriever, can efficiently retrieve the support section with high correlation with the problem, and reduce the model parameters of the retriever; compressing continuous vectors output by the linear layer into binary codes by adopting a hash layer, and reducing the storage space of an index memory; the method for searching paragraphs by two stages of binary code indexing of the first problem and continuous vector indexing of the second problem effectively reduces the time consumption of searching, and the searched supporting paragraphs and the problems reach the maximum relativity, so that answers with smaller confusion degree are generated.

For a better understanding and implementation, the present invention is described in detail below with reference to the drawings.

Drawings

FIG. 1 is a schematic block diagram of an open domain question-answering system of the present invention;

FIG. 2 is a flowchart of the operation of the open domain question-answering system of FIG. 1;

FIG. 3 is a schematic diagram of a training method of the retriever of FIG. 1;

fig. 4 is a flowchart of answer prediction in the open domain question-answering system of the present invention.

Detailed Description

Firstly, converting paragraph binary codes of documents of an existing knowledge base through a retriever and storing the converted paragraph binary codes in a paragraph index base; then, the input questions are converted into question continuous vectors and question binary codes through the retriever; calculating the hamming distance between the problem binary code and the paragraph binary code in the paragraph index library in a support document indexer, screening out m paragraphs with the closest hamming distance, calculating the inner products of the problem continuous vector and the m paragraphs, and screening out K paragraphs with the largest inner product value from the m paragraphs as support documents; and finally, splicing the question vector and the support document vector through an answer generator, and generating a predicted answer by adopting a greedy decoding algorithm.

The open domain question-answering system of the present invention, including the construction, optimization and operation of the system, will be described in detail below.

Open domain question-answering system

Referring to fig. 1 and 2, fig. 1 is a schematic block diagram of an open domain question-answering system according to the present invention, and fig. 2 is a flowchart of the open domain question-answering system shown in fig. 1. The open domain question-answering system of the present invention includes a vector translator, a retriever, a paragraph index library, a supporting document indexer, and an answer generator.

Specifically, the vector converter is configured to perform step S-1: and converting the input sentence into a word vector, and adding paragraph features and position information to the word vector to obtain a sentence matrix integrating the word vector, the paragraph features and the position information. When the input sentence is a question, this step is labeled S-1A; when the sentence entered is an answer or paragraph, this step is labeled S-1B.

The retriever comprises a feature extraction module, a problem linear layer, a problem hash layer, a paragraph linear layer and a paragraph hash layer.

The feature extraction module is used for executing the step S-2: extracting semantic features of an input sentence matrix to obtain a semantic feature matrix, and dividing the semantic feature matrix into a problem feature matrix or a paragraph feature matrix according to sentence head vectors of the semantic feature matrix. When the semantic feature matrix is divided into problem feature matrices, the step is marked as S-2A; when the semantic feature matrix is divided into paragraph feature matrices, this step is labeled S-2B.

The problem linear layer is used for executing the step S-3A: and carrying out linear transformation on the problem feature matrix to obtain a problem continuous vector, and storing the problem continuous vector.

The problem hash layer is used for executing the step S-4A: and performing binary code conversion on the problem continuous vector to obtain a problem binary code, and storing the problem binary code.

The paragraph linear layer is used for executing the step S-3B: and linearly transforming the paragraph feature matrix to obtain paragraph continuous vectors and storing the paragraph continuous vectors.

The paragraph Ha Xiceng is for performing step S-4B: and performing binary code conversion on the paragraph continuous vector to obtain paragraph binary codes, and storing the paragraph binary codes.

When the input sentence is a problem, the feature extraction module, the problem linear layer and the problem hash layer form a problem encoder; when the input sentence is an answer or a paragraph, the feature extraction module, the paragraph linear layer, and the paragraph hash layer constitute a paragraph encoder. In this embodiment, the feature extraction module uses a Bert model or an Albert model that is followed by a [ CLS ] output. Further, the output dimension of the problem linear layer and the paragraph linear layer is 128 dimensions; and the problem hash layer and the paragraph hash layer adopt a tanh function, and the expression is as follows:

in the formula ,

indicating a manual adjustment value->

Representing the continuous vector of question linearity layer and answer/linearity layer outputs.

The paragraph index stores paragraph binary codes of knowledge base documents. Specifically, a document is firstly obtained from a large-scale knowledge base in the vertical field, the document is intercepted into paragraphs with fixed word numbers, the paragraphs are input into a paragraph encoder of the retriever to execute steps S-2, S-3B and S-4B, and all paragraph binary codes are obtained

And storing the mapping relation between the paragraph index library and the paragraph document according to the line.

The supporting document indexer is configured to perform step S-5: and sequentially using the problem binary codes and the problem continuous vectors to screen K paragraphs with the largest inner product value from the paragraph index library as a problem supporting document. Specifically, the maximum inner product search function of the fasss library is adopted, and the questions are calculated firstQuestion binary code and all paragraph binary codes in paragraph index library

Screening m paragraphs closest to the Hamming distance of the binary code of the problem

The method comprises the steps of carrying out a first treatment on the surface of the And then carrying out inner product operation on the m paragraphs and the problem continuous vector, and screening out K paragraphs with the largest inner product value from the m paragraphs to form a problem supporting document. Further, the format of the supporting document composed of K paragraphs is "">

Paragraph 1->

Paragraph 2 … ->

Paragraph K ", wherein->

To separate symbols.

The answer generator is configured to perform step S-6: and converting the input problem into a problem vector, splicing the problem vector with the problem support document vector to obtain a problem document splicing vector, and generating a prediction answer for the problem document splicing vector by adopting a generate function and a greedy decoding algorithm. In this embodiment, the answer generator is a BART model followed by a linear layer with a language header. Further, the splice format of the problem document splice vector is "problem: { } support documents: {}".

Optimization of (II) open domain question-answering system

After the open domain question-answering system is built, the question-answering system needs to be trained so as to be evolved from an initial system to an optimization system. The retriever is trained first, and the answer generator is trained and/or fine-tuned based on the trained retriever to obtain the optimized open domain question-answer system.

Meanwhile, the paragraph binary code data stored in the paragraph index library is also obtained based on a trained retriever. Specifically, the paragraphs of the knowledge base document are input into a trained retriever, and the paragraph binary codes are formed after being processed by a paragraph encoder of the retriever.

The training method of the open domain question-answering system will be specifically described below.

1. Training retriever

1.1 Creating a first data set for training a retriever

The search training dataset needs to be built before training the retriever.

Question-answer pair data is obtained from the open domain, and all question-answer pairs are made into a (question, answer) mapping dataset. Specifically, the data content includes question Q and answer A, and defines question answer pairs

For positive sample, ++>

Is a negative example.

Further, the data set batch size is 1024, as shown in Table 1,

is a positive sample, others->

，......，/>

Is a negative example.

TABLE 1A batch of question answer pairs

1.2 Training

Please refer to fig. 3, which is a diagram illustrating a training method of the retriever. The training process of the retriever is specifically as follows:

S-1C converts a certain batch of questions and answers in the input first data set into word vectors respectively, and adds paragraph features and position information to the word vectors to obtain sentence matrixes integrating the word vectors, the paragraph features and the position information.

S-2, extracting semantic features of the input sentence matrix to obtain a semantic feature matrix, and dividing the semantic feature matrix into a question feature matrix or an answer feature matrix according to sentence head vectors of the semantic feature matrix.

S-3A performs linear transformation on the problem feature matrix to obtain a problem continuous vector

And stores the question continuation vector +.>

。

S-4A is continuous to the problem vector

Performing binary code conversion to obtain problematic binary code +.>

And stores the question binary code +.>

。

S-3B performing linear transformation on the answer characteristic matrix to obtain paragraph continuous vectors

The paragraph continuous vector->

Includes paragraph positive sample continuous vector->

And question-independent paragraph negative sample continuous vector +.>

Store paragraph continuous vector +.>

Paragraph positive sample continuous vector->

And paragraph negative sample continuous vector +.>

。

S-4B performs binary code conversion on the paragraph continuous vector to obtain paragraph binary codes

The paragraph binary code->

Includes paragraph positive sample binary code->

And paragraph negative sample binary code->

Store paragraph binary codes

Paragraph positive sample binary code->

And paragraph negative sample binary code->

。

SA-5 will input problem continuous vector

Question binary code->

Paragraph positive sample continuous vector->

Paragraph negative sample continuous vector->

Paragraph positive sample binary code->

And paragraph negative sample binary code->

And calculating according to the task loss function set for the retriever through a forward propagation algorithm to obtain the loss value of each task, and calculating the final loss value through the loss value of each task.

Specifically, 4 task loss functions are set for the retriever, and the 4 task loss functions are negative log likelihood functions of the positive sample of the minimized paragraph, and the expression of each loss function is as follows:

(1) In the (4) mode,

successive vectors representing the current question +.>

Binary code representing the current question->

Representing a positive sample continuation vector of a paragraph associated with the current question,/->

Representing a positive sample binary code of a paragraph associated with the current problem,

paragraph negative sample continuous vector representing irrelevant to the current question, +.>

Representing a paragraph negative sample binary code that is not related to the current question.

Final loss value

The loss values calculated by the formulas (1) - (4) are weighted to meet the following relation: />

(5) in the formula ,

a weighting factor of 4 loss values.

SA-6 updates parameters of the feature extraction module, the problem linear layer and the paragraph linear layer through a back propagation algorithm according to the final loss value.

Repeating the steps S-1C to SA-6 until the loss value tends to be stable or reaches the iteration number or is smaller than the iteration threshold value, obtaining a trained retriever, and storing the retriever.

In the present embodiment, the super-parameters of the retriever training are set to learning rate learning_rate=2×e ^-4 Maximum input max_length=256, and the optimizer is Adam optimizer.

2. Training answer generator

2.1 Creating a second data set for training the answer generator

Based on the trained retriever, an answer training dataset needs to be built before training the answer generator, specifically by the following way.

(1) And acquiring a corpus knowledge base in the vertical field, and intercepting sentences with fixed length from the corpus knowledge base by using a fixed probability as a question of an answer generator training data set.

(2) Inputting the questions into a question encoder of a trained retriever to obtain a question continuous vector

And question binary code->

The method comprises the steps of carrying out a first treatment on the surface of the Continuous vector of questions/>

And question binary code->

Input to supporting document indexer by calculating problem binary code +.>

Obtaining m paragraphs from the hamming distance of the paragraph binary codes in the paragraph index library by calculating the problem continuous vector +.>

And screening K paragraphs with the largest inner product value from the m paragraphs to form a question support document, taking the question support document as an answer, and taking the answer as a second label.

Repeating the steps (1) and (2) to obtain second labels corresponding to all the problems, and manufacturing a data set of the problems and the mapping of the second labels.

2.2 Training

SC-1 converts a batch of questions of the second data set and questions in the mapping of the second label into vectors, resulting in a question vector.

And the SC-2 performs feature extraction on the input question vector, and generates a second prediction answer by adopting a generate function and a greedy decoding algorithm.

And SC-3 calculates the cross entropy between the second predicted answer and the second label, takes the cross entropy as a second loss value, and updates the parameters of an answer generator through a back propagation algorithm, specifically updates the parameters of a BART model and a linear layer with a language head.

Repeating the steps SC-1 to SC-3 until the loss value tends to be stable or reaches the iteration number or less than the iteration threshold value, obtaining a trained answer generator, and storing the answer generator.

In this embodiment, the BART model of the answer generator is initialized by using the BART-large-Chinese weight; the trained super-parameters were set to learn rate learning_rate=2×e ^-4 Maximum input max_length=1024, and the optimizer is Adam optimizer.

3. Fine tuning answer generator

3.1 Creating a third data set for fine tuning the answer generator

Based on the trained answer generator, a fine tuning data set needs to be established before fine tuning the answer generator, specifically by the following method.

(1) And obtaining question-answer pair data in the vertical field.

And question binary code->

The method comprises the steps of carrying out a first treatment on the surface of the Continuous vector of questions->

And question binary code->

Input to supporting document indexer by calculating problem binary code +.>

Obtaining m paragraphs from the Hamming distance between the paragraph index library and the paragraph binary code library in the paragraph index library, and calculating the problem continuous vector +.>

And (3) filtering out K paragraphs with the largest inner product value from the m paragraphs with the inner product values of the m paragraphs to form a problem supporting document.

Repeating the steps (1) and (2) to obtain a question support document of all questions, and making a data set of the mapping of the questions, the question support document and a third label, wherein the third label is an answer paragraph in the vertical field answer pair data.

3.2 Fine tuning

SD-1 converts a certain batch of questions of the third data set, questions and question support documents in the mapping of the question support documents and the third labels into vectors to obtain question vectors and support document vectors, and performs splicing processing on the question vectors and the support document vectors to obtain question document splicing vectors.

And SD-2 performs feature extraction on the input question document splicing vector, and generates a third prediction answer by adopting a generate function and a greedy decoding algorithm.

And SD-3 calculates the cross entropy between the third predicted answer and the third label, takes the cross entropy as a third loss value, updates the parameters of the trained answer generator through a back propagation algorithm, and particularly updates the parameters of the BART model and the linear layer with the language head.

Repeating the steps SD-1 to SD-3 until the loss value tends to be stable or reaches the iteration number or is smaller than the iteration threshold value, obtaining a finely tuned answer generator, and storing the answer generator.

In the present embodiment, the fine-tuned super-parameters are set to the learning rate learning_rate=2×e ^-5 Maximum input max_length=1024, and the optimizer is Adam optimizer.

The initial question-answering system constructed by the invention is optimized through the steps, so that the optimized question-answering system is obtained.

(III) operation of question-answering System

Referring to fig. 4, the present invention generates a predicted answer to a question by running the following steps on an optimized open domain question-answering system.

S-1A converts the input problem into a word vector, and adds paragraph features and position information to the word vector to obtain a sentence matrix integrating the word vector, the paragraph features and the position information.

S-2A performs semantic feature extraction on an input sentence matrix to obtain a semantic feature matrix, and divides the semantic feature matrix into problem feature matrices according to sentence head vectors of the semantic feature matrix.

S-3A carries out linear transformation on the problem feature matrix to obtain a problem continuous vector.

S-4A performs binary code conversion on the problem continuous vector to obtain a problem binary code.

S-5, sequentially screening K paragraphs with the largest inner product value from the paragraph index library through the question binary code and the question continuous vector to serve as supporting documents of the questions, and obtaining the question supporting documents.

According to the invention, only a single feature extraction model is adopted in the feature extraction module, and the retriever formed by the structures of the problem linear layer-problem hash layer and the paragraph linear layer-paragraph Ha Xiceng is combined at the classification output end of the feature extraction model, so that model parameters of the retriever can be greatly reduced, continuous vectors are compressed into binary codes, the storage space of an index memory is reduced, and the continuous vectors and the binary codes are used for calculating a loss function of the model, so that the model can be effectively trained; the method for searching the supported documents by using the continuous vectors and the binary codes to screen paragraphs through the field vector searching tool in the paragraph index library and then further screening the paragraphs through the continuous vectors can search the supported documents with the largest correlation degree with the questions, and reduces the time consumption of searching high-quality answers.

Meanwhile, the unsupervised training method of the answer generator of the open domain question-answering system provided by the invention enables the output predictive label to be closer to the style of the answer, and simultaneously, a knowledge-enhanced encoder is used, namely, the questions and the supporting documents are spliced together and input into the encoder, and compared with the method for inputting the questions only, the method can output the answers with smaller confusion.

Based on the same inventive concept, the present application also provides an electronic device, which may be a terminal device such as a server, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet computer, a netbook, etc.). The device comprises one or more processors and a memory, wherein the processors are used for executing a program to realize an answer prediction method of open domain questions and answers; the memory is used for storing a computer program executable by the processor.

Based on the same inventive concept, the present application further provides a computer readable storage medium, corresponding to the foregoing embodiment of the answer prediction method of open domain questions and answers, having stored thereon a computer program, which when executed by a processor, implements the steps of the aspect word emotion analysis method described in any of the foregoing embodiments.

The present application may take the form of a computer program product embodied on one or more storage media (including, but not limited to, magnetic disk storage, CD-ROM, optical storage, etc.) having program code embodied therein. Computer-usable storage media include both permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to: phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that may be accessed by the computing device.

The above examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that modifications and improvements can be made by those skilled in the art without departing from the spirit of the invention, and the invention is intended to encompass such modifications and improvements.

Claims

1. An answer prediction method for open domain questions and answers is characterized by comprising the following steps:

2. The answer prediction method of open domain questions and answers according to claim 1, wherein the paragraph binary codes of the documents of the knowledge base stored in the paragraph index library are obtained by:

s-0: acquiring a document from a large-scale knowledge base in the vertical field, and intercepting the document into paragraphs with fixed word numbers;

S-1B: converting the input paragraph into a word vector, and adding paragraph features and position information to the word vector to obtain a sentence matrix integrating the word vector, the paragraph features and the position information;

S-2B: extracting semantic features of the sentence matrix to obtain a semantic feature matrix, and dividing the semantic feature matrix into paragraph feature matrices according to sentence head vectors of the semantic feature matrix;

S-3B: performing linear transformation on the paragraph feature matrix to obtain paragraph continuous vectors;

S-4B: and performing binary code conversion on the paragraph continuous vector to obtain paragraph binary codes.

3. The answer prediction method of open domain questions and answers as claimed in claim 2, wherein said step S-5 comprises the steps of:

4. The answer prediction method of open domain questions and answers according to any one of claims 1-3, wherein the parameters of semantic feature extraction in the step S-2A and/or the step S-2B and the parameters of linear transformation in the step S-3A and/or the step S-3B are obtained by:

S-1C, respectively converting a certain batch of input questions and answers into word vectors, and adding paragraph features and position information to the word vectors to obtain sentence matrixes integrating the word vectors, the paragraph features and the position information;

s-2, extracting semantic features of an input sentence matrix to obtain a semantic feature matrix, and dividing the semantic feature matrix into a problem feature matrix or a paragraph feature matrix according to sentence head vectors of the semantic feature matrix;

S-3B carries out linear transformation on the paragraph feature matrix to obtain paragraph continuous vectors, wherein the paragraph continuous vectors comprise paragraph positive sample continuous vectors and paragraph negative sample continuous vectors;

S-4B performs binary code conversion on the paragraph continuous vector to obtain paragraph binary codes, wherein the paragraph binary codes comprise paragraph positive sample binary codes and paragraph negative sample binary codes;

SA-5 carries out matrix operation on the input problem continuous vector, the paragraph positive sample continuous vector, the paragraph negative sample continuous vector, the problem binary code, the paragraph positive sample binary code and the paragraph negative sample binary code through forward propagation according to 4 task loss functions set for a retriever to obtain loss values of 4 tasks, and calculates a final loss value through the loss values of the 4 tasks;

the 4 task loss functions are all negative log likelihood functions of the positive samples of the minimized paragraphs, and each loss function satisfies the following relation:

wherein ,

successive vectors representing the current question +.>

Binary code representing the current question->

Representing a paragraph positive sample binary code associated with the current question, -/->

A paragraph negative sample binary code representing a paragraph not related to the current question;

the final loss value satisfies the following relation:

wherein ,

weight coefficients for 4 loss values;

SA-6 updates parameters of feature extraction and linear transformation through a back propagation algorithm according to the final loss value;

repeating the steps S-1C to SA-6 until the loss value tends to be stable or reaches the iteration number or is smaller than the iteration threshold.

5. The answer prediction method of open domain questions and answers as claimed in claim 4, wherein the parameters for generating the predicted answers in the step S-6 are obtained by:

SC-1 converts a certain batch of input questions and questions in the mapping of the second label into vectors to obtain question vectors;

SC-2 performs feature extraction on the input question vector, and generates a second prediction answer by adopting a generate function and a greedy decoding algorithm;

SC-3 calculates the cross entropy between the second predicted answer and the second label, takes the cross entropy as a second loss value, and updates the parameters of an answer generator through a back propagation algorithm;

repeating the steps SC-1 to SC-3 until the loss value tends to be stable or reaches the iteration number or is smaller than an iteration threshold;

wherein the mapping of the question and the second label is obtained by:

SE-1 acquires a corpus knowledge base in the vertical field, and intercepts sentences with fixed length in the corpus knowledge base as problems and problems in the mapping of a second label by using a fixed probability random;

SE-2 executes steps S-3A and S-4A on the problem to obtain a problem continuous vector and a problem binary code;

step S-5 is executed on the problem continuous vector and the problem binary code by SE-3 to obtain a problem supporting document, and the problem supporting document is used as a second label;

and repeating SE-1 to SE-3 to obtain second labels corresponding to all the problems, and manufacturing a data set of the problems and the mapping of the second labels.

6. An open domain question-answering system, comprising:

7. The open-domain question-answering system according to claim 6, wherein the retriever further comprises:

8. The open domain question-answering system according to claim 6 or 7, wherein the supporting document indexer is a fasss library for calculating hamming distances between the problem binary code and the segment binary codes in the segment index library, and screening out m segments nearest to the problem binary code hamming distance; and performing inner product operation on the problem continuous vector and the m paragraphs, screening out K paragraphs with the largest inner product value as supporting documents of the problem, and obtaining the problem supporting documents.

9. The open domain question-answering system according to claim 8, wherein the parameters of the retriever are obtained by:

wherein ,

successive vectors representing the current question +.>

Binary code representing the current question->

the final loss value satisfies the following relation:

wherein ,

weight coefficients for 4 loss values;

10. The open domain question-answering system according to claim 9, wherein the parameters of the answer generator are obtained by:

wherein the mapping of the question and the second label is obtained by: