CN116415594A

CN116415594A - Question-answer pair generation method and electronic equipment

Info

Publication number: CN116415594A
Application number: CN202111631090.0A
Authority: CN
Inventors: 徐传飞; 李一同; 彭超
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2023-07-11
Also published as: WO2023125335A1

Abstract

The application provides a method for generating question-answer pairs, which is used for analyzing a product document to obtain paragraph text corresponding to the product document, pictures associated with the paragraph text and keywords corresponding to the paragraph text. Inputting the obtained paragraph text, the pictures associated with the paragraph text and the keywords extracted from the paragraph text into a multimodal problem generating model to obtain a plurality of text problems and a plurality of picture problems corresponding to the paragraph text and the pictures associated with the paragraph text. And taking the paragraph text and the picture associated with the paragraph text as answers, and forming a plurality of pre-selected question-answer pairs based on the text questions and the picture questions generated by the paragraph text and the picture associated with the paragraph text. And finally, calculating the similarity of the questions and the answers in the preselected question-answer pairs, deleting the preselected question-answer pairs with the similarity smaller than the similarity threshold, and taking the questions and the answers with the similarity larger than the similarity threshold in the preselected question-answer pairs as final question-answer pairs.

Description

Question-answer pair generation method and electronic equipment

Technical Field

The application relates to the field of terminals, in particular to a method and electronic equipment for generating question-answer pairs.

Background

Along with the rapid development of artificial intelligence technology, the demand of users for intelligent question-answering systems is increasing, the intelligent question-answering systems are in the form of one-question-one-answer, knowledge of questions required by users is accurately positioned, and personalized information services are provided for the users through interaction with the users.

Currently, in order to be more fit to questions of users in real life, a multi-mode intelligent question-answering system receives great attention, and the multi-mode intelligent question-answering system can receive multi-mode (such as pictures, texts and other modes) questions of the users, search multi-mode answers matched with the user questions in a multi-mode question-answering pair library, and feed back the answers of the users. Therefore, the requirement of multi-mode problem input of the user can be met, and the richness of information inquiry of the user is improved.

However, in the current multi-mode intelligent question-answer system, a multi-mode question-answer pair library is generally established by using a manual maintenance input mode, for example, questions and answers with question-answer relations are extracted from a document by service personnel, and question-answer pairs are generated. Thus, the manual writing has large workload and consumes a great deal of manpower.

Disclosure of Invention

The application provides a question-answer pair generation method and electronic equipment, and by means of the method, a large number of multi-mode question-answer pairs can be automatically generated.

In a first aspect, an embodiment of the present application provides a method for generating a question-answer pair, where the method includes an electronic device obtaining a target document; the electronic equipment analyzes the target document to obtain data of a plurality of paragraphs of the target document, wherein the data comprise data of a first paragraph; the electronic equipment inputs the data of the first paragraph into a multi-modal problem generating model to obtain a plurality of multi-modal problems corresponding to the first paragraph; the multi-modal questions corresponding to the first paragraph include text questions corresponding to the first paragraph and picture questions corresponding to the first paragraph; the electronic device generates a plurality of pre-selected question-answer pairs corresponding to the first paragraph based on the data of the first paragraph and the plurality of multi-modal questions corresponding to the first paragraph.

It should be noted that, in the embodiment of the present application, the multi-mode includes a picture mode and a text mode, which is not limited thereto, and the multi-mode may also include an audio mode, which is not limited thereto.

By implementing the method provided in the first aspect, the electronic device can automatically generate a large number of multi-modal question-answer pairs containing multi-modal questions and multi-modal answers based on the target document by acquiring the target document. Therefore, in the question-answering system, a large number of multi-mode question-answering pairs do not need to be manually written, and labor is saved.

In some implementations, the data of the first paragraph includes the first paragraph text, a picture associated with the first paragraph text, and a keyword set corresponding to the first paragraph.

In the implementation manner, the paragraph text, the picture associated with the paragraph text and the keyword set corresponding to the paragraph text in the target document can be extracted by processing the target document. In this way, text information, picture information and association information between the text information and the picture information in the target document can be fully utilized, so that the generated problem is more accurate. In addition, by extracting keyword information in the paragraph text, the generated problems are limited in a required range, so that the generated problems are more accurate.

In combination with the method provided in the first aspect, in some implementations, the electronic device parses the target document to obtain data of multiple paragraphs of the target document, including: the electronic equipment performs paragraph division and picture extraction on the target document based on the structure of the target document to obtain a plurality of paragraph texts and a plurality of pictures corresponding to the target document;

the electronic equipment associates the plurality of pictures with the plurality of paragraph texts to obtain pictures corresponding to the plurality of paragraph texts; and the electronic equipment extracts keywords based on the plurality of paragraph texts to obtain a keyword set corresponding to the plurality of paragraph texts.

In some implementations, in combination with the method provided in the first aspect, the pre-selected answer pair corresponding to the first paragraph includes an answer corresponding to the first paragraph and a question corresponding to the first paragraph, the answer corresponding to the first paragraph includes a first paragraph text and/or a picture associated with the first paragraph text, and the question corresponding to the first paragraph includes a picture question corresponding to the first paragraph and/or a text question corresponding to the first paragraph.

With reference to the method provided in the first aspect, in some implementations, after generating the plurality of pre-selected answer pairs corresponding to the first paragraph, the method further includes: the electronic equipment calculates the similarity of the questions and the answers in the plurality of pre-selected question-answer pairs corresponding to the first paragraph;

The electronic device selects a pre-selected question-answer pair with the similarity meeting a preset similarity threshold from the pre-selected question-answer pairs corresponding to the first paragraph based on the similarity of the questions and the answers in the pre-selected question-answer pairs corresponding to the first paragraph.

In the implementation manner, the similarity calculation is carried out on the generated pre-selected question-answer pairs, so that the pre-selected question-answer pairs which do not meet the regulations are removed, and the quality of the question-answer pairs is improved.

With reference to the method provided in the first aspect, in some implementations, the electronic device calculates similarities between questions and answers in a plurality of pre-selected question-answer pairs corresponding to the first paragraph, including: the electronic equipment inputs the picture problem corresponding to the first paragraph into a picture coding model, outputs a first picture problem sequence vector, inputs the text problem corresponding to the first paragraph into a text coding model, and outputs a first text problem sequence vector;

the electronic equipment inputs the first picture problem sequence vector and the first text problem sequence vector into a cross-mode coding model to perform fusion coding, so as to obtain a second picture problem sequence vector and a second text problem sequence vector;

the electronic equipment inputs the picture corresponding to the first paragraph into a picture coding model, outputs a first picture answer sequence vector, inputs the text corresponding to the first paragraph into a text coding model, and outputs a first text answer sequence vector;

The electronic equipment inputs the first picture answer sequence vector and the first text answer sequence vector into a cross-modal coding model to carry out fusion coding, so as to obtain a second picture answer sequence vector and a second text answer sequence vector;

the electronic device calculates the similarity among the second picture question sequence vector, the second text question sequence vector, the second picture answer sequence vector and the second text answer sequence vector respectively.

In the implementation manner, the picture questions and the picture answers in the preselected question-answer pair are input into a picture coding model; inputting the text questions and the text answers into a text coding model, and mapping the picture questions and the picture answers into a picture vector space to obtain a corresponding picture vector sequence; and mapping the text questions and the text answers to a text vector space to obtain corresponding text vector sequences. Then, through a cross-modal coding model, the vector sequences corresponding to the picture questions, the picture answers, the text questions and the text answers can be mapped in a fusion space, and the similarity between the questions and the answers is calculated in the fusion space. In this way, the calculated question and answer pair can be made more accurate in terms of similarity of questions and answers.

With reference to the method provided in the first aspect, in some implementations, the multimodal problem generating model includes a text encoding model, a picture encoding model, a cross-modality encoding model, a text decoding model, and the picture decoding model;

the electronic device inputs the data of the first paragraph into a multi-modal problem generating model to obtain a plurality of multi-modal problems corresponding to the first paragraph, and the method specifically comprises the following steps:

the electronic equipment inputs the first paragraph text into the text coding model to obtain a first text feature representation, inputs a keyword set corresponding to the first paragraph text into the first text coding model to obtain a second text feature representation, and inputs a picture corresponding to the first paragraph text into the picture coding model to obtain a first picture feature representation;

the electronic equipment inputs a first text feature representation, a second text feature representation and the first picture feature representation into the cross-modal coding model to obtain a first text fusion feature representation and a first picture fusion feature representation; the first text fusion feature representation comprises the first text feature representation, the second text feature representation and the first picture feature representation, and the first picture fusion feature representation comprises the first text feature representation, the second text feature representation and the first picture feature representation;

The electronic equipment inputs the first text fusion characteristic representation into a text decoding model to obtain a plurality of text problems corresponding to the first paragraph, and inputs the first picture fusion characteristic representation into a picture decoding model to obtain a plurality of picture problems corresponding to the first paragraph.

In the implementation manner, text features and keyword features are extracted by inputting the text coding model into the keywords corresponding to the paragraph text and the paragraph text, inputting the picture associated with the paragraph text into the picture coding model, extracting picture features, and then inputting the extracted text features, keyword features and picture features into the cross-modal coding model for fusion coding, so that the text features, keyword features and picture features can be extracted, and cross-modal association features between the text and the picture can be extracted, and the generated problem can be more accurate. In addition, the keyword features are added during fusion coding, so that the generated problems can be limited in a required range, and the generated problems are more accurate.

In some implementations, before the electronic device inputs the data of the first paragraph into the multimodal problem creation model, the method further includes:

The method comprises the steps that electronic equipment obtains multi-modal training data, wherein the multi-modal training data comprises multi-modal answers, multi-modal questions and keyword sets corresponding to the multi-modal answers;

the electronic equipment inputs the multi-modal answers and keyword sets corresponding to the multi-modal answers into a multi-modal pre-training model, and outputs predicted multi-modal questions; the multi-modal pre-training model comprises a text pre-training model, a picture pre-training model and a first cross-modal coding model;

the electronic device determines a prediction error based on the predicted multimodal problem and the multimodal problem;

and the electronic equipment adjusts the multi-modal problem generating model based on the prediction error until the prediction error meets the training stop condition to obtain the multi-modal problem generating model.

In combination with the method provided in the first aspect, in some implementations, before the electronic device inputs the multimodal answer and the keyword set corresponding to the multimodal answer into the multimodal pre-training model, the method further includes:

the method comprises the steps that electronic equipment obtains pre-training data and a pre-training model, wherein the pre-training data comprise pre-training text data and pre-training picture data;

the electronic equipment uses the pre-training text data to pre-train the pre-training model to obtain a text pre-training model, and uses the pre-training picture data to pre-train the pre-training model to obtain a picture pre-training model.

In the implementation manner, the multi-modal problem generation model is obtained by performing pre-training and performing fine-tuning training by adopting multi-modal training data with question-answer relations on the basis of training. Therefore, the universal feature extraction capability is learned by using the pre-training model, and when the problem generating task is obtained by subsequent fine adjustment, only a small amount of manual annotation data is needed for fine adjustment training, so that the manual workload can be saved.

In a second aspect, embodiments of the present application provide an electronic device comprising one or more processors and one or more memories; wherein the one or more memories are coupled to the one or more processors, the one or more memories being operable to store computer program code comprising computer instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described in the first aspect and any possible implementation of the first aspect.

In a third aspect, embodiments of the present application provide a chip system for application to an electronic device, the chip system comprising one or more processors for invoking computer instructions to cause the electronic device to perform a method as described in the first aspect and any possible implementation of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer program product comprising instructions which, when run on an electronic device, cause the electronic device to perform a method as described in the first aspect and any possible implementation of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer-readable storage medium comprising instructions that, when executed on an electronic device, cause the electronic device to perform a method as described in the first aspect and any possible implementation of the first aspect.

It will be appreciated that the electronic device provided in the second aspect, the chip system provided in the third aspect, the computer program product provided in the fourth aspect and the computer storage medium provided in the fifth aspect described above are all configured to perform the method provided by the embodiments of the present application. Therefore, the advantages achieved by the method can be referred to as the advantages of the corresponding method, and will not be described herein.

Drawings

FIG. 1 is an interface schematic diagram of an intelligent customer service answering system provided in an embodiment of the present application;

FIG. 2 is a flow chart of a process for training a multimodal problem generating model provided by an embodiment of the present application;

FIG. 3 is a flow chart of pre-training provided by an embodiment of the present application;

FIG. 4 is a flow chart of a fine tuning training process provided by an embodiment of the present application;

FIG. 5 is a flow chart of a method for generating question-answer pairs according to an embodiment of the present application;

FIG. 6 is a partial screenshot of a wristband instruction provided in an embodiment of the present application;

FIG. 7 is a schematic diagram of a process of parsing a target document to obtain a paragraph text, a picture associated with the paragraph text, and a keyword corresponding to the paragraph text according to the embodiment of the present application;

FIG. 8 is a schematic diagram of a process for generating multiple questions based on a bracelet specification and a multimodal question generation model according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a multimodal question-answer pair provided in an embodiment of the present application;

FIG. 10 is a schematic diagram of a process for calculating similarity between questions and answers in question-answer pairs according to an embodiment of the present application;

fig. 11 is a schematic hardware structure of an electronic device 100 according to an embodiment of the present application;

fig. 12 is a schematic software architecture of the electronic device 100 according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and thoroughly described below with reference to the accompanying drawings. Wherein, in the description of the embodiments of the present application, "/" means or is meant unless otherwise indicated, for example, a/B may represent a or B; the text "and/or" is merely an association relation describing the associated object, and indicates that three relations may exist, for example, a and/or B may indicate: the three cases where a exists alone, a and B exist together, and B exists alone, and in addition, in the description of the embodiments of the present application, "plural" means two or more than two.

The terms "first," "second," and the like, are used below for descriptive purposes only and are not to be construed as implying or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature, and in the description of embodiments of the present application, unless otherwise indicated, the meaning of "a plurality" is two or more.

The term "User Interface (UI)" in the following embodiments of the present application is a media interface for interaction and information exchange between an application program or an operating system and a user, which enables conversion between an internal form of information and an acceptable form of the user. The user interface is a source code written in a specific computer language such as java, extensible markup language (extensible markup language, XML) and the like, and the interface source code is analyzed and rendered on the electronic equipment to finally be presented as content which can be identified by a user. A commonly used presentation form of the user interface is a graphical user interface (graphic user interface, GUI), which refers to a user interface related to computer operations that is displayed in a graphical manner. It may be a visual interface element of text, icons, buttons, menus, tabs, text boxes, dialog boxes, status bars, navigation bars, widgets, etc., displayed in a display of the electronic device.

First, a scenario related to the embodiments of the present application will be described.

Referring to fig. 1, fig. 1 is an interface schematic diagram of an intelligent customer service answering system according to an embodiment of the present application. As shown in fig. 1, the customer service answering system solves a multimodal teletext problem entered by a user. The user can input text questions and picture questions into the intelligent customer service, and can obtain multi-mode answers output by the intelligent customer service. For example, when the user inputs a bracelet picture and a question about the bracelet, "how do you shake? The intelligent customer service analyzes the bracelet picture and the questions related to the bracelet, searches a text answer and a picture answer corresponding to the bracelet picture and the bracelet questions from a multi-mode question-answer pair library, and outputs the text answer and the picture answer related to the bracelet vibration to the user.

The multi-mode question-answer pair library is manually pre-written, that is, business personnel are pre-required to mark the multi-mode questions and multi-mode answers with question-answer relations according to a product manual or a specification, so that the multi-mode question-answer pair library is formed. Thus, the manual writing has large workload and consumes a great deal of manpower. In addition, different business personnel have inconsistent understanding standards for questions and answers, and marked multi-mode question-answer pairs have different quality, so that the matching degree between the questions and the answers is not high.

Therefore, the embodiment of the application provides a method for generating question-answer pairs, which is used for analyzing a product document to obtain paragraph text corresponding to the product document, pictures associated with the paragraph text and keywords corresponding to the paragraph text. Inputting the obtained paragraph text, the pictures associated with the paragraph text and the keywords extracted from the paragraph text into a multimodal problem generating model to obtain a plurality of text problems and a plurality of picture problems corresponding to the paragraph text and the pictures associated with the paragraph text. And taking the paragraph text and the picture associated with the paragraph text as answers, and forming a plurality of pre-selected question-answer pairs based on the text questions and the picture questions generated by the paragraph text and the picture associated with the paragraph text. And finally, calculating the similarity of the questions and the answers in the preselected question-answer pairs, deleting the preselected question-answer pairs with the similarity smaller than the similarity threshold, and taking the questions and the answers with the similarity larger than the similarity threshold in the preselected question-answer pairs as final question-answer pairs.

Thus, on the one hand, a plurality of answer pairs containing multi-modal questions and multi-modal answers are automatically generated through the product document, and a multi-modal answer pair library is formed. In the question-answering system, a large number of manual writing of question-answering pairs is not needed, and labor is saved. On the other hand, through carrying out similarity calculation on the generated pre-selected question-answer pairs, the pre-selected question-answer pairs which do not meet the regulations are removed, and the quality of the question-answer pairs is improved.

In the embodiment of the application, the method for generating the question-answer pair can be divided into two parts, wherein the first part is a training stage of the multi-mode question generation model, and the second part is for generating the question-answer pair based on the multi-mode question generation model.

A first part: training phase of the multimodal problem generation model.

Next, training of the multimodal problem generating model according to the present application will be described. Referring to fig. 2, fig. 2 illustrates a process of training a multimodal problem generating model. As shown in FIG. 2, the training process of the multimodal problem generating model includes steps S101-S105.

In the embodiment of the present application, the training process of the multimodal problem generating model may be divided into two phases, a first phase, steps S101 to S103, a pre-training phase; and step S104-S105, fine tuning training phase.

Wherein, steps S101-S102: introducing a process of pre-training a pre-training model by using pre-training data to obtain a text pre-training model and a picture pre-training model; step S103: introducing a process of obtaining a multi-mode pre-training model based on a text pre-training model and a picture pre-training model; steps S104-S105: introducing a process of performing fine tuning training on the multi-modal pre-training model by using manually marked training data to obtain a multi-modal problem generating model. Next, the description will be made separately.

Step S101-S103, pre-training stage.

S101, the electronic equipment acquires pre-training data, wherein the pre-training data comprises pre-training text data and pre-training picture data.

Specifically, the pre-training data may be obtained from a network through a web crawler or may be written manually. For example, a web crawler is utilized to obtain documents from wikipedia, hundred degrees encyclopedia, hundred degrees awareness, blogs, forum bars, etc., as pre-training data. The document may be a document of a certain product, for example, a service manual or a service instruction of the product, or may be an article, etc., which is not limited in this application.

S102, the electronic equipment pretrains the pretraining model based on pretraining text data to obtain a trained text pretraining model, pretrains the pretraining model based on pretraining picture data to obtain a trained picture pretraining model.

Pre-training refers to training a generic model unrelated to a specific task on a large scale of unlabeled data sets, which may be referred to as a pre-training model, which is then only required to be fine-tuned to handle various downstream tasks. Such as image recognition, image generation, text generation, visual questions and answers, classification, and the like. That is, pre-training is the training of a pre-training model by using a large number of unlabeled datasets, such that the pre-training model has the ability to extract features.

It is understood that the pre-training model may be a natural language processing model or a picture processing model, which is not limited in this application. In some embodiments, the pre-training model may include an encoding (encoder) model and a decoding (decoder) model, the encoding model and the decoding model corresponding to two recurrent neural networks (Recurrent Neural Network, RNN) of the input sequence and the output sequence, respectively. The basic idea of the common encoder-decoder structure is to use two RNNs, one as the encoder and the other as the encoder. The encoder is responsible for compressing the input sequence into a vector of specified length, which can be seen as the semantics of the sequence, a process called encoding. The decoder is responsible for generating the specified sequence from the semantic vector, a process also known as decoding. In some embodiments, the coding model is also referred to as an encoder, and the decoding model is also referred to as a decoder, which is not limited in this application. The coding model and decoding model may be unidirectional neural network models or bidirectional neural network models, and in the embodiment of the present application, the structure of the pre-training model is not limited, for example, the coding model-decoding model may be a mainstream transform model, a convolutional neural network (Convolutional Neural Networks, CNN) model, a Long Short-Term Memory (LSTM) model, and the like.

In the pre-training stage, two single-mode pre-training data are respectively used for pre-training the corresponding single-mode pre-training model. That is, the pre-training picture data is used for pre-training the picture pre-training model, and the capability of extracting picture features is learned, so that a trained picture pre-training model is obtained; and pre-training the text pre-training model by using pre-training text data, and learning the capability of extracting text features to obtain a trained text training model.

For example, referring to fig. 3, fig. 3 illustrates a pre-training process. As shown in fig. 3 (a), for text data, text encoding model 110 in the pre-training model is used to encode the pre-training text data, text features are extracted to obtain feature representations of the pre-training text data, and text decoding model 120 in the pre-training model is used to decode the feature representations of the pre-training text to obtain predicted text data. And comparing the predicted text data with the pre-training text data, calculating a predicted prediction error, and reversely adjusting weight parameters in the text coding model 110 and the text decoding model 120 based on the predicted error until the pre-training model converges, so that a trained text pre-training model can be obtained. In this way, the pre-trained model can be trained to learn the ability to extract text features. The model convergence may be that the prediction error value is smaller than a preset value, for example, 0.001, 0.0001, etc. The number of model training may be up to a preset number, for example, up to 10000 modulo iterations, which is not limited in this application.

Accordingly, as shown in fig. 3 (b), for the picture data, the picture coding model 210 in the pre-training model is used to code the pre-training picture data, extract the picture features to obtain the feature representation of the pre-training picture data, and then the picture decoding model 220 in the pre-training model is used to decode the feature representation of the pre-training picture to obtain the predicted picture data. And comparing the predicted picture data with the pre-training picture data, calculating a prediction error, and reversely adjusting weight parameters in the picture coding model 210 and the picture decoding model 220 based on the prediction error until the pre-training model converges, so that a trained picture pre-training model can be obtained. In this way, the pre-trained model can be trained to learn the ability to extract picture features.

S103, the electronic equipment obtains a multi-mode pre-training model based on the trained text pre-training model and the trained picture pre-training model.

Specifically, based on the text pre-training model and the picture pre-training model trained in step S102, a cross-modal coding model is added to obtain a multi-modal pre-training model. That is, during pre-training, the text feature extracting capability of the text pre-training model and the picture feature extracting capability of the picture pre-training model are obtained through learning in a pre-training mode, namely, the convergence weight parameters of the text pre-training model and the convergence weight parameters of the picture pre-training model. Then, taking the weight parameters when the text pre-training model converges as initial parameters of a text coding model and a text decoding model in the multi-mode pre-training model; and taking the weight parameters when the picture pre-training model converges as initial parameters of a picture coding model and a picture decoding model in the multi-mode pre-training model.

Illustratively, referring to fig. 4, fig. 4 illustrates a fine-tuning training process of a multi-modal pre-training model, which includes a single-modal encoding layer, a cross-modal encoding layer, and a single-modal decoding layer, as shown in fig. 4. Wherein the single-mode coding layer further comprises a trained text coding model 110 and a trained picture coding model 210; accordingly, the unimodal decoding layer includes a trained text decoding model 120 and a trained picture decoding model 220.

Wherein the single-mode coding layer is used for carrying out feature coding on the input of a single mode. For example, the picture coding model 210 is used for feature coding an input picture to obtain a picture feature representation, and the text coding model 110 is used for feature coding an input text to obtain a text feature representation. The cross-modal coding model 310 is used for cross-modal relevance coding on the basis of each single-modal coding model. For example, the image feature representation and the text feature representation are input into a cross-modal coding model fusion code, feature fusion is performed, the association features between the text feature and the image feature are learned, and the fusion features of the image and the text can be output. The picture fusion features comprise features of the picture and features of the text, and the text fusion features comprise features of the text and features of the picture. The single-mode decoding layer is used for carrying out feature decoding on the input of a single mode to obtain the problem of the single mode. For example, the features of the picture are fused and input into the picture decoding model 220 to obtain a picture problem; the features of the text are fused into the text decoding model 120 to obtain a text question.

And step S104-S105, fine tuning training phase.

S104, the electronic equipment acquires multi-mode training data.

Wherein, the multi-modal training data refers to training data comprising multiple modalities. In the embodiment of the present application, the training data with the multi-mode training data as the text mode and the picture mode is described as an example, and in practical application, the multi-mode data may also include other modes such as audio mode and video mode, which is not limited in this application.

The multi-modal training data comprises multi-modal question-answer pairs consisting of multi-modal questions and multi-modal answers corresponding to the multi-modal questions. For example, a multimodal question-answer pair comprising a multimodal question and a multimodal answer manually written based on a product document. The document can be a product instruction book or a service manual, etc. The multiple paragraph texts of the document, the pictures associated with the multiple paragraph texts and the keywords in the multiple paragraph texts can be manually extracted to serve as multi-modal answers, and then multiple text questions and picture questions are written based on the multiple paragraph texts, the pictures associated with the multiple paragraph texts and the keywords in the multiple paragraph texts to form multi-modal answer pairs.

In some possible embodiments, the multimodal training data may also be a multimodal question-answer pair including a multimodal question-answer and a multimodal question-answer acquired by the electronic device from the network using the web crawler, for example, a multimodal question and a multimodal answer may be acquired from a question-answer community website disclosed by a knowledge, a hundred degrees of knowledge, etc.

S105, the electronic equipment performs fine tuning training on the multi-modal pre-training model based on the multi-modal training data to obtain a multi-modal problem generating model.

Specifically, the manually marked multi-modal question and answer is adopted to conduct fine tuning training on the multi-modal pre-training model, and the multi-modal question generation model is obtained. The fine tuning is to perform fine tuning training on the multi-mode pre-training model through a small amount of manual annotation data, and the final model is more suitable for solving the current problem generating task through adjusting weight parameters acquired from the text pre-training model and the picture pre-training model. Therefore, the universal feature extraction capability is learned by using the pre-training model, and when the problem generating task is obtained by subsequent fine adjustment, only a small amount of manual annotation data is needed for fine adjustment training, so that the manual workload can be saved.

Specifically, the fine tuning training process of the multimodal problem generating model is described in detail below in conjunction with FIG. 4.

Referring to fig. 4, fig. 4 illustrates a fine-tuning training process for a multimodal problem generating model. As shown in FIG. 4, in the fine-tuning training phase, the manually labeled multimodal challenge-response pair is used to train the multimodal pre-training model.

In some implementations, first, manually written multi-modal question-answer pairs are respectively input into a single-modal coding model to be coded, and single-modal characteristic representation is obtained. Inputting a keyword set corresponding to the paragraph text into the text coding model 110 for coding, extracting keyword features, and obtaining keyword feature representation; inputting the paragraph text into the text coding model 110 for coding, extracting text characteristics, and obtaining text characteristic representation; inputting the picture associated with the paragraph text into the picture coding model 210, extracting picture characteristics, and obtaining picture characteristic representation.

Next, each of the obtained single-mode feature representations is input into the cross-mode coding model 310, and fusion feature coding is performed on each of the single-mode feature representations to obtain fusion feature representations. The method comprises the steps of inputting a keyword feature representation, a text feature representation and a picture feature representation into a cross-mode encoder for fusion encoding, extracting and encoding correlations among the picture feature representation, the text feature representation and the keyword feature representation, and accordingly obtaining fusion representation of pictures and fusion representation of texts.

It will be appreciated that the fused representation of the picture includes not only the features of the picture itself, but also keyword features and text features. Accordingly, the fused representation of the text not only includes the characteristics of the text itself, but also includes the keyword characteristics and the picture characteristics.

Then, the fusion representations obtained by the cross-modal coding model are respectively input into the respective decoding models. That is, the fused representation of the pictures is input into the picture decoding model 220 for decoding, generating a plurality of picture problems; the fused representation of text is input into the text decoding model 120 for decoding, generating a plurality of text questions.

And finally, calculating a loss function of the generated picture problem, the generated text problem, the manually marked text problem label and the manually marked picture problem label, and adjusting the weight parameters of each coding model and each decoding model in the problem generation model according to the loss function value until the adjustment multi-mode pre-training model converges. The model convergence may be that the loss function value is smaller than a predetermined value, for example, 0.001, 0.0001, etc. It is also possible that the number of model training times reaches a preset number of times, for example, the number of model times reaches 10000 times, etc. The present application is not limited in this regard.

It should be noted that, at the end of training of each single-mode pre-training model, each single-mode pre-training model converges, corresponding to one convergence parameter. In the pre-training of the multi-mode pre-training model, the initial parameters of the text encoding model 110, the picture encoding model 210, the text decoding model 120 and the picture decoding model 220 adopt the convergence parameters of the single-mode pre-training models trained above, and the cross-mode encoding model can configure the initial parameters in a random initialization mode.

The pre-training in the training stage in the above embodiment is to save multi-modal training data, that is, a trained multi-modal problem generating model can be obtained by training using a small amount of multi-modal training data, and the embodiment of the application should not be limited. In some alternative embodiments, pre-training may not be performed during the training stage, for example, the initial multi-modal problem generating model may be directly trained using multi-modal training data with question-answer relationships until the multi-modal problem generating model converges, to obtain a trained multi-modal problem generating model.

A second part: and in the application stage of the multi-modal question generation model, generating multi-modal question-answer pairs.

Referring to fig. 5, fig. 5 is a schematic flow chart of a method for generating a question-answer pair according to an embodiment of the present application, and as shown in fig. 5, the method for generating a question-answer pair includes all or part of the following steps:

s201, the electronic device acquires the target document.

The target document can be the description content of the target product, wherein the description content comprises texts and pictures. The description of the product may include the model number of the product, instructions for use of the product, notes on use of the product, maintenance procedures, and the like.

It should be noted that, the product in this embodiment may be an entity product, such as a mobile phone product, a notebook product, a wearable product, etc., or may be a product in a virtual network, such as a network game, etc., or may be a product of a service experience type, such as a play item, etc.

In some embodiments, the target document may be a semi-structured document. The target document comprises chapter structure information, wherein the chapter structure information refers to a multi-level directory, a title, a paragraph, a abstract, a picture block, a table block, typesetting, a retracted format and the like of the target document.

The target document may be, for example, a bracelet specification. Referring to fig. 6, fig. 6 illustrates a partial screenshot of a wristband instruction. As shown in fig. 6, the bracelet use description includes a multi-level title. For example, the subtitle "off alarm clock" and "delay shock" below the headline "smart alarm clock" and the like. The bracelet use instruction also comprises a plurality of paragraphs corresponding to the lower parts of the titles. For example, the subtitle "off alarm" corresponds below to paragraph text and pictures about "off alarm". As shown in fig. 6, the paragraph text about "turn off alarm clock" includes a description of specific operation of turning off the bracelet alarm clock, and the picture of "turn off alarm clock" includes a bracelet, an alarm clock displayed on the bracelet, an operation of sliding the bracelet upward, and an operation of clicking a side key of the bracelet. In addition, the wristband instructions may include a form of indentation, a line feed, etc., e.g., a beginning portion of each paragraph is indented by two, an ending portion of a paragraph is ending with a line feed, etc.

It should be understood that the contents of fig. 6 are only a part of the specification of the bracelet, and the product specification shown in fig. 6 is only an example and should not be construed as limiting the application. In practical applications, the product specification may also be in other forms, and the embodiments of the present application are not limited in any way.

S202, the electronic device analyzes the target document to obtain paragraph text, pictures associated with the paragraph text and keywords corresponding to the paragraph text.

Specifically, the target document comprises at least one paragraph, the electronic device analyzes the target document, and data of one or more paragraphs of the target document can be obtained. I.e. the electronic device may segment for the target document to obtain a plurality of paragraph text. The electronic device may also extract a plurality of pictures for the target document and associate each paragraph text with the picture corresponding to the paragraph text. Then, the electronic device extracts keywords for a plurality of paragraph texts, and a keyword set corresponding to each paragraph text can be obtained.

The data of one or more paragraphs of the target document comprises data of a first paragraph, and the data of the first paragraph comprises a first paragraph text, a picture associated with the first paragraph text and a keyword set corresponding to the first paragraph text. That is, the electronic device divides and extracts the paragraphs of the target document to obtain a plurality of paragraph texts and a plurality of pictures corresponding to the target document, wherein the plurality of paragraph texts and the plurality of pictures comprise a first paragraph text and a picture corresponding to the first paragraph text, then associates the first paragraph text with the picture corresponding to the first paragraph text to obtain a picture associated with the first paragraph text, and then extracts keywords from the first paragraph text to obtain a keyword set corresponding to the first paragraph text.

In some embodiments, the electronic device segments according to descriptive content in segments in the target document. For example, the electronic device identifies the target document layout structure based on a generic document parsing technique. For example, topics, subtitles, and paragraphs and text below the subtitles in the target document are identified.

For example, as shown in fig. 7, the electronic device may identify the title smart alarm clock, the subtitle "off alarm clock", the operation description on the bracelet off alarm clock under the subtitle, and the picture schematic corresponding to the operation description in the bracelet use specification. For example, the electronic device locates to the "4.3 intelligent alarm clock" first, then determines the paragraph a according to the heading style and the indentation format of the following text, and divides the usage instruction into the paragraph text a and the paragraph text B shown in fig. 7, where the picture below the paragraph text a is the picture associated with the paragraph text a.

After dividing the target document into a plurality of paragraphs, the electronic device extracts keywords of each paragraph based on each obtained paragraph. The keywords may be the title of a paragraph or the title of a chapter in the target document, or the theme, entity word, etc. of the paragraph.

In some embodiments, the electronic device may extract a title in the target document as a keyword. For example, as shown in fig. 7, the electronic device recognizes the title, "smart alarm" and the subtitle "off alarm", "delay shake", etc. in the bracelet specification as keywords.

The electronic device may also extract a subject term in the target document as a keyword. For example, the electronic device may count the frequency of words in the paragraph text, and extract words with high frequency as keywords. For example, as shown in fig. 7, words with high occurrence frequency in the paragraph text a are a bracelet and an alarm clock is turned off, and words with high occurrence frequency in the paragraph text B are delay vibration, bracelet, etc., so the electronic device can use these words as keywords.

The electronic device may also extract entity words in the target document as keywords. The electronic device recognizes the entity words in the paragraph text using entity recognition techniques. The entities include named entities such as person names, place names, company names, organization names, etc., and numeric type entities such as amount, date, age, etc. For example, as shown in fig. 7, the physical word bracelet, the numerical values "3 times", "10 minutes" and the like can be extracted from the paragraph text B.

The above examples are merely for explaining the present application, and should not be construed as limiting, and embodiments of the present application are not limited to the above manner, and may also analyze paragraphs from the target document, extract keywords from the paragraphs, and the like in other manners.

S203, the electronic device inputs the paragraph text, the picture associated with the paragraph text and the keyword corresponding to the paragraph text into the multimodal problem generating model to obtain a plurality of text problems and a plurality of picture problems corresponding to the paragraph text and the picture associated with the paragraph text.

Specifically, referring to fig. 8, the multimodal question-answer generation model includes a single-modality encoding layer, a cross-modality encoding layer, and a decoding layer. The electronic equipment respectively inputs the obtained keywords corresponding to the paragraph text and the paragraph text into the text coding model 110 to obtain text characteristics, and inputs the picture corresponding to the paragraph text into the picture coding model 210 to obtain picture characteristics; then, the text features and the picture features are input into a cross-modal coding model 310, and cross-modal fusion coding is performed on the text and the picture, so that fusion features corresponding to the text and the picture are obtained. The fusion feature skills corresponding to the text can reflect the features of the text mode and the cross-mode relevance features between pictures of the reverse text; accordingly, the fusion features corresponding to the pictures can reflect the features of the picture modes, and can reflect the features of cross-mode relevance between texts of the positive pictures. And finally, inputting the fusion characteristics corresponding to the texts and the pictures into respective decoding models to obtain one or more corresponding text problems and one or more picture problems.

It should be noted that, before inputting the paragraph text and the keywords into the multimodal problem generating model, the electronic device needs to map the paragraph text into a word vector form. That is, words in the paragraph text are mapped into Word vectors by Word embedding. And then the word vectors are input into the coding model one by one.

The following will schematically illustrate a process of generating a problem based on a target document, taking a description of the use of the target document as a bracelet as an example.

For example, the electronic device may obtain a plurality of paragraph texts, pictures associated with the plurality of paragraph texts, and keyword sets corresponding to the plurality of paragraph texts based on the target document. Aiming at a first paragraph text in a plurality of paragraphs, the electronic equipment inputs the first paragraph text, a picture associated with the first paragraph text and a keyword set corresponding to the first paragraph text into a multi-modal problem generating model to obtain a problem corresponding to the first paragraph text. And then sequentially inputting other paragraph texts in the multiple paragraphs into the multi-modal problem generation model to obtain corresponding problems.

For example, as can be seen from fig. 7, the electronic device may obtain a paragraph text a, a paragraph text B, a picture C corresponding to the paragraph text a, a picture D corresponding to the paragraph text B, a keyword set E obtained according to the paragraph text a, and a keyword set F obtained according to the paragraph text B by segmenting based on the usage specification of the wristband. The electronic device inputs the paragraph text A, the picture C corresponding to the paragraph text A and the keyword set E corresponding to the paragraph text A into the multimodal problem generating model, and then inputs the paragraph text B, the picture D corresponding to the paragraph text B and the keyword set F corresponding to the paragraph text B into the multimodal problem generating model.

Referring to fig. 8, the electronic device inputs the paragraph text a and the keyword set E corresponding to the paragraph text a into the text encoding model 110 to obtain the text feature representation X1 and the text feature representation X2, and inputs the picture C corresponding to the paragraph text a into the picture encoding model 210 to obtain the picture feature representation Y. Then, the text feature representation X1, the text feature representation X2 and the picture feature representation Y are input into the cross-modal coding model 310, and the fusion feature representation Z1 corresponding to the text and the fusion feature representation Z2 corresponding to the picture are output. The fusion feature representation Z1 corresponding to the text not only comprises the text feature representation X2 of the text, but also fuses the text feature representation X1 and the picture feature representation Y, and the fusion feature representation Z2 corresponding to the picture not only comprises the picture feature representation Y of the picture, but also fuses the text feature representation X1 and the text feature representation X2. In this way, the cross-mode coding model 310 performs fusion coding on the text characteristic and the picture characteristic, so that the text characteristic and the picture characteristic are not only included, but also fused, and the accuracy of the problem can be improved when the corresponding problem is generated by decoding. In addition, text features corresponding to the keywords are fused in the cross-modal coding layer, so that the generated problems are limited in a required range, and the generated problems are more accurate.

And then, inputting the fusion characteristic representation Z1 corresponding to the text and the fusion characteristic representation Z2 corresponding to the picture into the corresponding decoding layers to generate corresponding text problems and picture problems. That is, the text-corresponding fusion feature representation Z1 is input into the text decoding model 120, and the text-corresponding question T1 and the text question T2 are generated, for example, the text question T1 may be "is the wristband continuously vibrated? "text question T2 may be" how to close the bracelet? ". And inputting the fusion characteristic representation Z2 corresponding to the picture into the picture decoding model 220 to generate a picture problem P1 and a picture problem P2. It should be understood that the number of the above-mentioned problems is merely an example, and in practical application, a plurality of problems, for example, 3, 5, or 10, may be obtained, or one problem may be obtained, which is not limited in this application.

In this embodiment of the present application, the paragraph text a and the paragraph text B may be referred to as a first paragraph text, the text feature representation X1 may be referred to as a second text feature representation, the text feature representation X2 may be referred to as a first text feature representation, the picture feature representation Y may be referred to as a first picture feature representation, the text-corresponding fusion feature representation Z1 may be referred to as a first text fusion feature representation, and the picture-corresponding fusion feature representation Z2 may be referred to as a first picture fusion feature representation.

S204, the electronic device obtains a plurality of pre-selected question-answer pairs based on the paragraph text, the picture associated with the paragraph text, and a plurality of picture questions and text questions generated based on the paragraph text and the picture associated with the paragraph text.

Specifically, according to the paragraph text of the target document and the picture corresponding to the paragraph text obtained in step S202 are used as multi-modal answers, one or more text questions and one or more picture questions generated based on the paragraph text and the picture associated with the paragraph text are used as multi-modal questions, and the two questions form a pre-selected question-answer pair, so as to obtain one or more pre-selected question-answer pairs.

Illustratively, as shown in fig. 9 (a), the text a of the paragraph and the corresponding picture C of the text a of the paragraph and the text question T2 and the picture question P2 are a pre-selected question-answer pair.

The pre-selected question-answer pair may be, but not limited to, a single-mode question and a multi-mode answer, i.e., a picture question and a picture answer, or a text answer may be a pre-selected question-answer pair. For example, as shown in fig. 9 (b), the text question T1, the paragraph text a, and the picture C corresponding to the paragraph text a are a pre-selected question-answer pair. The picture question P1, the paragraph text A and the picture C corresponding to the paragraph text A are a pre-selected question-answer pair.

The question-answer pair may also be a multimodal question and a monomodal answer, i.e., the picture question, text question and text answer may be a question-answer pair and the picture question, text question and picture answer may be a pre-selected question-answer pair. For example, text question T1, picture question P1 and paragraph text A form a question-answer pair; the text question T1, the picture question P1 and the picture C form a pre-selected question-answer pair.

The question-answer pair can also be a single-mode question and a single-mode answer, namely a text question and a text answer, a text question and a picture answer, a picture question and a text answer, and a picture question and a picture answer respectively form a pre-selected question-answer pair. For example, as shown in fig. 9 (C), the paragraph text a and the text question T1 may form a pre-selected question-answer pair, the paragraph text a and the text question T2 may be a pre-selected question-answer pair, the picture C and the picture question P1 may be a pre-selected question-answer pair, and the picture C and the picture question P2 may be a pre-selected question-answer pair.

In this embodiment of the present application, the paragraph text a may be referred to as a first paragraph text, the picture C corresponding to the paragraph text a may be referred to as a picture associated with the first paragraph text, the text question T1 and the text question T2 may be referred to as text questions corresponding to the first paragraph text, and the picture question P2 may be referred to as picture questions corresponding to the first paragraph text.

S205, the electronic equipment calculates the similarity between the questions and the answers in the plurality of pre-selected question-answer pairs, and deletes the pre-selected question-answer pairs with the similarity smaller than the similarity threshold value to obtain a plurality of question-answer pairs.

Specifically, according to the above steps S201 to S204, a large number of pre-selected question-answer pairs can be obtained. Quality checks are required for these pre-selected question-answer pairs, and non-rule-conforming pre-selected question-answer pairs are deleted. That is, the electronic device calculates the similarity between the questions and the answers in the plurality of pre-selected question-answer pairs of the paragraphs, if the similarity is smaller than the similarity threshold, the pre-selected question-answer pair is deleted, and if the similarity is greater than or equal to the similarity threshold, the pre-selected question-answer pair is reserved. For example, the similarity between the picture question and the picture answer, the similarity between the picture question and the text answer, the similarity between the text question and the text answer, and then the pre-selected question-answer pair having a similarity less than the similarity threshold is deleted.

In some implementations, for multiple questions in a preselected question-answer pair corresponding to a paragraph, the electronic device inputs the picture question into picture coding model 210, outputs a picture question feature representation, i.e., a picture question sequence vector, inputs the text question into text coding model 110, and outputs a question text feature representation, i.e., a text question sequence vector. Then, inputting the picture problem feature representation and the text problem feature representation into a cross-mode coding model 310, performing feature fusion coding, and mapping the picture problem sequence vector and the text problem sequence vector under the same fusion feature space to obtain the fusion feature representation of the picture problem and the fusion feature representation of the text problem, namely the fusion vector of the picture problem and the fusion vector of the text problem.

Accordingly, for a plurality of answers in a pre-selected question-answer pair corresponding to a paragraph, the electronic device inputs the picture answer into the picture coding model 210, outputs a picture answer feature representation, i.e., a picture answer sequence vector, inputs the text answer into the text coding model 110, outputs a text answer feature representation, and outputs the text answer sequence vector. And inputting the picture answer feature representation and the text answer feature representation into a cross-modal coding model 310 for feature fusion coding, and mapping the picture answer sequence vector and the text answer sequence vector under the same fusion feature space to obtain the fusion feature representation of the picture answer and the fusion feature representation of the text answer, namely the fusion vector of the picture answer and the fusion vector of the text answer.

Then, the similarity between the question and the answer in the preselected question-answer pair is calculated respectively, namely, the similarity between the fusion vector of the picture question and the fusion vector of the picture answer, the similarity between the fusion vector of the picture question and the fusion vector of the text answer, the similarity between the fusion vector of the text question and the fusion vector of the picture answer, and the similarity between the fusion vector of the text question and the fusion vector of the text answer are calculated. If the similarity is smaller than the similarity threshold, deleting the pre-selected question-answer pair, and if the similarity is larger than or equal to the similarity threshold, reserving the pre-selected question-answer pair. Finally, for a plurality of paragraphs in the target document, according to the method, the pre-selected question-answer pairs which do not accord with the rule are removed, and the final one or more question-answer pairs corresponding to the target document can be obtained.

In some alternative embodiments, the electronic device may also rank the similarity of each pre-selected question-answer pair, e.g., the similarity may be ranked from large to small, with the electronic device taking the pre-selected question-answer pair with the similarity ranked top as the final question-answer pair.

In the following, referring to fig. 10, a process of similarity calculation is specifically described by taking a question in a question-answer pair as a paragraph text a and a picture C, and taking a text question T1 and a picture question P1 as an answer.

Referring to fig. 10, fig. 10 illustrates a process in which an electronic device calculates the similarity between a question and an answer in a question-answer pair. As shown in fig. 10, for a multimodal answer, the electronic device inputs paragraph text a into the text encoding model 110, outputting a text answer feature representation x2; the electronic device inputs picture C into picture coding model 210 and outputs picture answer characteristic representation y1. The electronic device paragraph text a is input into the text encoding model 110, and encodes the paragraph text a, which is actually mapped into a text vector space, so as to obtain a text answer characteristic representation x2 as a text answer sequence vector, i.e. a first text answer sequence vector. The electronic device inputs the picture C into the picture coding model 210, codes the picture C, and in fact maps the picture C into a picture vector space, and the obtained picture answer characteristic representation y1 is a picture answer sequence vector, i.e. a first picture answer sequence vector.

Accordingly, for a multimodal problem, the electronic device inputs the text problem T1 into the text encoding model 110 and outputs a text problem feature representation x3; the picture problem P1 is input to the picture coding model 210 and the picture problem feature representation y2 is output. The electronic device text question T1 is input into the text encoding model 110, and the text question T1 is actually mapped into a text vector space, so as to obtain a text question feature representation x3 as a text question sequence vector, i.e. a first text question sequence vector. The electronic device inputs the picture problem P1 into the picture coding model 210, codes the picture problem P1, and in fact maps the picture problem P1 into a picture vector space, and the obtained picture problem feature representation y2 is a picture sequence vector, namely a first picture problem sequence vector.

It should be noted that, in the natural language processing, each word in the natural language is mapped into a vector with a fixed length through coding, and a vector space can be formed by putting all the vectors together.

Illustratively, as shown in fig. 10, the text answer feature representation x2, the picture answer feature representation y1, the text question feature representation x3 and the picture question feature representation y2 correspond to two vector spaces, namely a text vector space and a picture vector space, so that the questions and the answers need to be mapped to a common space for similarity calculation. As shown in fig. 10, the electronic device inputs the text answer feature representation x2 and the picture answer feature representation y1 into the cross-modal coding model 310, performs feature fusion coding, and maps the picture answer sequence vector and the text answer sequence vector in the same fusion feature space to obtain a fusion feature representation z2 of the picture answer and a fusion feature representation z1 of the text answer, that is, a fusion vector of the picture answer and a fusion sequence vector of the text answer. Accordingly, the electronic device inputs the text problem feature representation x3 and the picture problem feature representation y2 into the cross-modal coding model 310, performs feature fusion coding, and maps the picture problem sequence vector and the text problem sequence vector under the same fusion feature space to obtain a fusion feature representation z4 of the picture problem and a fusion feature representation z3 of the text problem, namely a fusion vector of the picture problem and a fusion vector of the text problem.

In this embodiment of the present application, the above-mentioned fusion vector of picture answers may be referred to as a second picture answer sequence vector, and the above-mentioned fusion vector of text answers may be referred to as a second text answer sequence vector. The fusion vector of the picture questions may be referred to as a second picture question sequence vector and the fusion vector of the text questions may be referred to as a second text question sequence vector.

The fusion vector of the picture questions, the fusion vector of the text questions, the fusion vector of the picture answers and the fusion vector of the text answers are four points in the fusion feature space respectively, and a distance is introduced into the fusion feature space, so that the similarity (semantically) between the vectors can be judged according to the distance between the vectors. For example, as shown in fig. 10, the similarity between the fusion vector z1 of the text answer and the fusion vector z3 of the text question may be calculated as s1, the similarity between the fusion vector z1 of the text answer and the fusion vector z4 of the picture question is s2, the similarity between the fusion vector z2 of the picture answer and the fusion vector z3 of the text answer is s3, and the similarity between the fusion vector z2 of the picture answer and the fusion vector z4 of the picture question is s4.

In some alternative embodiments, the similarity between the question sequence vector and the answer sequence vector may be calculated by calculating a cosine similarity between the question sequence vector and the answer sequence vector. Cosine similarity the cosine value of the angle between two vectors in vector space is used to measure the similarity between the two vectors.

Taking the similarity between the fusion vector z1 of the text answer and the fusion vector z3 of the text question as an example, the calculation formula may be as follows:

wherein cos is cosine similarity,

is a fusion vector of the text answers,

is a fusion vector of text questions.

In practical application, there are various methods for calculating the similarity value between the question and the answer, for example, a semantic dictionary method, a part-of-speech word sequence combination method, a dependency tree method or an edit distance method may be used, and the application is not limited.

After the similarity between the questions and the answers in each question-answer pair is calculated, judging whether the similarity between the questions and the answers in each question-answer pair is smaller than a similarity threshold value, deleting the question-answer pair when the similarity between the questions and the answers in each question-answer pair is smaller than the similarity threshold value, and reserving the question-answer pair when the similarity between the questions and the answers in each question-answer pair is larger than the similarity threshold value. The similarity threshold may be determined according to practical applications, and may be, for example, 0.9, 0.8, etc., which is not limited in this application.

It should be noted that, for simplicity of description, the above method embodiments are all described as a series of combinations of actions, but it should be understood by those skilled in the art that the present invention is not limited by the order of actions described, and further, those skilled in the art should also understand that the embodiments described in the specification belong to preferred embodiments, and the actions involved are not necessarily required for the present invention.

Fig. 11 shows a hardware configuration diagram of the electronic device 100.

The electronic device 100 may be a cell phone, tablet, desktop, laptop, handheld, notebook, ultra-mobile personal computer (ultra-mobile personal computer, UMPC), netbook, as well as a cellular telephone, personal digital assistant (personal digital assistant, PDA), augmented reality (augmented reality, AR) device, virtual Reality (VR) device, artificial intelligence (artificial intelligence, AI) device, wearable device, vehicle-mounted device, smart home device, and/or smart city device, with the specific types of such electronic devices not being particularly limited in the embodiments of the present application.

The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a camera 193, a display 194, and a subscriber identity module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, a geomagnetic sensor 180N, and the like.

It should be understood that the illustrated structure of the embodiment of the present invention does not constitute a specific limitation on the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

In some embodiments, the processor 110 may be configured to analyze the target text to obtain a paragraph text, a picture associated with the paragraph text, and a keyword corresponding to the paragraph text in the target document, and obtain a plurality of questions corresponding to the paragraph text and the picture associated with the paragraph text in the target document through the multimodal question generation model, where the questions and the picture associated with the paragraph text form a plurality of question-answer pairs. For a description of the multimodal problem generating model generating a large number of question-answer pairs, refer to the above, and will not be described herein.

The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data.

The charge management module 140 is configured to receive a charge input from a charger.

The power management module 141 is used for connecting the battery 142, and the charge management module 140 and the processor 110.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals.

The mobile communication module 150 may provide a solution for wireless communication including 2G/3G/4G/5G, etc., applied to the electronic device 100. The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (low noise amplifier, LNA), etc. The mobile communication module 150 may receive electromagnetic waves from the antenna 1, perform processes such as filtering, amplifying, and the like on the received electromagnetic waves, and transmit the processed electromagnetic waves to the modem processor for demodulation. The mobile communication module 150 can amplify the signal modulated by the modem processor, and convert the signal into electromagnetic waves through the antenna 1 to radiate.

The wireless communication module 160 may provide solutions for wireless communication including wireless local area network (wireless local area networks, WLAN) (e.g., wireless fidelity (wireless fidelity, wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field wireless communication technology (near field communication, NFC), infrared technology (IR), etc., as applied to the electronic device 100. The wireless communication module 160 may be one or more devices that integrate at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, demodulates and filters the electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, frequency modulate it, amplify it, and convert it to electromagnetic waves for radiation via the antenna 2.

In some embodiments, antenna 1 and mobile communication module 150 of electronic device 100 are coupled, and antenna 2 and wireless communication module 160 are coupled, such that electronic device 100 may communicate with a network and other devices through wireless communication techniques. The wireless communication techniques may include the Global System for Mobile communications (global system for mobile communications, GSM), general packet radio service (general packet radio service, GPRS), code division multiple access (code division multiple access, CDMA), wideband code division multiple access (wideband code division multiple access, WCDMA), time division code division multiple access (time-division code division multiple access, TD-SCDMA), long term evolution (long term evolution, LTE), BT, GNSS, WLAN, NFC, FM, and/or IR techniques, among others. The GNSS may include a global satellite positioning system (global positioning system, GPS), a global navigation satellite system (global navigation satellite system, GLONASS), a beidou satellite navigation system (beidou navigation satellite system, BDS), a quasi zenith satellite system (quasi-zenith satellite system, QZSS) and/or a satellite based augmentation system (satellite based augmentation systems, SBAS).

The electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 194 is used to display images, videos, and the like. The display 194 includes a display panel. The display panel may employ a liquid crystal display (liquid crystal display, LCD). The display panel may also be manufactured using organic light-emitting diode (OLED), active-matrix organic light-emitting diode (AMOLED) or active-matrix organic light-emitting diode (active-matrix organic light emitting diode), flexible light-emitting diode (FLED), mini, micro-OLED, quantum dot light-emitting diode (quantum dot light emitting diodes, QLED), or the like.

The electronic device 100 may implement photographing functions through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.

The ISP is used to process data fed back by the camera 193. For example, when photographing, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electric signal, and the camera photosensitive element transmits the electric signal to the ISP for processing and is converted into an image visible to naked eyes.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then transferred to the ISP to be converted into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, YUV, or the like format.

The digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals.

Video codecs are used to compress or decompress digital video.

The NPU is a neural-network (NN) computing processor, and can rapidly process input information by referencing a biological neural network structure, for example, referencing a transmission mode between human brain neurons, and can also continuously perform self-learning.

The internal memory 121 may include one or more random access memories (random access memory, RAM) and one or more non-volatile memories (NVM).

In some embodiments, the internal memory 121 may contain a training database and a model database for storing generated multimodal problem generating models. The description of the database and the model database may be referred to above, and will not be repeated here.

The internal memory 121 may be used to store the generated question-answer pairs.

The external memory interface 120 may be used to connect external non-volatile memory to enable expansion of the memory capabilities of the electronic device 100.

The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The speaker 170A, also referred to as a "horn," is used to convert audio electrical signals into sound signals. A receiver 170B, also referred to as a "earpiece", is used to convert the audio electrical signal into a sound signal. Microphone 170C, also referred to as a "microphone" or "microphone", is used to convert sound signals into electrical signals. The earphone interface 170D is used to connect a wired earphone.

The pressure sensor 180A is used to sense a pressure signal, and may convert the pressure signal into an electrical signal.

The gyro sensor 180B may be used to determine a motion gesture of the electronic device 100. In some embodiments, the angular velocity of electronic device 100 about three axes (i.e., x, y, and z axes) may be determined by gyro sensor 180B. The gyro sensor 180B may be used for photographing anti-shake. For example, when the shutter is pressed, the gyro sensor 180B detects the shake angle of the electronic device 100, calculates the distance to be compensated by the lens module according to the angle, and makes the lens counteract the shake of the electronic device 100 through the reverse motion, so as to realize anti-shake. The gyro sensor 180B may also be used for navigating, somatosensory game scenes.

The air pressure sensor 180C is used to measure air pressure. In some embodiments, electronic device 100 calculates altitude from barometric pressure values measured by barometric pressure sensor 180C, aiding in positioning and navigation.

The magnetic sensor 180D includes a hall sensor. The electronic device 100 may detect the opening and closing of the flip cover using the magnetic sensor 180D. In some embodiments, when the electronic device 100 is a flip machine, the electronic device 100 may detect the opening and closing of the flip according to the magnetic sensor 180D.

The acceleration sensor 180E may detect the magnitude of acceleration of the electronic device 100 in various directions (typically three axes). The magnitude and direction of gravity may be detected when the electronic device 100 is stationary. The electronic equipment gesture recognition method can also be used for recognizing the gesture of the electronic equipment, and is applied to horizontal and vertical screen switching, pedometers and other applications.

A distance sensor 180F for measuring a distance. The electronic device 100 may measure the distance by infrared or laser. The proximity light sensor 180G may include, for example, a Light Emitting Diode (LED) and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode. The electronic device 100 emits infrared light outward through the light emitting diode. The electronic device 100 detects infrared reflected light from nearby objects using a photodiode. The ambient light sensor 180L is used to sense ambient light level. The fingerprint sensor 180H is used to collect a fingerprint. The temperature sensor 180J is for detecting temperature. The touch sensor 180K, also referred to as a "touch device". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The touch sensor 180K is for detecting a touch operation acting thereon or thereabout. The bone conduction sensor 180M may acquire a vibration signal.

The keys 190 include a power-on key, a volume key, etc. The keys 190 may be mechanical keys. Or may be a touch key. The electronic device 100 may receive key inputs, generating key signal inputs related to user settings and function controls of the electronic device 100. The motor 191 may generate a vibration cue. The indicator 192 may be an indicator light, may be used to indicate a state of charge, a change in charge, a message indicating a missed call, a notification, etc. The SIM card interface 195 is used to connect a SIM card.

Fig. 12 shows a software structure schematic diagram of the electronic device 100 according to the embodiment of the present application.

As shown in fig. 12, the software structure of the electronic device 100 may include: the system comprises a training database, a model training module, an analysis module, a question-answer pair generation module, a verification module and a model database. The electronic device 100 provided in the embodiment of the present application may be used to implement all functions of question-answer pair generation.

The training database is used for storing pre-training data and multi-modal training data. The pre-training data can be documents obtained from the network through a web crawler, or can be manually written articles and the like. For example, a web crawler is utilized to obtain documents from wikipedia, hundred degrees encyclopedia, hundred degrees awareness, blogs, forum bars, etc., as pre-training data. The multi-modal training data comprises multi-modal question-answer pairs consisting of multi-modal answers corresponding to the multi-modal questions. For example, as shown in the specification of the bracelet in fig. 4, the keyword set E corresponding to the paragraph text a and the paragraph text a is taken as an answer, and the "how to turn off the bracelet alarm clock" is taken as a question-answer pair consisting of questions.

The model training module is used for acquiring pre-training data from the training database, and training the pre-training model by using text pre-training data in the pre-training data to obtain a text pre-training model; training the pre-training model by using the picture pre-training data in the pre-training to obtain a picture pre-training model. For details regarding the pre-training process, see the relevant description in the embodiment of fig. 3, and are not repeated here. The model training module is also used for acquiring multi-modal training data from the training database, and performing fine tuning training on the multi-modal pre-training model by using the multi-modal training data to obtain a trained multi-modal problem generating model. And the trained multi-modal problem generation model is stored in a model database, so that the subsequent use is convenient. For details regarding the process of fine tuning training, refer to the related description in the embodiment of fig. 4, and are not repeated here.

The model database is used for storing a pre-training model and a trained multi-modal problem generating model.

The analyzing module is used for analyzing the target document to obtain a plurality of paragraph texts in the target text, pictures associated with the paragraph texts and keyword sets corresponding to the paragraph texts. Specifically, the analysis module receives a target document input by a user, and performs paragraph division on the target document to obtain a plurality of paragraph texts corresponding to the target document; it extracts a plurality of pictures for the target document and associates the plurality of pictures with a plurality of paragraph documents. The analysis module extracts keywords for each paragraph document in the plurality of paragraph texts to obtain keywords corresponding to the plurality of paragraph texts. For the analysis of the target document, see the related description in the embodiment of fig. 5, and are not repeated here.

The question-answer pair generating module is used for acquiring a multi-mode question generating model and an analysis model from the model database, and acquiring a plurality of paragraph texts, pictures associated with the paragraph texts and keyword sets corresponding to the paragraph texts in the target text. The question-answer pair generation module respectively inputs a plurality of paragraph texts, pictures associated with the paragraph texts and keyword sets corresponding to the paragraph texts in the target text into the multi-modal problem generation model to obtain a plurality of picture problems and a plurality of text problems corresponding to the paragraph texts. The questions and the paragraph text and the pictures corresponding to the paragraph text form a plurality of pre-selected question-answer pairs.

The verification module is used for obtaining a plurality of pre-selected question-answer pairs generated in the question-answer pair generation module, mapping the questions and the answers in the pre-selected question-answer pairs into a fusion feature vector space, and calculating the similarity between the questions and the answers. And deleting the pre-training question-answer pairs with the similarity smaller than the similarity threshold between the questions and the answers in the plurality of pre-selected question-answer pairs, and reserving the pre-training question-answer pairs with the similarity larger than the similarity threshold to obtain a plurality of final question-answer pairs.

As used in the above embodiments, the term "when …" may be interpreted to mean "if …" or "after …" or "in response to determination …" or "in response to detection …" depending on the context. Similarly, the phrase "at the time of determination …" or "if detected (a stated condition or event)" may be interpreted to mean "if determined …" or "in response to determination …" or "at the time of detection (a stated condition or event)" or "in response to detection (a stated condition or event)" depending on the context.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc.

Those of ordinary skill in the art will appreciate that implementing all or part of the above-described method embodiments may be accomplished by a computer program to instruct related hardware, the program may be stored in a computer readable storage medium, and the program may include the above-described method embodiments when executed. And the aforementioned storage medium includes: ROM or random access memory RAM, magnetic or optical disk, etc.

Claims

1. A method of question-answer pair generation, the method comprising:

the electronic equipment acquires a target document;

the electronic equipment analyzes the target document to obtain data of a plurality of paragraphs of the target document, wherein the data comprise data of a first paragraph;

the electronic equipment inputs the data of the first paragraph into a multi-modal problem generating model to obtain a plurality of multi-modal problems corresponding to the first paragraph; the multi-modal questions corresponding to the first paragraph comprise text questions corresponding to the first paragraph and picture questions corresponding to the first paragraph;

the electronic device generates a plurality of pre-selected question-answer pairs corresponding to the first paragraph based on the data of the first paragraph and the plurality of multi-modal questions corresponding to the first paragraph.

2. The method of claim 1, wherein the data for the first paragraph comprises a first paragraph text, a picture associated with the first paragraph text, and a set of keywords corresponding to the first paragraph.

3. The method of claim 2, wherein the electronic device parses the target document to obtain data of a plurality of paragraphs of the target document, wherein the data includes data of a first paragraph, and specifically includes:

the electronic equipment performs paragraph division and picture extraction on the target document based on the structure of the target document to obtain a plurality of paragraph texts and a plurality of pictures corresponding to the target document;

the electronic equipment associates the plurality of pictures with the plurality of paragraph texts to obtain pictures corresponding to the first paragraph text;

and the electronic equipment extracts keywords based on the first paragraph text to obtain a keyword set corresponding to the first paragraph text.

4. A method according to any of claims 1-3, wherein the pre-selected question-answer pair corresponding to the first paragraph comprises an answer corresponding to the first paragraph comprising the first paragraph text and/or a picture associated with the first paragraph text and a question corresponding to the first paragraph comprising a picture question corresponding to the first paragraph and/or a text question corresponding to the first paragraph.

5. The method of any one of claims 1-4, wherein after the generating the plurality of pre-selected question-answer pairs corresponding to the first paragraph, the method further comprises:

the electronic equipment calculates the similarity of the questions and the answers in the plurality of pre-selected question-answer pairs corresponding to the first paragraph;

and the electronic equipment selects a preselected question-answer pair with the similarity meeting a preset similarity threshold from a plurality of preselected question-answer pairs corresponding to the first paragraph based on the similarity of the questions and the answers in the plurality of preselected question-answer pairs corresponding to the first paragraph, and takes the preselected question-answer pair with the similarity meeting the preset similarity threshold as the question-answer pair of the first paragraph.

6. The method of claim 5, wherein the electronic device calculates similarities between questions and answers in a plurality of pre-selected question-answer pairs corresponding to the first paragraph, specifically comprising:

the electronic equipment inputs the picture problem corresponding to the first paragraph into a picture coding model, outputs a first picture problem sequence vector, inputs the text problem corresponding to the first paragraph into a text coding model, and outputs a first text problem sequence vector;

the electronic equipment inputs the first picture problem sequence vector and the first text problem sequence vector into a cross-mode coding model to perform fusion coding so as to obtain a second picture problem sequence vector and a second text problem sequence vector;

The electronic equipment inputs the picture corresponding to the first paragraph into the picture coding model, outputs a first picture answer sequence vector, inputs the text corresponding to the first paragraph into the text coding model, and outputs a first text answer sequence vector;

the electronic equipment inputs the first picture answer sequence vector and the first text answer sequence vector into the cross-modal coding model to perform fusion coding so as to obtain a second picture answer sequence vector and a second text answer sequence vector;

and the electronic equipment calculates the similarity among the second picture question sequence vector, the second text question sequence vector, the second picture answer sequence vector and the second text answer sequence vector respectively.

7. The method of any of claims 1-6, wherein the multimodal problem generating model includes the text encoding model, the picture encoding model, the cross-modality encoding model, the text decoding model, and the picture decoding model;

the electronic equipment inputs the first text feature representation, the second text feature representation and the first picture feature representation into the cross-modal coding model to obtain a first text fusion feature representation and a first picture fusion feature representation; wherein the first text fusion feature representation comprises the first text feature representation, the second text feature representation, and the first picture feature representation, the first picture fusion feature representation comprising the first text feature representation, the second text feature representation, and the first picture feature representation;

and the electronic equipment inputs the first text fusion characteristic representation into the text decoding model to obtain a plurality of text questions corresponding to the first paragraph, and inputs the first picture fusion characteristic representation into the picture decoding model to obtain a plurality of picture questions corresponding to the first paragraph.

8. The method of any of claims 1-7, wherein before the electronic device enters the data of the first paragraph into a multimodal problem creation model, the method further comprises:

the electronic equipment acquires multi-modal training data, wherein the multi-modal training data comprises multi-modal answers, multi-modal questions and keyword sets corresponding to the multi-modal answers;

the electronic equipment inputs the multi-modal answers and keyword sets corresponding to the multi-modal answers into the multi-modal pre-training model, and outputs predicted multi-modal questions; the multi-modal pre-training model comprises a text pre-training model, a picture pre-training model and the first cross-modal coding model;

the electronic device determining a prediction error based on the predicted multimodal problem and the multimodal problem;

and the electronic equipment adjusts the multi-modal problem generation model based on the prediction error until the prediction error meets the training stop condition to obtain the multi-modal problem generation model.

9. The method of claim 8, wherein before the electronic device inputs the multimodal answer and the set of keywords corresponding to the multimodal answer into the multimodal pre-training model, the method further comprises:

The electronic equipment acquires pre-training data and the pre-training model, wherein the pre-training data comprises pre-training text data and pre-training picture data;

10. An electronic device comprising one or more processors and one or more memories; wherein the one or more memories are coupled to the one or more processors, the one or more memories for storing computer program code comprising computer instructions that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-9.

11. A chip system for application to an electronic device, the chip system comprising one or more processors to invoke computer instructions to cause the electronic device to perform the method of any of claims 1-9.

12. A computer program product comprising instructions which, when run on an electronic device, cause the electronic device to perform the method of any of claims 1-9.

13. A computer readable storage medium comprising instructions which, when run on an electronic device, cause the electronic device to perform the method of any of claims 1-9.