CN117972044A

CN117972044A - Visual question-answering method and platform based on knowledge enhancement

Info

Publication number: CN117972044A
Application number: CN202311868499.3A
Authority: CN
Inventors: 刘静; 汪群博; 郭龙腾
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2023-12-29
Filing date: 2023-12-29
Publication date: 2024-05-03

Abstract

The invention provides a visual question-answering method and platform based on knowledge enhancement, belonging to the technical field of artificial intelligence, wherein the method comprises the following steps: acquiring an image and a question text input by a user to obtain an input text; inputting the input text to a retriever of the visual question-answering model to obtain a plurality of relevant external knowledge; respectively splicing a plurality of related external knowledge with the input text to obtain a plurality of spliced input texts, inputting the plurality of spliced input texts into an answer generator of the visual question-answer model to obtain answer texts corresponding to each spliced input text, inputting the input texts into the answer generator to obtain answer texts corresponding to the input texts, and determining a final target answer text corresponding to the input texts; the visual question-answering model is obtained by carrying out joint training on an initial answer generator and an initial retriever of the initial visual question-answering model. The invention can reasonably utilize the retrieved external knowledge and the internal knowledge hidden by the visual question-answering model.

Description

Visual question-answering method and platform based on knowledge enhancement

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a visual question-answering method and platform based on knowledge enhancement.

Background

With the rapid update and development of internet technology, visual questions and answers can provide users with an image-oriented question and answer service that allows users to input images for which questions are posed to be answered. Visual question-and-answer services may help users provide a multimodal form of interactive experience, and thus gain widespread attention to researchers. In recent years, more and more work has focused on how to combine search methods with pre-trained models. At present, the traditional pre-training model knowledge enhancement method cannot reasonably utilize the retrieved external knowledge and the internal knowledge implied by the model, so that the accuracy of visual question-answering is affected.

Disclosure of Invention

The invention provides a knowledge enhancement-based visual question and answer method and a platform, which are used for solving the defect that the knowledge enhancement method of a pre-training model in the prior art cannot reasonably utilize the retrieved external knowledge and the internal knowledge hidden by the model, so as to influence the accuracy of visual question and answer.

In a first aspect, the present invention provides a visual question-answering method based on knowledge enhancement, which is applied to a visual question-answering platform based on knowledge enhancement, and includes:

Acquiring an image and a question text input by a user, and processing the image and the question text to obtain an input text;

inputting the input text to a retriever of a pre-constructed visual question-answer model to obtain a plurality of relevant external knowledge corresponding to the input text output by the retriever;

Splicing the multiple related external knowledge with the input text respectively to obtain multiple spliced input texts, inputting the multiple spliced input texts into an answer generator of the visual question-answer model to obtain an answer text corresponding to each spliced input text output by the answer generator, and inputting the input texts into the answer generator to obtain an answer text corresponding to the input text output by the answer generator;

Determining a final target answer text corresponding to the input text from the answer text corresponding to each spliced input text and the answer text corresponding to the input text;

The visual question-answering model is obtained by taking a sample input text as a training sample, taking an answer text label corresponding to the sample input text as a sample label, and performing combined training on an initial answer generator and an initial retriever of the initial visual question-answering model.

In some embodiments, the retriever includes a knowledge encoding layer, a query encoding layer, and a matching layer;

correspondingly, the retriever for inputting the input text into the pre-constructed visual question-answer model obtains a plurality of relevant external knowledge corresponding to the input text output by the retriever, and the method comprises the following steps:

Inputting each knowledge in the pre-constructed external knowledge base into the knowledge coding layer in advance to obtain a feature vector of each knowledge output by the knowledge coding layer;

Inputting the input text to the query coding layer to obtain a feature vector of the input text output by the query coding layer;

And inputting the feature vector of the input text and the feature vector of each knowledge to the matching layer to obtain a plurality of relevant external knowledge corresponding to the input text output by the matching layer.

In some embodiments, the determining of the external knowledge base includes:

Acquiring a visual question-answer open source data set, and splicing a question text and an answer text of each data sample in the data set to obtain a query text corresponding to each data sample;

Inputting each query text into a search engine to obtain a plurality of query results corresponding to each query text;

Calculating the relevance between each query result and each query text, and screening the plurality of query results to obtain screened query results;

text extraction is carried out on the screened query result to obtain a plurality of knowledge text fragments, and the knowledge text fragments are processed to obtain processed knowledge text fragments;

And obtaining the external knowledge base based on the processed knowledge text segment.

In some embodiments, the determining a final target answer text corresponding to the input text from the answer text corresponding to each spliced input text and the answer text corresponding to the input text includes;

calculating the uncertainty degree of the answer text corresponding to each spliced input text and the answer text corresponding to the input text based on the visual question-answering model;

and determining a final target answer text corresponding to the input text from each answer text according to the uncertainty degree of each answer text.

In some embodiments, the processing the image and the question text to obtain the input text includes:

processing the image to obtain an image context corresponding to the image, wherein the image context comprises an image description text, an identification text, object information and attribute information;

And splicing the image context and the problem text to obtain the input text.

In some embodiments, the determining of the visual question-answering model includes:

Acquiring a sample image and a sample question text, processing the sample image and the sample question text to obtain a sample input text, and determining an answer text label corresponding to the sample input text;

Inputting the sample input text to an initial retriever of the initial visual question-answering model to obtain a plurality of sample related external knowledge corresponding to the sample input text output by the initial retriever;

Splicing the plurality of sample related external knowledge with the sample input text respectively to obtain a plurality of spliced sample input texts, inputting the plurality of spliced sample input texts to an initial answer generator of the initial visual question-answer model to obtain an answer text prediction result corresponding to each spliced sample input text output by the initial answer generator, inputting the sample input text to the initial answer generator to obtain an answer text prediction result corresponding to the sample input text output by the initial answer generator;

Determining a final target answer text corresponding to the sample input text from answer text prediction results corresponding to each spliced sample input text and answer text prediction results corresponding to the sample input text;

calculating an optimization objective function value based on the answer text prediction result corresponding to each spliced sample input text, the answer text prediction result corresponding to the sample input text and the answer text label corresponding to the sample input text;

Training the initial visual question-answering model based on the optimized objective function value, and carrying out parameter optimization iteration on the initial visual question-answering model to obtain the visual question-answering model;

wherein, the determining process of the retriever comprises:

comparing the answer text prediction results corresponding to the plurality of spliced sample input texts with the answer text prediction results corresponding to the sample input texts based on the answer text labels corresponding to the sample input texts to obtain comparison results;

And based on the comparison result, obtaining a supervision training signal, training the initial retriever, and obtaining the retriever after the initial retriever is trained.

In some embodiments, the initial retriever comprises a knowledge coding layer, an initial query coding layer, and an initial matching layer;

correspondingly, the step of inputting the sample input text to the initial retriever of the initial visual question-answer model to obtain a plurality of sample related external knowledge corresponding to the sample input text output by the initial retriever comprises the following steps:

inputting the sample input text to the initial query coding layer to obtain a feature vector of the sample input text output by the initial query coding layer;

And inputting the feature vector of the sample input text and the feature vector of each knowledge to the initial matching layer to obtain a plurality of sample related external knowledge corresponding to the sample input text output by the initial matching layer.

In some embodiments, the determining the final target answer text corresponding to the sample input text from the answer text prediction result corresponding to each spliced sample input text and the answer text prediction result corresponding to the sample input text includes;

Calculating an answer text prediction result corresponding to each spliced sample input text and the uncertainty degree of the answer text prediction result corresponding to the sample input text based on the initial visual question-answer model;

and determining a final target answer text corresponding to the sample input text from each answer text prediction result according to the uncertainty degree of each answer text prediction result.

In some embodiments, the calculation formula of the optimized objective function value is:

Wherein P represents the optimized objective function value, x represents the sample input text, A represents the answer text label set corresponding to the sample input text, θ represents the parameters of the initial answer generator, φ represents the parameters of the initial retriever, s _j represents the j-th sample related external knowledge, And representing an answer text label corresponding to the j-th sample related external knowledge in the answer text label set, wherein R ^P represents a sample related external knowledge set corresponding to the positive supervision training signal, and R ^N represents a sample related external knowledge set corresponding to the negative supervision training signal.

In a second aspect, an embodiment of the present invention further provides a visual question-answering platform based on knowledge enhancement, including:

The acquisition module is used for acquiring an image and a problem text input by a user, and processing the image and the problem text to obtain an input text;

the retrieval module is used for inputting the input text to a retriever of a pre-constructed visual question-answer model to obtain a plurality of relevant external knowledge corresponding to the input text output by the retriever;

The answer generation module is used for respectively splicing the plurality of related external knowledge with the input text to obtain a plurality of spliced input texts, inputting the plurality of spliced input texts into an answer generator of the visual question-answer model to obtain an answer text corresponding to each spliced input text output by the answer generator, and inputting the input text into the answer generator to obtain an answer text corresponding to the input text output by the answer generator;

The determining module is used for determining a final target answer text corresponding to the input text from the answer text corresponding to each spliced input text and the answer text corresponding to the input text;

According to the visual question-answering method and platform based on knowledge enhancement, through acquiring an image and a question text input by a user, processing the image and the question text to obtain an input text, inputting the input text to a retriever of a pre-constructed visual question-answering model to obtain a plurality of relevant external knowledge, respectively splicing the plurality of relevant external knowledge and the input text to obtain a plurality of spliced input texts, respectively inputting the plurality of spliced input texts and the input texts to an answer generator of the visual question-answering model to obtain a plurality of answer texts, and determining a target answer text, wherein the visual question-answering model is obtained by jointly training an initial retriever and an initial answer generator based on a training sample and a sample label.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a knowledge-based enhanced visual question-answering method provided by an embodiment of the present invention;

FIG. 2 is a flow chart of a determination process of a visual question-answering model provided by an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a training method of a visual question-answering model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a knowledge-based enhanced visual question-answering model provided by an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a visual question-answering platform based on knowledge enhancement provided by an embodiment of the present invention;

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terms "first," "second," and the like in the description of the present invention, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the invention are capable of operation in sequences other than those illustrated or otherwise described herein, and that the "first" and "second" distinguishing between objects generally are not limited in number to the extent that the first object may, for example, be one or more.

In recent years, pre-training techniques have rapidly evolved, and visual question-answering services based on pre-training models can generate higher quality answers. In addition, to enhance the implicit knowledge contained in the pre-training model, some researchers have designed knowledge-based pre-training tasks to better integrate common sense and factual knowledge into the parameters of the model. While these methods may enhance the learning of the knowledge entity by the model, the model must be retrained when needed to cope with the newly emerging knowledge. Recently, large-scale pre-training models such as third generation pre-training (GENERATIVE PRE-trained-3 transducer, GPT-3) models have been demonstrated to learn massive knowledge into their parameters by means of self-supervised learning on massive data, and these models can achieve good results in knowledge questions and answers and be applied to visual questions and answers tasks. In practice, however, large-scale pre-training models suffer from the problem of phantom output, i.e. producing plausible but in fact incorrect predictions, and related studies have found that this problem cannot be solved by merely increasing the model size. In addition, this way of memorizing knowledge by means of parameters cannot cope with the change and the new increase of knowledge.

In practical application, visual questions and answers are usually open, answers are difficult to obtain from information in images by some question sheets, and help of related external knowledge is needed, so that a model enhancement method based on knowledge retrieval is researched and used in a visual question and answer model, namely, a retriever is used for obtaining and inputting related external knowledge fragments to help the model to predict. Because the supervisory signals of the search knowledge cannot be directly obtained, the related knowledge-based search methods at present generally adopt weak supervisory signals to train a searcher, namely, the search knowledge gives positive supervisory signals if target answers are included, and the methods require that the output must depend on the searched external knowledge even if the searched external knowledge has errors. This approach thus greatly hinders its practical application. In practical applications, if none of the retrieved relevant knowledge can play a positive role in the correct prediction, their addition will interfere with the prediction of the model instead, and some questions can be answered correctly using implicit knowledge inside the model.

Therefore, the embodiment of the invention provides a visual question-answering method and platform based on knowledge enhancement, which are characterized in that an image and a question text input by a user are acquired, the image and the question text are processed to obtain an input text, the input text is input to a retriever of a pre-constructed visual question-answering model to obtain a plurality of relevant external knowledge, the relevant external knowledge and the input text are respectively spliced to obtain a plurality of spliced input texts, the spliced input texts and the input text are respectively input to an answer generator of the visual question-answering model to obtain a plurality of answer texts, and a target answer text is determined, wherein the visual question-answering model is obtained by jointly training an initial retriever and an initial answer generator based on training samples and sample labels. The invention can reasonably utilize the retrieved external knowledge and the internal knowledge hidden by the visual question-answering model, thereby improving the accuracy of the visual question-answering.

Fig. 1 is a schematic flow chart of a visual question-answering method based on knowledge enhancement according to an embodiment of the present invention. As shown in fig. 1, a visual question-answering method based on knowledge enhancement is provided, which is applied to a visual question-answering platform based on knowledge enhancement, and comprises the following steps: step 110, step 120, step 130 and step 140. The method flow steps are only one possible implementation of the invention.

Step 110, acquiring an image and a question text input by a user, and processing the image and the question text to obtain an input text.

Wherein the image correlates with the question text, e.g. the train is shown on the image, the question text being "who invented this for the first time".

Optionally, the image uploaded by the user may be directly received, or a website link of the image provided by the user may be obtained, and the corresponding image may be obtained according to the website link of the image.

Alternatively, the question text input by the user may be received directly, or the voice data input by the user may be received, and the voice data may be identified to obtain the question text.

Optionally, the image and the text data of the questions input by the user can be collected through a terminal device, and the terminal device can be a mobile phone, a personal computer, a tablet computer and the like.

Alternatively, computer vision techniques may be used to extract features from the image and natural language processing techniques may be used to process the question text.

It should be noted that, for the visual question-answering task, the image and the question text input by the user can be further converted into the input text which can be understood by the machine by processing the image and the question text.

In some embodiments, step 110 processes the image and the question text to obtain input text, including:

step 111, processing the image to obtain an image context corresponding to the image, wherein the image context comprises an image description text, an identification text, object information and attribute information;

and step 112, splicing the image context and the problem text to obtain an input text.

Alternatively, the image may be input to the image description model, so as to obtain an image description text corresponding to the image output by the image description model.

Alternatively, the image may be input to the target recognition model, so as to obtain object information and attribute information corresponding to the image output by the target recognition model.

Alternatively, the image may be input to an optical character recognition (Optical Character Recognition, OCR) model, resulting in recognition text corresponding to the image output by the OCR model.

For example, the image context may be represented as c _i＝(caption_i,object_i,ocr_i), i represents an i-th user input, capture _i represents an image description text corresponding to the i-th image, object _i represents object information and attribute information corresponding to the i-th image, and ocr _i represents a recognition text corresponding to the i-th image.

For example, for the ith user input x _i, the corresponding image context c _i and question text q _i are spliced to obtain the input text (c _i,q_i).

And 120, inputting the input text into a retriever of a pre-constructed visual question-answer model to obtain a plurality of relevant external knowledge corresponding to the input text output by the retriever.

It can be appreciated that inputting the input text to the retriever of the pre-constructed visual question-answering model and obtaining the relevant external knowledge output by the retriever can provide more context information and background knowledge to the visual question-answering model, thereby helping the visual question-answering model to more accurately understand and answer questions.

Correspondingly, step 120 inputs the input text to a retriever of the pre-constructed visual question-answer model, and obtains a plurality of relevant external knowledge corresponding to the input text output by the retriever, including:

Step 121, inputting each knowledge in the pre-constructed external knowledge base into a knowledge coding layer in advance to obtain a feature vector of each knowledge output by the knowledge coding layer;

step 122, inputting the input text to the query coding layer to obtain the feature vector of the input text output by the query coding layer;

and step 123, inputting the feature vector of the input text and the feature vector of each knowledge to a matching layer to obtain a plurality of related external knowledge corresponding to the input text output by the matching layer.

It should be noted that, the knowledge coding layer is kept fixed, all the knowledge in the external knowledge base can be coded in advance once to obtain the feature vector of each knowledge, and the feature vector is stored in the open source index engine and is indexed, so that the calculation cost is reduced; with the input text as a query, the input text may be input to a query encoding layer in real-time, which may dynamically generate feature vectors of the input text (i.e., feature vectors of the query).

Alternatively, the semantic similarity between the feature vector of the input text and the feature vector of each knowledge can be calculated based on the matching layer in an inner product manner, and a plurality of relevant external knowledge corresponding to the input text can be determined according to the semantic similarity.

For example, the corresponding plurality of external knowledge may be ranked in order of high-to-low semantic similarity, and the top k-1 relevant external knowledge may be selected.

In some embodiments, the determination of the external knowledge base includes:

Calculating the relevance between each query result and each query text, and screening a plurality of query results to obtain screened query results;

text extraction is carried out on the screened query results to obtain a plurality of knowledge text fragments, and the knowledge text fragments are processed to obtain processed knowledge text fragments;

and obtaining an external knowledge base based on the processed knowledge text fragments.

Optionally, the relevance corresponding to different query results may be ranked in order from high to low, and query results with relevance ranks exceeding a preset threshold may be screened.

Alternatively, the filtered query results may be the top 10 query results that are most relevant.

Optionally, a plurality of knowledge text segments may be screened, incomplete and repeated knowledge text segments may be deleted, so as to obtain a screened knowledge text segment, and knowledge identifiers may be allocated to the screened knowledge text segment, so as to obtain a processed knowledge text segment.

Optionally, the processed knowledge text segments are stored in an external knowledge base.

Step 130, respectively splicing a plurality of related external knowledge with the input text to obtain a plurality of spliced input texts, inputting the plurality of spliced input texts into an answer generator of the visual question-answer model to obtain an answer text corresponding to each spliced input text output by the answer generator, and inputting the input text into the answer generator to obtain an answer text corresponding to the input text output by the answer generator;

It can be understood that the multiple relevant external knowledge is spliced with the input text respectively to obtain multiple spliced input texts, and the multiple spliced input texts and the multiple input texts are input to the answer generator of the visual question-answer model respectively, so that the visual question-answer model can be helped to generate multiple candidate answer texts.

And 140, determining a final target answer text corresponding to the input text from the answer text corresponding to each spliced input text and the answer text corresponding to the input text.

The visual question-answering model is obtained by taking a sample input text as a training sample, taking an answer text label corresponding to the sample input text as a sample label and carrying out combined training on an initial answer generator and an initial retriever of the initial visual question-answering model.

In some embodiments, determining a final target answer text corresponding to the input text from the answer text corresponding to each spliced input text and the answer text corresponding to the input text, including;

Based on the visual question-answering model, calculating an answer text corresponding to each spliced input text and an uncertainty degree of the answer text corresponding to the input text;

And determining a final target answer text corresponding to the input text from the answer texts according to the uncertainty degree of the answer texts.

Optionally, the answer text with the least uncertainty degree is selected from the answer texts as the final target answer text.

For example, k-1 relevant external knowledge is spliced with the input text respectively to obtain k-1 spliced input text, a 1 st answer text is obtained based on the 1 st spliced input text, a 2 nd answer text is obtained based on the 2 nd spliced input text, … …, a k-1 answer text is obtained based on the k-1 th spliced input text, and a k answer text is obtained based on internal knowledge implied by the visual question-answer model; and calculating the uncertainty degree of k prediction results, and selecting the answer text with the minimum uncertainty as the final target answer text.

In the embodiment of the invention, the image and the question text are processed by acquiring the image and the question text input by a user to obtain the input text, the input text is input to a retriever of a pre-constructed visual question-answer model to obtain a plurality of relevant external knowledge, the plurality of relevant external knowledge and the input text are spliced to obtain a plurality of spliced input texts, the plurality of spliced input texts and the plurality of input texts are input to an answer generator of the visual question-answer model to obtain a plurality of answer texts, and the target answer text is determined, wherein the visual question-answer model is obtained by jointly training the initial retriever and the initial answer generator based on training samples and sample labels.

Fig. 2 is a flow chart illustrating a determination process of a visual question-answer model according to an embodiment of the present invention. As shown in fig. 2, the determination process of the visual question-answer model includes the following steps: step 210, step 220, step 230, step 240, step 250 and step 260.

Step 210, a sample image and a sample question text are obtained, the sample image and the sample question text are processed to obtain a sample input text, and an answer text label corresponding to the sample input text is determined.

Optionally, the sample image is processed to obtain a sample image context corresponding to the sample image, wherein the sample image context comprises sample image description text, sample identification text, sample object information and sample attribute information.

Optionally, the sample image context and the sample question text are spliced to obtain a sample input text.

And 220, inputting the sample input text to an initial retriever of the initial visual question-answering model to obtain a plurality of sample related external knowledge corresponding to the sample input text output by the initial retriever.

In some embodiments, the initial retriever comprises a knowledge encoding layer, an initial query encoding layer, and an initial matching layer;

Correspondingly, the sample input text is input to an initial retriever of the initial visual question-answer model, a plurality of sample related external knowledge corresponding to the sample input text output by the initial retriever is obtained, and the method comprises the following steps:

Inputting each knowledge in the pre-constructed external knowledge base into a knowledge coding layer in advance to obtain a feature vector of each knowledge output by the knowledge coding layer;

Inputting the sample input text to an initial query coding layer to obtain a feature vector of the sample input text output by the initial query coding layer;

and inputting the feature vector of the sample input text and the feature vector of each knowledge to an initial matching layer to obtain a plurality of sample related external knowledge corresponding to the sample input text output by the initial matching layer.

Optionally, the determining of the external knowledge base includes:

It should be noted that the knowledge coding layer remains fixed, all knowledge in the external knowledge base can be coded in advance once, and only the initial query encoder can be trained, so that training cost is reduced.

Step 230, respectively splicing the plurality of sample related external knowledge with the sample input text to obtain a plurality of spliced sample input texts, inputting the plurality of spliced sample input texts to an initial answer generator of the initial visual question-answer model to obtain an answer text prediction result corresponding to each spliced sample input text output by the initial answer generator, inputting the sample input text to the initial answer generator to obtain an answer text prediction result corresponding to the sample input text output by the initial answer generator.

Step 240, determining a final target answer text corresponding to the sample input text from the answer text prediction result corresponding to each spliced sample input text and the answer text prediction result corresponding to the sample input text.

In some embodiments, determining a final target answer text corresponding to the sample input text from the answer text prediction result corresponding to each spliced sample input text and the answer text prediction result corresponding to the sample input text, including;

based on the initial visual question-answering model, calculating an answer text prediction result corresponding to each spliced sample input text and an uncertainty degree of the answer text prediction result corresponding to the sample input text;

Step 250, calculating an optimization objective function value based on the answer text prediction result corresponding to each spliced sample input text, the answer text prediction result corresponding to the sample input text, and the answer text label corresponding to the sample input text.

And 260, training an initial visual question-answer model based on the optimized objective function value, and carrying out parameter optimization iteration on the initial visual question-answer model to obtain the visual question-answer model.

Wherein, the determining process of the retriever comprises:

Based on answer text labels corresponding to the sample input texts, comparing answer text prediction results corresponding to the plurality of spliced sample input texts with answer text prediction results corresponding to the sample input texts to obtain comparison results;

Based on the comparison result, a supervision training signal is obtained, the initial retriever is trained, and the retriever is obtained after the initial retriever is trained.

It should be noted that, given training data set d= { V _i,q_i,a_i }, where each sample contains sample image V _i e V, sample question text Q _i e Q, answer text label a _i e a. The model needs to be trained to learn the mapping function:

It should be noted that in practical applications, some questions may be answered directly using implicit knowledge contained in the pre-trained model parameters, while some questions require additional explicit knowledge to assist in answering by retrieving relevant explicit knowledge from an external knowledge source, so the optimization objective of the visual question-answering model can be expressed as follows:

argmaxp(a_i|v_i,q_i；θ,K)

Where θ represents implicit knowledge contained in the visual question-answer model, K represents a sample-related external knowledge set, k= { s _j }, where s _j represents a j-th sample-related external knowledge, j e [1, K-1], and the visual question-answer model needs to be able to actively select more appropriate knowledge to predict.

In some embodiments, the calculation formula for optimizing the objective function value is:

Wherein P represents the optimized objective function value, x represents the sample input text, A represents the answer text label set corresponding to the sample input text, θ represents the parameters of the initial answer generator, φ represents the parameters of the initial retriever, s _j represents the j-th sample related external knowledge, The text label corresponding to the jth sample related external knowledge is represented, R ^P represents the sample related external knowledge set corresponding to the positive supervisory training signal, and R ^N represents the sample related external knowledge set corresponding to the negative supervisory training signal.

It should be noted that, in the optimization objective function, the first term is to improve the answer generation quality of the answer generator, the second term is to encourage the retriever to obtain correct knowledge, and the third term is to avoid the retriever from obtaining incorrect knowledge.

Optionally, in the training data set, each question may include 10 answer text labels manually given, which are represented by answer text label set a, and in order to obtain a better training effect, a training target corresponding to each spliced input text is determined respectively.

Optionally, for the spliced sample input text obtained after the jth sample related external knowledge is spliced with the sample input text, selecting an answer appearing in the jth sample related external knowledge from the answer label set a as an answer generation target of the input, and if the jth sample related external knowledge does not contain any answer in the sample input text, taking the most frequent answer label in the sample input text as the answer generation target corresponding to the spliced sample input text.

It should be noted that, the 1 st prediction result is obtained based on the spliced sample input text corresponding to the 1 st sample related external knowledge, the 2 nd prediction result is obtained based on the spliced sample input text corresponding to the 2 nd sample related external knowledge, … …, the k-1 st prediction result is obtained based on the spliced sample input text corresponding to the k-1 st sample related external knowledge, and the k-1 st prediction result is obtained based on the internal knowledge implied by the visual question-answer model.

It should be noted that valuable external knowledge should be able to help the model generate correct answers when it is unable to generate correct answers, and in embodiments of the present invention, the kth prediction output may be used to determine which knowledge is truly valuable to the model predictions because the kth prediction output does not use external knowledge to help generate answers. For example, when the prediction result obtained using a certain knowledge is correct but the kth prediction obtained using model implicit knowledge is incorrect, then the knowledge can be determined to be valuable and assigned a positive supervisory signal; conversely, when the prediction result obtained by using a certain external knowledge is wrong but the kth prediction obtained by using the model implicit knowledge is correct, the external knowledge can be judged to be noisy, and a negative supervisory signal can be allocated to the external knowledge. The training signal set of the initial retriever is constructed as follows:

Wherein H (s _j, a) =1 indicates that the sample related external knowledge s _j contains one answer in the manually labeled answer set, and H (s _j, a) =0 indicates that the sample related external knowledge s _j does not contain any answer in the manually labeled answer set; Meaning that the correct answer,/>, can be generated with the help of the j-th sample-related external knowledge Indicating that correct answers cannot be generated using only the implicit internal knowledge of the visual question-answer model; r ^I represents a knowledge set that neither belongs to R ^P nor R ^N, and the knowledge in R ^I does not calculate the loss at the time of training. /(I)

It should be noted that, the previous k-1 sample related external knowledge is spliced with the sample input text to generate k-1 prediction results { y _i1,y_i2,,...,y_i(k-1) }, the kth prediction result is obtained based on the internal knowledge, the final prediction result is determined based on the uncertainty of each prediction result, and the goal of the retriever training is to decrease the score of the knowledge in the R ^N set and increase the score of the knowledge in the R ^P set.

Alternatively, a loss function value corresponding to each prediction result may be calculated, where the loss function value corresponding to each prediction result represents the uncertainty degree of the visual question-answer model on the prediction result, and the greater the loss function value, the greater the uncertainty, and the prediction result with the minimum loss function value is selected as the final prediction result.

Fig. 3 is a flowchart of a training method of a visual question-answering model according to an embodiment of the present invention. As shown in fig. 3, the training method of the visual question-answering model includes the following steps:

S301, starting;

S302, constructing a knowledge base;

alternatively, the knowledge base may be constructed based on the visual question-answer source dataset and the search engine, resulting in external knowledge.

S303, inputting a sample image and a sample question text;

optionally, training data is acquired before the sample image and the sample question text are input, the training data including the sample image and the sample question text, and corresponding answer text labels.

S304, searching k-1 pieces of external knowledge in a knowledge base based on a sample input text;

wherein the sample input text includes a sample image and a sample question text.

S305, the visual question-answering model predicts based on each external knowledge to obtain the first k-1 prediction results.

S306, the visual question-answering model does not predict based on external knowledge, and a kth prediction result is obtained.

It should be noted that, the kth prediction result is obtained by predicting the visual question-answer model based on the implicit internal knowledge of the model.

S307, determining a final prediction result from k prediction results based on uncertainty;

s308, comparing a predicted result based on the external knowledge with a kth predicted result to determine the value of the external knowledge;

S309, performing joint training on the visual question-answer model based on the predicted result loss and the external knowledge loss;

S310, performing parameter optimization iteration on the visual question-answer model by adopting a gradient back propagation method;

S311, judging whether model training is converged or not;

If the training is converged, step 312 is executed, and if the training is not converged, the process continues to step 303.

S312, ending.

Fig. 4 is a schematic diagram of a knowledge-based enhanced visual question-answering model provided by an embodiment of the present invention. As shown in fig. 4, the visual question-answer model includes a retriever and an answer generator, the retriever including a knowledge encoder and a query encoder; acquiring an image and a question text input by a user, obtaining an image context based on the image, wherein the image context comprises image description, objects in the image and characters in the image, and splicing the question text and the image context to obtain an input text; inputting an input text to a retriever, wherein the retriever encodes the input text through a query encoder to obtain a feature vector of the input text, and encodes all knowledge in an external knowledge base through a knowledge encoder in advance to obtain a feature vector of each knowledge, and calculates the correlation between the feature vector of the input text and the feature vector of each knowledge according to a correlation calculation function to obtain k-1 most relevant external knowledge; and splicing the retrieved plurality of external knowledge and the input text to obtain a plurality of spliced input texts, respectively inputting the plurality of spliced input texts and the input text into an answer generator, obtaining k-1 prediction results based on the k-1 spliced input texts, obtaining a kth prediction result based on the internal knowledge, and determining a final prediction result based on the uncertainty degree of each prediction result.

It should be noted that, the training mechanism of the visual question-answer model is to perform joint training on the retriever and the answer generator to obtain positive and negative of k prediction results, determine the training signal of each external knowledge, and improve the quality of the external knowledge retrieved by the retriever, thereby improving the accuracy of the visual question-answer model prediction.

In the embodiment of the invention, whether the external knowledge can actually provide the prediction performance gain is judged to serve as a supervision signal for model training, and the initial retriever and the initial answer generator of the initial visual question-answering model are jointly trained, so that the retriever can be trained based on the strong supervision signal, the quality of the external knowledge retrieved by the retriever can be improved, the visual question-answering model can reasonably utilize the external knowledge and the internal knowledge, the reliability of answer prediction is improved, and the visual question-answering model is prevented from being influenced by the external noise knowledge due to excessive dependence on the external knowledge in the prediction process.

The knowledge-based enhanced visual question-answering platform provided by the embodiment of the invention is described below, and the knowledge-based enhanced visual question-answering platform described below and the knowledge-based enhanced visual question-answering method described above can be referred to correspondingly with each other.

Fig. 5 is a schematic structural diagram of a visual question-answering platform based on knowledge enhancement according to an embodiment of the present invention, and as shown in fig. 5, the visual question-answering platform 500 based on knowledge enhancement includes:

the obtaining module 510 is configured to obtain an image and a question text input by a user, and process the image and the question text to obtain an input text;

the retrieval module 520 is configured to input the input text to a retriever of a pre-constructed visual question-answer model, and obtain a plurality of relevant external knowledge corresponding to the input text output by the retriever;

the answer generation module 530 is configured to splice a plurality of relevant external knowledge with the input text respectively to obtain a plurality of spliced input texts, input the plurality of spliced input texts to an answer generator of the visual question-answer model to obtain answer texts corresponding to each spliced input text output by the answer generator, and input the input texts to the answer generator to obtain answer texts corresponding to the input texts output by the answer generator;

a determining module 540, configured to determine a final target answer text corresponding to the input text from the answer text corresponding to each spliced input text and the answer text corresponding to the input text;

Optionally, the retriever comprises a knowledge coding layer, a query coding layer and a matching layer;

Correspondingly, the input text is input to a retriever of a pre-constructed visual question-answer model, a plurality of relevant external knowledge corresponding to the input text output by the retriever is obtained, and the method comprises the following steps:

Inputting the input text to a query coding layer to obtain a feature vector of the input text output by the query coding layer;

And inputting the feature vector of the input text and the feature vector of each knowledge to a matching layer to obtain a plurality of related external knowledge corresponding to the input text output by the matching layer.

Optionally, the determining of the external knowledge base includes:

Optionally, determining a final target answer text corresponding to the input text from the answer text corresponding to each spliced input text and the answer text corresponding to the input text, including;

Optionally, processing the image and the question text to obtain an input text, including:

And splicing the image context and the problem text to obtain an input text.

Optionally, the determining process of the visual question-answer model includes:

Inputting the sample input text to an initial retriever of an initial visual question-answering model to obtain a plurality of sample related external knowledge corresponding to the sample input text output by the initial retriever;

Respectively splicing a plurality of sample related external knowledge with sample input texts to obtain a plurality of spliced sample input texts, inputting the plurality of spliced sample input texts into an initial answer generator of an initial visual question-answer model to obtain answer text prediction results corresponding to each spliced sample input text output by the initial answer generator, inputting the sample input texts into the initial answer generator to obtain answer text prediction results corresponding to the sample input texts output by the initial answer generator;

calculating an optimized objective function value based on the answer text prediction result corresponding to each spliced sample input text, the answer text prediction result corresponding to the sample input text and the answer text label corresponding to the sample input text;

Training an initial visual question-answer model based on the optimized objective function value, and carrying out parameter optimization iteration on the initial visual question-answer model to obtain a visual question-answer model;

Wherein, the determining process of the retriever comprises:

Optionally, the initial retriever comprises a knowledge coding layer, an initial query coding layer and an initial matching layer;

Optionally, determining a final target answer text corresponding to the sample input text from the answer text prediction result corresponding to each spliced sample input text and the answer text prediction result corresponding to the sample input text, including;

Optionally, the calculation formula for optimizing the objective function value is:

Wherein P represents the optimized objective function value, x represents the sample input text, A represents the answer text label set corresponding to the sample input text, θ represents the parameters of the initial answer generator, φ represents the parameters of the initial retriever, s _j represents the j-th sample related external knowledge, The answer text label corresponding to the j-th sample related external knowledge in the answer text label set is represented, R ^P represents the sample related external knowledge set corresponding to the positive supervision training signal, and R ^N represents the sample related external knowledge set corresponding to the negative supervision training signal.

Optionally, the knowledge-based enhanced visual question and answer platform 500 further includes an interactive interface for capturing images and question text entered by the user, displaying the target answer text, or displaying the target answer text and related external knowledge.

It should be noted that, the visual question-answering platform based on knowledge enhancement provided by the embodiment of the present invention can implement all the method steps implemented by the visual question-answering method embodiment based on knowledge enhancement, and can achieve the same technical effects, and the same parts and beneficial effects as those of the method embodiment in the embodiment are not specifically described herein.

Fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: processor 610, communication interface Communications Interface 820, memory 630, and communication bus 640, wherein processor 610, communication interface 620, and memory 630 communicate with each other via communication bus 640. Processor 610 may invoke logic instructions in memory 630 to perform a knowledge-based enhanced visual question-answering method that includes: acquiring an image and a question text input by a user, and processing the image and the question text to obtain an input text; inputting the input text to a retriever of a pre-constructed visual question-answer model to obtain a plurality of relevant external knowledge corresponding to the input text output by the retriever; respectively splicing a plurality of related external knowledge with the input text to obtain a plurality of spliced input texts, inputting the plurality of spliced input texts into an answer generator of the visual question-answer model to obtain answer texts corresponding to each spliced input text output by the answer generator, and inputting the input texts into the answer generator to obtain answer texts corresponding to the input texts output by the answer generator; determining a final target answer text corresponding to the input text from the answer text corresponding to each spliced input text and the answer text corresponding to the input text; the visual question-answering model is obtained by taking a sample input text as a training sample, taking an answer text label corresponding to the sample input text as a sample label and carrying out combined training on an initial answer generator and an initial retriever of the initial visual question-answering model.

Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A knowledge-based enhanced visual question-answering method, which is applied to a knowledge-based enhanced visual question-answering platform, comprising:

2. The knowledge-based enhanced visual question-answering method according to claim 1, wherein the retriever comprises a knowledge coding layer, a query coding layer, and a matching layer;

3. The knowledge-based enhanced visual question-answering method according to claim 2, wherein the determination of the external knowledge base comprises:

4. The knowledge-based enhanced visual question-answering method according to claim 1, wherein the determining a final target answer text corresponding to the input text from the answer text corresponding to each spliced input text and the answer text corresponding to the input text, comprises;

5. The knowledge-based enhanced visual question-answering method according to any one of claims 1-4, wherein the processing of the image and question text to obtain input text comprises:

And splicing the image context and the problem text to obtain the input text.

6. The knowledge-based enhanced visual question-answering method according to claim 1, wherein the determination of the visual question-answering model includes:

wherein, the determining process of the retriever comprises:

7. The knowledge-based enhanced visual question-answering method according to claim 6, wherein the initial retriever comprises a knowledge coding layer, an initial query coding layer, and an initial matching layer;

8. The knowledge-based enhanced visual question-answering method according to claim 6, wherein the determining a final target answer text corresponding to the sample input text from the answer text prediction result corresponding to each spliced sample input text and the answer text prediction result corresponding to the sample input text, comprises;

9. The knowledge-based enhanced visual question-answering method according to claim 6, wherein the calculation formula of the optimized objective function value is:

10. A knowledge-based enhanced visual question-answering platform, comprising: