CN116662591A

CN116662591A - Robust visual question-answering model training method based on contrast learning

Info

Publication number: CN116662591A
Application number: CN202310646697.9A
Authority: CN
Inventors: 鉴萍; 张钧贺
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2023-06-02
Filing date: 2023-06-02
Publication date: 2023-08-29

Abstract

A robust visual question-answering model training method based on contrast learning belongs to the technical field of natural language processing and computer visual cross application. In the aspect of image enhancement, an enhancement method based on visual context disturbance is used, visual context which is weak in relation to problems is screened out through attention distribution of objects in an image, disturbance is added to the visual context to construct a new image representation, and therefore a model is enabled to learn the image representation irrelevant to the visual context. For the problems of general question types, a strategy of deleting the query auxiliary verbs is adopted for text enhancement; for other types of problems, a rewrite strategy is employed for text enhancement. The positive sample is constructed by the data enhancement method, and then the comparison learning method is utilized to optimize, so that the unbiased multi-modal representation of the input information is learned. The invention is suitable for the fields of artificial intelligence, natural language processing and the like, enhances the robustness of the model, and improves the accuracy of question and answer of the model on different scenes or data with different distributions.

Description

Robust visual question-answering model training method based on contrast learning

Technical Field

The invention relates to a visual question-answering model training method, in particular to a robust visual question-answering model training method based on contrast learning, and belongs to the technical field of natural language processing and computer visual cross application.

Background

A visual question-and-answer task refers to a given picture and a natural language question about that picture, with the hope that the computer will be able to predict the correct answer to the question. The computer not only needs to understand the natural language problem, but also needs to understand the semantics of the image, and synthesizes two kinds of information to perform reasoning so as to predict the answer. The prior art has achieved good results on visual question-answering tasks.

However, with the rapid development of the field of visual questions, researchers have found that there is a phenomenon in this field: visual question-answering models tend to rely on single modality bias in the input information, relying on shortcuts to answer questions, resulting in models that are difficult to generalize to other data sets or real scenes, lacking robustness to changes in data distribution.

In many improvements to bias questions in visual question-answering tasks, it is common that: these methods tend to focus only on language bias between the question category information and the answer, and not on potential bias between the image information and the answer, so the question remains unresolved.

Meanwhile, with the development and wide application of the visual question and answer field, researchers are increasingly aware that language bias is not the only bias, and further need to eliminate the bias from more aspects so as to enhance the robustness of the model and reduce the dependence of the model on the bias.

Disclosure of Invention

Aiming at the problem that the current visual question-answering field tends to depend on single-mode bias in input information, the main purpose of the invention is to provide a robust visual question-answering model training method based on contrast learning, and a positive sample is constructed by utilizing data enhancement methods on two modes of text and image; and then optimizing by using a contrast learning method, learning unbiased multi-modal representation of the input information for subsequent prediction, thereby enhancing the robustness of the visual question-answering model, reducing the dependence of the model on bias, and improving the question-answering accuracy of the model on different scenes or data with different distribution.

The invention aims at realizing the following technical scheme:

according to the robust visual question-answering model training method based on contrast learning, in the aspect of image enhancement, a visual context disturbance-based enhancement method is used, visual contexts which are weak in correlation with problems are screened out through attention distribution of objects in images, disturbance is added to the visual contexts to construct new image representations, and therefore the models are enabled to learn image representations irrelevant to the visual contexts, the models are enabled to be harder to rely on bias information existing in the images, and robustness of the models to image changes is improved. For the problems of general question types, a strategy of deleting the query auxiliary verb is adopted to carry out text enhancement, so that a potential co-occurrence mode between the query auxiliary verb and answers is cut off, the bias dependency degree of the model on the problems is reduced, and the accuracy of question and answer of the model on different scenes or data with different distributions is improved.

The invention discloses a robust visual question-answering model training method based on contrast learning, which comprises the following steps:

step 1: and (3) rewriting the input questions of the visual questions and answers to obtain rewritten enhancement questions.

Step 1.1 trimming of the T5, transfer Text-to-Text transducer model is performed on multiple rewritten datasets.

Step 1.2, deleting the problem category in the data set label on the basis of the original problem for a general question sentence, thereby deleting the query auxiliary verb corresponding to the problem and constructing a new enhanced problem; for other problems, the problem is input to the T5 model after trimming, and the rewrite problem corresponding to the problem is output.

Step 2: and inputting the questions and the images of the visual questions and answers to obtain the characteristics of the questions and the characteristics of the images.

Step 2.1: the GloVe word vector model is used to extract the question and rewrite the textual representation of the question.

Firstly, word segmentation is carried out on an input problem, a text in a natural language form is converted into an integer form which can be recognized by a computer, the maximum word number is set, and the problem is truncated; each Word in the question is then converted into a text representation vector word_embedded, the first dimension of which is n, representing the number of words.

Step 2.2: and (3) inputting the text representation vector obtained in the step 2.1 into a text encoder LSTM to extract the problem features.

The word_Embed vector is transmitted into a single-layer LSTM network to obtain a problem feature Y:

Y＝LSTM(Word_Embed) (1)

similarly, the same processing can be performed on the rewrite problem to obtain the rewrite problem feature Y _pos 。

Step 2.3: image features of the image based on the target are extracted using the fast R-CNN model.

First, object detection is performed on an input image, and object-based image features are extracted for each image using a fast R-CNN model based on res net 101.

X＝Faster R-CNN(Input_Image) (2)

Wherein input_image represents an Input Image, X represents an Image feature, and a first dimension of the feature is m, which is used to represent the number of detection targets.

Step 3: and (3) inputting the problem features and the image features obtained in the step (2) into a deep collaborative attention mechanics learning module to obtain attention features after interaction of the two modes.

Step 3.1: and (3) extracting attention problem characteristics from the self-attention units entering the cascade L layers for the problem characteristics obtained in the step (2).

Step 3.2: and (3) extracting attention image features from the self-attention unit and the attention guiding unit which enter the cascade L layers for the image features obtained in the step (2), wherein the attention guiding is conducted by the attention problem features obtained in the step (3.1).

Step 4: attention features of the enhanced image are obtained using an image enhancement method based on visual context disturbance.

Step 4.1: and (3) carrying out the attention weight matrix in the first-layer self-attention operation in the step 3.2, and calculating the average value of each column vector of the attention weight matrix as the significant fraction of the current object.

The object with higher attention weight has stronger correlation with the problem, is essential for the model to answer the problem correctly, and is regarded as a key object of the problem; whereas objects with lower attention weights have little to no relevance to the question and do not help the model answer the question, treated as a visual context for the question.

In the deep self-attention operation performed inside the image modal information, since the self-attention operation calculation is performed a plurality of times, the attention distribution gradually tends to be stable, and the attention distribution is used as the basis for screening the visual context.

Attention weight matrix obtained by layer I self-attention calculationWherein->Each column represents the attention weight of the current object and m objects as a column vector. For selecting a salient object or region in the image, each column vector in the attention weight matrix is +.>Averaging can be performedThe attention weight average value of each object is obtained and is used as the significant fraction of the object, and the calculation process is shown in a formula (3).

Wherein the saliency score of the ith object is averaged from the column vector corresponding to the object,representing column vectorsIs the j-th element of (c).

Step 4.2: and (3) sequencing the significant scores of each object in the step 4.1 according to the size, and masking r objects with the minimum scores to obtain the attention feature of the enhanced image.

For r objects with the smallest saliency score, the attention features corresponding to the objects are given as minimum values, so that the attention weight is 0 after Softmax calculation is carried out, and the features are covered. The data enhancement can be performed at the image representation level, resulting in an enhanced image representation.

Step 5: and (3) respectively carrying out multi-mode fusion on the problem and the image representation in the step (3) and the enhancement problem and the image representation in the step (4) to respectively obtain multi-mode representations of the original sample and the positive sample.

Step 5.1: the attention characteristics of the two modes in the step 3 are input into an attention attenuation network and a full connection layer, and the attention attenuation network and the full connection layer are added to obtain the multi-mode representation of the original sample.

Step 5.2: the attention features of the enhanced text of step 3 and the attention features of the enhanced image of step 4 are input into the attention-deficit network and into a fully connected layer and added to obtain a multimodal representation of the positive sample.

Step 6: optimizing multi-modal representation by utilizing a contrast learning loss function, optimizing the prediction capability of a robust visual question-answering model by utilizing a cross entropy loss function, obtaining a trained robust visual question-answering model, and realizing high robust visual question-answering according to the trained robust visual question-answering model.

Step 6.1: the multi-modal representation of the original sample is optimized using the InfoNCE loss function to be close to the multi-modal representation of the positive sample and far from the multi-modal representations of the other samples in the same batch.

Step 6.2: the multi-mode representation of the step 5.1 is entered into a layer of full-connection classifier, and answers are predicted; while training using a binary cross entropy loss function.

Further comprising the step 7 of: compared with a model which does not use a data enhancement and contrast learning method, the robust visual question-answering model obtained through training in the step 6 has stronger robustness in the face of data distribution change, reduces the bias dependence degree of the robust visual question-answering model on the type of problems, improves the question-answering accuracy of the robust visual question-answering model on different scenes or data with different distributions, and improves the man-machine interaction performance.

Advantageous effects

1. According to the robust visual question-answering training method based on contrast learning, data enhancement is carried out on a plurality of modes, and the image enhancement method based on visual context is used on an image mode, so that bias existing in image information is effectively restrained, the method is applicable to various scenes with bias in texts or images, the model can show superior performance in more scenes, the question-answering accuracy of the robust visual question-answering model on different scenes or data with different distributions is improved, and man-machine interaction performance is improved.

2. The robust visual question-answering training method based on the contrast learning, disclosed by the invention, is based on the contrast learning method, and the flexibility of the contrast learning method is utilized to enable the robust visual question-answering training method to be combined with any visual question-answering model independently without being limited by a model structure, so that the robustness and performance of the robust visual question-answering training method on visual question-answering tasks can be improved on the basis of various model structures.

3. Aiming at the problem that a model is easy to depend on bias on a general question, the robust visual question-answering training method based on contrast learning adopts an enhancement strategy for deleting the query auxiliary verb, so that the bias dependency degree of the model on the general question is effectively inhibited.

Drawings

FIG. 1 is a flow chart of a robust visual question-answering training method based on contrast learning disclosed by the invention;

fig. 2 is a schematic diagram of a visual question-answering model based on contrast learning according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings and examples. The technical problems and the beneficial effects solved by the technical proposal of the invention are also described, and the described embodiment is only used for facilitating the understanding of the invention and does not have any limiting effect.

If training is desired to obtain a visual question-answering model as a life assistant of a visually impaired user, the robustness and accuracy of the model are required to be high, as shown in fig. 1, and the robust visual question-answering model training method based on contrast learning comprises the following steps:

step 1: and (3) rewriting the input questions of the visual questions and answers to obtain rewritten enhanced questions, so that the model can adapt to different language expression modes.

Step 1.1: trimming the T5 (Transfer Text-to-Text transducer) model over a plurality of rewritten datasets;

specifically, the T5 model was trimmed on the Quora Question Pairs, paraphrase Adversaries from Word Scrambling and Microsoft Research Paraphrase Corpus open source datasets.

Step 1.2: aiming at general questioning sentences, adopting a strategy for deleting the questioning auxiliary verbs to construct an enhancement problem; for other problems, a rewrite problem is generated for each problem in the dataset using the trimmed T5 model.

For a general question sentence, deleting the category of the question in the data set label on the basis of the original question, so as to delete the corresponding query auxiliary verb of the question to construct a new enhanced question; for other problems, the problem is input to the T5 model after trimming, and the rewrite problem corresponding to the problem is output.

Firstly, word segmentation is carried out on an input problem, texts in a natural language form are converted into integer forms which can be recognized by a computer, the maximum Word number is set to be 14, the problem is truncated, and secondly, each Word in the problem is converted into an n×300-dimension text representation vector word_Embed, wherein n epsilon [1,14] is the Word number in the problem.

Step 2.2: inputting the text representation obtained in the step 2.1 into a text encoder LSTM to extract problem features;

and (3) transmitting the word_Embled vector into a single-layer LSTM network, and obtaining the problem feature Y through a formula (1).

Step 2.3: extracting image characteristics of the image based on the target by using a Faster R-CNN model;

first, object detection is performed on an input image, and object-based image features are extracted for each image using a fast R-CNN model based on the res net101 as shown in equation (2).

Where input_image represents an Input Image, X represents Image features in m×2048 dimensions, and m represents the number of detection targets, which is generally set to 36.

Step 3.1: extracting attention problem characteristics from the problem characteristics obtained in the step 2 by a self-attention unit entering a cascade L layer;

at the question text end, the question feature Y is input to the multi-headed self-attention unit of the L-layer cascade. Assuming that the head number is h, the first layer self-attention unit maps the problem feature into a query value Q, a key value K and a value V, and the calculation process is as follows:

Q＝W _Q ·Y ^l (4)

K＝W _K ·Y ^l (5)

V＝W _V ·Y ^l (6)

wherein W is _Q 、W _K 、W _V Mapping matrix parameters respectively representing a query value Q, a key value K and a value V; y is Y ^l Representing the problem characteristics of the first layer input.

Y ^l+1 ＝Attention(Q,K,V)＝W _attention V (7)

Wherein QK ^T The normalized attention weight matrix, W, can be calculated from the Softmax function, which can be seen as a dot product similarity calculation of two vectors _attention ∈R ^m×n ，Y ^l+1 The question feature outputted from the first layer is expressed, and the question feature outputted from the last layer is regarded as a notice question feature Y'.

The same processing can be performed on the rewrite problem to obtain the characteristic Y 'of the rewrite problem' _pos 。

Step 3.2: and (3) extracting attention problem characteristics from the self-attention units entering the cascade L layers for the problem characteristics obtained in the step (2).

At the image end, the problem feature X is input into the multi-head self-attention unit and the multi-head guiding attention unit of the L-layer cascade. The number of heads is h, the calculation mode of the first layer self-attention unit is the same as that of the text end, and the self-attention unit is followed by the guiding attention unit, and the calculation process is as follows:

Q＝W _Q ·X ^l (9)

K＝W _K ·Y′ (10)

V＝W _V ·Y′ (11)

wherein W is _Q 、W _K 、W _V Mapping matrix parameters respectively representing a query value Q, a key value K and a value V; x is X ^l Image representing layer I inputThe feature, Y', represents the attention question feature.

The calculation is performed according to the formulas (7), (8) in step 3.1, and the image feature output from the last layer is taken as the attention image feature X'.

Step 4: the attention features of the enhanced image are obtained using an image enhancement method based on visual context disturbance, as shown in fig. 2.

Step 4.1: performing a first-layer self-attention operation on the step 3.2, namely calculating the average value of each column vector of the attention weight matrix as the significant fraction of the current object;

In the deep self-attention operation performed in the image mode information, as the self-attention operation calculation is performed for a plurality of times, the attention distribution gradually tends to be stable, and the method is suitable for being used as the basis for screening visual contexts.

Attention weight matrix obtained for layer I self-attention calculation Wherein->Each column represents the attention weight of the current object and m objects as a column vector. To select a salient object or region in an image, we add +_for each column vector in the attention weight matrix>Averaging, the weighted average of the attention of each object can be obtained as the significant fraction of the object,the calculation process is shown in the formula (3).

Step 4.2: sorting the significant scores of each object in the step 4.1 according to the size, and covering up r objects with the minimum scores to obtain the attention feature of the enhanced image;

for r objects with the smallest saliency score, attention features corresponding to the objects can be given as minimum values, so that attention weight is 0 after Softmax calculation is performed, and the features are covered. This approach allows data enhancement at the image representation level to obtain an enhanced image representation X' _pos 。

Step 5.1: inputting the attention characteristics of the two modes in the step 3 into an attention attenuation network and a full-connection layer, and adding to obtain a multi-mode representation of an original sample;

specifically, through the deep collaborative attention mechanics learning stage, the output problem feature Y 'and the image feature X' already contain rich attention weight information of the problem word and the image region. A multi-layer perceptron (Multilayer Perceptron, MLP) of two fully connected layers is thus employed as the attention decay model, with the first fully connected layer using ReLU as the activation function and Dropout added.

Applying a Softmax function to the attenuated attention feature to calculate a new attention weight α, and based thereon, calculating a new attention featureThe calculation process is as follows:

wherein X' is an input feature, alpha ε [ alpha ] ₁ ,α ₂ ,...,α _m ]Is the learned attention weight.

Attention features for outputAnd->The following linear multi-modal fusion function was used:

wherein the method comprises the steps ofFor two linear mapping matrices, d _z And representing the dimension of the fusion characteristic, and obtaining the multi-modal representation z by using layer normalization stabilization training after fusion.

Step 5.2: inputting the attention features of the enhanced text in the step 3 and the attention features of the enhanced image in the step 4 into an attention attenuation network and a full-connection layer, and adding to obtain a multi-modal representation of the positive sample;

specifically, as with the implementation of step 5.1, the enhanced problem feature Y 'is input' _pos And image feature X' _pos A multi-modal representation z can be obtained _pos As a positive sample.

Step 6: optimizing multi-modal representation by utilizing a contrast learning loss function, enhancing the robustness of the multi-modal representation to language change and image change, so that the model can obtain better generalization effect in a real scene, and better helping visually impaired users; the predictive power of the model is optimized using the cross entropy loss function.

Step 6.1: optimizing the multi-modal representation of the original sample using the InfoNCE loss function to bring it closer to the multi-modal representation of the positive sample and further away from the multi-modal representations of other samples in the same batch;

the current sample and its corresponding positive sample are represented as M in multiple modes _i ,Respectively correspond to the z, z obtained in the previous step _pos As shown in fig. 2, the optimization is performed using the InfoNCE loss function so that the multi-modal representation of the original sample is as close as possible to the multi-modal representation of the positive sample, and as far as possible from the multi-modal representations of the other negative samples of the same batch, the calculation process is as follows:

where N is the number of samples, τ is the temperature coefficient, I _[j≠i] E {0,1} is an indicator function, representing that 1 is taken when j+.i, or 0 is taken otherwise,represents M _i ,/>The similarity between the two vectors is calculated as follows:

and introducing language and image changes to construct a positive sample, and optimizing the multi-modal representation learned by the visual question-answer model in a contrast learning mode to enhance the robustness of the model.

Step 6.2: the multi-mode representation of the step 5.1 is entered into a layer of fully connected classifier to predict answers; while training using a binary cross entropy loss function.

In particular, as shown in FIG. 2, multiple modesVector s E R representing z mapped to vocabulary size by a full connectivity layer FCN and Sigmoid function ^G Wherein G is the number of candidate answers.

s＝Sigmoid(FCN(z)) (16)

Finally, training an N-class classifier by using binary cross entropy (binary cross-entropy) as a loss function, wherein the calculation process is as follows:

wherein target is _i And representing the target score corresponding to each answer label of the ith sample.

Finally, the robust visual question-answering model trained by the invention can be better used as an assistant for visually impaired users to help them to recognize the world. When the user asks "Is the traffic light red now? The model can not answer 'Yes' in a recklessly due to the existence of language bias that most answers to the questions are 'Yes' in the training set, but can truly observe the color of a signal lamp in the graph and give correct answers, which definitely increases the feasibility and safety of the floor application of the technology and further expands the application value of the technology.

The foregoing detailed description has set forth the objects, aspects and advantages of the invention in further detail, it should be understood that the foregoing description is only illustrative of the invention and is not intended to limit the scope of the invention, but is to be accorded the full scope of the invention as defined by the appended claims.

Claims

1. A robust visual question-answering model training method based on contrast learning is characterized by comprising the following steps of: comprises the following steps of the method,

step 1: writing the input questions of the visual questions and answers to obtain written enhancement questions;

step 2: inputting a question and an image of a visual question and answer to obtain a question feature and an image feature;

step 3: inputting the problem features and the image features obtained in the step 2 into a deep collaborative attention mechanics learning module to obtain attention features after interaction of two modes;

step 4: obtaining attention features of the enhanced image using an image enhancement method based on visual context disturbance;

step 5: respectively carrying out multi-mode fusion on the problem and the image representation in the step 3 and the enhancement problem and the image representation in the step 4 to respectively obtain multi-mode representations of an original sample and a positive sample;

2. The robust visual question-answering model training method based on contrast learning as claimed in claim 1, wherein: further comprising the step 7 of: compared with a model which does not use a data enhancement and contrast learning method, the robust visual question-answering model obtained through training in the step 6 has stronger robustness in the face of data distribution change, and reduces the bias dependence degree of the robust visual question-answering model on the type of problems. The accuracy of question and answer of the model on different scenes or data with different distribution is improved, and the man-machine interaction performance is improved.

3. The robust visual question-answering model training method based on contrast learning as claimed in claim 1, wherein: the implementation method of the step 1 is that,

step 1.1, fine tuning a T5, namely a Transfer Text-to-Text transducer model on a plurality of rewritten data sets;

4. A robust visual question-answering model training method based on contrast learning as claimed in claim 3, wherein: the implementation method of the step 2 is that,

step 2.1: extracting a text representation of the question and overwriting the question using a GloVe word vector model;

firstly, word segmentation is carried out on an input problem, a text in a natural language form is converted into an integer form which can be recognized by a computer, the maximum word number is set, and the problem is truncated; then, each Word in the question is converted into a text representing vector word_Embed, wherein the first dimension of the representing vector is n and is used for representing the number of words;

step 2.2: inputting the text representation vector obtained in the step 2.1 into a text encoder LSTM to extract problem features;

Y＝LSTM(Word_Embed) (1)

similarly, the same processing can be performed on the rewrite problem to obtain the rewrite problem feature Y _pos ；

firstly, performing target detection on an input image, and extracting image characteristics based on targets for each image by using a fast R-CNN model based on ResNet 101;

X＝Faster R-CNN(Input_Image) (2)

5. The robust visual question-answering model training method based on contrast learning as claimed in claim 4, wherein: the implementation method of the step 3 is that,

6. The robust visual question-answering model training method based on contrast learning as claimed in claim 5, wherein: the implementation method of the step 4 is that,

the object with higher attention weight has stronger correlation with the problem, is essential for the model to answer the problem correctly, and is regarded as a key object of the problem; while the object with lower attention weight has weaker or no correlation with the question and does not help the model answer the question, as a visual context of the question;

in the deep self-attention operation performed in the image modal information, as the self-attention operation calculation is performed for a plurality of times, the attention distribution gradually tends to be stable and is used as the basis for screening the visual context;

attention weight matrix obtained by layer I self-attention calculationWherein->As column vectors, each column represents the attention weights of the current object and m objects; for selecting a salient object or region in the image, each column vector in the attention weight matrix is +.>Averaging to obtain each objectAs a significant fraction of the object, the calculation process is shown as formula (3);

wherein the saliency score of the ith object is averaged from the column vector corresponding to the object,representing column vector +.>Is the j-th element of (2);

for r objects with the smallest saliency score, the attention features corresponding to the objects are given as minimum values, so that the attention weight is 0 after Softmax calculation is carried out, and the features are covered; the data enhancement can be performed at the image representation level, resulting in an enhanced image representation.

7. The robust visual question-answering model training method based on contrast learning of claim 6, wherein: the implementation method of the step 5 is that,

8. The robust visual question-answering model training method based on contrast learning of claim 7, wherein: the implementation method of the step 6 is that,

step 6.2: the multi-mode representation of the step 5.1 is entered into a layer of full-connection classifier, and answers are predicted; and training by using a binary cross entropy loss function to obtain a trained robust visual question-answering model.