CN114973041A

CN114973041A - Language prior method for overcoming visual question and answer based on self-contrast learning

Info

Publication number: CN114973041A
Application number: CN202111557673.3A
Authority: CN
Inventors: 孔凡彦; 刘利军; 黄青松
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2021-12-20
Filing date: 2021-12-20
Publication date: 2022-08-30

Abstract

The invention relates to a language prior method for overcoming visual question answering based on self-contrast learning, which comprises the steps of firstly, completing image feature extraction through a pre-training model, and inputting a question word after the question word is embedded into a GRU (general purpose unit) to generate a question feature; then, an attention mechanism is used for the problem and the image, the problem characteristic and the image characteristic are fused into a joint representation characteristic, a weighted value learned by the attention mechanism is used for an anti-attention layer, the problem characteristic and the anti-image characteristic are fused into another joint representation characteristic, and the two characteristics are compared; VQA classification loss through optimization basis and self-contrast utilizationLearning loss L _scl To increase the correlation of the problem with the image, these two losses are constructed as a joint loss untraining. The method is established on a model LMH, the most advanced performance of 59.00% is realized on the most common reference VQA CP v2 under the condition of not using an auxiliary task, and the absolute performance is improved by 6.51%.

Description

Language prior method for overcoming visual question and answer based on self-contrast learning

Technical Field

The invention relates to a language prior method for overcoming visual question answering based on self-contrast learning, belonging to the technical field of computer natural language processing and computer vision.

Background

The purpose of the Visual Question Answering (VQA) is to automatically answer natural language questions based on visual content, which is one of the benchmark tasks for multimodal (e.g., language and image). It requires visual analysis, language understanding, multimodal information fusion and reasoning. In recent years, VQA has attracted considerable interest, publishing various sets of reference data. With the introduction of a large number of works, the VQA task made a significant advance. A large number of works attempt to understand images and problems, but recent studies find that these works are largely driven by surface language correlation (i.e., language priors) in training QA pairs, ignoring image content. For example, these models tend to answer "how many … …? "question and" tennis "to answer" what movement … …? "and ignores the combination of image content to reason about this problem. To help address these bias factors, Agrawal et al created diagnostic benchmarks VQA-CP in 2018 (VQA altered priors by reorganizing the training and validation splits of the individual VQA datasets). Most prior works have designed various attention mechanisms to learn relationships between modalities, which can work well on VQA benchmark tests (e.g., VQA v 2). However, the performance of these works drops significantly on VQA-CP due to language priors. To alleviate language priors, the existing work focuses on reducing statistical priors of the problem and increasing image dependency and interpretability, which can be roughly divided into learning with auxiliary tasks and learning without auxiliary tasks. Learning without auxiliary tasks uses auxiliary QA branches to normalize the training of the target VQA model or a particular complex learning strategy. These methods attempt to add an auxiliary branch to capture the language prior to diminish its effect. The auxiliary task learning introduces additional manual supervision and auxiliary tasks (visual basis, image captioning, etc.) to increase image dependency and interpretability). The methods can better understand the image content under the guidance of auxiliary tasks, thereby achieving better performance. But the inherent data bias is severe leading to surface language correlation. Therefore, it is important to reduce the native language prior without introducing additional annotations while relying on the relevant visual areas for making decisions. Existing work to reduce language priors can be broadly divided into learning with and without auxiliary tasks. Learning through auxiliary tasks. These works introduce additional manual supervision and assistance tasks to increase image dependency and interpretability.

Disclosure of Invention

The invention provides a language prior method for overcoming visual question and answer based on self-contrast learning, which overcomes VQA language prior problem by using novel self-contrast learning, can concentrate on relevant areas to predict correct answers of given problems about input images, and improves the reasoning ability and robustness of VQA model.

The technical scheme of the invention is as follows: a language prior method for overcoming visual question answering based on self-contrast learning comprises the following specific steps:

step1, firstly, taking the questions, the images and answer options as experimental data, secondly, preprocessing the images to extract a feature map, and preprocessing the questions to generate question feature vectors;

step2, using attention layer learning to identify the image and the problem related area; after Step1 preprocessing, the attention mechanism uses the problem to calculate attention weights on image regions to locate the image region associated with the problem, and the resulting problem feature q and weighted image features

Fusing to a joint representation r;

step3, using the anti-attention layer to identify image areas that are currently not or less relevant; the question feature q and the weighted inverse image feature are weighted by the attention weight value obtained by Step2

Fused into a joint representation r ₀ Focusing the problem on irrelevant areas and ignoring relevant areas on the image to form contrast;

step4, post-treatment: joint representation r by Step3 ₀ And the joint representation r obtained at Step1, training the proposed network to optimize the joint loss from the contrast loss Lscl and the base VQA classification loss Lvqa, we can focus on the relevant areas to predict the correct answer to a given question with respect to the input image.

Further, the Step1 includes the following steps:

step1.1, firstly, extracting a series of visual target characteristics from an image by using a pre-training model Faster-R-CNN;

step1.2, performing word embedding on the problem, and transmitting the problem to a single-layer GRU to generate a problem characteristic;

further, the specific Step of Step2 is as follows:

step2.1, after extracting the image and the problem features, transmitting the image and the problem features to an attention layer, and converting the image features and the problem features into a space with the same dimension;

step2.2, calculating attention weight, then generating a normalized attention weight for each feature map, wherein the final image feature is the weighted sum of all input features;

step2.3, fusing the weighted image characteristics and the question characteristics obtained by Step1.2 into a joint characteristic representation r, and further calculating the probability distribution of each answer a in the candidate answer set A

Further, the specific steps of Step3 are as follows:

step3.1, the attention mechanism uses the problem to compute attention weights on image regions to locate the image region associated with the problem. However, the counter-attention mechanism is the opposite of the attention mechanism. It helps VQA model overcome language priors by focusing the problem on irrelevant areas and ignoring relevant areas on the image to form contrasts.

Step3.2, normalized attention-back using attention weight a obtained from attention layer, centered on attention weight value obtained at Step2A force weight α' which performs a negative operation, opponent (a) ═ a or opponent (a) ═ e ^-a Making the larger weight smaller and the smaller weight larger, so that the attention weight output using the softmax function is focused on irrelevant areas;

step3.3, after learning the inverse attention weight, generating the weighted inverse image feature

Then, similar to the attention layer, we will weight the image features

Fusion of problem feature q obtained with Step1.2 to joint feature representation r ₀ Further, the probability distribution of each answer a in the candidate answer set A is calculated

Further, the specific steps of Step4 are as follows:

step4.1, first loss layer contains two branches, the first one aiming at exploiting the underlying VQA model probability distribution

It is optimized by minimizing the cross-entropy loss of the binary, the loss function being defined as Lvqa;

step4.2, the other branch is a self-contrast layer, which aims at increasing the correlation and the dependency between the question and the image by using the answer distribution predicted by the self-contrast layer, firstly considering an objective function similar to QICE [36], benefiting from the objective function, and considering that certain correlation exists between the answers predicted by the question based on relevant and irrelevant areas in the same image, namely the predicted answers are mutually exclusive, thus excluding the partial answers defined by the self-contrast layer and then providing the self-contrast learning loss Lscl to increase the correlation between the question and the image;

step4.3, training the proposed net to optimize the joint loss of self-contrast loss Lscl in step4.2 and the underlying VQA classification loss Lvqa in step4.1, by which we can focus on the relevant areas to predict the correct answer to a given question with respect to the input image.

Further, the model of the anti-attention layer is similar to the attention layer, first using the attention weight a obtained by the attention layer, the normalized anti-attention weight α' can be calculated as: α' ═ softmax (opponent (a)), after learning the inverse attention weights, we generate inverse image features

As follows:

we will weight the inverse image features

The formula is as follows: q's' ₀ ＝f″ _q (q)， v′ ₀ ＝f″ _v (v)，

Wherein, f _v ，f _q ，f ₀ Is a transformation function and w ₀ Representing the weight matrix that needs to be learned.

The invention has the beneficial effects that:

1. the present invention solves the VQA problem by a novel self-contrast learning method that overcomes language priors by comparing answers generated for problem-related and problem-unrelated regions in an image.

2. After self-contrast learning training, the model is forced to learn more information from the relevant image regions. It effectively increases the semantic dependency and interpretability of the image. In this way, image features and problem context no longer exist in isolation during the modeling process.

3. Numerous experiments were performed on popular benchmark tests VQA-CP v1 and VQA-CP v 2. Experimental results show that our method can significantly improve the performance of the baseline data set without the use of additional annotations. In particular, by constructing on top of the LMH model, we achieved 59.00% of the most advanced performance at VQA-CP v2, with an absolute performance improvement of 6.51%.

Drawings

FIG. 1 is a block diagram of a language prior method for overcoming visual question-answering based on self-contrast learning;

FIG. 2 is a comparison of the present invention with several variations of the overcome language prior VQA model;

FIG. 3 is an example of the self-contrast learning of the present invention;

FIG. 4 is a flow chart of the present invention.

Detailed Description

Example 1: as shown in fig. 1-4, a language prior method for overcoming visual question answering based on self-contrast learning is characterized in that: the method comprises the following specific steps:

Fusion to a joint representation r;

step3, using the anti-attention layer to identify image areas that are currently not or less relevant; with the attention weight value obtained at Step2, the question will be askedFeature q and weighted inverse image feature

step4, post-processing: joint representation r by Step3 ₀ And the joint representation r obtained at Step1, training the proposed network to optimize the joint loss from the contrast loss Lscl and the base VQA classification loss Lvqa, we can focus on the relevant areas to predict the correct answer to a given question with respect to the input image.

Further, the Step1 includes the following steps:

further, the specific steps of Step2 are as follows:

step2.3, fusing the weighted image features and the question features obtained by Step1.2 into a joint feature representation r, and further calculating the probability distribution of each answer a in the candidate answer set A

Further, the specific steps of Step3 are as follows:

Step3.2, starting with the attention weight value obtained at Step2 as the center, uses the attention weight a obtained at the attention level, the normalized attention-deficit weight α', which performs a negative operation, opponent (a) or opponent (a) e ^-a Making the larger weight smaller and the smaller weight larger, so that the attention weight output using the softmax function is focused on irrelevant areas;

Then, similar to the attention layer, we will weight the image features

Further, the specific steps of Step4 are as follows:

step4.2, another branch is self-contrast layer, aiming at increasing the correlation and dependency between the question and the image by using the answer distribution predicted by the self-contrast layer, firstly considering an objective function similar to QICE [36], benefiting from the objective function, and considering that certain correlation exists between the answers predicted by the question based on relevant and irrelevant areas in the same image, namely the predicted answers are mutually exclusive, thus excluding the partial answers defined by the self-contrast layer and then providing self-contrast learning loss Lscl to increase the correlation between the question and the image;

As follows:

we will weight the inverse image features

Wherein f is _v ，f _q ，f ₀ Is a transformation function and w ₀ Representing the weight matrix that needs to be learned.

The present invention has performed a number of experiments on popular benchmark tests VQA-CP v1 and VQA-CP v 2. Experimental results show that the method of the present invention can significantly improve the performance of the reference data set without using additional annotations. In particular, by constructing on top of the LMH model, the most advanced performance of 59.00% was achieved at VQA-CP v2, with an absolute performance improvement of 6.51%, with the results shown in tables 1 and 2.

Table 1 shows the results of the experiment of the present invention at VQA-CP v2

Table 2 shows the results of the experiment of the present invention at VQA-CP v1

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A language prior method for overcoming visual question answering based on self-contrast learning is characterized in that: the method comprises the following specific steps:

Are fused intoJointly represent r;

step4, post-processing: joint representation r by Step3 ₀ And Step1, training the network to optimize the joint loss from the contrast loss Lscl and the base VQA classification loss Lvqa, can focus on the relevant regions to predict the correct answer to a given question with respect to the input image.

2. The method for overcoming language priors for visual question answering based on self-contrast learning according to claim 1, wherein: the Step1 comprises the following steps:

step1.2, performing word embedding on the problem, and transmitting the problem to a single-layer GRU to generate a problem characteristic.

3. The method for overcoming language priors for visual question answering based on self-contrast learning according to claim 1, wherein: the specific steps of Step2 are as follows:

step2.3, fusing the weighted image features and the question features into a joint feature representation r, and further calculating each answer a in the candidate answer set AProbability distribution of

4. The method for overcoming language priors for visual question answering based on self-contrast learning according to claim 1: the specific steps of Step3 are as follows:

step3.1, the attention mechanism uses the problem to calculate attention weights on image regions to locate the image region associated with the problem; however, the counter-attention mechanism is the opposite of the attention mechanism, which helps the VQA model overcome language priors by focusing the problem on irrelevant areas and ignoring relevant areas on the image to form contrasts;

step3.2, starting with the attention weight value obtained at Step2 as the center, uses the attention weight a obtained at the attention level, the normalized attention-deficit weight α', which performs a negative operation, opponent (a) or opponent (a) e ^-a Making the large weight small and the small weight large, so that the attention weight output using the softmax function is focused on an irrelevant area;

The weighted inverse image features are then weighted similarly to the attention layer

Fusing the problem feature q obtained from Step1 into a joint feature representation r ₀ Further, the probability distribution of each answer a in the candidate answer set A is calculated

5. The method for overcoming language priors for visual question answering based on self-contrast learning according to claim 1: the specific steps of Step4 are as follows:

step4.2, the other branch is a self-contrast layer, which aims to increase the correlation and the dependency between the question and the image by using the answer distribution predicted by the self-contrast layer, firstly considering an objective function similar to QICE [36], benefiting from the objective function, and considering that certain correlation exists between the answers predicted by the question based on relevant and irrelevant areas in the same image, namely the predicted answers are mutually exclusive, so that partial answers defined by the self-contrast layer are eliminated, and then the self-contrast learning loss Lscl is provided to increase the correlation between the question and the image;

step4.3, training the proposed net to optimize the joint loss of self-contrast loss Lscl in step4.2 and the underlying VQA classification loss Lvqa in step4.1, by this approach, we can focus on the relevant areas to predict the correct answer to a given question with respect to the input image.

6. The method for overcoming language priors for visual question answering based on self-contrast learning according to claim 1, wherein: the model of the anti-attention layer is similar to the attention layer, first using the attention weight a obtained by the attention layer, and the normalized anti-attention weight α' is calculated as: α' ═ softmax (opponent (a)), and after learning the attention-back weight, the inverse image feature is generated

As follows:

the weighted inverse image features

Fusing with the obtained problem feature q into a joint feature representation r ₀ Further, the probability distribution of each answer a in the candidate answer set A is calculated

The formula is as follows: q's' ₀ ＝f″ _q (q)，v′ ₀ ＝f″ _v (v)，