CN114913403B

CN114913403B - Visual question-answering method based on metric learning

Info

Publication number: CN114913403B
Application number: CN202210839762.5A
Authority: CN
Inventors: 舒昕垚; 陆振宇
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2022-07-18
Filing date: 2022-07-18
Publication date: 2022-09-20
Anticipated expiration: 2042-07-18
Also published as: CN114913403A

Abstract

The invention discloses a visual question-answering method based on metric learning, which adopts a self-attention encoder and a cross-attention module to align and map natural language question features and visual image features in a high-dimensional feature space; carrying out similarity measurement on natural language problem features and visual image features by adopting a self-supervision multi-mode measurement learning method, and dividing the visual image features into positive visual features and negative visual features; the positive visual features and the original visual features are fused with the natural language question features to obtain correct answers, and the negative visual features and the natural language question features are fused to not obtain correct answers. The invention realizes the similarity measurement of multi-modal characteristics in a high-dimensional characteristic space, and adopts a comparative learning mode to carry out countertraining on the measured positive visual characteristics and the measured negative visual characteristics, thereby relieving the problems of semantic gap and semantic deviation in the visual question and answer and improving the performance and the robustness of the visual question and answer model.

Description

Visual question-answering method based on metric learning

Technical Field

The invention relates to a visual question-answering method, in particular to a visual question-answering method based on metric learning.

Background

Vision and language are the most important forms of human communication, and visual question answering is an important multi-modal task combining vision and language. Visual question-answering systems answer questions by exploring the content of visual images, which requires a deep understanding of both visual images and natural language. The difference between the computer understanding of the image and the problem results in a semantic gap between the two modalities, which cannot be well correlated, resulting in a decrease in the performance of the visual question-answering model.

Visual question answering is a task which needs to answer questions by observing images, however, most of the current visual question answering models answer questions by capturing the representation relationship between questions and answers, which results in that many questions can get correct answers without observing images, namely language deviation (language prior question) in the visual question answering models. If the fire hydrants in the training set are all red, the visual question-and-answer model will answer red when a picture containing the red fire hydrant is given and the "what color the fire hydrant is" is asked. Because the fire hydrants in the training set are all red, when the visual question-answering model meets the problems of 'what color the fire hydrant is', the visual picture information is directly ignored and the answer is red. When a picture containing a green fire hydrant is given and the question "what the fire hydrant is in color" is asked, the visual question-and-answer model will also answer directly red, which is obviously unreasonable. The existence of such language deviations also leads to the training of visual question-answering models from the training set, but cannot be generalized to test sets with different distributions.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a visual question-answering method based on metric learning, which reduces language deviation (language prior problem) of a visual question-answering model and improves accuracy and robustness of the visual question-answering model.

The technical scheme is as follows: the visual question answering method comprises the following steps:

s1, collecting data set, selecting an image and a natural language question related to the image as the input of the visual question-answer model;

s2, preprocessing the visual image and the natural language problem, extracting the regional characteristics of the visual image through a target detection algorithm, and extracting the object target characteristics and the boundary box characteristics; performing feature extraction on the natural language problem through a language representation algorithm;

s3, forming a multi-modal feature pair by the visual image features and the natural language problem features obtained by processing in the step S2, and performing feature fusion and alignment by adopting an encoder module; the encoder module comprises a self-attention module and a cross-attention module, wherein the self-attention module adopts a single-mode encoder, and the cross-attention module adopts a multi-mode cross encoder;

s4, calculating a correlation index of the fused visual image features and the natural language problem features by adopting an attention mechanism, and dividing the fused visual image features into positive visual features and negative visual features according to the correlation index;

s5, forming a triple by the active visual feature, the passive visual feature and the natural language problem feature, calculating the relation between the natural language problem feature and the visual image feature by a multi-modal triple loss function, and screening out the visual image feature related to the natural language problem;

s6, respectively performing feature fusion on the original visual features, the positive visual features and the negative visual features and the natural language question features; the feature fusion adopts a cross attention encoder module to finally obtain an original fusion feature, a positive fusion feature and a negative fusion feature;

s7, inputting the original fusion feature, the positive fusion feature and the negative fusion feature into an answer prediction module to predict answers, and calculating the answer obtained by the original fusion feature and the positive fusion feature through the answer prediction module and the loss of a standard label, and the answer obtained by the negative fusion feature through the answer prediction module and the loss of a false label by adopting a multi-label cross entropy loss function;

and S8, training the visual question-answer model according to the multi-mode triple loss function and the multi-label cross entropy loss function, and obtaining the final model parameters after the training conditions are met.

Further, in step S3, in the self-attention module, two single-modality encoders are established, which are a visual object encoder and a natural language question encoder respectively; the visual object encoder and the natural language problem encoder are both composed of a self-attention layer and a feedforward neural network layer, and residual errors are added into the self-attention layer and the feedforward neural network layer to be connected;

in the cross attention module, two multi-mode cross encoders are established, namely a visual object cross encoder and a natural language problem cross encoder; the visual object cross encoder is composed of a cross attention layer, a self attention layer and a feedforward neural network layer, and residual connection is added to each of the cross attention layer, the self attention layer and the feedforward neural network layer.

Further, in step S4, the attention mechanism calculates a correlation index between the visual image feature and the natural language question feature by using a dot product similarity:

wherein the content of the first and second substances,

for the aligned and mapped natural language question features of the 16 entries,

for the aligned and mapped visual image features of the 36 strips,

is a cosine function.

Further, in step S5, the multi-modal triplet loss function is:

wherein the content of the first and second substances,

in the form of the euclidean distance,

is a natural language question feature;

and

positive visual features and negative visual features, respectively, that are visual image features;

representing the distance between features for a hyper-parameter;

indicating that the maximum value is selected.

Further, in step S7, the original fused features and the positive fused features can both obtain correct answers through the answer prediction module, and the negative fused features cannot obtain correct answers through the answer prediction module.

Further, in step S8, the visual question-answering model is trained according to the loss value, and the training is stopped by an early-stopping method when the accuracy of the verification set is greatly reduced, so as to obtain parameters of the final model.

Compared with the prior art, the invention has the following remarkable effects:

1. the visual question-answer model is improved by adopting a metric learning method, so that the language deviation of the visual question-answer model is reduced, and the accuracy and the robustness of the visual question-answer model are improved;

2. the invention designs a self-supervision training mode, and relevant visual image characteristics are screened out according to natural language problems; the visual question-answer model is trained through metric learning, so that when the visual question-answer model makes correct answers, objects in the image can be distinguished, wherein the objects are related to the questions, and the objects are unrelated to the questions;

3. the invention designs a comparative learning mode, which fuses the positive visual features distinguished by metric learning with the original visual features and the natural language question features to obtain correct answers, and fuses the negative visual features distinguished by metric learning with the natural language question features to obtain incorrect answers. The attention in the visual question-answering model is enhanced, the visual question-answering model must deduce the correct answer depending on the correct visual image area, the dependence of the visual question-answering model on the visual image is enhanced when making a decision, and the language deviation in the visual question-answering model is reduced.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of a self-attention model of the present invention;

FIG. 3(a) a schematic view of a visual object encoder;

FIG. 3(b) a schematic diagram of a natural language question encoder;

FIG. 4 is a schematic diagram of a multi-modal cross coder of the present invention;

FIG. 5 is a schematic diagram of a feature classification method according to the present invention;

FIG. 6 is a schematic diagram of a multi-modal metric learning method;

FIG. 7 is a schematic diagram of a visual question-answering model training method of the present invention;

FIG. 8 is a schematic diagram of the visual question-answering model of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and the detailed description.

As shown in FIG. 1, the visual question answering method of the present invention first collects visual image information and natural language question information; analyzing the visual image through a target detection algorithm, and extracting the regional characteristics of the image; performing feature extraction on the natural language problem through algorithms such as word embedding; secondly, performing feature fusion and mapping on the regional features of the visual images and the natural language problem features; extracting visual image features and natural language problem features after feature fusion, calculating a correlation index of the visual image features and the natural language problem features after fusion by adopting an attention mechanism, and specifically dividing the visual image after fusion into positive visual features and negative visual features according to the correlation index; then, adopting an automatic supervision scheme, forming a triple by the visual image characteristics of the two parts and the natural language problem characteristics, and calculating loss by adopting an improved multi-modal triple loss function; respectively carrying out feature fusion on the original visual image and the positive and negative visual images and natural language features and predicting an answer through an answer prediction module; and finally, training a visual question-answer model and optimizing parameters of the visual question-answer model.

The specific implementation steps are as follows:

step one, collecting a data set

With the VQA-CP v2 (VQA, Visual Question Answering) dataset, VQA-CP v2 changed the prior distribution of answers in the VQA v2.0 training and test segmentation so that the answers for each Question category (65 categories according to the Question prefix) had a different distribution in the training and test set to avoid the Visual Question Answering model giving the most popular answers for certain Question types without understanding the image content. VQA-CP v2 is composed of 265,016 pictures including COCO (Common Objects in Context image recognition dataset) pictures and abstract scenes, each picture containing at least 3 questions, each question containing 10 true answers, and each question containing 3 plausible answers.

Step two, extracting the characteristics of the visual image and the natural language problem

And extracting the region characteristics of the visual image by adopting a fast-RCNN algorithm, wherein 36 region characteristics are fixedly selected in each picture, and each characteristic contains the coordinates of a bounding box. Each regional characteristic corresponds to an object, the regional characteristics have higher precision than the traditional grid characteristics, and the similarity degree of each object and a problem is easier to observe when metric learning is carried out. And (4) extracting the features of the natural language problem by adopting a Bert algorithm.

Step three, data fusion and alignment

And forming a multi-modal feature pair by the visual image features obtained by processing and the natural language problem features, and fusing and aligning the features by adopting an encoder. The encoder module comprises a self-attention module and a cross-attention module, wherein the self-attention module is a single-mode encoder (as shown in fig. 3(a) and 3 (b)), and is formed by combining a self-attention layer and a full-connection layer; the cross attention module is a multi-modal cross encoder (as shown in fig. 4), and is formed by combining a cross attention layer, a self attention layer and a full connection layer.

Self-attention is the most important method for data fusion and alignment. Self-attention has three feature vectors, namely Query, Key and Value, which are obtained by the same input feature through three different full connection layers. As shown in fig. 2, the self-attention can be roughly divided into three stages, the first stage is to calculate the similarity by using the dot product similarity according to Query and Key, the second stage is to normalize the similarity coefficient by using softmax, and the third stage is to weight and sum Value according to the normalized weight coefficient in the second stage. The formula for self-attention is:

wherein the content of the first and second substances,Q、KandVrespectively represent Query, Key and Value,

is a vectorQ"T" represents a matrix transpose;

，eis a natural number, and the number of the main points is,

representing a similarity coefficient, and n represents the input nth similarity coefficient.

In the self-attention module, two single-modality encoders are established, namely a visual object encoder (as shown in fig. 3 (a)) and a natural language problem encoder (as shown in fig. 3 (b)). The visual object encoder and the natural language problem encoder are both composed of a self-attention layer and a feedforward neural network layer, and residual connection is added in the self-attention layer and the feedforward neural network layer.

As shown in fig. 4, in the cross attention module, two multi-modal cross encoders are established, respectively, a visual object cross encoder and a natural language problem cross encoder. The cross encoder mainly comprises a cross attention layer, a self attention layer and a feedforward neural network layer, and residual connection is added in each of the cross attention layer, the self attention layer and the feedforward neural network layer. The multi-modal data fusion and alignment are mainly completed by a cross encoder, and the cross attention layer adopts a cooperative attention mode to enable visual image features and natural language problem features to pay attention to each other for feature fusion.

Step four, feature classification based on metric learning

And calculating a correlation index of the fused visual image features and the natural language problem features by adopting an attention mechanism, and dividing the fused visual image features into positive visual features and negative visual features according to the correlation index. The attention mechanism is a dot product similarity attention calculation mode, and a correlation index between visual image features and natural language problem features is calculated and used for distinguishing positive visual features from negative visual features. The formula of the dot product similarity of the visual image features and the natural language problem features is as follows:

wherein the content of the first and second substances,

cos () is the cosine function for the aligned and mapped visual image features of the 36 strips.

As shown in fig. 5, after the relevance index is obtained, the relevance index size is sorted, and 36 visual image features are divided into 20 features related to the natural language question features and 16 features unrelated to the natural language question features. Wherein visual image features that are associated with natural language question features are referred to as positive visual features and visual image features that are not associated with natural language question features are referred to as negative visual features.

In order to more accurately screen out visual image features related to natural language questions, a self-supervision mechanism is added during training of a visual question-answer model, and the relation between the natural language question features and positive visual features and negative visual features is learned by means of metric learning. As shown in fig. 6, the mapped natural language problem features are made into triplets with the active visual features and the passive visual features of the visual image, and the relationship between the triplets and the passive visual features is calculated by using a multi-modal triplet loss function and used for training. The multi-modal triplet loss function is:

wherein d () is the Euclidean distance, q is the natural language problem feature, pos and neg are the positive visual feature and the negative visual feature of the visual image feature respectively; margin is a hyper-parameter, representing the distance between features, typically set to 0.8; max () represents selecting the maximum value.

Step five, an answer prediction module

The answer prediction module is composed of two layers of fully-connected neural networks, and the activation function adopts a Relu function. Weight normalization is also added into the answer prediction module to accelerate the training of the neural network. In the answer prediction module, natural language question features fused with visual image features (or extracted positive visual features and negative visual features) are extracted at first, and the fused natural language question features are pooled to obtain final fusion features. The output dimensionality of the single-layer neural network layer adopted by the pooling layer is 768, and the activation function adopts a Tanh function. And inputting the final fusion features obtained through pooling into an answer prediction module, and outputting an 2274-dimensional answer.

Step six, calculating a loss function

The method comprises the steps that a loss function is needed to be calculated at two positions during training of a visual question-answer model, the first position is to calculate the relation between natural language question features and visual image features through a multi-mode triple loss function, and the second position is to calculate the loss between answers and labels by adopting a multi-label cross entropy loss function.

Metric learning is a method of spatial mapping, and can learn a feature space in which the distance between features of similar samples is smaller and the distance between features of dissimilar samples is larger, so as to distinguish them. The multi-modal triplet loss function is thus:

where q is the natural language question feature, pos is the positive visual feature, neg is the negative visual feature, and margin is the hyper-parameter (typically set to 0.8).

The cross entropy loss function for multiple tags is:

wherein x is a predicted answer, y is a standard answer, k represents the total number of all answers, i =1, 2, …, k;

。

step seven, a comparison learning module

As shown in fig. 7, the original visual features, the active visual features and the passive visual features are respectively fused with the natural language question features to obtain original fusion features, active fusion features and passive fusion features. The original visual features and the positive fusion features should both get correct answers through the answer prediction module, and the negative fusion features should not get correct answers through the answer prediction module. And calculating loss by respectively passing the original visual features and the positive fusion features through answers and standard labels obtained by an answer prediction module, calculating loss by passing the negative fusion features through answers and false labels obtained by the answer prediction module, and training by using a comparison method and utilizing the positive visual features and the negative visual features to cooperate with original features.

Step eight, training a visual question-answer model

And finally, training a visual question-answer model (as shown in fig. 8) according to the loss value by calculating the loss, and optimizing parameters of the visual question-answer model. And setting the maximum number of training rounds as 30 rounds, stopping training by adopting an early stop method when the accuracy of the verification set is greatly reduced, and taking the model parameter with the highest accuracy of the verification set as the parameter of the final model.

In order to show the beneficial effects of the invention, the invention adopts the standard visual question-answer evaluation index to evaluate the visual question-answer model:

wherein the content of the first and second substances,

is the answer of the statistical people's answer. If at least 3 annotators provide answers, the predicted outcome of the candidate answer will be 100%.

The experimental results are shown in Table 1, and are shown on the VQA-CPv2 data set validation set.

Table 1 effect display using metric learning-based visual question-answering method

According to the table 1, the overall accuracy of the visual question-answer model added with the metric learning method is greatly improved together with the Yes/no problem, the semantic gap and semantic deviation problem in the visual question-answer is well relieved, and the performance and robustness of the visual question-answer model are improved.

The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only illustrative of the present invention, and are not intended to limit the scope of the present invention, and any person skilled in the art should understand that equivalent changes and modifications made without departing from the concept and principle of the present invention should fall within the protection scope of the present invention.

Claims

1. A visual question-answering method based on metric learning is characterized by comprising the following steps:

2. The visual question-answering method based on metric learning of claim 1, wherein in the step S3, in the self-attention module, two single-modality encoders are established, respectively, a visual object encoder and a natural language question encoder; the visual object encoder and the natural language problem encoder are both composed of a self-attention layer and a feedforward neural network layer, and residual errors are added into the self-attention layer and the feedforward neural network layer to be connected;

3. The visual question-answering method based on metric learning of claim 1, wherein in the step S4, the attention mechanism calculates a correlation index between visual image features and natural language question features by using a dot product similarity:

wherein the content of the first and second substances,

for the aligned and mapped visual image features of the 36 strips,

is a cosine function.

4. The visual question-answering method based on metric learning of claim 1, wherein in the step S5, the multi-modal triplet loss function is:

wherein the content of the first and second substances,

in the form of the euclidean distance,

is a natural language question feature;

and

representing the distance between features for hyper-parameters;

indicating that a maximum value is selected.

5. The visual question-answering method based on metric learning of claim 1, wherein in step S7, the original fused features and the positive fused features can both get correct answers through an answer prediction module, and the negative fused features can not get correct answers through the answer prediction module.

6. The visual question-answering method based on metric learning of claim 1, wherein in step S8, the visual question-answering model is trained according to the loss value, and the training is stopped by adopting an early-stopping method when the accuracy of the verification set is greatly reduced, so as to obtain the parameters of the final model.