CN110717431B

CN110717431B - Fine-grained visual question and answer method combined with multi-view attention mechanism

Info

Publication number: CN110717431B
Application number: CN201910927585.4A
Authority: CN
Inventors: 彭淑娟; 李磊; 柳欣; 范文涛; 钟必能; 杜吉祥
Original assignee: Huaqiao University
Current assignee: Huaqiao University
Priority date: 2019-09-27
Filing date: 2019-09-27
Publication date: 2023-03-24
Anticipated expiration: 2039-09-27
Also published as: CN110717431A

Abstract

The invention relates to a fine-grained visual question-answering method combined with a multi-view attention mechanism, which sufficiently considers the guiding effect of the specific semantics of a question and provides a multi-view attention model, can effectively select a plurality of salient target areas related to a current task target (question), learns and acquires area information related to answers in an image and a question text from a plurality of views, extracts area salient features in the image under the guidance of question semantics, has finer-grained feature expression, has stronger depicting capability for expressing the condition that a plurality of important semantic expression areas exist in the image, increases the effectiveness and the comprehensiveness of the multi-view attention model, effectively strengthens the semantic relevance of the salient features and the question features in the image area, and improves the accuracy and comprehensiveness of the semantic understanding of the visual question-answering. The method for carrying out the visual question-answering task has the advantages of simple steps, high efficiency, high accuracy, complete commercial application and good market prospect.

Description

Fine-grained visual question and answer method combined with multi-view attention mechanism

Technical Field

The invention relates to the technical field of computer vision and natural language processing, in particular to a fine-grained vision question-answering method combined with a multi-view attention mechanism.

Background

With the rapid development of computer vision and natural language processing, the visual question-answering system becomes one of the more and more popular research fields of artificial intelligence. The visual question-answering technology is an emerging topic, and the task of the visual question-answering technology is to combine two subject fields of computer vision and natural language processing, take a given image and a natural language question related to the image as input, and generate a natural language answer as output. The visual question and answer is a key application direction in the field of artificial intelligence, and can help users with visual disorders to perform real-time human-computer interaction by simulating the scenes of the real world.

In essence, the visual question-answering system is regarded as a classification task, and it is a common practice to extract pictures and question features according to known pictures and questions, and then classify by fusing the picture features and the question features to obtain question-answering results. In recent years, visual question answering has attracted a great deal of attention in the fields of computer vision and natural language processing. Due to the relative complexity of visual question answering and the need for image and text processing, some existing methods have a certain lack of accuracy and are faced with major challenges.

In practical applications, visual question-answering systems are often faced with high dimensionality of the images and with noise effects that can affect the prediction of the answers by the algorithms. Therefore, the effective visual question-answering model can be used for mining the structural features and semantic correlation parts consistent with the question semantics in the image to perform fine-grained prediction.

The visual attention model is a computer simulation of the human visual attention mechanism to obtain the most noticeable part of an image, namely the salient region of the image. In the visual question answering, most methods using a single attention mechanism model often ignore the difference of image structural semantics and have some defects in the situation that a plurality of important areas exist in an image, so that the attention mechanism brought by the methods inevitably influences the accuracy of the visual question answering.

Research shows that most of the existing visual question-answering methods predict semantic answers of questions through questions and the whole picture, but do not consider the guiding effect of the specific semantics of the questions, so that the relevance of the image region features learned by the models and the question features on a semantic space is weak.

In summary, in the prior art, effective visual question answering methods still need to be improved.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a fine-grained visual question-answering method combined with a multi-view attention mechanism, which can effectively improve the accuracy and comprehensiveness of visual semantic information extraction and reduce the influence of redundant data and noise data, thereby improving the fine-grained identification capability and the judgment of complex problems of a visual question-answering system and improving the accuracy and the interpretability of a model of the visual question-answering system to a certain extent.

The technical scheme of the invention is as follows:

a fine-grained visual question-answering method combined with a multi-view attention mechanism comprises the following steps:

1) Inputting an image and extracting image characteristics; inputting a question text and extracting question features;

2) Inputting the image characteristics and the problem characteristics into a multi-view attention model, calculating the attention weight of the image, and performing weighted operation on the image characteristics in the step 1) through the attention weight to obtain fine-grained characteristics of the image;

3) Fusing the fine-grained features of the image with the problem features to obtain fused features;

4) And inputting the fusion features into a classifier, and predicting to obtain an answer.

Preferably, the multi-view attention model includes an upper layer attention model and a lower layer attention model, a single attention weight is obtained through the upper layer attention model, and a significant attention weight is obtained through the lower layer attention model, and the significant attention weight represents that different target regions in the image correspond to different attention resources.

Preferably, the method of obtaining a single attention weight is as follows:

inputting image characteristics and problem characteristics to an upper-layer attention model, respectively projecting data of the image characteristics and the problem characteristics to a same dimensional space by using a full connection layer, and normalizing vectors by using an activation function ReLu; then utilizing the product fusion of Hadamard codes, inputting two full-connection layers in turn to process the learning parameters, and processing the learned parameters

Finally, the weight is normalized by using the softmax function to obtain the single attention weight->

Wherein the content of the first and second substances,

is an image characteristic, is based on>

In order to be a characteristic of the problem,

the weight parameter to be learned for the upper layer attention model, K is the number of spatial regions of image features, T is the length of selected problem features, d is the number of hidden neurons in a network layer, h is the output dimension set by the layer, reLu is an activation function in a neural network, and the specific form of ReLu can be expressed as f (x) = max (0, x).

Preferably, the method of obtaining the saliency attention weight is as follows:

inputting image characteristics and problem characteristics to a lower-layer attention model, respectively projecting data of the image characteristics and the problem characteristics to a same dimensional space by using a full-connection layer,recalculating the correlation matrix C _i ＝ReLu(q _i ^T W _b V _i ) (ii) a Wherein the content of the first and second substances,

a weight parameter to be learned for the lower attention model, <' >>

Obtaining an incidence matrix;

multiplying the incidence matrix as a characteristic by the problem characteristic, and fusing the incidence matrix with the input image characteristic, wherein the fused parameter is

Finally, the weight value is normalized by using the softmax function, and the significance attention weight is output>

Wherein the content of the first and second substances,

weight parameters to be learned for the underlying attention model.

Preferably, the attention weight of the image is calculated based on the single attention weight and the saliency attention weight, specifically as follows:

wherein, beta ₁ And beta ₂ The weight ratio of the upper layer attention model and the lower layer attention model is a hyper-parameter.

Preferably, in step 3), the fine-grained features and the problem features of the image are respectively passed through the nonlinear layer f _v 、f _q In the non-linear layer f _v 、f _q Normalizing the vector by using an activation function ReLu; then, the Hadamard product is used for fusion to obtain fusion characteristics

Preferably, in step 4), the fused features are passed through a non-linear layer f _o Through the non-linear layer f _o Normalizing the vector by using an activation function ReLu; then using the linear mapping w _o To predict the candidate score of the answer

Finally, selecting the output with higher score;

where σ is the sigmoid activation function, w _o Is the weight parameter to be learned.

Preferably, the sigmoid activation function normalizes the final score to be in a (0-1) interval, the last stage is used as a logistic regression for predicting the correctness of each candidate answer, and the target function is

Wherein z and k cover N candidate answers, s, to M training questions, respectively _zk Is the true answer to the question.

Preferably, in step 1), the input image I is subjected to fast-RCNN standard model _i Carrying out feature extraction to obtain image features V of depth expression _i ＝FasterRCNN(I _i )。

Preferably, in step 1), question text Q is input _i First, the question text Q is marked by using spaces and punctuation _i Dividing into words, initializing by a pre-trained GloVe word embedding method to obtain an i-th specified question sentence after coding

Wherein x is _t ⁽ⁱ⁾ Indicating the tth word of each word in the vocabulary;

then, will

Inputting into LSTM network, taking out output q of last layer _i As->

To obtain a problem feature q _i 。

The invention has the following beneficial effects:

the invention provides a multi-view attention model by combining a fine-grained vision question-answering method of a multi-view attention mechanism, which can effectively select a plurality of significant target areas related to a current task target (question), extract area significance characteristics in an image under the guidance of question semantics, has fine-grained characteristic expression, expresses the condition that a plurality of important semantic expression areas exist in the image, and has strong depicting capability.

According to the method, the guiding effect of the specific semantics of the question is fully considered, the regional information related to the answer in the image and the question text is learned and obtained from multiple visual angles, and the effectiveness and the comprehensiveness of the multi-visual-angle attention model are improved, so that the semantic relevance of the salient features and the question features of the image region is effectively enhanced, and the accuracy and the comprehensiveness of the semantic understanding of the visual question-answering are improved.

The method for carrying out the visual question-answering task has the advantages of simple steps, high efficiency, high accuracy, complete commercial application and good market prospect.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a schematic diagram of a multi-view attention model;

FIG. 3 is an attention weight visualization thermodynamic diagram (simple attention task);

FIG. 4 is an attention weight visualization thermodynamic diagram (task needs to be highly focused on multiple locations in the image);

FIG. 5 is a graph comparing the results obtained by the multi-view attention model of the present invention with the current more advanced method;

FIG. 6 is a graph of a loss function for final model performance training;

FIG. 7 is a graph of training verification scores for final model performance training.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

The invention provides a fine-grained visual question-answering method combining a multi-view attention mechanism to solve the defects in the prior art. The visual question-answer may be regarded as a multi-tasking classification question, and each answer may be regarded as a classification category. In a general visual question-answering system, answers are coded by using a One-Hot method to obtain One-Hot vectors corresponding to each answer, and an answer vector table is formed. One-Hot encoding is the representation of classification variables as binary vectors. This first requires that the classification values are mapped to integer values, and then each integer value is represented as a binary vector, which is a zero value, except for the index of the integer, which is labeled 1.

As shown in fig. 1, the fine-grained visual question-answering method combined with a multi-view attention mechanism according to the present invention generally comprises the following steps:

1) Inputting an image and extracting image characteristics; inputting a problem text and extracting problem features;

In this embodiment, in step 1), the fast-RCNN standard model is used to match the input image I _i Carrying out feature extraction to obtain image features V _i ＝FasterRCNN(I _i ). Let K be the number of spatial regions of the image feature, then the image feature

Can be further expressed as>

Wherein it is present>

Is the kth regional characteristic extracted by fast-RCNN, and d is the number of hidden neurons in the network layer and simultaneously represents the output dimension.

In step 1), inputting a question text Q _i Then, the question text Q is marked by using spaces and punctuation _i Dividing into words, and initializing by a pre-trained GloVe Word embedding method (Global Vectors for Word retrieval) to obtain the ith sentence with specified problem

In a coded form of (1), wherein x _t ⁽ⁱ⁾ Indicating the tth word of each word in the vocabulary;

then, will

Inputting into LSTM network, specifically, using standard LSTM network containing 1280 hidden units, taking out output q of last layer _i As->

To obtain a problem feature q _i 。

Then, for the acquired image feature V _i And problem feature q of the coding _i And inputting the two features into a multi-view attention model, and calculating the attention weight of the image.

The visual attention mechanism can essentially select a target area which is more critical to the current task target from the image, so that more attention resources are put into the area to acquire more detailed information of the target which needs to be focused, and other useless information is suppressed. In the visual question-answering task, semantic expressions have diversity. In particular, there are problems that require the model to understand semantic expressions between multiple target objects in an image. Therefore, the single visual attention model cannot effectively mine the relevance between different semantic objects in the image and the problem semantics;

in order to solve the problem, the invention provides a multi-view attention model, which uses two different attention mechanisms to jointly learn important area parts with different semantics, which can be focused on in the problem, so as to obtain a fine-grained attention feature map of an image. The attention model of the multi-view angle is used for paying attention to the image to obtain the attention weight of the image, the weight is used for carrying out image feature weighting to obtain an accumulated vector as a final image feature representation, namely, fine-grained features of the image can be well associated with problem semantics.

As shown in fig. 2, the multi-view attention model includes an upper-layer attention model and a lower-layer attention model, a single attention weight is obtained through the upper-layer attention model, and a saliency attention weight is obtained through the lower-layer attention model, and the saliency attention weight represents that different target regions in the image correspond to different attention resources.

Specifically, in the upper-level attention model, the method for obtaining the single attention weight is as follows:

inputting image characteristics and problem characteristics to an upper layer attention model, respectively projecting data of the image characteristics and the problem characteristics to a same dimensional space by using a layer of full connection layer, and normalizing vectors by using an activation function ReLu; then, the Hardmard products are fused, and then the parameters are sequentially input into two full-connection layers to process the learning parameters, and the parameters after learning are processed

Wherein it is present>

Is an image characteristic, is based on>

For question features, in>

Weight parameters to be learned for the upper layer attention model, K is the number of spatial regions of image features, and T is selectionD is the number of hidden neurons in the network layer, h is the output dimension set by the layer, reLu is an activation function in the neural network, and the specific form of ReLu can be expressed as f (x) = max (0, x);

finally, the weight is normalized by using a softmax function to obtain a single attention weight

Considering a single attention weight

The value of the softmax weight is larger, and the rest weight is smaller. Since an image often contains a plurality of different semantics and these semantics are often expressed visually semantically in different regions. Single attention weight->

Some region information with important semantics is often ignored. In order to supplement the missing part of the attention information of the upper layer attention model, the invention further provides the lower layer attention model. The lower-layer attention model simultaneously gives consideration to the relevance of the image and the problem semantics, achieves a learning mechanism of the problem-guided multi-view attention model, and increases the feature fine-grained mining capability.

Specifically, in the lower attention model, the method for obtaining the significance attention weight is as follows:

inputting image characteristics and problem characteristics to a lower-layer attention model, respectively projecting data of the image characteristics and the problem characteristics to a same dimensional space by using a full connecting layer, and then calculating a correlation matrix C _i ＝ReLu(q _i ^T W _b V _i ) (ii) a Wherein the content of the first and second substances,

a weight parameter to be learned for the lower attention model, <' >>

Obtaining an incidence matrix;

Wherein it is present>

And setting parameter dimensions for weight parameters to be learned of the lower-layer attention model to be consistent with those of the upper-layer attention model, wherein K is the number of space regions of image features, T is the length of selected problem features, d is the number of hidden neurons in the network layer, h is the output dimension of the layer, and ReLu is an activation function in the neural network.

Calculating the attention weight of the image based on the single attention weight and the significant attention weight, which is specifically as follows:

wherein, beta ₁ And beta ₂ The weight ratio parameters of the upper layer attention model and the lower layer attention model are obtained. In practical application, the weights between the upper layer attention model and the lower layer attention model can be distributed through debugging parameters, so that a better effect is achieved.

Image feature V _i Can further represent the collection form of K image space region characteristics

Further, attention is weighted by a _i Multiplying and weighting the image characteristics of each space area to obtain an imageFine grain features

/>

In step 3), the fine-grained characteristic and the problem characteristic of the image are respectively passed through a nonlinear layer f _v 、f _q In the non-linear layer f _v 、f _q Normalizing the vector by using an activation function ReLu; then, the Hadamard product is used for fusion to obtain fusion characteristics

Further, the visual question-answering question is a multi-label classification question, and further, in the step 4), the fusion features pass through the nonlinear layer f _o Through the non-linear layer f _o Normalizing the vector by using an activation function ReLu; then using the linear mapping w _o To predict the candidate score of the answer

Finally, selecting the output with higher score;

Wherein the indices z and k cover N candidate answers, s, to M training questions, respectively _zk Is the true answer to the question.

Compared with other common visual questions and answers using the softmax classifier, the logistic regression classification used by the method is more effective. The Sigmoid function uses a soft score (soft target) as a target result, provides a richer training signal, and can effectively capture the occasional uncertainty in the real answer.

To better observe how the attention model is closedAttention is paid to the salient region part of the image, and the attention map (a) of the single attention weight and the salient attention weight is obtained ^u ，a ^b ) Thereafter, the attention map is visualized as a matrix heatmap (matrix heatmap) using the heatmap function of the matchlib drawing library in python, as shown in fig. 3, 4.

Fig. 3 and 4 are representations of an upper layer attention model and a lower layer attention model of a multi-view attention model for 2 different task images, respectively, wherein attention1 is an attention visual view of the upper layer attention model, and attention2 is an attention visual view of the lower layer attention model. As can be seen from the heat map of attention, the added underlying attention model is able to learn different important regions of the input image. As can be seen from fig. 3, for a simple attention task, the upper and lower attention models can find the correct position in the image. However, in fig. 4, it can be seen that when a task needs to be highly focused on a plurality of locations in an image, a lower layer attention model focuses on a different portion from an upper layer attention model, thereby improving the accuracy of a multi-view attention model, which is advantageous over the prior art models.

Introduction of test data set: the VQA v2 dataset (Antol S, agrawal A, lu J, et al. Vqa: visual query answering [ C ]. Proceedings of the IEEE International Conference on Computer vision.2015: 2425-2433.) is a large-scale Visual question-answer dataset in which all questions and answers are manually annotated. There were 443,757 training questions, 214,354 validation questions, and 447,793 testing questions in the dataset. Each image is associated with three questions for which ten answers are provided by the annotator. In the case of standard visual question-answering tasks, the questions in this dataset are often classified as: yes/no, number and other.

Further, to verify the effectiveness of the present invention, the present invention was compared with the results of 2017VQA challenge champion (Anderson P, he X, buehler C, et al, bottom-up and top-down-effort for image capturing and visual query answering. ArXiv print arXiv:1707.07998, 2017.), and as shown in FIG. 5, the present invention replaced the original simple attention model with the multi-view attention model based on the recurrent paper code, the multi-view attention model of the present invention finally scored 64.35%, and the evaluation of the accuracy rate compared with the paper is about 1.2% higher.

Some basic parameters were designed in the experiment as follows, with the basic learning rate set at α =0.0007, the random deactivation rate set at dropout =0.3 after each LSTM layer, and the answer screening setting N =3000. Hidden neuron setting num _ hid =1024 for the fully connected layer, and batch _ size =512 for the number of batches trained. The weight of the single attention weight and the significance attention weight is distributed as beta ₁ ＝0.7，β ₂ ＝0.3。

As shown in fig. 6, the loss function value (loss) of the model decreases with the increase of the training period; as shown in fig. 7, the model accuracy rate is represented on the training set and the test set respectively as the training period increases.

Comparison of the representative method of the present invention in the test-dev case with the VQA task on the public standard data set VQA v2 is shown in Table 1.

TABLE 1

Specifically, the data were evaluated in 3 categories based on the type of the problem, and then the total evaluation result was calculated. The question types are Y/N question, number question, other openness question. The scores in the table are the accuracy of the model for different types of question answer results, and the accuracy is higher when the numerical value is larger. As can be seen from the table, the multi-view attention model of the present invention achieves better results for different tasks.

Particularly, the multi-view attention model strengthens the feature expression of fine granularity, improves the detection and identification capabilities of the object, and has good improvement on the Number evaluation compared with the prior method. The overall accuracy evaluation result of the model is better than that of most existing methods.

The above examples are provided only for illustrating the present invention and are not intended to limit the present invention. Changes, modifications, etc. to the above-described embodiments are intended to fall within the scope of the claims of the present invention as long as they are in accordance with the technical spirit of the present invention.

Claims

1. A fine-grained visual question-answering method combined with a multi-view attention mechanism is characterized by comprising the following steps:

4) Inputting the fusion characteristics into a classifier, and predicting to obtain an answer;

the multi-view attention model comprises an upper layer attention model and a lower layer attention model, wherein a single attention weight is obtained through the upper layer attention model, a significant attention weight is obtained through the lower layer attention model, and the significant attention weight reflects different attention resources corresponding to different target areas in an image;

the method of obtaining a single attention weight is as follows:

inputting image characteristics and problem characteristics to an upper layer attention model, respectively projecting data of the image characteristics and the problem characteristics to a same dimensional space by using a layer of full connection layer, and normalizing vectors by using an activation function ReLu; then, the Hadamard product is used for fusion, and then two full-connection layers are sequentially input for processing learning parameters, and the learned parameters are processed

Wherein the content of the first and second substances,

in order to be a feature of the image,

in order to be a characteristic of the problem,

the weight parameters to be learned for the upper layer attention model, K is the number of spatial regions of image features, T is the length of selected problem features, d is the number of hidden neurons in a network layer, h is the output dimension set by the layer, reLu is an activation function in a neural network, and the specific form of the ReLu can be expressed as f (x) = max (0, x);

the method for obtaining the significance attention weight is as follows:

inputting image characteristics and problem characteristics to a lower-layer attention model, respectively projecting data of the image characteristics and the problem characteristics to the same dimensional space by using a full-connection layer, and calculating a correlation matrix C _i ＝ReLu(q _i ^T W _b V _i ) (ii) a Wherein the content of the first and second substances,

the weight parameters to be learned for the underlying attention model,

obtaining an incidence matrix;

Finally, the weight is normalized by using a softmax function, and the significance attention weight is output

Wherein the content of the first and second substances,

weight parameters to be learned for the underlying attention model.

2. The fine-grained visual question-answering method combined with a multi-view attention mechanism according to claim 1, wherein the attention weight of the image is calculated based on a single attention weight and a significant attention weight, and specifically as follows:

wherein beta is ₁ And beta ₂ The weight ratio of the upper layer attention model and the lower layer attention model is a hyper-parameter.

3. The fine-grained visual question-answering method combined with the multi-view attention mechanism according to claim 2, wherein in the step 3), the image fine-grained features and the question features are respectively passed through the nonlinear layer f _v 、f _q In the non-linear layer f _v 、f _q Normalizing the vector by using an activation function ReLu; then, the Hadamard product is used for fusion to obtain fusion characteristics

4. The fine-grained visual question-answering method combined with multi-view attention mechanism according to claim 3, wherein in the step 4), the fused features are processed through a nonlinear layer f _o In the presence of a non-linear passingProperty layer f _o Normalizing the vector by using an activation function ReLu; then using the linear mapping w _o To predict the candidate score of the answer

Finally, selecting the output with higher score;

5. The fine-grained visual question-answering method combined with a multi-view attention mechanism according to claim 4, wherein the sigmoid activation function normalizes the final score to be in a (0-1) interval, the last stage is used as a logistic regression for predicting the correctness of each candidate answer, and the objective function is

6. The fine-grained visual question-answering method combined with a multi-view attention mechanism according to claim 1, wherein in the step 1), a fast-RCNN standard model is used for an input image I _i Carrying out feature extraction to obtain image features V of depth expression _i ＝FasterRCNN(I _i )。

7. The fine-grained visual question-answering method combined with multi-view attention mechanism according to claim 1, characterized in that in step 1), question text Q is input _i First, the question text Q is marked by using spaces and punctuation _i Dividing into words, initializing by a pre-trained GloVe word embedding method to obtain an i-th specified question sentence after coding

then, will

Inputting into LSTM network, taking out output q of last layer _i As

To obtain a problem feature q _i 。