CN112926655A

CN112926655A - Image content understanding and visual question and answer VQA method, storage medium and terminal

Info

Publication number: CN112926655A
Application number: CN202110211935.4A
Authority: CN
Inventors: 匡平; 张婷
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-02-25
Filing date: 2021-02-25
Publication date: 2021-06-08
Anticipated expiration: 2041-02-25
Also published as: CN112926655B

Abstract

The invention discloses a method for understanding image content and visually asking VQA, a storage medium and a terminal, wherein the method comprises the following steps: inputting the images and the questions to be answered into a trained prediction module for answering; the prediction module comprises a fusion attention module, a bilinear model and a classifier which are connected in sequence, and the classifier outputs answers. The invention completes the task of solving the problems and Visual Question and Answer (VQA) of the image content according to the thought of 'expressing the characteristics of images and questions, expressing the characteristics of images and declarative sentences, fusing characteristic matrixes, learning the characteristics of images according to the characteristics of the questions, learning the characteristics of images according to correct declarative sentences, correctly guiding the models by using correct declarative sentences and obtaining results'; therefore, the method provides a method for fusing the intensive interaction between the image and the question key words, and can learn the intensive interaction between the image and the text so as to deduce the relationship between the image and the question key words.

Description

Image content understanding and visual question and answer VQA method, storage medium and terminal

Technical Field

The invention relates to the technical field of computers, in particular to a method for understanding image content and visually asking and answering VQA, a storage medium and a terminal.

Background

In recent years, image content understanding and Visual Question and Answer (VQA) have attracted increasing interest. Multi-modal fusion of global features is the most straightforward VQA solution. The general processing idea is to express images and questions as global features and then perform probability prediction of answers by using a multi-modal fusion model.

In addition to understanding the visual content of the image, VQA also requires a full understanding of the semantics of the natural language question. Therefore, it is necessary to learn both the attention of the text to the problem and the visual attention of the image. The current problem is mainly represented by LSTM, and the multi-modal fusion is mainly represented by residual error network. The problem caused by the current fusion is that the global feature representation of a graph may lose some key information, which may relate to the local area of the image in the problem, and most solutions still use the attention mechanism. The cooperative attention network adopted at present learns the attention distribution of each modality separately and then performs fusion.

Since the current network architecture for solving the VQA problem is to learn the attention distribution of each modality and then perform fusion, there are several drawbacks: (1) the network can only learn rough interaction among multiple modes, but neglects intensive interaction of images and texts, and the current cooperative attention is not enough to deduce the relationship between the images and the problem keywords; (2) the task of image question answering (VQA) is not accurate.

Disclosure of Invention

It is an object of the present invention to overcome the disadvantages of the prior art and to provide a method, a storage medium and a terminal for image content understanding and visual question answering VQA.

The purpose of the invention is realized by the following technical scheme:

in a first aspect of the present invention, a method for image content understanding and visual question and answer VQA is provided, comprising the steps of:

inputting the images and the questions to be answered into a trained prediction module for answering; the prediction module comprises a fusion attention module, a bilinear model and a classifier which are connected in sequence, and the classifier outputs answers; the training of the prediction module comprises the sub-steps of:

respectively extracting features of the image and the problem, inputting the image and the problem into a fusion attention module, and splicing the obtained image fusion feature I (f) and the problem fusion feature Q (f) to obtain a first splicing result;

respectively extracting features of the image and the correctness statement of the problem, inputting the images and the correctness statement of the problem into a fusion attention module, and splicing the obtained image fusion feature I (t) and the statement fusion feature S (t) to obtain a second splicing result;

performing loss calculation on the first splicing result and the second splicing result to obtain a result loss (f);

inputting the image fusion characteristics I (f) and the problem fusion characteristics Q (f) into a bilinear model to be coded to obtain fused characteristics Z, and obtaining a classification result through a classifier;

converting the correct answer to the question and the classification result into a first vector a (t) and a second vector a (f), respectively;

performing loss calculation on the first vector A (t) and the second vector A (f) to obtain a result loss (c);

performing mathematical operation on the result Loss (f) and the result Loss (c) to obtain a final result Loss;

optimizing a fusion attention module, a bilinear model and a classifier by using the final result Loss; the fusion attention module was optimized using loss (f).

Further, the image feature extraction specifically includes:

performing characteristic representation on an input image by using a fast R-CNN trained on Visual Genome data in a bottom-up mode; for each target, performing average pooling by using the convolutional layer to obtain a characteristic, and marking as Xi; the features in the image are finally represented as an image feature matrix X.

Further, extracting features from the questions specifically comprises:

dividing the input problem to obtain words, then converting each word into a vector by using a word embedding method, then inputting a single-layer recurrent neural network, and finally outputting a problem feature matrix Y (f);

extracting features from the correctness statement of the question, specifically comprising:

dividing the input statement sentence to obtain words, then converting each word into a vector by using a word embedding method, then inputting a single-layer recurrent neural network, and finally outputting a statement feature matrix Y (t).

Further, the fused attention module comprises a first self-attention module, a second self-attention module, and a scoring attention module;

the first self-attention module receives image features, the second self-attention module receives question features or statement features, the results of the first self-attention module and the results of the second self-attention module are output to the scoring attention module, and image fusion features I (f) and question fusion features Q (f) or image fusion features I (t) and statement fusion features S (t) are output.

Further, the first and second self-attention modules each include:

inputting image characteristics, problem characteristics or statement characteristics, converting the input image characteristics, the problem characteristics or the statement characteristics into matrixes through embedding transformation, and performing dot multiplication on the matrixes and the three matrixes Wq, Wk and Wv to obtain three weight matrixes Qi, Ki and Vi; wherein Wq, Wk, Wv are three trainable weight matrices using uniform distribution;

performing point multiplication on the matrix Qi and the matrix Ki to obtain a Score Score (i), and performing point multiplication on the matrix Qi and the matrices K (i +1), K (i +2), … and K (i + n) respectively to obtain scores Score (i +1), Score (i +2), … and Score (i + n);

SoftMax is carried out on [ Score (i), Score (i +1), … and Score (i + n) ] to obtain proportions [ Ratio (i), Ratio (i +1), … and Ratio (i + n) ];

multiplying the score proportion [ Ratio (i), Ratio (i +1), …, Ratio (i + n) ] with [ Vi, V (i +1), …, V (i + n) ] to obtain weighted values, and adding the weighted values to obtain Ti, namely an n × n attention mechanism diagram, wherein each word corresponds to a weight E (i, j), and the weighted diagram is output by the first self-attention module and the second self-attention module.

Further, the scoring attention module comprises:

the feature of the image feature output by the first self-attention module is ISM, the feature of the question feature or statement feature output by the second self-attention module is QM, both ISM and QM are n x n matrixes, and each position stores the ratio of a certain row to a certain column;

dot multiplication is carried out on i rows and j columns of the ISM and QM to obtain a matrix V:

V_(ISM,QM)＝(ISM_(i,j)*QM_(i,j))(i≤n,j≤n)

normalizing the matrix V to obtain V_(i,j)′：

Make V_(i,j)Multiplying by ISM, multiplying the finally obtained matrix by weight a to obtain a final matrix S, and normalizing S to obtain a weight matrix M:

M_i＝softmax(a*V_(i,j)′*ISM_(i,j))(i≤n,j≤n)

the finally obtained image characteristics IM are:

in the formula (I), the compound is shown in the specification,

representing the resulting image features, and m represents the number of network cycles in a particular training session.

Further, the obtaining a result loss (f) by performing loss calculation on the first splicing result and the second splicing result includes:

wherein F (f) is the first splicing result, F (t) is the second splicing result, and N is the number of samples;

the obtaining of the result loss (c) by performing the loss calculation on the first vector a (t) and the second vector a (f) includes:

Loss(c)＝cross-entropy(A(t),A(f))

wherein cross-entropy represents cross-entropy;

performing mathematical operation on the result Loss (f) and the result Loss (c) to obtain a final result Loss, wherein the mathematical operation comprises the following steps:

Loss＝Loss(c)+λLoss(f)

in the formula, λ is an adjustment parameter.

Further, the inputting the image fusion feature i (f) and the problem fusion feature q (f) into a bilinear model to obtain a fused feature Z, and obtaining a classification result through a classifier, includes:

inputting the image fusion characteristics I (f) and the problem fusion characteristics Q (f) into a pre-trained encoder to obtain a fusion vector Z:

in the formula, LayerNorm represents normalization of hidden layer state dimension, Wx represents a linear projection matrix of image features, IM represents image features, namely I (f), and W_yA linear projection matrix representing the problem features is shown, and QY represents the problem fusion features, namely Q (f);

mapping the obtained fusion vector Z, namely the feature Z into a vector S, and calculating the classification result of the N answers by using binary cross entropy:

S＝sidmoid(Z)

in the formula, sidmoid represents an activation function of a neural network.

In a second aspect of the present invention, a storage medium is provided having stored thereon computer instructions which, when executed, perform the steps of one of the image content understanding and visual question and answer VQA methods.

In a third aspect of the present invention, a terminal is provided, comprising a memory and a processor, the memory having stored thereon computer instructions executable on the processor, the processor executing the computer instructions to perform the steps of the method for image content understanding and visual question and answer VQA.

The invention has the beneficial effects that:

(1) in an exemplary embodiment of the present invention, a feature matrix of an image is extracted from the image, and a question (sentence) input is divided into words (in an exemplary embodiment, the maximum length of a word is 10), and features of the question are extracted through a neural network; then, by a method of learning image characteristics according to question characteristics, the idea of learning correct answers by adopting a correct declarative sentence guiding model is adopted, and the tasks of solving questions and Visual Questions (VQA) of image contents are completed according to the thinking of 'performing characteristic representation on images and questions, performing characteristic representation on images and declarative sentences, fusing characteristic matrixes, learning image characteristics according to question characteristics, learning image characteristics according to correct declarative sentences, using the correct declarative sentence guiding model and obtaining results'. Therefore, the method provides a method for fusing the intensive interaction between the image and the question key words, and can learn the intensive interaction between the image and the text so as to deduce the relationship between the image and the question key words.

(2) In another exemplary embodiment of the present invention, specific implementation of each step is disclosed, which can improve the accuracy of VQA to 77.34%.

Drawings

FIG. 1 is a flow chart of a method in an exemplary embodiment of the invention;

FIG. 2 is a diagram of a model corresponding to a method in an exemplary embodiment of the invention;

fig. 3 is a schematic structural diagram of a fused attention module in an exemplary embodiment of the invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

In a first aspect of the present invention, a method for image content understanding and visual question answering VQA is provided, as shown in fig. 1 and 2, comprising the steps of:

respectively extracting features (an image feature matrix X and a problem feature matrix Y (f)) from an image and a problem, inputting the features into a fusion attention module, and splicing the obtained image fusion feature I (f) and the problem fusion feature Q (f) to obtain a first splicing result;

respectively extracting features (an image feature matrix X and a statement feature matrix Y (t)) from the image and the correctness statement of the problem, inputting the features into a fusion attention module, and splicing the obtained image fusion feature I (t) and the statement fusion feature S (t) to obtain a second splicing result;

converting the correct answer to the question and the classification result into a first vector a (t) and a second vector a (f), respectively; (the conversion mode can adopt one-hot coding)

Specifically, in this exemplary embodiment, a feature matrix of an image is first extracted from the image, and a question (sentence) input is divided into words (in an exemplary embodiment, the maximum length of a word is 10), and features of the question are extracted through a neural network; then, by a method of learning image characteristics according to question characteristics, the idea of learning correct answers by adopting a correct declarative sentence guiding model is adopted, and the tasks of solving questions and Visual Questions (VQA) of image contents are completed according to the thinking of 'performing characteristic representation on images and questions, performing characteristic representation on images and declarative sentences, fusing characteristic matrixes, learning image characteristics according to question characteristics, learning image characteristics according to correct declarative sentences, using the correct declarative sentence guiding model and obtaining results'. Therefore, the method provides a method for fusing the intensive interaction between the image and the question key words, and can learn the intensive interaction between the image and the text so as to deduce the relationship between the image and the question key words.

In addition, as shown in fig. 2, in the training process, the distraction prediction module further includes a labeling module, and the fusion attention module is located in the labeling module at the same time.

Inputting the image feature matrix X and the problem feature matrix Y into a fusion attention module, enabling a network to learn the target problem feature Y, adding a 'grading' mechanism, enabling the network to learn the image feature X by using the problem feature Y, and finally outputting a fusion feature I (f) and a problem fusion feature Q (f) of the image. Inputting the image feature matrix X and the statement feature matrix Y (t) into a fusion attention module, enabling a network to learn the feature Y (t) of a statement sentence aiming at the correctness of the image, enabling the network to learn the image feature X by using the statement fusion feature Y (t), and finally outputting the fusion feature I (t) and the statement fusion feature S (t) of the image.

And performing loss calculation on the first splicing result and the second splicing result to obtain a result loss (f), and performing loss calculation on the two splicing results, wherein the purpose of the step is to enable the splicing characteristics of the prediction module to be more and more approximate to the splicing characteristics in the labeling module, so that the accuracy is improved.

For the calculation of loss from A (t) and A (f), as shown in FIG. 2, the goal of this step is to predict the training result of the module to be closer and closer to the correct answer of the question, so as to improve the accuracy.

Finally, two loss are subjected to a mathematical operation, that is, loss (c) + λ loss (f), λ is preset to 0.01 in the specific embodiment, so as to make the two losses reach a balance, thereby preventing a special situation, such as a problem that one loss is too large or one loss is too small.

Preferably, in an exemplary embodiment, the extracting features from the image specifically includes:

performing feature representation on an input image in a bottom-up (bottom-up) mode by using a fast R-CNN (the core is ResNet-101) trained on Visual Genome data; for each target, performing average pooling by using the convolutional layer to obtain a characteristic, and marking as Xi; the features in the image are finally represented as an image feature matrix X.

Preferably, in an exemplary embodiment, the extracting the problem features specifically includes:

dividing the input problem to obtain words (the number of words is at most 14 in an exemplary embodiment), then converting each Word into a vector by using a Word Embedding method (pre-training on a large-scale corpus), then inputting a single-layer recurrent neural network, and finally outputting a problem feature matrix Y (f);

the input statement sentence is divided into words (in an exemplary embodiment, the number of words is at most 14), then each Word is converted into a vector by using a Word Embedding method (pre-trained on a large-scale corpus), then a single-layer recurrent neural network is input, and finally a statement feature matrix Y (t) is output.

More preferably, in an exemplary embodiment, as shown in FIG. 3, the fused attention module includes a first self-attention module, a second self-attention module, and a scored attention module;

Fig. 3 is a fused attention module. The fusion feature I (f) of the final output image and the fusion feature Y (f) of the question or the fusion feature I (t) of the output image and the statement fusion feature S (t) need to be repeated for the fusion module and the fusion module n times.

More preferably, in an exemplary embodiment, the first self-attention module and the second self-attention module each include:

It should be noted that Wq, Wk, and Wv are preset as three trainable weight matrices initialized by using uniform distribution at first, and then through training, the three matrices are optimized by using back propagation to converge.

In addition, each sentence contains a plurality of words, and each word corresponds to one Q, one K and one V. Assuming that the sentence is I am a man, the meaning here is that the matrix Q of "I" and the matrix K of "I", the matrix K of "am", the matrix K of "a", and the matrix K of "man" are multiplied respectively to obtain score.

More preferably, in an exemplary embodiment, the scoring attention module comprises:

V_(ISM,QM)＝(ISM_(i,j)*QM_(i,j))(i≤n,j≤n)

normalizing the matrix V to obtain V_(i,j)′：

M_i＝softmax(a*V_(i,j)′*ISM_(i,j))(i≤n,j≤n)

the finally obtained image characteristics IM are:

in the formula (I), the compound is shown in the specification,

Preferably, in an exemplary embodiment, the calculating the loss of the first splicing result and the second splicing result to obtain a result loss (f) includes:

Loss(c)＝cross-entropy(A(t),A(f))

in the formula, cross-entropy characterizes how hard a probability distribution p is expressed by a probability distribution q, where p is a correct answer and q is a predicted value, i.e., the smaller the cross-entropy value is, the closer the two probability distributions are.

Loss＝Loss(c)+λLoss(f)

where λ is an adjustment parameter, and in an exemplary embodiment λ is preset to 0.01, in order to balance the two loss, so as to prevent a special situation, such as a problem that one loss is too large or one loss is too small.

Preferably, in an exemplary embodiment, the inputting the image fusion features i (f) and the problem fusion features q (f) into a bilinear model to encode to obtain fused features Z, and obtaining a classification result through a classifier, includes:

in the formula, LayerNorm expresses normalization of hidden layer state dimension, Wx expresses linear projection matrix of image feature, IM expresses image feature, namely I (f), and W_yA linear projection matrix representing the problem features is shown, and QY represents the problem fusion features, namely Q (f);

S＝sidmoid(Z)

in the formula, the Sigmoid function is an activation function of the neural network, and maps variables between 0 and 1.

It should be noted that IM here indicates that the image feature corresponds to the image fusion feature i (f), and QY here corresponds to the problem fusion feature q (f). And the problem feature matrix y (f) is not calculated in the fusion attention module, and is only used for the auxiliary image feature matrix X, and IM is calculated in the fusion attention module, that is, the problem feature matrix y (f) is QY, that is, the problem fusion feature q (f). The stated feature matrix Y (t) is similar and will not be described in detail herein.

By adopting the method of the exemplary embodiment, the accuracy of VQA can be improved to 77.34%.

In a further exemplary embodiment of the present invention, based on any one of the above exemplary embodiments, a storage medium is provided having stored thereon computer instructions which, when executed, perform the steps of one of the image content understanding and visual question-and-answer VQA methods.

In another exemplary embodiment of the present invention based on any one of the above exemplary embodiments, there is provided a terminal comprising a memory and a processor, the memory having stored thereon computer instructions executable on the processor, the processor executing the computer instructions to perform the steps of one of the image content understanding and visual question and answer VQA methods.

Based on such understanding, the technical solutions of the present embodiments may be essentially implemented or make a contribution to the prior art, or may be implemented in the form of a software product stored in a storage medium and including several instructions for causing an apparatus to execute all or part of the steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

It is to be understood that the above-described embodiments are illustrative only and not restrictive of the broad invention, and that various other modifications and changes in light thereof will be suggested to persons skilled in the art based upon the above teachings. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.

Claims

1. A method of image content understanding and visual question answering VQA, comprising: the method comprises the following steps:

2. The method of claim 1, wherein the image content understanding and visual question answering VQA is characterized in that: extracting features from the image, specifically comprising:

3. The method of claim 1, wherein the image content understanding and visual question answering VQA is characterized in that: extracting the problems to obtain features, specifically comprising the following steps:

4. The method of claim 1, wherein the image content understanding and visual question answering VQA is characterized in that: the fused attention module comprises a first self-attention module, a second self-attention module and a scoring attention module;

5. The method of claim 4, wherein the image content understanding and visual question answering VQA is characterized in that: the first and second self-attention modules each include:

performing point multiplication on the matrix Qi and the matrix Ki to obtain a Score Score (i), and performing point multiplication on the matrix Qi and the matrices K (i), K (i +1), K (i +2), … and K (i + n) respectively to obtain scores Score (i +1), Score (i +2), … and Score (i + n);

6. The method of claim 5, wherein the image content understanding and visual question answering VQA is characterized in that: the scoring attention module comprises:

V_(ISM,QM)＝(ISM_(i,j)*QM_(i,j))(i≤n,j≤n)

normalizing the matrix V to obtain V_(i,j)′：

M_i＝softmaX(a*V_(i，j)*ISM_(i，j))(i≤n，j≤n)

the finally obtained image characteristics IM are:

in the formula (I), the compound is shown in the specification,

representing the obtained image characteristics, and m represents the number of network cycles in specific training.

7. The method of claim 1, wherein the image content understanding and visual question answering VQA is characterized in that: the obtaining of the result loss (f) by performing loss calculation on the first splicing result and the second splicing result comprises the following steps:

Loss(c)＝cross-entropy(A(t)，A(f))

wherein cross-entropy represents cross-entropy;

Loss＝Loss(c)+λLoss(f)

in the formula, λ is an adjustment parameter.

8. The method of claim 6, wherein the image content understanding and visual question answering VQA is characterized in that: inputting the image fusion characteristics I (f) and the problem fusion characteristics Q (f) into a bilinear model to be coded to obtain fused characteristics Z, and obtaining a classification result through a classifier, wherein the method comprises the following steps:

wherein LayerNorm denotes the dimension of the p-hidden layer stateDegree is normalized, W_xA linear projection matrix representing image features, IM representing image features i (f), W_yA linear projection matrix representing the problem features is shown, and QY represents the problem fusion features, namely Q (f);

S＝sidmoid(Z)

in the formula, sidmoid represents an activation function of a neural network.

9. A storage medium having stored thereon computer instructions, characterized in that: the computer instructions when executed perform the steps of a method for image content understanding and visual question answering VQA according to any one of claims 1 to 8.

10. A terminal comprising a memory and a processor, said memory having stored thereon computer instructions executable on said processor, wherein said processor when executing said computer instructions performs the steps of a method for image content understanding and visual question and answer VQA according to any one of claims 1 to 8.