CN113886626B

CN113886626B - Visual question-answering method of dynamic memory network model based on multi-attention mechanism

Info

Publication number: CN113886626B
Application number: CN202111083704.6A
Authority: CN
Inventors: 缪亚林; 童萌; 程文芳; 李臻
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-09-14
Filing date: 2021-09-14
Publication date: 2024-02-02
Anticipated expiration: 2041-09-14
Also published as: CN113886626A

Abstract

The invention discloses a visual question-answering method of a dynamic memory network model based on a multi-attention mechanism, which comprises the following steps: step 1, preprocessing an input image and a text; step 2, extracting the characteristics of the problem input in the step 1, and dividing the problem into independent words according to punctuations and spaces of the problem; step 3, sending the picture input in the step 1 into a feature extraction network to obtain region target features composed of K regions with highest confidence coefficient; step 4, iteratively updating the memory of the question features and the picture features obtained in the step 2 and the step 3 by using a multi-attention mechanism to generate context vectors required by answering the questions; and 5, sending the question features in the step 2 and the new graph features generated in the step 4 to a feature fusion device to jointly infer an answer, wherein the answer is selected from candidate answers with highest probability given by a classifier. The method improves the accuracy of the visual question-answering model.

Description

Visual question-answering method of dynamic memory network model based on multi-attention mechanism

Technical Field

The invention belongs to the technical field of cross-modal tasks combined in the field of computer vision and natural language processing, and particularly relates to a visual question-answering method of a dynamic memory network model based on a multi-attention mechanism.

Background

Attention mechanisms are widely used in tasks such as visual questions and answers, image captions, machine translation, and the like, and visual questions and answers attention models generate attention distributions of picture features based on problem features so as to perform accurate questions and answers. At present, the visual question-answering attention mechanism generally performs weighted pooling only on the last convolution layer of the image, namely different spatial regions have different weights, but different channels have the same weights, so that the spatial information of the feature map is inevitably lost, which conflicts with the feature map spatial and channel coexistence characteristic of the convolutional neural network. Worse, the attention mechanism is only used for the last convolutional layer where the receptive fields are quite large and the differences between receptive fields are limited, resulting in a spatial attention that is not noticeable. Researchers have therefore proposed combining channel attention with spatial attention as the "left-and-right arm" of the neural network.

Some of the questions in the visual question and answer relate to multi-hop relationships between objects, such as "what is in the basket of a bicycle? The model needs to first find the bicycle in the picture, locate the basket according to the bicycle's position, and then identify the objects contained in the basket. It can be seen that visual question answer predictions require matching the best picture area of answer questions step by step according to the questions. Therefore, besides the key information required by answering the questions by using the attention mechanism, the visual question-answering model also has certain memory capacity, and related information is searched, inferred and stored according to different questions. Because RNN, LSTM, GRU and other neural networks with memory functions have shorter memory step length, the long-term memory and storage requirements of visual question-answering tasks on effective information cannot be met. To mitigate the loss of valid information, dynamic memory networks are used herein to iteratively find visual information related to a problem.

Disclosure of Invention

The invention aims to provide a visual question-answering method of a dynamic memory network model based on a multiple attention mechanism, which solves the complex problem of multiple reasoning in visual question-answering and improves the accuracy of the visual question-answering model.

The technical scheme adopted by the invention is that the visual question-answering method of the dynamic memory network model based on the multi-attention mechanism comprises the following steps:

step 1, preprocessing an input image and a text, and sending the image and the text into an input module of a model to be responsible for extracting the characteristics of the image and the text, wherein the characteristics of a target level are obtained;

step 2, extracting the characteristics of the questions input in the step 1, and dividing the questions into independent words according to punctuations and spaces of the questions; then, carrying out vectorization representation on the words by using a pre-trained word model, and then inputting the word vector representation into a cyclic neural network to obtain the hidden state of the last time step so as to obtain the problem characteristics;

step 3, in order to obtain the picture characteristics, the picture input in the step 1 is sent to a characteristic extraction network to obtain regional target characteristics composed of K regional characteristics with highest confidence;

step 4, iteratively updating the memory of the question features and the picture features obtained in the step 2 and the step 3 by using a multi-attention mechanism to generate context vectors required by answering the questions;

and 5, sending the question features in the step 2 and the new graph features generated in the step 4 to a feature fusion device to jointly infer an answer, wherein the answer is selected from candidate answers with highest probability given by a classifier.

The present invention is also characterized in that,

the specific implementation mode of the step 2 is as follows:

step 2.1: first, the inputted question text is processed into a form acceptable to the model, and then the inputted question q is expressed as:

q＝[q _l ，q ₂ ，...，q _N ]

wherein: n is the sentence length, q _i Is a word;

step 2.2: secondly, mapping the words into the same vector space by using a word vector model to obtain word embedded representation of the words; and the word vector h of the obtained word is expressed as:

h＝[h ₁ ，h ₂ ，...，h _N l

wherein: h is a _i For word q _i H is the word vector of the trained word; the processed word vector is input into the GRU network, and the process is represented by the following equation:

wherein: s is the sentence characteristic of the input text, h _i For the word vector of the input text,the representative word vector is P-dimensional;

step 2.3: finally, the word vector is input into a cyclic neural network to extract the characteristics of sentences, namely the problem characteristics.

The problem feature in step 2 is to obtain a word vector representation for each word using a Glove word vector model pre-trained on a corpus.

The step 3 is specifically implemented according to the following steps:

after the input picture is accepted, as not all elements in the picture are related to the problem, in order to more accurately lock the target, an attention mechanism needs to be applied to the picture representation to respectively find out the key areas for solving the problem, wherein a top-down attention model is used, and a target detection network Faster R-CNN with high-level semantics is adopted to extract the picture characteristics; firstly, extracting an image feature map by utilizing VGG and ResNet basic networks, then pooling according to a regional suggestion network and the regional suggestion to obtain a suggestion frame feature map with a fixed size, and then classifying and regressing to obtain accurate image features; finally, the first K candidate areas with the highest confidence level are obtained and used as image features, and the extraction process is as follows:

wherein: v _K Represents any one candidate object, V represents the confidence of the selection,indicating that each candidate object is D-dimensional.

Step 4 is specifically implemented according to the following steps:

step 4.1: firstly, fusing the problem features obtained in the step 2 and the step 3 with the picture features;

step 4.2: secondly, the object feature map is firstly obtained through channel attention and is closely related to the problem, a space attention mechanism is further used on the feature map which is focused through the channel to obtain an object space region closely related to the problem, based on the updated model memory, the process is iterated, and key context information for answering the problem is obtained; updated model memory m ^t The following are provided:

wherein: [ ·; carrying out]Representing feature stitching operations, W ^t Representing the parameter update matrix, b representing the bias,representing new image features, m ^t Wherein t represents a certain time, m ^t-1 Representing a context memory; q represents a problem vector.

Step 5 is specifically implemented according to the following steps:

first memorize the updated model m ^t Carrying out feature fusion on the problem vector Q in a BLOCK multi-mode fusion mode to obtain fusion features J; after the joint feature representation J is obtained, a classification process is performed using the two fully connected layers; answer prediction is then performed using Sigmoid functions in the DMN-MA model, which allows multiple correct answers to each question, with a score for each candidate answer ranging between (0, 1); and finally selecting the candidate answer with the maximum probability value as the final answer of the model, wherein the final answer is as follows:

y＝Sigmoid(W _j J+b _j )

wherein: w (W) _j Parameters representing fully connected layers, b _j Representing the bias term, y representing the final answer, and cross entropy is used as a loss function in the training process.

The beneficial effects of the invention are as follows:

1. the invention is based on a dynamic memory network model of a multi-attention mechanism. Unlike previous attention models, the model not only uses a spatial-based attention mechanism, but also uses a channel attention mechanism, so that the visual question-answering model uses different weights on different channel feature graphs, and the spatial attention mechanism becomes an effective supplement of the channel attention mechanism. In addition, the input module and the scene memory module of the dynamic memory network model are subjected to deep research, and the input module is used for obtaining object characteristics of a target level by using a fast-RCNN; and in the scene memory module, a multi-attention mechanism is used for continuously carrying out memory updating and storage according to the questions, the most relevant visual vector for answering the questions is obtained through iterative reasoning, and the context information is effectively utilized for carrying out answer reasoning. And finally, merging the final memory and the question representation of the network, and presuming a correct answer.

2. The method is scientific and reasonable in design, can use a multiple memory mechanism to continuously update and store memory according to the questions, and can obtain the most relevant visual vector of answers to the questions by iterative reasoning, and answer reasoning is effectively carried out by using context information. And the memory network further improves the accuracy of the visual question-answering model.

3. The method of the invention provides a dynamic memory network model (DMN-MA) based on a multiple attention mechanism on the basis of the dynamic memory network. Unlike the previous model, the multi-attention mechanism based on problem guidance is applied when the input image features are read, so that the spatial region of the image is focused, and the image is focused on different convolution channels, and the three-dimensional characteristics of the feature map, which are compatible with the feature map, are more consistent. The DMN-MA model iteratively inquires visual information related to the questions when searching the image features, and continuously updates the memory content to obtain key memories for answering the questions, so that the complex problems requiring multiple reasoning in visual question answering are solved.

Drawings

FIG. 1 is a schematic diagram of a scene memory module iterated twice in the method of the present invention;

FIG. 2 is a diagram of the overall framework of a dynamic memory network model based on a multi-attention mechanism in the method of the present invention;

FIG. 3 is a schematic diagram of the simulation of the present invention prior to the memory visualization process;

FIG. 4 is a schematic diagram of the memory visualization process in the simulation of the present invention.

Detailed Description

The invention will be described in detail below with reference to the drawings and the detailed description.

The invention discloses a visual question-answering method of a dynamic memory network model based on a multi-attention mechanism, which comprises the following steps:

the specific implementation mode of the step 2 is as follows:

step 2.1: firstly, processing an input question text into a form acceptable by a model, namely dividing all words in the question text into independent words according to punctuations and spaces; then the entered question q is expressed as:

q＝[q ₁ ，q2，...，q _N ]

wherein: n is the sentence length, q _i Is a word;

step 2.2: secondly, mapping the words into the same vector space by using a word vector model to obtain word embedded representation of the words; word embedding is a method of converting words in text into real vectors, so that the way of converting into vectors can be calculated conveniently. And the word vector h of the obtained word is expressed as:

h＝[h ₁ ，h ₂ ，...，h _N ]

wherein: h is a _i For word q _i H is the word vector of the trained word; the pre-trained Glove word vector model is used here to obtain a word vector representation for each word, and since the question text in the visual question-answer dataset used herein typically does not exceed 20 words, the processed word vector is entered into the GRU network by the following equation:

wherein: s is the sentence characteristic of the input text, h _i For the word vector of the input text,the representative word vector is P-dimensional.

The problem feature in step 2 is to obtain a word vector representation for each word using a Glove word vector model pre-trained on a large corpus.

Step 3, in order to obtain the picture characteristics, the picture input in the step 1 is sent to a characteristic extraction network to obtain regional target characteristics composed of K regional characteristics with highest confidence; the feature extraction network used herein is the fast R-CNN network.

The step 3 is specifically implemented according to the following steps:

after accepting the input picture. Since not all elements in the diagram are related to the problem, in order to more accurately target, an attention mechanism needs to be applied to the diagram representation to find out the critical areas for solving the problem. The top-down attention model is used, and a target detection network Faster R-CNN with high-level semantics is adopted to extract the picture features; firstly, extracting an image feature map by utilizing VGG and ResNet basic networks, then pooling according to a regional suggestion network and the regional suggestion to obtain a suggestion frame feature map with a fixed size, and then classifying and regressing to obtain accurate image features; finally, the first K candidate areas with the highest confidence level are obtained and used as image features, and the extraction process is as follows:

Step 4, iteratively updating the memory of the question features and the picture features obtained in the step 2 and the step 3 by using a multi-attention mechanism to generate context vectors required by answering the questions; adopting a mode of combining channel attention and space attention to update the memory of the answered questions once and once;

step 4 is specifically implemented according to the following steps:

step 4.2: secondly, as shown in the image channel feature diagram of fig. 1, the object feature diagram is firstly obtained through channel attention to obtain a channel feature diagram closely related to the problem, a spatial attention mechanism is further used on the feature diagram closely related to the problem through channel attention, an object space region closely related to the problem is obtained, based on the updated model memory, the process is iterated, and key context information for answering the problem is obtained; updated model memory m ^t The following are provided:

wherein: [ ·; carrying out]Representing feature stitching operations, W ^t Representing the parameter update matrix, b representing the bias,representing new image features, m ^t Wherein t represents a certain time, m ^t-1 Representing a context memory; q represents a problem vector. The channel attention is mainly focused on an object, and then the channel attention vector is obtained by performing correlation calculation. While spatial attention is to locate the best object region to answer a question by a question, giving different weights to different object regions, the process does not treat each object region equally. The scene memory is updated with new image features each time a channel attention module and a spatial attention module are passed, generating vectors. Following the previous visual question-and-answer work, the memory is updated using the ReLU activation function.

Step 5 is specifically implemented according to the following steps:

first memorize the updated model m ^t And carrying out feature fusion on the problem vector Q in a BLOCK multi-mode fusion mode to obtain fusion features J. After the joint feature representation J is obtained, a classification process is performed using two fully connected layers. Answer prediction is then performed using Sigmoid functions in the DMN-MA model, which allows multiple correct answers to each question, with a score for each candidate answer ranging between (0, 1). And finally selecting the candidate answer with the maximum probability value as the final answer of the model, wherein the final answer is as follows:

y＝Sigmoid(W _j J+b _j )

wherein: w (W) _j Parameters representing fully connected layers, b _j Representing the bias term, y represents the final answer. And cross entropy is used as a loss function in the training process.

The specific process of the invention is shown in figure 2. Firstly, extracting regional target characteristics of an input image and a text, processing vectors of dimension facts by the image, and simultaneously encoding an input problem; a dynamic memory network model based on a multi-attention mechanism is established, then the obtained text questions and image features can be input for iteration for a plurality of times, and context memory is updated after each iteration until an answer with higher occurrence probability is obtained. And secondly, fusing the reuse features, interacting with the questions to obtain new graph features, and finally, deducing an answer together with the obtained graph features and the questions. Compared with the traditional method using integral image characteristics or other graph network visual question-answering methods neglecting relationship importance, the technical scheme of the invention is adopted to effectively improve the performance of the visual question-answering model.

Simulation experiment and characterization of experimental results

1. Data set

The model was run on two visual question and answer public data sets, COCO-QA and VQA2.0 data sets, respectively. COCO-QA dataset pictures were from MS-COCO. A total of 123587 pictures, 72783 of which were used for training and 38948 of which were used for testing, it was important that the answer distribution to the dataset questions was relatively uniform. VQA 2.0.0 dataset contained 204721 pictures from MS-COCO, with 123,287 pictures for training and validation set, 8 ten thousand for training set, and 81434 pictures for test set. The dataset had a total of 614163 questions, three questions per picture, ten answers to each question, and ten different annotators for each answer to the question.

2. Experimental environment

The development framework was version PyTorrch1.1.0 using the Python3.6 development language. Specifically, the image input module k=100, each object feature vector dimension is 2048, and image feature extraction is performed using the res net152 as a base network. The problem module handles the problem as a fixed length, with excess length discarded, not enough to be filled with 0. The COCO-QA dataset problem length was fixed at 20 and VQA2.0 dataset problem length was fixed at 14. The word vector dimension is 300, the gru hidden layer dimension is 2048, and the obtained problem vector dimension is 2048. In the answer prediction stage, 430 answers are provided in the COCO-QA data set; VQA 2.0.0 data set, if a certain answer appears more than 8 times in the training set, it is added to the candidate answer set, and 3129 candidate answers are obtained in total.

All activation functions in the experiment used ReLU, and dropout with p=0.5 at the input and output layers to prevent overfitting. During the training process, all training samples are randomly shuffled, the batch size is set to 32, and the epoch is set to 20. The Adam random gradient descent algorithm is used in the training process, the initial learning rate is 0.001, and after training 5 epochs, the DMN-MA model reduces the learning rate to 1/10 of the previous one every 3 epochs.

3. Experimental results and analysis

Because of the uncertainty of iteration times of the DMN-MA model scene memory module, different iteration times are firstly set on the COCO-QA data set and the VQA 2.0.0 data set so as to find the optimal performance of the model. The experimental results of the overall accuracy and the number of iterations of the model on the two data sets are shown in table 1.

Table 1 comparison of iteration number accuracy of scene memory module

As can be obtained from table 1, the number of iterations is increased, the model accuracy increases, and when the number of iterations is 3, the overall accuracy of the model on both data sets is highest, and then the number of iterations is increased, and the model accuracy decreases sharply. Overall, the multi-attention mechanism iterates 3 times with the highest accuracy, so the experiment sets the number of iterations to 3.

Next, to verify the effectiveness of the proposed model, table 2 lists the experimental results of the model and other mainstream methods on the COCO-QA test set.

TABLE 2 comparison of overall accuracy, WUPS index and other methods on COCO-QA dataset

As can be seen from Table 2, the overall accuracy of the proposed DMN-MA model reaches 64.57%, and the accuracy is improved by 11.26% compared with the conventional VIS+LSTM method. Especially, compared with a visual question-answering classical attention method SAN, the overall accuracy is improved by about 3 percent, and compared with a QPU model, the accuracy is improved by 2.07 percent. In addition, the model also has a unusual effect on wusp 0.9 and wusp 0.0. Description in visual question and answer studies, it is not sufficient to use only spatial attention for iterative reasoning, but question-based channel attention is equally important.

As shown in Table 3, the overall performance of the proposed DMN-MA model is 12.96% higher than that of the reference model CNN+LSTM, 4.91% higher than that of the MCB model, and 2.54% higher than that of the ResonNet model; in addition, the overall accuracy of the model is 1.51% ahead of that of a classical visual question-answering system model with a top-down attention mechanism. It is worth noting that the DMN-MA model and the visual question-answering system model of the top-down attention mechanism adopt the same data preprocessing mode, namely, the Faster-RCNN is adopted to extract visual features of images, GLOVE+GRU is adopted to extract features of questions, and the difference is that the visual question-answering system model of the top-down attention mechanism only adopts the spatial attention mechanism to conduct answer prediction, so that the effectiveness of the model is fully proved.

TABLE 3 accuracy comparison of question types on COCO-QA

In summary, comparing the DMN-MA model with a plurality of main stream methods on COCO-QA and VQA2.0 data sets, it can be seen that the DMN-MA model combines the advantages of a multiple attention mechanism and a memory network, more accords with the three-dimensional characteristics of a convolution feature map, reduces the loss of context information in the answer prediction process, and has better performance.

4. Attention visualization

Several pictures and questions in the dataset were randomly chosen for the proposed model for attention visualization presentation, as shown in fig. 3-4. The upper part of fig. 3 is a question, fig. 3 is an original picture, fig. 4 is a picture visualized by the model attention, the lower group trunk is a data set answer, and the Prediction represents an answer of model Prediction.

Claims

1. The visual question-answering method of the dynamic memory network model based on the multi-attention mechanism is characterized by comprising the following steps of:

step 2, extracting the characteristics of the problem input in the step 1, and dividing the problem into independent words according to punctuations and spaces of the problem; then, carrying out vectorization representation on the words by using a pre-trained word model, and then inputting the word vector representation into a cyclic neural network to obtain the hidden state of the last time step so as to obtain the problem characteristics;

the specific implementation mode of the step 2 is as follows:

q＝[q ₁ ，q ₂ ，...，q _N ]

wherein: n is the sentence length, q _i Is a word;

h＝[h ₁ ，h ₂ ，...，h _N ]

step 2.3: finally, inputting the word vector into a cyclic neural network to extract the characteristics of sentences, namely the problem characteristics;

step 3, sending the picture input in the step 1 into a feature extraction network to obtain region target features composed of K regions with highest confidence coefficient;

the step 3 is specifically implemented according to the following steps:

wherein: v _K Represents any one candidate object, V represents the confidence of the selection,representing each candidate object as a D dimension;

step 4 is specifically implemented according to the following steps:

wherein: [ ·; carrying out]Representing feature stitching operations, W ^t Representing the parameter update matrix, b representing the bias, V _s ^t Representing new image features, m ^t Wherein t represents a certain time, m ^t-1 Representing a context memory; q represents a problem vector;

2. The visual question-answering method based on a dynamic memory network model of multiple-attention mechanism according to claim 1, wherein the problem feature in step 2 is to obtain word vector representations for each word using a Glove word vector model pre-trained on a corpus.

3. The visual question-answering method of the dynamic memory network model based on the multi-attention mechanism according to claim 1, wherein the step 5 is specifically implemented according to the following steps:

first, the updated model is recordedMemm ^t Carrying out feature fusion on the problem vector Q in a BLOCK multi-mode fusion mode to obtain fusion features J; after the joint feature representation J is obtained, a classification process is performed using the two fully connected layers; answer prediction is then performed using Sigmoid functions in the DMN-MA model, which allows multiple correct answers to each question, with a score for each candidate answer ranging between (0, 1); and finally selecting the candidate answer with the maximum probability value as the final answer of the model, wherein the final answer is as follows:

y＝Sigmoid(W _j J+b _j )