CN113886626B - Visual question-answering method of dynamic memory network model based on multi-attention mechanism - Google Patents

Visual question-answering method of dynamic memory network model based on multi-attention mechanism Download PDF

Info

Publication number
CN113886626B
CN113886626B CN202111083704.6A CN202111083704A CN113886626B CN 113886626 B CN113886626 B CN 113886626B CN 202111083704 A CN202111083704 A CN 202111083704A CN 113886626 B CN113886626 B CN 113886626B
Authority
CN
China
Prior art keywords
model
features
question
answer
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111083704.6A
Other languages
Chinese (zh)
Other versions
CN113886626A (en
Inventor
缪亚林
童萌
程文芳
李臻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202111083704.6A priority Critical patent/CN113886626B/en
Publication of CN113886626A publication Critical patent/CN113886626A/en
Application granted granted Critical
Publication of CN113886626B publication Critical patent/CN113886626B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Library & Information Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a visual question-answering method of a dynamic memory network model based on a multi-attention mechanism, which comprises the following steps: step 1, preprocessing an input image and a text; step 2, extracting the characteristics of the problem input in the step 1, and dividing the problem into independent words according to punctuations and spaces of the problem; step 3, sending the picture input in the step 1 into a feature extraction network to obtain region target features composed of K regions with highest confidence coefficient; step 4, iteratively updating the memory of the question features and the picture features obtained in the step 2 and the step 3 by using a multi-attention mechanism to generate context vectors required by answering the questions; and 5, sending the question features in the step 2 and the new graph features generated in the step 4 to a feature fusion device to jointly infer an answer, wherein the answer is selected from candidate answers with highest probability given by a classifier. The method improves the accuracy of the visual question-answering model.

Description

Visual question-answering method of dynamic memory network model based on multi-attention mechanism
Technical Field
The invention belongs to the technical field of cross-modal tasks combined in the field of computer vision and natural language processing, and particularly relates to a visual question-answering method of a dynamic memory network model based on a multi-attention mechanism.
Background
Attention mechanisms are widely used in tasks such as visual questions and answers, image captions, machine translation, and the like, and visual questions and answers attention models generate attention distributions of picture features based on problem features so as to perform accurate questions and answers. At present, the visual question-answering attention mechanism generally performs weighted pooling only on the last convolution layer of the image, namely different spatial regions have different weights, but different channels have the same weights, so that the spatial information of the feature map is inevitably lost, which conflicts with the feature map spatial and channel coexistence characteristic of the convolutional neural network. Worse, the attention mechanism is only used for the last convolutional layer where the receptive fields are quite large and the differences between receptive fields are limited, resulting in a spatial attention that is not noticeable. Researchers have therefore proposed combining channel attention with spatial attention as the "left-and-right arm" of the neural network.
Some of the questions in the visual question and answer relate to multi-hop relationships between objects, such as "what is in the basket of a bicycle? The model needs to first find the bicycle in the picture, locate the basket according to the bicycle's position, and then identify the objects contained in the basket. It can be seen that visual question answer predictions require matching the best picture area of answer questions step by step according to the questions. Therefore, besides the key information required by answering the questions by using the attention mechanism, the visual question-answering model also has certain memory capacity, and related information is searched, inferred and stored according to different questions. Because RNN, LSTM, GRU and other neural networks with memory functions have shorter memory step length, the long-term memory and storage requirements of visual question-answering tasks on effective information cannot be met. To mitigate the loss of valid information, dynamic memory networks are used herein to iteratively find visual information related to a problem.
Disclosure of Invention
The invention aims to provide a visual question-answering method of a dynamic memory network model based on a multiple attention mechanism, which solves the complex problem of multiple reasoning in visual question-answering and improves the accuracy of the visual question-answering model.
The technical scheme adopted by the invention is that the visual question-answering method of the dynamic memory network model based on the multi-attention mechanism comprises the following steps:
step 1, preprocessing an input image and a text, and sending the image and the text into an input module of a model to be responsible for extracting the characteristics of the image and the text, wherein the characteristics of a target level are obtained;
step 2, extracting the characteristics of the questions input in the step 1, and dividing the questions into independent words according to punctuations and spaces of the questions; then, carrying out vectorization representation on the words by using a pre-trained word model, and then inputting the word vector representation into a cyclic neural network to obtain the hidden state of the last time step so as to obtain the problem characteristics;
step 3, in order to obtain the picture characteristics, the picture input in the step 1 is sent to a characteristic extraction network to obtain regional target characteristics composed of K regional characteristics with highest confidence;
step 4, iteratively updating the memory of the question features and the picture features obtained in the step 2 and the step 3 by using a multi-attention mechanism to generate context vectors required by answering the questions;
and 5, sending the question features in the step 2 and the new graph features generated in the step 4 to a feature fusion device to jointly infer an answer, wherein the answer is selected from candidate answers with highest probability given by a classifier.
The present invention is also characterized in that,
the specific implementation mode of the step 2 is as follows:
step 2.1: first, the inputted question text is processed into a form acceptable to the model, and then the inputted question q is expressed as:
q=[q l ,q 2 ,...,q N ]
wherein: n is the sentence length, q i Is a word;
step 2.2: secondly, mapping the words into the same vector space by using a word vector model to obtain word embedded representation of the words; and the word vector h of the obtained word is expressed as:
h=[h 1 ,h 2 ,...,h N l
wherein: h is a i For word q i H is the word vector of the trained word; the processed word vector is input into the GRU network, and the process is represented by the following equation:
wherein: s is the sentence characteristic of the input text, h i For the word vector of the input text,the representative word vector is P-dimensional;
step 2.3: finally, the word vector is input into a cyclic neural network to extract the characteristics of sentences, namely the problem characteristics.
The problem feature in step 2 is to obtain a word vector representation for each word using a Glove word vector model pre-trained on a corpus.
The step 3 is specifically implemented according to the following steps:
after the input picture is accepted, as not all elements in the picture are related to the problem, in order to more accurately lock the target, an attention mechanism needs to be applied to the picture representation to respectively find out the key areas for solving the problem, wherein a top-down attention model is used, and a target detection network Faster R-CNN with high-level semantics is adopted to extract the picture characteristics; firstly, extracting an image feature map by utilizing VGG and ResNet basic networks, then pooling according to a regional suggestion network and the regional suggestion to obtain a suggestion frame feature map with a fixed size, and then classifying and regressing to obtain accurate image features; finally, the first K candidate areas with the highest confidence level are obtained and used as image features, and the extraction process is as follows:
wherein: v K Represents any one candidate object, V represents the confidence of the selection,indicating that each candidate object is D-dimensional.
Step 4 is specifically implemented according to the following steps:
step 4.1: firstly, fusing the problem features obtained in the step 2 and the step 3 with the picture features;
step 4.2: secondly, the object feature map is firstly obtained through channel attention and is closely related to the problem, a space attention mechanism is further used on the feature map which is focused through the channel to obtain an object space region closely related to the problem, based on the updated model memory, the process is iterated, and key context information for answering the problem is obtained; updated model memory m t The following are provided:
wherein: [ ·; carrying out]Representing feature stitching operations, W t Representing the parameter update matrix, b representing the bias,representing new image features, m t Wherein t represents a certain time, m t-1 Representing a context memory; q represents a problem vector.
Step 5 is specifically implemented according to the following steps:
first memorize the updated model m t Carrying out feature fusion on the problem vector Q in a BLOCK multi-mode fusion mode to obtain fusion features J; after the joint feature representation J is obtained, a classification process is performed using the two fully connected layers; answer prediction is then performed using Sigmoid functions in the DMN-MA model, which allows multiple correct answers to each question, with a score for each candidate answer ranging between (0, 1); and finally selecting the candidate answer with the maximum probability value as the final answer of the model, wherein the final answer is as follows:
y=Sigmoid(W j J+b j )
wherein: w (W) j Parameters representing fully connected layers, b j Representing the bias term, y representing the final answer, and cross entropy is used as a loss function in the training process.
The beneficial effects of the invention are as follows:
1. the invention is based on a dynamic memory network model of a multi-attention mechanism. Unlike previous attention models, the model not only uses a spatial-based attention mechanism, but also uses a channel attention mechanism, so that the visual question-answering model uses different weights on different channel feature graphs, and the spatial attention mechanism becomes an effective supplement of the channel attention mechanism. In addition, the input module and the scene memory module of the dynamic memory network model are subjected to deep research, and the input module is used for obtaining object characteristics of a target level by using a fast-RCNN; and in the scene memory module, a multi-attention mechanism is used for continuously carrying out memory updating and storage according to the questions, the most relevant visual vector for answering the questions is obtained through iterative reasoning, and the context information is effectively utilized for carrying out answer reasoning. And finally, merging the final memory and the question representation of the network, and presuming a correct answer.
2. The method is scientific and reasonable in design, can use a multiple memory mechanism to continuously update and store memory according to the questions, and can obtain the most relevant visual vector of answers to the questions by iterative reasoning, and answer reasoning is effectively carried out by using context information. And the memory network further improves the accuracy of the visual question-answering model.
3. The method of the invention provides a dynamic memory network model (DMN-MA) based on a multiple attention mechanism on the basis of the dynamic memory network. Unlike the previous model, the multi-attention mechanism based on problem guidance is applied when the input image features are read, so that the spatial region of the image is focused, and the image is focused on different convolution channels, and the three-dimensional characteristics of the feature map, which are compatible with the feature map, are more consistent. The DMN-MA model iteratively inquires visual information related to the questions when searching the image features, and continuously updates the memory content to obtain key memories for answering the questions, so that the complex problems requiring multiple reasoning in visual question answering are solved.
Drawings
FIG. 1 is a schematic diagram of a scene memory module iterated twice in the method of the present invention;
FIG. 2 is a diagram of the overall framework of a dynamic memory network model based on a multi-attention mechanism in the method of the present invention;
FIG. 3 is a schematic diagram of the simulation of the present invention prior to the memory visualization process;
FIG. 4 is a schematic diagram of the memory visualization process in the simulation of the present invention.
Detailed Description
The invention will be described in detail below with reference to the drawings and the detailed description.
The invention discloses a visual question-answering method of a dynamic memory network model based on a multi-attention mechanism, which comprises the following steps:
step 1, preprocessing an input image and a text, and sending the image and the text into an input module of a model to be responsible for extracting the characteristics of the image and the text, wherein the characteristics of a target level are obtained;
step 2, extracting the characteristics of the questions input in the step 1, and dividing the questions into independent words according to punctuations and spaces of the questions; then, carrying out vectorization representation on the words by using a pre-trained word model, and then inputting the word vector representation into a cyclic neural network to obtain the hidden state of the last time step so as to obtain the problem characteristics;
the specific implementation mode of the step 2 is as follows:
step 2.1: firstly, processing an input question text into a form acceptable by a model, namely dividing all words in the question text into independent words according to punctuations and spaces; then the entered question q is expressed as:
q=[q 1 ,q2,...,q N ]
wherein: n is the sentence length, q i Is a word;
step 2.2: secondly, mapping the words into the same vector space by using a word vector model to obtain word embedded representation of the words; word embedding is a method of converting words in text into real vectors, so that the way of converting into vectors can be calculated conveniently. And the word vector h of the obtained word is expressed as:
h=[h 1 ,h 2 ,...,h N ]
wherein: h is a i For word q i H is the word vector of the trained word; the pre-trained Glove word vector model is used here to obtain a word vector representation for each word, and since the question text in the visual question-answer dataset used herein typically does not exceed 20 words, the processed word vector is entered into the GRU network by the following equation:
wherein: s is the sentence characteristic of the input text, h i For the word vector of the input text,the representative word vector is P-dimensional.
Step 2.3: finally, the word vector is input into a cyclic neural network to extract the characteristics of sentences, namely the problem characteristics.
The problem feature in step 2 is to obtain a word vector representation for each word using a Glove word vector model pre-trained on a large corpus.
Step 3, in order to obtain the picture characteristics, the picture input in the step 1 is sent to a characteristic extraction network to obtain regional target characteristics composed of K regional characteristics with highest confidence; the feature extraction network used herein is the fast R-CNN network.
The step 3 is specifically implemented according to the following steps:
after accepting the input picture. Since not all elements in the diagram are related to the problem, in order to more accurately target, an attention mechanism needs to be applied to the diagram representation to find out the critical areas for solving the problem. The top-down attention model is used, and a target detection network Faster R-CNN with high-level semantics is adopted to extract the picture features; firstly, extracting an image feature map by utilizing VGG and ResNet basic networks, then pooling according to a regional suggestion network and the regional suggestion to obtain a suggestion frame feature map with a fixed size, and then classifying and regressing to obtain accurate image features; finally, the first K candidate areas with the highest confidence level are obtained and used as image features, and the extraction process is as follows:
wherein: v K Represents any one candidate object, V represents the confidence of the selection,indicating that each candidate object is D-dimensional.
Step 4, iteratively updating the memory of the question features and the picture features obtained in the step 2 and the step 3 by using a multi-attention mechanism to generate context vectors required by answering the questions; adopting a mode of combining channel attention and space attention to update the memory of the answered questions once and once;
step 4 is specifically implemented according to the following steps:
step 4.1: firstly, fusing the problem features obtained in the step 2 and the step 3 with the picture features;
step 4.2: secondly, as shown in the image channel feature diagram of fig. 1, the object feature diagram is firstly obtained through channel attention to obtain a channel feature diagram closely related to the problem, a spatial attention mechanism is further used on the feature diagram closely related to the problem through channel attention, an object space region closely related to the problem is obtained, based on the updated model memory, the process is iterated, and key context information for answering the problem is obtained; updated model memory m t The following are provided:
wherein: [ ·; carrying out]Representing feature stitching operations, W t Representing the parameter update matrix, b representing the bias,representing new image features, m t Wherein t represents a certain time, m t-1 Representing a context memory; q represents a problem vector. The channel attention is mainly focused on an object, and then the channel attention vector is obtained by performing correlation calculation. While spatial attention is to locate the best object region to answer a question by a question, giving different weights to different object regions, the process does not treat each object region equally. The scene memory is updated with new image features each time a channel attention module and a spatial attention module are passed, generating vectors. Following the previous visual question-and-answer work, the memory is updated using the ReLU activation function.
And 5, sending the question features in the step 2 and the new graph features generated in the step 4 to a feature fusion device to jointly infer an answer, wherein the answer is selected from candidate answers with highest probability given by a classifier.
Step 5 is specifically implemented according to the following steps:
first memorize the updated model m t And carrying out feature fusion on the problem vector Q in a BLOCK multi-mode fusion mode to obtain fusion features J. After the joint feature representation J is obtained, a classification process is performed using two fully connected layers. Answer prediction is then performed using Sigmoid functions in the DMN-MA model, which allows multiple correct answers to each question, with a score for each candidate answer ranging between (0, 1). And finally selecting the candidate answer with the maximum probability value as the final answer of the model, wherein the final answer is as follows:
y=Sigmoid(W j J+b j )
wherein: w (W) j Parameters representing fully connected layers, b j Representing the bias term, y represents the final answer. And cross entropy is used as a loss function in the training process.
The specific process of the invention is shown in figure 2. Firstly, extracting regional target characteristics of an input image and a text, processing vectors of dimension facts by the image, and simultaneously encoding an input problem; a dynamic memory network model based on a multi-attention mechanism is established, then the obtained text questions and image features can be input for iteration for a plurality of times, and context memory is updated after each iteration until an answer with higher occurrence probability is obtained. And secondly, fusing the reuse features, interacting with the questions to obtain new graph features, and finally, deducing an answer together with the obtained graph features and the questions. Compared with the traditional method using integral image characteristics or other graph network visual question-answering methods neglecting relationship importance, the technical scheme of the invention is adopted to effectively improve the performance of the visual question-answering model.
Simulation experiment and characterization of experimental results
1. Data set
The model was run on two visual question and answer public data sets, COCO-QA and VQA2.0 data sets, respectively. COCO-QA dataset pictures were from MS-COCO. A total of 123587 pictures, 72783 of which were used for training and 38948 of which were used for testing, it was important that the answer distribution to the dataset questions was relatively uniform. VQA 2.0.0 dataset contained 204721 pictures from MS-COCO, with 123,287 pictures for training and validation set, 8 ten thousand for training set, and 81434 pictures for test set. The dataset had a total of 614163 questions, three questions per picture, ten answers to each question, and ten different annotators for each answer to the question.
2. Experimental environment
The development framework was version PyTorrch1.1.0 using the Python3.6 development language. Specifically, the image input module k=100, each object feature vector dimension is 2048, and image feature extraction is performed using the res net152 as a base network. The problem module handles the problem as a fixed length, with excess length discarded, not enough to be filled with 0. The COCO-QA dataset problem length was fixed at 20 and VQA2.0 dataset problem length was fixed at 14. The word vector dimension is 300, the gru hidden layer dimension is 2048, and the obtained problem vector dimension is 2048. In the answer prediction stage, 430 answers are provided in the COCO-QA data set; VQA 2.0.0 data set, if a certain answer appears more than 8 times in the training set, it is added to the candidate answer set, and 3129 candidate answers are obtained in total.
All activation functions in the experiment used ReLU, and dropout with p=0.5 at the input and output layers to prevent overfitting. During the training process, all training samples are randomly shuffled, the batch size is set to 32, and the epoch is set to 20. The Adam random gradient descent algorithm is used in the training process, the initial learning rate is 0.001, and after training 5 epochs, the DMN-MA model reduces the learning rate to 1/10 of the previous one every 3 epochs.
3. Experimental results and analysis
Because of the uncertainty of iteration times of the DMN-MA model scene memory module, different iteration times are firstly set on the COCO-QA data set and the VQA 2.0.0 data set so as to find the optimal performance of the model. The experimental results of the overall accuracy and the number of iterations of the model on the two data sets are shown in table 1.
Table 1 comparison of iteration number accuracy of scene memory module
As can be obtained from table 1, the number of iterations is increased, the model accuracy increases, and when the number of iterations is 3, the overall accuracy of the model on both data sets is highest, and then the number of iterations is increased, and the model accuracy decreases sharply. Overall, the multi-attention mechanism iterates 3 times with the highest accuracy, so the experiment sets the number of iterations to 3.
Next, to verify the effectiveness of the proposed model, table 2 lists the experimental results of the model and other mainstream methods on the COCO-QA test set.
TABLE 2 comparison of overall accuracy, WUPS index and other methods on COCO-QA dataset
As can be seen from Table 2, the overall accuracy of the proposed DMN-MA model reaches 64.57%, and the accuracy is improved by 11.26% compared with the conventional VIS+LSTM method. Especially, compared with a visual question-answering classical attention method SAN, the overall accuracy is improved by about 3 percent, and compared with a QPU model, the accuracy is improved by 2.07 percent. In addition, the model also has a unusual effect on wusp 0.9 and wusp 0.0. Description in visual question and answer studies, it is not sufficient to use only spatial attention for iterative reasoning, but question-based channel attention is equally important.
As shown in Table 3, the overall performance of the proposed DMN-MA model is 12.96% higher than that of the reference model CNN+LSTM, 4.91% higher than that of the MCB model, and 2.54% higher than that of the ResonNet model; in addition, the overall accuracy of the model is 1.51% ahead of that of a classical visual question-answering system model with a top-down attention mechanism. It is worth noting that the DMN-MA model and the visual question-answering system model of the top-down attention mechanism adopt the same data preprocessing mode, namely, the Faster-RCNN is adopted to extract visual features of images, GLOVE+GRU is adopted to extract features of questions, and the difference is that the visual question-answering system model of the top-down attention mechanism only adopts the spatial attention mechanism to conduct answer prediction, so that the effectiveness of the model is fully proved.
TABLE 3 accuracy comparison of question types on COCO-QA
In summary, comparing the DMN-MA model with a plurality of main stream methods on COCO-QA and VQA2.0 data sets, it can be seen that the DMN-MA model combines the advantages of a multiple attention mechanism and a memory network, more accords with the three-dimensional characteristics of a convolution feature map, reduces the loss of context information in the answer prediction process, and has better performance.
4. Attention visualization
Several pictures and questions in the dataset were randomly chosen for the proposed model for attention visualization presentation, as shown in fig. 3-4. The upper part of fig. 3 is a question, fig. 3 is an original picture, fig. 4 is a picture visualized by the model attention, the lower group trunk is a data set answer, and the Prediction represents an answer of model Prediction.

Claims (3)

1. The visual question-answering method of the dynamic memory network model based on the multi-attention mechanism is characterized by comprising the following steps of:
step 1, preprocessing an input image and a text, and sending the image and the text into an input module of a model to be responsible for extracting the characteristics of the image and the text, wherein the characteristics of a target level are obtained;
step 2, extracting the characteristics of the problem input in the step 1, and dividing the problem into independent words according to punctuations and spaces of the problem; then, carrying out vectorization representation on the words by using a pre-trained word model, and then inputting the word vector representation into a cyclic neural network to obtain the hidden state of the last time step so as to obtain the problem characteristics;
the specific implementation mode of the step 2 is as follows:
step 2.1: first, the inputted question text is processed into a form acceptable to the model, and then the inputted question q is expressed as:
q=[q 1 ,q 2 ,...,q N ]
wherein: n is the sentence length, q i Is a word;
step 2.2: secondly, mapping the words into the same vector space by using a word vector model to obtain word embedded representation of the words; and the word vector h of the obtained word is expressed as:
h=[h 1 ,h 2 ,...,h N ]
wherein: h is a i For word q i H is the word vector of the trained word; the processed word vector is input into the GRU network, and the process is represented by the following equation:
wherein: s is the sentence characteristic of the input text, h i For the word vector of the input text,the representative word vector is P-dimensional;
step 2.3: finally, inputting the word vector into a cyclic neural network to extract the characteristics of sentences, namely the problem characteristics;
step 3, sending the picture input in the step 1 into a feature extraction network to obtain region target features composed of K regions with highest confidence coefficient;
the step 3 is specifically implemented according to the following steps:
after the input picture is accepted, as not all elements in the picture are related to the problem, in order to more accurately lock the target, an attention mechanism needs to be applied to the picture representation to respectively find out the key areas for solving the problem, wherein a top-down attention model is used, and a target detection network Faster R-CNN with high-level semantics is adopted to extract the picture characteristics; firstly, extracting an image feature map by utilizing VGG and ResNet basic networks, then pooling according to a regional suggestion network and the regional suggestion to obtain a suggestion frame feature map with a fixed size, and then classifying and regressing to obtain accurate image features; finally, the first K candidate areas with the highest confidence level are obtained and used as image features, and the extraction process is as follows:
wherein: v K Represents any one candidate object, V represents the confidence of the selection,representing each candidate object as a D dimension;
step 4, iteratively updating the memory of the question features and the picture features obtained in the step 2 and the step 3 by using a multi-attention mechanism to generate context vectors required by answering the questions;
step 4 is specifically implemented according to the following steps:
step 4.1: firstly, fusing the problem features obtained in the step 2 and the step 3 with the picture features;
step 4.2: secondly, the object feature map is firstly obtained through channel attention and is closely related to the problem, a space attention mechanism is further used on the feature map which is focused through the channel to obtain an object space region closely related to the problem, based on the updated model memory, the process is iterated, and key context information for answering the problem is obtained; updated model memory m t The following are provided:
wherein: [ ·; carrying out]Representing feature stitching operations, W t Representing the parameter update matrix, b representing the bias, V s t Representing new image features, m t Wherein t represents a certain time, m t-1 Representing a context memory; q represents a problem vector;
and 5, sending the question features in the step 2 and the new graph features generated in the step 4 to a feature fusion device to jointly infer an answer, wherein the answer is selected from candidate answers with highest probability given by a classifier.
2. The visual question-answering method based on a dynamic memory network model of multiple-attention mechanism according to claim 1, wherein the problem feature in step 2 is to obtain word vector representations for each word using a Glove word vector model pre-trained on a corpus.
3. The visual question-answering method of the dynamic memory network model based on the multi-attention mechanism according to claim 1, wherein the step 5 is specifically implemented according to the following steps:
first, the updated model is recordedMemm t Carrying out feature fusion on the problem vector Q in a BLOCK multi-mode fusion mode to obtain fusion features J; after the joint feature representation J is obtained, a classification process is performed using the two fully connected layers; answer prediction is then performed using Sigmoid functions in the DMN-MA model, which allows multiple correct answers to each question, with a score for each candidate answer ranging between (0, 1); and finally selecting the candidate answer with the maximum probability value as the final answer of the model, wherein the final answer is as follows:
y=Sigmoid(W j J+b j )
wherein: w (W) j Parameters representing fully connected layers, b j Representing the bias term, y representing the final answer, and cross entropy is used as a loss function in the training process.
CN202111083704.6A 2021-09-14 2021-09-14 Visual question-answering method of dynamic memory network model based on multi-attention mechanism Active CN113886626B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111083704.6A CN113886626B (en) 2021-09-14 2021-09-14 Visual question-answering method of dynamic memory network model based on multi-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111083704.6A CN113886626B (en) 2021-09-14 2021-09-14 Visual question-answering method of dynamic memory network model based on multi-attention mechanism

Publications (2)

Publication Number Publication Date
CN113886626A CN113886626A (en) 2022-01-04
CN113886626B true CN113886626B (en) 2024-02-02

Family

ID=79009636

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111083704.6A Active CN113886626B (en) 2021-09-14 2021-09-14 Visual question-answering method of dynamic memory network model based on multi-attention mechanism

Country Status (1)

Country Link
CN (1) CN113886626B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2022082238A (en) * 2020-11-20 2022-06-01 富士通株式会社 Machine learning program, machine learning method, and output device
CN114417044B (en) * 2022-01-19 2023-05-26 中国科学院空天信息创新研究院 Image question and answer method and device
CN114661874B (en) * 2022-03-07 2024-04-30 浙江理工大学 Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels
CN114416914B (en) * 2022-03-30 2022-07-08 中建电子商务有限责任公司 Processing method based on picture question and answer

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163299A (en) * 2019-05-31 2019-08-23 合肥工业大学 A kind of vision answering method based on bottom-up attention mechanism and memory network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11568315B2 (en) * 2019-03-22 2023-01-31 Royal Bank Of Canada Systems and methods for learning user representations for open vocabulary data sets

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163299A (en) * 2019-05-31 2019-08-23 合肥工业大学 A kind of vision answering method based on bottom-up attention mechanism and memory network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
闫茹玉 ; 刘学亮 ; .结合自底向上注意力机制和记忆网络的视觉问答模型.中国图象图形学报.2020,(05),全文. *

Also Published As

Publication number Publication date
CN113886626A (en) 2022-01-04

Similar Documents

Publication Publication Date Title
CN113886626B (en) Visual question-answering method of dynamic memory network model based on multi-attention mechanism
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN112100346B (en) Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN110188358B (en) Training method and device for natural language processing model
CN107943784B (en) Relationship extraction method based on generation of countermeasure network
CN110704601A (en) Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network
Zhang et al. Rich visual knowledge-based augmentation network for visual question answering
CN108829719A (en) The non-true class quiz answers selection method of one kind and system
CN111191078A (en) Video information processing method and device based on video information processing model
CN111046179B (en) Text classification method for open network question in specific field
CN114565104A (en) Language model pre-training method, result recommendation method and related device
CN110826338B (en) Fine-grained semantic similarity recognition method for single-selection gate and inter-class measurement
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN107480132A (en) A kind of classic poetry generation method of image content-based
CN111695052A (en) Label classification method, data processing device and readable storage medium
CN112527993B (en) Cross-media hierarchical deep video question-answer reasoning framework
CN112733602B (en) Relation-guided pedestrian attribute identification method
CN114201592A (en) Visual question-answering method for medical image diagnosis
Puscasiu et al. Automated image captioning
CN115270752A (en) Template sentence evaluation method based on multilevel comparison learning
CN111145914A (en) Method and device for determining lung cancer clinical disease library text entity
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
CN115861995B (en) Visual question-answering method and device, electronic equipment and storage medium
CN114168769B (en) Visual question-answering method based on GAT relation reasoning
CN113609330B (en) Video question-answering system, method, computer and storage medium based on text attention and fine-grained information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant