CN110717431B - Fine-grained visual question and answer method combined with multi-view attention mechanism - Google Patents

Fine-grained visual question and answer method combined with multi-view attention mechanism Download PDF

Info

Publication number
CN110717431B
CN110717431B CN201910927585.4A CN201910927585A CN110717431B CN 110717431 B CN110717431 B CN 110717431B CN 201910927585 A CN201910927585 A CN 201910927585A CN 110717431 B CN110717431 B CN 110717431B
Authority
CN
China
Prior art keywords
attention
image
question
weight
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910927585.4A
Other languages
Chinese (zh)
Other versions
CN110717431A (en
Inventor
彭淑娟
李磊
柳欣
范文涛
钟必能
杜吉祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huaqiao University
Original Assignee
Huaqiao University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huaqiao University filed Critical Huaqiao University
Priority to CN201910927585.4A priority Critical patent/CN110717431B/en
Publication of CN110717431A publication Critical patent/CN110717431A/en
Application granted granted Critical
Publication of CN110717431B publication Critical patent/CN110717431B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]

Abstract

The invention relates to a fine-grained visual question-answering method combined with a multi-view attention mechanism, which sufficiently considers the guiding effect of the specific semantics of a question and provides a multi-view attention model, can effectively select a plurality of salient target areas related to a current task target (question), learns and acquires area information related to answers in an image and a question text from a plurality of views, extracts area salient features in the image under the guidance of question semantics, has finer-grained feature expression, has stronger depicting capability for expressing the condition that a plurality of important semantic expression areas exist in the image, increases the effectiveness and the comprehensiveness of the multi-view attention model, effectively strengthens the semantic relevance of the salient features and the question features in the image area, and improves the accuracy and comprehensiveness of the semantic understanding of the visual question-answering. The method for carrying out the visual question-answering task has the advantages of simple steps, high efficiency, high accuracy, complete commercial application and good market prospect.

Description

Fine-grained visual question and answer method combined with multi-view attention mechanism
Technical Field
The invention relates to the technical field of computer vision and natural language processing, in particular to a fine-grained vision question-answering method combined with a multi-view attention mechanism.
Background
With the rapid development of computer vision and natural language processing, the visual question-answering system becomes one of the more and more popular research fields of artificial intelligence. The visual question-answering technology is an emerging topic, and the task of the visual question-answering technology is to combine two subject fields of computer vision and natural language processing, take a given image and a natural language question related to the image as input, and generate a natural language answer as output. The visual question and answer is a key application direction in the field of artificial intelligence, and can help users with visual disorders to perform real-time human-computer interaction by simulating the scenes of the real world.
In essence, the visual question-answering system is regarded as a classification task, and it is a common practice to extract pictures and question features according to known pictures and questions, and then classify by fusing the picture features and the question features to obtain question-answering results. In recent years, visual question answering has attracted a great deal of attention in the fields of computer vision and natural language processing. Due to the relative complexity of visual question answering and the need for image and text processing, some existing methods have a certain lack of accuracy and are faced with major challenges.
In practical applications, visual question-answering systems are often faced with high dimensionality of the images and with noise effects that can affect the prediction of the answers by the algorithms. Therefore, the effective visual question-answering model can be used for mining the structural features and semantic correlation parts consistent with the question semantics in the image to perform fine-grained prediction.
The visual attention model is a computer simulation of the human visual attention mechanism to obtain the most noticeable part of an image, namely the salient region of the image. In the visual question answering, most methods using a single attention mechanism model often ignore the difference of image structural semantics and have some defects in the situation that a plurality of important areas exist in an image, so that the attention mechanism brought by the methods inevitably influences the accuracy of the visual question answering.
Research shows that most of the existing visual question-answering methods predict semantic answers of questions through questions and the whole picture, but do not consider the guiding effect of the specific semantics of the questions, so that the relevance of the image region features learned by the models and the question features on a semantic space is weak.
In summary, in the prior art, effective visual question answering methods still need to be improved.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a fine-grained visual question-answering method combined with a multi-view attention mechanism, which can effectively improve the accuracy and comprehensiveness of visual semantic information extraction and reduce the influence of redundant data and noise data, thereby improving the fine-grained identification capability and the judgment of complex problems of a visual question-answering system and improving the accuracy and the interpretability of a model of the visual question-answering system to a certain extent.
The technical scheme of the invention is as follows:
a fine-grained visual question-answering method combined with a multi-view attention mechanism comprises the following steps:
1) Inputting an image and extracting image characteristics; inputting a question text and extracting question features;
2) Inputting the image characteristics and the problem characteristics into a multi-view attention model, calculating the attention weight of the image, and performing weighted operation on the image characteristics in the step 1) through the attention weight to obtain fine-grained characteristics of the image;
3) Fusing the fine-grained features of the image with the problem features to obtain fused features;
4) And inputting the fusion features into a classifier, and predicting to obtain an answer.
Preferably, the multi-view attention model includes an upper layer attention model and a lower layer attention model, a single attention weight is obtained through the upper layer attention model, and a significant attention weight is obtained through the lower layer attention model, and the significant attention weight represents that different target regions in the image correspond to different attention resources.
Preferably, the method of obtaining a single attention weight is as follows:
inputting image characteristics and problem characteristics to an upper-layer attention model, respectively projecting data of the image characteristics and the problem characteristics to a same dimensional space by using a full connection layer, and normalizing vectors by using an activation function ReLu; then utilizing the product fusion of Hadamard codes, inputting two full-connection layers in turn to process the learning parameters, and processing the learned parameters
Figure BDA0002219333860000021
Finally, the weight is normalized by using the softmax function to obtain the single attention weight->
Figure BDA0002219333860000022
Wherein the content of the first and second substances,
Figure BDA0002219333860000023
is an image characteristic, is based on>
Figure BDA0002219333860000024
In order to be a characteristic of the problem,
Figure BDA0002219333860000025
the weight parameter to be learned for the upper layer attention model, K is the number of spatial regions of image features, T is the length of selected problem features, d is the number of hidden neurons in a network layer, h is the output dimension set by the layer, reLu is an activation function in a neural network, and the specific form of ReLu can be expressed as f (x) = max (0, x).
Preferably, the method of obtaining the saliency attention weight is as follows:
inputting image characteristics and problem characteristics to a lower-layer attention model, respectively projecting data of the image characteristics and the problem characteristics to a same dimensional space by using a full-connection layer,recalculating the correlation matrix C i =ReLu(q i T W b V i ) (ii) a Wherein the content of the first and second substances,
Figure BDA0002219333860000031
a weight parameter to be learned for the lower attention model, <' >>
Figure BDA0002219333860000032
Obtaining an incidence matrix;
multiplying the incidence matrix as a characteristic by the problem characteristic, and fusing the incidence matrix with the input image characteristic, wherein the fused parameter is
Figure BDA0002219333860000033
Finally, the weight value is normalized by using the softmax function, and the significance attention weight is output>
Figure BDA0002219333860000034
Wherein the content of the first and second substances,
Figure BDA0002219333860000035
weight parameters to be learned for the underlying attention model.
Preferably, the attention weight of the image is calculated based on the single attention weight and the saliency attention weight, specifically as follows:
Figure BDA0002219333860000036
wherein, beta 1 And beta 2 The weight ratio of the upper layer attention model and the lower layer attention model is a hyper-parameter.
Preferably, in step 3), the fine-grained features and the problem features of the image are respectively passed through the nonlinear layer f v 、f q In the non-linear layer f v 、f q Normalizing the vector by using an activation function ReLu; then, the Hadamard product is used for fusion to obtain fusion characteristics
Figure BDA0002219333860000037
Preferably, in step 4), the fused features are passed through a non-linear layer f o Through the non-linear layer f o Normalizing the vector by using an activation function ReLu; then using the linear mapping w o To predict the candidate score of the answer
Figure BDA0002219333860000038
Finally, selecting the output with higher score;
where σ is the sigmoid activation function, w o Is the weight parameter to be learned.
Preferably, the sigmoid activation function normalizes the final score to be in a (0-1) interval, the last stage is used as a logistic regression for predicting the correctness of each candidate answer, and the target function is
Figure BDA0002219333860000039
Wherein z and k cover N candidate answers, s, to M training questions, respectively zk Is the true answer to the question.
Preferably, in step 1), the input image I is subjected to fast-RCNN standard model i Carrying out feature extraction to obtain image features V of depth expression i =FasterRCNN(I i )。
Preferably, in step 1), question text Q is input i First, the question text Q is marked by using spaces and punctuation i Dividing into words, initializing by a pre-trained GloVe word embedding method to obtain an i-th specified question sentence after coding
Figure BDA0002219333860000041
Wherein x is t (i) Indicating the tth word of each word in the vocabulary;
then, will
Figure BDA0002219333860000042
Inputting into LSTM network, taking out output q of last layer i As->
Figure BDA0002219333860000043
To obtain a problem feature q i
The invention has the following beneficial effects:
the invention provides a multi-view attention model by combining a fine-grained vision question-answering method of a multi-view attention mechanism, which can effectively select a plurality of significant target areas related to a current task target (question), extract area significance characteristics in an image under the guidance of question semantics, has fine-grained characteristic expression, expresses the condition that a plurality of important semantic expression areas exist in the image, and has strong depicting capability.
According to the method, the guiding effect of the specific semantics of the question is fully considered, the regional information related to the answer in the image and the question text is learned and obtained from multiple visual angles, and the effectiveness and the comprehensiveness of the multi-visual-angle attention model are improved, so that the semantic relevance of the salient features and the question features of the image region is effectively enhanced, and the accuracy and the comprehensiveness of the semantic understanding of the visual question-answering are improved.
The method for carrying out the visual question-answering task has the advantages of simple steps, high efficiency, high accuracy, complete commercial application and good market prospect.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a schematic diagram of a multi-view attention model;
FIG. 3 is an attention weight visualization thermodynamic diagram (simple attention task);
FIG. 4 is an attention weight visualization thermodynamic diagram (task needs to be highly focused on multiple locations in the image);
FIG. 5 is a graph comparing the results obtained by the multi-view attention model of the present invention with the current more advanced method;
FIG. 6 is a graph of a loss function for final model performance training;
FIG. 7 is a graph of training verification scores for final model performance training.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
The invention provides a fine-grained visual question-answering method combining a multi-view attention mechanism to solve the defects in the prior art. The visual question-answer may be regarded as a multi-tasking classification question, and each answer may be regarded as a classification category. In a general visual question-answering system, answers are coded by using a One-Hot method to obtain One-Hot vectors corresponding to each answer, and an answer vector table is formed. One-Hot encoding is the representation of classification variables as binary vectors. This first requires that the classification values are mapped to integer values, and then each integer value is represented as a binary vector, which is a zero value, except for the index of the integer, which is labeled 1.
As shown in fig. 1, the fine-grained visual question-answering method combined with a multi-view attention mechanism according to the present invention generally comprises the following steps:
1) Inputting an image and extracting image characteristics; inputting a problem text and extracting problem features;
2) Inputting the image characteristics and the problem characteristics into a multi-view attention model, calculating the attention weight of the image, and performing weighted operation on the image characteristics in the step 1) through the attention weight to obtain fine-grained characteristics of the image;
3) Fusing the fine-grained features of the image with the problem features to obtain fused features;
4) And inputting the fusion features into a classifier, and predicting to obtain an answer.
In this embodiment, in step 1), the fast-RCNN standard model is used to match the input image I i Carrying out feature extraction to obtain image features V i =FasterRCNN(I i ). Let K be the number of spatial regions of the image feature, then the image feature
Figure BDA0002219333860000051
Can be further expressed as>
Figure BDA0002219333860000052
Wherein it is present>
Figure BDA0002219333860000053
Is the kth regional characteristic extracted by fast-RCNN, and d is the number of hidden neurons in the network layer and simultaneously represents the output dimension.
In step 1), inputting a question text Q i Then, the question text Q is marked by using spaces and punctuation i Dividing into words, and initializing by a pre-trained GloVe Word embedding method (Global Vectors for Word retrieval) to obtain the ith sentence with specified problem
Figure BDA0002219333860000054
In a coded form of (1), wherein x t (i) Indicating the tth word of each word in the vocabulary;
then, will
Figure BDA0002219333860000055
Inputting into LSTM network, specifically, using standard LSTM network containing 1280 hidden units, taking out output q of last layer i As->
Figure BDA0002219333860000061
To obtain a problem feature q i
Then, for the acquired image feature V i And problem feature q of the coding i And inputting the two features into a multi-view attention model, and calculating the attention weight of the image.
The visual attention mechanism can essentially select a target area which is more critical to the current task target from the image, so that more attention resources are put into the area to acquire more detailed information of the target which needs to be focused, and other useless information is suppressed. In the visual question-answering task, semantic expressions have diversity. In particular, there are problems that require the model to understand semantic expressions between multiple target objects in an image. Therefore, the single visual attention model cannot effectively mine the relevance between different semantic objects in the image and the problem semantics;
in order to solve the problem, the invention provides a multi-view attention model, which uses two different attention mechanisms to jointly learn important area parts with different semantics, which can be focused on in the problem, so as to obtain a fine-grained attention feature map of an image. The attention model of the multi-view angle is used for paying attention to the image to obtain the attention weight of the image, the weight is used for carrying out image feature weighting to obtain an accumulated vector as a final image feature representation, namely, fine-grained features of the image can be well associated with problem semantics.
As shown in fig. 2, the multi-view attention model includes an upper-layer attention model and a lower-layer attention model, a single attention weight is obtained through the upper-layer attention model, and a saliency attention weight is obtained through the lower-layer attention model, and the saliency attention weight represents that different target regions in the image correspond to different attention resources.
Specifically, in the upper-level attention model, the method for obtaining the single attention weight is as follows:
inputting image characteristics and problem characteristics to an upper layer attention model, respectively projecting data of the image characteristics and the problem characteristics to a same dimensional space by using a layer of full connection layer, and normalizing vectors by using an activation function ReLu; then, the Hardmard products are fused, and then the parameters are sequentially input into two full-connection layers to process the learning parameters, and the parameters after learning are processed
Figure BDA0002219333860000062
Wherein it is present>
Figure BDA0002219333860000063
Is an image characteristic, is based on>
Figure BDA0002219333860000064
For question features, in>
Figure BDA0002219333860000065
Weight parameters to be learned for the upper layer attention model, K is the number of spatial regions of image features, and T is selectionD is the number of hidden neurons in the network layer, h is the output dimension set by the layer, reLu is an activation function in the neural network, and the specific form of ReLu can be expressed as f (x) = max (0, x);
finally, the weight is normalized by using a softmax function to obtain a single attention weight
Figure BDA0002219333860000066
Considering a single attention weight
Figure BDA0002219333860000071
The value of the softmax weight is larger, and the rest weight is smaller. Since an image often contains a plurality of different semantics and these semantics are often expressed visually semantically in different regions. Single attention weight->
Figure BDA0002219333860000072
Some region information with important semantics is often ignored. In order to supplement the missing part of the attention information of the upper layer attention model, the invention further provides the lower layer attention model. The lower-layer attention model simultaneously gives consideration to the relevance of the image and the problem semantics, achieves a learning mechanism of the problem-guided multi-view attention model, and increases the feature fine-grained mining capability.
Specifically, in the lower attention model, the method for obtaining the significance attention weight is as follows:
inputting image characteristics and problem characteristics to a lower-layer attention model, respectively projecting data of the image characteristics and the problem characteristics to a same dimensional space by using a full connecting layer, and then calculating a correlation matrix C i =ReLu(q i T W b V i ) (ii) a Wherein the content of the first and second substances,
Figure BDA0002219333860000073
a weight parameter to be learned for the lower attention model, <' >>
Figure BDA0002219333860000074
Obtaining an incidence matrix;
multiplying the incidence matrix as a characteristic by the problem characteristic, and fusing the incidence matrix with the input image characteristic, wherein the fused parameter is
Figure BDA0002219333860000075
Finally, the weight value is normalized by using the softmax function, and the significance attention weight is output>
Figure BDA0002219333860000076
Wherein it is present>
Figure BDA0002219333860000077
And setting parameter dimensions for weight parameters to be learned of the lower-layer attention model to be consistent with those of the upper-layer attention model, wherein K is the number of space regions of image features, T is the length of selected problem features, d is the number of hidden neurons in the network layer, h is the output dimension of the layer, and ReLu is an activation function in the neural network.
Calculating the attention weight of the image based on the single attention weight and the significant attention weight, which is specifically as follows:
Figure BDA0002219333860000078
wherein, beta 1 And beta 2 The weight ratio parameters of the upper layer attention model and the lower layer attention model are obtained. In practical application, the weights between the upper layer attention model and the lower layer attention model can be distributed through debugging parameters, so that a better effect is achieved.
Image feature V i Can further represent the collection form of K image space region characteristics
Figure BDA0002219333860000079
Further, attention is weighted by a i Multiplying and weighting the image characteristics of each space area to obtain an imageFine grain features
Figure BDA00022193338600000710
/>
In step 3), the fine-grained characteristic and the problem characteristic of the image are respectively passed through a nonlinear layer f v 、f q In the non-linear layer f v 、f q Normalizing the vector by using an activation function ReLu; then, the Hadamard product is used for fusion to obtain fusion characteristics
Figure BDA0002219333860000081
Further, the visual question-answering question is a multi-label classification question, and further, in the step 4), the fusion features pass through the nonlinear layer f o Through the non-linear layer f o Normalizing the vector by using an activation function ReLu; then using the linear mapping w o To predict the candidate score of the answer
Figure BDA0002219333860000082
Finally, selecting the output with higher score;
where σ is the sigmoid activation function, w o Is the weight parameter to be learned.
Preferably, the sigmoid activation function normalizes the final score to be in a (0-1) interval, the last stage is used as a logistic regression for predicting the correctness of each candidate answer, and the target function is
Figure BDA0002219333860000083
Wherein the indices z and k cover N candidate answers, s, to M training questions, respectively zk Is the true answer to the question.
Compared with other common visual questions and answers using the softmax classifier, the logistic regression classification used by the method is more effective. The Sigmoid function uses a soft score (soft target) as a target result, provides a richer training signal, and can effectively capture the occasional uncertainty in the real answer.
To better observe how the attention model is closedAttention is paid to the salient region part of the image, and the attention map (a) of the single attention weight and the salient attention weight is obtained u ,a b ) Thereafter, the attention map is visualized as a matrix heatmap (matrix heatmap) using the heatmap function of the matchlib drawing library in python, as shown in fig. 3, 4.
Fig. 3 and 4 are representations of an upper layer attention model and a lower layer attention model of a multi-view attention model for 2 different task images, respectively, wherein attention1 is an attention visual view of the upper layer attention model, and attention2 is an attention visual view of the lower layer attention model. As can be seen from the heat map of attention, the added underlying attention model is able to learn different important regions of the input image. As can be seen from fig. 3, for a simple attention task, the upper and lower attention models can find the correct position in the image. However, in fig. 4, it can be seen that when a task needs to be highly focused on a plurality of locations in an image, a lower layer attention model focuses on a different portion from an upper layer attention model, thereby improving the accuracy of a multi-view attention model, which is advantageous over the prior art models.
Introduction of test data set: the VQA v2 dataset (Antol S, agrawal A, lu J, et al. Vqa: visual query answering [ C ]. Proceedings of the IEEE International Conference on Computer vision.2015: 2425-2433.) is a large-scale Visual question-answer dataset in which all questions and answers are manually annotated. There were 443,757 training questions, 214,354 validation questions, and 447,793 testing questions in the dataset. Each image is associated with three questions for which ten answers are provided by the annotator. In the case of standard visual question-answering tasks, the questions in this dataset are often classified as: yes/no, number and other.
Further, to verify the effectiveness of the present invention, the present invention was compared with the results of 2017VQA challenge champion (Anderson P, he X, buehler C, et al, bottom-up and top-down-effort for image capturing and visual query answering. ArXiv print arXiv:1707.07998, 2017.), and as shown in FIG. 5, the present invention replaced the original simple attention model with the multi-view attention model based on the recurrent paper code, the multi-view attention model of the present invention finally scored 64.35%, and the evaluation of the accuracy rate compared with the paper is about 1.2% higher.
Some basic parameters were designed in the experiment as follows, with the basic learning rate set at α =0.0007, the random deactivation rate set at dropout =0.3 after each LSTM layer, and the answer screening setting N =3000. Hidden neuron setting num _ hid =1024 for the fully connected layer, and batch _ size =512 for the number of batches trained. The weight of the single attention weight and the significance attention weight is distributed as beta 1 =0.7,β 2 =0.3。
As shown in fig. 6, the loss function value (loss) of the model decreases with the increase of the training period; as shown in fig. 7, the model accuracy rate is represented on the training set and the test set respectively as the training period increases.
Comparison of the representative method of the present invention in the test-dev case with the VQA task on the public standard data set VQA v2 is shown in Table 1.
TABLE 1
Figure BDA0002219333860000091
Specifically, the data were evaluated in 3 categories based on the type of the problem, and then the total evaluation result was calculated. The question types are Y/N question, number question, other openness question. The scores in the table are the accuracy of the model for different types of question answer results, and the accuracy is higher when the numerical value is larger. As can be seen from the table, the multi-view attention model of the present invention achieves better results for different tasks.
Particularly, the multi-view attention model strengthens the feature expression of fine granularity, improves the detection and identification capabilities of the object, and has good improvement on the Number evaluation compared with the prior method. The overall accuracy evaluation result of the model is better than that of most existing methods.
The above examples are provided only for illustrating the present invention and are not intended to limit the present invention. Changes, modifications, etc. to the above-described embodiments are intended to fall within the scope of the claims of the present invention as long as they are in accordance with the technical spirit of the present invention.

Claims (7)

1. A fine-grained visual question-answering method combined with a multi-view attention mechanism is characterized by comprising the following steps:
1) Inputting an image and extracting image characteristics; inputting a question text and extracting question features;
2) Inputting the image characteristics and the problem characteristics into a multi-view attention model, calculating the attention weight of the image, and performing weighted operation on the image characteristics in the step 1) through the attention weight to obtain fine-grained characteristics of the image;
3) Fusing the fine-grained features of the image with the problem features to obtain fused features;
4) Inputting the fusion characteristics into a classifier, and predicting to obtain an answer;
the multi-view attention model comprises an upper layer attention model and a lower layer attention model, wherein a single attention weight is obtained through the upper layer attention model, a significant attention weight is obtained through the lower layer attention model, and the significant attention weight reflects different attention resources corresponding to different target areas in an image;
the method of obtaining a single attention weight is as follows:
inputting image characteristics and problem characteristics to an upper layer attention model, respectively projecting data of the image characteristics and the problem characteristics to a same dimensional space by using a layer of full connection layer, and normalizing vectors by using an activation function ReLu; then, the Hadamard product is used for fusion, and then two full-connection layers are sequentially input for processing learning parameters, and the learned parameters are processed
Figure FDA0003980779750000011
Finally, the weight is normalized by using a softmax function to obtain a single attention weight
Figure FDA0003980779750000012
Wherein the content of the first and second substances,
Figure FDA0003980779750000013
in order to be a feature of the image,
Figure FDA0003980779750000014
in order to be a characteristic of the problem,
Figure FDA0003980779750000015
Figure FDA0003980779750000016
the weight parameters to be learned for the upper layer attention model, K is the number of spatial regions of image features, T is the length of selected problem features, d is the number of hidden neurons in a network layer, h is the output dimension set by the layer, reLu is an activation function in a neural network, and the specific form of the ReLu can be expressed as f (x) = max (0, x);
the method for obtaining the significance attention weight is as follows:
inputting image characteristics and problem characteristics to a lower-layer attention model, respectively projecting data of the image characteristics and the problem characteristics to the same dimensional space by using a full-connection layer, and calculating a correlation matrix C i =ReLu(q i T W b V i ) (ii) a Wherein the content of the first and second substances,
Figure FDA0003980779750000017
the weight parameters to be learned for the underlying attention model,
Figure FDA0003980779750000018
obtaining an incidence matrix;
multiplying the incidence matrix as a characteristic by the problem characteristic, and fusing the incidence matrix with the input image characteristic, wherein the fused parameter is
Figure FDA0003980779750000021
Finally, the weight is normalized by using a softmax function, and the significance attention weight is output
Figure FDA0003980779750000022
Wherein the content of the first and second substances,
Figure FDA0003980779750000023
weight parameters to be learned for the underlying attention model.
2. The fine-grained visual question-answering method combined with a multi-view attention mechanism according to claim 1, wherein the attention weight of the image is calculated based on a single attention weight and a significant attention weight, and specifically as follows:
Figure FDA0003980779750000024
wherein beta is 1 And beta 2 The weight ratio of the upper layer attention model and the lower layer attention model is a hyper-parameter.
3. The fine-grained visual question-answering method combined with the multi-view attention mechanism according to claim 2, wherein in the step 3), the image fine-grained features and the question features are respectively passed through the nonlinear layer f v 、f q In the non-linear layer f v 、f q Normalizing the vector by using an activation function ReLu; then, the Hadamard product is used for fusion to obtain fusion characteristics
Figure FDA0003980779750000025
4. The fine-grained visual question-answering method combined with multi-view attention mechanism according to claim 3, wherein in the step 4), the fused features are processed through a nonlinear layer f o In the presence of a non-linear passingProperty layer f o Normalizing the vector by using an activation function ReLu; then using the linear mapping w o To predict the candidate score of the answer
Figure FDA0003980779750000026
Finally, selecting the output with higher score;
where σ is the sigmoid activation function, w o Is the weight parameter to be learned.
5. The fine-grained visual question-answering method combined with a multi-view attention mechanism according to claim 4, wherein the sigmoid activation function normalizes the final score to be in a (0-1) interval, the last stage is used as a logistic regression for predicting the correctness of each candidate answer, and the objective function is
Figure FDA0003980779750000027
Wherein z and k cover N candidate answers, s, to M training questions, respectively zk Is the true answer to the question.
6. The fine-grained visual question-answering method combined with a multi-view attention mechanism according to claim 1, wherein in the step 1), a fast-RCNN standard model is used for an input image I i Carrying out feature extraction to obtain image features V of depth expression i =FasterRCNN(I i )。
7. The fine-grained visual question-answering method combined with multi-view attention mechanism according to claim 1, characterized in that in step 1), question text Q is input i First, the question text Q is marked by using spaces and punctuation i Dividing into words, initializing by a pre-trained GloVe word embedding method to obtain an i-th specified question sentence after coding
Figure FDA0003980779750000031
Wherein x is t (i) Indicating the tth word of each word in the vocabulary;
then, will
Figure FDA0003980779750000032
Inputting into LSTM network, taking out output q of last layer i As
Figure FDA0003980779750000033
To obtain a problem feature q i
CN201910927585.4A 2019-09-27 2019-09-27 Fine-grained visual question and answer method combined with multi-view attention mechanism Active CN110717431B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910927585.4A CN110717431B (en) 2019-09-27 2019-09-27 Fine-grained visual question and answer method combined with multi-view attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910927585.4A CN110717431B (en) 2019-09-27 2019-09-27 Fine-grained visual question and answer method combined with multi-view attention mechanism

Publications (2)

Publication Number Publication Date
CN110717431A CN110717431A (en) 2020-01-21
CN110717431B true CN110717431B (en) 2023-03-24

Family

ID=69211080

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910927585.4A Active CN110717431B (en) 2019-09-27 2019-09-27 Fine-grained visual question and answer method combined with multi-view attention mechanism

Country Status (1)

Country Link
CN (1) CN110717431B (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325000B (en) * 2020-01-23 2021-01-26 北京百度网讯科技有限公司 Language generation method and device and electronic equipment
CN111325243B (en) * 2020-02-03 2023-06-16 天津大学 Visual relationship detection method based on regional attention learning mechanism
CN113837212B (en) * 2020-06-24 2023-09-26 四川大学 Visual question-answering method based on multi-mode bidirectional guiding attention
CN111860653A (en) * 2020-07-22 2020-10-30 苏州浪潮智能科技有限公司 Visual question answering method and device, electronic equipment and storage medium
CN111984772B (en) * 2020-07-23 2024-04-02 中山大学 Medical image question-answering method and system based on deep learning
CN112100346B (en) * 2020-08-28 2021-07-20 西北工业大学 Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN112163608B (en) * 2020-09-21 2023-02-03 天津大学 Visual relation detection method based on multi-granularity semantic fusion
CN112488111B (en) * 2020-12-18 2022-06-14 贵州大学 Indication expression understanding method based on multi-level expression guide attention network
CN112732879B (en) * 2020-12-23 2022-05-10 重庆理工大学 Downstream task processing method and model of question-answering task
CN112905819B (en) * 2021-01-06 2022-09-23 中国石油大学(华东) Visual question-answering method of original feature injection network based on composite attention
CN113761153B (en) * 2021-05-19 2023-10-24 腾讯科技(深圳)有限公司 Picture-based question-answering processing method and device, readable medium and electronic equipment
CN113223018A (en) * 2021-05-21 2021-08-06 信雅达科技股份有限公司 Fine-grained image analysis processing method
CN113407794B (en) * 2021-06-01 2023-10-31 中国科学院计算技术研究所 Visual question-answering method and system for inhibiting language deviation
CN113436094B (en) * 2021-06-24 2022-05-31 湖南大学 Gray level image automatic coloring method based on multi-view attention mechanism
CN113408511B (en) * 2021-08-23 2021-11-12 南开大学 Method, system, equipment and storage medium for determining gazing target
CN113792617B (en) * 2021-08-26 2023-04-18 电子科技大学 Image interpretation method combining image information and text information
CN113779298B (en) * 2021-09-16 2023-10-31 哈尔滨工程大学 Medical vision question-answering method based on composite loss
CN114398471A (en) * 2021-12-24 2022-04-26 哈尔滨工程大学 Visual question-answering method based on deep reasoning attention mechanism
CN114417044B (en) * 2022-01-19 2023-05-26 中国科学院空天信息创新研究院 Image question and answer method and device
CN114661874B (en) * 2022-03-07 2024-04-30 浙江理工大学 Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110111399A (en) * 2019-04-24 2019-08-09 上海理工大学 A kind of image text generation method of view-based access control model attention
CN110163299A (en) * 2019-05-31 2019-08-23 合肥工业大学 A kind of vision answering method based on bottom-up attention mechanism and memory network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9965705B2 (en) * 2015-11-03 2018-05-08 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (ABC-CNN) for visual question answering

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110111399A (en) * 2019-04-24 2019-08-09 上海理工大学 A kind of image text generation method of view-based access control model attention
CN110163299A (en) * 2019-05-31 2019-08-23 合肥工业大学 A kind of vision answering method based on bottom-up attention mechanism and memory network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于注意力机制的答案选择方法研究;熊雪等;《智能计算机与应用》;20181105(第06期);全文 *

Also Published As

Publication number Publication date
CN110717431A (en) 2020-01-21

Similar Documents

Publication Publication Date Title
CN110717431B (en) Fine-grained visual question and answer method combined with multi-view attention mechanism
CN111554268B (en) Language identification method based on language model, text classification method and device
CN110737801B (en) Content classification method, apparatus, computer device, and storage medium
CN110837846B (en) Image recognition model construction method, image recognition method and device
CN113657425B (en) Multi-label image classification method based on multi-scale and cross-modal attention mechanism
CN112732916B (en) BERT-based multi-feature fusion fuzzy text classification system
CN111209384A (en) Question and answer data processing method and device based on artificial intelligence and electronic equipment
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
Wang et al. Spatial–temporal pooling for action recognition in videos
AU2019101138A4 (en) Voice interaction system for race games
CN113011357A (en) Depth fake face video positioning method based on space-time fusion
CN114239585A (en) Biomedical nested named entity recognition method
KR20200010672A (en) Smart merchandise searching method and system using deep learning
CN113987187A (en) Multi-label embedding-based public opinion text classification method, system, terminal and medium
CN110705490A (en) Visual emotion recognition method
CN112131345A (en) Text quality identification method, device, equipment and storage medium
Xia et al. Evaluation of saccadic scanpath prediction: Subjective assessment database and recurrent neural network based metric
Yan Computational Methods for Deep Learning: Theory, Algorithms, and Implementations
CN113378919B (en) Image description generation method for fusing visual sense and enhancing multilayer global features
Chen et al. STRAN: Student expression recognition based on spatio-temporal residual attention network in classroom teaching videos
Gong et al. Human interaction recognition based on deep learning and HMM
Ling et al. A facial expression recognition system for smart learning based on YOLO and vision transformer
Vijayaraju Image retrieval using image captioning
Tamaazousti On the universality of visual and multimodal representations
CN113239678A (en) Multi-angle attention feature matching method and system for answer selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant