CN110717431A - Fine-grained visual question and answer method combined with multi-view attention mechanism - Google Patents
Fine-grained visual question and answer method combined with multi-view attention mechanism Download PDFInfo
- Publication number
- CN110717431A CN110717431A CN201910927585.4A CN201910927585A CN110717431A CN 110717431 A CN110717431 A CN 110717431A CN 201910927585 A CN201910927585 A CN 201910927585A CN 110717431 A CN110717431 A CN 110717431A
- Authority
- CN
- China
- Prior art keywords
- attention
- image
- question
- layer
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000000007 visual effect Effects 0.000 title claims abstract description 52
- 238000000034 method Methods 0.000 title claims abstract description 46
- 230000007246 mechanism Effects 0.000 title claims abstract description 25
- 230000006870 function Effects 0.000 claims description 32
- 230000004913 activation Effects 0.000 claims description 19
- 239000013598 vector Substances 0.000 claims description 15
- 239000011159 matrix material Substances 0.000 claims description 14
- 230000004927 fusion Effects 0.000 claims description 13
- 238000012549 training Methods 0.000 claims description 12
- 210000002569 neuron Anatomy 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 4
- 238000007477 logistic regression Methods 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 230000005477 standard model Effects 0.000 claims description 2
- 230000014509 gene expression Effects 0.000 abstract description 7
- 230000000694 effects Effects 0.000 abstract description 5
- 239000000284 extract Substances 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 4
- 238000003058 natural language processing Methods 0.000 description 4
- 239000000047 product Substances 0.000 description 4
- 230000007547 defect Effects 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 208000013521 Visual disease Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 230000009849 deactivation Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 208000029257 vision disease Diseases 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/10—Terrestrial scenes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/5866—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Library & Information Science (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a fine-grained visual question-answering method combined with a multi-view attention mechanism, which sufficiently considers the guiding effect of the specific semantics of a question and provides a multi-view attention model, can effectively select a plurality of salient target areas related to a current task target (question), learns and acquires area information related to answers in an image and a question text from a plurality of views, extracts area salient features in the image under the guidance of question semantics, has finer-grained feature expression, has stronger depicting capability for expressing the condition that a plurality of important semantic expression areas exist in the image, increases the effectiveness and the comprehensiveness of the multi-view attention model, effectively strengthens the semantic relevance of the salient features and the question features in the image area, and improves the accuracy and comprehensiveness of the semantic understanding of the visual question-answering. The method for carrying out the visual question-answering task has the advantages of simple steps, high efficiency, high accuracy, complete commercial application and good market prospect.
Description
Technical Field
The invention relates to the technical field of computer vision and natural language processing, in particular to a fine-grained vision question-answering method combined with a multi-view attention mechanism.
Background
With the rapid development of computer vision and natural language processing, the visual question-answering system becomes one of the more and more popular research fields of artificial intelligence. The visual question-answering technology is an emerging topic, and the task of the visual question-answering technology is to combine two subject fields of computer vision and natural language processing, take a given image and a natural language question related to the image as input, and generate a natural language answer as output. The visual question and answer is a key application direction in the field of artificial intelligence, and can help users with visual disorders to perform real-time human-computer interaction by simulating the scenes of the real world.
In essence, the visual question-answering system is regarded as a classification task, and it is a common practice to extract pictures and question features according to known pictures and questions, and then classify by fusing the picture features and the question features to obtain question-answering results. In recent years, visual question answering has attracted a great deal of attention in the fields of computer vision and natural language processing. Due to the relative complexity of visual question answering and the need for image and text processing, some existing methods have a certain lack of accuracy and are faced with major challenges.
In practical applications, visual question-answering systems are often faced with high dimensionality of the image and noise effects that can affect the prediction of the answer by the algorithm. Therefore, the effective visual question-answering model can be used for mining the structural features and semantic correlation parts consistent with the question semantics in the image to perform fine-grained prediction.
The visual attention model is a computer simulation of the human visual attention mechanism to obtain the most noticeable part of an image, namely the salient region of the image. In the visual question answering, most methods using a single attention mechanism model often ignore the difference of image structural semantics and have some defects in the situation that a plurality of important areas exist in an image, so that the attention mechanism brought by the methods inevitably influences the accuracy of the visual question answering.
Research shows that most of the existing visual question-answering methods predict semantic answers of questions through questions and the whole picture, but do not consider the guiding effect of the specific semantics of the questions, so that the relevance of the image region features learned by the models and the question features on a semantic space is weak.
In summary, the effective visual question answering methods in the prior art still need to be improved.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a fine-grained visual question-answering method combined with a multi-view attention mechanism, which can effectively improve the accuracy and comprehensiveness of visual semantic information extraction and reduce the influence of redundant data and noise data, thereby improving the fine-grained identification capability and the judgment of complex problems of a visual question-answering system and improving the accuracy and the interpretability of a model of the visual question-answering system to a certain extent.
The technical scheme of the invention is as follows:
a fine-grained visual question-answering method combined with a multi-view attention mechanism comprises the following steps:
1) inputting an image and extracting image characteristics; inputting a question text and extracting question features;
2) inputting the image characteristics and the problem characteristics into a multi-view attention model, calculating the attention weight of the image, and performing weighted operation on the image characteristics in the step 1) through the attention weight to obtain fine-grained characteristics of the image;
3) fusing the fine-grained features of the image with the problem features to obtain fused features;
4) and inputting the fusion features into a classifier, and predicting to obtain an answer.
Preferably, the multi-view attention model includes an upper-layer attention model and a lower-layer attention model, a single attention weight is obtained through the upper-layer attention model, a significant attention weight is obtained through the lower-layer attention model, and the significant attention weight represents that different target regions in the image correspond to different attention resources.
Preferably, the method of obtaining a single attention weight is as follows:
inputting image characteristics and problem characteristics to an upper layer attention model, respectively projecting data of the image characteristics and the problem characteristics to a same dimensional space by using a layer of full connection layer, and normalizing vectors by using an activation function ReLu; then utilizing the product fusion of Hadamard codes, inputting two full-connection layers in turn to process the learning parameters, and processing the learned parametersFinally, the weight is normalized by using a softmax function to obtain a single attention weight
Wherein,in order to be a feature of the image,in order to be a characteristic of the problem,the weight parameters to be learned for the upper-layer attention model, K is the number of spatial regions of image features, T is the length of selected problem features, d is the number of hidden neurons in the network layer, h is the output dimension set by the layer, ReLu is an activation function in the neural network, and the specific form thereof can be expressed as f (x) max (0, x).
Preferably, the method of obtaining the saliency attention weight is as follows:
inputting image characteristics and problem characteristics to a lower-layer attention model, respectively projecting data of the image characteristics and the problem characteristics to the same dimensional space by using a full-connection layer, and calculating a correlation matrix Ci=ReLu(qi TWbVi) (ii) a Wherein,the weight parameters to be learned for the underlying attention model,obtaining an incidence matrix;
multiplying the incidence matrix as a characteristic by the problem characteristic, and fusing the incidence matrix with the input image characteristic, wherein the fused parameter isFinally, the weight is normalized by using a softmax function, and the significance attention weight is output
Preferably, the attention weight of the image is calculated based on the single attention weight and the saliency attention weight, specifically as follows:
wherein, beta1And beta2The weight ratio of the upper layer attention model and the lower layer attention model is a hyper-parameter.
Preferably, in the step 3), the fine-grained characteristic and the problem characteristic of the image are respectively passed through the nonlinear layer fv、fqIn the non-linear layer fv、fqNormalizing the vector by using an activation function ReLu; then, the Hadamard product is used for fusion to obtain fusion characteristics
Preferably, in step 4), the fused features are passed through a non-linear layer foThrough the non-linear layer foNormalizing the vector by using an activation function ReLu; then using the linear mapping woTo predict the candidate score of the answerFinally, selecting the output with higher score;
where σ is the sigmoid activation function, woIs the weight parameter to be learned.
Preferably, the sigmoid activation function normalizes the final score to be in a (0-1) interval, the last stage is used as a logistic regression for predicting the correctness of each candidate answer, and the target function is
Wherein z and k cover N candidate answers, s, to M training questions, respectivelyzkIs the true answer to the question.
Preferably, in step 1), the input image I is subjected to fast-RCNN standard modeliCarrying out feature extraction to obtain image features V of depth expressioni=FasterRCNN(Ii)。
Preferably, in step 1), question text Q is inputiFirst, using space and punctuation to askQuestion text QiDividing into words, initializing by a pre-trained GloVe word embedding method to obtain an i-th specified question sentence after codingWherein x ist (i)Indicating the tth word of each word in the vocabulary;
then, willInputting into LSTM network, taking out output q of last layeriAsTo obtain a problem feature qi。
The invention has the following beneficial effects:
the invention provides a multi-view attention model by combining a fine-grained vision question-answering method of a multi-view attention mechanism, which can effectively select a plurality of significant target areas related to a current task target (question), extract area significance characteristics in an image under the guidance of question semantics, has fine-grained characteristic expression, expresses the condition that a plurality of important semantic expression areas exist in the image, and has strong depicting capability.
According to the method, the guiding effect of the specific semantics of the question is fully considered, the regional information related to the answer in the image and the question text is learned and obtained from multiple visual angles, and the effectiveness and the comprehensiveness of the multi-visual-angle attention model are improved, so that the semantic relevance of the salient features and the question features of the image region is effectively enhanced, and the accuracy and the comprehensiveness of the semantic understanding of the visual question-answering are improved.
The method for carrying out the visual question-answering task has the advantages of simple steps, high efficiency, high accuracy, complete commercial application and good market prospect.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a schematic diagram of a multi-view attention model;
FIG. 3 is an attention weight visualization thermodynamic diagram (simple attention task);
FIG. 4 is an attention weight visualization thermodynamic diagram (task needs to be highly focused on multiple locations in the image);
FIG. 5 is a graph comparing the results obtained by the multi-view attention model of the present invention with a more advanced method;
FIG. 6 is a graph of a loss function for final model performance training;
FIG. 7 is a graph of training verification scores for final model performance training.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
The invention provides a fine-grained visual question-answering method combining a multi-view attention mechanism to solve the defects in the prior art. The visual question-answer may be regarded as a multi-tasking classification question, and each answer may be regarded as a classification category. In a general visual question-answering system, answers are coded by using a One-Hot method to obtain One-Hot vectors corresponding to each answer, and an answer vector table is formed. One-Hot encoding is the representation of classification variables as binary vectors. This first requires that the classification values are mapped to integer values, and then each integer value is represented as a binary vector, which is a zero value, except for the index of the integer, which is labeled 1.
As shown in fig. 1, the fine-grained visual question-answering method combined with a multi-view attention mechanism according to the present invention generally comprises the following steps:
1) inputting an image and extracting image characteristics; inputting a question text and extracting question features;
2) inputting the image characteristics and the problem characteristics into a multi-view attention model, calculating the attention weight of the image, and performing weighted operation on the image characteristics in the step 1) through the attention weight to obtain fine-grained characteristics of the image;
3) fusing the fine-grained features of the image with the problem features to obtain fused features;
4) and inputting the fusion features into a classifier, and predicting to obtain an answer.
In this embodiment, in step 1), the fast-RCNN standard model is used to match the input image IiCarrying out feature extraction to obtain image features Vi=FasterRCNN(Ii). Let K be the number of spatial regions of the image feature, then the image featureCan be further expressed asWherein,is the kth regional characteristic extracted by fast-RCNN, and d is the number of hidden neurons in the network layer and simultaneously represents the output dimension.
In step 1), inputting a question text QiThen, the question text Q is marked by using spaces and punctuationiDividing into words, and initializing by a pre-trained GloVe Word embedding method (Global Vectors for Word retrieval) to obtain the ith sentence with specified problemIn a coded form of (1), wherein xt (i)Indicating the tth word of each word in the vocabulary;
then, willInputting into LSTM network, specifically, using standard LSTM network containing 1280 hidden units, taking out output q of last layeriAsTo obtain a problem feature qi。
Then, for the acquired image feature ViAnd problem feature q of the codingiInputting two characteristicsAnd (4) calculating the attention weight of the image by using a multi-view attention model.
The visual attention mechanism can essentially select a target area which is more critical to the current task target from the image, so that more attention resources are invested in the area to acquire more detailed information of the target which needs to be concerned, and other useless information is suppressed. In the visual question-answering task, semantic expressions have diversity. In particular, there are problems that require the model to understand semantic expressions between multiple target objects in an image. Therefore, the single visual attention model cannot effectively mine the relevance between different semantic objects in the image and the problem semantics;
in order to solve the problem, the invention provides a multi-view attention model, which uses two different attention mechanisms to jointly learn important area parts with different semantics, which can be focused on in the problem, so as to obtain a fine-grained attention feature map of an image. The attention model of the multi-view angle is used for paying attention to the image to obtain the attention weight of the image, the weight is used for carrying out image feature weighting to obtain an accumulated vector as a final image feature representation, namely, fine-grained features of the image can be well associated with problem semantics.
As shown in fig. 2, the multi-view attention model includes an upper-layer attention model and a lower-layer attention model, a single attention weight is obtained through the upper-layer attention model, and a saliency attention weight is obtained through the lower-layer attention model, and the saliency attention weight represents that different target regions in the image correspond to different attention resources.
Specifically, in the upper-level attention model, the method for obtaining the single attention weight is as follows:
inputting image characteristics and problem characteristics to an upper layer attention model, respectively projecting data of the image characteristics and the problem characteristics to a same dimensional space by using a layer of full connection layer, and normalizing vectors by using an activation function ReLu; then, the Hardmard products are fused, and then the parameters are sequentially input into two full-connection layers to process the learning parameters, and the parameters after learning are processedWherein,in order to be a feature of the image,in order to be a characteristic of the problem,the method comprises the following steps that weight parameters to be learned for an upper layer attention model are defined, K is the number of space regions of image features, T is the length of selected problem features, d is the number of hidden neurons in a network layer, h is the output dimension set by the layer, ReLu is an activation function in a neural network, and the specific form of ReLu can be expressed as f (x) max (0, x);
Considering a single attention weightThe value of the softmax weight is larger, and the rest weight is smaller. Since an image often contains a plurality of different semantics and these semantics are often expressed visually semantically in different regions. Single attention weightSome region information with important semantics is often ignored. In order to supplement the missing part of the attention information of the upper layer attention model, the invention further provides the lower layer attention model. The lower-layer attention model simultaneously gives consideration to the relevance of the image and the problem semantics, achieves a learning mechanism of the problem-guided multi-view attention model, and increases the feature fine-grained mining capability.
Specifically, in the lower attention model, the method for obtaining the significance attention weight is as follows:
inputting image characteristics and problem characteristics to a lower-layer attention model, respectively projecting data of the image characteristics and the problem characteristics to the same dimensional space by using a full-connection layer, and calculating a correlation matrix Ci=ReLu(qi TWbVi) (ii) a Wherein,the weight parameters to be learned for the underlying attention model,obtaining an incidence matrix;
multiplying the incidence matrix as a characteristic by the problem characteristic, and fusing the incidence matrix with the input image characteristic, wherein the fused parameter isFinally, the weight is normalized by using a softmax function, and the significance attention weight is outputWherein,and setting parameter dimensions for weight parameters to be learned of the lower-layer attention model to be consistent with those of the upper-layer attention model, wherein K is the number of space regions of image features, T is the length of selected problem features, d is the number of hidden neurons in the network layer, h is the output dimension of the layer, and ReLu is an activation function in the neural network.
Calculating the attention weight of the image based on the single attention weight and the significant attention weight, which is specifically as follows:
wherein, beta1And beta2For the upper layer attention model and the lower layer attention modelAnd (4) weight ratio parameters of the gravity model. In practical application, the weights between the upper layer attention model and the lower layer attention model can be distributed through debugging parameters, so that a better effect is achieved.
Image feature ViCan further represent the collection form of K image space region characteristicsFurther, attention is weighted by aiMultiplying and weighting the image characteristics of each space area to obtain fine-grained characteristics of the image
In step 3), the fine-grained characteristic and the problem characteristic of the image are respectively passed through a nonlinear layer fv、fqIn the non-linear layer fv、fqNormalizing the vector by using an activation function ReLu; then, the Hadamard product is used for fusion to obtain fusion characteristics
Further, the visual question-answering question is a multi-label classification question, and further, in the step 4), the fusion features pass through the nonlinear layer foThrough the non-linear layer foNormalizing the vector by using an activation function ReLu; then using the linear mapping woTo predict the candidate score of the answerFinally, selecting the output with higher score;
where σ is the sigmoid activation function, woIs the weight parameter to be learned.
Preferably, the sigmoid activation function normalizes the final score to be in a (0-1) interval, the last stage is used as a logistic regression for predicting the correctness of each candidate answer, and the target function is
Wherein the indices z and k cover N candidate answers, s, to M training questions, respectivelyzkIs the true answer to the question.
Compared with other common visual questions and answers using the softmax classifier, the logistic regression classification used by the method is more effective. The Sigmoid function uses soft score (soft target) as a target result, providing richer training signals, and effectively capturing the occasional uncertainty in the real answer.
In order to better observe how the attention model focuses on the salient region part of the image, the attention map (a) of a single attention weight and a salient attention weight is obtainedu,ab) Thereafter, the attention map is visualized as a matrix heatmap (matrix heatmap) using the heatmap function of the matchlib drawing library in python, as shown in fig. 3, 4.
Fig. 3 and 4 are representations of an upper-layer attention model and a lower-layer attention model of a multi-view attention model for 2 different task images, respectively, wherein an attention1 is an attentive visual view of the upper-layer attention model, and an attention2 is an attentive visual view of the lower-layer attention model. As can be seen from the heat map of attention, the added underlying attention model is able to learn different important regions of the input image. As can be seen from fig. 3, for a simple attention task, the upper and lower attention models can find the correct position in the image. However, in fig. 4, it can be seen that when a task needs to be highly focused on a plurality of locations in an image, a lower layer attention model focuses on a different portion from an upper layer attention model, thereby improving the accuracy of a multi-view attention model, which is advantageous over the prior art models.
Introduction of test data set: the VQA v2 dataset (Antol S, Agrawal A, Lu J, et al. Vqa: visual query answering [ C ]. Proceedings of the IEEE International Conference on computer vision 2015:2425 and 2433.) is a large-scale visual question-answer dataset in which all questions and answers are manually annotated. There were 443,757 training questions, 214,354 validation questions and 447,793 testing questions in the dataset. Each image is associated with three questions for which ten answers are provided by the annotator. In the case of standard visual question-answering tasks, the questions in this dataset are often classified as: yes/no, Number and other.
Further, to verify the effectiveness of the present invention, the present invention was compared with the results of 2017VQA challenge champion (Anderson P, HeX, Buehler C, et al, bottom-up and top-down evidence for image capturing and visual query answering. arXiv prediction arXiv:1707.07998,2017). as shown in fig. 5, the present invention replaced the original simple attention model with the multi-view attention model based on the recurrent paper code, and the final score of the multi-view attention model was 64.35%, and the evaluation of the accuracy rate was about 1.2% higher than that of the paper.
Some basic parameters were designed in the experiment as follows, the basic learning rate is set to α -0.0007, the random deactivation rate is set to dropout-0.3 after each LSTM layer, and the answer screening setting N-3000. The hidden neuron of the fully connected layer sets num _ hid to 1024, and the number of batch trainings sets batch _ size to 512. The weight of the single attention weight and the significance attention weight is distributed as beta1=0.7,β2=0.3。
As shown in fig. 6, the loss function value (loss) of the model decreases with the increase of the training period; as shown in fig. 7, the model accuracy rate is represented on the training set and the test set respectively as the training period increases.
Comparison of the representative method of the invention in the test-dev case with the VQA task on the public standard data set VQA v2 is shown in Table 1.
TABLE 1
Specifically, the data were evaluated in 3 categories based on the type of the problem, and then the total evaluation result was calculated. The question types are Y/N question, Number question, other openness question. The scores in the table are the accuracy of the model for different types of question answer results, and the accuracy is higher when the numerical value is larger. As can be seen from the table, the multi-view attention model of the present invention achieves better results for different tasks.
Particularly, the multi-view attention model strengthens the feature expression of fine granularity, improves the detection and identification capabilities of the object, and has good improvement on the Number evaluation compared with the prior method. The overall accuracy evaluation result of the model is better than that of most existing methods.
The above examples are provided only for illustrating the present invention and are not intended to limit the present invention. Changes, modifications, etc. to the above-described embodiments are intended to fall within the scope of the claims of the present invention as long as they are in accordance with the technical spirit of the present invention.
Claims (10)
1. A fine-grained visual question-answering method combined with a multi-view attention mechanism is characterized by comprising the following steps:
1) inputting an image and extracting image characteristics; inputting a question text and extracting question features;
2) inputting the image characteristics and the problem characteristics into a multi-view attention model, calculating the attention weight of the image, and performing weighted operation on the image characteristics in the step 1) through the attention weight to obtain fine-grained characteristics of the image;
3) fusing the fine-grained features of the image with the problem features to obtain fused features;
4) and inputting the fusion features into a classifier, and predicting to obtain an answer.
2. The fine-grained visual question-answering method combined with a multi-view attention mechanism according to claim 1, wherein the multi-view attention model comprises an upper-layer attention model and a lower-layer attention model, a single attention weight is obtained through the upper-layer attention model, a saliency attention weight is obtained through the lower-layer attention model, and the saliency attention weight represents that different target regions in an image correspond to different attention resources.
3. The fine-grained visual question-answering method combined with a multi-view attention mechanism according to claim 2, characterized in that the single attention weight is obtained by the following method:
inputting image characteristics and problem characteristics to an upper layer attention model, respectively projecting data of the image characteristics and the problem characteristics to a same dimensional space by using a layer of full connection layer, and normalizing vectors by using an activation function ReLu; then utilizing the product fusion of Hadamard codes, inputting two full-connection layers in turn to process the learning parameters, and processing the learned parametersFinally, the weight is normalized by using a softmax function to obtain a single attention weight
Wherein,in order to be a feature of the image,in order to be a characteristic of the problem,the weight parameters to be learned for the upper-layer attention model, K is the number of spatial regions of image features, T is the length of selected problem features, d is the number of hidden neurons in the network layer, h is the output dimension set by the layer, ReLu is an activation function in the neural network, and the specific form thereof can be expressed as f (x) max (0, x).
4. The fine-grained visual question-answering method combined with a multi-view attention mechanism according to claim 2, characterized in that the significance attention weight is obtained by the following method:
inputting image characteristics and problem characteristics to a lower-layer attention model, respectively projecting data of the image characteristics and the problem characteristics to the same dimensional space by using a full-connection layer, and calculating a correlation matrix Ci=ReLu(qi TWbVi) (ii) a Wherein,the weight parameters to be learned for the underlying attention model,obtaining an incidence matrix;
multiplying the incidence matrix as a characteristic by the problem characteristic, and fusing the incidence matrix with the input image characteristic, wherein the fused parameter isFinally, the weight is normalized by using a softmax function, and the significance attention weight is output
5. The fine-grained visual question-answering method combined with the multi-view attention mechanism according to any one of claims 2, 3 and 4, wherein the attention weight of the image is calculated based on a single attention weight and a significant attention weight, and specifically as follows:
wherein, beta1And beta2The weight ratio of the upper layer attention model and the lower layer attention model is a hyper-parameter.
6. The fine-grained visual question-answering method combined with the multi-view attention mechanism according to claim 5, wherein in the step 3), the image fine-grained features and the question features are respectively passed through a nonlinear layer fv、fqIn the non-linear layer fv、fqNormalizing the vector by using an activation function ReLu; then, the Hadamard product is used for fusion to obtain fusion characteristics
7. The fine-grained visual question-answering method combined with a multi-view attention mechanism according to claim 6, wherein in the step 4), the fusion features are passed through a nonlinear layer foThrough the non-linear layer foNormalizing the vector by using an activation function ReLu; then using the linear mapping woTo predict the candidate score of the answerFinally, selecting the output with higher score;
where σ is the sigmoid activation function, woIs the weight parameter to be learned.
8. The fine-grained visual question-answering method combined with a multi-view attention mechanism according to claim 7, wherein the sigmoid activation function normalizes the final score to be in a (0-1) interval, the last stage is used as a logistic regression for predicting the correctness of each candidate answer, and the objective function is
Wherein z and k cover N candidate answers, s, to M training questions, respectivelyzkIs the true answer to the question.
9. The fine-grained vision combined with a multi-perspective attention mechanism of claim 1The method for answering and feeling is characterized in that in the step 1), a Faster-RCNN standard model is used for an input image IiCarrying out feature extraction to obtain image features V of depth expressioni=FasterRCNN(Ii)。
10. The fine-grained visual question-answering method combined with multi-view attention mechanism according to claim 1, characterized in that in step 1), question text Q is inputiFirst, the question text Q is marked by using spaces and punctuationiDividing into words, initializing by a pre-trained GloVe word embedding method to obtain an i-th specified question sentence after codingWherein x ist (i)Indicating the tth word of each word in the vocabulary;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910927585.4A CN110717431B (en) | 2019-09-27 | 2019-09-27 | Fine-grained visual question and answer method combined with multi-view attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910927585.4A CN110717431B (en) | 2019-09-27 | 2019-09-27 | Fine-grained visual question and answer method combined with multi-view attention mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110717431A true CN110717431A (en) | 2020-01-21 |
CN110717431B CN110717431B (en) | 2023-03-24 |
Family
ID=69211080
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910927585.4A Active CN110717431B (en) | 2019-09-27 | 2019-09-27 | Fine-grained visual question and answer method combined with multi-view attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110717431B (en) |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111325000A (en) * | 2020-01-23 | 2020-06-23 | 北京百度网讯科技有限公司 | Language generation method and device and electronic equipment |
CN111325243A (en) * | 2020-02-03 | 2020-06-23 | 天津大学 | Visual relation detection method based on regional attention learning mechanism |
CN111860653A (en) * | 2020-07-22 | 2020-10-30 | 苏州浪潮智能科技有限公司 | Visual question answering method and device, electronic equipment and storage medium |
CN111984772A (en) * | 2020-07-23 | 2020-11-24 | 中山大学 | Medical image question-answering method and system based on deep learning |
CN112100346A (en) * | 2020-08-28 | 2020-12-18 | 西北工业大学 | Visual question-answering method based on fusion of fine-grained image features and external knowledge |
CN112163608A (en) * | 2020-09-21 | 2021-01-01 | 天津大学 | Visual relation detection method based on multi-granularity semantic fusion |
CN112488111A (en) * | 2020-12-18 | 2021-03-12 | 贵州大学 | Instruction expression understanding method based on multi-level expression guide attention network |
CN112732879A (en) * | 2020-12-23 | 2021-04-30 | 重庆理工大学 | Downstream task processing method and model of question-answering task |
CN112905819A (en) * | 2021-01-06 | 2021-06-04 | 中国石油大学(华东) | Visual question-answering method of original feature injection network based on composite attention |
CN113223018A (en) * | 2021-05-21 | 2021-08-06 | 信雅达科技股份有限公司 | Fine-grained image analysis processing method |
CN113392288A (en) * | 2020-03-11 | 2021-09-14 | 阿里巴巴集团控股有限公司 | Visual question answering and model training method, device, equipment and storage medium thereof |
CN113408511A (en) * | 2021-08-23 | 2021-09-17 | 南开大学 | Method, system, equipment and storage medium for determining gazing target |
CN113436094A (en) * | 2021-06-24 | 2021-09-24 | 湖南大学 | Gray level image automatic coloring method based on multi-view attention mechanism |
CN113761153A (en) * | 2021-05-19 | 2021-12-07 | 腾讯科技(深圳)有限公司 | Question and answer processing method and device based on picture, readable medium and electronic equipment |
CN113779298A (en) * | 2021-09-16 | 2021-12-10 | 哈尔滨工程大学 | Medical vision question-answering method based on composite loss |
CN113792617A (en) * | 2021-08-26 | 2021-12-14 | 电子科技大学 | Image interpretation method combining image information and text information |
CN113837212A (en) * | 2020-06-24 | 2021-12-24 | 四川大学 | Visual question-answering method based on multi-mode bidirectional guiding attention |
CN114092783A (en) * | 2020-08-06 | 2022-02-25 | 清华大学 | Dangerous goods detection method based on attention mechanism continuous visual angle |
CN114117159A (en) * | 2021-12-08 | 2022-03-01 | 东北大学 | Image question-answering method for multi-order image feature and question interaction |
CN114398471A (en) * | 2021-12-24 | 2022-04-26 | 哈尔滨工程大学 | Visual question-answering method based on deep reasoning attention mechanism |
CN114417044A (en) * | 2022-01-19 | 2022-04-29 | 中国科学院空天信息创新研究院 | Image question and answer method and device |
CN114661874A (en) * | 2022-03-07 | 2022-06-24 | 浙江理工大学 | Visual question-answering method based on multi-angle semantic understanding and self-adaptive dual channels |
CN113407794B (en) * | 2021-06-01 | 2023-10-31 | 中国科学院计算技术研究所 | Visual question-answering method and system for inhibiting language deviation |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170124432A1 (en) * | 2015-11-03 | 2017-05-04 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering |
CN110111399A (en) * | 2019-04-24 | 2019-08-09 | 上海理工大学 | A kind of image text generation method of view-based access control model attention |
CN110163299A (en) * | 2019-05-31 | 2019-08-23 | 合肥工业大学 | A kind of vision answering method based on bottom-up attention mechanism and memory network |
-
2019
- 2019-09-27 CN CN201910927585.4A patent/CN110717431B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170124432A1 (en) * | 2015-11-03 | 2017-05-04 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering |
CN110111399A (en) * | 2019-04-24 | 2019-08-09 | 上海理工大学 | A kind of image text generation method of view-based access control model attention |
CN110163299A (en) * | 2019-05-31 | 2019-08-23 | 合肥工业大学 | A kind of vision answering method based on bottom-up attention mechanism and memory network |
Non-Patent Citations (1)
Title |
---|
熊雪等: "基于注意力机制的答案选择方法研究", 《智能计算机与应用》 * |
Cited By (37)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111325000B (en) * | 2020-01-23 | 2021-01-26 | 北京百度网讯科技有限公司 | Language generation method and device and electronic equipment |
US11562150B2 (en) | 2020-01-23 | 2023-01-24 | Beijing Baidu Netcom Science Technology Co., Ltd. | Language generation method and apparatus, electronic device and storage medium |
CN111325000A (en) * | 2020-01-23 | 2020-06-23 | 北京百度网讯科技有限公司 | Language generation method and device and electronic equipment |
CN111325243A (en) * | 2020-02-03 | 2020-06-23 | 天津大学 | Visual relation detection method based on regional attention learning mechanism |
CN113392288A (en) * | 2020-03-11 | 2021-09-14 | 阿里巴巴集团控股有限公司 | Visual question answering and model training method, device, equipment and storage medium thereof |
CN113837212A (en) * | 2020-06-24 | 2021-12-24 | 四川大学 | Visual question-answering method based on multi-mode bidirectional guiding attention |
CN113837212B (en) * | 2020-06-24 | 2023-09-26 | 四川大学 | Visual question-answering method based on multi-mode bidirectional guiding attention |
CN111860653A (en) * | 2020-07-22 | 2020-10-30 | 苏州浪潮智能科技有限公司 | Visual question answering method and device, electronic equipment and storage medium |
CN111984772B (en) * | 2020-07-23 | 2024-04-02 | 中山大学 | Medical image question-answering method and system based on deep learning |
CN111984772A (en) * | 2020-07-23 | 2020-11-24 | 中山大学 | Medical image question-answering method and system based on deep learning |
CN114092783A (en) * | 2020-08-06 | 2022-02-25 | 清华大学 | Dangerous goods detection method based on attention mechanism continuous visual angle |
CN112100346A (en) * | 2020-08-28 | 2020-12-18 | 西北工业大学 | Visual question-answering method based on fusion of fine-grained image features and external knowledge |
CN112100346B (en) * | 2020-08-28 | 2021-07-20 | 西北工业大学 | Visual question-answering method based on fusion of fine-grained image features and external knowledge |
CN112163608B (en) * | 2020-09-21 | 2023-02-03 | 天津大学 | Visual relation detection method based on multi-granularity semantic fusion |
CN112163608A (en) * | 2020-09-21 | 2021-01-01 | 天津大学 | Visual relation detection method based on multi-granularity semantic fusion |
CN112488111A (en) * | 2020-12-18 | 2021-03-12 | 贵州大学 | Instruction expression understanding method based on multi-level expression guide attention network |
CN112488111B (en) * | 2020-12-18 | 2022-06-14 | 贵州大学 | Indication expression understanding method based on multi-level expression guide attention network |
CN112732879A (en) * | 2020-12-23 | 2021-04-30 | 重庆理工大学 | Downstream task processing method and model of question-answering task |
CN112905819A (en) * | 2021-01-06 | 2021-06-04 | 中国石油大学(华东) | Visual question-answering method of original feature injection network based on composite attention |
CN112905819B (en) * | 2021-01-06 | 2022-09-23 | 中国石油大学(华东) | Visual question-answering method of original feature injection network based on composite attention |
CN113761153B (en) * | 2021-05-19 | 2023-10-24 | 腾讯科技(深圳)有限公司 | Picture-based question-answering processing method and device, readable medium and electronic equipment |
CN113761153A (en) * | 2021-05-19 | 2021-12-07 | 腾讯科技(深圳)有限公司 | Question and answer processing method and device based on picture, readable medium and electronic equipment |
CN113223018A (en) * | 2021-05-21 | 2021-08-06 | 信雅达科技股份有限公司 | Fine-grained image analysis processing method |
CN113407794B (en) * | 2021-06-01 | 2023-10-31 | 中国科学院计算技术研究所 | Visual question-answering method and system for inhibiting language deviation |
CN113436094A (en) * | 2021-06-24 | 2021-09-24 | 湖南大学 | Gray level image automatic coloring method based on multi-view attention mechanism |
CN113408511B (en) * | 2021-08-23 | 2021-11-12 | 南开大学 | Method, system, equipment and storage medium for determining gazing target |
CN113408511A (en) * | 2021-08-23 | 2021-09-17 | 南开大学 | Method, system, equipment and storage medium for determining gazing target |
CN113792617A (en) * | 2021-08-26 | 2021-12-14 | 电子科技大学 | Image interpretation method combining image information and text information |
CN113792617B (en) * | 2021-08-26 | 2023-04-18 | 电子科技大学 | Image interpretation method combining image information and text information |
CN113779298B (en) * | 2021-09-16 | 2023-10-31 | 哈尔滨工程大学 | Medical vision question-answering method based on composite loss |
CN113779298A (en) * | 2021-09-16 | 2021-12-10 | 哈尔滨工程大学 | Medical vision question-answering method based on composite loss |
CN114117159A (en) * | 2021-12-08 | 2022-03-01 | 东北大学 | Image question-answering method for multi-order image feature and question interaction |
CN114117159B (en) * | 2021-12-08 | 2024-07-12 | 东北大学 | Image question-answering method for multi-order image feature and question interaction |
CN114398471A (en) * | 2021-12-24 | 2022-04-26 | 哈尔滨工程大学 | Visual question-answering method based on deep reasoning attention mechanism |
CN114417044A (en) * | 2022-01-19 | 2022-04-29 | 中国科学院空天信息创新研究院 | Image question and answer method and device |
CN114661874A (en) * | 2022-03-07 | 2022-06-24 | 浙江理工大学 | Visual question-answering method based on multi-angle semantic understanding and self-adaptive dual channels |
CN114661874B (en) * | 2022-03-07 | 2024-04-30 | 浙江理工大学 | Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels |
Also Published As
Publication number | Publication date |
---|---|
CN110717431B (en) | 2023-03-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110717431B (en) | Fine-grained visual question and answer method combined with multi-view attention mechanism | |
CN111554268B (en) | Language identification method based on language model, text classification method and device | |
Yan | Computational methods for deep learning | |
Zhang et al. | Multilabel image classification with regional latent semantic dependencies | |
CN110737801B (en) | Content classification method, apparatus, computer device, and storage medium | |
CN108804530B (en) | Subtitling areas of an image | |
CN112732916B (en) | BERT-based multi-feature fusion fuzzy text classification system | |
CN111209384A (en) | Question and answer data processing method and device based on artificial intelligence and electronic equipment | |
CN109783666A (en) | A kind of image scene map generation method based on iteration fining | |
Wang et al. | Spatial–temporal pooling for action recognition in videos | |
CN106803098A (en) | A kind of three mode emotion identification methods based on voice, expression and attitude | |
Islam et al. | A review on video classification with methods, findings, performance, challenges, limitations and future work | |
CN112749274A (en) | Chinese text classification method based on attention mechanism and interference word deletion | |
AU2019101138A4 (en) | Voice interaction system for race games | |
CN113987187A (en) | Multi-label embedding-based public opinion text classification method, system, terminal and medium | |
KR20200010672A (en) | Smart merchandise searching method and system using deep learning | |
CN110705490A (en) | Visual emotion recognition method | |
Yan | Computational methods for deep learning: theory, algorithms, and implementations | |
Xia et al. | Evaluation of saccadic scanpath prediction: Subjective assessment database and recurrent neural network based metric | |
CN111898704A (en) | Method and device for clustering content samples | |
Chen et al. | STRAN: Student expression recognition based on spatio-temporal residual attention network in classroom teaching videos | |
Ling et al. | A facial expression recognition system for smart learning based on YOLO and vision transformer | |
CN114840649A (en) | Student cognitive diagnosis method based on cross-modal mutual attention neural network | |
Gong et al. | Human interaction recognition based on deep learning and HMM | |
Li et al. | Supervised classification of plant image based on attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |