CN110717431B - Fine-grained visual question and answer method combined with multi-view attention mechanism - Google Patents
Fine-grained visual question and answer method combined with multi-view attention mechanism Download PDFInfo
- Publication number
- CN110717431B CN110717431B CN201910927585.4A CN201910927585A CN110717431B CN 110717431 B CN110717431 B CN 110717431B CN 201910927585 A CN201910927585 A CN 201910927585A CN 110717431 B CN110717431 B CN 110717431B
- Authority
- CN
- China
- Prior art keywords
- attention
- image
- question
- weight
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/10—Terrestrial scenes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/5866—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
Abstract
The invention relates to a fine-grained visual question-answering method combined with a multi-view attention mechanism, which sufficiently considers the guiding effect of the specific semantics of a question and provides a multi-view attention model, can effectively select a plurality of salient target areas related to a current task target (question), learns and acquires area information related to answers in an image and a question text from a plurality of views, extracts area salient features in the image under the guidance of question semantics, has finer-grained feature expression, has stronger depicting capability for expressing the condition that a plurality of important semantic expression areas exist in the image, increases the effectiveness and the comprehensiveness of the multi-view attention model, effectively strengthens the semantic relevance of the salient features and the question features in the image area, and improves the accuracy and comprehensiveness of the semantic understanding of the visual question-answering. The method for carrying out the visual question-answering task has the advantages of simple steps, high efficiency, high accuracy, complete commercial application and good market prospect.
Description
Technical Field
The invention relates to the technical field of computer vision and natural language processing, in particular to a fine-grained vision question-answering method combined with a multi-view attention mechanism.
Background
With the rapid development of computer vision and natural language processing, the visual question-answering system becomes one of the more and more popular research fields of artificial intelligence. The visual question-answering technology is an emerging topic, and the task of the visual question-answering technology is to combine two subject fields of computer vision and natural language processing, take a given image and a natural language question related to the image as input, and generate a natural language answer as output. The visual question and answer is a key application direction in the field of artificial intelligence, and can help users with visual disorders to perform real-time human-computer interaction by simulating the scenes of the real world.
In essence, the visual question-answering system is regarded as a classification task, and it is a common practice to extract pictures and question features according to known pictures and questions, and then classify by fusing the picture features and the question features to obtain question-answering results. In recent years, visual question answering has attracted a great deal of attention in the fields of computer vision and natural language processing. Due to the relative complexity of visual question answering and the need for image and text processing, some existing methods have a certain lack of accuracy and are faced with major challenges.
In practical applications, visual question-answering systems are often faced with high dimensionality of the images and with noise effects that can affect the prediction of the answers by the algorithms. Therefore, the effective visual question-answering model can be used for mining the structural features and semantic correlation parts consistent with the question semantics in the image to perform fine-grained prediction.
The visual attention model is a computer simulation of the human visual attention mechanism to obtain the most noticeable part of an image, namely the salient region of the image. In the visual question answering, most methods using a single attention mechanism model often ignore the difference of image structural semantics and have some defects in the situation that a plurality of important areas exist in an image, so that the attention mechanism brought by the methods inevitably influences the accuracy of the visual question answering.
Research shows that most of the existing visual question-answering methods predict semantic answers of questions through questions and the whole picture, but do not consider the guiding effect of the specific semantics of the questions, so that the relevance of the image region features learned by the models and the question features on a semantic space is weak.
In summary, in the prior art, effective visual question answering methods still need to be improved.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a fine-grained visual question-answering method combined with a multi-view attention mechanism, which can effectively improve the accuracy and comprehensiveness of visual semantic information extraction and reduce the influence of redundant data and noise data, thereby improving the fine-grained identification capability and the judgment of complex problems of a visual question-answering system and improving the accuracy and the interpretability of a model of the visual question-answering system to a certain extent.
The technical scheme of the invention is as follows:
a fine-grained visual question-answering method combined with a multi-view attention mechanism comprises the following steps:
1) Inputting an image and extracting image characteristics; inputting a question text and extracting question features;
2) Inputting the image characteristics and the problem characteristics into a multi-view attention model, calculating the attention weight of the image, and performing weighted operation on the image characteristics in the step 1) through the attention weight to obtain fine-grained characteristics of the image;
3) Fusing the fine-grained features of the image with the problem features to obtain fused features;
4) And inputting the fusion features into a classifier, and predicting to obtain an answer.
Preferably, the multi-view attention model includes an upper layer attention model and a lower layer attention model, a single attention weight is obtained through the upper layer attention model, and a significant attention weight is obtained through the lower layer attention model, and the significant attention weight represents that different target regions in the image correspond to different attention resources.
Preferably, the method of obtaining a single attention weight is as follows:
inputting image characteristics and problem characteristics to an upper-layer attention model, respectively projecting data of the image characteristics and the problem characteristics to a same dimensional space by using a full connection layer, and normalizing vectors by using an activation function ReLu; then utilizing the product fusion of Hadamard codes, inputting two full-connection layers in turn to process the learning parameters, and processing the learned parametersFinally, the weight is normalized by using the softmax function to obtain the single attention weight->
Wherein the content of the first and second substances,is an image characteristic, is based on>In order to be a characteristic of the problem,the weight parameter to be learned for the upper layer attention model, K is the number of spatial regions of image features, T is the length of selected problem features, d is the number of hidden neurons in a network layer, h is the output dimension set by the layer, reLu is an activation function in a neural network, and the specific form of ReLu can be expressed as f (x) = max (0, x).
Preferably, the method of obtaining the saliency attention weight is as follows:
inputting image characteristics and problem characteristics to a lower-layer attention model, respectively projecting data of the image characteristics and the problem characteristics to a same dimensional space by using a full-connection layer,recalculating the correlation matrix C i =ReLu(q i T W b V i ) (ii) a Wherein the content of the first and second substances,a weight parameter to be learned for the lower attention model, <' >>Obtaining an incidence matrix;
multiplying the incidence matrix as a characteristic by the problem characteristic, and fusing the incidence matrix with the input image characteristic, wherein the fused parameter isFinally, the weight value is normalized by using the softmax function, and the significance attention weight is output>
Wherein the content of the first and second substances,weight parameters to be learned for the underlying attention model.
Preferably, the attention weight of the image is calculated based on the single attention weight and the saliency attention weight, specifically as follows:
wherein, beta 1 And beta 2 The weight ratio of the upper layer attention model and the lower layer attention model is a hyper-parameter.
Preferably, in step 3), the fine-grained features and the problem features of the image are respectively passed through the nonlinear layer f v 、f q In the non-linear layer f v 、f q Normalizing the vector by using an activation function ReLu; then, the Hadamard product is used for fusion to obtain fusion characteristics
Preferably, in step 4), the fused features are passed through a non-linear layer f o Through the non-linear layer f o Normalizing the vector by using an activation function ReLu; then using the linear mapping w o To predict the candidate score of the answerFinally, selecting the output with higher score;
where σ is the sigmoid activation function, w o Is the weight parameter to be learned.
Preferably, the sigmoid activation function normalizes the final score to be in a (0-1) interval, the last stage is used as a logistic regression for predicting the correctness of each candidate answer, and the target function is
Wherein z and k cover N candidate answers, s, to M training questions, respectively zk Is the true answer to the question.
Preferably, in step 1), the input image I is subjected to fast-RCNN standard model i Carrying out feature extraction to obtain image features V of depth expression i =FasterRCNN(I i )。
Preferably, in step 1), question text Q is input i First, the question text Q is marked by using spaces and punctuation i Dividing into words, initializing by a pre-trained GloVe word embedding method to obtain an i-th specified question sentence after codingWherein x is t (i) Indicating the tth word of each word in the vocabulary;
then, willInputting into LSTM network, taking out output q of last layer i As->To obtain a problem feature q i 。
The invention has the following beneficial effects:
the invention provides a multi-view attention model by combining a fine-grained vision question-answering method of a multi-view attention mechanism, which can effectively select a plurality of significant target areas related to a current task target (question), extract area significance characteristics in an image under the guidance of question semantics, has fine-grained characteristic expression, expresses the condition that a plurality of important semantic expression areas exist in the image, and has strong depicting capability.
According to the method, the guiding effect of the specific semantics of the question is fully considered, the regional information related to the answer in the image and the question text is learned and obtained from multiple visual angles, and the effectiveness and the comprehensiveness of the multi-visual-angle attention model are improved, so that the semantic relevance of the salient features and the question features of the image region is effectively enhanced, and the accuracy and the comprehensiveness of the semantic understanding of the visual question-answering are improved.
The method for carrying out the visual question-answering task has the advantages of simple steps, high efficiency, high accuracy, complete commercial application and good market prospect.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a schematic diagram of a multi-view attention model;
FIG. 3 is an attention weight visualization thermodynamic diagram (simple attention task);
FIG. 4 is an attention weight visualization thermodynamic diagram (task needs to be highly focused on multiple locations in the image);
FIG. 5 is a graph comparing the results obtained by the multi-view attention model of the present invention with the current more advanced method;
FIG. 6 is a graph of a loss function for final model performance training;
FIG. 7 is a graph of training verification scores for final model performance training.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
The invention provides a fine-grained visual question-answering method combining a multi-view attention mechanism to solve the defects in the prior art. The visual question-answer may be regarded as a multi-tasking classification question, and each answer may be regarded as a classification category. In a general visual question-answering system, answers are coded by using a One-Hot method to obtain One-Hot vectors corresponding to each answer, and an answer vector table is formed. One-Hot encoding is the representation of classification variables as binary vectors. This first requires that the classification values are mapped to integer values, and then each integer value is represented as a binary vector, which is a zero value, except for the index of the integer, which is labeled 1.
As shown in fig. 1, the fine-grained visual question-answering method combined with a multi-view attention mechanism according to the present invention generally comprises the following steps:
1) Inputting an image and extracting image characteristics; inputting a problem text and extracting problem features;
2) Inputting the image characteristics and the problem characteristics into a multi-view attention model, calculating the attention weight of the image, and performing weighted operation on the image characteristics in the step 1) through the attention weight to obtain fine-grained characteristics of the image;
3) Fusing the fine-grained features of the image with the problem features to obtain fused features;
4) And inputting the fusion features into a classifier, and predicting to obtain an answer.
In this embodiment, in step 1), the fast-RCNN standard model is used to match the input image I i Carrying out feature extraction to obtain image features V i =FasterRCNN(I i ). Let K be the number of spatial regions of the image feature, then the image featureCan be further expressed as>Wherein it is present>Is the kth regional characteristic extracted by fast-RCNN, and d is the number of hidden neurons in the network layer and simultaneously represents the output dimension.
In step 1), inputting a question text Q i Then, the question text Q is marked by using spaces and punctuation i Dividing into words, and initializing by a pre-trained GloVe Word embedding method (Global Vectors for Word retrieval) to obtain the ith sentence with specified problemIn a coded form of (1), wherein x t (i) Indicating the tth word of each word in the vocabulary;
then, willInputting into LSTM network, specifically, using standard LSTM network containing 1280 hidden units, taking out output q of last layer i As->To obtain a problem feature q i 。
Then, for the acquired image feature V i And problem feature q of the coding i And inputting the two features into a multi-view attention model, and calculating the attention weight of the image.
The visual attention mechanism can essentially select a target area which is more critical to the current task target from the image, so that more attention resources are put into the area to acquire more detailed information of the target which needs to be focused, and other useless information is suppressed. In the visual question-answering task, semantic expressions have diversity. In particular, there are problems that require the model to understand semantic expressions between multiple target objects in an image. Therefore, the single visual attention model cannot effectively mine the relevance between different semantic objects in the image and the problem semantics;
in order to solve the problem, the invention provides a multi-view attention model, which uses two different attention mechanisms to jointly learn important area parts with different semantics, which can be focused on in the problem, so as to obtain a fine-grained attention feature map of an image. The attention model of the multi-view angle is used for paying attention to the image to obtain the attention weight of the image, the weight is used for carrying out image feature weighting to obtain an accumulated vector as a final image feature representation, namely, fine-grained features of the image can be well associated with problem semantics.
As shown in fig. 2, the multi-view attention model includes an upper-layer attention model and a lower-layer attention model, a single attention weight is obtained through the upper-layer attention model, and a saliency attention weight is obtained through the lower-layer attention model, and the saliency attention weight represents that different target regions in the image correspond to different attention resources.
Specifically, in the upper-level attention model, the method for obtaining the single attention weight is as follows:
inputting image characteristics and problem characteristics to an upper layer attention model, respectively projecting data of the image characteristics and the problem characteristics to a same dimensional space by using a layer of full connection layer, and normalizing vectors by using an activation function ReLu; then, the Hardmard products are fused, and then the parameters are sequentially input into two full-connection layers to process the learning parameters, and the parameters after learning are processedWherein it is present>Is an image characteristic, is based on>For question features, in>Weight parameters to be learned for the upper layer attention model, K is the number of spatial regions of image features, and T is selectionD is the number of hidden neurons in the network layer, h is the output dimension set by the layer, reLu is an activation function in the neural network, and the specific form of ReLu can be expressed as f (x) = max (0, x);
Considering a single attention weightThe value of the softmax weight is larger, and the rest weight is smaller. Since an image often contains a plurality of different semantics and these semantics are often expressed visually semantically in different regions. Single attention weight->Some region information with important semantics is often ignored. In order to supplement the missing part of the attention information of the upper layer attention model, the invention further provides the lower layer attention model. The lower-layer attention model simultaneously gives consideration to the relevance of the image and the problem semantics, achieves a learning mechanism of the problem-guided multi-view attention model, and increases the feature fine-grained mining capability.
Specifically, in the lower attention model, the method for obtaining the significance attention weight is as follows:
inputting image characteristics and problem characteristics to a lower-layer attention model, respectively projecting data of the image characteristics and the problem characteristics to a same dimensional space by using a full connecting layer, and then calculating a correlation matrix C i =ReLu(q i T W b V i ) (ii) a Wherein the content of the first and second substances,a weight parameter to be learned for the lower attention model, <' >>Obtaining an incidence matrix;
multiplying the incidence matrix as a characteristic by the problem characteristic, and fusing the incidence matrix with the input image characteristic, wherein the fused parameter isFinally, the weight value is normalized by using the softmax function, and the significance attention weight is output>Wherein it is present>And setting parameter dimensions for weight parameters to be learned of the lower-layer attention model to be consistent with those of the upper-layer attention model, wherein K is the number of space regions of image features, T is the length of selected problem features, d is the number of hidden neurons in the network layer, h is the output dimension of the layer, and ReLu is an activation function in the neural network.
Calculating the attention weight of the image based on the single attention weight and the significant attention weight, which is specifically as follows:
wherein, beta 1 And beta 2 The weight ratio parameters of the upper layer attention model and the lower layer attention model are obtained. In practical application, the weights between the upper layer attention model and the lower layer attention model can be distributed through debugging parameters, so that a better effect is achieved.
Image feature V i Can further represent the collection form of K image space region characteristicsFurther, attention is weighted by a i Multiplying and weighting the image characteristics of each space area to obtain an imageFine grain features/>
In step 3), the fine-grained characteristic and the problem characteristic of the image are respectively passed through a nonlinear layer f v 、f q In the non-linear layer f v 、f q Normalizing the vector by using an activation function ReLu; then, the Hadamard product is used for fusion to obtain fusion characteristics
Further, the visual question-answering question is a multi-label classification question, and further, in the step 4), the fusion features pass through the nonlinear layer f o Through the non-linear layer f o Normalizing the vector by using an activation function ReLu; then using the linear mapping w o To predict the candidate score of the answerFinally, selecting the output with higher score;
where σ is the sigmoid activation function, w o Is the weight parameter to be learned.
Preferably, the sigmoid activation function normalizes the final score to be in a (0-1) interval, the last stage is used as a logistic regression for predicting the correctness of each candidate answer, and the target function is
Wherein the indices z and k cover N candidate answers, s, to M training questions, respectively zk Is the true answer to the question.
Compared with other common visual questions and answers using the softmax classifier, the logistic regression classification used by the method is more effective. The Sigmoid function uses a soft score (soft target) as a target result, provides a richer training signal, and can effectively capture the occasional uncertainty in the real answer.
To better observe how the attention model is closedAttention is paid to the salient region part of the image, and the attention map (a) of the single attention weight and the salient attention weight is obtained u ,a b ) Thereafter, the attention map is visualized as a matrix heatmap (matrix heatmap) using the heatmap function of the matchlib drawing library in python, as shown in fig. 3, 4.
Fig. 3 and 4 are representations of an upper layer attention model and a lower layer attention model of a multi-view attention model for 2 different task images, respectively, wherein attention1 is an attention visual view of the upper layer attention model, and attention2 is an attention visual view of the lower layer attention model. As can be seen from the heat map of attention, the added underlying attention model is able to learn different important regions of the input image. As can be seen from fig. 3, for a simple attention task, the upper and lower attention models can find the correct position in the image. However, in fig. 4, it can be seen that when a task needs to be highly focused on a plurality of locations in an image, a lower layer attention model focuses on a different portion from an upper layer attention model, thereby improving the accuracy of a multi-view attention model, which is advantageous over the prior art models.
Introduction of test data set: the VQA v2 dataset (Antol S, agrawal A, lu J, et al. Vqa: visual query answering [ C ]. Proceedings of the IEEE International Conference on Computer vision.2015: 2425-2433.) is a large-scale Visual question-answer dataset in which all questions and answers are manually annotated. There were 443,757 training questions, 214,354 validation questions, and 447,793 testing questions in the dataset. Each image is associated with three questions for which ten answers are provided by the annotator. In the case of standard visual question-answering tasks, the questions in this dataset are often classified as: yes/no, number and other.
Further, to verify the effectiveness of the present invention, the present invention was compared with the results of 2017VQA challenge champion (Anderson P, he X, buehler C, et al, bottom-up and top-down-effort for image capturing and visual query answering. ArXiv print arXiv:1707.07998, 2017.), and as shown in FIG. 5, the present invention replaced the original simple attention model with the multi-view attention model based on the recurrent paper code, the multi-view attention model of the present invention finally scored 64.35%, and the evaluation of the accuracy rate compared with the paper is about 1.2% higher.
Some basic parameters were designed in the experiment as follows, with the basic learning rate set at α =0.0007, the random deactivation rate set at dropout =0.3 after each LSTM layer, and the answer screening setting N =3000. Hidden neuron setting num _ hid =1024 for the fully connected layer, and batch _ size =512 for the number of batches trained. The weight of the single attention weight and the significance attention weight is distributed as beta 1 =0.7,β 2 =0.3。
As shown in fig. 6, the loss function value (loss) of the model decreases with the increase of the training period; as shown in fig. 7, the model accuracy rate is represented on the training set and the test set respectively as the training period increases.
Comparison of the representative method of the present invention in the test-dev case with the VQA task on the public standard data set VQA v2 is shown in Table 1.
TABLE 1
Specifically, the data were evaluated in 3 categories based on the type of the problem, and then the total evaluation result was calculated. The question types are Y/N question, number question, other openness question. The scores in the table are the accuracy of the model for different types of question answer results, and the accuracy is higher when the numerical value is larger. As can be seen from the table, the multi-view attention model of the present invention achieves better results for different tasks.
Particularly, the multi-view attention model strengthens the feature expression of fine granularity, improves the detection and identification capabilities of the object, and has good improvement on the Number evaluation compared with the prior method. The overall accuracy evaluation result of the model is better than that of most existing methods.
The above examples are provided only for illustrating the present invention and are not intended to limit the present invention. Changes, modifications, etc. to the above-described embodiments are intended to fall within the scope of the claims of the present invention as long as they are in accordance with the technical spirit of the present invention.
Claims (7)
1. A fine-grained visual question-answering method combined with a multi-view attention mechanism is characterized by comprising the following steps:
1) Inputting an image and extracting image characteristics; inputting a question text and extracting question features;
2) Inputting the image characteristics and the problem characteristics into a multi-view attention model, calculating the attention weight of the image, and performing weighted operation on the image characteristics in the step 1) through the attention weight to obtain fine-grained characteristics of the image;
3) Fusing the fine-grained features of the image with the problem features to obtain fused features;
4) Inputting the fusion characteristics into a classifier, and predicting to obtain an answer;
the multi-view attention model comprises an upper layer attention model and a lower layer attention model, wherein a single attention weight is obtained through the upper layer attention model, a significant attention weight is obtained through the lower layer attention model, and the significant attention weight reflects different attention resources corresponding to different target areas in an image;
the method of obtaining a single attention weight is as follows:
inputting image characteristics and problem characteristics to an upper layer attention model, respectively projecting data of the image characteristics and the problem characteristics to a same dimensional space by using a layer of full connection layer, and normalizing vectors by using an activation function ReLu; then, the Hadamard product is used for fusion, and then two full-connection layers are sequentially input for processing learning parameters, and the learned parameters are processedFinally, the weight is normalized by using a softmax function to obtain a single attention weight
Wherein the content of the first and second substances,in order to be a feature of the image,in order to be a characteristic of the problem, the weight parameters to be learned for the upper layer attention model, K is the number of spatial regions of image features, T is the length of selected problem features, d is the number of hidden neurons in a network layer, h is the output dimension set by the layer, reLu is an activation function in a neural network, and the specific form of the ReLu can be expressed as f (x) = max (0, x);
the method for obtaining the significance attention weight is as follows:
inputting image characteristics and problem characteristics to a lower-layer attention model, respectively projecting data of the image characteristics and the problem characteristics to the same dimensional space by using a full-connection layer, and calculating a correlation matrix C i =ReLu(q i T W b V i ) (ii) a Wherein the content of the first and second substances,the weight parameters to be learned for the underlying attention model,obtaining an incidence matrix;
multiplying the incidence matrix as a characteristic by the problem characteristic, and fusing the incidence matrix with the input image characteristic, wherein the fused parameter isFinally, the weight is normalized by using a softmax function, and the significance attention weight is output
2. The fine-grained visual question-answering method combined with a multi-view attention mechanism according to claim 1, wherein the attention weight of the image is calculated based on a single attention weight and a significant attention weight, and specifically as follows:
wherein beta is 1 And beta 2 The weight ratio of the upper layer attention model and the lower layer attention model is a hyper-parameter.
3. The fine-grained visual question-answering method combined with the multi-view attention mechanism according to claim 2, wherein in the step 3), the image fine-grained features and the question features are respectively passed through the nonlinear layer f v 、f q In the non-linear layer f v 、f q Normalizing the vector by using an activation function ReLu; then, the Hadamard product is used for fusion to obtain fusion characteristics
4. The fine-grained visual question-answering method combined with multi-view attention mechanism according to claim 3, wherein in the step 4), the fused features are processed through a nonlinear layer f o In the presence of a non-linear passingProperty layer f o Normalizing the vector by using an activation function ReLu; then using the linear mapping w o To predict the candidate score of the answerFinally, selecting the output with higher score;
where σ is the sigmoid activation function, w o Is the weight parameter to be learned.
5. The fine-grained visual question-answering method combined with a multi-view attention mechanism according to claim 4, wherein the sigmoid activation function normalizes the final score to be in a (0-1) interval, the last stage is used as a logistic regression for predicting the correctness of each candidate answer, and the objective function is
Wherein z and k cover N candidate answers, s, to M training questions, respectively zk Is the true answer to the question.
6. The fine-grained visual question-answering method combined with a multi-view attention mechanism according to claim 1, wherein in the step 1), a fast-RCNN standard model is used for an input image I i Carrying out feature extraction to obtain image features V of depth expression i =FasterRCNN(I i )。
7. The fine-grained visual question-answering method combined with multi-view attention mechanism according to claim 1, characterized in that in step 1), question text Q is input i First, the question text Q is marked by using spaces and punctuation i Dividing into words, initializing by a pre-trained GloVe word embedding method to obtain an i-th specified question sentence after codingWherein x is t (i) Indicating the tth word of each word in the vocabulary;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910927585.4A CN110717431B (en) | 2019-09-27 | 2019-09-27 | Fine-grained visual question and answer method combined with multi-view attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910927585.4A CN110717431B (en) | 2019-09-27 | 2019-09-27 | Fine-grained visual question and answer method combined with multi-view attention mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110717431A CN110717431A (en) | 2020-01-21 |
CN110717431B true CN110717431B (en) | 2023-03-24 |
Family
ID=69211080
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910927585.4A Active CN110717431B (en) | 2019-09-27 | 2019-09-27 | Fine-grained visual question and answer method combined with multi-view attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110717431B (en) |
Families Citing this family (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111325000B (en) * | 2020-01-23 | 2021-01-26 | 北京百度网讯科技有限公司 | Language generation method and device and electronic equipment |
CN111325243B (en) * | 2020-02-03 | 2023-06-16 | 天津大学 | Visual relationship detection method based on regional attention learning mechanism |
CN113837212B (en) * | 2020-06-24 | 2023-09-26 | 四川大学 | Visual question-answering method based on multi-mode bidirectional guiding attention |
CN111860653A (en) * | 2020-07-22 | 2020-10-30 | 苏州浪潮智能科技有限公司 | Visual question answering method and device, electronic equipment and storage medium |
CN111984772B (en) * | 2020-07-23 | 2024-04-02 | 中山大学 | Medical image question-answering method and system based on deep learning |
CN112100346B (en) * | 2020-08-28 | 2021-07-20 | 西北工业大学 | Visual question-answering method based on fusion of fine-grained image features and external knowledge |
CN112163608B (en) * | 2020-09-21 | 2023-02-03 | 天津大学 | Visual relation detection method based on multi-granularity semantic fusion |
CN112488111B (en) * | 2020-12-18 | 2022-06-14 | 贵州大学 | Indication expression understanding method based on multi-level expression guide attention network |
CN112732879B (en) * | 2020-12-23 | 2022-05-10 | 重庆理工大学 | Downstream task processing method and model of question-answering task |
CN112905819B (en) * | 2021-01-06 | 2022-09-23 | 中国石油大学(华东) | Visual question-answering method of original feature injection network based on composite attention |
CN113761153B (en) * | 2021-05-19 | 2023-10-24 | 腾讯科技(深圳)有限公司 | Picture-based question-answering processing method and device, readable medium and electronic equipment |
CN113223018A (en) * | 2021-05-21 | 2021-08-06 | 信雅达科技股份有限公司 | Fine-grained image analysis processing method |
CN113407794B (en) * | 2021-06-01 | 2023-10-31 | 中国科学院计算技术研究所 | Visual question-answering method and system for inhibiting language deviation |
CN113436094B (en) * | 2021-06-24 | 2022-05-31 | 湖南大学 | Gray level image automatic coloring method based on multi-view attention mechanism |
CN113408511B (en) * | 2021-08-23 | 2021-11-12 | 南开大学 | Method, system, equipment and storage medium for determining gazing target |
CN113792617B (en) * | 2021-08-26 | 2023-04-18 | 电子科技大学 | Image interpretation method combining image information and text information |
CN113779298B (en) * | 2021-09-16 | 2023-10-31 | 哈尔滨工程大学 | Medical vision question-answering method based on composite loss |
CN114398471A (en) * | 2021-12-24 | 2022-04-26 | 哈尔滨工程大学 | Visual question-answering method based on deep reasoning attention mechanism |
CN114417044B (en) * | 2022-01-19 | 2023-05-26 | 中国科学院空天信息创新研究院 | Image question and answer method and device |
CN114661874B (en) * | 2022-03-07 | 2024-04-30 | 浙江理工大学 | Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110111399A (en) * | 2019-04-24 | 2019-08-09 | 上海理工大学 | A kind of image text generation method of view-based access control model attention |
CN110163299A (en) * | 2019-05-31 | 2019-08-23 | 合肥工业大学 | A kind of vision answering method based on bottom-up attention mechanism and memory network |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9965705B2 (en) * | 2015-11-03 | 2018-05-08 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (ABC-CNN) for visual question answering |
-
2019
- 2019-09-27 CN CN201910927585.4A patent/CN110717431B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110111399A (en) * | 2019-04-24 | 2019-08-09 | 上海理工大学 | A kind of image text generation method of view-based access control model attention |
CN110163299A (en) * | 2019-05-31 | 2019-08-23 | 合肥工业大学 | A kind of vision answering method based on bottom-up attention mechanism and memory network |
Non-Patent Citations (1)
Title |
---|
基于注意力机制的答案选择方法研究;熊雪等;《智能计算机与应用》;20181105(第06期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN110717431A (en) | 2020-01-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110717431B (en) | Fine-grained visual question and answer method combined with multi-view attention mechanism | |
CN111554268B (en) | Language identification method based on language model, text classification method and device | |
CN110737801B (en) | Content classification method, apparatus, computer device, and storage medium | |
CN110837846B (en) | Image recognition model construction method, image recognition method and device | |
CN113657425B (en) | Multi-label image classification method based on multi-scale and cross-modal attention mechanism | |
CN112732916B (en) | BERT-based multi-feature fusion fuzzy text classification system | |
CN111209384A (en) | Question and answer data processing method and device based on artificial intelligence and electronic equipment | |
CN112749274B (en) | Chinese text classification method based on attention mechanism and interference word deletion | |
Wang et al. | Spatial–temporal pooling for action recognition in videos | |
AU2019101138A4 (en) | Voice interaction system for race games | |
CN113011357A (en) | Depth fake face video positioning method based on space-time fusion | |
CN114239585A (en) | Biomedical nested named entity recognition method | |
KR20200010672A (en) | Smart merchandise searching method and system using deep learning | |
CN113987187A (en) | Multi-label embedding-based public opinion text classification method, system, terminal and medium | |
CN110705490A (en) | Visual emotion recognition method | |
CN112131345A (en) | Text quality identification method, device, equipment and storage medium | |
Xia et al. | Evaluation of saccadic scanpath prediction: Subjective assessment database and recurrent neural network based metric | |
Yan | Computational Methods for Deep Learning: Theory, Algorithms, and Implementations | |
CN113378919B (en) | Image description generation method for fusing visual sense and enhancing multilayer global features | |
Chen et al. | STRAN: Student expression recognition based on spatio-temporal residual attention network in classroom teaching videos | |
Gong et al. | Human interaction recognition based on deep learning and HMM | |
Ling et al. | A facial expression recognition system for smart learning based on YOLO and vision transformer | |
Vijayaraju | Image retrieval using image captioning | |
Tamaazousti | On the universality of visual and multimodal representations | |
CN113239678A (en) | Multi-angle attention feature matching method and system for answer selection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |