CN110717431A - Fine-grained visual question and answer method combined with multi-view attention mechanism - Google Patents

Fine-grained visual question and answer method combined with multi-view attention mechanism Download PDF

Info

Publication number
CN110717431A
CN110717431A CN201910927585.4A CN201910927585A CN110717431A CN 110717431 A CN110717431 A CN 110717431A CN 201910927585 A CN201910927585 A CN 201910927585A CN 110717431 A CN110717431 A CN 110717431A
Authority
CN
China
Prior art keywords
attention
image
question
layer
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910927585.4A
Other languages
Chinese (zh)
Other versions
CN110717431B (en
Inventor
彭淑娟
李磊
柳欣
范文涛
钟必能
杜吉祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huaqiao University
Original Assignee
Huaqiao University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huaqiao University filed Critical Huaqiao University
Priority to CN201910927585.4A priority Critical patent/CN110717431B/en
Publication of CN110717431A publication Critical patent/CN110717431A/en
Application granted granted Critical
Publication of CN110717431B publication Critical patent/CN110717431B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Library & Information Science (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a fine-grained visual question-answering method combined with a multi-view attention mechanism, which sufficiently considers the guiding effect of the specific semantics of a question and provides a multi-view attention model, can effectively select a plurality of salient target areas related to a current task target (question), learns and acquires area information related to answers in an image and a question text from a plurality of views, extracts area salient features in the image under the guidance of question semantics, has finer-grained feature expression, has stronger depicting capability for expressing the condition that a plurality of important semantic expression areas exist in the image, increases the effectiveness and the comprehensiveness of the multi-view attention model, effectively strengthens the semantic relevance of the salient features and the question features in the image area, and improves the accuracy and comprehensiveness of the semantic understanding of the visual question-answering. The method for carrying out the visual question-answering task has the advantages of simple steps, high efficiency, high accuracy, complete commercial application and good market prospect.

Description

Fine-grained visual question and answer method combined with multi-view attention mechanism
Technical Field
The invention relates to the technical field of computer vision and natural language processing, in particular to a fine-grained vision question-answering method combined with a multi-view attention mechanism.
Background
With the rapid development of computer vision and natural language processing, the visual question-answering system becomes one of the more and more popular research fields of artificial intelligence. The visual question-answering technology is an emerging topic, and the task of the visual question-answering technology is to combine two subject fields of computer vision and natural language processing, take a given image and a natural language question related to the image as input, and generate a natural language answer as output. The visual question and answer is a key application direction in the field of artificial intelligence, and can help users with visual disorders to perform real-time human-computer interaction by simulating the scenes of the real world.
In essence, the visual question-answering system is regarded as a classification task, and it is a common practice to extract pictures and question features according to known pictures and questions, and then classify by fusing the picture features and the question features to obtain question-answering results. In recent years, visual question answering has attracted a great deal of attention in the fields of computer vision and natural language processing. Due to the relative complexity of visual question answering and the need for image and text processing, some existing methods have a certain lack of accuracy and are faced with major challenges.
In practical applications, visual question-answering systems are often faced with high dimensionality of the image and noise effects that can affect the prediction of the answer by the algorithm. Therefore, the effective visual question-answering model can be used for mining the structural features and semantic correlation parts consistent with the question semantics in the image to perform fine-grained prediction.
The visual attention model is a computer simulation of the human visual attention mechanism to obtain the most noticeable part of an image, namely the salient region of the image. In the visual question answering, most methods using a single attention mechanism model often ignore the difference of image structural semantics and have some defects in the situation that a plurality of important areas exist in an image, so that the attention mechanism brought by the methods inevitably influences the accuracy of the visual question answering.
Research shows that most of the existing visual question-answering methods predict semantic answers of questions through questions and the whole picture, but do not consider the guiding effect of the specific semantics of the questions, so that the relevance of the image region features learned by the models and the question features on a semantic space is weak.
In summary, the effective visual question answering methods in the prior art still need to be improved.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a fine-grained visual question-answering method combined with a multi-view attention mechanism, which can effectively improve the accuracy and comprehensiveness of visual semantic information extraction and reduce the influence of redundant data and noise data, thereby improving the fine-grained identification capability and the judgment of complex problems of a visual question-answering system and improving the accuracy and the interpretability of a model of the visual question-answering system to a certain extent.
The technical scheme of the invention is as follows:
a fine-grained visual question-answering method combined with a multi-view attention mechanism comprises the following steps:
1) inputting an image and extracting image characteristics; inputting a question text and extracting question features;
2) inputting the image characteristics and the problem characteristics into a multi-view attention model, calculating the attention weight of the image, and performing weighted operation on the image characteristics in the step 1) through the attention weight to obtain fine-grained characteristics of the image;
3) fusing the fine-grained features of the image with the problem features to obtain fused features;
4) and inputting the fusion features into a classifier, and predicting to obtain an answer.
Preferably, the multi-view attention model includes an upper-layer attention model and a lower-layer attention model, a single attention weight is obtained through the upper-layer attention model, a significant attention weight is obtained through the lower-layer attention model, and the significant attention weight represents that different target regions in the image correspond to different attention resources.
Preferably, the method of obtaining a single attention weight is as follows:
inputting image characteristics and problem characteristics to an upper layer attention model, respectively projecting data of the image characteristics and the problem characteristics to a same dimensional space by using a layer of full connection layer, and normalizing vectors by using an activation function ReLu; then utilizing the product fusion of Hadamard codes, inputting two full-connection layers in turn to process the learning parameters, and processing the learned parameters
Figure BDA0002219333860000021
Finally, the weight is normalized by using a softmax function to obtain a single attention weight
Wherein,
Figure BDA0002219333860000023
in order to be a feature of the image,
Figure BDA0002219333860000024
in order to be a characteristic of the problem,
Figure BDA0002219333860000025
the weight parameters to be learned for the upper-layer attention model, K is the number of spatial regions of image features, T is the length of selected problem features, d is the number of hidden neurons in the network layer, h is the output dimension set by the layer, ReLu is an activation function in the neural network, and the specific form thereof can be expressed as f (x) max (0, x).
Preferably, the method of obtaining the saliency attention weight is as follows:
inputting image characteristics and problem characteristics to a lower-layer attention model, respectively projecting data of the image characteristics and the problem characteristics to the same dimensional space by using a full-connection layer, and calculating a correlation matrix Ci=ReLu(qi TWbVi) (ii) a Wherein,
Figure BDA0002219333860000031
the weight parameters to be learned for the underlying attention model,
Figure BDA0002219333860000032
obtaining an incidence matrix;
multiplying the incidence matrix as a characteristic by the problem characteristic, and fusing the incidence matrix with the input image characteristic, wherein the fused parameter is
Figure BDA0002219333860000033
Finally, the weight is normalized by using a softmax function, and the significance attention weight is output
Figure BDA0002219333860000034
Wherein,
Figure BDA0002219333860000035
weight parameters to be learned for the underlying attention model.
Preferably, the attention weight of the image is calculated based on the single attention weight and the saliency attention weight, specifically as follows:
Figure BDA0002219333860000036
wherein, beta1And beta2The weight ratio of the upper layer attention model and the lower layer attention model is a hyper-parameter.
Preferably, in the step 3), the fine-grained characteristic and the problem characteristic of the image are respectively passed through the nonlinear layer fv、fqIn the non-linear layer fv、fqNormalizing the vector by using an activation function ReLu; then, the Hadamard product is used for fusion to obtain fusion characteristics
Figure BDA0002219333860000037
Preferably, in step 4), the fused features are passed through a non-linear layer foThrough the non-linear layer foNormalizing the vector by using an activation function ReLu; then using the linear mapping woTo predict the candidate score of the answer
Figure BDA0002219333860000038
Finally, selecting the output with higher score;
where σ is the sigmoid activation function, woIs the weight parameter to be learned.
Preferably, the sigmoid activation function normalizes the final score to be in a (0-1) interval, the last stage is used as a logistic regression for predicting the correctness of each candidate answer, and the target function is
Figure BDA0002219333860000039
Wherein z and k cover N candidate answers, s, to M training questions, respectivelyzkIs the true answer to the question.
Preferably, in step 1), the input image I is subjected to fast-RCNN standard modeliCarrying out feature extraction to obtain image features V of depth expressioni=FasterRCNN(Ii)。
Preferably, in step 1), question text Q is inputiFirst, using space and punctuation to askQuestion text QiDividing into words, initializing by a pre-trained GloVe word embedding method to obtain an i-th specified question sentence after coding
Figure BDA0002219333860000041
Wherein x ist (i)Indicating the tth word of each word in the vocabulary;
then, will
Figure BDA0002219333860000042
Inputting into LSTM network, taking out output q of last layeriAs
Figure BDA0002219333860000043
To obtain a problem feature qi
The invention has the following beneficial effects:
the invention provides a multi-view attention model by combining a fine-grained vision question-answering method of a multi-view attention mechanism, which can effectively select a plurality of significant target areas related to a current task target (question), extract area significance characteristics in an image under the guidance of question semantics, has fine-grained characteristic expression, expresses the condition that a plurality of important semantic expression areas exist in the image, and has strong depicting capability.
According to the method, the guiding effect of the specific semantics of the question is fully considered, the regional information related to the answer in the image and the question text is learned and obtained from multiple visual angles, and the effectiveness and the comprehensiveness of the multi-visual-angle attention model are improved, so that the semantic relevance of the salient features and the question features of the image region is effectively enhanced, and the accuracy and the comprehensiveness of the semantic understanding of the visual question-answering are improved.
The method for carrying out the visual question-answering task has the advantages of simple steps, high efficiency, high accuracy, complete commercial application and good market prospect.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a schematic diagram of a multi-view attention model;
FIG. 3 is an attention weight visualization thermodynamic diagram (simple attention task);
FIG. 4 is an attention weight visualization thermodynamic diagram (task needs to be highly focused on multiple locations in the image);
FIG. 5 is a graph comparing the results obtained by the multi-view attention model of the present invention with a more advanced method;
FIG. 6 is a graph of a loss function for final model performance training;
FIG. 7 is a graph of training verification scores for final model performance training.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
The invention provides a fine-grained visual question-answering method combining a multi-view attention mechanism to solve the defects in the prior art. The visual question-answer may be regarded as a multi-tasking classification question, and each answer may be regarded as a classification category. In a general visual question-answering system, answers are coded by using a One-Hot method to obtain One-Hot vectors corresponding to each answer, and an answer vector table is formed. One-Hot encoding is the representation of classification variables as binary vectors. This first requires that the classification values are mapped to integer values, and then each integer value is represented as a binary vector, which is a zero value, except for the index of the integer, which is labeled 1.
As shown in fig. 1, the fine-grained visual question-answering method combined with a multi-view attention mechanism according to the present invention generally comprises the following steps:
1) inputting an image and extracting image characteristics; inputting a question text and extracting question features;
2) inputting the image characteristics and the problem characteristics into a multi-view attention model, calculating the attention weight of the image, and performing weighted operation on the image characteristics in the step 1) through the attention weight to obtain fine-grained characteristics of the image;
3) fusing the fine-grained features of the image with the problem features to obtain fused features;
4) and inputting the fusion features into a classifier, and predicting to obtain an answer.
In this embodiment, in step 1), the fast-RCNN standard model is used to match the input image IiCarrying out feature extraction to obtain image features Vi=FasterRCNN(Ii). Let K be the number of spatial regions of the image feature, then the image feature
Figure BDA0002219333860000051
Can be further expressed asWherein,
Figure BDA0002219333860000053
is the kth regional characteristic extracted by fast-RCNN, and d is the number of hidden neurons in the network layer and simultaneously represents the output dimension.
In step 1), inputting a question text QiThen, the question text Q is marked by using spaces and punctuationiDividing into words, and initializing by a pre-trained GloVe Word embedding method (Global Vectors for Word retrieval) to obtain the ith sentence with specified problem
Figure BDA0002219333860000054
In a coded form of (1), wherein xt (i)Indicating the tth word of each word in the vocabulary;
then, will
Figure BDA0002219333860000055
Inputting into LSTM network, specifically, using standard LSTM network containing 1280 hidden units, taking out output q of last layeriAs
Figure BDA0002219333860000061
To obtain a problem feature qi
Then, for the acquired image feature ViAnd problem feature q of the codingiInputting two characteristicsAnd (4) calculating the attention weight of the image by using a multi-view attention model.
The visual attention mechanism can essentially select a target area which is more critical to the current task target from the image, so that more attention resources are invested in the area to acquire more detailed information of the target which needs to be concerned, and other useless information is suppressed. In the visual question-answering task, semantic expressions have diversity. In particular, there are problems that require the model to understand semantic expressions between multiple target objects in an image. Therefore, the single visual attention model cannot effectively mine the relevance between different semantic objects in the image and the problem semantics;
in order to solve the problem, the invention provides a multi-view attention model, which uses two different attention mechanisms to jointly learn important area parts with different semantics, which can be focused on in the problem, so as to obtain a fine-grained attention feature map of an image. The attention model of the multi-view angle is used for paying attention to the image to obtain the attention weight of the image, the weight is used for carrying out image feature weighting to obtain an accumulated vector as a final image feature representation, namely, fine-grained features of the image can be well associated with problem semantics.
As shown in fig. 2, the multi-view attention model includes an upper-layer attention model and a lower-layer attention model, a single attention weight is obtained through the upper-layer attention model, and a saliency attention weight is obtained through the lower-layer attention model, and the saliency attention weight represents that different target regions in the image correspond to different attention resources.
Specifically, in the upper-level attention model, the method for obtaining the single attention weight is as follows:
inputting image characteristics and problem characteristics to an upper layer attention model, respectively projecting data of the image characteristics and the problem characteristics to a same dimensional space by using a layer of full connection layer, and normalizing vectors by using an activation function ReLu; then, the Hardmard products are fused, and then the parameters are sequentially input into two full-connection layers to process the learning parameters, and the parameters after learning are processedWherein,
Figure BDA0002219333860000063
in order to be a feature of the image,
Figure BDA0002219333860000064
in order to be a characteristic of the problem,
Figure BDA0002219333860000065
the method comprises the following steps that weight parameters to be learned for an upper layer attention model are defined, K is the number of space regions of image features, T is the length of selected problem features, d is the number of hidden neurons in a network layer, h is the output dimension set by the layer, ReLu is an activation function in a neural network, and the specific form of ReLu can be expressed as f (x) max (0, x);
finally, the weight is normalized by using a softmax function to obtain a single attention weight
Figure BDA0002219333860000066
Considering a single attention weight
Figure BDA0002219333860000071
The value of the softmax weight is larger, and the rest weight is smaller. Since an image often contains a plurality of different semantics and these semantics are often expressed visually semantically in different regions. Single attention weight
Figure BDA0002219333860000072
Some region information with important semantics is often ignored. In order to supplement the missing part of the attention information of the upper layer attention model, the invention further provides the lower layer attention model. The lower-layer attention model simultaneously gives consideration to the relevance of the image and the problem semantics, achieves a learning mechanism of the problem-guided multi-view attention model, and increases the feature fine-grained mining capability.
Specifically, in the lower attention model, the method for obtaining the significance attention weight is as follows:
inputting image characteristics and problem characteristics to a lower-layer attention model, respectively projecting data of the image characteristics and the problem characteristics to the same dimensional space by using a full-connection layer, and calculating a correlation matrix Ci=ReLu(qi TWbVi) (ii) a Wherein,
Figure BDA0002219333860000073
the weight parameters to be learned for the underlying attention model,
Figure BDA0002219333860000074
obtaining an incidence matrix;
multiplying the incidence matrix as a characteristic by the problem characteristic, and fusing the incidence matrix with the input image characteristic, wherein the fused parameter is
Figure BDA0002219333860000075
Finally, the weight is normalized by using a softmax function, and the significance attention weight is output
Figure BDA0002219333860000076
Wherein,
Figure BDA0002219333860000077
and setting parameter dimensions for weight parameters to be learned of the lower-layer attention model to be consistent with those of the upper-layer attention model, wherein K is the number of space regions of image features, T is the length of selected problem features, d is the number of hidden neurons in the network layer, h is the output dimension of the layer, and ReLu is an activation function in the neural network.
Calculating the attention weight of the image based on the single attention weight and the significant attention weight, which is specifically as follows:
wherein, beta1And beta2For the upper layer attention model and the lower layer attention modelAnd (4) weight ratio parameters of the gravity model. In practical application, the weights between the upper layer attention model and the lower layer attention model can be distributed through debugging parameters, so that a better effect is achieved.
Image feature ViCan further represent the collection form of K image space region characteristicsFurther, attention is weighted by aiMultiplying and weighting the image characteristics of each space area to obtain fine-grained characteristics of the image
Figure BDA00022193338600000710
In step 3), the fine-grained characteristic and the problem characteristic of the image are respectively passed through a nonlinear layer fv、fqIn the non-linear layer fv、fqNormalizing the vector by using an activation function ReLu; then, the Hadamard product is used for fusion to obtain fusion characteristics
Figure BDA0002219333860000081
Further, the visual question-answering question is a multi-label classification question, and further, in the step 4), the fusion features pass through the nonlinear layer foThrough the non-linear layer foNormalizing the vector by using an activation function ReLu; then using the linear mapping woTo predict the candidate score of the answer
Figure BDA0002219333860000082
Finally, selecting the output with higher score;
where σ is the sigmoid activation function, woIs the weight parameter to be learned.
Preferably, the sigmoid activation function normalizes the final score to be in a (0-1) interval, the last stage is used as a logistic regression for predicting the correctness of each candidate answer, and the target function is
Figure BDA0002219333860000083
Wherein the indices z and k cover N candidate answers, s, to M training questions, respectivelyzkIs the true answer to the question.
Compared with other common visual questions and answers using the softmax classifier, the logistic regression classification used by the method is more effective. The Sigmoid function uses soft score (soft target) as a target result, providing richer training signals, and effectively capturing the occasional uncertainty in the real answer.
In order to better observe how the attention model focuses on the salient region part of the image, the attention map (a) of a single attention weight and a salient attention weight is obtainedu,ab) Thereafter, the attention map is visualized as a matrix heatmap (matrix heatmap) using the heatmap function of the matchlib drawing library in python, as shown in fig. 3, 4.
Fig. 3 and 4 are representations of an upper-layer attention model and a lower-layer attention model of a multi-view attention model for 2 different task images, respectively, wherein an attention1 is an attentive visual view of the upper-layer attention model, and an attention2 is an attentive visual view of the lower-layer attention model. As can be seen from the heat map of attention, the added underlying attention model is able to learn different important regions of the input image. As can be seen from fig. 3, for a simple attention task, the upper and lower attention models can find the correct position in the image. However, in fig. 4, it can be seen that when a task needs to be highly focused on a plurality of locations in an image, a lower layer attention model focuses on a different portion from an upper layer attention model, thereby improving the accuracy of a multi-view attention model, which is advantageous over the prior art models.
Introduction of test data set: the VQA v2 dataset (Antol S, Agrawal A, Lu J, et al. Vqa: visual query answering [ C ]. Proceedings of the IEEE International Conference on computer vision 2015:2425 and 2433.) is a large-scale visual question-answer dataset in which all questions and answers are manually annotated. There were 443,757 training questions, 214,354 validation questions and 447,793 testing questions in the dataset. Each image is associated with three questions for which ten answers are provided by the annotator. In the case of standard visual question-answering tasks, the questions in this dataset are often classified as: yes/no, Number and other.
Further, to verify the effectiveness of the present invention, the present invention was compared with the results of 2017VQA challenge champion (Anderson P, HeX, Buehler C, et al, bottom-up and top-down evidence for image capturing and visual query answering. arXiv prediction arXiv:1707.07998,2017). as shown in fig. 5, the present invention replaced the original simple attention model with the multi-view attention model based on the recurrent paper code, and the final score of the multi-view attention model was 64.35%, and the evaluation of the accuracy rate was about 1.2% higher than that of the paper.
Some basic parameters were designed in the experiment as follows, the basic learning rate is set to α -0.0007, the random deactivation rate is set to dropout-0.3 after each LSTM layer, and the answer screening setting N-3000. The hidden neuron of the fully connected layer sets num _ hid to 1024, and the number of batch trainings sets batch _ size to 512. The weight of the single attention weight and the significance attention weight is distributed as beta1=0.7,β2=0.3。
As shown in fig. 6, the loss function value (loss) of the model decreases with the increase of the training period; as shown in fig. 7, the model accuracy rate is represented on the training set and the test set respectively as the training period increases.
Comparison of the representative method of the invention in the test-dev case with the VQA task on the public standard data set VQA v2 is shown in Table 1.
TABLE 1
Specifically, the data were evaluated in 3 categories based on the type of the problem, and then the total evaluation result was calculated. The question types are Y/N question, Number question, other openness question. The scores in the table are the accuracy of the model for different types of question answer results, and the accuracy is higher when the numerical value is larger. As can be seen from the table, the multi-view attention model of the present invention achieves better results for different tasks.
Particularly, the multi-view attention model strengthens the feature expression of fine granularity, improves the detection and identification capabilities of the object, and has good improvement on the Number evaluation compared with the prior method. The overall accuracy evaluation result of the model is better than that of most existing methods.
The above examples are provided only for illustrating the present invention and are not intended to limit the present invention. Changes, modifications, etc. to the above-described embodiments are intended to fall within the scope of the claims of the present invention as long as they are in accordance with the technical spirit of the present invention.

Claims (10)

1. A fine-grained visual question-answering method combined with a multi-view attention mechanism is characterized by comprising the following steps:
1) inputting an image and extracting image characteristics; inputting a question text and extracting question features;
2) inputting the image characteristics and the problem characteristics into a multi-view attention model, calculating the attention weight of the image, and performing weighted operation on the image characteristics in the step 1) through the attention weight to obtain fine-grained characteristics of the image;
3) fusing the fine-grained features of the image with the problem features to obtain fused features;
4) and inputting the fusion features into a classifier, and predicting to obtain an answer.
2. The fine-grained visual question-answering method combined with a multi-view attention mechanism according to claim 1, wherein the multi-view attention model comprises an upper-layer attention model and a lower-layer attention model, a single attention weight is obtained through the upper-layer attention model, a saliency attention weight is obtained through the lower-layer attention model, and the saliency attention weight represents that different target regions in an image correspond to different attention resources.
3. The fine-grained visual question-answering method combined with a multi-view attention mechanism according to claim 2, characterized in that the single attention weight is obtained by the following method:
inputting image characteristics and problem characteristics to an upper layer attention model, respectively projecting data of the image characteristics and the problem characteristics to a same dimensional space by using a layer of full connection layer, and normalizing vectors by using an activation function ReLu; then utilizing the product fusion of Hadamard codes, inputting two full-connection layers in turn to process the learning parameters, and processing the learned parameters
Figure FDA0002219333850000011
Finally, the weight is normalized by using a softmax function to obtain a single attention weight
Figure FDA0002219333850000012
Wherein,
Figure FDA0002219333850000013
in order to be a feature of the image,in order to be a characteristic of the problem,
Figure FDA0002219333850000015
the weight parameters to be learned for the upper-layer attention model, K is the number of spatial regions of image features, T is the length of selected problem features, d is the number of hidden neurons in the network layer, h is the output dimension set by the layer, ReLu is an activation function in the neural network, and the specific form thereof can be expressed as f (x) max (0, x).
4. The fine-grained visual question-answering method combined with a multi-view attention mechanism according to claim 2, characterized in that the significance attention weight is obtained by the following method:
inputting image characteristics and problem characteristics to a lower-layer attention model, respectively projecting data of the image characteristics and the problem characteristics to the same dimensional space by using a full-connection layer, and calculating a correlation matrix Ci=ReLu(qi TWbVi) (ii) a Wherein,
Figure FDA0002219333850000016
the weight parameters to be learned for the underlying attention model,obtaining an incidence matrix;
multiplying the incidence matrix as a characteristic by the problem characteristic, and fusing the incidence matrix with the input image characteristic, wherein the fused parameter is
Figure FDA0002219333850000021
Finally, the weight is normalized by using a softmax function, and the significance attention weight is output
Figure FDA0002219333850000022
Wherein,
Figure FDA0002219333850000023
weight parameters to be learned for the underlying attention model.
5. The fine-grained visual question-answering method combined with the multi-view attention mechanism according to any one of claims 2, 3 and 4, wherein the attention weight of the image is calculated based on a single attention weight and a significant attention weight, and specifically as follows:
Figure FDA0002219333850000024
wherein, beta1And beta2The weight ratio of the upper layer attention model and the lower layer attention model is a hyper-parameter.
6. The fine-grained visual question-answering method combined with the multi-view attention mechanism according to claim 5, wherein in the step 3), the image fine-grained features and the question features are respectively passed through a nonlinear layer fv、fqIn the non-linear layer fv、fqNormalizing the vector by using an activation function ReLu; then, the Hadamard product is used for fusion to obtain fusion characteristics
Figure FDA0002219333850000025
7. The fine-grained visual question-answering method combined with a multi-view attention mechanism according to claim 6, wherein in the step 4), the fusion features are passed through a nonlinear layer foThrough the non-linear layer foNormalizing the vector by using an activation function ReLu; then using the linear mapping woTo predict the candidate score of the answer
Figure FDA0002219333850000026
Finally, selecting the output with higher score;
where σ is the sigmoid activation function, woIs the weight parameter to be learned.
8. The fine-grained visual question-answering method combined with a multi-view attention mechanism according to claim 7, wherein the sigmoid activation function normalizes the final score to be in a (0-1) interval, the last stage is used as a logistic regression for predicting the correctness of each candidate answer, and the objective function is
Figure FDA0002219333850000027
Wherein z and k cover N candidate answers, s, to M training questions, respectivelyzkIs the true answer to the question.
9. The fine-grained vision combined with a multi-perspective attention mechanism of claim 1The method for answering and feeling is characterized in that in the step 1), a Faster-RCNN standard model is used for an input image IiCarrying out feature extraction to obtain image features V of depth expressioni=FasterRCNN(Ii)。
10. The fine-grained visual question-answering method combined with multi-view attention mechanism according to claim 1, characterized in that in step 1), question text Q is inputiFirst, the question text Q is marked by using spaces and punctuationiDividing into words, initializing by a pre-trained GloVe word embedding method to obtain an i-th specified question sentence after coding
Figure FDA0002219333850000031
Wherein x ist (i)Indicating the tth word of each word in the vocabulary;
then, willInputting into LSTM network, taking out output q of last layeriAs
Figure FDA0002219333850000033
To obtain a problem feature qi
CN201910927585.4A 2019-09-27 2019-09-27 Fine-grained visual question and answer method combined with multi-view attention mechanism Active CN110717431B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910927585.4A CN110717431B (en) 2019-09-27 2019-09-27 Fine-grained visual question and answer method combined with multi-view attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910927585.4A CN110717431B (en) 2019-09-27 2019-09-27 Fine-grained visual question and answer method combined with multi-view attention mechanism

Publications (2)

Publication Number Publication Date
CN110717431A true CN110717431A (en) 2020-01-21
CN110717431B CN110717431B (en) 2023-03-24

Family

ID=69211080

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910927585.4A Active CN110717431B (en) 2019-09-27 2019-09-27 Fine-grained visual question and answer method combined with multi-view attention mechanism

Country Status (1)

Country Link
CN (1) CN110717431B (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325000A (en) * 2020-01-23 2020-06-23 北京百度网讯科技有限公司 Language generation method and device and electronic equipment
CN111325243A (en) * 2020-02-03 2020-06-23 天津大学 Visual relation detection method based on regional attention learning mechanism
CN111860653A (en) * 2020-07-22 2020-10-30 苏州浪潮智能科技有限公司 Visual question answering method and device, electronic equipment and storage medium
CN111984772A (en) * 2020-07-23 2020-11-24 中山大学 Medical image question-answering method and system based on deep learning
CN112100346A (en) * 2020-08-28 2020-12-18 西北工业大学 Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN112163608A (en) * 2020-09-21 2021-01-01 天津大学 Visual relation detection method based on multi-granularity semantic fusion
CN112488111A (en) * 2020-12-18 2021-03-12 贵州大学 Instruction expression understanding method based on multi-level expression guide attention network
CN112732879A (en) * 2020-12-23 2021-04-30 重庆理工大学 Downstream task processing method and model of question-answering task
CN112905819A (en) * 2021-01-06 2021-06-04 中国石油大学(华东) Visual question-answering method of original feature injection network based on composite attention
CN113223018A (en) * 2021-05-21 2021-08-06 信雅达科技股份有限公司 Fine-grained image analysis processing method
CN113392288A (en) * 2020-03-11 2021-09-14 阿里巴巴集团控股有限公司 Visual question answering and model training method, device, equipment and storage medium thereof
CN113408511A (en) * 2021-08-23 2021-09-17 南开大学 Method, system, equipment and storage medium for determining gazing target
CN113436094A (en) * 2021-06-24 2021-09-24 湖南大学 Gray level image automatic coloring method based on multi-view attention mechanism
CN113761153A (en) * 2021-05-19 2021-12-07 腾讯科技(深圳)有限公司 Question and answer processing method and device based on picture, readable medium and electronic equipment
CN113779298A (en) * 2021-09-16 2021-12-10 哈尔滨工程大学 Medical vision question-answering method based on composite loss
CN113792617A (en) * 2021-08-26 2021-12-14 电子科技大学 Image interpretation method combining image information and text information
CN113837212A (en) * 2020-06-24 2021-12-24 四川大学 Visual question-answering method based on multi-mode bidirectional guiding attention
CN114092783A (en) * 2020-08-06 2022-02-25 清华大学 Dangerous goods detection method based on attention mechanism continuous visual angle
CN114117159A (en) * 2021-12-08 2022-03-01 东北大学 Image question-answering method for multi-order image feature and question interaction
CN114398471A (en) * 2021-12-24 2022-04-26 哈尔滨工程大学 Visual question-answering method based on deep reasoning attention mechanism
CN114417044A (en) * 2022-01-19 2022-04-29 中国科学院空天信息创新研究院 Image question and answer method and device
CN114661874A (en) * 2022-03-07 2022-06-24 浙江理工大学 Visual question-answering method based on multi-angle semantic understanding and self-adaptive dual channels
CN113407794B (en) * 2021-06-01 2023-10-31 中国科学院计算技术研究所 Visual question-answering method and system for inhibiting language deviation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170124432A1 (en) * 2015-11-03 2017-05-04 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering
CN110111399A (en) * 2019-04-24 2019-08-09 上海理工大学 A kind of image text generation method of view-based access control model attention
CN110163299A (en) * 2019-05-31 2019-08-23 合肥工业大学 A kind of vision answering method based on bottom-up attention mechanism and memory network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170124432A1 (en) * 2015-11-03 2017-05-04 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering
CN110111399A (en) * 2019-04-24 2019-08-09 上海理工大学 A kind of image text generation method of view-based access control model attention
CN110163299A (en) * 2019-05-31 2019-08-23 合肥工业大学 A kind of vision answering method based on bottom-up attention mechanism and memory network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
熊雪等: "基于注意力机制的答案选择方法研究", 《智能计算机与应用》 *

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325000B (en) * 2020-01-23 2021-01-26 北京百度网讯科技有限公司 Language generation method and device and electronic equipment
US11562150B2 (en) 2020-01-23 2023-01-24 Beijing Baidu Netcom Science Technology Co., Ltd. Language generation method and apparatus, electronic device and storage medium
CN111325000A (en) * 2020-01-23 2020-06-23 北京百度网讯科技有限公司 Language generation method and device and electronic equipment
CN111325243A (en) * 2020-02-03 2020-06-23 天津大学 Visual relation detection method based on regional attention learning mechanism
CN113392288A (en) * 2020-03-11 2021-09-14 阿里巴巴集团控股有限公司 Visual question answering and model training method, device, equipment and storage medium thereof
CN113837212A (en) * 2020-06-24 2021-12-24 四川大学 Visual question-answering method based on multi-mode bidirectional guiding attention
CN113837212B (en) * 2020-06-24 2023-09-26 四川大学 Visual question-answering method based on multi-mode bidirectional guiding attention
CN111860653A (en) * 2020-07-22 2020-10-30 苏州浪潮智能科技有限公司 Visual question answering method and device, electronic equipment and storage medium
CN111984772B (en) * 2020-07-23 2024-04-02 中山大学 Medical image question-answering method and system based on deep learning
CN111984772A (en) * 2020-07-23 2020-11-24 中山大学 Medical image question-answering method and system based on deep learning
CN114092783A (en) * 2020-08-06 2022-02-25 清华大学 Dangerous goods detection method based on attention mechanism continuous visual angle
CN112100346A (en) * 2020-08-28 2020-12-18 西北工业大学 Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN112100346B (en) * 2020-08-28 2021-07-20 西北工业大学 Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN112163608B (en) * 2020-09-21 2023-02-03 天津大学 Visual relation detection method based on multi-granularity semantic fusion
CN112163608A (en) * 2020-09-21 2021-01-01 天津大学 Visual relation detection method based on multi-granularity semantic fusion
CN112488111A (en) * 2020-12-18 2021-03-12 贵州大学 Instruction expression understanding method based on multi-level expression guide attention network
CN112488111B (en) * 2020-12-18 2022-06-14 贵州大学 Indication expression understanding method based on multi-level expression guide attention network
CN112732879A (en) * 2020-12-23 2021-04-30 重庆理工大学 Downstream task processing method and model of question-answering task
CN112905819A (en) * 2021-01-06 2021-06-04 中国石油大学(华东) Visual question-answering method of original feature injection network based on composite attention
CN112905819B (en) * 2021-01-06 2022-09-23 中国石油大学(华东) Visual question-answering method of original feature injection network based on composite attention
CN113761153B (en) * 2021-05-19 2023-10-24 腾讯科技(深圳)有限公司 Picture-based question-answering processing method and device, readable medium and electronic equipment
CN113761153A (en) * 2021-05-19 2021-12-07 腾讯科技(深圳)有限公司 Question and answer processing method and device based on picture, readable medium and electronic equipment
CN113223018A (en) * 2021-05-21 2021-08-06 信雅达科技股份有限公司 Fine-grained image analysis processing method
CN113407794B (en) * 2021-06-01 2023-10-31 中国科学院计算技术研究所 Visual question-answering method and system for inhibiting language deviation
CN113436094A (en) * 2021-06-24 2021-09-24 湖南大学 Gray level image automatic coloring method based on multi-view attention mechanism
CN113408511B (en) * 2021-08-23 2021-11-12 南开大学 Method, system, equipment and storage medium for determining gazing target
CN113408511A (en) * 2021-08-23 2021-09-17 南开大学 Method, system, equipment and storage medium for determining gazing target
CN113792617A (en) * 2021-08-26 2021-12-14 电子科技大学 Image interpretation method combining image information and text information
CN113792617B (en) * 2021-08-26 2023-04-18 电子科技大学 Image interpretation method combining image information and text information
CN113779298B (en) * 2021-09-16 2023-10-31 哈尔滨工程大学 Medical vision question-answering method based on composite loss
CN113779298A (en) * 2021-09-16 2021-12-10 哈尔滨工程大学 Medical vision question-answering method based on composite loss
CN114117159A (en) * 2021-12-08 2022-03-01 东北大学 Image question-answering method for multi-order image feature and question interaction
CN114117159B (en) * 2021-12-08 2024-07-12 东北大学 Image question-answering method for multi-order image feature and question interaction
CN114398471A (en) * 2021-12-24 2022-04-26 哈尔滨工程大学 Visual question-answering method based on deep reasoning attention mechanism
CN114417044A (en) * 2022-01-19 2022-04-29 中国科学院空天信息创新研究院 Image question and answer method and device
CN114661874A (en) * 2022-03-07 2022-06-24 浙江理工大学 Visual question-answering method based on multi-angle semantic understanding and self-adaptive dual channels
CN114661874B (en) * 2022-03-07 2024-04-30 浙江理工大学 Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels

Also Published As

Publication number Publication date
CN110717431B (en) 2023-03-24

Similar Documents

Publication Publication Date Title
CN110717431B (en) Fine-grained visual question and answer method combined with multi-view attention mechanism
CN111554268B (en) Language identification method based on language model, text classification method and device
Yan Computational methods for deep learning
Zhang et al. Multilabel image classification with regional latent semantic dependencies
CN110737801B (en) Content classification method, apparatus, computer device, and storage medium
CN108804530B (en) Subtitling areas of an image
CN112732916B (en) BERT-based multi-feature fusion fuzzy text classification system
CN111209384A (en) Question and answer data processing method and device based on artificial intelligence and electronic equipment
CN109783666A (en) A kind of image scene map generation method based on iteration fining
Wang et al. Spatial–temporal pooling for action recognition in videos
CN106803098A (en) A kind of three mode emotion identification methods based on voice, expression and attitude
Islam et al. A review on video classification with methods, findings, performance, challenges, limitations and future work
CN112749274A (en) Chinese text classification method based on attention mechanism and interference word deletion
AU2019101138A4 (en) Voice interaction system for race games
CN113987187A (en) Multi-label embedding-based public opinion text classification method, system, terminal and medium
KR20200010672A (en) Smart merchandise searching method and system using deep learning
CN110705490A (en) Visual emotion recognition method
Yan Computational methods for deep learning: theory, algorithms, and implementations
Xia et al. Evaluation of saccadic scanpath prediction: Subjective assessment database and recurrent neural network based metric
CN111898704A (en) Method and device for clustering content samples
Chen et al. STRAN: Student expression recognition based on spatio-temporal residual attention network in classroom teaching videos
Ling et al. A facial expression recognition system for smart learning based on YOLO and vision transformer
CN114840649A (en) Student cognitive diagnosis method based on cross-modal mutual attention neural network
Gong et al. Human interaction recognition based on deep learning and HMM
Li et al. Supervised classification of plant image based on attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant