CN110717431A

CN110717431A - A fine-grained visual question answering method combined with multi-view attention mechanism

Info

Publication number: CN110717431A
Application number: CN201910927585.4A
Authority: CN
Inventors: 彭淑娟; 李磊; 柳欣; 范文涛; 钟必能; 杜吉祥
Original assignee: Huaqiao University
Current assignee: Huaqiao University
Priority date: 2019-09-27
Filing date: 2019-09-27
Publication date: 2020-01-21
Anticipated expiration: 2039-09-27
Also published as: CN110717431B

Abstract

The invention relates to a fine-grained visual question answering method combined with a multi-perspective attention mechanism, fully considers the guiding role of the specific semantics of the question, and proposes a multi-perspective attention model, which can effectively select the current task target (problem). Multiple salient target regions, learn from multiple perspectives to obtain the region information related to the answer in the image and the question text, extract the regional salient features in the image under the guidance of the question semantics, have more fine-grained feature expression, and analyze the image. There are multiple important semantic expression regions in the image, which has strong characterization ability, which increases the effectiveness and comprehensiveness of the multi-view attention model. Accuracy and Comprehensiveness of Semantic Understanding for Visual Question Answering. Using the method of the present invention to perform a visual question and answer task has the advantages of simple steps, high efficiency and high accuracy, and can be fully used in business and has a good market prospect.

Description

A fine-grained visual question answering method combined with multi-view attention mechanism

技术领域technical field

本发明涉及计算机视觉与自然语言处理技术领域，更具体地说，涉及一种结合多视角注意力机制的细粒度视觉问答方法。The invention relates to the technical field of computer vision and natural language processing, and more particularly, to a fine-grained visual question answering method combined with a multi-view attention mechanism.

背景技术Background technique

随着计算机视觉和自然语言处理的快速发展，视觉问答系统成为人工智能越来越热门的研究领域之一。视觉问答技术是一项新兴课题，其任务是结合计算机视觉和自然语言处理两个学科领域，把给定的图像和与图像相关的自然语言问题作为输入，生成一个自然语言答案作为输出。视觉问答是人工智能领域重点的应用方向，通过模拟真实世界的情景，视觉问答可以帮助存在视觉障碍的用户进行实时的人机交互。With the rapid development of computer vision and natural language processing, visual question answering system has become one of the more and more popular research fields of artificial intelligence. Visual question answering technology is an emerging topic, and its task is to combine the two disciplines of computer vision and natural language processing, taking a given image and a natural language question related to the image as input, and generating a natural language answer as output. Visual question answering is a key application direction in the field of artificial intelligence. By simulating real-world scenarios, visual question answering can help users with visual impairments to perform real-time human-computer interaction.

本质上，视觉问答系统被视为是一个分类任务，常用做法是根据已知的图片和问题抽取图片和问题特征，然后通过融合图片特征和问题特征来进行分类获取问答结果。近年来，视觉问答在计算机视觉和自然语言处理领域引起了广泛的关注。由于视觉问答的相对复杂性以及对图像和文本处理的需求，现有的一些方法在准确率上面还有一定的欠缺，还面临着较大的挑战。In essence, the visual question answering system is regarded as a classification task. The common practice is to extract picture and question features based on known pictures and questions, and then classify and obtain question and answer results by fusing picture features and question features. In recent years, visual question answering has attracted extensive attention in the fields of computer vision and natural language processing. Due to the relative complexity of visual question answering and the need for image and text processing, some existing methods still have a certain lack of accuracy and face greater challenges.

在实际应用中，视觉问答系统常常面临着图像的高维度和噪声影响，这些噪声会影响算法对答案的预测。因此，有效的视觉问答模型能需要挖掘到图像中与问题语义一致的结构化特征及语义相关性部分进行细粒度预测。In practical applications, visual question answering systems often face the high dimensionality of images and the effects of noise, which can affect the algorithm's prediction of answers. Therefore, an effective visual question answering model can mine the structural features and semantically related parts of the image that are semantically consistent with the question for fine-grained prediction.

视觉注意力模型是利用计算机模拟人类视觉注意力机制得到一幅图像中最容易引起人们注意的部分，即图像的显著区域。在视觉问答中，大多数使用单一注意力机制模型的方法常常忽略图像结构化语义的区别，并对图像中存在多个重要区域的情况表现有些不足，从而这类方法带来的注意力机制不可避免的会影响视觉问答准确性。The visual attention model is to use the computer to simulate the human visual attention mechanism to obtain the part of an image that is most likely to attract people's attention, that is, the salient area of the image. In visual question answering, most of the methods using a single attention mechanism model often ignore the difference of the structural semantics of the image, and are somewhat insufficient for the situation where there are multiple important regions in the image, so the attention mechanism brought by such methods cannot be used. Avoided will affect visual question answering accuracy.

研究发现，当前已有大多数视觉问答方法都是通过问题和整张图片去预测问题语义答案，却没有考虑到问题具体语义的导向作用，因此，这些模型学习到的图像区域特征与问题特征在语义空间上的关联性较弱。The study found that most of the existing visual question answering methods predict the semantic answer of the question through the question and the whole picture, but do not take into account the guiding role of the specific semantics of the question. Therefore, the image region features and question features learned by these models are in the The correlation in the semantic space is weak.

综上，现有技术中，有效的视觉问答方法仍然有待改善。To sum up, in the prior art, effective visual question answering methods still need to be improved.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于克服现有技术的不足，提供一种结合多视角注意力机制的细粒度视觉问答方法，可以有效提高视觉语义信息提取的准确度和全面性，并降低冗余数据和噪音数据的影响，从而提升视觉问答系统的细粒度识别能力与对复杂问题的判断，并一定程度上提升视觉问答系统的准确率与模型的可解释性。The purpose of the present invention is to overcome the deficiencies of the prior art and provide a fine-grained visual question answering method combined with a multi-view attention mechanism, which can effectively improve the accuracy and comprehensiveness of visual semantic information extraction, and reduce redundant data and noise data. Therefore, the fine-grained recognition ability of the visual question answering system and the judgment of complex problems are improved, and the accuracy of the visual question answering system and the interpretability of the model are improved to a certain extent.

本发明的技术方案如下：The technical scheme of the present invention is as follows:

一种结合多视角注意力机制的细粒度视觉问答方法，步骤如下：A fine-grained visual question answering method combined with multi-view attention mechanism, the steps are as follows:

1)输入图像，提取图像特征；输入问题文本，提取问题特征；1) Input image, extract image features; input question text, extract question features;

2)将图像特征、问题特征输入多视角注意力模型，计算图像的注意力权重，通过注意力权重对步骤1)的图像特征进行加权运算，得到图像细粒度特征；2) Input the image features and question features into the multi-view attention model, calculate the attention weight of the image, and perform a weighted operation on the image features of step 1) through the attention weight to obtain fine-grained image features;

3)将图像细粒度特征与问题特征进行融合，得到融合特征；3) Fusion of image fine-grained features and problem features to obtain fusion features;

4)将融合特征输入分类器，预测得到答案。4) Input the fused features into the classifier, and predict the answer.

作为优选，所述的多视角注意力模型包括上层注意力模型、下层注意力模型，通过上层注意力模型获得单一注意力权重，通过下层注意力模型获得显著性注意力权重，显著性注意力权重体现图像中的不同目标区域对应不同的注意力资源。Preferably, the multi-view attention model includes an upper-layer attention model and a lower-layer attention model, the single-attention weight is obtained through the upper-layer attention model, the salient attention weight is obtained through the lower-layer attention model, and the salient attention weight is obtained through the lower-layer attention model. It reflects that different target regions in the image correspond to different attention resources.

作为优选，获得单一注意力权重的方法如下：As a preference, the method for obtaining a single attention weight is as follows:

输入图像特征、问题特征至上层注意力模型，分别使用一层全连接层将图像特征、问题特征的数据投影到相同维度空间，利用激活函数ReLu规范化向量；然后利用哈达码乘积融合，再依次输入两层全连接层进行处理学习参数，处理学习后的参数

最后使用softmax函数归一化权值，得到单一注意力权重 Input image features and question features to the upper attention model, use a fully connected layer to project the data of image features and question features to the same dimensional space, and use the activation function ReLu to normalize the vector; then use Hada code product fusion, and then input in turn The two-layer fully connected layer processes the learned parameters and processes the learned parameters

Finally, use the softmax function to normalize the weights to get a single attention weight

其中，

为图像特征，

为问题特征，

为上层注意力模型待学习的权重参数，K为图像特征的空间区域个数，T为选取的问题特征长度，d为网络层中隐藏神经元的个数，h为该层设置的输出维度，ReLu是神经网络中的激活函数，其具体形式可以表达为f(x)＝max(0,x)。in,

is the image feature,

is the problem feature,

is the weight parameter to be learned by the upper attention model, K is the number of spatial regions of image features, T is the length of the selected problem feature, d is the number of hidden neurons in the network layer, h is the output dimension set by this layer, ReLu is an activation function in a neural network, and its specific form can be expressed as f(x)=max(0,x).

作为优选，获得显著性注意力权重的方法如下：Preferably, the method for obtaining salient attention weights is as follows:

输入图像特征、问题特征至下层注意力模型，分别使用一层全连接层将图像特征、问题特征的数据投影到相同维度空间，再计算关联矩阵C_i＝ReLu(q_i ^TW_bV_i)；其中，

为下层注意力模型待学习的权重参数，

为获取的关联矩阵；Input image features and question features to the lower attention model, use a fully connected layer to project the data of image features and question features to the same dimensional space, and then calculate the correlation matrix C _i =ReLu(q _i ^T W _b V _i ) ;in,

is the weight parameter to be learned by the lower attention model,

is the obtained correlation matrix;

将关联矩阵作为特征与问题特征相乘，并与输入的图像特征融合，融合后的参数为

最后使用softmax函数归一化权值，输出显著性注意力权重

Multiply the correlation matrix as a feature with the problem feature, and fuse it with the input image feature. The fused parameter is

Finally, use the softmax function to normalize the weights and output the saliency attention weights

其中，

为下层注意力模型待学习的权重参数。in,

Weight parameters to be learned for the lower attention model.

作为优选，基于单一注意力权重、显著性注意力权重计算图像的注意力权重，具体如下：Preferably, the attention weight of the image is calculated based on the single attention weight and the salient attention weight, as follows:

其中，β₁和β₂为上层注意力模型、下层注意力模型的权重比超参数。Among them, β ₁ and β ₂ are the weight ratio hyperparameters of the upper-layer attention model and the lower-layer attention model.

作为优选，步骤3)中，将图像细粒度特征与问题特征分别通过非线性层f_v、f_q，在非线性层f_v、f_q中利用激活函数ReLu规范化向量；然后利用哈达码乘积融合，得到融合特征

Preferably, in step 3), the fine-grained features of the image and the problem features are passed through the nonlinear layers f _v and f _q respectively, and the activation function ReLu is used in the nonlinear layers f _v and f _q to normalize the vectors; then the Hada code product fusion is used , get the fusion feature

作为优选，步骤4)中，融合特征通过非线性层f_o，在通过非线性层f_o中利用激活函数ReLu规范化向量；然后使用线性映射w_o来预测答案的候选得分

最后，选取得分更高的输出；Preferably, in step 4), the fusion feature passes through the nonlinear layer f _o , and the activation function ReLu is used to normalize the vector in the pass through the nonlinear layer f _o ; then the linear mapping w _o is used to predict the candidate score of the answer

Finally, select the output with a higher score;

其中，σ是sigmoid激活函数，w_o是待学习的权重参数。where σ is the sigmoid activation function and _wo is the weight parameter to be learned.

作为优选，sigmoid激活函数将最终得分规范化为(0-1)区间，最后一个阶段作为预测每个候选答案正确性的逻辑回归，其目标函数为

As an option, the sigmoid activation function normalizes the final score to the (0-1) interval, and the last stage acts as a logistic regression that predicts the correctness of each candidate answer with an objective function of

其中，z和k分别覆盖M个训练问题的N个候选答案，s_zk是问题的真实答案。where z and k cover N candidate answers for M training questions, respectively, and s _zk is the real answer to the question.

作为优选，步骤1)中，使用Faster-RCNN标准模型对输入的图像I_i进行特征提取，得到深度表达的图像特征V_i＝FasterRCNN(I_i)。Preferably, in step 1), the Faster-RCNN standard model is used to perform feature extraction on the input image I _i to obtain a deeply expressed image feature V _i =FasterRCNN(I _i ).

作为优选，步骤1)中，输入问题文本Q_i，先使用空格和标点将问题文本Q_i分成单词，再通过预训练的GloVe词嵌入方法进行初始化，得到编码后的第i个指定问题句子

其中，x_t ⁽ⁱ⁾表示每个单词在词汇表中的位第t个单词；Preferably, in step 1), input the question text Q _i , first use spaces and punctuation to divide the question text Q _i into words, and then initialize by the pre-trained GloVe word embedding method to obtain the ith specified question sentence after encoding

Among them, x _t ⁽ⁱ⁾ represents the t-th word of each word in the vocabulary;

然后，将

输入到LSTM网络中，取出最后一层的输出q_i作为

的特征表达，得到问题特征q_i。followed by

Input into the _LSTM network, take out the output qi of the last layer as

The feature expression of , the problem feature _qi is obtained.

本发明的有益效果如下：The beneficial effects of the present invention are as follows:

本发明所述的结合多视角注意力机制的细粒度视觉问答方法，提出一种多视角注意力模型，能够有效选择出与当前任务目标(问题)相关的多个显著目标区域，提取出问题语义引导下的图像中的区域显著性特征，具有更细粒度的特征表达，并对图像中存在多个重要语义表达区域的情况表现，具有较强的刻画能力。The fine-grained visual question answering method combined with the multi-view attention mechanism described in the present invention proposes a multi-view attention model, which can effectively select multiple salient target areas related to the current task target (question), and extract the question semantics. The regional saliency feature in the guided image has more fine-grained feature expression, and has a strong ability to describe the situation that there are multiple important semantic expression regions in the image.

本发明充分考虑到问题具体语义的导向作用，从多个视角来学习获取图像和问题文本中与答案有关的区域信息，增加了多视角注意力模型的有效性和全面性，从而有效加强图像区域显著特征和问题特征的语义关联性，以提升视觉问答的语义理解的准确性和全面性。The invention fully considers the guiding role of the specific semantics of the question, learns and obtains the area information related to the answer in the image and the question text from multiple perspectives, increases the effectiveness and comprehensiveness of the multi-perspective attention model, and effectively strengthens the image area. Semantic correlation of salient features and question features to improve the accuracy and comprehensiveness of semantic understanding for visual question answering.

采用本发明所述的方法进行视觉问答任务，步骤简单、效率高、准确率高，完全可以用于商业，市场前景较佳。Using the method of the present invention to perform a visual question and answer task has the advantages of simple steps, high efficiency and high accuracy, and can be fully used in business and has a good market prospect.

附图说明Description of drawings

图1是本发明的流程示意图；Fig. 1 is the schematic flow sheet of the present invention;

图2是多视角注意力模型的示意图；Figure 2 is a schematic diagram of a multi-view attention model;

图3是注意力权重可视化热力图(简单的注意任务)；Figure 3 is a heatmap for visualizing attention weights (a simple attention task);

图4是注意力权重可视化热力图(任务需要高度集中于图像中的多个位置)；Figure 4 is a heatmap for visualizing attention weights (tasks need to be highly focused on multiple locations in the image);

图5是本发明的多视角注意力模型取得的结果与目前较为先进的方法的对比曲线图；Fig. 5 is the comparative graph of the result obtained by the multi-view attention model of the present invention and the more advanced method at present;

图6是最终模型性能培训的损失函数的曲线图；Figure 6 is a graph of the loss function for final model performance training;

图7是最终模型性能培训的培训验证分数的曲线图。Figure 7 is a graph of training validation scores for final model performance training.

具体实施方式Detailed ways

以下结合附图及实施例对本发明进行进一步的详细说明。The present invention will be further described in detail below with reference to the accompanying drawings and embodiments.

本发明为了解决现有技术存在的不足，提供一种结合多视角注意力机制的细粒度视觉问答方法。视觉问答可以被视为一种多任务分类问题，每个答案就可以看作为一种分类种类。一般视觉问答系统中，使用One-Hot方法对答案进行编码，来获得每个答案对应的One-Hot向量，构成答案向量表。One-Hot编码是分类变量作为二进制向量的表示。这首先要求将分类值映射到整数值，然后每个整数值被表示为二进制向量，除了整数的索引之外，它都是零值，它被标记为1。In order to solve the shortcomings of the prior art, the present invention provides a fine-grained visual question answering method combined with a multi-view attention mechanism. Visual question answering can be regarded as a multi-task classification problem, and each answer can be regarded as a classification category. In general visual question answering systems, the One-Hot method is used to encode the answer to obtain the One-Hot vector corresponding to each answer, forming an answer vector table. One-Hot encoding is the representation of categorical variables as binary vectors. This first requires mapping categorical values to integer values, then each integer value is represented as a binary vector, which is zero except for the integer's index, which is marked as 1.

如图1所示，本发明所述的结合多视角注意力机制的细粒度视觉问答方法，大致步骤如下：As shown in Figure 1, the fine-grained visual question answering method combined with the multi-view attention mechanism described in the present invention has the following steps:

本实施例中，步骤1)中，使用Faster-RCNN标准模型对输入的图像I_i进行特征提取，得到图像特征V_i＝FasterRCNN(I_i)。令K为图像特征的空间区域个数，则图像特征

可以进一步表示成其中，

是Faster-RCNN提取的第k个区域特征，d为网络层中隐藏神经元的个数并同时表示输出维度。In this embodiment, in step 1), the Faster-RCNN standard model is used to perform feature extraction on the input image I _i to obtain image features V _i =FasterRCNN(I _i ). Let K be the number of spatial regions of the image feature, then the image feature

can be further expressed as in,

is the k-th regional feature extracted by Faster-RCNN, and d is the number of hidden neurons in the network layer and represents the output dimension at the same time.

步骤1)中，输入问题文本Q_i后，先使用空格和标点将问题文本Q_i分成单词，再通过预训练的GloVe词嵌入方法(Global Vectors for Word Representation)进行初始化，得到第i个指定问题句子

的编码形式，其中，x_t ⁽ⁱ⁾表示每个单词在词汇表中的位第t个单词；In step 1), after entering the question text Q _i , first use spaces and punctuation to divide the question text Q _i into words, and then initialize it through the pre-trained GloVe word embedding method (Global Vectors for Word Representation) to obtain the i-th specified question. sentence

The encoded form of , where x _t ⁽ⁱ⁾ represents the t-th word of each word in the vocabulary;

然后，将

输入到LSTM网络中，具体地，使用含有1280个隐藏单元的标准LSTM网络，取出最后一层的输出q_i作为

的特征表达，得到问题特征q_i。followed by

Input into the LSTM network, specifically, use a standard LSTM network with 1280 hidden units, and take out the output _qi of the last layer as

The feature expression of , the problem feature _qi is obtained.

然后，针对获取的图像特征V_i以及编码的问题特征q_i，将两种特征输入多视角注意力模型，计算图像的注意力权重。Then, for the acquired image feature V _i and the encoded question feature qi _, the two kinds of features are input into the multi-view attention model, and the attention weight of the image is calculated.

视觉注意力机制从本质上可以从图像中选择出对当前任务目标更关键的目标区域，从而对这一区域投入更多注意力资源，以获取更多所需要关注目标的细节信息，抑制其他无用信息。在视觉问答任务中，语义表达具有多样性。特别地，有一些问题需要模型理解图像中多个目标对象之间的语义表达。因此，单一视觉注意力模型不能够有效挖掘图像中不同语义对象与问题语义之间的关联性；The visual attention mechanism can essentially select the target area that is more critical to the current task goal from the image, so as to invest more attention resources in this area to obtain more detailed information of the target that needs attention, and suppress other useless information. information. In visual question answering tasks, semantic representations are diverse. In particular, there are some problems that require the model to understand the semantic representation between multiple target objects in an image. Therefore, the single visual attention model cannot effectively mine the correlation between different semantic objects and question semantics in the image;

为了解决这个问题，本发明提供了一种多视角注意力模型，使用两种不同的注意力机制进行联合学习问题中可以关注到的不同语义的重要区域部分，以获得图像的细粒度注意力特征图。使用多视角注意力模型关注图像得到的图像注意力权重，并利用该权重进行图像特征加权得到累积向量作为最终的图像特征表示，即图像细粒度特征，能够较好地与问题语义进行关联。In order to solve this problem, the present invention provides a multi-view attention model, which uses two different attention mechanisms to jointly learn important regions with different semantics that can be paid attention to in the problem, so as to obtain fine-grained attention features of images picture. Using the multi-view attention model to pay attention to the image attention weight obtained from the image, and using the weight to weight the image features to obtain the cumulative vector as the final image feature representation, that is, the fine-grained image feature, which can be better associated with the semantics of the question.

如图2所示，所述的多视角注意力模型包括上层注意力模型、下层注意力模型，通过上层注意力模型获得单一注意力权重，通过下层注意力模型获得显著性注意力权重，显著性注意力权重体现图像中的不同目标区域对应不同的注意力资源。As shown in Figure 2, the multi-view attention model includes an upper-layer attention model and a lower-layer attention model. The single attention weight is obtained through the upper-layer attention model, and the salient attention weight is obtained through the lower-layer attention model. The attention weight reflects that different target regions in the image correspond to different attention resources.

具体地，上层注意力模型中，获得单一注意力权重的方法如下：Specifically, in the upper-layer attention model, the method for obtaining a single attention weight is as follows:

输入图像特征、问题特征至上层注意力模型，分别使用一层全连接层将图像特征、问题特征的数据投影到相同维度空间，利用激活函数ReLu规范化向量；然后利用哈达码乘积(Hardmard product)融合，再依次输入两层全连接层进行处理学习参数，处理学习后的参数其中，

为图像特征，

为问题特征，

为上层注意力模型待学习的权重参数，K为图像特征的空间区域个数，T为选取的问题特征长度，d为网络层中隐藏神经元的个数，h为该层设置的输出维度，ReLu是神经网络中的激活函数，其具体形式可以表达为f(x)＝max(0,x)；Input image features and question features to the upper attention model, use a fully connected layer to project the data of image features and question features to the same dimensional space, and use the activation function ReLu to normalize the vector; then use the Hardmard product (Hardmard product) fusion , and then input the two fully connected layers in turn to process the learning parameters, and process the learned parameters in,

is the image feature,

is the problem feature,

is the weight parameter to be learned by the upper attention model, K is the number of spatial regions of image features, T is the length of the selected problem feature, d is the number of hidden neurons in the network layer, h is the output dimension set by this layer, ReLu is the activation function in the neural network, and its specific form can be expressed as f(x)=max(0,x);

最后使用softmax函数归一化权值，得到单一注意力权重

考虑单一注意力权重

为softmax权值，如果部分权值数值较大，其余部分势必权值较小。由于一幅图像常常含有多个不同语义，并且这些语义常常在不同区域进行视觉语义表达。单一注意力权重

常常会忽略一些具有重要语义的区域信息。为补充上层注意力模型的注意信息的缺失部分，本发明进一步提出了下层注意力模型。下层注意力模型同时兼顾图像与问题语义的关联性，达到问题引导多视角注意力模型的学习机制，增加特征细粒度挖掘能力，本发明通过计算图像与问题特征在语义的相似性来引导图像区域的注意力的学习。Consider a single attention weight

For softmax weights, if some of the weights have larger values, the rest will inevitably have smaller weights. Since an image often contains multiple different semantics, and these semantics are often expressed in different regions of the visual semantics. single attention weight

Some region information with important semantics is often ignored. To supplement the missing part of the attention information of the upper-layer attention model, the present invention further proposes a lower-layer attention model. The lower-level attention model also takes into account the relevance of the image and the question semantics, achieves the learning mechanism of the question-guided multi-view attention model, and increases the feature fine-grained mining ability. The present invention guides the image area by calculating the semantic similarity between the image and the question feature. attention learning.

具体地，下层注意力模型中，获得显著性注意力权重的方法如下：Specifically, in the lower-layer attention model, the method of obtaining the salient attention weight is as follows:

为下层注意力模型待学习的权重参数，

is the weight parameter to be learned by the lower attention model,

is the obtained correlation matrix;

最后使用softmax函数归一化权值，输出显著性注意力权重

其中，

为下层注意力模型待学习的权重参数，参数维度设置通上层注意力模型一致，K为图像特征的空间区域个数，T为选取的问题特征长度，d为网络层中隐藏神经元的个数，h为该层的输出维度，ReLu是神经网络中的激活函数。Multiply the correlation matrix as a feature with the problem feature, and fuse it with the input image feature. The fused parameter is

in,

is the weight parameter to be learned by the lower-layer attention model. The parameter dimension is set to be consistent with the upper-layer attention model. K is the number of spatial regions of image features, T is the length of the selected problem feature, and d is the number of hidden neurons in the network layer. , h is the output dimension of this layer, and ReLu is the activation function in the neural network.

基于单一注意力权重、显著性注意力权重计算图像的注意力权重，具体如下：The attention weight of the image is calculated based on the single attention weight and the salient attention weight, as follows:

其中，β₁和β₂为上层注意力模型、下层注意力模型的权重比参数。实际应用中，可以通过调试参数来分配上层注意力模型、下层注意力模型之间的权重，以达到更好的效果。Among them, β ₁ and β ₂ are the weight ratio parameters of the upper-layer attention model and the lower-layer attention model. In practical applications, the weights between the upper-layer attention model and the lower-layer attention model can be allocated by debugging parameters to achieve better results.

图像特征V_i可以进一步表示K个图像空间区域特征的集合形式进一步，将注意力权重a_i与每一个空间区域图像特征相乘加权，从而得到图像细粒度特征

The image feature V _i can further represent the set form of K image space region features Further, the attention weight a _i is multiplied and weighted by the image features of each spatial region to obtain fine-grained features of the image.

步骤3)中，将图像细粒度特征与问题特征分别通过非线性层f_v、f_q，在非线性层f_v、f_q中利用激活函数ReLu规范化向量；然后利用哈达码乘积融合，得到融合特征

In step 3), the fine-grained features of the image and the problem features are passed through the nonlinear layers f _v and f _q respectively, and the activation function ReLu is used to normalize the vectors in the nonlinear layers f _v and f _q ; feature

进一步地，视觉问答问题是为一个多标签分类问题，进而，步骤4)中，融合特征通过非线性层f_o，在通过非线性层f_o中利用激活函数ReLu规范化向量；然后使用线性映射w_o来预测答案的候选得分

最后，选取得分更高的输出；Further, the visual question answering problem is a multi-label classification problem, and further, in step 4), the fusion features pass through the nonlinear layer f _o , and the activation function ReLu is used to normalize the vector in the passing through the nonlinear layer f _o ; and then use the linear mapping w _o to predict the candidate score of the answer

Finally, select the output with a higher score;

其中，指数z和k分别覆盖M个训练问题的N个候选答案，s_zk是问题的真实答案。where the indices z and k cover the N candidate answers for the M training questions, respectively, and s _zk is the true answer to the question.

与其他常用的视觉问答利用softmax分类器相比，本发明利用的逻辑回归分类更加有效。Sigmoid函数使用软分数(soft target)作为目标结果，提供了更加丰富的训练信号，可以有效捕捉真实答案中偶尔出现的不确定性。Compared with other common visual question answering using softmax classifier, the logistic regression classification used in the present invention is more effective. The sigmoid function uses a soft score (soft target) as the target result, which provides a richer training signal and can effectively capture the occasional uncertainty in the real answer.

为了更好的观察注意力模型如何关注到图像的显著区域部分，在获得单一注意力权重、显著性注意力权重的attention map(a^u，a^b)后，使用python中的matplotlib画图库的heatmap函数将注意力map可视化为矩阵热图(matrix heatmap)，如图3、图4所示。In order to better observe how the attention model pays attention to the salient area of the image, after obtaining the attention map (a ^u , a ^b ) of the single attention weight and the salient attention weight, use matplotlib in python to draw the heatmap of the library The function visualizes the attention map as a matrix heatmap, as shown in Figure 3 and Figure 4.

图3、图4是多视角注意力模型分别对2张不同任务图像的上层注意力模型、下层注意力模型的表现图，attention1是上层注意力模型的注意力可视图，attention2是下层注意力模型的注意力可视图。从注意力的热度图可以看出，添加的下层注意力模型能够学习输入图像的不同重要区域。从图3可以看出，对于一个简单的注意任务，上层注意力模型、下层注意力模型都能够在图像中找到正确的位置。然而，在图4中，可以看到，当任务需要高度集中于图像中的多个位置时，下层注意力模型关注到了与上层注意力模型不同的部分，从而提高了多视角注意力模型的准确性，本发明的多视角注意力模型比现有技术的模型具有优势。Figure 3 and Figure 4 are the performance diagrams of the upper-layer attention model and the lower-layer attention model of the multi-view attention model for 2 different task images respectively, attention1 is the attention visual view of the upper-layer attention model, and attention2 is the lower-layer attention model. attention visuals. From the heat map of attention, it can be seen that the added lower-level attention model is able to learn different important regions of the input image. As can be seen from Figure 3, for a simple attention task, both the upper-layer attention model and the lower-layer attention model can find the correct position in the image. However, in Figure 4, it can be seen that when the task needs to be highly focused on multiple locations in the image, the lower-layer attention model pays attention to different parts from the upper-layer attention model, thereby improving the accuracy of the multi-view attention model Therefore, the multi-view attention model of the present invention has advantages over the prior art models.

测试数据集介绍：VQA v2数据集(Antol S,Agrawal A,Lu J,et al.Vqa:Visualquestion answering[C].Proceedings of the IEEE International Conference onComputer Vision.2015:2425-2433.)是一个大规模的视觉问答数据集，这个数据集中所有的问题和答案都由人为注释。在数据集中有443,757个训练问题，214,354个验证问题和447,793个测试问题。每张图像和三个问题相关，对于每个问题，注释者提供了十个答案。在标准视觉问答任务重，这个数据集中的问题常常分为：Yes/no,Number和other三种类型的问题。Test dataset introduction: The VQA v2 dataset (Antol S, Agrawal A, Lu J, et al. Vqa:Visualquestion answering[C].Proceedings of the IEEE International Conference onComputer Vision.2015:2425-2433.) is a large-scale The visual question answering dataset in which all questions and answers are annotated by humans. There are 443,757 training questions, 214,354 validation questions and 447,793 test questions in the dataset. Each image is associated with three questions, for each question the annotators provide ten answers. In the standard visual question answering task, the questions in this dataset are often divided into three types: Yes/no, Number and other.

进一步，为验证本发明的有效性，将本发明与2017VQA挑战赛冠军(Anderson P,HeX,Buehler C,et al.Bottom-up and top-down attention for image captioning andvisual question answering.arXiv preprint arXiv:1707.07998,2017.)的结果做了对比，如图5所示，本发明在复现的论文代码基础上，将本发明的多视角注意力模型替换原来的简单注意力模型，本发明的多视角注意力模型最终评分为64.35％，在准确率的评价与所述论文相比，具有约1.2％的提升。Further, in order to verify the validity of the present invention, the present invention was compared with the champion of the 2017 VQA Challenge (Anderson P, HeX, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering. arXiv preprint arXiv:1707.07998 , 2017.) The results are compared, as shown in Figure 5, the present invention replaces the original simple attention model with the multi-perspective attention model of the present invention on the basis of the reproduced thesis code, and the multi-perspective attention model of the present invention The final score of the force model is 64.35%, which is an improvement of about 1.2% compared with the paper in the evaluation of accuracy.

在实验中一些基本参数设计如下，基本学习率设置为α＝0.0007，在每个LSTM层后随机失活率设置为dropout＝0.3，答案筛选设置N＝3000。全连接层的隐藏神经元设置num_hid＝1024，批训练的数量设置batch_size＝512。单一注意力权重、显著性注意力权重的权重分配为β₁＝0.7，β₂＝0.3。Some basic parameters in the experiment are designed as follows, the basic learning rate is set to α=0.0007, the random dropout rate after each LSTM layer is set to dropout=0.3, and the answer screening is set to N=3000. The hidden neurons of the fully connected layer are set to num_hid=1024, and the number of batch training is set to batch_size=512. The weight assignments of the single attention weight and the salient attention weight are β ₁ =0.7, β ₂ =0.3.

如图6所示，模型的损失函数值(loss)随着训练周期的增加不断减小收敛的过程；如图7所示，模型准确率随着训练周期的增长分别在训练集与测试集上的表现。As shown in Figure 6, the loss function value (loss) of the model decreases with the increase of the training period and the convergence process; Performance.

本发明在test-dev情况下与VQA任务中比较有代表性的方法在公开标准数据集VQA v2上的对比，如表1所示。Table 1 shows the comparison of the representative methods of the present invention in the test-dev case and the VQA task on the open standard data set VQA v2.

表1Table 1

特别的，将数据以问题类型为标准分为3个种类进行评估，然后计算出总的评估结果。问题种类分别为Y/N是否问题，Number数量问题，Others其他开放性问题。表中的分数是模型针对不同类型的问题回答结果的准确率，数值越大准确率越高。从表中可以看出，本发明的多视角注意力模型对不同任务都达到了较好的效果。In particular, the data is divided into three categories according to the question type for evaluation, and then the total evaluation result is calculated. The types of problems are whether Y/N is a problem, the number of Number problems, and Others other open problems. The scores in the table are the accuracy of the model's answering results for different types of questions. The larger the value, the higher the accuracy. It can be seen from the table that the multi-view attention model of the present invention achieves good effects on different tasks.

特别的，由于本发明的多视角注意力模型加强细粒度的特征表达，对物体的检测，识别能力提高，在Number这个一类型的评价上较之前的方法有着不错的提升。模型总的准确率评价结果更是优于大部分现存方法的结果。In particular, since the multi-view attention model of the present invention strengthens the fine-grained feature expression, the detection and recognition ability of objects is improved, and the evaluation of the type of Number has a good improvement compared with the previous method. The overall accuracy evaluation results of the model are even better than the results of most existing methods.

上述实施例仅是用来说明本发明，而并非用作对本发明的限定。只要是依据本发明的技术实质，对上述实施例进行变化、变型等都将落在本发明的权利要求的范围内。The above-mentioned embodiments are only used to illustrate the present invention, but not to limit the present invention. As long as it is in accordance with the technical essence of the present invention, changes, modifications, etc. to the above-described embodiments will fall within the scope of the claims of the present invention.

Claims

1. A fine-grained visual question answering method in conjunction with a multi-view attention mechanism, characterized in that the steps are as follows:

1) Input image, extract image features; input question text, extract question features;

2) Input the image features and question features into the multi-view attention model, calculate the attention weight of the image, and perform a weighted operation on the image features of step 1) through the attention weight to obtain fine-grained image features;

3) Fusion of image fine-grained features and problem features to obtain fusion features;

4) Input the fused features into the classifier, and predict the answer.

2. The fine-grained visual question answering method combined with the multi-view attention mechanism according to claim 1, wherein the multi-view attention model comprises an upper-layer attention model and a lower-layer attention model, and the upper-layer attention model is The single attention weight is obtained, and the salient attention weight is obtained through the lower attention model, and the salient attention weight reflects that different target areas in the image correspond to different attention resources.

3. The fine-grained visual question answering method combining multi-view attention mechanism according to claim 2, is characterized in that, the method for obtaining single attention weight is as follows:

Input image features and question features to the upper attention model, use a fully connected layer to project the data of image features and question features to the same dimensional space, and use the activation function ReLu to normalize the vector; then use Hada code product fusion, and then input in turn The two-layer fully connected layer processes the learned parameters and processes the learned parameters

in,

is the image feature, is the problem feature,

4. The fine-grained visual question answering method combining multi-view attention mechanism according to claim 2, is characterized in that, the method for obtaining salient attention weight is as follows:

Input image features and question features to the lower attention model, use a fully connected layer to project the data of image features and question features to the same dimensional space, and then calculate the correlation matrix C _i =ReLu(q _i ^T W _b V _i ) ;in,

is the weight parameter to be learned by the lower attention model, is the obtained correlation matrix;

in,

Weight parameters to be learned for the lower attention model.

5. The fine-grained visual question answering method combined with multi-view attention mechanism according to any one of claims 2, 3, and 4, wherein the attention weight of the image is calculated based on a single attention weight and a salient attention weight ,details as follows:

Among them, β ₁ and β ₂ are the weight ratio hyperparameters of the upper-layer attention model and the lower-layer attention model.

6. The fine-grained visual question answering method combined with multi-view attention mechanism according to claim 5, is characterized in that, in step 3), the fine-grained feature of the image and the question feature are respectively passed through the nonlinear layers f _v , f _q , The activation function ReLu is used to normalize the vector in the nonlinear layers f _v , f _q ; then the Hada code product fusion is used to obtain the fusion features

7. The fine-grained visual question answering method combined with multi-view attention mechanism according to claim 6, characterized in that, in step 4), the fusion feature passes through the nonlinear layer f _o , and uses the activation in the through nonlinear layer f _o The function ReLu normalizes the vector; the linear map w _o is then used to predict the candidate score of the answer

Finally, select the output with a higher score;

where σ is the sigmoid activation function and _wo is the weight parameter to be learned.

8. The fine-grained visual question answering method combined with multi-view attention mechanism according to claim 7, characterized in that, the sigmoid activation function normalizes the final score to a (0-1) interval, and the last stage is used to predict each candidate answer Logistic regression for correctness with an objective function of

where z and k cover N candidate answers for M training questions, respectively, and s _zk is the real answer to the question.

9. the fine-grained visual question answering method in conjunction with multi-view attention mechanism according to claim 1, is characterized in that, in step 1), use Faster-RCNN standard model to carry out feature extraction to the image I _i of input, obtains deep expression The image features of _Vi = FasterRCNN(I _i ).

10. The fine-grained visual question answering method combined with multi-view attention mechanism according to claim 1, is characterized in that, in step 1), input question text Q _i , first use spaces and punctuation to divide question text Q _i into words, Then initialize it through the pre-trained GloVe word embedding method to obtain the encoded i-th specified question sentence

followed by Input into the _LSTM network, take out the output qi of the last layer as

The feature expression of , the problem feature _qi is obtained.