CN110717431A - A fine-grained visual question answering method combined with multi-view attention mechanism - Google Patents

A fine-grained visual question answering method combined with multi-view attention mechanism Download PDF

Info

Publication number
CN110717431A
CN110717431A CN201910927585.4A CN201910927585A CN110717431A CN 110717431 A CN110717431 A CN 110717431A CN 201910927585 A CN201910927585 A CN 201910927585A CN 110717431 A CN110717431 A CN 110717431A
Authority
CN
China
Prior art keywords
attention
image
features
question
fine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910927585.4A
Other languages
Chinese (zh)
Other versions
CN110717431B (en
Inventor
彭淑娟
李磊
柳欣
范文涛
钟必能
杜吉祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huaqiao University
Original Assignee
Huaqiao University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huaqiao University filed Critical Huaqiao University
Priority to CN201910927585.4A priority Critical patent/CN110717431B/en
Publication of CN110717431A publication Critical patent/CN110717431A/en
Application granted granted Critical
Publication of CN110717431B publication Critical patent/CN110717431B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Library & Information Science (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

本发明涉及一种结合多视角注意力机制的细粒度视觉问答方法,充分考虑到问题具体语义的导向作用,提出一种多视角注意力模型,能够有效选择出与当前任务目标(问题)相关的多个显著目标区域,从多个视角学习获取图像和问题文本中与答案有关的区域信息,提取出问题语义引导下的图像中的区域显著性特征,具有更细粒度的特征表达,并对图像中存在多个重要语义表达区域的情况表现,具有较强的刻画能力,增加了多视角注意力模型的有效性和全面性,从而有效加强图像区域显著特征和问题特征的语义关联性,以提升视觉问答的语义理解的准确性和全面性。采用本发明所述的方法进行视觉问答任务,步骤简单、效率高、准确率高,完全可以用于商业,市场前景较佳。

Figure 201910927585

The invention relates to a fine-grained visual question answering method combined with a multi-perspective attention mechanism, fully considers the guiding role of the specific semantics of the question, and proposes a multi-perspective attention model, which can effectively select the current task target (problem). Multiple salient target regions, learn from multiple perspectives to obtain the region information related to the answer in the image and the question text, extract the regional salient features in the image under the guidance of the question semantics, have more fine-grained feature expression, and analyze the image. There are multiple important semantic expression regions in the image, which has strong characterization ability, which increases the effectiveness and comprehensiveness of the multi-view attention model. Accuracy and Comprehensiveness of Semantic Understanding for Visual Question Answering. Using the method of the present invention to perform a visual question and answer task has the advantages of simple steps, high efficiency and high accuracy, and can be fully used in business and has a good market prospect.

Figure 201910927585

Description

一种结合多视角注意力机制的细粒度视觉问答方法A fine-grained visual question answering method combined with multi-view attention mechanism

技术领域technical field

本发明涉及计算机视觉与自然语言处理技术领域,更具体地说,涉及一种结合多视角注意力机制的细粒度视觉问答方法。The invention relates to the technical field of computer vision and natural language processing, and more particularly, to a fine-grained visual question answering method combined with a multi-view attention mechanism.

背景技术Background technique

随着计算机视觉和自然语言处理的快速发展,视觉问答系统成为人工智能越来越热门的研究领域之一。视觉问答技术是一项新兴课题,其任务是结合计算机视觉和自然语言处理两个学科领域,把给定的图像和与图像相关的自然语言问题作为输入,生成一个自然语言答案作为输出。视觉问答是人工智能领域重点的应用方向,通过模拟真实世界的情景,视觉问答可以帮助存在视觉障碍的用户进行实时的人机交互。With the rapid development of computer vision and natural language processing, visual question answering system has become one of the more and more popular research fields of artificial intelligence. Visual question answering technology is an emerging topic, and its task is to combine the two disciplines of computer vision and natural language processing, taking a given image and a natural language question related to the image as input, and generating a natural language answer as output. Visual question answering is a key application direction in the field of artificial intelligence. By simulating real-world scenarios, visual question answering can help users with visual impairments to perform real-time human-computer interaction.

本质上,视觉问答系统被视为是一个分类任务,常用做法是根据已知的图片和问题抽取图片和问题特征,然后通过融合图片特征和问题特征来进行分类获取问答结果。近年来,视觉问答在计算机视觉和自然语言处理领域引起了广泛的关注。由于视觉问答的相对复杂性以及对图像和文本处理的需求,现有的一些方法在准确率上面还有一定的欠缺,还面临着较大的挑战。In essence, the visual question answering system is regarded as a classification task. The common practice is to extract picture and question features based on known pictures and questions, and then classify and obtain question and answer results by fusing picture features and question features. In recent years, visual question answering has attracted extensive attention in the fields of computer vision and natural language processing. Due to the relative complexity of visual question answering and the need for image and text processing, some existing methods still have a certain lack of accuracy and face greater challenges.

在实际应用中,视觉问答系统常常面临着图像的高维度和噪声影响,这些噪声会影响算法对答案的预测。因此,有效的视觉问答模型能需要挖掘到图像中与问题语义一致的结构化特征及语义相关性部分进行细粒度预测。In practical applications, visual question answering systems often face the high dimensionality of images and the effects of noise, which can affect the algorithm's prediction of answers. Therefore, an effective visual question answering model can mine the structural features and semantically related parts of the image that are semantically consistent with the question for fine-grained prediction.

视觉注意力模型是利用计算机模拟人类视觉注意力机制得到一幅图像中最容易引起人们注意的部分,即图像的显著区域。在视觉问答中,大多数使用单一注意力机制模型的方法常常忽略图像结构化语义的区别,并对图像中存在多个重要区域的情况表现有些不足,从而这类方法带来的注意力机制不可避免的会影响视觉问答准确性。The visual attention model is to use the computer to simulate the human visual attention mechanism to obtain the part of an image that is most likely to attract people's attention, that is, the salient area of the image. In visual question answering, most of the methods using a single attention mechanism model often ignore the difference of the structural semantics of the image, and are somewhat insufficient for the situation where there are multiple important regions in the image, so the attention mechanism brought by such methods cannot be used. Avoided will affect visual question answering accuracy.

研究发现,当前已有大多数视觉问答方法都是通过问题和整张图片去预测问题语义答案,却没有考虑到问题具体语义的导向作用,因此,这些模型学习到的图像区域特征与问题特征在语义空间上的关联性较弱。The study found that most of the existing visual question answering methods predict the semantic answer of the question through the question and the whole picture, but do not take into account the guiding role of the specific semantics of the question. Therefore, the image region features and question features learned by these models are in the The correlation in the semantic space is weak.

综上,现有技术中,有效的视觉问答方法仍然有待改善。To sum up, in the prior art, effective visual question answering methods still need to be improved.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于克服现有技术的不足,提供一种结合多视角注意力机制的细粒度视觉问答方法,可以有效提高视觉语义信息提取的准确度和全面性,并降低冗余数据和噪音数据的影响,从而提升视觉问答系统的细粒度识别能力与对复杂问题的判断,并一定程度上提升视觉问答系统的准确率与模型的可解释性。The purpose of the present invention is to overcome the deficiencies of the prior art and provide a fine-grained visual question answering method combined with a multi-view attention mechanism, which can effectively improve the accuracy and comprehensiveness of visual semantic information extraction, and reduce redundant data and noise data. Therefore, the fine-grained recognition ability of the visual question answering system and the judgment of complex problems are improved, and the accuracy of the visual question answering system and the interpretability of the model are improved to a certain extent.

本发明的技术方案如下:The technical scheme of the present invention is as follows:

一种结合多视角注意力机制的细粒度视觉问答方法,步骤如下:A fine-grained visual question answering method combined with multi-view attention mechanism, the steps are as follows:

1)输入图像,提取图像特征;输入问题文本,提取问题特征;1) Input image, extract image features; input question text, extract question features;

2)将图像特征、问题特征输入多视角注意力模型,计算图像的注意力权重,通过注意力权重对步骤1)的图像特征进行加权运算,得到图像细粒度特征;2) Input the image features and question features into the multi-view attention model, calculate the attention weight of the image, and perform a weighted operation on the image features of step 1) through the attention weight to obtain fine-grained image features;

3)将图像细粒度特征与问题特征进行融合,得到融合特征;3) Fusion of image fine-grained features and problem features to obtain fusion features;

4)将融合特征输入分类器,预测得到答案。4) Input the fused features into the classifier, and predict the answer.

作为优选,所述的多视角注意力模型包括上层注意力模型、下层注意力模型,通过上层注意力模型获得单一注意力权重,通过下层注意力模型获得显著性注意力权重,显著性注意力权重体现图像中的不同目标区域对应不同的注意力资源。Preferably, the multi-view attention model includes an upper-layer attention model and a lower-layer attention model, the single-attention weight is obtained through the upper-layer attention model, the salient attention weight is obtained through the lower-layer attention model, and the salient attention weight is obtained through the lower-layer attention model. It reflects that different target regions in the image correspond to different attention resources.

作为优选,获得单一注意力权重的方法如下:As a preference, the method for obtaining a single attention weight is as follows:

输入图像特征、问题特征至上层注意力模型,分别使用一层全连接层将图像特征、问题特征的数据投影到相同维度空间,利用激活函数ReLu规范化向量;然后利用哈达码乘积融合,再依次输入两层全连接层进行处理学习参数,处理学习后的参数

Figure BDA0002219333860000021
最后使用softmax函数归一化权值,得到单一注意力权重 Input image features and question features to the upper attention model, use a fully connected layer to project the data of image features and question features to the same dimensional space, and use the activation function ReLu to normalize the vector; then use Hada code product fusion, and then input in turn The two-layer fully connected layer processes the learned parameters and processes the learned parameters
Figure BDA0002219333860000021
Finally, use the softmax function to normalize the weights to get a single attention weight

其中,

Figure BDA0002219333860000023
为图像特征,
Figure BDA0002219333860000024
为问题特征,
Figure BDA0002219333860000025
为上层注意力模型待学习的权重参数,K为图像特征的空间区域个数,T为选取的问题特征长度,d为网络层中隐藏神经元的个数,h为该层设置的输出维度,ReLu是神经网络中的激活函数,其具体形式可以表达为f(x)=max(0,x)。in,
Figure BDA0002219333860000023
is the image feature,
Figure BDA0002219333860000024
is the problem feature,
Figure BDA0002219333860000025
is the weight parameter to be learned by the upper attention model, K is the number of spatial regions of image features, T is the length of the selected problem feature, d is the number of hidden neurons in the network layer, h is the output dimension set by this layer, ReLu is an activation function in a neural network, and its specific form can be expressed as f(x)=max(0,x).

作为优选,获得显著性注意力权重的方法如下:Preferably, the method for obtaining salient attention weights is as follows:

输入图像特征、问题特征至下层注意力模型,分别使用一层全连接层将图像特征、问题特征的数据投影到相同维度空间,再计算关联矩阵Ci=ReLu(qi TWbVi);其中,

Figure BDA0002219333860000031
为下层注意力模型待学习的权重参数,
Figure BDA0002219333860000032
为获取的关联矩阵;Input image features and question features to the lower attention model, use a fully connected layer to project the data of image features and question features to the same dimensional space, and then calculate the correlation matrix C i =ReLu(q i T W b V i ) ;in,
Figure BDA0002219333860000031
is the weight parameter to be learned by the lower attention model,
Figure BDA0002219333860000032
is the obtained correlation matrix;

将关联矩阵作为特征与问题特征相乘,并与输入的图像特征融合,融合后的参数为

Figure BDA0002219333860000033
最后使用softmax函数归一化权值,输出显著性注意力权重
Figure BDA0002219333860000034
Multiply the correlation matrix as a feature with the problem feature, and fuse it with the input image feature. The fused parameter is
Figure BDA0002219333860000033
Finally, use the softmax function to normalize the weights and output the saliency attention weights
Figure BDA0002219333860000034

其中,

Figure BDA0002219333860000035
为下层注意力模型待学习的权重参数。in,
Figure BDA0002219333860000035
Weight parameters to be learned for the lower attention model.

作为优选,基于单一注意力权重、显著性注意力权重计算图像的注意力权重,具体如下:Preferably, the attention weight of the image is calculated based on the single attention weight and the salient attention weight, as follows:

Figure BDA0002219333860000036
Figure BDA0002219333860000036

其中,β1和β2为上层注意力模型、下层注意力模型的权重比超参数。Among them, β 1 and β 2 are the weight ratio hyperparameters of the upper-layer attention model and the lower-layer attention model.

作为优选,步骤3)中,将图像细粒度特征与问题特征分别通过非线性层fv、fq,在非线性层fv、fq中利用激活函数ReLu规范化向量;然后利用哈达码乘积融合,得到融合特征

Figure BDA0002219333860000037
Preferably, in step 3), the fine-grained features of the image and the problem features are passed through the nonlinear layers f v and f q respectively, and the activation function ReLu is used in the nonlinear layers f v and f q to normalize the vectors; then the Hada code product fusion is used , get the fusion feature
Figure BDA0002219333860000037

作为优选,步骤4)中,融合特征通过非线性层fo,在通过非线性层fo中利用激活函数ReLu规范化向量;然后使用线性映射wo来预测答案的候选得分

Figure BDA0002219333860000038
最后,选取得分更高的输出;Preferably, in step 4), the fusion feature passes through the nonlinear layer f o , and the activation function ReLu is used to normalize the vector in the pass through the nonlinear layer f o ; then the linear mapping w o is used to predict the candidate score of the answer
Figure BDA0002219333860000038
Finally, select the output with a higher score;

其中,σ是sigmoid激活函数,wo是待学习的权重参数。where σ is the sigmoid activation function and wo is the weight parameter to be learned.

作为优选,sigmoid激活函数将最终得分规范化为(0-1)区间,最后一个阶段作为预测每个候选答案正确性的逻辑回归,其目标函数为

Figure BDA0002219333860000039
As an option, the sigmoid activation function normalizes the final score to the (0-1) interval, and the last stage acts as a logistic regression that predicts the correctness of each candidate answer with an objective function of
Figure BDA0002219333860000039

其中,z和k分别覆盖M个训练问题的N个候选答案,szk是问题的真实答案。where z and k cover N candidate answers for M training questions, respectively, and s zk is the real answer to the question.

作为优选,步骤1)中,使用Faster-RCNN标准模型对输入的图像Ii进行特征提取,得到深度表达的图像特征Vi=FasterRCNN(Ii)。Preferably, in step 1), the Faster-RCNN standard model is used to perform feature extraction on the input image I i to obtain a deeply expressed image feature V i =FasterRCNN(I i ).

作为优选,步骤1)中,输入问题文本Qi,先使用空格和标点将问题文本Qi分成单词,再通过预训练的GloVe词嵌入方法进行初始化,得到编码后的第i个指定问题句子

Figure BDA0002219333860000041
其中,xt (i)表示每个单词在词汇表中的位第t个单词;Preferably, in step 1), input the question text Q i , first use spaces and punctuation to divide the question text Q i into words, and then initialize by the pre-trained GloVe word embedding method to obtain the ith specified question sentence after encoding
Figure BDA0002219333860000041
Among them, x t (i) represents the t-th word of each word in the vocabulary;

然后,将

Figure BDA0002219333860000042
输入到LSTM网络中,取出最后一层的输出qi作为
Figure BDA0002219333860000043
的特征表达,得到问题特征qi。followed by
Figure BDA0002219333860000042
Input into the LSTM network, take out the output qi of the last layer as
Figure BDA0002219333860000043
The feature expression of , the problem feature qi is obtained.

本发明的有益效果如下:The beneficial effects of the present invention are as follows:

本发明所述的结合多视角注意力机制的细粒度视觉问答方法,提出一种多视角注意力模型,能够有效选择出与当前任务目标(问题)相关的多个显著目标区域,提取出问题语义引导下的图像中的区域显著性特征,具有更细粒度的特征表达,并对图像中存在多个重要语义表达区域的情况表现,具有较强的刻画能力。The fine-grained visual question answering method combined with the multi-view attention mechanism described in the present invention proposes a multi-view attention model, which can effectively select multiple salient target areas related to the current task target (question), and extract the question semantics. The regional saliency feature in the guided image has more fine-grained feature expression, and has a strong ability to describe the situation that there are multiple important semantic expression regions in the image.

本发明充分考虑到问题具体语义的导向作用,从多个视角来学习获取图像和问题文本中与答案有关的区域信息,增加了多视角注意力模型的有效性和全面性,从而有效加强图像区域显著特征和问题特征的语义关联性,以提升视觉问答的语义理解的准确性和全面性。The invention fully considers the guiding role of the specific semantics of the question, learns and obtains the area information related to the answer in the image and the question text from multiple perspectives, increases the effectiveness and comprehensiveness of the multi-perspective attention model, and effectively strengthens the image area. Semantic correlation of salient features and question features to improve the accuracy and comprehensiveness of semantic understanding for visual question answering.

采用本发明所述的方法进行视觉问答任务,步骤简单、效率高、准确率高,完全可以用于商业,市场前景较佳。Using the method of the present invention to perform a visual question and answer task has the advantages of simple steps, high efficiency and high accuracy, and can be fully used in business and has a good market prospect.

附图说明Description of drawings

图1是本发明的流程示意图;Fig. 1 is the schematic flow sheet of the present invention;

图2是多视角注意力模型的示意图;Figure 2 is a schematic diagram of a multi-view attention model;

图3是注意力权重可视化热力图(简单的注意任务);Figure 3 is a heatmap for visualizing attention weights (a simple attention task);

图4是注意力权重可视化热力图(任务需要高度集中于图像中的多个位置);Figure 4 is a heatmap for visualizing attention weights (tasks need to be highly focused on multiple locations in the image);

图5是本发明的多视角注意力模型取得的结果与目前较为先进的方法的对比曲线图;Fig. 5 is the comparative graph of the result obtained by the multi-view attention model of the present invention and the more advanced method at present;

图6是最终模型性能培训的损失函数的曲线图;Figure 6 is a graph of the loss function for final model performance training;

图7是最终模型性能培训的培训验证分数的曲线图。Figure 7 is a graph of training validation scores for final model performance training.

具体实施方式Detailed ways

以下结合附图及实施例对本发明进行进一步的详细说明。The present invention will be further described in detail below with reference to the accompanying drawings and embodiments.

本发明为了解决现有技术存在的不足,提供一种结合多视角注意力机制的细粒度视觉问答方法。视觉问答可以被视为一种多任务分类问题,每个答案就可以看作为一种分类种类。一般视觉问答系统中,使用One-Hot方法对答案进行编码,来获得每个答案对应的One-Hot向量,构成答案向量表。One-Hot编码是分类变量作为二进制向量的表示。这首先要求将分类值映射到整数值,然后每个整数值被表示为二进制向量,除了整数的索引之外,它都是零值,它被标记为1。In order to solve the shortcomings of the prior art, the present invention provides a fine-grained visual question answering method combined with a multi-view attention mechanism. Visual question answering can be regarded as a multi-task classification problem, and each answer can be regarded as a classification category. In general visual question answering systems, the One-Hot method is used to encode the answer to obtain the One-Hot vector corresponding to each answer, forming an answer vector table. One-Hot encoding is the representation of categorical variables as binary vectors. This first requires mapping categorical values to integer values, then each integer value is represented as a binary vector, which is zero except for the integer's index, which is marked as 1.

如图1所示,本发明所述的结合多视角注意力机制的细粒度视觉问答方法,大致步骤如下:As shown in Figure 1, the fine-grained visual question answering method combined with the multi-view attention mechanism described in the present invention has the following steps:

1)输入图像,提取图像特征;输入问题文本,提取问题特征;1) Input image, extract image features; input question text, extract question features;

2)将图像特征、问题特征输入多视角注意力模型,计算图像的注意力权重,通过注意力权重对步骤1)的图像特征进行加权运算,得到图像细粒度特征;2) Input the image features and question features into the multi-view attention model, calculate the attention weight of the image, and perform a weighted operation on the image features of step 1) through the attention weight to obtain fine-grained image features;

3)将图像细粒度特征与问题特征进行融合,得到融合特征;3) Fusion of image fine-grained features and problem features to obtain fusion features;

4)将融合特征输入分类器,预测得到答案。4) Input the fused features into the classifier, and predict the answer.

本实施例中,步骤1)中,使用Faster-RCNN标准模型对输入的图像Ii进行特征提取,得到图像特征Vi=FasterRCNN(Ii)。令K为图像特征的空间区域个数,则图像特征

Figure BDA0002219333860000051
可以进一步表示成其中,
Figure BDA0002219333860000053
是Faster-RCNN提取的第k个区域特征,d为网络层中隐藏神经元的个数并同时表示输出维度。In this embodiment, in step 1), the Faster-RCNN standard model is used to perform feature extraction on the input image I i to obtain image features V i =FasterRCNN(I i ). Let K be the number of spatial regions of the image feature, then the image feature
Figure BDA0002219333860000051
can be further expressed as in,
Figure BDA0002219333860000053
is the k-th regional feature extracted by Faster-RCNN, and d is the number of hidden neurons in the network layer and represents the output dimension at the same time.

步骤1)中,输入问题文本Qi后,先使用空格和标点将问题文本Qi分成单词,再通过预训练的GloVe词嵌入方法(Global Vectors for Word Representation)进行初始化,得到第i个指定问题句子

Figure BDA0002219333860000054
的编码形式,其中,xt (i)表示每个单词在词汇表中的位第t个单词;In step 1), after entering the question text Q i , first use spaces and punctuation to divide the question text Q i into words, and then initialize it through the pre-trained GloVe word embedding method (Global Vectors for Word Representation) to obtain the i-th specified question. sentence
Figure BDA0002219333860000054
The encoded form of , where x t (i) represents the t-th word of each word in the vocabulary;

然后,将

Figure BDA0002219333860000055
输入到LSTM网络中,具体地,使用含有1280个隐藏单元的标准LSTM网络,取出最后一层的输出qi作为
Figure BDA0002219333860000061
的特征表达,得到问题特征qi。followed by
Figure BDA0002219333860000055
Input into the LSTM network, specifically, use a standard LSTM network with 1280 hidden units, and take out the output qi of the last layer as
Figure BDA0002219333860000061
The feature expression of , the problem feature qi is obtained.

然后,针对获取的图像特征Vi以及编码的问题特征qi,将两种特征输入多视角注意力模型,计算图像的注意力权重。Then, for the acquired image feature V i and the encoded question feature qi , the two kinds of features are input into the multi-view attention model, and the attention weight of the image is calculated.

视觉注意力机制从本质上可以从图像中选择出对当前任务目标更关键的目标区域,从而对这一区域投入更多注意力资源,以获取更多所需要关注目标的细节信息,抑制其他无用信息。在视觉问答任务中,语义表达具有多样性。特别地,有一些问题需要模型理解图像中多个目标对象之间的语义表达。因此,单一视觉注意力模型不能够有效挖掘图像中不同语义对象与问题语义之间的关联性;The visual attention mechanism can essentially select the target area that is more critical to the current task goal from the image, so as to invest more attention resources in this area to obtain more detailed information of the target that needs attention, and suppress other useless information. information. In visual question answering tasks, semantic representations are diverse. In particular, there are some problems that require the model to understand the semantic representation between multiple target objects in an image. Therefore, the single visual attention model cannot effectively mine the correlation between different semantic objects and question semantics in the image;

为了解决这个问题,本发明提供了一种多视角注意力模型,使用两种不同的注意力机制进行联合学习问题中可以关注到的不同语义的重要区域部分,以获得图像的细粒度注意力特征图。使用多视角注意力模型关注图像得到的图像注意力权重,并利用该权重进行图像特征加权得到累积向量作为最终的图像特征表示,即图像细粒度特征,能够较好地与问题语义进行关联。In order to solve this problem, the present invention provides a multi-view attention model, which uses two different attention mechanisms to jointly learn important regions with different semantics that can be paid attention to in the problem, so as to obtain fine-grained attention features of images picture. Using the multi-view attention model to pay attention to the image attention weight obtained from the image, and using the weight to weight the image features to obtain the cumulative vector as the final image feature representation, that is, the fine-grained image feature, which can be better associated with the semantics of the question.

如图2所示,所述的多视角注意力模型包括上层注意力模型、下层注意力模型,通过上层注意力模型获得单一注意力权重,通过下层注意力模型获得显著性注意力权重,显著性注意力权重体现图像中的不同目标区域对应不同的注意力资源。As shown in Figure 2, the multi-view attention model includes an upper-layer attention model and a lower-layer attention model. The single attention weight is obtained through the upper-layer attention model, and the salient attention weight is obtained through the lower-layer attention model. The attention weight reflects that different target regions in the image correspond to different attention resources.

具体地,上层注意力模型中,获得单一注意力权重的方法如下:Specifically, in the upper-layer attention model, the method for obtaining a single attention weight is as follows:

输入图像特征、问题特征至上层注意力模型,分别使用一层全连接层将图像特征、问题特征的数据投影到相同维度空间,利用激活函数ReLu规范化向量;然后利用哈达码乘积(Hardmard product)融合,再依次输入两层全连接层进行处理学习参数,处理学习后的参数其中,

Figure BDA0002219333860000063
为图像特征,
Figure BDA0002219333860000064
为问题特征,
Figure BDA0002219333860000065
为上层注意力模型待学习的权重参数,K为图像特征的空间区域个数,T为选取的问题特征长度,d为网络层中隐藏神经元的个数,h为该层设置的输出维度,ReLu是神经网络中的激活函数,其具体形式可以表达为f(x)=max(0,x);Input image features and question features to the upper attention model, use a fully connected layer to project the data of image features and question features to the same dimensional space, and use the activation function ReLu to normalize the vector; then use the Hardmard product (Hardmard product) fusion , and then input the two fully connected layers in turn to process the learning parameters, and process the learned parameters in,
Figure BDA0002219333860000063
is the image feature,
Figure BDA0002219333860000064
is the problem feature,
Figure BDA0002219333860000065
is the weight parameter to be learned by the upper attention model, K is the number of spatial regions of image features, T is the length of the selected problem feature, d is the number of hidden neurons in the network layer, h is the output dimension set by this layer, ReLu is the activation function in the neural network, and its specific form can be expressed as f(x)=max(0,x);

最后使用softmax函数归一化权值,得到单一注意力权重

Figure BDA0002219333860000066
Finally, use the softmax function to normalize the weights to get a single attention weight
Figure BDA0002219333860000066

考虑单一注意力权重

Figure BDA0002219333860000071
为softmax权值,如果部分权值数值较大,其余部分势必权值较小。由于一幅图像常常含有多个不同语义,并且这些语义常常在不同区域进行视觉语义表达。单一注意力权重
Figure BDA0002219333860000072
常常会忽略一些具有重要语义的区域信息。为补充上层注意力模型的注意信息的缺失部分,本发明进一步提出了下层注意力模型。下层注意力模型同时兼顾图像与问题语义的关联性,达到问题引导多视角注意力模型的学习机制,增加特征细粒度挖掘能力,本发明通过计算图像与问题特征在语义的相似性来引导图像区域的注意力的学习。Consider a single attention weight
Figure BDA0002219333860000071
For softmax weights, if some of the weights have larger values, the rest will inevitably have smaller weights. Since an image often contains multiple different semantics, and these semantics are often expressed in different regions of the visual semantics. single attention weight
Figure BDA0002219333860000072
Some region information with important semantics is often ignored. To supplement the missing part of the attention information of the upper-layer attention model, the present invention further proposes a lower-layer attention model. The lower-level attention model also takes into account the relevance of the image and the question semantics, achieves the learning mechanism of the question-guided multi-view attention model, and increases the feature fine-grained mining ability. The present invention guides the image area by calculating the semantic similarity between the image and the question feature. attention learning.

具体地,下层注意力模型中,获得显著性注意力权重的方法如下:Specifically, in the lower-layer attention model, the method of obtaining the salient attention weight is as follows:

输入图像特征、问题特征至下层注意力模型,分别使用一层全连接层将图像特征、问题特征的数据投影到相同维度空间,再计算关联矩阵Ci=ReLu(qi TWbVi);其中,

Figure BDA0002219333860000073
为下层注意力模型待学习的权重参数,
Figure BDA0002219333860000074
为获取的关联矩阵;Input image features and question features to the lower attention model, use a fully connected layer to project the data of image features and question features to the same dimensional space, and then calculate the correlation matrix C i =ReLu(q i T W b V i ) ;in,
Figure BDA0002219333860000073
is the weight parameter to be learned by the lower attention model,
Figure BDA0002219333860000074
is the obtained correlation matrix;

将关联矩阵作为特征与问题特征相乘,并与输入的图像特征融合,融合后的参数为

Figure BDA0002219333860000075
最后使用softmax函数归一化权值,输出显著性注意力权重
Figure BDA0002219333860000076
其中,
Figure BDA0002219333860000077
为下层注意力模型待学习的权重参数,参数维度设置通上层注意力模型一致,K为图像特征的空间区域个数,T为选取的问题特征长度,d为网络层中隐藏神经元的个数,h为该层的输出维度,ReLu是神经网络中的激活函数。Multiply the correlation matrix as a feature with the problem feature, and fuse it with the input image feature. The fused parameter is
Figure BDA0002219333860000075
Finally, use the softmax function to normalize the weights and output the saliency attention weights
Figure BDA0002219333860000076
in,
Figure BDA0002219333860000077
is the weight parameter to be learned by the lower-layer attention model. The parameter dimension is set to be consistent with the upper-layer attention model. K is the number of spatial regions of image features, T is the length of the selected problem feature, and d is the number of hidden neurons in the network layer. , h is the output dimension of this layer, and ReLu is the activation function in the neural network.

基于单一注意力权重、显著性注意力权重计算图像的注意力权重,具体如下:The attention weight of the image is calculated based on the single attention weight and the salient attention weight, as follows:

其中,β1和β2为上层注意力模型、下层注意力模型的权重比参数。实际应用中,可以通过调试参数来分配上层注意力模型、下层注意力模型之间的权重,以达到更好的效果。Among them, β 1 and β 2 are the weight ratio parameters of the upper-layer attention model and the lower-layer attention model. In practical applications, the weights between the upper-layer attention model and the lower-layer attention model can be allocated by debugging parameters to achieve better results.

图像特征Vi可以进一步表示K个图像空间区域特征的集合形式进一步,将注意力权重ai与每一个空间区域图像特征相乘加权,从而得到图像细粒度特征

Figure BDA00022193338600000710
The image feature V i can further represent the set form of K image space region features Further, the attention weight a i is multiplied and weighted by the image features of each spatial region to obtain fine-grained features of the image.
Figure BDA00022193338600000710

步骤3)中,将图像细粒度特征与问题特征分别通过非线性层fv、fq,在非线性层fv、fq中利用激活函数ReLu规范化向量;然后利用哈达码乘积融合,得到融合特征

Figure BDA0002219333860000081
In step 3), the fine-grained features of the image and the problem features are passed through the nonlinear layers f v and f q respectively, and the activation function ReLu is used to normalize the vectors in the nonlinear layers f v and f q ; feature
Figure BDA0002219333860000081

进一步地,视觉问答问题是为一个多标签分类问题,进而,步骤4)中,融合特征通过非线性层fo,在通过非线性层fo中利用激活函数ReLu规范化向量;然后使用线性映射wo来预测答案的候选得分

Figure BDA0002219333860000082
最后,选取得分更高的输出;Further, the visual question answering problem is a multi-label classification problem, and further, in step 4), the fusion features pass through the nonlinear layer f o , and the activation function ReLu is used to normalize the vector in the passing through the nonlinear layer f o ; and then use the linear mapping w o to predict the candidate score of the answer
Figure BDA0002219333860000082
Finally, select the output with a higher score;

其中,σ是sigmoid激活函数,wo是待学习的权重参数。where σ is the sigmoid activation function and wo is the weight parameter to be learned.

作为优选,sigmoid激活函数将最终得分规范化为(0-1)区间,最后一个阶段作为预测每个候选答案正确性的逻辑回归,其目标函数为

Figure BDA0002219333860000083
As an option, the sigmoid activation function normalizes the final score to the (0-1) interval, and the last stage acts as a logistic regression that predicts the correctness of each candidate answer with an objective function of
Figure BDA0002219333860000083

其中,指数z和k分别覆盖M个训练问题的N个候选答案,szk是问题的真实答案。where the indices z and k cover the N candidate answers for the M training questions, respectively, and s zk is the true answer to the question.

与其他常用的视觉问答利用softmax分类器相比,本发明利用的逻辑回归分类更加有效。Sigmoid函数使用软分数(soft target)作为目标结果,提供了更加丰富的训练信号,可以有效捕捉真实答案中偶尔出现的不确定性。Compared with other common visual question answering using softmax classifier, the logistic regression classification used in the present invention is more effective. The sigmoid function uses a soft score (soft target) as the target result, which provides a richer training signal and can effectively capture the occasional uncertainty in the real answer.

为了更好的观察注意力模型如何关注到图像的显著区域部分,在获得单一注意力权重、显著性注意力权重的attention map(au,ab)后,使用python中的matplotlib画图库的heatmap函数将注意力map可视化为矩阵热图(matrix heatmap),如图3、图4所示。In order to better observe how the attention model pays attention to the salient area of the image, after obtaining the attention map (a u , a b ) of the single attention weight and the salient attention weight, use matplotlib in python to draw the heatmap of the library The function visualizes the attention map as a matrix heatmap, as shown in Figure 3 and Figure 4.

图3、图4是多视角注意力模型分别对2张不同任务图像的上层注意力模型、下层注意力模型的表现图,attention1是上层注意力模型的注意力可视图,attention2是下层注意力模型的注意力可视图。从注意力的热度图可以看出,添加的下层注意力模型能够学习输入图像的不同重要区域。从图3可以看出,对于一个简单的注意任务,上层注意力模型、下层注意力模型都能够在图像中找到正确的位置。然而,在图4中,可以看到,当任务需要高度集中于图像中的多个位置时,下层注意力模型关注到了与上层注意力模型不同的部分,从而提高了多视角注意力模型的准确性,本发明的多视角注意力模型比现有技术的模型具有优势。Figure 3 and Figure 4 are the performance diagrams of the upper-layer attention model and the lower-layer attention model of the multi-view attention model for 2 different task images respectively, attention1 is the attention visual view of the upper-layer attention model, and attention2 is the lower-layer attention model. attention visuals. From the heat map of attention, it can be seen that the added lower-level attention model is able to learn different important regions of the input image. As can be seen from Figure 3, for a simple attention task, both the upper-layer attention model and the lower-layer attention model can find the correct position in the image. However, in Figure 4, it can be seen that when the task needs to be highly focused on multiple locations in the image, the lower-layer attention model pays attention to different parts from the upper-layer attention model, thereby improving the accuracy of the multi-view attention model Therefore, the multi-view attention model of the present invention has advantages over the prior art models.

测试数据集介绍:VQA v2数据集(Antol S,Agrawal A,Lu J,et al.Vqa:Visualquestion answering[C].Proceedings of the IEEE International Conference onComputer Vision.2015:2425-2433.)是一个大规模的视觉问答数据集,这个数据集中所有的问题和答案都由人为注释。在数据集中有443,757个训练问题,214,354个验证问题和447,793个测试问题。每张图像和三个问题相关,对于每个问题,注释者提供了十个答案。在标准视觉问答任务重,这个数据集中的问题常常分为:Yes/no,Number和other三种类型的问题。Test dataset introduction: The VQA v2 dataset (Antol S, Agrawal A, Lu J, et al. Vqa:Visualquestion answering[C].Proceedings of the IEEE International Conference onComputer Vision.2015:2425-2433.) is a large-scale The visual question answering dataset in which all questions and answers are annotated by humans. There are 443,757 training questions, 214,354 validation questions and 447,793 test questions in the dataset. Each image is associated with three questions, for each question the annotators provide ten answers. In the standard visual question answering task, the questions in this dataset are often divided into three types: Yes/no, Number and other.

进一步,为验证本发明的有效性,将本发明与2017VQA挑战赛冠军(Anderson P,HeX,Buehler C,et al.Bottom-up and top-down attention for image captioning andvisual question answering.arXiv preprint arXiv:1707.07998,2017.)的结果做了对比,如图5所示,本发明在复现的论文代码基础上,将本发明的多视角注意力模型替换原来的简单注意力模型,本发明的多视角注意力模型最终评分为64.35%,在准确率的评价与所述论文相比,具有约1.2%的提升。Further, in order to verify the validity of the present invention, the present invention was compared with the champion of the 2017 VQA Challenge (Anderson P, HeX, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering. arXiv preprint arXiv:1707.07998 , 2017.) The results are compared, as shown in Figure 5, the present invention replaces the original simple attention model with the multi-perspective attention model of the present invention on the basis of the reproduced thesis code, and the multi-perspective attention model of the present invention The final score of the force model is 64.35%, which is an improvement of about 1.2% compared with the paper in the evaluation of accuracy.

在实验中一些基本参数设计如下,基本学习率设置为α=0.0007,在每个LSTM层后随机失活率设置为dropout=0.3,答案筛选设置N=3000。全连接层的隐藏神经元设置num_hid=1024,批训练的数量设置batch_size=512。单一注意力权重、显著性注意力权重的权重分配为β1=0.7,β2=0.3。Some basic parameters in the experiment are designed as follows, the basic learning rate is set to α=0.0007, the random dropout rate after each LSTM layer is set to dropout=0.3, and the answer screening is set to N=3000. The hidden neurons of the fully connected layer are set to num_hid=1024, and the number of batch training is set to batch_size=512. The weight assignments of the single attention weight and the salient attention weight are β 1 =0.7, β 2 =0.3.

如图6所示,模型的损失函数值(loss)随着训练周期的增加不断减小收敛的过程;如图7所示,模型准确率随着训练周期的增长分别在训练集与测试集上的表现。As shown in Figure 6, the loss function value (loss) of the model decreases with the increase of the training period and the convergence process; Performance.

本发明在test-dev情况下与VQA任务中比较有代表性的方法在公开标准数据集VQA v2上的对比,如表1所示。Table 1 shows the comparison of the representative methods of the present invention in the test-dev case and the VQA task on the open standard data set VQA v2.

表1Table 1

特别的,将数据以问题类型为标准分为3个种类进行评估,然后计算出总的评估结果。问题种类分别为Y/N是否问题,Number数量问题,Others其他开放性问题。表中的分数是模型针对不同类型的问题回答结果的准确率,数值越大准确率越高。从表中可以看出,本发明的多视角注意力模型对不同任务都达到了较好的效果。In particular, the data is divided into three categories according to the question type for evaluation, and then the total evaluation result is calculated. The types of problems are whether Y/N is a problem, the number of Number problems, and Others other open problems. The scores in the table are the accuracy of the model's answering results for different types of questions. The larger the value, the higher the accuracy. It can be seen from the table that the multi-view attention model of the present invention achieves good effects on different tasks.

特别的,由于本发明的多视角注意力模型加强细粒度的特征表达,对物体的检测,识别能力提高,在Number这个一类型的评价上较之前的方法有着不错的提升。模型总的准确率评价结果更是优于大部分现存方法的结果。In particular, since the multi-view attention model of the present invention strengthens the fine-grained feature expression, the detection and recognition ability of objects is improved, and the evaluation of the type of Number has a good improvement compared with the previous method. The overall accuracy evaluation results of the model are even better than the results of most existing methods.

上述实施例仅是用来说明本发明,而并非用作对本发明的限定。只要是依据本发明的技术实质,对上述实施例进行变化、变型等都将落在本发明的权利要求的范围内。The above-mentioned embodiments are only used to illustrate the present invention, but not to limit the present invention. As long as it is in accordance with the technical essence of the present invention, changes, modifications, etc. to the above-described embodiments will fall within the scope of the claims of the present invention.

Claims (10)

1.一种结合多视角注意力机制的细粒度视觉问答方法,其特征在于,步骤如下:1. A fine-grained visual question answering method in conjunction with a multi-view attention mechanism, characterized in that the steps are as follows: 1)输入图像,提取图像特征;输入问题文本,提取问题特征;1) Input image, extract image features; input question text, extract question features; 2)将图像特征、问题特征输入多视角注意力模型,计算图像的注意力权重,通过注意力权重对步骤1)的图像特征进行加权运算,得到图像细粒度特征;2) Input the image features and question features into the multi-view attention model, calculate the attention weight of the image, and perform a weighted operation on the image features of step 1) through the attention weight to obtain fine-grained image features; 3)将图像细粒度特征与问题特征进行融合,得到融合特征;3) Fusion of image fine-grained features and problem features to obtain fusion features; 4)将融合特征输入分类器,预测得到答案。4) Input the fused features into the classifier, and predict the answer. 2.根据权利要求1所述的结合多视角注意力机制的细粒度视觉问答方法,其特征在于,所述的多视角注意力模型包括上层注意力模型、下层注意力模型,通过上层注意力模型获得单一注意力权重,通过下层注意力模型获得显著性注意力权重,显著性注意力权重体现图像中的不同目标区域对应不同的注意力资源。2. The fine-grained visual question answering method combined with the multi-view attention mechanism according to claim 1, wherein the multi-view attention model comprises an upper-layer attention model and a lower-layer attention model, and the upper-layer attention model is The single attention weight is obtained, and the salient attention weight is obtained through the lower attention model, and the salient attention weight reflects that different target areas in the image correspond to different attention resources. 3.根据权利要求2所述的结合多视角注意力机制的细粒度视觉问答方法,其特征在于,获得单一注意力权重的方法如下:3. The fine-grained visual question answering method combining multi-view attention mechanism according to claim 2, is characterized in that, the method for obtaining single attention weight is as follows: 输入图像特征、问题特征至上层注意力模型,分别使用一层全连接层将图像特征、问题特征的数据投影到相同维度空间,利用激活函数ReLu规范化向量;然后利用哈达码乘积融合,再依次输入两层全连接层进行处理学习参数,处理学习后的参数
Figure FDA0002219333850000011
最后使用softmax函数归一化权值,得到单一注意力权重
Figure FDA0002219333850000012
Input image features and question features to the upper attention model, use a fully connected layer to project the data of image features and question features to the same dimensional space, and use the activation function ReLu to normalize the vector; then use Hada code product fusion, and then input in turn The two-layer fully connected layer processes the learned parameters and processes the learned parameters
Figure FDA0002219333850000011
Finally, use the softmax function to normalize the weights to get a single attention weight
Figure FDA0002219333850000012
其中,
Figure FDA0002219333850000013
为图像特征,为问题特征,
Figure FDA0002219333850000015
为上层注意力模型待学习的权重参数,K为图像特征的空间区域个数,T为选取的问题特征长度,d为网络层中隐藏神经元的个数,h为该层设置的输出维度,ReLu是神经网络中的激活函数,其具体形式可以表达为f(x)=max(0,x)。
in,
Figure FDA0002219333850000013
is the image feature, is the problem feature,
Figure FDA0002219333850000015
is the weight parameter to be learned by the upper attention model, K is the number of spatial regions of image features, T is the length of the selected problem feature, d is the number of hidden neurons in the network layer, h is the output dimension set by this layer, ReLu is an activation function in a neural network, and its specific form can be expressed as f(x)=max(0,x).
4.根据权利要求2所述的结合多视角注意力机制的细粒度视觉问答方法,其特征在于,获得显著性注意力权重的方法如下:4. The fine-grained visual question answering method combining multi-view attention mechanism according to claim 2, is characterized in that, the method for obtaining salient attention weight is as follows: 输入图像特征、问题特征至下层注意力模型,分别使用一层全连接层将图像特征、问题特征的数据投影到相同维度空间,再计算关联矩阵Ci=ReLu(qi TWbVi);其中,
Figure FDA0002219333850000016
为下层注意力模型待学习的权重参数,为获取的关联矩阵;
Input image features and question features to the lower attention model, use a fully connected layer to project the data of image features and question features to the same dimensional space, and then calculate the correlation matrix C i =ReLu(q i T W b V i ) ;in,
Figure FDA0002219333850000016
is the weight parameter to be learned by the lower attention model, is the obtained correlation matrix;
将关联矩阵作为特征与问题特征相乘,并与输入的图像特征融合,融合后的参数为
Figure FDA0002219333850000021
最后使用softmax函数归一化权值,输出显著性注意力权重
Figure FDA0002219333850000022
Multiply the correlation matrix as a feature with the problem feature, and fuse it with the input image feature. The fused parameter is
Figure FDA0002219333850000021
Finally, use the softmax function to normalize the weights and output the saliency attention weights
Figure FDA0002219333850000022
其中,
Figure FDA0002219333850000023
为下层注意力模型待学习的权重参数。
in,
Figure FDA0002219333850000023
Weight parameters to be learned for the lower attention model.
5.根据权利要求2、3、4任一项所述的结合多视角注意力机制的细粒度视觉问答方法,其特征在于,基于单一注意力权重、显著性注意力权重计算图像的注意力权重,具体如下:5. The fine-grained visual question answering method combined with multi-view attention mechanism according to any one of claims 2, 3, and 4, wherein the attention weight of the image is calculated based on a single attention weight and a salient attention weight ,details as follows:
Figure FDA0002219333850000024
Figure FDA0002219333850000024
其中,β1和β2为上层注意力模型、下层注意力模型的权重比超参数。Among them, β 1 and β 2 are the weight ratio hyperparameters of the upper-layer attention model and the lower-layer attention model.
6.根据权利要求5所述的结合多视角注意力机制的细粒度视觉问答方法,其特征在于,步骤3)中,将图像细粒度特征与问题特征分别通过非线性层fv、fq,在非线性层fv、fq中利用激活函数ReLu规范化向量;然后利用哈达码乘积融合,得到融合特征
Figure FDA0002219333850000025
6. The fine-grained visual question answering method combined with multi-view attention mechanism according to claim 5, is characterized in that, in step 3), the fine-grained feature of the image and the question feature are respectively passed through the nonlinear layers f v , f q , The activation function ReLu is used to normalize the vector in the nonlinear layers f v , f q ; then the Hada code product fusion is used to obtain the fusion features
Figure FDA0002219333850000025
7.根据权利要求6所述的结合多视角注意力机制的细粒度视觉问答方法,其特征在于,步骤4)中,融合特征通过非线性层fo,在通过非线性层fo中利用激活函数ReLu规范化向量;然后使用线性映射wo来预测答案的候选得分
Figure FDA0002219333850000026
最后,选取得分更高的输出;
7. The fine-grained visual question answering method combined with multi-view attention mechanism according to claim 6, characterized in that, in step 4), the fusion feature passes through the nonlinear layer f o , and uses the activation in the through nonlinear layer f o The function ReLu normalizes the vector; the linear map w o is then used to predict the candidate score of the answer
Figure FDA0002219333850000026
Finally, select the output with a higher score;
其中,σ是sigmoid激活函数,wo是待学习的权重参数。where σ is the sigmoid activation function and wo is the weight parameter to be learned.
8.根据权利要求7所述的结合多视角注意力机制的细粒度视觉问答方法,其特征在于,sigmoid激活函数将最终得分规范化为(0-1)区间,最后一个阶段作为预测每个候选答案正确性的逻辑回归,其目标函数为
Figure FDA0002219333850000027
8. The fine-grained visual question answering method combined with multi-view attention mechanism according to claim 7, characterized in that, the sigmoid activation function normalizes the final score to a (0-1) interval, and the last stage is used to predict each candidate answer Logistic regression for correctness with an objective function of
Figure FDA0002219333850000027
其中,z和k分别覆盖M个训练问题的N个候选答案,szk是问题的真实答案。where z and k cover N candidate answers for M training questions, respectively, and s zk is the real answer to the question.
9.根据权利要求1所述的结合多视角注意力机制的细粒度视觉问答方法,其特征在于,步骤1)中,使用Faster-RCNN标准模型对输入的图像Ii进行特征提取,得到深度表达的图像特征Vi=FasterRCNN(Ii)。9. the fine-grained visual question answering method in conjunction with multi-view attention mechanism according to claim 1, is characterized in that, in step 1), use Faster-RCNN standard model to carry out feature extraction to the image I i of input, obtains deep expression The image features of Vi = FasterRCNN(I i ). 10.根据权利要求1所述的结合多视角注意力机制的细粒度视觉问答方法,其特征在于,步骤1)中,输入问题文本Qi,先使用空格和标点将问题文本Qi分成单词,再通过预训练的GloVe词嵌入方法进行初始化,得到编码后的第i个指定问题句子
Figure FDA0002219333850000031
其中,xt (i)表示每个单词在词汇表中的位第t个单词;
10. The fine-grained visual question answering method combined with multi-view attention mechanism according to claim 1, is characterized in that, in step 1), input question text Q i , first use spaces and punctuation to divide question text Q i into words, Then initialize it through the pre-trained GloVe word embedding method to obtain the encoded i-th specified question sentence
Figure FDA0002219333850000031
Among them, x t (i) represents the t-th word of each word in the vocabulary;
然后,将输入到LSTM网络中,取出最后一层的输出qi作为
Figure FDA0002219333850000033
的特征表达,得到问题特征qi
followed by Input into the LSTM network, take out the output qi of the last layer as
Figure FDA0002219333850000033
The feature expression of , the problem feature qi is obtained.
CN201910927585.4A 2019-09-27 2019-09-27 Fine-grained visual question and answer method combined with multi-view attention mechanism Active CN110717431B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910927585.4A CN110717431B (en) 2019-09-27 2019-09-27 Fine-grained visual question and answer method combined with multi-view attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910927585.4A CN110717431B (en) 2019-09-27 2019-09-27 Fine-grained visual question and answer method combined with multi-view attention mechanism

Publications (2)

Publication Number Publication Date
CN110717431A true CN110717431A (en) 2020-01-21
CN110717431B CN110717431B (en) 2023-03-24

Family

ID=69211080

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910927585.4A Active CN110717431B (en) 2019-09-27 2019-09-27 Fine-grained visual question and answer method combined with multi-view attention mechanism

Country Status (1)

Country Link
CN (1) CN110717431B (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325000A (en) * 2020-01-23 2020-06-23 北京百度网讯科技有限公司 Language generation method and device and electronic equipment
CN111325243A (en) * 2020-02-03 2020-06-23 天津大学 Visual relation detection method based on regional attention learning mechanism
CN111860653A (en) * 2020-07-22 2020-10-30 苏州浪潮智能科技有限公司 Visual question answering method and device, electronic equipment and storage medium
CN111984772A (en) * 2020-07-23 2020-11-24 中山大学 Medical image question-answering method and system based on deep learning
CN112100346A (en) * 2020-08-28 2020-12-18 西北工业大学 A visual question answering method based on the fusion of fine-grained image features and external knowledge
CN112163608A (en) * 2020-09-21 2021-01-01 天津大学 Visual relation detection method based on multi-granularity semantic fusion
CN112488111A (en) * 2020-12-18 2021-03-12 贵州大学 Instruction expression understanding method based on multi-level expression guide attention network
CN112732879A (en) * 2020-12-23 2021-04-30 重庆理工大学 Downstream task processing method and model of question-answering task
CN112905819A (en) * 2021-01-06 2021-06-04 中国石油大学(华东) Visual question-answering method of original feature injection network based on composite attention
CN113223018A (en) * 2021-05-21 2021-08-06 信雅达科技股份有限公司 Fine-grained image analysis processing method
CN113392288A (en) * 2020-03-11 2021-09-14 阿里巴巴集团控股有限公司 Visual question answering and model training method, device, equipment and storage medium thereof
CN113408511A (en) * 2021-08-23 2021-09-17 南开大学 Method, system, equipment and storage medium for determining gazing target
CN113436094A (en) * 2021-06-24 2021-09-24 湖南大学 Gray level image automatic coloring method based on multi-view attention mechanism
CN113761153A (en) * 2021-05-19 2021-12-07 腾讯科技(深圳)有限公司 Question and answer processing method and device based on picture, readable medium and electronic equipment
CN113779298A (en) * 2021-09-16 2021-12-10 哈尔滨工程大学 A compound loss-based method for medical visual question answering
CN113792617A (en) * 2021-08-26 2021-12-14 电子科技大学 An Image Interpretation Method Combining Image Information and Text Information
CN113837212A (en) * 2020-06-24 2021-12-24 四川大学 A visual question answering method based on multimodal bidirectional directed attention
CN114092783A (en) * 2020-08-06 2022-02-25 清华大学 Dangerous goods detection method based on attention mechanism continuous visual angle
CN114117159A (en) * 2021-12-08 2022-03-01 东北大学 An Image Question Answering Method Based on Interaction between Multi-Order Image Features and Questions
CN114398471A (en) * 2021-12-24 2022-04-26 哈尔滨工程大学 A Visual Question Answering Method Based on Deep Inference Attention Mechanism
CN114417044A (en) * 2022-01-19 2022-04-29 中国科学院空天信息创新研究院 Image question answering method and device
CN114661874A (en) * 2022-03-07 2022-06-24 浙江理工大学 A visual question answering method based on multi-angle semantic understanding and adaptive dual-channel
CN113407794B (en) * 2021-06-01 2023-10-31 中国科学院计算技术研究所 Visual question-answering method and system for inhibiting language deviation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170124432A1 (en) * 2015-11-03 2017-05-04 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering
CN110111399A (en) * 2019-04-24 2019-08-09 上海理工大学 A kind of image text generation method of view-based access control model attention
CN110163299A (en) * 2019-05-31 2019-08-23 合肥工业大学 A kind of vision answering method based on bottom-up attention mechanism and memory network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170124432A1 (en) * 2015-11-03 2017-05-04 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering
CN110111399A (en) * 2019-04-24 2019-08-09 上海理工大学 A kind of image text generation method of view-based access control model attention
CN110163299A (en) * 2019-05-31 2019-08-23 合肥工业大学 A kind of vision answering method based on bottom-up attention mechanism and memory network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
熊雪等: "基于注意力机制的答案选择方法研究", 《智能计算机与应用》 *

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325000B (en) * 2020-01-23 2021-01-26 北京百度网讯科技有限公司 Language generation method and device and electronic equipment
US11562150B2 (en) 2020-01-23 2023-01-24 Beijing Baidu Netcom Science Technology Co., Ltd. Language generation method and apparatus, electronic device and storage medium
CN111325000A (en) * 2020-01-23 2020-06-23 北京百度网讯科技有限公司 Language generation method and device and electronic equipment
CN111325243A (en) * 2020-02-03 2020-06-23 天津大学 Visual relation detection method based on regional attention learning mechanism
CN113392288A (en) * 2020-03-11 2021-09-14 阿里巴巴集团控股有限公司 Visual question answering and model training method, device, equipment and storage medium thereof
CN113837212A (en) * 2020-06-24 2021-12-24 四川大学 A visual question answering method based on multimodal bidirectional directed attention
CN113837212B (en) * 2020-06-24 2023-09-26 四川大学 Visual question-answering method based on multi-mode bidirectional guiding attention
CN111860653A (en) * 2020-07-22 2020-10-30 苏州浪潮智能科技有限公司 Visual question answering method and device, electronic equipment and storage medium
CN111984772B (en) * 2020-07-23 2024-04-02 中山大学 Medical image question-answering method and system based on deep learning
CN111984772A (en) * 2020-07-23 2020-11-24 中山大学 Medical image question-answering method and system based on deep learning
CN114092783A (en) * 2020-08-06 2022-02-25 清华大学 Dangerous goods detection method based on attention mechanism continuous visual angle
CN112100346A (en) * 2020-08-28 2020-12-18 西北工业大学 A visual question answering method based on the fusion of fine-grained image features and external knowledge
CN112100346B (en) * 2020-08-28 2021-07-20 西北工业大学 A visual question answering method based on the fusion of fine-grained image features and external knowledge
CN112163608B (en) * 2020-09-21 2023-02-03 天津大学 Visual relation detection method based on multi-granularity semantic fusion
CN112163608A (en) * 2020-09-21 2021-01-01 天津大学 Visual relation detection method based on multi-granularity semantic fusion
CN112488111A (en) * 2020-12-18 2021-03-12 贵州大学 Instruction expression understanding method based on multi-level expression guide attention network
CN112488111B (en) * 2020-12-18 2022-06-14 贵州大学 Indication expression understanding method based on multi-level expression guide attention network
CN112732879A (en) * 2020-12-23 2021-04-30 重庆理工大学 Downstream task processing method and model of question-answering task
CN112905819A (en) * 2021-01-06 2021-06-04 中国石油大学(华东) Visual question-answering method of original feature injection network based on composite attention
CN112905819B (en) * 2021-01-06 2022-09-23 中国石油大学(华东) Visual question-answering method of original feature injection network based on composite attention
CN113761153B (en) * 2021-05-19 2023-10-24 腾讯科技(深圳)有限公司 Picture-based question-answering processing method and device, readable medium and electronic equipment
CN113761153A (en) * 2021-05-19 2021-12-07 腾讯科技(深圳)有限公司 Question and answer processing method and device based on picture, readable medium and electronic equipment
CN113223018A (en) * 2021-05-21 2021-08-06 信雅达科技股份有限公司 Fine-grained image analysis processing method
CN113407794B (en) * 2021-06-01 2023-10-31 中国科学院计算技术研究所 Visual question-answering method and system for inhibiting language deviation
CN113436094A (en) * 2021-06-24 2021-09-24 湖南大学 Gray level image automatic coloring method based on multi-view attention mechanism
CN113408511B (en) * 2021-08-23 2021-11-12 南开大学 A method, system, device and storage medium for determining gaze target
CN113408511A (en) * 2021-08-23 2021-09-17 南开大学 Method, system, equipment and storage medium for determining gazing target
CN113792617A (en) * 2021-08-26 2021-12-14 电子科技大学 An Image Interpretation Method Combining Image Information and Text Information
CN113792617B (en) * 2021-08-26 2023-04-18 电子科技大学 Image interpretation method combining image information and text information
CN113779298B (en) * 2021-09-16 2023-10-31 哈尔滨工程大学 Medical vision question-answering method based on composite loss
CN113779298A (en) * 2021-09-16 2021-12-10 哈尔滨工程大学 A compound loss-based method for medical visual question answering
CN114117159A (en) * 2021-12-08 2022-03-01 东北大学 An Image Question Answering Method Based on Interaction between Multi-Order Image Features and Questions
CN114117159B (en) * 2021-12-08 2024-07-12 东北大学 Image question-answering method for multi-order image feature and question interaction
CN114398471A (en) * 2021-12-24 2022-04-26 哈尔滨工程大学 A Visual Question Answering Method Based on Deep Inference Attention Mechanism
CN114417044A (en) * 2022-01-19 2022-04-29 中国科学院空天信息创新研究院 Image question answering method and device
CN114661874A (en) * 2022-03-07 2022-06-24 浙江理工大学 A visual question answering method based on multi-angle semantic understanding and adaptive dual-channel
CN114661874B (en) * 2022-03-07 2024-04-30 浙江理工大学 Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels

Also Published As

Publication number Publication date
CN110717431B (en) 2023-03-24

Similar Documents

Publication Publication Date Title
CN110717431B (en) Fine-grained visual question and answer method combined with multi-view attention mechanism
Sadr et al. Multi-view deep network: a deep model based on learning features from heterogeneous neural networks for sentiment analysis
Yang et al. Video captioning by adversarial LSTM
CN111554268B (en) Language identification method based on language model, text classification method and device
Do et al. Deep neural network-based fusion model for emotion recognition using visual data
CN111488739A (en) An Implicit Discourse Recognition Approach Based on Multi-granularity Generated Image Enhancement Representation
Islam et al. A review on video classification with methods, findings, performance, challenges, limitations and future work
Oluwasammi et al. Features to text: a comprehensive survey of deep learning on semantic segmentation and image captioning
CN110163117A (en) A kind of pedestrian's recognition methods again based on autoexcitation identification feature learning
Liu et al. Fact-based visual question answering via dual-process system
Yeh et al. Reactive multi-stage feature fusion for multimodal dialogue modeling
Liu et al. A multimodal approach for multiple-relation extraction in videos
Han et al. NSNP-DFER: a nonlinear spiking neural P network for dynamic facial expression recognition
CN113657272B (en) A micro-video classification method and system based on missing data completion
Ling et al. A facial expression recognition system for smart learning based on yolo and vision transformer
Wu et al. Question-driven multiple attention (dqma) model for visual question answer
Ke et al. Spatial, structural and temporal feature learning for human interaction prediction
Wang et al. Generalised zero-shot learning for entailment-based text classification with external knowledge
Shultana et al. CvTSRR: A Convolutional Vision Transformer Based Method for Social Relation Recognition
Guo et al. Double-layer affective visual question answering network
Yuvaraj et al. An Adaptive Deep Belief Feature Learning Model for Cognitive Emotion Recognition
Aishwarya et al. Stacked attention based textbook visual question answering with BERT
Shinde et al. Image caption generation methodologies
Pu et al. Differential residual learning for facial expression recognition
TanujaPatgar Convolution neural network based emotion classification cognitive modelforfacial expression

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant