WO2020119631A1 - Lightweight visual question-answering system and method - Google Patents
Lightweight visual question-answering system and method Download PDFInfo
- Publication number
- WO2020119631A1 WO2020119631A1 PCT/CN2019/124008 CN2019124008W WO2020119631A1 WO 2020119631 A1 WO2020119631 A1 WO 2020119631A1 CN 2019124008 W CN2019124008 W CN 2019124008W WO 2020119631 A1 WO2020119631 A1 WO 2020119631A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- text
- feature
- layer
- processing module
- question answering
- Prior art date
Links
- 230000000007 visual effect Effects 0.000 title claims abstract description 26
- 238000000034 method Methods 0.000 title claims abstract description 19
- 239000013598 vector Substances 0.000 claims abstract description 33
- 238000012545 processing Methods 0.000 claims abstract description 31
- 230000004927 fusion Effects 0.000 claims abstract description 28
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 12
- 239000000284 extract Substances 0.000 claims abstract description 4
- 238000011176 pooling Methods 0.000 claims description 10
- 238000000354 decomposition reaction Methods 0.000 claims description 7
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000007500 overflow downdraw method Methods 0.000 claims description 2
- 238000000605 extraction Methods 0.000 abstract description 4
- 238000003058 natural language processing Methods 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000000306 recurrent effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000002054 transplantation Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
Definitions
- the invention relates to the field of computer vision, in particular to the field of visual question answering technology.
- CNN convolutional neural network
- RNN recurrent neural network
- Visual Q&A is one of the most challenging questions in the field of computer vision.
- the task of visual question and answer is to use the computer to automatically analyze the pictures and questions, so as to give answers to the questions. Since the visual question and answer involves the contents of computer vision and natural language processing, then a natural solution is to combine the convolutional neural network and the recurrent neural network that have been very successful in computer vision and natural language processing. model.
- the most commonly used convolutional neural networks are Res-net and VGG-net, and the most commonly used recurrent neural networks are LSTM and GRU.
- the visual question and answer need to process images and questions at the same time, the calculation is often slower. When the computing power is insufficient, such as in the mobile terminal, the time to get the answer will be longer.
- MUTAN fusion model in the paper MUTAN: Multimodal Tucker Fusion for Visual Question Answering.
- Matrix and core tensor and by constraining the core tensor to further control the number of model parameters, overfitting can be prevented during training, and input/output prediction can be adjusted more flexibly.
- the present invention is based on the MUTAN model, uses shuffle-net to process images, and uses convolutional neural network TextCNN to process problem statements, which can effectively reduce the complexity of the model and facilitate the transplantation of question answering systems to mobile terminals.
- the purpose of the present invention is to propose a question and answer system and method that have low computing power requirements and are easy to transplant to the mobile terminal.
- the technical scheme adopted is as follows:
- a lightweight visual question answering system including an image processing module 1, a text processing module 2, a feature fusion device 3, and a classifier 4, wherein the image processing module 1 uses a convolutional neural network to extract image features and convert them into Image feature vector; the text processing module 2 extracts text features to form a text feature vector; both the image feature vector and the text feature vector are sent to the feature fuser 3 for fusion, and the fusion result is sent to classification 4 to form the final answer.
- the image processing module 1 uses a shuffle-net model to extract image features.
- the text processing module 2 uses TextCNN to extract text features.
- the text processing module 2 includes an input layer 21, a convolutional layer 22, a pooling layer 23, and a fully connected layer 24.
- the input layer 21 ranks the pre-trained word vectors of each word in the sentence together, Obtain an n*k matrix, where n is the preset sentence length, supplemented by 0 when insufficient, k is the length of the word vector; the input layer 21 is connected to the convolution layer 22, the convolution layer 22 pairs The input matrix is processed by a convolutional neural network.
- the convolutional layer includes multiple layers; the convolutional layer 22 is connected to the pooling layer 23, and the pooling layer 23 is connected to the fully connected layer 24.
- the fully connected layer 24 obtains the characteristics of the text.
- the feature fusion device 3 adopts the MUTAN model to perform Tucker decomposition, fuse each component, and obtain a fusion result.
- the classifier 4 is a SoftMax classifier, and the loss function used is a cross-entropy loss function.
- system is embedded and used in mobile terminals.
- a lightweight visual question answering method that uses a pre-trained shuffle-net model to extract image features, uses TextCNN to extract text features, and then uses the MUTAN model to fuse the image features with the text features to obtain answers.
- the fusion method is to fuse the vector q obtained from the text feature extractor and the vector v obtained from the image feature extractor to obtain a tensor T, and perform Tucker decomposition on T to obtain the parameter core tensor ⁇ c , And the three internal model matrices W q , W v , W o , the fusion feature y is calculated:
- ⁇ i represents that the vector is multiplied by the same tensor in the i-th dimension, and the final answer can be obtained by sending y into the classifier.
- the method is applied in the mobile terminal.
- the advantages of the light-weight visual question answering system and method of the present invention are: reducing the complexity of the model from the two aspects of image feature extraction and question text feature extraction, which is convenient for transplanting the question answering system to the mobile terminal.
- Figure 1 is the architecture diagram of MUTAN fusion model.
- Figure 2 is a block diagram of a lightweight visual question answering system.
- Figure 3 shows the structure of the text processing module.
- the lightweight visual question answering system of the present invention includes an image processing module 1, a text processing module 2, a feature fusion device 3, and a classifier 4, wherein the image to be detected enters the image processing module 1 for processing,
- the image processing module 1 uses a convolutional neural network to extract image features and convert them into image feature vectors; the query text enters the text processing module 2 for processing, and the text features are extracted in the text processing module 2 to form a text feature vector; Both the image feature vector and the text feature vector are sent to the feature fusion machine 3 for fusion, and the fusion result is sent to the classifier 4 to form a final answer.
- a pre-trained shuffle-net model is selected to extract features, wherein the features of the last convolutional layer of the shuffle-net will be sent to the feature fusion machine.
- the text processing module 2 uses TextCNN to process the query text. Its structure is shown in FIG. 3.
- each word in the pre-sentence corresponding to the pre-trained word vector is arranged together to obtain an n*k matrix.
- n is the preset sentence length, supplemented by 0 when insufficient
- k is the length of the word vector.
- the convolutional neural network that is, the input layer 21 is connected to the convolutional layer 22, and the features are extracted in the multiple convolutional layers 22.
- the convolution layer 22 is connected to the pooling layer 23, and the pooling layer 23 uses the maximum pooling method to pool the features.
- the pooling layer 23 is connected to the fully connected layer 24, and finally the fully connected layer 24 Get the characteristics of the text.
- the MUTAN model is used to perform Tucker decomposition, and the components are fused to obtain the fusion result.
- the MUTAN fusion model was proposed by Hedi Ben-younes and others in the paper MUTAN: Multimodal Tucker Fusion for Visual Question Answering, and its flow is shown in Figure 1.
- the vector q obtained from the text feature extractor and the vector ⁇ obtained from the image feature extractor are fused to obtain the tensor T, and Tucker decomposition is performed on T to obtain the parameter core tensor ⁇ c and three internal model matrices W q and W v , W o , calculate the fusion feature y:
- ⁇ i represents that the vector is multiplied by the same tensor in the i-th dimension, and the final answer can be obtained by sending y into the classifier.
- the classifier 4 is the SoftMax layer, and the loss function selected for training is the cross-entropy loss, expressed as:
- a lightweight visual question answering method that uses a pre-trained shuffle-net model to extract image features, uses TextCNN to extract text features, and then uses the MUTAN model to fuse the image features with the text features to obtain answers.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Image Analysis (AREA)
Abstract
Disclosed are a lightweight visual question-answering system and method. The system comprises an image processing module (1), a text processing module (2), a feature fuser (3) and a classifier (4), wherein the image processing module (1) uses a convolutional neural network to extract an image feature and converts same into an image feature vector; the text processing module (2) extracts a text feature to form a text feature vector; and the image feature vector and the text feature vector are both sent to the feature fuser (3) for fusion, and a fusion result is sent to the classifier (4) to form a final answer. The method can reduce the complexity of a model in two aspects comprising image feature extraction and question text feature extraction, so as to transplant a question-answering system to a mobile terminal.
Description
本发明涉及计算机视觉领域,尤其涉及视觉问答技术领域。The invention relates to the field of computer vision, in particular to the field of visual question answering technology.
深度学习以其强大的特征学习能力,在计算机视觉(CV)和自然语言处理(NLP)中获得广泛应用。卷积神经网络(CNN)可以抽取并压缩图像信息,多在图像处理中应用;而递归神经网络(RNN)在自然语言处理领域,尤其是在语音识别,机器翻译,语言模型与文本生成等方面取得很大的成功。Deep learning is widely used in computer vision (CV) and natural language processing (NLP) with its powerful feature learning ability. Convolutional neural network (CNN) can extract and compress image information, which is mostly used in image processing; and recurrent neural network (RNN) in the field of natural language processing, especially in speech recognition, machine translation, language model and text generation, etc. A great success.
视觉问答是计算机视觉领域里最具挑战性的问题之一。视觉问答的任务就是利用计算机自动地分析图片与问题,从而对提出的问题给出回答。由于视觉问答涉及到计算机视觉和自然语言处理两个领域的内容,那么很自然的一种解决方案就是将在计算机视觉和自然语言处理中应用非常成功的卷积神经网络和递归神经网络结合构造组合模型。而其中最常使用的卷积神经网络是Res-net和VGG-net,最常使用的递归神经网络是LSTM和GRU。但视觉问答因为需要同时处理图像和问题,往往计算较慢,在算力不足时,比如移动端中,得出答案的时间会比较长。Visual Q&A is one of the most challenging questions in the field of computer vision. The task of visual question and answer is to use the computer to automatically analyze the pictures and questions, so as to give answers to the questions. Since the visual question and answer involves the contents of computer vision and natural language processing, then a natural solution is to combine the convolutional neural network and the recurrent neural network that have been very successful in computer vision and natural language processing. model. The most commonly used convolutional neural networks are Res-net and VGG-net, and the most commonly used recurrent neural networks are LSTM and GRU. However, because the visual question and answer need to process images and questions at the same time, the calculation is often slower. When the computing power is insufficient, such as in the mobile terminal, the time to get the answer will be longer.
在将图像信息与文本信息融合方面,Hedi Ben-younes等在论文MUTAN:Multimodal Tucker Fusion for Visual Question Answering中提出了MUTAN融合模型,如图1所示,基于Tucker张量,分解为三个内模矩阵和核心张量,且通过约束核心张量进一步控制模型参数的数量,在训练期间能够防止过度拟合,而且能够更灵活地调整输入/输出预测。本发明基于MUTAN模型,使用shuffle-net处理图像,使用卷积神经网络TextCNN来处理问题语句,可以有效降低模型的复杂度,便于将问答系统移植到移动端。In the fusion of image information and text information, Hedi Ben-younes et al. proposed the MUTAN fusion model in the paper MUTAN: Multimodal Tucker Fusion for Visual Question Answering. Matrix and core tensor, and by constraining the core tensor to further control the number of model parameters, overfitting can be prevented during training, and input/output prediction can be adjusted more flexibly. The present invention is based on the MUTAN model, uses shuffle-net to process images, and uses convolutional neural network TextCNN to process problem statements, which can effectively reduce the complexity of the model and facilitate the transplantation of question answering systems to mobile terminals.
发明内容Summary of the invention
本发明的目的在于提出一种对算力要求低,便于移植到移动端的问答系统和方法。所采用的技术方案如下:The purpose of the present invention is to propose a question and answer system and method that have low computing power requirements and are easy to transplant to the mobile terminal. The technical scheme adopted is as follows:
一种轻量视觉问答系统,包括图像处理模块1、文本处理模块2,特征融合器3,和分类器4,其中,所述图像处理模1块采用卷积神经网络提取图像特征,并转化为图像特征向量;所述文本处理模块2提取文本特征形成文本特征向量;所述图像特征向量和所述文本特征向量均被送入所述特征融合器3进行融合,并 将融合的结果送入分类器4,形成最终答案。A lightweight visual question answering system, including an image processing module 1, a text processing module 2, a feature fusion device 3, and a classifier 4, wherein the image processing module 1 uses a convolutional neural network to extract image features and convert them into Image feature vector; the text processing module 2 extracts text features to form a text feature vector; both the image feature vector and the text feature vector are sent to the feature fuser 3 for fusion, and the fusion result is sent to classification 4 to form the final answer.
进一步地,所述图像处理模块1采用shuffle-net模型提取图像特征。Further, the image processing module 1 uses a shuffle-net model to extract image features.
进一步地,所述文本处理模块2采用TextCNN提取文本特征。Further, the text processing module 2 uses TextCNN to extract text features.
进一步地,所述文本处理模块2包括输入层21、卷积层22、池化层23和全连接层24,所述输入层21将句子中每个单词预训练好的词向量排在一起,得到一个n*k矩阵,其中n是预先设定句子长度,不足时用0补充,k为词向量的长度;所述输入层21与所述卷积层22连接,所述卷积层22对输入的矩阵进行卷积神经网络处理,卷积层包括多个层;卷积层22与所述池化层23连接,所述池化层23与所述全连接层24连接,最终由所述全连接层24得到文本的特征。Further, the text processing module 2 includes an input layer 21, a convolutional layer 22, a pooling layer 23, and a fully connected layer 24. The input layer 21 ranks the pre-trained word vectors of each word in the sentence together, Obtain an n*k matrix, where n is the preset sentence length, supplemented by 0 when insufficient, k is the length of the word vector; the input layer 21 is connected to the convolution layer 22, the convolution layer 22 pairs The input matrix is processed by a convolutional neural network. The convolutional layer includes multiple layers; the convolutional layer 22 is connected to the pooling layer 23, and the pooling layer 23 is connected to the fully connected layer 24. The fully connected layer 24 obtains the characteristics of the text.
进一步地,所述特征融合器3采用MUTAN模型进行Tucker分解,对各分量进行融合,得到融合的结果。Further, the feature fusion device 3 adopts the MUTAN model to perform Tucker decomposition, fuse each component, and obtain a fusion result.
进一步地,所述分类器4为SoftMax分类器,采用的损失函数为交叉熵损失函数。Further, the classifier 4 is a SoftMax classifier, and the loss function used is a cross-entropy loss function.
进一步地,所述系统嵌入移动终端中使用。Further, the system is embedded and used in mobile terminals.
一种轻量视觉问答方法,采用预先训练的shuffle-net模型提取图像特征,采用TextCNN提取文本特征,然后利用MUTAN模型将所述图像特征与所述文本特征融合,得到答案。A lightweight visual question answering method that uses a pre-trained shuffle-net model to extract image features, uses TextCNN to extract text features, and then uses the MUTAN model to fuse the image features with the text features to obtain answers.
进一步地,所述融合的方法为,将从文本特征提取器得到的向量q,图像特征提取器得到的向量ν融合,得到张量T,并对T进行Tucker分解,得到参数核心张量τ
c,以及三个内模矩阵W
q、W
v、W
o,计算获得融合的特征y:
Further, the fusion method is to fuse the vector q obtained from the text feature extractor and the vector v obtained from the image feature extractor to obtain a tensor T, and perform Tucker decomposition on T to obtain the parameter core tensor τ c , And the three internal model matrices W q , W v , W o , the fusion feature y is calculated:
其中,×
i代表向量在第i维同张量相乘,将y送入分类器中即可得到最终答案。
Among them, × i represents that the vector is multiplied by the same tensor in the i-th dimension, and the final answer can be obtained by sending y into the classifier.
进一步地,在移动终端中应用该方法。Further, the method is applied in the mobile terminal.
本发明的轻量视觉问答系统和方法的优势在于:从图像特征提取和问题文本特征提取两方面对降低模型的复杂度,便于将问答系统移植到移动端。The advantages of the light-weight visual question answering system and method of the present invention are: reducing the complexity of the model from the two aspects of image feature extraction and question text feature extraction, which is convenient for transplanting the question answering system to the mobile terminal.
图1为MUTAN融合模型架构图。Figure 1 is the architecture diagram of MUTAN fusion model.
图2为轻量视觉问答系统框图。Figure 2 is a block diagram of a lightweight visual question answering system.
图3为文本处理模块结构图。Figure 3 shows the structure of the text processing module.
如图2所示,本发明的轻量化视觉问答系统,包括图像处理模块1、文本处理模块2,特征融合器3,和分类器4,其中,待检测图像进入所述图像处理模块1处理,图像处理模块1采用卷积神经网络提取图像特征,并转化为图像特征向量;询问文本进入所述文本处理模块2处理,在文本处理模块2中对文本的特征进行提取,形成文本特征向量;所述图像特征向量和所述文本特征向量均被送入特征融合器3进行融合,并将融合的结果送入分类器4,形成最终答案。As shown in FIG. 2, the lightweight visual question answering system of the present invention includes an image processing module 1, a text processing module 2, a feature fusion device 3, and a classifier 4, wherein the image to be detected enters the image processing module 1 for processing, The image processing module 1 uses a convolutional neural network to extract image features and convert them into image feature vectors; the query text enters the text processing module 2 for processing, and the text features are extracted in the text processing module 2 to form a text feature vector; Both the image feature vector and the text feature vector are sent to the feature fusion machine 3 for fusion, and the fusion result is sent to the classifier 4 to form a final answer.
在图像处理模块1中选取预训练好的shuffle-net模型提取特征,其中,shuffle-net最后一个卷积层的特征将送入特征融合器中。In the image processing module 1, a pre-trained shuffle-net model is selected to extract features, wherein the features of the last convolutional layer of the shuffle-net will be sent to the feature fusion machine.
文本处理模块2采用TextCNN处理询问文本,其结构如图3所示,在输入层21中,将预先句子中每个单词对应预训练好的词向量排在一起,得到一个n*k矩阵。其中n是预先设定句子长度,不足时用0补充,k为词向量的长度。然后按照卷积神经网络处理,即输入层21连接卷积层22,在多个卷积层22中对特征进行提取。卷积层22与所述池化层23连接,池化层23中采用最大池化法的将特征池化,池化层23与所述全连接层24连接,最终由所述全连接层24得到文本的特征。The text processing module 2 uses TextCNN to process the query text. Its structure is shown in FIG. 3. In the input layer 21, each word in the pre-sentence corresponding to the pre-trained word vector is arranged together to obtain an n*k matrix. Where n is the preset sentence length, supplemented by 0 when insufficient, and k is the length of the word vector. Then it is processed according to the convolutional neural network, that is, the input layer 21 is connected to the convolutional layer 22, and the features are extracted in the multiple convolutional layers 22. The convolution layer 22 is connected to the pooling layer 23, and the pooling layer 23 uses the maximum pooling method to pool the features. The pooling layer 23 is connected to the fully connected layer 24, and finally the fully connected layer 24 Get the characteristics of the text.
在特征融合器3中,采用采用MUTAN模型进行Tucker分解,对各分量进行融合,得到融合的结果。MUTAN融合模型是由Hedi Ben-younes等在论文MUTAN:Multimodal Tucker Fusion for Visual Question Answering中提出的,其流程如图1所示。In the feature fusion device 3, the MUTAN model is used to perform Tucker decomposition, and the components are fused to obtain the fusion result. The MUTAN fusion model was proposed by Hedi Ben-younes and others in the paper MUTAN: Multimodal Tucker Fusion for Visual Question Answering, and its flow is shown in Figure 1.
从文本特征提取器得到的向量q,图像特征提取器得到的向量ν融合,得到张量T,并对T进行Tucker分解,得到参数核心张量τ
c,以及三个内模矩阵W
q、W
v、W
o,计算获得融合的特征y:
The vector q obtained from the text feature extractor and the vector ν obtained from the image feature extractor are fused to obtain the tensor T, and Tucker decomposition is performed on T to obtain the parameter core tensor τ c and three internal model matrices W q and W v , W o , calculate the fusion feature y:
其中,×
i代表向量在第i维同张量相乘,将y送入分类器中即可得到最终答案。
Among them, × i represents that the vector is multiplied by the same tensor in the i-th dimension, and the final answer can be obtained by sending y into the classifier.
所述Tucker张量分解的方式为:T=((τ
c×
1W
q)×
2W
v)×
3W
o,其中,T 由文本特征向量q和图像特征向量ν融合得到。
The Tucker tensor decomposition method is: T=((τ c × 1 W q )× 2 W v )× 3 W o , where T is obtained by fusing the text feature vector q and the image feature vector ν.
分类器4为SoftMax层,训练选取的损失函数为交叉熵损失,表示为:The classifier 4 is the SoftMax layer, and the loss function selected for training is the cross-entropy loss, expressed as:
其中y
i代表真实的答案索引,
是预测的答案索引.i=1……|A|,|A|是不同的答案个数。
Where y i represents the actual answer index, Is the predicted answer index. i=1...|A|, |A| is the number of different answers.
一种轻量视觉问答方法,采用预先训练的shuffle-net模型提取图像特征,采用TextCNN提取文本特征,然后利用MUTAN模型将所述图像特征与所述文本特征融合,得到答案。A lightweight visual question answering method that uses a pre-trained shuffle-net model to extract image features, uses TextCNN to extract text features, and then uses the MUTAN model to fuse the image features with the text features to obtain answers.
经过实验证明,使用shuffle-net作为视觉问答的图像特征提取器,TextCNN作为文本特征提取器,可以有效降低模型的复杂度,便于将问答系统移植到移动端。Experiments show that using shuffle-net as the image feature extractor for visual question answering and TextCNN as the text feature extractor can effectively reduce the complexity of the model and facilitate the transplantation of the question answering system to the mobile terminal.
Claims (10)
- 一种轻量视觉问答系统,其特征在于,包括图像处理模块(1)、文本处理模块(2),特征融合器(3),和分类器(4),其中,所述图像处理模块(1)采用卷积神经网络提取图像特征,并转化为图像特征向量;所述文本处理模块(2)提取文本特征形成文本特征向量;所述图像特征向量和所述文本特征向量均被送入所述特征融合器(3)进行融合,并将融合的结果送入分类器(4),形成最终答案。A lightweight visual question answering system, characterized in that it includes an image processing module (1), a text processing module (2), a feature fusion (3), and a classifier (4), wherein the image processing module (1 ) A convolutional neural network is used to extract image features and convert them into image feature vectors; the text processing module (2) extracts text features to form a text feature vector; both the image feature vector and the text feature vector are sent to the The feature fusion machine (3) performs fusion and sends the fusion result to the classifier (4) to form the final answer.
- 根据权利要求1所述的轻量视觉问答系统,其特征在于,所述图像处理模块(1)采用shuffle-net模型提取图像特征。The lightweight visual question answering system according to claim 1, wherein the image processing module (1) uses a shuffle-net model to extract image features.
- 根据权利要求1或2所述的轻量视觉问答系统,其特征在于,所述文本处理模块(2)采用TextCNN提取文本特征。The lightweight visual question answering system according to claim 1 or 2, wherein the text processing module (2) uses TextCNN to extract text features.
- 根据权利要求3所述的轻量视觉问答系统,其特征在于,所述文本处理模块(2)包括输入层(21)、卷积层(22)、池化层(23)和全连接层(24),所述输入层(21)将句子中每个单词预训练好的词向量排在一起,得到一个n*k矩阵,其中n是预先设定句子长度,不足时用0补充,k为词向量的长度;所述输入层(21)与所述卷积层(22)连接,所述卷积层(22)对输入的矩阵进行卷积神经网络处理,卷积层包括多个层;卷积层(22)与所述池化层(23)连接,所述池化层(23)与所述全连接层(24)连接,最终由所述全连接层(24)得到文本的特征。The lightweight visual question answering system according to claim 3, wherein the text processing module (2) includes an input layer (21), a convolutional layer (22), a pooling layer (23), and a fully connected layer ( 24), the input layer (21) arranges the pre-trained word vectors of each word in the sentence together to obtain an n*k matrix, where n is a preset sentence length, and is supplemented by 0 when insufficient, k is The length of the word vector; the input layer (21) is connected to the convolutional layer (22), and the convolutional layer (22) performs convolutional neural network processing on the input matrix, and the convolutional layer includes multiple layers; The convolution layer (22) is connected to the pooling layer (23), the pooling layer (23) is connected to the fully connected layer (24), and finally the textual features are obtained from the fully connected layer (24) .
- 根据权利要求1-4任一项所述的轻量视觉问答系统,其特征在于,所述特征融合器(3)采用MUTAN模型进行Tucker分解,对各分量进行融合,得到融合的结果。The lightweight visual question answering system according to any one of claims 1 to 4, characterized in that the feature fuser (3) uses the MUTAN model to perform Tucker decomposition, fuse each component, and obtain a result of fusion.
- 根据权利要求1-5任一项所述的轻量视觉问答系统,其特征在于,所述分类器(4)为SoftMax分类器,采用的损失函数为交叉熵损失函数。The lightweight visual question answering system according to any one of claims 1 to 5, wherein the classifier (4) is a SoftMax classifier, and the loss function used is a cross-entropy loss function.
- 根据权利要求1-6任一项所述的轻量视觉问答系统,其特征在于,所述 系统嵌入移动终端中使用。The lightweight visual question answering system according to any one of claims 1 to 6, wherein the system is embedded in a mobile terminal and used.
- 一种轻量视觉问答方法,其特征在于,采用预先训练的shuffle-net模型提取图像特征,采用TextCNN提取文本特征,然后利用MUTAN模型将所述图像特征与所述文本特征融合,得到答案。A lightweight visual question answering method, characterized in that the pre-trained shuffle-net model is used to extract image features, and the TextCNN is used to extract text features, and then the MUTAN model is used to fuse the image features with the text features to obtain answers.
- 根据权利要求8所述的轻量视觉问答方法,其特征在于,所述融合的方法为,将从文本特征提取器得到的向量q,图像特征提取器得到的向量ν融合,得到张量T,并对T进行Tucker分解,得到参数核心张量τ c,以及三个内模矩阵W q、W ν、W o,计算获得融合的特征y: The lightweight visual question answering method according to claim 8, wherein the fusion method is to fuse the vector q obtained from the text feature extractor and the vector v obtained from the image feature extractor to obtain a tensor T, Tucker decomposition of T is performed to obtain the parameter core tensor τ c and the three internal model matrices W q , W ν , W o , and the fusion feature y is calculated:其中,× i代表向量在第i维同张量相乘,将y送入分类器中即可得到最终答案。 Among them, × i represents that the vector is multiplied by the same tensor in the i-th dimension, and the final answer can be obtained by sending y into the classifier.
- 根据权利要求8所述的轻量视觉问答方法,其特征在于,在移动终端中应用该方法。The lightweight visual question answering method according to claim 8, wherein the method is applied in a mobile terminal.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811518735.8 | 2018-12-12 | ||
CN201811518735.8A CN109784163A (en) | 2018-12-12 | 2018-12-12 | A kind of light weight vision question answering system and method |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020119631A1 true WO2020119631A1 (en) | 2020-06-18 |
Family
ID=66496867
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/124008 WO2020119631A1 (en) | 2018-12-12 | 2019-12-09 | Lightweight visual question-answering system and method |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109784163A (en) |
WO (1) | WO2020119631A1 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113157889A (en) * | 2021-04-21 | 2021-07-23 | 韶鼎人工智能科技有限公司 | Visual question-answering model construction method based on theme loss |
CN113792703A (en) * | 2021-09-29 | 2021-12-14 | 山东新一代信息产业技术研究院有限公司 | Image question-answering method and device based on Co-Attention deep modular network |
CN113837212A (en) * | 2020-06-24 | 2021-12-24 | 四川大学 | Visual question-answering method based on multi-mode bidirectional guiding attention |
CN113918679A (en) * | 2021-09-22 | 2022-01-11 | 三一汽车制造有限公司 | Knowledge question and answer method and device and engineering machinery |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109784163A (en) * | 2018-12-12 | 2019-05-21 | 中国科学院深圳先进技术研究院 | A kind of light weight vision question answering system and method |
CN110298338B (en) * | 2019-06-20 | 2021-08-24 | 北京易道博识科技有限公司 | Document image classification method and device |
CN110348535B (en) * | 2019-07-17 | 2022-05-31 | 北京金山数字娱乐科技有限公司 | Visual question-answering model training method and device |
CN111967487B (en) * | 2020-03-23 | 2022-09-20 | 同济大学 | Incremental data enhancement method for visual question-answer model training and application |
CN111814843B (en) * | 2020-03-23 | 2024-02-27 | 同济大学 | End-to-end training method and application of image feature module in visual question-answering system |
CN112100346B (en) * | 2020-08-28 | 2021-07-20 | 西北工业大学 | Visual question-answering method based on fusion of fine-grained image features and external knowledge |
CN112148891A (en) * | 2020-09-25 | 2020-12-29 | 天津大学 | Knowledge graph completion method based on graph perception tensor decomposition |
CN112925904B (en) * | 2021-01-27 | 2022-11-29 | 天津大学 | Lightweight text classification method based on Tucker decomposition |
CN113128415B (en) * | 2021-04-22 | 2023-09-29 | 合肥工业大学 | Environment distinguishing method, system, equipment and storage medium |
CN113919344B (en) * | 2021-09-26 | 2022-09-23 | 腾讯科技(深圳)有限公司 | Text processing method and device |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170124432A1 (en) * | 2015-11-03 | 2017-05-04 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering |
CN106777185A (en) * | 2016-12-23 | 2017-05-31 | 浙江大学 | A kind of across media Chinese herbal medicine image search methods based on deep learning |
CN107066583A (en) * | 2017-04-14 | 2017-08-18 | 华侨大学 | A kind of picture and text cross-module state sensibility classification method merged based on compact bilinearity |
CN107679582A (en) * | 2017-10-20 | 2018-02-09 | 深圳市唯特视科技有限公司 | A kind of method that visual question and answer are carried out based on multi-modal decomposition model |
CN108256549A (en) * | 2017-12-13 | 2018-07-06 | 北京达佳互联信息技术有限公司 | Image classification method, device and terminal |
CN108763325A (en) * | 2018-05-04 | 2018-11-06 | 北京达佳互联信息技术有限公司 | A kind of network object processing method and processing device |
CN109784163A (en) * | 2018-12-12 | 2019-05-21 | 中国科学院深圳先进技术研究院 | A kind of light weight vision question answering system and method |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105138993B (en) * | 2015-08-31 | 2018-07-27 | 小米科技有限责任公司 | Establish the method and device of human face recognition model |
CN105956608A (en) * | 2016-04-21 | 2016-09-21 | 恩泊泰(天津)科技有限公司 | Objective positioning and classifying algorithm based on deep learning |
CN107368770B (en) * | 2016-05-12 | 2021-05-11 | 江苏安纳泰克能源服务有限公司 | Method and system for automatically identifying returning passenger |
CN106055576B (en) * | 2016-05-20 | 2018-04-10 | 大连理工大学 | A kind of fast and effectively image search method under large-scale data background |
CN106250918B (en) * | 2016-07-26 | 2019-08-13 | 大连理工大学 | A kind of mixed Gauss model matching process based on improved soil-shifting distance |
CN106372581B (en) * | 2016-08-25 | 2020-09-04 | 中国传媒大学 | Method for constructing and training face recognition feature extraction network |
US10282462B2 (en) * | 2016-10-31 | 2019-05-07 | Walmart Apollo, Llc | Systems, method, and non-transitory computer-readable storage media for multi-modal product classification |
CN108509519B (en) * | 2018-03-09 | 2021-03-09 | 北京邮电大学 | General knowledge graph enhanced question-answer interaction system and method based on deep learning |
CN108564588B (en) * | 2018-03-21 | 2020-07-10 | 华中科技大学 | Built-up area automatic extraction method based on depth features and graph segmentation method |
CN108875648A (en) * | 2018-06-22 | 2018-11-23 | 深源恒际科技有限公司 | A method of real-time vehicle damage and component detection based on mobile video stream |
-
2018
- 2018-12-12 CN CN201811518735.8A patent/CN109784163A/en active Pending
-
2019
- 2019-12-09 WO PCT/CN2019/124008 patent/WO2020119631A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170124432A1 (en) * | 2015-11-03 | 2017-05-04 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering |
CN106777185A (en) * | 2016-12-23 | 2017-05-31 | 浙江大学 | A kind of across media Chinese herbal medicine image search methods based on deep learning |
CN107066583A (en) * | 2017-04-14 | 2017-08-18 | 华侨大学 | A kind of picture and text cross-module state sensibility classification method merged based on compact bilinearity |
CN107679582A (en) * | 2017-10-20 | 2018-02-09 | 深圳市唯特视科技有限公司 | A kind of method that visual question and answer are carried out based on multi-modal decomposition model |
CN108256549A (en) * | 2017-12-13 | 2018-07-06 | 北京达佳互联信息技术有限公司 | Image classification method, device and terminal |
CN108763325A (en) * | 2018-05-04 | 2018-11-06 | 北京达佳互联信息技术有限公司 | A kind of network object processing method and processing device |
CN109784163A (en) * | 2018-12-12 | 2019-05-21 | 中国科学院深圳先进技术研究院 | A kind of light weight vision question answering system and method |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113837212A (en) * | 2020-06-24 | 2021-12-24 | 四川大学 | Visual question-answering method based on multi-mode bidirectional guiding attention |
CN113837212B (en) * | 2020-06-24 | 2023-09-26 | 四川大学 | Visual question-answering method based on multi-mode bidirectional guiding attention |
CN113157889A (en) * | 2021-04-21 | 2021-07-23 | 韶鼎人工智能科技有限公司 | Visual question-answering model construction method based on theme loss |
CN113918679A (en) * | 2021-09-22 | 2022-01-11 | 三一汽车制造有限公司 | Knowledge question and answer method and device and engineering machinery |
CN113792703A (en) * | 2021-09-29 | 2021-12-14 | 山东新一代信息产业技术研究院有限公司 | Image question-answering method and device based on Co-Attention deep modular network |
CN113792703B (en) * | 2021-09-29 | 2024-02-02 | 山东新一代信息产业技术研究院有限公司 | Image question-answering method and device based on Co-Attention depth modular network |
Also Published As
Publication number | Publication date |
---|---|
CN109784163A (en) | 2019-05-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020119631A1 (en) | Lightweight visual question-answering system and method | |
CN111554268B (en) | Language identification method based on language model, text classification method and device | |
CN111340814B (en) | RGB-D image semantic segmentation method based on multi-mode self-adaptive convolution | |
WO2021134277A1 (en) | Emotion recognition method, intelligent device, and computer-readable storage medium | |
CN107563498A (en) | View-based access control model is combined the Image Description Methods and system of strategy with semantic notice | |
CN110866184A (en) | Short video data label recommendation method and device, computer equipment and storage medium | |
CN113537024B (en) | Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism | |
CN113722458B (en) | Visual question-answering processing method, device, computer readable medium, and program product | |
US20220318946A1 (en) | Method for image shape transformation based on generative adversarial network | |
Mazaheri et al. | Video fill in the blank with merging lstms | |
CN110795549A (en) | Short text conversation method, device, equipment and storage medium | |
CN117121015A (en) | Multimodal, less-hair learning using frozen language models | |
CN112749556A (en) | Multi-language model training method and device, storage medium and electronic equipment | |
Sevli et al. | Turkish sign language digits classification with CNN using different optimizers | |
CN117789099B (en) | Video feature extraction method and device, storage medium and electronic equipment | |
Mazaheri et al. | Video fill in the blank using lr/rl lstms with spatial-temporal attentions | |
Thakar et al. | Sign Language to Text Conversion in Real Time using Transfer Learning | |
CN109564633A (en) | Artificial neural network | |
Chaikaew | An applied holistic landmark with deep learning for Thai sign language recognition | |
CN117494762A (en) | Training method of student model, material processing method, device and electronic equipment | |
CN115017900B (en) | Conversation emotion recognition method based on multi-mode multi-prejudice | |
Sreemathy et al. | Indian Sign Language interpretation using convolutional neural networks | |
Takayama et al. | Masked batch normalization to improve tracking-based sign language recognition using graph convolutional networks | |
CN116311493A (en) | Two-stage human-object interaction detection method based on coding and decoding architecture | |
CN115392232A (en) | Topic and multi-mode fused emergency emotion analysis method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19894915 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 02.11.2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19894915 Country of ref document: EP Kind code of ref document: A1 |