WO2020119631A1 - Lightweight visual question-answering system and method - Google Patents

Lightweight visual question-answering system and method Download PDF

Info

Publication number
WO2020119631A1
WO2020119631A1 PCT/CN2019/124008 CN2019124008W WO2020119631A1 WO 2020119631 A1 WO2020119631 A1 WO 2020119631A1 CN 2019124008 W CN2019124008 W CN 2019124008W WO 2020119631 A1 WO2020119631 A1 WO 2020119631A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
feature
layer
processing module
question answering
Prior art date
Application number
PCT/CN2019/124008
Other languages
French (fr)
Chinese (zh)
Inventor
王磊
赖坤耀
程俊
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2020119631A1 publication Critical patent/WO2020119631A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Definitions

  • the invention relates to the field of computer vision, in particular to the field of visual question answering technology.
  • CNN convolutional neural network
  • RNN recurrent neural network
  • Visual Q&A is one of the most challenging questions in the field of computer vision.
  • the task of visual question and answer is to use the computer to automatically analyze the pictures and questions, so as to give answers to the questions. Since the visual question and answer involves the contents of computer vision and natural language processing, then a natural solution is to combine the convolutional neural network and the recurrent neural network that have been very successful in computer vision and natural language processing. model.
  • the most commonly used convolutional neural networks are Res-net and VGG-net, and the most commonly used recurrent neural networks are LSTM and GRU.
  • the visual question and answer need to process images and questions at the same time, the calculation is often slower. When the computing power is insufficient, such as in the mobile terminal, the time to get the answer will be longer.
  • MUTAN fusion model in the paper MUTAN: Multimodal Tucker Fusion for Visual Question Answering.
  • Matrix and core tensor and by constraining the core tensor to further control the number of model parameters, overfitting can be prevented during training, and input/output prediction can be adjusted more flexibly.
  • the present invention is based on the MUTAN model, uses shuffle-net to process images, and uses convolutional neural network TextCNN to process problem statements, which can effectively reduce the complexity of the model and facilitate the transplantation of question answering systems to mobile terminals.
  • the purpose of the present invention is to propose a question and answer system and method that have low computing power requirements and are easy to transplant to the mobile terminal.
  • the technical scheme adopted is as follows:
  • a lightweight visual question answering system including an image processing module 1, a text processing module 2, a feature fusion device 3, and a classifier 4, wherein the image processing module 1 uses a convolutional neural network to extract image features and convert them into Image feature vector; the text processing module 2 extracts text features to form a text feature vector; both the image feature vector and the text feature vector are sent to the feature fuser 3 for fusion, and the fusion result is sent to classification 4 to form the final answer.
  • the image processing module 1 uses a shuffle-net model to extract image features.
  • the text processing module 2 uses TextCNN to extract text features.
  • the text processing module 2 includes an input layer 21, a convolutional layer 22, a pooling layer 23, and a fully connected layer 24.
  • the input layer 21 ranks the pre-trained word vectors of each word in the sentence together, Obtain an n*k matrix, where n is the preset sentence length, supplemented by 0 when insufficient, k is the length of the word vector; the input layer 21 is connected to the convolution layer 22, the convolution layer 22 pairs The input matrix is processed by a convolutional neural network.
  • the convolutional layer includes multiple layers; the convolutional layer 22 is connected to the pooling layer 23, and the pooling layer 23 is connected to the fully connected layer 24.
  • the fully connected layer 24 obtains the characteristics of the text.
  • the feature fusion device 3 adopts the MUTAN model to perform Tucker decomposition, fuse each component, and obtain a fusion result.
  • the classifier 4 is a SoftMax classifier, and the loss function used is a cross-entropy loss function.
  • system is embedded and used in mobile terminals.
  • a lightweight visual question answering method that uses a pre-trained shuffle-net model to extract image features, uses TextCNN to extract text features, and then uses the MUTAN model to fuse the image features with the text features to obtain answers.
  • the fusion method is to fuse the vector q obtained from the text feature extractor and the vector v obtained from the image feature extractor to obtain a tensor T, and perform Tucker decomposition on T to obtain the parameter core tensor ⁇ c , And the three internal model matrices W q , W v , W o , the fusion feature y is calculated:
  • ⁇ i represents that the vector is multiplied by the same tensor in the i-th dimension, and the final answer can be obtained by sending y into the classifier.
  • the method is applied in the mobile terminal.
  • the advantages of the light-weight visual question answering system and method of the present invention are: reducing the complexity of the model from the two aspects of image feature extraction and question text feature extraction, which is convenient for transplanting the question answering system to the mobile terminal.
  • Figure 1 is the architecture diagram of MUTAN fusion model.
  • Figure 2 is a block diagram of a lightweight visual question answering system.
  • Figure 3 shows the structure of the text processing module.
  • the lightweight visual question answering system of the present invention includes an image processing module 1, a text processing module 2, a feature fusion device 3, and a classifier 4, wherein the image to be detected enters the image processing module 1 for processing,
  • the image processing module 1 uses a convolutional neural network to extract image features and convert them into image feature vectors; the query text enters the text processing module 2 for processing, and the text features are extracted in the text processing module 2 to form a text feature vector; Both the image feature vector and the text feature vector are sent to the feature fusion machine 3 for fusion, and the fusion result is sent to the classifier 4 to form a final answer.
  • a pre-trained shuffle-net model is selected to extract features, wherein the features of the last convolutional layer of the shuffle-net will be sent to the feature fusion machine.
  • the text processing module 2 uses TextCNN to process the query text. Its structure is shown in FIG. 3.
  • each word in the pre-sentence corresponding to the pre-trained word vector is arranged together to obtain an n*k matrix.
  • n is the preset sentence length, supplemented by 0 when insufficient
  • k is the length of the word vector.
  • the convolutional neural network that is, the input layer 21 is connected to the convolutional layer 22, and the features are extracted in the multiple convolutional layers 22.
  • the convolution layer 22 is connected to the pooling layer 23, and the pooling layer 23 uses the maximum pooling method to pool the features.
  • the pooling layer 23 is connected to the fully connected layer 24, and finally the fully connected layer 24 Get the characteristics of the text.
  • the MUTAN model is used to perform Tucker decomposition, and the components are fused to obtain the fusion result.
  • the MUTAN fusion model was proposed by Hedi Ben-younes and others in the paper MUTAN: Multimodal Tucker Fusion for Visual Question Answering, and its flow is shown in Figure 1.
  • the vector q obtained from the text feature extractor and the vector ⁇ obtained from the image feature extractor are fused to obtain the tensor T, and Tucker decomposition is performed on T to obtain the parameter core tensor ⁇ c and three internal model matrices W q and W v , W o , calculate the fusion feature y:
  • ⁇ i represents that the vector is multiplied by the same tensor in the i-th dimension, and the final answer can be obtained by sending y into the classifier.
  • the classifier 4 is the SoftMax layer, and the loss function selected for training is the cross-entropy loss, expressed as:
  • a lightweight visual question answering method that uses a pre-trained shuffle-net model to extract image features, uses TextCNN to extract text features, and then uses the MUTAN model to fuse the image features with the text features to obtain answers.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed are a lightweight visual question-answering system and method. The system comprises an image processing module (1), a text processing module (2), a feature fuser (3) and a classifier (4), wherein the image processing module (1) uses a convolutional neural network to extract an image feature and converts same into an image feature vector; the text processing module (2) extracts a text feature to form a text feature vector; and the image feature vector and the text feature vector are both sent to the feature fuser (3) for fusion, and a fusion result is sent to the classifier (4) to form a final answer. The method can reduce the complexity of a model in two aspects comprising image feature extraction and question text feature extraction, so as to transplant a question-answering system to a mobile terminal.

Description

一种轻量视觉问答系统及方法Lightweight visual question answering system and method 技术领域Technical field
本发明涉及计算机视觉领域,尤其涉及视觉问答技术领域。The invention relates to the field of computer vision, in particular to the field of visual question answering technology.
背景技术Background technique
深度学习以其强大的特征学习能力,在计算机视觉(CV)和自然语言处理(NLP)中获得广泛应用。卷积神经网络(CNN)可以抽取并压缩图像信息,多在图像处理中应用;而递归神经网络(RNN)在自然语言处理领域,尤其是在语音识别,机器翻译,语言模型与文本生成等方面取得很大的成功。Deep learning is widely used in computer vision (CV) and natural language processing (NLP) with its powerful feature learning ability. Convolutional neural network (CNN) can extract and compress image information, which is mostly used in image processing; and recurrent neural network (RNN) in the field of natural language processing, especially in speech recognition, machine translation, language model and text generation, etc. A great success.
视觉问答是计算机视觉领域里最具挑战性的问题之一。视觉问答的任务就是利用计算机自动地分析图片与问题,从而对提出的问题给出回答。由于视觉问答涉及到计算机视觉和自然语言处理两个领域的内容,那么很自然的一种解决方案就是将在计算机视觉和自然语言处理中应用非常成功的卷积神经网络和递归神经网络结合构造组合模型。而其中最常使用的卷积神经网络是Res-net和VGG-net,最常使用的递归神经网络是LSTM和GRU。但视觉问答因为需要同时处理图像和问题,往往计算较慢,在算力不足时,比如移动端中,得出答案的时间会比较长。Visual Q&A is one of the most challenging questions in the field of computer vision. The task of visual question and answer is to use the computer to automatically analyze the pictures and questions, so as to give answers to the questions. Since the visual question and answer involves the contents of computer vision and natural language processing, then a natural solution is to combine the convolutional neural network and the recurrent neural network that have been very successful in computer vision and natural language processing. model. The most commonly used convolutional neural networks are Res-net and VGG-net, and the most commonly used recurrent neural networks are LSTM and GRU. However, because the visual question and answer need to process images and questions at the same time, the calculation is often slower. When the computing power is insufficient, such as in the mobile terminal, the time to get the answer will be longer.
在将图像信息与文本信息融合方面,Hedi Ben-younes等在论文MUTAN:Multimodal Tucker Fusion for Visual Question Answering中提出了MUTAN融合模型,如图1所示,基于Tucker张量,分解为三个内模矩阵和核心张量,且通过约束核心张量进一步控制模型参数的数量,在训练期间能够防止过度拟合,而且能够更灵活地调整输入/输出预测。本发明基于MUTAN模型,使用shuffle-net处理图像,使用卷积神经网络TextCNN来处理问题语句,可以有效降低模型的复杂度,便于将问答系统移植到移动端。In the fusion of image information and text information, Hedi Ben-younes et al. proposed the MUTAN fusion model in the paper MUTAN: Multimodal Tucker Fusion for Visual Question Answering. Matrix and core tensor, and by constraining the core tensor to further control the number of model parameters, overfitting can be prevented during training, and input/output prediction can be adjusted more flexibly. The present invention is based on the MUTAN model, uses shuffle-net to process images, and uses convolutional neural network TextCNN to process problem statements, which can effectively reduce the complexity of the model and facilitate the transplantation of question answering systems to mobile terminals.
发明内容Summary of the invention
本发明的目的在于提出一种对算力要求低,便于移植到移动端的问答系统和方法。所采用的技术方案如下:The purpose of the present invention is to propose a question and answer system and method that have low computing power requirements and are easy to transplant to the mobile terminal. The technical scheme adopted is as follows:
一种轻量视觉问答系统,包括图像处理模块1、文本处理模块2,特征融合器3,和分类器4,其中,所述图像处理模1块采用卷积神经网络提取图像特征,并转化为图像特征向量;所述文本处理模块2提取文本特征形成文本特征向量;所述图像特征向量和所述文本特征向量均被送入所述特征融合器3进行融合,并 将融合的结果送入分类器4,形成最终答案。A lightweight visual question answering system, including an image processing module 1, a text processing module 2, a feature fusion device 3, and a classifier 4, wherein the image processing module 1 uses a convolutional neural network to extract image features and convert them into Image feature vector; the text processing module 2 extracts text features to form a text feature vector; both the image feature vector and the text feature vector are sent to the feature fuser 3 for fusion, and the fusion result is sent to classification 4 to form the final answer.
进一步地,所述图像处理模块1采用shuffle-net模型提取图像特征。Further, the image processing module 1 uses a shuffle-net model to extract image features.
进一步地,所述文本处理模块2采用TextCNN提取文本特征。Further, the text processing module 2 uses TextCNN to extract text features.
进一步地,所述文本处理模块2包括输入层21、卷积层22、池化层23和全连接层24,所述输入层21将句子中每个单词预训练好的词向量排在一起,得到一个n*k矩阵,其中n是预先设定句子长度,不足时用0补充,k为词向量的长度;所述输入层21与所述卷积层22连接,所述卷积层22对输入的矩阵进行卷积神经网络处理,卷积层包括多个层;卷积层22与所述池化层23连接,所述池化层23与所述全连接层24连接,最终由所述全连接层24得到文本的特征。Further, the text processing module 2 includes an input layer 21, a convolutional layer 22, a pooling layer 23, and a fully connected layer 24. The input layer 21 ranks the pre-trained word vectors of each word in the sentence together, Obtain an n*k matrix, where n is the preset sentence length, supplemented by 0 when insufficient, k is the length of the word vector; the input layer 21 is connected to the convolution layer 22, the convolution layer 22 pairs The input matrix is processed by a convolutional neural network. The convolutional layer includes multiple layers; the convolutional layer 22 is connected to the pooling layer 23, and the pooling layer 23 is connected to the fully connected layer 24. The fully connected layer 24 obtains the characteristics of the text.
进一步地,所述特征融合器3采用MUTAN模型进行Tucker分解,对各分量进行融合,得到融合的结果。Further, the feature fusion device 3 adopts the MUTAN model to perform Tucker decomposition, fuse each component, and obtain a fusion result.
进一步地,所述分类器4为SoftMax分类器,采用的损失函数为交叉熵损失函数。Further, the classifier 4 is a SoftMax classifier, and the loss function used is a cross-entropy loss function.
进一步地,所述系统嵌入移动终端中使用。Further, the system is embedded and used in mobile terminals.
一种轻量视觉问答方法,采用预先训练的shuffle-net模型提取图像特征,采用TextCNN提取文本特征,然后利用MUTAN模型将所述图像特征与所述文本特征融合,得到答案。A lightweight visual question answering method that uses a pre-trained shuffle-net model to extract image features, uses TextCNN to extract text features, and then uses the MUTAN model to fuse the image features with the text features to obtain answers.
进一步地,所述融合的方法为,将从文本特征提取器得到的向量q,图像特征提取器得到的向量ν融合,得到张量T,并对T进行Tucker分解,得到参数核心张量τ c,以及三个内模矩阵W q、W v、W o,计算获得融合的特征y: Further, the fusion method is to fuse the vector q obtained from the text feature extractor and the vector v obtained from the image feature extractor to obtain a tensor T, and perform Tucker decomposition on T to obtain the parameter core tensor τ c , And the three internal model matrices W q , W v , W o , the fusion feature y is calculated:
Figure PCTCN2019124008-appb-000001
Figure PCTCN2019124008-appb-000001
其中,× i代表向量在第i维同张量相乘,将y送入分类器中即可得到最终答案。 Among them, × i represents that the vector is multiplied by the same tensor in the i-th dimension, and the final answer can be obtained by sending y into the classifier.
进一步地,在移动终端中应用该方法。Further, the method is applied in the mobile terminal.
本发明的轻量视觉问答系统和方法的优势在于:从图像特征提取和问题文本特征提取两方面对降低模型的复杂度,便于将问答系统移植到移动端。The advantages of the light-weight visual question answering system and method of the present invention are: reducing the complexity of the model from the two aspects of image feature extraction and question text feature extraction, which is convenient for transplanting the question answering system to the mobile terminal.
附图说明BRIEF DESCRIPTION
图1为MUTAN融合模型架构图。Figure 1 is the architecture diagram of MUTAN fusion model.
图2为轻量视觉问答系统框图。Figure 2 is a block diagram of a lightweight visual question answering system.
图3为文本处理模块结构图。Figure 3 shows the structure of the text processing module.
具体实施方式detailed description
如图2所示,本发明的轻量化视觉问答系统,包括图像处理模块1、文本处理模块2,特征融合器3,和分类器4,其中,待检测图像进入所述图像处理模块1处理,图像处理模块1采用卷积神经网络提取图像特征,并转化为图像特征向量;询问文本进入所述文本处理模块2处理,在文本处理模块2中对文本的特征进行提取,形成文本特征向量;所述图像特征向量和所述文本特征向量均被送入特征融合器3进行融合,并将融合的结果送入分类器4,形成最终答案。As shown in FIG. 2, the lightweight visual question answering system of the present invention includes an image processing module 1, a text processing module 2, a feature fusion device 3, and a classifier 4, wherein the image to be detected enters the image processing module 1 for processing, The image processing module 1 uses a convolutional neural network to extract image features and convert them into image feature vectors; the query text enters the text processing module 2 for processing, and the text features are extracted in the text processing module 2 to form a text feature vector; Both the image feature vector and the text feature vector are sent to the feature fusion machine 3 for fusion, and the fusion result is sent to the classifier 4 to form a final answer.
在图像处理模块1中选取预训练好的shuffle-net模型提取特征,其中,shuffle-net最后一个卷积层的特征将送入特征融合器中。In the image processing module 1, a pre-trained shuffle-net model is selected to extract features, wherein the features of the last convolutional layer of the shuffle-net will be sent to the feature fusion machine.
文本处理模块2采用TextCNN处理询问文本,其结构如图3所示,在输入层21中,将预先句子中每个单词对应预训练好的词向量排在一起,得到一个n*k矩阵。其中n是预先设定句子长度,不足时用0补充,k为词向量的长度。然后按照卷积神经网络处理,即输入层21连接卷积层22,在多个卷积层22中对特征进行提取。卷积层22与所述池化层23连接,池化层23中采用最大池化法的将特征池化,池化层23与所述全连接层24连接,最终由所述全连接层24得到文本的特征。The text processing module 2 uses TextCNN to process the query text. Its structure is shown in FIG. 3. In the input layer 21, each word in the pre-sentence corresponding to the pre-trained word vector is arranged together to obtain an n*k matrix. Where n is the preset sentence length, supplemented by 0 when insufficient, and k is the length of the word vector. Then it is processed according to the convolutional neural network, that is, the input layer 21 is connected to the convolutional layer 22, and the features are extracted in the multiple convolutional layers 22. The convolution layer 22 is connected to the pooling layer 23, and the pooling layer 23 uses the maximum pooling method to pool the features. The pooling layer 23 is connected to the fully connected layer 24, and finally the fully connected layer 24 Get the characteristics of the text.
在特征融合器3中,采用采用MUTAN模型进行Tucker分解,对各分量进行融合,得到融合的结果。MUTAN融合模型是由Hedi Ben-younes等在论文MUTAN:Multimodal Tucker Fusion for Visual Question Answering中提出的,其流程如图1所示。In the feature fusion device 3, the MUTAN model is used to perform Tucker decomposition, and the components are fused to obtain the fusion result. The MUTAN fusion model was proposed by Hedi Ben-younes and others in the paper MUTAN: Multimodal Tucker Fusion for Visual Question Answering, and its flow is shown in Figure 1.
从文本特征提取器得到的向量q,图像特征提取器得到的向量ν融合,得到张量T,并对T进行Tucker分解,得到参数核心张量τ c,以及三个内模矩阵W q、W v、W o,计算获得融合的特征y: The vector q obtained from the text feature extractor and the vector ν obtained from the image feature extractor are fused to obtain the tensor T, and Tucker decomposition is performed on T to obtain the parameter core tensor τ c and three internal model matrices W q and W v , W o , calculate the fusion feature y:
Figure PCTCN2019124008-appb-000002
Figure PCTCN2019124008-appb-000002
其中,× i代表向量在第i维同张量相乘,将y送入分类器中即可得到最终答案。 Among them, × i represents that the vector is multiplied by the same tensor in the i-th dimension, and the final answer can be obtained by sending y into the classifier.
所述Tucker张量分解的方式为:T=((τ c× 1W q2W v3W o,其中,T 由文本特征向量q和图像特征向量ν融合得到。 The Tucker tensor decomposition method is: T=((τ c × 1 W q2 W v3 W o , where T is obtained by fusing the text feature vector q and the image feature vector ν.
分类器4为SoftMax层,训练选取的损失函数为交叉熵损失,表示为:The classifier 4 is the SoftMax layer, and the loss function selected for training is the cross-entropy loss, expressed as:
Figure PCTCN2019124008-appb-000003
Figure PCTCN2019124008-appb-000003
其中y i代表真实的答案索引,
Figure PCTCN2019124008-appb-000004
是预测的答案索引.i=1……|A|,|A|是不同的答案个数。
Where y i represents the actual answer index,
Figure PCTCN2019124008-appb-000004
Is the predicted answer index. i=1...|A|, |A| is the number of different answers.
一种轻量视觉问答方法,采用预先训练的shuffle-net模型提取图像特征,采用TextCNN提取文本特征,然后利用MUTAN模型将所述图像特征与所述文本特征融合,得到答案。A lightweight visual question answering method that uses a pre-trained shuffle-net model to extract image features, uses TextCNN to extract text features, and then uses the MUTAN model to fuse the image features with the text features to obtain answers.
经过实验证明,使用shuffle-net作为视觉问答的图像特征提取器,TextCNN作为文本特征提取器,可以有效降低模型的复杂度,便于将问答系统移植到移动端。Experiments show that using shuffle-net as the image feature extractor for visual question answering and TextCNN as the text feature extractor can effectively reduce the complexity of the model and facilitate the transplantation of the question answering system to the mobile terminal.

Claims (10)

  1. 一种轻量视觉问答系统,其特征在于,包括图像处理模块(1)、文本处理模块(2),特征融合器(3),和分类器(4),其中,所述图像处理模块(1)采用卷积神经网络提取图像特征,并转化为图像特征向量;所述文本处理模块(2)提取文本特征形成文本特征向量;所述图像特征向量和所述文本特征向量均被送入所述特征融合器(3)进行融合,并将融合的结果送入分类器(4),形成最终答案。A lightweight visual question answering system, characterized in that it includes an image processing module (1), a text processing module (2), a feature fusion (3), and a classifier (4), wherein the image processing module (1 ) A convolutional neural network is used to extract image features and convert them into image feature vectors; the text processing module (2) extracts text features to form a text feature vector; both the image feature vector and the text feature vector are sent to the The feature fusion machine (3) performs fusion and sends the fusion result to the classifier (4) to form the final answer.
  2. 根据权利要求1所述的轻量视觉问答系统,其特征在于,所述图像处理模块(1)采用shuffle-net模型提取图像特征。The lightweight visual question answering system according to claim 1, wherein the image processing module (1) uses a shuffle-net model to extract image features.
  3. 根据权利要求1或2所述的轻量视觉问答系统,其特征在于,所述文本处理模块(2)采用TextCNN提取文本特征。The lightweight visual question answering system according to claim 1 or 2, wherein the text processing module (2) uses TextCNN to extract text features.
  4. 根据权利要求3所述的轻量视觉问答系统,其特征在于,所述文本处理模块(2)包括输入层(21)、卷积层(22)、池化层(23)和全连接层(24),所述输入层(21)将句子中每个单词预训练好的词向量排在一起,得到一个n*k矩阵,其中n是预先设定句子长度,不足时用0补充,k为词向量的长度;所述输入层(21)与所述卷积层(22)连接,所述卷积层(22)对输入的矩阵进行卷积神经网络处理,卷积层包括多个层;卷积层(22)与所述池化层(23)连接,所述池化层(23)与所述全连接层(24)连接,最终由所述全连接层(24)得到文本的特征。The lightweight visual question answering system according to claim 3, wherein the text processing module (2) includes an input layer (21), a convolutional layer (22), a pooling layer (23), and a fully connected layer ( 24), the input layer (21) arranges the pre-trained word vectors of each word in the sentence together to obtain an n*k matrix, where n is a preset sentence length, and is supplemented by 0 when insufficient, k is The length of the word vector; the input layer (21) is connected to the convolutional layer (22), and the convolutional layer (22) performs convolutional neural network processing on the input matrix, and the convolutional layer includes multiple layers; The convolution layer (22) is connected to the pooling layer (23), the pooling layer (23) is connected to the fully connected layer (24), and finally the textual features are obtained from the fully connected layer (24) .
  5. 根据权利要求1-4任一项所述的轻量视觉问答系统,其特征在于,所述特征融合器(3)采用MUTAN模型进行Tucker分解,对各分量进行融合,得到融合的结果。The lightweight visual question answering system according to any one of claims 1 to 4, characterized in that the feature fuser (3) uses the MUTAN model to perform Tucker decomposition, fuse each component, and obtain a result of fusion.
  6. 根据权利要求1-5任一项所述的轻量视觉问答系统,其特征在于,所述分类器(4)为SoftMax分类器,采用的损失函数为交叉熵损失函数。The lightweight visual question answering system according to any one of claims 1 to 5, wherein the classifier (4) is a SoftMax classifier, and the loss function used is a cross-entropy loss function.
  7. 根据权利要求1-6任一项所述的轻量视觉问答系统,其特征在于,所述 系统嵌入移动终端中使用。The lightweight visual question answering system according to any one of claims 1 to 6, wherein the system is embedded in a mobile terminal and used.
  8. 一种轻量视觉问答方法,其特征在于,采用预先训练的shuffle-net模型提取图像特征,采用TextCNN提取文本特征,然后利用MUTAN模型将所述图像特征与所述文本特征融合,得到答案。A lightweight visual question answering method, characterized in that the pre-trained shuffle-net model is used to extract image features, and the TextCNN is used to extract text features, and then the MUTAN model is used to fuse the image features with the text features to obtain answers.
  9. 根据权利要求8所述的轻量视觉问答方法,其特征在于,所述融合的方法为,将从文本特征提取器得到的向量q,图像特征提取器得到的向量ν融合,得到张量T,并对T进行Tucker分解,得到参数核心张量τ c,以及三个内模矩阵W q、W ν、W o,计算获得融合的特征y: The lightweight visual question answering method according to claim 8, wherein the fusion method is to fuse the vector q obtained from the text feature extractor and the vector v obtained from the image feature extractor to obtain a tensor T, Tucker decomposition of T is performed to obtain the parameter core tensor τ c and the three internal model matrices W q , W ν , W o , and the fusion feature y is calculated:
    Figure PCTCN2019124008-appb-100001
    Figure PCTCN2019124008-appb-100001
    其中,× i代表向量在第i维同张量相乘,将y送入分类器中即可得到最终答案。 Among them, × i represents that the vector is multiplied by the same tensor in the i-th dimension, and the final answer can be obtained by sending y into the classifier.
  10. 根据权利要求8所述的轻量视觉问答方法,其特征在于,在移动终端中应用该方法。The lightweight visual question answering method according to claim 8, wherein the method is applied in a mobile terminal.
PCT/CN2019/124008 2018-12-12 2019-12-09 Lightweight visual question-answering system and method WO2020119631A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811518735.8 2018-12-12
CN201811518735.8A CN109784163A (en) 2018-12-12 2018-12-12 A kind of light weight vision question answering system and method

Publications (1)

Publication Number Publication Date
WO2020119631A1 true WO2020119631A1 (en) 2020-06-18

Family

ID=66496867

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/124008 WO2020119631A1 (en) 2018-12-12 2019-12-09 Lightweight visual question-answering system and method

Country Status (2)

Country Link
CN (1) CN109784163A (en)
WO (1) WO2020119631A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113157889A (en) * 2021-04-21 2021-07-23 韶鼎人工智能科技有限公司 Visual question-answering model construction method based on theme loss
CN113792703A (en) * 2021-09-29 2021-12-14 山东新一代信息产业技术研究院有限公司 Image question-answering method and device based on Co-Attention deep modular network
CN113837212A (en) * 2020-06-24 2021-12-24 四川大学 Visual question-answering method based on multi-mode bidirectional guiding attention
CN113918679A (en) * 2021-09-22 2022-01-11 三一汽车制造有限公司 Knowledge question and answer method and device and engineering machinery

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109784163A (en) * 2018-12-12 2019-05-21 中国科学院深圳先进技术研究院 A kind of light weight vision question answering system and method
CN110298338B (en) * 2019-06-20 2021-08-24 北京易道博识科技有限公司 Document image classification method and device
CN110348535B (en) * 2019-07-17 2022-05-31 北京金山数字娱乐科技有限公司 Visual question-answering model training method and device
CN111967487B (en) * 2020-03-23 2022-09-20 同济大学 Incremental data enhancement method for visual question-answer model training and application
CN111814843B (en) * 2020-03-23 2024-02-27 同济大学 End-to-end training method and application of image feature module in visual question-answering system
CN112100346B (en) * 2020-08-28 2021-07-20 西北工业大学 Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN112148891A (en) * 2020-09-25 2020-12-29 天津大学 Knowledge graph completion method based on graph perception tensor decomposition
CN112925904B (en) * 2021-01-27 2022-11-29 天津大学 Lightweight text classification method based on Tucker decomposition
CN113128415B (en) * 2021-04-22 2023-09-29 合肥工业大学 Environment distinguishing method, system, equipment and storage medium
CN113919344B (en) * 2021-09-26 2022-09-23 腾讯科技(深圳)有限公司 Text processing method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170124432A1 (en) * 2015-11-03 2017-05-04 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering
CN106777185A (en) * 2016-12-23 2017-05-31 浙江大学 A kind of across media Chinese herbal medicine image search methods based on deep learning
CN107066583A (en) * 2017-04-14 2017-08-18 华侨大学 A kind of picture and text cross-module state sensibility classification method merged based on compact bilinearity
CN107679582A (en) * 2017-10-20 2018-02-09 深圳市唯特视科技有限公司 A kind of method that visual question and answer are carried out based on multi-modal decomposition model
CN108256549A (en) * 2017-12-13 2018-07-06 北京达佳互联信息技术有限公司 Image classification method, device and terminal
CN108763325A (en) * 2018-05-04 2018-11-06 北京达佳互联信息技术有限公司 A kind of network object processing method and processing device
CN109784163A (en) * 2018-12-12 2019-05-21 中国科学院深圳先进技术研究院 A kind of light weight vision question answering system and method

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138993B (en) * 2015-08-31 2018-07-27 小米科技有限责任公司 Establish the method and device of human face recognition model
CN105956608A (en) * 2016-04-21 2016-09-21 恩泊泰(天津)科技有限公司 Objective positioning and classifying algorithm based on deep learning
CN107368770B (en) * 2016-05-12 2021-05-11 江苏安纳泰克能源服务有限公司 Method and system for automatically identifying returning passenger
CN106055576B (en) * 2016-05-20 2018-04-10 大连理工大学 A kind of fast and effectively image search method under large-scale data background
CN106250918B (en) * 2016-07-26 2019-08-13 大连理工大学 A kind of mixed Gauss model matching process based on improved soil-shifting distance
CN106372581B (en) * 2016-08-25 2020-09-04 中国传媒大学 Method for constructing and training face recognition feature extraction network
US10282462B2 (en) * 2016-10-31 2019-05-07 Walmart Apollo, Llc Systems, method, and non-transitory computer-readable storage media for multi-modal product classification
CN108509519B (en) * 2018-03-09 2021-03-09 北京邮电大学 General knowledge graph enhanced question-answer interaction system and method based on deep learning
CN108564588B (en) * 2018-03-21 2020-07-10 华中科技大学 Built-up area automatic extraction method based on depth features and graph segmentation method
CN108875648A (en) * 2018-06-22 2018-11-23 深源恒际科技有限公司 A method of real-time vehicle damage and component detection based on mobile video stream

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170124432A1 (en) * 2015-11-03 2017-05-04 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering
CN106777185A (en) * 2016-12-23 2017-05-31 浙江大学 A kind of across media Chinese herbal medicine image search methods based on deep learning
CN107066583A (en) * 2017-04-14 2017-08-18 华侨大学 A kind of picture and text cross-module state sensibility classification method merged based on compact bilinearity
CN107679582A (en) * 2017-10-20 2018-02-09 深圳市唯特视科技有限公司 A kind of method that visual question and answer are carried out based on multi-modal decomposition model
CN108256549A (en) * 2017-12-13 2018-07-06 北京达佳互联信息技术有限公司 Image classification method, device and terminal
CN108763325A (en) * 2018-05-04 2018-11-06 北京达佳互联信息技术有限公司 A kind of network object processing method and processing device
CN109784163A (en) * 2018-12-12 2019-05-21 中国科学院深圳先进技术研究院 A kind of light weight vision question answering system and method

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113837212A (en) * 2020-06-24 2021-12-24 四川大学 Visual question-answering method based on multi-mode bidirectional guiding attention
CN113837212B (en) * 2020-06-24 2023-09-26 四川大学 Visual question-answering method based on multi-mode bidirectional guiding attention
CN113157889A (en) * 2021-04-21 2021-07-23 韶鼎人工智能科技有限公司 Visual question-answering model construction method based on theme loss
CN113918679A (en) * 2021-09-22 2022-01-11 三一汽车制造有限公司 Knowledge question and answer method and device and engineering machinery
CN113792703A (en) * 2021-09-29 2021-12-14 山东新一代信息产业技术研究院有限公司 Image question-answering method and device based on Co-Attention deep modular network
CN113792703B (en) * 2021-09-29 2024-02-02 山东新一代信息产业技术研究院有限公司 Image question-answering method and device based on Co-Attention depth modular network

Also Published As

Publication number Publication date
CN109784163A (en) 2019-05-21

Similar Documents

Publication Publication Date Title
WO2020119631A1 (en) Lightweight visual question-answering system and method
CN111554268B (en) Language identification method based on language model, text classification method and device
CN111340814B (en) RGB-D image semantic segmentation method based on multi-mode self-adaptive convolution
WO2021134277A1 (en) Emotion recognition method, intelligent device, and computer-readable storage medium
CN107563498A (en) View-based access control model is combined the Image Description Methods and system of strategy with semantic notice
CN110866184A (en) Short video data label recommendation method and device, computer equipment and storage medium
CN113537024B (en) Weak supervision neural network sign language recognition method of multilayer time sequence attention fusion mechanism
CN113722458B (en) Visual question-answering processing method, device, computer readable medium, and program product
US20220318946A1 (en) Method for image shape transformation based on generative adversarial network
Mazaheri et al. Video fill in the blank with merging lstms
CN110795549A (en) Short text conversation method, device, equipment and storage medium
CN117121015A (en) Multimodal, less-hair learning using frozen language models
CN112749556A (en) Multi-language model training method and device, storage medium and electronic equipment
Sevli et al. Turkish sign language digits classification with CNN using different optimizers
CN117789099B (en) Video feature extraction method and device, storage medium and electronic equipment
Mazaheri et al. Video fill in the blank using lr/rl lstms with spatial-temporal attentions
Thakar et al. Sign Language to Text Conversion in Real Time using Transfer Learning
CN109564633A (en) Artificial neural network
Chaikaew An applied holistic landmark with deep learning for Thai sign language recognition
CN117494762A (en) Training method of student model, material processing method, device and electronic equipment
CN115017900B (en) Conversation emotion recognition method based on multi-mode multi-prejudice
Sreemathy et al. Indian Sign Language interpretation using convolutional neural networks
Takayama et al. Masked batch normalization to improve tracking-based sign language recognition using graph convolutional networks
CN116311493A (en) Two-stage human-object interaction detection method based on coding and decoding architecture
CN115392232A (en) Topic and multi-mode fused emergency emotion analysis method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19894915

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 02.11.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19894915

Country of ref document: EP

Kind code of ref document: A1