WO2020119631A1

WO2020119631A1 - Lightweight visual question-answering system and method

Info

Publication number: WO2020119631A1
Application number: PCT/CN2019/124008
Authority: WO
Inventors: 王磊; 赖坤耀; 程俊
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2018-12-12
Filing date: 2019-12-09
Publication date: 2020-06-18
Also published as: CN109784163A

Abstract

Disclosed are a lightweight visual question-answering system and method. The system comprises an image processing module (1), a text processing module (2), a feature fuser (3) and a classifier (4), wherein the image processing module (1) uses a convolutional neural network to extract an image feature and converts same into an image feature vector; the text processing module (2) extracts a text feature to form a text feature vector; and the image feature vector and the text feature vector are both sent to the feature fuser (3) for fusion, and a fusion result is sent to the classifier (4) to form a final answer. The method can reduce the complexity of a model in two aspects comprising image feature extraction and question text feature extraction, so as to transplant a question-answering system to a mobile terminal.

Description

Lightweight visual question answering system and method

Technical field

The invention relates to the field of computer vision, in particular to the field of visual question answering technology.

Background technique

Deep learning is widely used in computer vision (CV) and natural language processing (NLP) with its powerful feature learning ability. Convolutional neural network (CNN) can extract and compress image information, which is mostly used in image processing; and recurrent neural network (RNN) in the field of natural language processing, especially in speech recognition, machine translation, language model and text generation, etc. A great success.

Visual Q&A is one of the most challenging questions in the field of computer vision. The task of visual question and answer is to use the computer to automatically analyze the pictures and questions, so as to give answers to the questions. Since the visual question and answer involves the contents of computer vision and natural language processing, then a natural solution is to combine the convolutional neural network and the recurrent neural network that have been very successful in computer vision and natural language processing. model. The most commonly used convolutional neural networks are Res-net and VGG-net, and the most commonly used recurrent neural networks are LSTM and GRU. However, because the visual question and answer need to process images and questions at the same time, the calculation is often slower. When the computing power is insufficient, such as in the mobile terminal, the time to get the answer will be longer.

In the fusion of image information and text information, Hedi Ben-younes et al. proposed the MUTAN fusion model in the paper MUTAN: Multimodal Tucker Fusion for Visual Question Answering. Matrix and core tensor, and by constraining the core tensor to further control the number of model parameters, overfitting can be prevented during training, and input/output prediction can be adjusted more flexibly. The present invention is based on the MUTAN model, uses shuffle-net to process images, and uses convolutional neural network TextCNN to process problem statements, which can effectively reduce the complexity of the model and facilitate the transplantation of question answering systems to mobile terminals.

Summary of the invention

The purpose of the present invention is to propose a question and answer system and method that have low computing power requirements and are easy to transplant to the mobile terminal. The technical scheme adopted is as follows:

A lightweight visual question answering system, including an image processing module 1, a text processing module 2, a feature fusion device 3, and a classifier 4, wherein the image processing module 1 uses a convolutional neural network to extract image features and convert them into Image feature vector; the text processing module 2 extracts text features to form a text feature vector; both the image feature vector and the text feature vector are sent to the feature fuser 3 for fusion, and the fusion result is sent to classification 4 to form the final answer.

Further, the image processing module 1 uses a shuffle-net model to extract image features.

Further, the text processing module 2 uses TextCNN to extract text features.

Further, the text processing module 2 includes an input layer 21, a convolutional layer 22, a pooling layer 23, and a fully connected layer 24. The input layer 21 ranks the pre-trained word vectors of each word in the sentence together, Obtain an n*k matrix, where n is the preset sentence length, supplemented by 0 when insufficient, k is the length of the word vector; the input layer 21 is connected to the convolution layer 22, the convolution layer 22 pairs The input matrix is processed by a convolutional neural network. The convolutional layer includes multiple layers; the convolutional layer 22 is connected to the pooling layer 23, and the pooling layer 23 is connected to the fully connected layer 24. The fully connected layer 24 obtains the characteristics of the text.

Further, the feature fusion device 3 adopts the MUTAN model to perform Tucker decomposition, fuse each component, and obtain a fusion result.

Further, the classifier 4 is a SoftMax classifier, and the loss function used is a cross-entropy loss function.

Further, the system is embedded and used in mobile terminals.

A lightweight visual question answering method that uses a pre-trained shuffle-net model to extract image features, uses TextCNN to extract text features, and then uses the MUTAN model to fuse the image features with the text features to obtain answers.

Further, the fusion method is to fuse the vector q obtained from the text feature extractor and the vector v obtained from the image feature extractor to obtain a tensor T, and perform Tucker decomposition on T to obtain the parameter core tensor τ _c , And the three internal model matrices W _q , W _v , W _o , the fusion feature y is calculated:

Among them, × _i represents that the vector is multiplied by the same tensor in the i-th dimension, and the final answer can be obtained by sending y into the classifier.

Further, the method is applied in the mobile terminal.

The advantages of the light-weight visual question answering system and method of the present invention are: reducing the complexity of the model from the two aspects of image feature extraction and question text feature extraction, which is convenient for transplanting the question answering system to the mobile terminal.

BRIEF DESCRIPTION

Figure 1 is the architecture diagram of MUTAN fusion model.

Figure 2 is a block diagram of a lightweight visual question answering system.

Figure 3 shows the structure of the text processing module.

detailed description

As shown in FIG. 2, the lightweight visual question answering system of the present invention includes an image processing module 1, a text processing module 2, a feature fusion device 3, and a classifier 4, wherein the image to be detected enters the image processing module 1 for processing, The image processing module 1 uses a convolutional neural network to extract image features and convert them into image feature vectors; the query text enters the text processing module 2 for processing, and the text features are extracted in the text processing module 2 to form a text feature vector; Both the image feature vector and the text feature vector are sent to the feature fusion machine 3 for fusion, and the fusion result is sent to the classifier 4 to form a final answer.

In the image processing module 1, a pre-trained shuffle-net model is selected to extract features, wherein the features of the last convolutional layer of the shuffle-net will be sent to the feature fusion machine.

The text processing module 2 uses TextCNN to process the query text. Its structure is shown in FIG. 3. In the input layer 21, each word in the pre-sentence corresponding to the pre-trained word vector is arranged together to obtain an n*k matrix. Where n is the preset sentence length, supplemented by 0 when insufficient, and k is the length of the word vector. Then it is processed according to the convolutional neural network, that is, the input layer 21 is connected to the convolutional layer 22, and the features are extracted in the multiple convolutional layers 22. The convolution layer 22 is connected to the pooling layer 23, and the pooling layer 23 uses the maximum pooling method to pool the features. The pooling layer 23 is connected to the fully connected layer 24, and finally the fully connected layer 24 Get the characteristics of the text.

In the feature fusion device 3, the MUTAN model is used to perform Tucker decomposition, and the components are fused to obtain the fusion result. The MUTAN fusion model was proposed by Hedi Ben-younes and others in the paper MUTAN: Multimodal Tucker Fusion for Visual Question Answering, and its flow is shown in Figure 1.

The vector q obtained from the text feature extractor and the vector ν obtained from the image feature extractor are fused to obtain the tensor T, and Tucker decomposition is performed on T to obtain the parameter core tensor τ _c and three internal model matrices W _q and W _v , W _o , calculate the fusion feature y:

The Tucker tensor decomposition method is: T=((τ _c × ₁ W _q )× ₂ W _v )× ₃ W _o , where T is obtained by fusing the text feature vector q and the image feature vector ν.

The classifier 4 is the SoftMax layer, and the loss function selected for training is the cross-entropy loss, expressed as:

Where y _i represents the actual answer index,

Is the predicted answer index. i=1...|A|, |A| is the number of different answers.

Experiments show that using shuffle-net as the image feature extractor for visual question answering and TextCNN as the text feature extractor can effectively reduce the complexity of the model and facilitate the transplantation of the question answering system to the mobile terminal.

Claims

A lightweight visual question answering system, characterized in that it includes an image processing module (1), a text processing module (2), a feature fusion (3), and a classifier (4), wherein the image processing module (1 ) A convolutional neural network is used to extract image features and convert them into image feature vectors; the text processing module (2) extracts text features to form a text feature vector; both the image feature vector and the text feature vector are sent to the The feature fusion machine (3) performs fusion and sends the fusion result to the classifier (4) to form the final answer.
The lightweight visual question answering system according to claim 1, wherein the image processing module (1) uses a shuffle-net model to extract image features.
The lightweight visual question answering system according to claim 1 or 2, wherein the text processing module (2) uses TextCNN to extract text features.
The lightweight visual question answering system according to claim 3, wherein the text processing module (2) includes an input layer (21), a convolutional layer (22), a pooling layer (23), and a fully connected layer ( 24), the input layer (21) arranges the pre-trained word vectors of each word in the sentence together to obtain an n*k matrix, where n is a preset sentence length, and is supplemented by 0 when insufficient, k is The length of the word vector; the input layer (21) is connected to the convolutional layer (22), and the convolutional layer (22) performs convolutional neural network processing on the input matrix, and the convolutional layer includes multiple layers; The convolution layer (22) is connected to the pooling layer (23), the pooling layer (23) is connected to the fully connected layer (24), and finally the textual features are obtained from the fully connected layer (24) .
The lightweight visual question answering system according to any one of claims 1 to 4, characterized in that the feature fuser (3) uses the MUTAN model to perform Tucker decomposition, fuse each component, and obtain a result of fusion.
The lightweight visual question answering system according to any one of claims 1 to 5, wherein the classifier (4) is a SoftMax classifier, and the loss function used is a cross-entropy loss function.
The lightweight visual question answering system according to any one of claims 1 to 6, wherein the system is embedded in a mobile terminal and used.
A lightweight visual question answering method, characterized in that the pre-trained shuffle-net model is used to extract image features, and the TextCNN is used to extract text features, and then the MUTAN model is used to fuse the image features with the text features to obtain answers.
The lightweight visual question answering method according to claim 8, wherein the fusion method is to fuse the vector q obtained from the text feature extractor and the vector v obtained from the image feature extractor to obtain a tensor T, Tucker decomposition of T is performed to obtain the parameter core tensor τ c and the three internal model matrices W q , W ν , W o , and the fusion feature y is calculated:

Among them, × i represents that the vector is multiplied by the same tensor in the i-th dimension, and the final answer can be obtained by sending y into the classifier.
The lightweight visual question answering method according to claim 8, wherein the method is applied in a mobile terminal.