CN111008293A

CN111008293A - Visual question-answering method based on structured semantic representation

Info

Publication number: CN111008293A
Application number: CN201811164612.9A
Authority: CN
Inventors: 熊红凯; 余东晨
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2018-10-06
Filing date: 2018-10-06
Publication date: 2020-04-14

Abstract

The invention provides a visual question-answering method based on structured semantic representation, which extracts image features of an input image through a convolutional neural network; extracting a word vector of each word from the input problems related to the input image through a pre-trained word embedding model; weighting the image features and the word vectors to obtain weighted image feature vectors and weighted text feature vectors; converting the word vector into a structured semantic representation vector through a Tree-LSTM network; fusing the image feature vector, the text feature vector and the structured semantic expression vector to obtain a corresponding fused feature vector; and taking the fused feature vector as the input of a prediction model, and outputting the answer corresponding to the input question by the prediction model. According to the method, richer semantic information is extracted from the problems, and the performance of the prediction model is improved through multi-layer training optimization, so that the accuracy of the answers is improved.

Description

Visual question-answering method based on structured semantic representation

Technical Field

The invention relates to the technical field of computer vision, in particular to a visual question answering method based on structured semantic representation.

Background

In the field of computer vision, visual question answering is a very leading and challenging problem. Given a natural image, any questions related to the image content may be asked. In order to accurately predict the problem, data information needs to be sufficiently acquired and represented in a more robust representation mode in the visual question-answering modeling process. The training mode of the visual question-answering model is very important, and the classifier boundary needs to be accurately found. Since the language itself has a composition structure property, different questions often have similar substructures, which also means that the reasoning process in visual question answering must be composition to obtain useful information.

Through the literature search of the prior art, Zichao Yang, Xiaolong He, Jianfeng Gao, LiDeng and Alex Smola in 2016 "IEEE Conference on Vision and Pattern recognition" published "Stacked attention networks for image query and switching" provide a multi-layer attention mechanism, and the image feature weight is calculated for multiple times through text features, so that more accurate weighting information can be obtained through multiple times of weighting. A layered joint attention mode is provided in 2016 'Conference and Workshop Processing Systems' document published by Jiansen Lu, Jianwei Yang, Dhruv Batra and Devi Parikh, and multi-layer feature extraction is performed on text features to obtain three-layer feature vectors, a joint attention mode is provided, and meanwhile, weighting is performed on images and text feature vectors.

In both methods, a Recurrent Neural Network (RNN) and a Long-Short-Term Memory (LSTM) Network are used for extracting text features, each word is relatively independent, and only the relation between the current word and the previous word is considered. However, this chained structure ignores semantic structure information in the text. This overlooked structural information is important in the visual question-answering process.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a visual question-answering method based on structured semantic representation.

The embodiment of the invention provides a visual question-answering method based on structured semantic representation, which comprises the following steps:

extracting image features of an input image through a convolutional neural network;

extracting a word vector of each word from the input problems related to the input image through a pre-trained word embedding model;

weighting the image features and the word vectors to obtain weighted image feature vectors and weighted text feature vectors;

converting the word vector into a structured semantic representation vector through a Tree-LSTM network;

fusing the image feature vector, the text feature vector and the structured semantic expression vector to obtain a corresponding fused feature vector;

and taking the fused feature vector as the input of a prediction model, and outputting the answer corresponding to the input question by the prediction model.

Optionally, extracting the image feature of the input image by a convolutional neural network includes:

cutting the input image into image blocks with preset sizes;

and inputting the image block into a convolutional neural network, and taking the output of the last pooling layer of the convolutional neural network as the image feature of the input image.

Optionally, performing weighting processing on the image feature and the word vector to obtain a weighted image feature vector and a weighted text feature vector, including:

weighting the image features and the word vectors through a preset attention function to obtain weighted image feature vectors and weighted text feature vectors; defining operations of interest

X is an unweighted image featureA token or word vector, g is the corresponding image or text attention amount, representing the degree of matching between the image and the text,

representing the weighted feature vector, and assuming that X is an image feature to obtain the weighted image feature vector; if X is a word vector, obtaining a weighted feature vector of the text, wherein a calculation formula of the attention function is as follows:

wherein: h represents an intermediate state of the calculation process of interest, tanh is a hyperbolic tangent nonlinear activation function, W_x、W_g、

In order to have three parameters to be learned,

is a unit vector with all elements 1, a^xRepresenting the probability value of importance for each image region calculated by text, softmax is a multi-classifier normalized index function,

representing weighted probability values, x, of the ith image region_iWhich represents the i-th image area,

representing the weighted total image feature vector, i is [1, N ]]N denotes the total number of image areas.

Optionally, converting the word vector into a structured semantic representation vector through a Tree-LSTM network, comprising:

obtaining a tree structure of each input question;

calculating a state vector corresponding to each node position on the tree structure, wherein each node position corresponds to one LSTM unit; the LSTM cell includes: an input gate, an output gate, and a memory cell;

and taking the state vector and the word vector corresponding to each node position as the input of the LSTM unit, and outputting the corresponding structural semantic expression vector by the LSTM unit.

Optionally, the fusing the image feature vector, the text feature vector, and the structured semantic representation vector includes:

and performing fusion processing on the image feature vector, the text feature vector and the structured semantic expression vector by adopting a mode of adding elements in different vectors one by one.

Optionally, the method further comprises:

establishing an initial visual question-answering model comprising the convolutional neural network, a pre-trained word embedding model, a Tree-LSTM network and a prediction model;

taking a training image and an answer corresponding to the training image as the input of the initial visual question-answering model, training a preliminary model according to the existing label information, and outputting a predicted candidate answer by the initial visual question-answering model;

taking the training image, the predicted candidate answer and a supplementary image as the input of the initial visual question-answering model, and outputting a judgment result of whether the predicted candidate answer is correct or not by the initial visual question-answering model;

if the judgment result is correct, the training image, the predicted candidate answer and the supplementary image form a positive sample; if the judgment result is wrong, the training image, the predicted candidate answer and the supplementary image form a negative sample;

and retraining the initial visual question-answer model by using the positive sample and the negative sample, and setting a two-classification loss function of the discrimination result to adjust the parameters of the initial visual question-answer model in the training process to obtain the optimized visual question-answer model.

Compared with the prior art, the invention has the following beneficial effects:

according to the invention, the structural information in the language is applied to the process of visual question answering, richer semantic information can be obtained by means of the Tree-LSTM network, the feature dimension is enriched, and the performance of model answer prediction is improved. In the training process, a dual-channel network structure is provided, and more information in new data can be obtained. The performance of model prediction can be further improved by a multi-layer training optimization mode. Therefore, the accuracy of the answers of the visual question answering method is finally improved.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a schematic diagram of a visual question-answering model of the present invention;

FIG. 2 is a schematic diagram of a two-channel network architecture input;

FIG. 3 is a flowchart of a visual question-answering method based on structured semantic representation according to an embodiment of the present invention;

fig. 4 is a flowchart of model training of a visual question-answering method based on structured semantic representation according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Fig. 1 is a schematic diagram of a principle of a visual question-answering model of the present invention, and as shown in fig. 1, image features are first extracted by using a convolutional neural network, and in order to maintain position information of an image, a pooling layer output is usually selected as the image features. And obtaining a word vector of each word by using the word embedding model obtained by training. And then extracting the structural information of the text by using the Tree-LSTM network to obtain a structured semantic representation feature vector. And carrying out importance weighting on the word vectors and the image feature vectors by using a joint attention mechanism, and calculating the importance of different words in different image areas and sentences. And performing multi-mode information fusion on the weighted image, the text feature vector and the structured semantic expression vector. And performing classifier training by using the feature vectors obtained by fusion. FIG. 2 is a schematic input diagram of a two-channel network architecture, as shown in FIG. 2, for further optimizing model parameters using the two-channel network architecture.

Optionally, in the feature extraction process, a multilayer convolution mode may be adopted to obtain multilayer text feature vectors, so that more information can be acquired to apply to the question-answering model.

Optionally, in the process of feature alignment, a multi-layer bidirectional attention mechanism is adopted, so that important image regions and words can be found more accurately, and irrelevant information is filtered out. The performance of the visual question-answering model can be improved.

Alternatively, the way of feature fusion directly affects the final feature vector. Besides basic element-by-element addition and multiplication, a further processing mode can be adopted, and the method has a more direct effect on the training of the classifier.

Fig. 3 is a flowchart of a visual question-answering method based on structured semantic representation according to an embodiment of the present invention, and as shown in fig. 3, the method in this embodiment may include:

and S101, extracting image features of the input image through a convolutional neural network.

In the embodiment, an input image is cut into image blocks with preset sizes; and inputting the image blocks into a convolutional neural network, and taking the output of the last pooling layer of the convolutional neural network as the image characteristics of the input image. Specifically, the input image is cropped to 224 × 224 size and input into the pre-trained VGG-19 network. The output of the last pooling layer is represented as an image feature with dimensions 512 × 14 × 14.

S102, extracting a word vector of each word from input problems related to input images through a pre-trained word embedding model.

In this embodiment, a word vector of each word is obtained by using a trained word embedding model, and the dimension is 512.

S103, weighting the image features and the word vectors to obtain weighted image feature vectors and weighted text feature vectors.

In this embodiment, the image features and the word vectors are weighted through a preset attention function to obtain weighted image feature vectors and text feature vectors; defining operations of interest

X is an unweighted image feature or word vector, g is a corresponding image or text attention amount, representing the degree of match between the image and the text,

For three parameters needing to be learnedThe number of the first and second groups is,

And S104, converting the word vector into a structural semantic expression vector through a Tree-LSTM network.

In this embodiment, a tree structure of each input question is obtained; calculating a state vector corresponding to each node position on the tree structure, wherein each node position corresponds to one LSTM unit; the LSTM unit includes: an input gate, an output gate, and a memory cell; and taking the state vector and the word vector corresponding to each node position as the input of an LSTM unit, and outputting a corresponding structural semantic expression vector by the LSTM unit.

Specifically, first obtaining a syntax tree structure of a question sentence, each word corresponding to a tree node in the syntax tree, each node corresponding to an LSTM unit, includes: input door i_jOutput gate o_jImplicit layer state value h_jAnd memory cells c_jThe state value update of each node is determined by its child nodes.

c_j＝i_j⊙u_j+Σ_k∈C(j)J_jk⊙c_k

h_j＝o_jtanh(c_j)

C_jSet of all children nodes, W, representing the jth nodeⁱ、Uⁱ、bⁱ、W^o、U^o、b^o、W^f、U^f、b^fRespectively representing the parameters i to be learned by the input unit, the output unit and the forgetting unit_j、f_jk、o_jRespectively representing input state control quantity, forgetting quantity and output quantity, sigma and tanh respectively representing sigmoid and hyperbolic tangent activation function, c_jRepresenting the state value of the cell at the jth node position, ⊙ is a product operation_jThe input to each node includes the word vector at the state of the child node and the current node. The state values of the final root node positions are used in the visual question-answer modeling process as a representation of the entire sequence.

And S105, carrying out fusion processing on the image feature vector, the text feature vector and the structured semantic expression vector to obtain a corresponding fusion feature vector.

In this embodiment, a basic element-by-element addition manner is adopted to perform fusion processing on the image feature vector, the text feature vector, and the structured semantic expression vector.

And S106, taking the fusion feature vector as the input of the prediction model, and outputting the answer corresponding to the input question by the prediction model.

In this embodiment, the fused feature vector is used as an input of the prediction model, and the prediction model outputs an answer corresponding to the input question.

Fig. 4 is a flowchart of model training of a visual question-answering method based on a structured semantic representation according to an embodiment of the present invention, and as shown in fig. 4, before executing S101, the method further includes:

s201, establishing an initial visual question-answer model comprising a convolutional neural network, a pre-trained word embedding model, a Tree-LSTM network and a prediction model.

S202, taking the training images and the answers corresponding to the training images as the input of the initial visual question-answer model, and outputting predicted candidate answers by the initial visual question-answer model.

S203, the training image, the predicted candidate answer and the supplementary image are used as the input of the initial visual question-answering model, and the initial visual question-answering model outputs the judgment result of whether the predicted candidate answer is correct or not.

S204, if the judgment result is correct, the training image, the predicted candidate answer and the supplementary image form a positive sample; if the judgment result is wrong, the training image, the predicted candidate answer and the supplementary image form a negative sample.

S205, training the initial visual question-answer model by using the positive sample and the negative sample, and adjusting parameters of the initial visual question-answer model by setting a loss function of a judgment result in the training process to obtain the optimized visual question-answer model.

In this embodiment, the training images and the answers corresponding to the training images are used as the input of the initial visual question-answer model, the candidate answers are predicted, the multiple classification questions are obtained, the loss function is a mutual information entropy function, and the first-layer training optimization is completed. And taking the training image, the predicted candidate answer and the supplementary image as the input of the initial visual question-answer model, judging whether the predicted candidate answer is correct or not, wherein the predicted candidate answer is a binary problem, the loss function is a mutual information entropy function, and the second-layer training optimization is completed by minimizing the loss function to solve.

According to the above steps, the experiment was carried out using the steps of the summary of the invention, and the data used for the experiment was from a published VQA 2.0.0 data set, including 82783 training images, 40504 verification images and 81434 test images. There were a total of 443757 questions for training, with 447793 test questions. The visual question-answer model based on the structured semantic representation and the layered joint attention model and the depth LSTM model proposed by Jiansen Lu et al in the 'Hierarchical query-image co-attribution for visual query answering' are compared.

The result is: by applying a structural semantic representation mode, the answer accuracy is respectively improved by 1.4% and 4% compared with a hierarchical joint attention model and a deep LSTM model. The two-channel network is used for training, the answer accuracy of the layered joint attention model can be improved by 1.01%, and the visual question-answer model based on the structured semantic representation provided by the invention is improved by 0.7%. Therefore, the method can effectively improve the performance of the visual question-answering model and has certain universality.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A visual question-answering method based on structured semantic representation is characterized by comprising the following steps:

2. The visual question-answering method based on the structured semantic representation according to claim 1, wherein the extracting of the image features of the input image through a convolutional neural network comprises:

cutting the input image into image blocks with preset sizes;

3. The visual question-answering method based on the structured semantic representation according to claim 1, wherein weighting the image features and the word vectors to obtain weighted image feature vectors and text feature vectors comprises:

In order to have three parameters to be learned,

4. The visual question-answering method based on structured semantic representation according to claim 1, wherein the word vector is converted into a structured semantic representation vector through a Tree-LSTM network, comprising:

obtaining a tree structure of each input question;

5. The visual question-answering method based on the structured semantic representation according to claim 1, wherein the fusion processing of the image feature vector, the text feature vector and the structured semantic representation vector comprises:

6. The visual question-answering method based on structured semantic representation according to any one of claims 1 to 5, characterized by further comprising: