CN112100346B

CN112100346B - Visual question-answering method based on fusion of fine-grained image features and external knowledge

Info

Publication number: CN112100346B
Application number: CN202010883275.XA
Authority: CN
Inventors: 宋凌云; 李建鳌; 尚学群; 俞梦真; 彭杨柳; 李伟; 李战怀
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2021-07-20
Anticipated expiration: 2040-08-28
Also published as: CN112100346A

Abstract

The invention discloses a visual question-answering method based on fusion of fine-grained image features and external knowledge, which comprises the following four steps: fine-grained image feature extraction, text processing and feature extraction, problem knowledge retrieval based on an external knowledge base, multi-modal feature fusion and answer prediction. Extracting the regional visual features of the image by using fine-grained image features; processing a visual question and a question sentence and obtaining the overall characteristics of the question sentence by text processing and characteristic extraction; problem knowledge retrieval based on an external knowledge base supplements necessary common knowledge or specific knowledge for prediction of visual problem answers by introducing a Freebase knowledge map as an external knowledge base of a model; in the multi-modal feature fusion and answer prediction, a similarity-based feature fusion method is used for multi-modal feature fusion, and the fused visual question features are used for predicting the answers to the questions. The method has better performance and higher prediction accuracy rate on the answers of the visual questions.

Description

Visual question-answering method based on fusion of fine-grained image features and external knowledge

Technical Field

The invention belongs to the field of intelligent information processing, and particularly relates to a visual question answering method.

Background

Visual Question Answering (VQA) is a interdisciplinary study combining computer vision and natural language processing studies with the goal of allowing computers to predict answers to Visual questions. The specific process is to input an image and an open question related to the image into a computer, and the visual question-answering system firstly needs to understand the semantics of the text of the visual question and then combines the visual information of the image related to the question to predict the answer. The visual question-answering task requires a computer to deeply understand the content of images in visual questions and the semantics of the questions, and the answering of partial questions also requires the computer to master related common knowledge or specific knowledge, so that a plurality of artificial intelligence technologies including fine-grained identification, object identification, behavior identification, natural language processing and the like are involved in the visual question-answering research, so that the visual question-answering has higher requirements and larger challenges in the aspect of image semantic understanding than the traditional computer visual research.

Some researches on visual question answering exist in the prior art, but the global image features are used, so that fine-grained visual features with high correlation degree with question texts cannot be obtained, and the applicability to fine-grained visual problems is poor; most methods only focus on the content of the visual problem, and the application scene is greatly limited; meanwhile, the answering effect on the visual questions of the fine-grained images is poor, and certain reasoning cannot be carried out on the basis of the visual questions.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a visual question-answering method based on fusion of fine-grained image features and external knowledge, which comprises the following four steps: fine-grained image feature extraction, text processing and feature extraction, problem knowledge retrieval based on an external knowledge base, multi-modal feature fusion and answer prediction. Extracting the regional visual features of the image by using fine-grained image features; processing a visual question and a question sentence and obtaining the overall characteristics of the question sentence by text processing and characteristic extraction; problem knowledge retrieval based on an external knowledge base supplements necessary common knowledge or specific knowledge for prediction of visual problem answers by introducing a Freebase knowledge map as an external knowledge base of a model; in the multi-modal feature fusion and answer prediction, a similarity-based feature fusion method is used for multi-modal feature fusion, and the fused visual question features are used for predicting the answers to the questions. The method has better performance, and has higher prediction accuracy on the answer of the visual question on the aspect of the accuracy of the evaluation index of the visual question and answer.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step 1: extracting fine-grained image features;

step 1-1: taking an original image as input, and adopting an unsupervised image segmentation algorithm to segment the image, wherein each segmented region is marked by different RGB color values; then the image size is changed to d₁×d₁×3；

Step 1-2: selecting a pre-trained VGG-16 network, and removing a full connection layer and a Softmax layer of the VGG-16 network to serve as an image feature extractor; inputting the original image into an image feature extractor, and taking the output of the last convolution layer of the image feature extractor as the feature map of the original image, wherein the size of the output feature map is d₂×d₂×512；

Step 1-3: mapping the segmentation region of the original image in the step 1-1 to a feature map by using an ROI projection method, performing region segmentation on the feature map according to a mapping result, and establishing a one-to-one correspondence relationship between the segmentation region of the original image and the segmentation region of the feature map; performing maximum pooling operation on the feature map to obtain image feature vectors of all the segmentation areas in the feature map, wherein the image feature vector of each segmentation area is 512-dimensional, and the value of the image feature vector of each segmentation area is the maximum value of the feature map of each dimension in the segmentation area;

step 2: text processing and feature extraction;

step 2-1: using an NLTK toolkit to perform word segmentation on the visual question and converting a word segmentation result into a one-hot word vector;

step 2-2: embedding one-hot word vectors into a word vector space by adopting a word embedding technology GloVe;

step 2-3: then, the word vector is coded by adopting an LSTM network, and the hidden state vector of the LSTM unit at the last moment is used as a problem text feature vector q;

and step 3: problem knowledge retrieval based on an external knowledge base;

step 3-1: using an NLTK part-of-speech tagging tool to perform word segmentation and part-of-speech tagging on the visual question, and tagging nouns and verbs in the visual question; then, the word forms of the nouns and the verbs are restored, the single nouns and the plural nouns are unified into a single form, and the tense variants of the verbs are unified into the original form of the verbs;

step 3-2: inputting the nouns and verbs marked in the step 3-1 as a knowledge graph of an external knowledge base, and searching by using the nouns and the verbs as key words to match corresponding knowledge triples for the question sentences of the visual problems;

step 3-3: and adopting a TransE algorithm to encode the entities and the relations in the knowledge graph to obtain a coding vector corresponding to the entities and the relations contained in each matched knowledge triple, wherein the dimension of the coding vector is H₁(ii) a And splicing each matched knowledge triple according to the sequence of the head entity vector, the relation vector and the tail entity vector to obtain an H₂A dimensional knowledge triplet feature vector k;

and 4, step 4: multi-modal feature fusion and answer prediction;

step 4-1: calculating the cosine similarity between the image characteristic vector of each segmentation area of the original image and the problem text characteristic vector q;

step 4-2: taking the cosine similarity obtained by calculation of each segmentation region as a weighting coefficient, and multiplying the weighting coefficient by the image feature vector of the corresponding segmentation region; adding the results of weighted multiplication of each divided area to obtain an image integral characteristic vector v;

step 4-3: and carrying out feature fusion on the image overall feature vector v, the problem text feature vector q and the knowledge triple feature vector k by using a bilinear pooling model MLB, wherein the fusion process is represented as follows:

f₁＝MLBFusion(k,q)

f₂＝MLBFusion(k,v)

f＝MLBFusion(f₁,f₂)

wherein f is the finally obtained fusion feature vector, and MLBFusion (·) represents the feature fusion process;

step 4-4: a multi-class classifier is constructed by using a multilayer perceptron and a Sigmoid function, a vector f is input into the classifier, and the output of the classifier is as follows:

α＝Sigmoid(MLP(f))

wherein α is an output vector of the classifier, the value of each dimension of the output vector is the probability value that each answer in the candidate answer set is the correct answer, and the answer with the highest probability is output as the predicted answer of the question.

Preferably, d is₁Has a value of 224, d₂Has a value of 14.

Preferably, said H₁Has a value of 100, H₂The value of (d) is 300.

Preferably, the ROI projection method is: dividing horizontal and vertical coordinates of each pixel of the original image by the convolution kernel step length of the image feature extractor, rounding up the calculation result, combining the repeated results, and obtaining coordinates as the coordinates of each pixel point in the feature map; and establishing a mapping relation between the pixel points of the original image and the pixel points of the feature map according to the corresponding relation of the coordinates.

Preferably, the knowledge-map employs Freebase.

Due to the adoption of the visual question-answering method based on the fusion of the fine-grained image characteristics and the external knowledge, the visual question-answering method has the following beneficial effects:

1. compared with the global image features used by the traditional method, the fine-grained image feature extraction method provided by the invention can obtain the fine-grained visual features with higher correlation degree with the problem text, and improves the applicability to the fine-grained visual problem.

2. The knowledge related to the visual questions in the external knowledge base is introduced into the prediction process of the answers, and the external knowledge and the feature vectors of the visual questions are fused, so that the capability of answering the questions needing common knowledge or special knowledge is improved.

3. In contrast, most of the conventional methods only focus on the content of the visual problem itself, and the application scenarios thereof are greatly limited. The experimental result of the invention on the standard data set proves that the method of the invention obtains higher accuracy rate than the prior method.

Drawings

FIG. 1 is a system framework diagram of the present invention.

Fig. 2 is a schematic diagram of fine-grained image feature extraction according to the present invention.

FIG. 3 is a schematic diagram of text processing and feature extraction according to the present invention.

FIG. 4 is a diagram of problem knowledge retrieval based on an external knowledge base according to the present invention.

FIG. 5 is a schematic diagram of multi-modal feature fusion and answer prediction according to the present invention.

FIG. 6 is a graph of loss function loss values as a function of iteration number during training according to an embodiment of the present invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

As shown in fig. 1, a visual question-answering method based on fusion of fine-grained image features and external knowledge includes the following steps:

step 1: extracting fine-grained image features;

Step 1-2: selecting a pre-trained VGG-16 network, and removing a full connection layer and a Softmax layer of the VGG-16 networkAs an image feature extractor; inputting the original image into an image feature extractor, and taking the output of the last convolution layer of the image feature extractor as the feature map of the original image, wherein the size of the output feature map is d₂×d₂×512；

step 2: text processing and feature extraction;

and step 3: problem knowledge retrieval based on an external knowledge base;

step 3-3: and coding the entities and the relations in the knowledge graph by adopting a TransE algorithm to obtain the entities contained in each matched knowledge tripleThe body and the relation correspond to a code vector, and the dimension of the code vector is H₁(ii) a And splicing each matched knowledge triple according to the sequence of the head entity vector, the relation vector and the tail entity vector to obtain an H₂A dimensional knowledge triplet feature vector k;

and 4, step 4: multi-modal feature fusion and answer prediction;

f₁＝MLBFusion(k,q)

f₂＝MLBFusion(k,v)

f＝MLBFusion(f₁,f₂)

α＝Sigmoid(MLP(f))

The specific embodiment is as follows:

1. dividing words of the visual question sentences in the sample library by using an NLTK part-of-speech tagging tool, and establishing a dictionary for word division results, wherein each word in the dictionary corresponds to a unique number;

2. as shown in fig. 2, an unsupervised image segmentation algorithm is first used to perform region segmentation on an original image. And outputting an image for marking each segmentation area by using different RGB color values according to the segmentation result, obtaining pixel coordinate information of each segmentation area by using different RGB color values, and determining the part of the image feature map corresponding to each segmentation area of the original image according to the pixel coordinate information. The division result image sizes are processed to be unified into 224 × 224 × 3.

And adopting the VGG-16 network with the weight pre-trained completion on ImageNet with the full connection layer and the Softmax layer removed as an image feature extractor. And inputting the original image corresponding to the visual problem into a VGG-16 network for forward propagation, taking the output of the last convolution layer as the extracted image feature, wherein the size of the output feature map is 14 multiplied by 512.

According to the segmentation result of unsupervised image segmentation, a method of mapping ROI (region of interest) in the target detection SPP-Net to the feature map is adopted to segment the region of the feature map output by the image feature extractor. According to the principle of receptive field and coordinate transformation in convolutional neural networks, the ROI mapping method is as follows: and (3) carrying out coordinate transformation on all pixel points of the original image, wherein the coordinate transformation method is that the horizontal and vertical coordinates of the pixels are divided by the convolution kernel step length of the image feature extractor, and the calculation result is rounded upwards to be the corresponding coordinates of the pixel points on the original image on the feature map. And obtaining the segmentation areas on the feature map output by the image feature extractor through ROI mapping, performing maximum pooling operation on each segmentation area, and taking the maximum value of the feature map in each segmentation area as the value of the feature vector corresponding to the area in the dimension, thereby obtaining the 512-dimensional image feature vector corresponding to each segmentation area in the original image.

3. Part-of-speech tagging can be used for determining part-of-speech of an English word, segmenting a visual problem sentence by using an NLTK toolkit, and performing part-of-speech tagging on a segmentation result, so that knowledge triples related to nouns and verbs in a problem can be conveniently searched in a problem knowledge retrieval step based on an external knowledge base. And after the part-of-speech tagging is finished, performing part-of-speech reduction on the result, wherein the single and plural nouns are unified into a single form, and the tense variants of the verbs are unified into the original form of the verb.

As shown in fig. 3, the input visual problem sentence is segmented with punctuation marks and spaces according to the characteristics of the english language. And converting the result of word segmentation into a one-hot word vector, wherein the value of the one-hot word vector corresponding to the word in a certain dimension is 1, and the values of the other dimensions are 0. Then, using Word embedding technology GloVe (Global Vectors for Word replication), one-hot Word Vectors of words are embedded into the Word vector space. And finally, inputting the word vector space into an LSTM network, coding the problem text by adopting the LSTM, and taking the hidden state vector of the LSTM unit at the last moment as a problem text feature vector.

4. As shown in fig. 4, the main task of the problem knowledge retrieval based on the external knowledge base is to find related knowledge triples, specifically, verbs and nouns in the visual problem sentences obtained by part-of-speech tagging are found in the external knowledge base knowledge graph Freebase. And then locally storing the feature vectors of the knowledge triples matched with each visual problem according to the search result, wherein the feature vectors are fused with image features and text features in the multi-modal feature fusion step.

Specifically, for the knowledge graph, a knowledge representation learning method transit is adopted to learn vector representation of entities and relations in the knowledge graph, and a transit algorithm is to regard relations in knowledge items in a triple form as translation from one entity to another entity, represent and learn the entities and relations in the same semantic space, so that the sum of a head entity vector and a relation vector is as close as possible to a tail entity vector, and can be represented as follows:

head entity+relationship＝tail entity

in the formula, head entry represents a head entity vector, relationship represents a relationship vector, and tail entry represents a tail entity vector.

And setting algorithm parameters to finally obtain the vector dimensions corresponding to the entities and the relations which are both 100-dimensional, and splicing the first entity vector, the relation vector and the tail entity vector of each triplet in sequence to obtain a 300-dimensional knowledge triplet characteristic vector.

5. When the fine-grained image features are extracted, the image feature vectors of each segmentation area of the visual problem image are obtained. For a specific visual problem, the correlation degree between different divided regions in the original image and the problem is different, so before the image features are fused, the similarity between the visual problem and n feature vectors corresponding to n divided regions of the image needs to be calculated, and the features of each region are weighted according to the similarity, so that higher attention is given to the region with high problem correlation.

As shown in fig. 5, cosine similarity is used to calculate the similarity between the image features of different regions and the visual problem. The closer the cosine value is to 1, the higher the similarity of the two vectors is, namely the higher the similarity of the image features and the visual problem is, and the cosine similarity calculation method comprises the following steps:

wherein, A and B respectively represent two vectors participating in the calculation, and O is the dimension of the two vectors participating in the calculation. The similarity degree of the image feature vector of each segmentation area and the problem text feature vector can be determined by calculating the cosine similarity. Taking the cosine similarity obtained by calculation of each segmentation region as a weighting coefficient, and multiplying the weighting coefficient by the image feature vector of the corresponding segmentation region; adding the results of weighted multiplication of each divided area to obtain an image integral characteristic vector;

the method comprises the steps of performing feature fusion on an image overall feature vector, a problem text feature vector and a knowledge triple feature vector through a feature fusion method based on a Bilinear pooling Model (MLB), wherein the MLB performs feature fusion on the two feature vectors based on a Hadamard product and matrix decomposition, in the specific implementation process, firstly, the features of two different modes are projected to the same dimension by using a multilayer perceptron, then, the projected vectors are multiplied by the Hadamard product, and the calculation result is the fusion result. In the knowledge-based visual question-answering method, the specific fusion steps are that firstly, MLB is adopted to fuse the image overall characteristic vector and the knowledge triple characteristic vector to obtain an intermediate result image-knowledge characteristic, then MLB is adopted to fuse the question text characteristic vector and the knowledge triple characteristic vector to obtain an intermediate result text-knowledge characteristic, finally MLB is adopted to fuse two intermediate results to obtain a characteristic representation of a visual question, and finally a fusion characteristic vector f is obtained, wherein the vector can be used for predicting the answer of the visual question.

The fused feature vector f is used as the input of the answer prediction step, so that the answer of the visual question can be predicted. In this embodiment, with reference to a method for modeling a visual question-answering task as a classification question in mainstream visual question-answering research at present, a multi-class classifier is constructed, a vector f is input into the classifier, a numerical value of each dimension of the output vector is a probability value that each answer in a candidate answer set is a correct answer, and according to probability distribution, the output with the highest probability is a predicted answer to the question.

The classifier structure adopts a form of a multilayer perceptron MLP followed by a Sigmoid function to refine the classification questions into a plurality of categories of two-classification questions, so that the types of the visual questions are not limited, the questions in the data set can only have one correct answer in the candidate set, and the method is also suitable for the condition that one question corresponds to a plurality of answers and the candidate answer set does not have the answer to the question. The output of the classifier is:

α＝Sigmoid(MLP(f))

the loss function of the classifier adopts a cross entropy loss function, and comprises the following steps:

wherein y is_iThe marking mechanism marks the range of each candidate answer as [0,1 ] in the answer processing]Fractional value, p (y)_i) For predicting the probability, N is the number of sample data participating in training, and i represents the ith sample data participating in training.

Before training, fine-grained image feature extraction and knowledge representation learning of a knowledge base are completed, and image features and knowledge item features corresponding to visual problems are locally stored in an HDF file format.

And (4) randomly initializing parameters, wherein random numbers used for initialization are fixed during each training in a mode of setting random number seeds. In the training process, after the training of an epoch is completed, the model can be verified on a verification set, and if the result of the current epoch on the verification set is superior to the result of the previous epoch, the parameters obtained by the training of the current epoch are stored to replace the previous parameters. AdaMax is adopted as the optimization algorithm, and the parameters of the optimization algorithm are all default values.

In order to prevent the overfitting phenomenon in training, a Dropout layer is added into the network, and gradient clipping is carried out in back propagation to prevent the gradient disappearance or the explosion phenomenon.

FIG. 6 is a graph of loss value versus number of training iterations for a loss function. After training to the 48 th epoch, convergence begins and the loss function eventually converges to 1.95.

For the evaluation of the prediction effect, the accuracy is calculated by adopting the following method: if the answer to the given question predicts that more than two of the ten labeled answers provided by the dataset are the same, then the predicted answer is deemed correct. This evaluation index is called the answer accuracy and is calculated by the formula:

wherein T is the number of ten labeled answers which is the same as the model predicted answer.

Shown in table 1 are the results of the evaluation of the knowledge-graph based visual question-answer model. The accuracy of the model on the three types of problems of 'yes or no', 'number' and 'other' is 73.35%, 47.87% and 34.01%, and the overall accuracy of the model is 52.14%.

Table 1: model accuracy

Table 2 compares the final accuracy of the methods of this document with those of other papers. CcVMS, CcVMS + Clustering, LcVMS + Clustering are experimental results of different methods on the same data set. Ours stands for our proposed method and it can be seen that the method of the present invention has a great improvement in performance compared to other methods.

Table 2: comparison of accuracy

Claims

1. A visual question-answering method based on fusion of fine-grained image features and external knowledge is characterized by comprising the following steps:

step 1: extracting fine-grained image features;

step 1-1: taking an original image as input, and adopting an unsupervised image segmentation algorithm to segment the image, wherein each segmented region is marked by different RGB color values; then the image size is changed to 224 × 224 × 3;

step 1-2: selecting a pre-trained VGG-16 network, and removing a full connection layer and a Softmax layer of the VGG-16 network to serve as an image feature extractor; inputting an original image into an image feature extractor, taking the output of the last convolution layer of the image feature extractor as a feature map of the original image, wherein the size of the output feature map is 14 multiplied by 512;

the ROI projection method comprises the following steps: dividing horizontal and vertical coordinates of each pixel of the original image by the convolution kernel step length of the image feature extractor, rounding up the calculation result, combining the repeated results, and obtaining coordinates as the coordinates of each pixel point in the feature map; according to the corresponding relation of the coordinates, establishing a mapping relation between the pixel points of the original image and the pixel points of the feature map;

step 2: text processing and feature extraction;

and step 3: problem knowledge retrieval based on an external knowledge base;

and 4, step 4: multi-modal feature fusion and answer prediction;

f₁＝MLBFusion(k,q)

f₂＝MLBFusion(k,v)

f＝MLBFusion(f₁,f₂)

α＝Sigmoid(MLP(f))

2. The visual question-answering method based on fusion of fine-grained image features and external knowledge according to claim 1, wherein the H is₁Has a value of 100, H₂The value of (d) is 300.

3. The visual question-answering method based on the fusion of the fine-grained image features and the external knowledge according to claim 1, wherein the knowledge graph adopts Freebase.