CN110263912B

CN110263912B - Image question-answering method based on multi-target association depth reasoning

Info

Publication number: CN110263912B
Application number: CN201910398140.1A
Authority: CN
Inventors: 余宙; 俞俊; 汪亮
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2019-05-14
Filing date: 2019-05-14
Publication date: 2021-02-26
Anticipated expiration: 2039-05-14
Also published as: CN110263912A

Abstract

The invention discloses an image question and answer method based on multi-target association depth reasoning. The invention comprises the following steps: 1. and 2, performing data preprocessing on the image and the text described in the natural language of the image, and performing attention mechanism reordering on each target based on the candidate box geometric feature enhanced adaptive attention module model. 3. Neural network structure based on AAM model. 4. And (4) model training, namely training neural network parameters by using a back propagation algorithm. The invention provides a deep neural network for image question answering, in particular a method for uniformly modeling image-question text data, reasoning on characteristics of each target in an image, reordering the attention mechanism of each target so as to answer questions more accurately, and obtaining better effect in the field of image question answering.

Description

Image question-answering method based on multi-target association depth reasoning

Technical Field

The invention relates to a deep neural network structure for an image Question Answering (Visual Question Answering) task, in particular to a method for carrying out unified modeling on image-Question Answering data, searching interaction relations between entity features and corresponding space position geometric features in an image and achieving adaptive attention weight adjustment through modeling on position relations between the entity features and the corresponding space position geometric features.

Background

Image question-answering is an emerging task that intersects computer vision and natural language processing. The task is to allow the machine to automatically answer the corresponding answer by giving a question related to the image. The task of image question-answering is undoubtedly more complex than another cross-task of computer vision and natural language processing, image description, which requires a machine to be able to understand images and questions and reason about the correct results. Such as "what color is her glasses? Such sentences contain rich semantic information. In order to answer the question, the machine needs to locate the area of the female eye in the image, and then answer according to the keyword of "color". For another example, "what do the beard is made? "this problem, the machine needs to be unable to directly find the location of the beard, but can estimate the area where the beard should be located according to the location of the face and pay attention to the area. This question is then answered according to the keyword "make".

With the rapid development of deep learning in recent years, end-to-end modeling using a deep Convolutional Neural Network (CNN) or a deep cyclic Neural network (RNN) is becoming the mainstream research direction in the field of computer vision and natural language processing at present. In the research process of the image question-answering algorithm, an end-to-end modeling idea is introduced, meanwhile, the image is subjected to end-to-end modeling by using a proper network structure, and a computer automatically answers the image according to input questions and the image, so that the research question is worth of deep exploration.

For many years, it has been well recognized in the field of computer vision that contextual information or associations between objects contribute to model enhancements. But most methods of using this information have preceded the popularity of deep learning. In the current deep learning era, no significant progress is made in the field of using relationship information between objects, particularly image question answering, and most methods still focus on respectively paying attention to entities. Because the object in the image has the changes of two-dimensional space position, scale/aspect ratio and the like, the image question-answering model needs to infer the problem depending on the interrelation between the entities. Therefore, the position information of the object, i.e., the geometric features in general, plays a complex and important role in the image question-answering model.

In the aspect of practical application, the image question-answering algorithm has wide application scenes. With the rapid development of wearable smart hardware (such as Google glasses and microsoft HoloLens) and augmented reality technology, in the near future, an image content automatic question-answering system based on visual perception may become an important way for human-computer interaction. The technology can help us, especially the disabled with visual impairment to better perceive and understand the world

In conclusion, the image question-answering algorithm based on end-to-end modeling is a direction worthy of intensive research, the subject is to be switched in from a plurality of key difficult problems in the task, the problems existing in the current method are solved, and finally a set of complete image question-answering system is formed.

Due to the fact that image content under a natural scene is complex, a main body is various; the description based on natural language has high degree of freedom, which makes the description of image content face huge challenge. Specifically, there are two main difficulties:

(1) the feature extraction problem is a classic and basic problem in the cross-media expression research direction, and commonly used methods are image processing feature extraction methods such as Histogram of Oriented Gradient (HOG), Local Binary Pattern (LBP), Haar features and the like. In addition, the features extracted by ResNet, GoogleNet and fast-RCNN models based on deep learning theory all play excellent effects in many fields, such as image fine-grained classification, natural language processing and recommendation systems. Therefore, selecting a proper strategy during cross-media data feature extraction, and improving the expression capability of features while ensuring the high efficiency of calculation is a direction worthy of intensive research.

(2) How to reason the problem by relying on the interrelationship between entities in the image: the input to the image question-and-answer algorithm is an image, which may have multiple target entities, and a question. The algorithm not only extracts the characteristics of each target entity in the image and correctly understands each target of the image, but also infers the relation between the targets by using the geometric characteristics and the visual characteristics of the target characteristics. Therefore, how to lead the algorithm to automatically learn the relation among all targets of the image and form more accurate cross-media expression characteristics is a difficult problem in the image question-answering algorithm and is also a crucial link influencing the performance of the algorithm result.

Disclosure of Invention

The invention provides an image question and answer method based on multi-target association depth reasoning. The invention relates to a deep neural network architecture for an image Question Answering (Visual Question Answering) task, which mainly comprises two points: 1. and adopting image characteristics with stronger expressive power and geometric information. 2. And reasoning the relation between the targets in the image by using the target characteristics in the image.

The technical scheme adopted by the invention for solving the technical problem comprises the following steps:

step (1), data preprocessing, and feature extraction of image and text data

Firstly, preprocessing an image:

target entities contained in the images are detected using a fast-RCNN deep neural network structure. And extracting the visual features V and the geometric features G containing the target size and coordinate information in the image.

Preprocessing the text data:

counting sentence length of a given question text sets the maximum length of the question text according to the statistical information. And constructing a problem text vocabulary dictionary, replacing the words of the problem with index values in the description vocabulary dictionary, and then passing through the LSTM, thereby converting the problem text into a vector q.

Step (2), attention module based on candidate frame geometric feature enhancement

The structure is shown in fig. 1, and the geometric feature G, the visual feature V, and the attention weight vector m for the three feature candidate box positions that are input.

Firstly, sequentially coding an attention weight vector m, converting the attention weight vector m into vectors according to the weight sequence, then mapping the vectors to a high dimension and adding the vectors to a visual feature V mapped to the high dimension, and obtaining V by Layer Normalization (Layer Normalization) processing of the output of the vectors_A。

Then, the geometric characteristics G are mapped through a linear layer and then are subjected to an activation function ReLU to obtain G_R. Will V_AAnd G_RInputting a candidate frame Relation component (relationship Module) to carry out reasoning to obtain O_relation. Mixing O with_relationMultiplying the linear layer and sigmoid function with the original attention weight vector m to obtain a new attention vector

Step (3) constructing a deep neural network

The structure of the method is shown in FIG. 2, and firstly, the problem text is converted into an index value vector according to a vocabulary dictionary. Then the vector is transmitted into a Long Short Term Memory network (LSTM) through high-dimensional mapping, the output vector q and the visual feature V obtained by using the Faster R-CNN are fused in a Hadamard product (Hadamard product) mode, and the attention weight m of each entity feature is obtained through an attention module. Inputting the Attention weight m, the visual feature V and the geometric feature G into an Adaptive Attention Module (AAM) enhanced based on the geometric features of the candidate frame, reasoning by using the visual feature and the geometric feature of the position of the candidate frame, reordering the Attention weight, and obtaining a new Attention vector

Attention vector

Fusing the product with the visual feature V and then carrying out weighted average to obtain new visual features

Characterizing visual features

And fusing the problem text vector q with the problem text vector q through a Hadamard product to generate probability through a softmax function, and outputting the probability as an output predicted value of the network.

Step (4), model training

And (4) training the model parameters of the neural network in the step (3) by utilizing a back propagation algorithm according to the difference between the generated predicted value and the actual description of the image until the whole network model converges.

The step (1) is specifically realized as follows:

1-1, extracting the features of the image i by using an existing deep neural network fast-RCNN, wherein the extracted features comprise visual features V and geometric features G of k targets contained in the image, and V ═ V ═ V { (V) } V₁，v₂，...，v_k}，G＝{g₁，g₂，...，g_k}，k∈[10,100]And the visual vector of the single target is

The geometric feature of the individual target is g_iX, y, w, h, wherein

Wherein x, y, w and h are position parameters of geometric features and respectively represent the abscissa, the ordinate, the width and the height of a candidate frame where an entity in the image is located;

1-2. for a given question text, the different words in the question text in the data set are first counted and recorded in a dictionary. Converting words in the word list into index values according to the word dictionary, thereby converting the problem text into an index vector with a fixed length, wherein the specific formula is as follows:

wherein

Is the word w_kThe index value in the dictionary, l, represents the length of the question text.

The adaptive attention module deep inference network based on candidate box geometric feature enhancement in step (2) specifically comprises the following steps:

2-1. the input attention weight vector m is first processed. Annotating each object in mGravity weight m { m₁，m₂，...，m_kThe value-ordered sequence number pos of the code is encoded,

the specific formula is as follows:

wherein

i∈[0，1，...，d/2]，pos∈[1，2，...，k]Obtaining a matrix based on the attention weight m

2-2, the matrix PE and the visual characteristic V are added after passing through different linear layers respectively, and the output of the matrix PE and the visual characteristic V are subjected to layer normalization processing to obtain V_AThe concrete formula is as follows:

V_A＝LayerNorm(W_PEPE^T+W_VV^T) (formula 3)

Wherein

2-3, performing correlation calculation on the geometric characteristic G, and obtaining G by passing the geometric characteristic G through a linear layer_RThe concrete formula is as follows:

G_R＝W_GΩ(G)^T(formula 4)

Wherein m, n belongs to [1, 2]GE is encoded using equation (2),

2-4, mixing V_AAnd G_RThe input correlation module performs reasoning to obtain O_relationThe concrete formula is as follows:

O_relation＝softmax(log(G_R)+V_R)·(W_OV_A+b_O) (formula 7)

Wherein

2-5, mixing O_relationAfter passing through the full connection layer, multiplying the original attention weight m by a sigmoid function to obtain a new attention vector

The specific formula is as follows:

wherein

Constructing a deep neural network in the step (3), which comprises the following specific steps:

3-1, mapping the problem text vector q and the visual feature V to a public space through linear transformation of a full connection layer and then fusing by using a Hadamard product, F_fusionRepresenting a fused feature on a common space. W_rAnd W_qRespectively representing corresponding full-link layer parameters, symbols, which linearly transform the visual characteristic V and the current state information q

Representing the two matrices using the hadamard product.W_mRepresenting the fully-connected layer parameters that dimension the fused features down and produce an attention-weight distribution,

the initial attention weight vector m, j represents the currently calculated jth region attention weight. The specific formula is as follows:

m＝softmax(W_mF_fusion+b_m) (formula 10)

3-2, inputting m, V and G into an adaptive attention module enhanced based on the geometric features of the candidate boxes according to the step (2), reasoning by using the features of V and G, reordering m to obtain a new attention feature

3-3. passing through

And the visual feature vector is obtained by weighted average after the feature product of V is multiplied

The specific formula is as follows:

the training model in the step (4) is as follows:

the question-answer pairs in the VQA-v2.0 dataset are answered by multiple people, so that the same question may have different correct answers. Previous image question-answering models treated the highest ticket number as the only correct answer and one-hot encoding (one-hot encoding) it. Because the correct answers have a plurality of elements, all answers to the same question are voted, and the weight of the correct answer in all correct answers is determined according to the number of votes. And using a Kullback-Leibler divergence loss function if N represents the length of the answer vocabulary. Presect represents the predicted value distribution, and GT represents the true value. Then the definition is as shown:

the invention has the following beneficial effects:

the invention relates to a method for uniformly modeling image-description data, reasoning on characteristics of each target in an image, and reordering attention mechanisms of each target so as to more accurately describe the image. The invention introduces the implicit geometric characteristics in the image for the first time and structures the image, so that the image and the solid characteristics in the image are subjected to cooperative reasoning, and the accuracy of the visual question-answering model can be effectively improved after the existing visual question-answering technology is combined.

The invention has smaller parameter quantity, light weight and high efficiency, is beneficial to more efficient distributed training and is beneficial to being deployed in specific hardware with limited memory.

Drawings

FIG. 1: an adaptive attention module enhanced based on candidate box geometric features;

FIG. 2: and (3) image question-answering neural network architecture of the adaptive attention module enhanced based on the geometrical characteristics of the candidate box.

Detailed Description

The following is a more detailed description of the detailed parameters of the present invention.

The invention provides a deep neural network framework aiming at image Question Answering (Visual Question Answering).

The data preprocessing and the feature extraction of the image and the text in the step (1) are specifically as follows:

1-1. for feature extraction of image data, we used the MS-COCO dataset as training and testing data and extracted its visual features using the existing fast-RCNN model. Specifically, the image data is input into the Faster-RCNN network, and the 10 ∞ m in the image is detected by using the Faster-RCNN modelCombining 100 targets, framing each target, extracting 2048-dimensional visual feature V from the image of each target, and recording the coordinates and the size { x, y, w, h } of the frame of each icon as the geometric feature G of the target, wherein V ═ { V }₁，v₂，...，v_k}，G＝{g₁，g₂，...，g_k}，k∈[10,100]。

1-2. for the question text, firstly, different words in the question text in the data set are counted, and 9847 words with the word frequency higher than 5 are recorded in the dictionary by all the words in the text.

1-3, only the first 16 words are taken for each question sentence, and if the question sentence is less than 16 words, the null characters are supplemented. The translation of the string from between values is done using each word to generate the index value in the word dictionary in 1-2 instead of the word, so that each question translates into 16 word index vectors.

Step (2) learning and associating the target feature V and the geometric feature G of the image based on an Adaptive Attention Module (AAM) model enhanced by the geometric features of the candidate frame so as to reorder the input original Attention information m, which is specifically as follows:

2-1, firstly, processing the input attention weight vector m, and processing the attention information { m in each target in m₁，m₂，...，m_kThe serial number pos of the value sequencing of the attention information m is coded to obtain a matrix based on the attention information m

2-2, mapping PE to 128 dimensions and adding V mapped to 128 dimensions, and obtaining matrix V with the size of 100x128 through layer normalization processing of output_A。

2-3, performing correlation calculation on the characteristic G, encoding by a formula (2) to obtain a matrix of 100x100x64 dimensions, mapping the last dimension of the matrix to a single value, and then obtaining a matrix G of 100x100 dimensions through an activation function ReLU_R。

2-4, mixing V_AAnd G_RThe input association (relationship) module performs inference by first reasoning about V_AOf each targetFeatures are mapped to 128 dimensions, and then the target features are point-multiplied with each other to obtain a matrix V of 100x100_R. According to V_RAnd G_RThe combined calculation results in a matrix of 100x100 and V_AWeighted averaging of each target in (a) yields a matrix O of 100x128_relation。

2-5, mixing O_relationAfter passing through the full connection layer, sigmoid is multiplied by original m to obtain new 100-dimensional

3-1, for the problem text characteristics, wherein the text input is the 16-dimensional index value vector generated in the step (1), a word embedding technology is used for converting each word index into a corresponding word vector, and the size of the word vector used by the word vector is 1024. Each question text becomes a matrix of size 16x 1024. And zero-filling the input visual features into a matrix of 100x2048, mapping the matrix into a matrix of 100x1024 through a linear layer, and then taking the word vector at each moment as the input of an LSTM, wherein the LSTM is a recurrent neural network structure, and setting the output of the LSTM as a 1024-dimensional vector q.

And 3-2, inputting the output vector q of the LSTM into an Attention module to obtain a preliminary 100-dimensional Attention feature m, and finishing the image Attention point information extraction (Attention) operation.

3-3, inputting m, V and G into an Adaptive Attention Module (AAM) model enhanced based on the geometric features of the candidate boxes according to the step (2), reasoning by using the features of V and G, reordering m, and obtaining a new 100-dimensional Attention feature

Up to this point, the operations of reasoning about the associations between objects in the image and reordering the points of interest (attentions) are completed.

3-4. passing 100-dimensional vector

Weighted average is carried out on the feature V with 100x1024 dimensions to obtain 1024-dimensional visual features with attention

3-5. We will generate the reordered visual features with attention information

Fusing with an output vector q of the LSTM, and sequentially performing FC layer and softmax operations, wherein FC is a neural network full-connection operation, and finally outputting a 9487-dimensional prediction vector of the word, wherein each element in the output represents a probability value for predicting that the answer corresponding to the element index is the answer of the given question.

The training model in the step (4) is as follows:

and (3) comparing the predicted 9487 dimensional vector generated in the step (3) with a correct answer of the question, calculating the difference between a predicted value and an actual correct value through a loss function defined by the user to form a loss value, and adjusting the parameter value of the whole network by using a BP algorithm according to the loss value so as to gradually reduce the difference between the predicted value and the actual value generated by the network until the network converges.

Claims

1. An image question-answering method based on multi-target association depth reasoning is characterized by comprising the following steps:

step (1), data preprocessing, and feature extraction of image and text data

Firstly, preprocessing an image:

detecting a target entity contained in the image by using a fast-RCNN deep neural network structure; extracting visual features V and geometric features G containing the size and coordinate information of each target in the image;

preprocessing the text data:

counting the sentence length of a given question text, and setting the maximum length of the question text according to statistical information; constructing a problem text vocabulary dictionary, replacing words of a problem with index values in a description vocabulary dictionary, and then converting the problem text into a vector q through an LSTM;

Geometric feature G, visual feature V and attention weight vector m for the three input feature candidate box positions;

firstly, sequentially coding attention weight vector m, converting the attention weight vector m into vectors according to the weight sequence, then mapping the vectors to a high dimension and adding visual features V mapped to the high dimension, and obtaining V by layer normalization processing of the output of the vectors_A；

Then, the geometric characteristics G are mapped through a linear layer and then are subjected to an activation function ReLU to obtain G_R(ii) a Will V_AAnd G_RThe input candidate box relation component carries out reasoning to obtain O_relationIntroducing O_relationMultiplying the original attention weight vector m by a linear layer and sigmoid function to obtain a new attention weight vector

Step (3) constructing a deep neural network

Firstly, converting a problem text into an index value vector according to a vocabulary dictionary; then the vector is transmitted into a Long Short Term Memory network (LSTM) through high-dimensional mapping, the output vector q and the visual feature V obtained by using fast R-CNN are fused in a Hadamard product (Hadamard product) mode, and an attention weight vector m of each entity feature is obtained through an attention module; inputting the attention weight vector m, the visual feature V and the geometric feature G into an adaptive attention module based on the geometric feature enhancement of the candidate frame, reasoning by using the visual feature and the geometric feature of the position of the candidate frame, reordering the attention weight vector to obtain a new attention weight vector

Attention weight vector

Characterizing visual features

Generating probability through a softmax function by fusing the problem text vector q with a Hadamard product, and outputting the probability as an output predicted value of the network;

step (4), model training

2. The image question-answering method based on multi-target association depth reasoning according to claim 1, characterized in that the step (1) is implemented as follows:

The geometric feature of the individual target is g_iX, y, w, h, wherein

1-2, for a given problem text, firstly counting different words in the problem text in a data set, and recording the words in a dictionary; converting words in the word list into index values according to the word dictionary, thereby converting the problem text into an index vector with a fixed length, wherein the specific formula is as follows:

wherein

3. The image question-answering method based on multi-objective association depth reasoning according to claim 2, wherein the adaptive attention module depth reasoning network based on candidate box geometric feature enhancement in step (2) is specifically as follows:

2-1, firstly processing an input attention weight vector m; weighting each target attention in m into a vector m { m }₁，m₂，...，m_kThe value-ordered sequence number pos of the code is encoded,

the specific formula is as follows:

wherein

i∈[0，1，...，d/2]pos∈[1，2，...，k]Obtaining a matrix based on the attention weight vector m

V_A＝Layer Norm(W_PEPE^T+W_VV^T) (formula 3)

Wherein

G_R＝W_GΩ(G)^T(formula 4)

Wherein m, n belongs to [1, 2]GE is encoded using equation (2),

O_relation＝softmax(log(G_R)+V_R)·(W_OV_A+b_O) (formula 7)

Wherein

2-5, mixing O_relationAfter passing through the full connection layer, multiplying the original attention weight vector m by a sigmoid function to obtain a new attention weight vector

The specific formula is as follows:

wherein

4. The image question-answering method based on multi-objective association depth reasoning according to claim 3, wherein the deep neural network is constructed in the step (3), and specifically comprises the following steps:

3-1, mapping the problem text vector q and the visual feature V to a public space through linear transformation of a full connection layer and then fusing by using a Hadamard product, F_fusionRepresenting a fused feature on a common space; w_rAnd W_qRespectively representing corresponding full-link layer parameters, symbols, which linearly transform the visual characteristic V and the current state information q

Expressing that the two matrixes adopt Hadamard products; w_mRepresenting the fully-connected layer parameters that dimension the fused features down and produce an attention weight vector distribution,

an initial attention weight vector m, j represents the current calculated jth region attention weight vector; the specific formula is as follows:

m＝softmax(W_mF_fusion+b_m) (formula 10)

3-3. passing through

The specific formula is as follows:

5. the image question-answering method based on multi-target association depth reasoning according to claim 4, wherein the model training in the step (4) is as follows:

VQA-v2.0 data sets of question-answer pairs are answered by multiple people, so that the same question may have different correct answers; the previous image question-answering model regards the highest ticket number as the only correct answer and carries out one-hot encoding (one-hot encoding) on the answer; because the correct answers have the diversity, all answers of the same question are voted, and the weight of the correct answers in all correct answers is determined according to the number of votes; and using a Kullback-Leibler divergence loss function if N represents the length of the answer vocabulary; presect represents the predicted value distribution, GT represents the true value; then the definition is as shown: