CN114691847A

CN114691847A - Relational attention network visual question-answering method based on deep perception and semantic guidance

Info

Publication number: CN114691847A
Application number: CN202210231121.1A
Authority: CN
Inventors: 魏巍; 刘宇航; 彭道万; 刘逸帆; 潘为燃
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2022-03-10
Filing date: 2022-03-10
Publication date: 2022-07-01
Anticipated expiration: 2042-03-10
Also published as: CN114691847B

Abstract

The invention discloses a relation attention network visual question-answering method based on depth perception and semantic guidance, which comprises the following steps of: 1) constructing a three-dimensional space relation between image targets; obtaining a three-dimensional space relation between image targets; 2) acquiring a correlation score of the image targets i and j in a space dimension according to a three-dimensional space relationship between the image targets; 3) acquiring the correlation between the image objects i and j by combining the implicit attention and the explicit attention; 4) according to the framework of the Transformer, the traditional self-attention layer is replaced by an improved attention mechanism, and a visual question-answering model is obtained. The invention introduces the correlation of the three-dimensional space into the traditional self-attention mechanism, and improves the accuracy of the visual question answering.

Description

Relational attention network visual question-answering method based on deep perception and semantic guidance

Technical Field

The invention relates to a natural language processing technology, in particular to a relational attention network visual question-answering method based on deep perception and semantic guidance.

Background

Conventional visual question answering methods are generally based on depth feature fusion models, such as bilinear BLOCK diagonal fusion (BLOCK) and Self-attention mechanism fusion (Self-attention), but these methods have difficulty solving the answer of complex problems with spatial relationship reasoning. With the progress of deep learning, many studies based on deep neural network models are dedicated to improving the effect of the visual question-answering task, which generally extracts an image target visual representation and a word vector representation from an image and a text respectively for input, and implements multi-modal entity alignment in an end-to-end training manner, and then adopts a multi-classification strategy to predict an answer. Recently, many research works construct models based on Attention networks (Attention networks), and although the models show excellent performance on the visual question-answering task, the models do not consider the spatial relationship or semantic relationship between image targets, so that the models have limitation on complex question answering related to visual reasoning.

Disclosure of Invention

The invention aims to solve the technical problem of providing a relational attention network visual question-answering method based on depth perception and semantic guidance aiming at the defects in the prior art.

The technical scheme adopted by the invention for solving the technical problems is as follows: a relational attention network visual question-answering method based on depth perception and semantic guidance comprises the following steps:

1) three-dimensional spatial relationship construction between image objects

The visual relationship in two-dimensional space is calculated using the rectangular box coordinates of the image objects, which two-dimensional spatial relationship is expressed as for image objects i and j

It can be derived from the rectangular boxes of the two image objects by a specific calculation, namely:

wherein the content of the first and second substances,

w_i,h_irespectively representing the coordinate, the width and the height of the central point of a rectangular frame of the image target i;

according to the depth distance value dep of the center point of the rectangular frame of the image targets i and j_iAnd dep_jThen calculating the visual relationship in the depth space

Namely:

wherein the content of the first and second substances,

the area of the overlapped part of the rectangular frames i and j is shown;

according to the two-dimensional space relation between the image objects i and j

And depth spatial relationship

The three-dimensional space relation between the image objects can be obtained

Namely:

wherein the content of the first and second substances,

as a learnable weight parameter, d_s64 is the dimension of the explicit spatial relationship representation, σ is the activation function ReLU;

2) depth perception and semantic guidance attention mechanism

Using the explicitly modeled three-dimensional spatial relationship described above, it can be used to calculate a correlation score in the spatial dimension between image objects i and j

Namely:

wherein f is^spaFor calculating the correlation of two image objects in space dimension, from the visual characteristics q of the ith image object_iAnd three-dimensional spatial relationship representation

Obtained by dot product, namely:

f^semthe method is used for calculating the correlation of the spatial relationship of two image objects and text semantics, namely:

wherein the content of the first and second substances,

is a learnable weight parameter;

[ CLS ] from BERT model for textual feature representation of problem]The feature representation of the last layer of the position is obtained;

3) combining implicit attention and explicit attention

Correlation alpha between image objects i and j_ijFinal implicit correlation

And explicit dependencies

Obtained by weighting, namely:

4) attention mechanism incorporated into visual question-answering model

According to the framework of the Transformer, an improved attention mechanism alpha is adopted_ijReplacing the conventional self-attention layer, all alpha's are used_ijExpressed in matrix form, i.e.

The improved Transformer calculation mode is as follows:

wherein, L is the number of transform layers, FFN is a multilayer perceptron (MLP) which is two fully-connected layers and activated by a ReLU hidden layer, that is:

FFN(X)＝W₂σ(W₁X+b₁)+b₂

wherein the content of the first and second substances,

are parameters that can be learned.

The invention has the following beneficial effects:

1. the invention introduces the correlation of the three-dimensional space to the traditional self-attention mechanism, and flexibly expands the correlation to realize the explicit modeling and calculation of the three-dimensional space relation between the image targets;

2. by modeling the three-dimensional spatial relationship between the image targets and designing an attention mechanism for depth perception and semantic guidance on the basis, more accurate spatial correlation calculation is performed between the input image targets by introducing two different attention weight bias terms, namely the correlation weights of the spatial dimension and the semantic dimension, so that the accuracy of visual question answering is improved.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a schematic diagram of an overall model structure of the visual question answering method of the present invention;

FIG. 2 is a schematic structural diagram of a depth perception and semantic guidance relationship attention mechanism in the visual question-answering method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, a method for relational attention network visual question answering based on depth perception and semantic guidance includes the following steps:

the invention provides a depth perception and semantic-guided relational attention network, which calculates spatial correlation between image targets by explicitly modeling three-dimensional spatial relations between the image targets

And semantic relevance

The method mainly comprises the following parts:

1) three-dimensional spatial relationship construction between image objects

The visual relationship in two-dimensional space is calculated using the rectangular box coordinates of the image objects, which two-dimensional space relationship is expressed as image objects i and j

wherein, the first and the second end of the pipe are connected with each other,

Namely:

wherein the content of the first and second substances,

the area of the overlapped part of the rectangular frames i and j is shown;

And depth spatial relationship

The three-dimensional space relation between the image objects can be obtained

Namely:

wherein the content of the first and second substances,

2) depth perception and semantic guidance attention mechanism

As shown in FIG. 2, the three-dimensional spatial relationship explicitly modeled above may be used to calculate a correlation score in a spatial dimension between image objects i and j

Namely:

Obtained by dot product, namely:

wherein the content of the first and second substances,

is a learnable weight parameter;

3) combining implicit attention and explicit attention

Correlation alpha between image objects i and j_ijFinal implicit correlation

And explicit dependencies

Obtained by weighting, namely:

the original Transformer model uses an implicit correlation self-attention mechanism to calculate the outputThe correlation between entries assumes that the feature matrix formed by the input image object RoI features is

Where N is the number of detected image objects, d_hFor the characteristic dimension, in order to measure the implicit relation between image targets, the invention firstly adopts a scaled dot-product (scaled dot-product) f (-) to calculate the implicit correlation between the image targets i and j, and then adopts a softmax function to normalize all image target neighbors to convert into a correlation score

Specifically, the input features X are first mapped to the hidden space of query, key and value, and then used to measure the implicit correlation between two image objects, namely:

q_i＝W^qx_i

k_j＝W^kx_j

v_j＝W^vx_j

wherein, W^q,W^k,

Is a learnable full connectivity layer parameter. x is a radical of a fluorine atom_i,x_jVisual characteristics of the ith, jth image object, q_i,k_j,v_jTo map to the visual features of the hidden space, f (·,) is the scaled dot product function, exp (·) is an exponential function based on the natural number e.

According to the method, the relevance between the image targets can be measured from the characteristic dimension and the space dimension respectively through combining the implicit attention mechanism and the explicit attention mechanism, and compared with the original Transformer which only considers the relevance of the input in the characteristic level, the method also considers the relevance of the input in the space dimension, so that the capability of answering visual reasoning related complex questions is improved.

4) Attention mechanism incorporated into visual question-answering model

The improved Transformer calculation mode is as follows:

FFN(X)＝W₂σ(W₁X+b₁)+b₂

wherein the content of the first and second substances,

are learnable parameters.

The invention provides a neural network architecture for a visual question-answering task, which comprises implicit and explicit image target relation modeling, and the following relation reasoning is better realized by implicitly and explicitly constructing the spatial and semantic relation between image targets. The depth perception and semantic guidance relation attention module provided by the invention is incorporated into a self-attention layer in a transform framework, namely, a layer of similarity measuring the spatial relation of image targets and text semantics is added, and a new image target correlation matrix is obtained by adjusting the original self-attention distribution weight, wherein the correlation matrix can reflect the correlation between the image targets in the relation level.

Experiments show that compared with the existing mainstream method, the sequence labeling method provided by the invention has a better effect. The experiment was evaluated using two reference Visual Question Answering datasets, namely, Visual Question Answering v2(VQA v2) and GQA datasets. The detailed information of the data set is shown in table 1.

Table 1 data set information

The experimental section is intended to evaluate the effectiveness of the visual question-answering model proposed by the present invention on different data sets. Specifically, we list the accuracy of the VQA v2 dataset and the GQA dataset as evaluation indexes of the model, and the experimental comparison results are given in table 2 and table 3, respectively.

TABLE 2 VQA v2 data set comparison experiment results

TABLE 3 GQA data set comparison of experimental results

It is noteworthy that, as can be observed from the above two tables, the method proposed by the present invention consistently outperforms all of these reference models in different visual question-answering tasks. Because most of these models focus on the attention of the image target entities, and ignore the modeling of the spatial and semantic relationships of the image targets, the models lack the ability to reason between the image targets. By explicitly modeling the three-dimensional spatial position characteristics of the image targets and adopting an attention mechanism to be combined into a neural network structure, the method provided by the invention can explicitly model the relationship between the image targets, so that the relationship reasoning between the image targets can be realized.

It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims

1. A relational attention network visual question-answering method based on depth perception and semantic guidance is characterized by comprising the following steps:

1) three-dimensional spatial relationship construction between image objects

1.1) Using the rectangular box coordinates of the image objects, the visual relationship in two-dimensional space is calculated, for image objects i and j, the two-dimensional spatial relationship is expressed as

1.2) according to the depth distance value dep of the center point of the rectangular frame of the image targets i and j_iAnd dep_jThen calculating the visual relationship in the depth space

1.3) based on the two-dimensional spatial relationship between image objects i and j

And depth spatial relationship

Obtaining three-dimensional spatial relationships between image objects

Namely:

wherein W is a learnable weight parameter, d_sThe dimensionality represented by the explicit spatial relationship is sigma of an activation function ReLU;

2) depth perception and semantic guidance attention mechanism

According to the three-dimensional space relationship between the image targets, the correlation score of the image targets i and j in the space dimension is obtained

Namely:

Obtained by dot product, namely:

wherein the content of the first and second substances,

is a learnable weight parameter;

3) combining implicit attention and explicit attention

Correlation alpha between image objects i and j_ijBy implicit correlation

And explicit dependencies

Obtained by weighting, namely:

4) attention mechanism incorporated into visual question-answering model

The improved Transformer calculation mode is as follows:

wherein, L is the number of layers of a transducer, FFN is a multilayer perceptron which is activated by two layers of full connection layers and adopts a ReLU hidden layer, namely:

FFN(X)＝W₂σ(W₁X+b₁)+b₂

wherein the content of the first and second substances,

are learnable parameters.

2. The method for relational attention network visual question answering based on depth perception and semantic guidance according to claim 1, wherein in the step 1.1), two-dimensional spatial relation

The rectangular frames from the two image objects are calculated as follows:

wherein the content of the first and second substances,

respectively representing the coordinate, the width and the height of the central point of a rectangular frame of the image target i;

respectively, the coordinate of the center point of the rectangular frame of the image target j, the width and the height.

3. The method for visual question-answering based on deep perception and semantically guided relation attention network according to claim 1, wherein, in the step 1.2),

computing visual relationships in depth space

Namely:

wherein, w_i,h_iThe width and height of the rectangular frame of the image target i respectively; w is a_j,h_jThe width and height of the rectangular box of image object j,

the area of the overlapping portion of the rectangular frame of image object i and the rectangular frame of image object j.

4. The relation attention network visual question-answering method based on deep perception and semantic guidance according to claim 1, wherein in the step 3), implicit correlation is performed

The self-attention mechanism adopting the Transformer model specifically comprises the following steps:

q_i＝W^qx_i

k_j＝W^kx_j

v_j＝W^vx_j

wherein the content of the first and second substances,

for learnable full connectivity layer parameters, x_i,x_jVisual characteristics of the ith, jth image object, q_i,k_j,v_jTo map to the visual features of the hidden space, f (·,) is the scaled dot product function, exp (·) is an exponential function based on the natural number e.

5. The method for visual question-answering based on deep perception and semantically guided relation attention network according to claim 1, wherein in the step 3),

correlation alpha between image objects i and j_ijBy implicit correlation

And explicit dependencies

Obtained by average weighting, namely: