CN114691847B

CN114691847B - Relation attention network vision question-answering method based on depth perception and semantic guidance

Info

Publication number: CN114691847B
Application number: CN202210231121.1A
Authority: CN
Inventors: 魏巍; 刘宇航; 彭道万; 刘逸帆; 潘为燃
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2022-03-10
Filing date: 2022-03-10
Publication date: 2024-04-26
Anticipated expiration: 2042-03-10
Also published as: CN114691847A

Abstract

The invention discloses a relational attention network visual question-answering method based on depth perception and semantic guidance, which comprises the following steps: 1) Constructing a three-dimensional space relation between image targets; obtaining a three-dimensional space relation between image targets; 2) According to the three-dimensional space relation between the image targets, obtaining the correlation score between the image targets i and j in the space dimension; 3) Acquiring the correlation between the image targets i and j by combining the implicit attention and the explicit attention; 4) According to the framework of the transducer, an improved attention mechanism is adopted to replace a traditional self-attention layer, and a visual question-answering model is obtained. The invention introduces the correlation of the three-dimensional space to the traditional self-attention mechanism and improves the accuracy of visual questions and answers.

Description

Relation attention network vision question-answering method based on depth perception and semantic guidance

Technical Field

The invention relates to a natural language processing technology, in particular to a relational attention network vision question-answering method based on depth perception and semantic guidance.

Background

Conventional visual question-answering methods are typically based on depth feature fusion models, such as bilinear BLOCK diagonal fusion (BLOCK) and Self-attention mechanism fusion (Self-attention), but these methods have difficulty in solving answers to complex questions where spatial relationship reasoning exists. As deep learning advances, many studies based on deep neural network models have been focused on improving the effect of visual question-answering tasks, which typically extract image target visual representations and word vector representations from images and text, respectively, for input, and achieve multi-modal entity alignment in an end-to-end training fashion, and then employ multi-classification strategies to predict answers. Recently, many research works have constructed models based on an attention network (Attention Network), which, although they exhibit excellent performance on visual question-answering tasks, do not take into account spatial or semantic relationships between image objects, making them limited in complex question-answers involving visual reasoning, and previous attention mechanisms only consider the correlation between image objects and text entities, but not between visual space or semantic relationships and text information, making their models somewhat lacking in understanding and reasoning capabilities for visual relationships.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a relational attention network vision question-answering method based on depth perception and semantic guidance.

The technical scheme adopted for solving the technical problems is as follows: a relation attention network vision question-answering method based on depth perception and semantic guidance comprises the following steps:

1) Three-dimensional spatial relationship construction between image objects

Calculating a visual relationship in a two-dimensional space using rectangular frame coordinates of image objects, the two-dimensional spatial relationship for image objects i and j being represented asThe rectangular box of two image targets can be obtained by specific calculation, namely:

wherein, W _i,h_i is the coordinates, width and height of the center point of the rectangular frame of the image object i respectively;

From depth distance values dep _i and dep _j of rectangular frame center points of image objects i and j, then a visual relationship in depth space is calculated Namely:

wherein, The area of the overlapping part of the rectangular frames i and j;

According to the two-dimensional spatial relationship between the image objects i and j And depth spatial relationship/>Three-dimensional space relation/>, among image targets, can be obtainedNamely:

wherein, D _s =64 is the dimension represented by the explicit spatial relationship, σ is the activation function ReLU;

2) Depth aware and semantic guided attention mechanism

Using the explicitly modeled three-dimensional spatial relationship described above, it can be used to calculate a correlation score between image objects i and j in the spatial dimensionNamely:

wherein f ^spa is used for calculating the correlation of two image targets in the space dimension, and is represented by the visual characteristic q _i and the three-dimensional space relation of the input ith image target Obtained by dot product, namely:

f ^sem is used to calculate the correlation of the spatial relationship of two image objects with text semantics, namely:

wherein, Is a weight parameter which can be learned; /(I)A text feature representation of the question, derived from the feature representation of the last layer of [ CLS ] positions of the BERT model;

3) Combining implicit and explicit attention

The correlation α _ij between image objects i and j is ultimately defined by an implicit correlationAnd explicit relevance/>Obtained by weighting, namely:

4) Attention mechanisms are incorporated into visual question-answering models

According to the framework of a transducer, all alpha _ij are represented in matrix form, i.e. by replacing the conventional self-attention layer with an improved attention mechanism alpha _ij The improved transducer calculation mode is as follows:

Wherein, L is the layer number of a transducer, FFN is two full-connection layers and adopts a multi-layer perceptron (MLP) activated by a ReLU hidden layer, namely:

FFN(X)＝W₂σ(W₁X+b₁)+b₂

wherein, Is a parameter that can be learned.

The invention has the beneficial effects that:

1. the invention introduces the correlation of the three-dimensional space to the traditional self-attention mechanism, and flexibly expands to realize the explicit modeling and calculation of the three-dimensional space relation between the image targets;

2. By modeling the three-dimensional space relation between image targets and designing a depth perception and semantic guidance attention mechanism on the basis, more accurate space correlation calculation is performed between input image targets by introducing two different attention weight bias items, namely the correlation weights of space dimension and semantic dimension, and the accuracy of visual question-answering is improved.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a schematic diagram of the overall model structure of the visual question-answering method of the present invention;

Fig. 2 is a schematic structural diagram of a depth perception and semantic guidance relationship attention mechanism in the visual question-answering method of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

As shown in fig. 1, a relational attention network visual question-answering method based on depth perception and semantic guidance comprises the following steps:

The invention provides a relation attention network of depth perception and semantic guidance, which calculates the spatial correlation between image targets by explicitly modeling the three-dimensional spatial relation between the image targets And semantic relevance/>The method is mainly divided into the following parts:

1) Three-dimensional spatial relationship construction between image objects

wherein, The area of the overlapping part of the rectangular frames i and j;

2) Depth aware and semantic guided attention mechanism

As in FIG. 2, the three-dimensional spatial relationship using explicit modeling described above can be used to calculate a correlation score between image objects i and j in the spatial dimensionNamely:

3) Combining implicit and explicit attention

The original transducer model uses an implicit correlation self-attention mechanism to calculate the correlation between inputs, assuming that the feature matrix composed of the input image target RoI features is In order to measure the implicit relation among image targets, the invention firstly adopts a scaled dot-product f (DEG,) to calculate the implicit correlation between the image targets i and j, and then adopts a softmax function to normalize all image target neighbors to convert the correlation score/>, wherein N is the number of the detected image targets and d _h is the feature dimensionSpecifically, the input feature X is first mapped to the hidden space of query, key and value, and then used to measure the implicit correlation between two image objects, namely:

q_i＝W^qx_i

k_j＝W^kx_j

v_j＝W^vx_j

Wherein, W ^q,W^k is used for preparing the high-strength steel, Is a learnable full connection layer parameter. x _i,x_j is the visual features of the i, j-th image object, q _i,k_j,v_j is the visual features mapped to the hidden space, f (·, ·) is the scaled dot product function, exp (·) is the exponential function based on the natural number e, respectively.

The invention can measure the correlation between image targets from the characteristic dimension and the space dimension respectively by combining the implicit attention and the explicit attention mechanism, and compared with the original Transformer which only considers the correlation of the input in the characteristic hierarchy, the invention also considers the correlation of the input in the space dimension, thereby improving the capability of answering the complex problems related to visual reasoning.

4) Attention mechanisms are incorporated into visual question-answering models

FFN(X)＝W₂σ(W₁X+b₁)+b₂

wherein, Is a parameter that can be learned.

The invention provides a neural network architecture for visual question-answering tasks, which comprises an implicit and explicit image target relation modeling, and can better realize subsequent relation reasoning by implicitly and explicitly constructing the space and semantic relation between image targets. The depth perception and semantic guidance relationship attention module is integrated into a self-attention layer in a transducer architecture, namely, a layer of similarity for measuring the spatial relationship of image targets and text semantics is added, and a new image target correlation matrix is obtained by adjusting original self-attention distribution weights, wherein the correlation matrix can reflect the correlation among image targets in a relationship level.

Experiments show that compared with the existing mainstream method, the sequence labeling method provided by the invention has a better effect. The experiment was evaluated using two baseline Visual Question-answer datasets, visual Question ANSWERING V (VQA v 2) and GQA dataset. The details of the dataset are shown in table 1.

Table 1 dataset information

The experimental part aims at evaluating the effectiveness of the visual question-answer model proposed by the invention on different data sets. Specifically, we list the accuracy of VQA v data set and GQA data set as the evaluation index of the model, and experimental comparison results are given in tables 2 and 3, respectively.

TABLE 2 VQA v2 dataset comparative experiment results

Table 3 GQA dataset comparative experiment results

It is noted that from the two tables above, it can be observed that the method proposed by the present invention is always better than all these benchmark models in different visual question-answering tasks. Because these models mostly focus on the attention of image object entities, and neglect modeling of spatial and semantic relationships of image objects, the models lack the ability to infer between image objects. By explicitly modeling the three-dimensional spatial position characteristics of the image targets and combining the three-dimensional spatial position characteristics into a neural network structure by adopting an attention mechanism, the method provided by the invention can explicitly model the relationship between the image targets, thereby realizing relationship reasoning between the image targets.

It will be understood that modifications and variations will be apparent to those skilled in the art from the foregoing description, and it is intended that all such modifications and variations be included within the scope of the following claims.

Claims

1. The relational attention network visual question-answering method based on depth perception and semantic guidance is characterized by comprising the following steps of:

1) Three-dimensional spatial relationship construction between image objects

1.1 Calculating a visual relationship in a two-dimensional space using rectangular frame coordinates of an image object for whichAnd/>Its two-dimensional spatial relationship is expressed as/>；

1.2 According to the image objectAnd/>Depth distance value/>, of rectangular box center pointAnd/>Visual relationship/>, in depth space is then calculated；

1.3 According to the image objectAnd/>Two-dimensional spatial relationship between/>And depth spatial relationship/>Obtaining the three-dimensional space relation/>, between the image targetsThe method comprises the following steps:

wherein, Is a learnable weight parameter,/>For the dimension of explicit spatial relationship representation,/>For the activation function ReLU;

2) Depth aware and semantic guided attention mechanism

Acquiring the image targets according to the three-dimensional space relation among the image targetsAnd/>Correlation score between in spatial dimensionsThe method comprises the following steps:

wherein, For calculating the correlation of two image objects in the spatial dimension, defined by the input/>Visual characteristics of individual image objects/>And three-dimensional spatial relationship representation/>Obtained by dot product, namely:

for calculating the correlation of the spatial relationship of two image objects with text semantics, namely:

3) Combining implicit and explicit attention

Image objectAnd/>Correlation between/>By implicit relevance/>And explicit relevance/>Obtained by weighting, namely:

4) Attention mechanisms are incorporated into visual question-answering models

In accordance with the framework of the transducer, an improved attention mechanism is adoptedReplace the traditional self-attention layer, will all/>Expressed in matrix form, i.e. >The improved transducer calculation mode is as follows:

wherein, Layer number of transducer,/>A multi-layer perceptron activated by a ReLU hidden layer for two full-connection layers, namely:

wherein, Is a parameter that can be learned.

2. The depth perception and semantic guidance based relational attention network visual question-answering method according to claim 1, wherein in step 1.1), two-dimensional spatial relationship isThe rectangular frames of the two image targets are obtained by the following calculation:

wherein, Image object/>, respectivelyThe coordinates, width and height of the center point of the rectangular frame; /(I)Image object/>, respectivelyThe rectangular frame center point coordinates, width and height.

3. The depth perception and semantic guidance based relational attention network visual question and answer method according to claim 1, wherein in step 1.2),

Computing visual relationships in depth spaceThe method comprises the following steps:

wherein, Image object/>, respectivelyThe width and height of the rectangular frame of (a); /(I)Image object/>, respectivelyWidth and height of rectangular frame of/>For image object/>Rectangular box and image object/>Is a rectangular frame of the display device.

4. The depth perception and semantic guidance based relational attention network visual question-answering method according to claim 1, wherein in step 3), the implicit correlation isThe method comprises the following steps:

wherein, Is a learnable full connection layer parameter,/>Respectively is/>Visual characteristics of individual image objects,/>For mapping to visual features of hidden space,/>To scale the dot product function,/>In natural number/>An exponential function of the base.