CN114677580B

CN114677580B - Image description method based on self-adaptive enhanced self-attention network

Info

Publication number: CN114677580B
Application number: CN202210586762.9A
Authority: CN
Inventors: 毛震东; 张勇东; 李经宇
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-05-27
Filing date: 2022-05-27
Publication date: 2022-09-30
Anticipated expiration: 2042-05-27
Also published as: CN114677580A

Abstract

The invention relates to the technical field of image description, and discloses an image description method based on a self-adaptive enhanced self-attention network, which can be used for modeling the relationship on a geometric level and a semantic level together, and can be used for adaptively enhancing the visual relationship in a given image to realize high-precision credible image description generation when a definite geometric or semantic relationship exists between two objects in the image.

Description

Image description method based on self-adaptive enhanced self-attention network

Technical Field

The invention relates to the technical field of image description, in particular to an image description method based on a self-adaptive enhanced self-attention network.

Background

The image description aims to automatically generate a sentence description for a given image, can well combine vision and language, and is an important multi-modal task. The generated image description is to not only identify objects of interest in the image, but also to describe the relationships between the objects. For image description, a key challenge is how to accurately and efficiently model the relationships between identified objects, which is crucial to improve the quality of the generation. Recently, geometric information has been extensively studied to enhance features at the regional level, because geometric features, i.e., relative distance and relative size, contain a clear positional relationship between objects.

Visual relationships include geometric positions and semantic interactions, which indicate the correlation between objects in the region-level representation, but previous work only uses geometric positions to enhance the representation of the visual relationships, and only shallow position information cannot cover the semantic relationships with complex action. Therefore, the prior art is limited in that the image description model has difficulty generating descriptions with accurate semantic relationship prediction credible.

Disclosure of Invention

In order to solve the technical problem, the invention provides an image description method based on an adaptive enhanced self-attention network.

In order to solve the technical problem, the invention adopts the following technical scheme:

an image description method based on an adaptive enhanced self-attention network comprises the following steps:

the method comprises the following steps: for a given scene image, a semantic relation graph S is constructed by using a scene graph extractor, the regional characteristics of an interested target in the scene image are detected through a pre-trained Faster-RCNN model, a regional characteristic embedding vector H is obtained through calculation, and a geometric relation graph G is constructed through the relative geometric characteristics of a target boundary box;

step two: the method for enhancing the attention mechanism through the adaptive relationship comprises the following steps of:

enhancing attention by calculating relationships

；

Wherein,

in order to be the region feature weight,

in order to enhance the attention of the semantics,

to enhance attention for geometry;

and

respectively, using projection matrices based on regional characteristics

And

the generated query region feature vector and the key value region feature vector are as follows:

；

wherein,

；

and

is a projection matrix based on a semantic relationship,

for the projected query semantic relationship vector,

obtaining a key value semantic relation vector for projection;

wherein,

；

in order to be a projection matrix based on a geometric relationship,

keys derived for projectionA value geometric relationship vector;

step three: the semantic relation and the geometric relation are enhanced in a self-adaptive weight distribution mode to obtain self-adaptive relation enhanced attention

；

Wherein the gating value

，

、

And

respectively using projection matrices

、

And

generating a truth value region feature vector, a truth value semantic relation vector and a truth value geometric relation vector:

；

an output weight representing the semantic relationship is determined,

the output weights representing the geometric relationship,

is a learnable parameter;

step four: and inputting the characteristics of the last layer of the encoder into a transform decoder, and outputting the description of the scene image.

Specifically, when the scene graph extractor is used to construct the semantic relation graph S in the first step, the scene graph analyzer is used to extract semantic relation triples for each pair of targets of the scene image, then a word embedding layer is used to convert the semantic relation triples into semantic embedding vectors, and finally all the semantic embedding vectors are constructed into a high-dimensional directed semantic vector matrix, namely the semantic relation graph S.

Specifically, when the region feature embedding vector H is calculated in the step one, the visual information of the target of interest in the scene image is coded into the region feature by using the pre-trained Faster R-CNN model

Meanwhile, obtaining the category C of the target, and then adopting a semantic embedding layer to map the category C into a word embedding vector of the target category

Calculating region feature embedding vector for obtaining context information of region feature

Wherein

is a feed-forward network with a ReLU activation function]A feature stitching operation is represented.

Specifically, when the geometric relationship graph G is constructed in the first step, the relative geometric features between the bounding boxes of the object i and the object j are first calculated

：

，

Wherein,

for a four-dimensional vector containing relative distances and magnitudes,

respectively is the central abscissa, central ordinate, width and height of the target i bounding box,

respectively the central abscissa, central ordinate, width and height of the bounding box of the target j;

relative geometric features

And inputting a position coding function to obtain a high-dimensional geometric relationship matrix, namely the geometric relationship graph G.

In particular, a cross entropy loss function is employed

Training the image description method:

，

wherein,

to adaptively enhance the parameters of the adaptive attention network,

is the target true value sequence at the current time,

for previously generated word sequences;

then, reinforcement learning is utilized

Additional model training was performed:

；

wherein the prize is awarded

For CIDER-D scores, E for expected rewards, and the goal of additional model training is based on parameters

And generating word sequences at different training moments

Generating a reward

A CIDER-D score of (1), maximizing the expected reward E for generating a word;

in the training process, first several rounds of cross entropy loss training are used, and then reinforcement learning is used

And further optimizing until convergence.

Compared with the prior art, the invention has the beneficial technical effects that:

in order to solve the problem of inaccurate relation generation among objects in the image description task, the invention can jointly model the relation on the geometric level and the semantic level, enrich the characteristic representation of the visual relation and realize the more accurate and higher-quality image description generation effect. In particular, when a definite geometric or semantic relation exists between two objects in an image, the method can adaptively enhance the visual relation in the given image and realize high-precision credible image description generation.

Drawings

Fig. 1 is an overall schematic diagram of an adaptive enhanced adaptive attention network according to the present invention.

Detailed Description

A preferred embodiment of the present invention will be described in detail below with reference to the accompanying drawings.

Fig. 1 shows the overall network structure. For a given natural scene image, the invention obtains the final image description through the processing of three modules. First, (a) feature extraction: the scene graph extractor is used for constructing a semantic relation graph, and the pre-trained object detector fast-RCNN is used for detecting the interested region and the bounding box to construct a geometric relation graph. Second, (b) an encoder with adaptive relationship enhancement attention mechanism: 1) the direction sensitive semantic enhancement considers the bidirectional association of the region features to the semantic relations and the semantic relations to the region features, and the region features and the semantic relations are used for representing complete triple (subject-predicate-object) information; 2) the geometric relationship enhancement dynamically calculates the association between the regional features and the geometric features; 3) adaptive relationship weight assignment adaptively enhances different relationship characteristics. Finally, (c) language-generated predictions: a trusted image description with accurate relationships is generated using a translator decoder. The detailed description is as follows:

1. feature extraction

The present invention uses regional, semantic, and geometric relational graph features of objects to collectively represent complex relationships between objects. The target visual features have rich object detail information, which is important for image understanding; the semantic relation graph represents action behaviors among the objects; a geometric map reflecting the spatial pattern of a single object may supplement the visual information.

1.1 regional characteristics

The invention firstly utilizes a pre-training target detector Faster R-CNN model to encode visual information of a series of interested targets into regional characteristics

And meanwhile, obtaining the class C of the target. Then, the invention adopts a semantic embedding layer to map C into a word embedding vector of a target category

. Finally, to obtain context information for the regional characteristics, a counter is calculatedCalculating the region feature embedding vector H as follows:

，

wherein,

1.2 semantic relationship graph

In order to obtain accurate visual semantic association information, the invention uses the existing scene graph analyzer to extract semantic relationship triples (subject-predicate-object) for each pair of targets of a scene image, then uses a word embedding layer to convert the triples into semantic embedding vectors, and finally constructs all the semantic embedding vectors into a high-dimensional directed semantic vector matrix, namely a semantic relationship graph S; the matrix can be used as semantic relation priori knowledge in an explicit mode to guide an encoder to encode more accurate semantic relations.

1.3 geometric relational graph

Unlike the semantic graph, the geometric graph is an undirected graph. The invention first calculates the relative geometric features between the bounding boxes of object i and object j

,

Is a four-dimensional vector containing relative distances and magnitudes, of the form:

，

wherein,

,

are respectively eyesThe central abscissa, the central ordinate, the width and the height of the bounding box of the index i,

respectively the central abscissa, the central ordinate, the width and the height of the bounding box of the target j; to obtain the geometric relationship information, the relative geometric characteristics are input by using the existing position coding function

Outputting a high-dimensional geometric relationship matrix, i.e. the geometric relationship graph

. This explicit high-dimensional geometric relationship matrix will serve as geometric prior knowledge to guide the encoder to encode more accurate geometric relationships.

2. Adaptive relationship enhanced attention mechanism

The invention provides a novel adaptive relationship enhancement attention mechanism, which considers semantic relationship enhancement and geometric relationship enhancement at the same time, adaptively encodes arbitrary relationship representation vectors between targets in pairs, and the relationship representation vectors perform joint calculation on all inputs by using the adaptive relationship enhancement mechanism, as shown in figure 1. Specifically, the adaptive relationship-enhanced attention mechanism includes: 1) a relation enhancement attention mechanism, and 2) adaptive relation weight distribution, which are respectively described in detail below.

2.1 relationship-enhanced attention mechanism

In order to realize the relation enhancement attention mechanism, the invention simultaneously utilizes the prior knowledge of semantic relation and geometric relation to divide the attention mechanism into three parts, namely semantic enhancement attention, geometric enhancement attention and regional characteristic weight. Semantic enhancement attention means attention enhancement obtained a priori according to a semantic relationship; the geometric enhancement attention represents the attention enhancement obtained a priori according to the geometric relationship; the regional feature weights represent the original attention scores derived from regional visual features. In particular, enhanced attention scores for semantically enhanced attention, geometrically enhanced attention, and regional feature weights

The calculation is as follows:

，

；

wherein H is a region feature embedding vector,

and

respectively using projection matrices based on regional features

And

the generated query region feature vector and key-value region feature vector are used, so according to the basic calculation mode of the attention mechanism,

it represents the region feature weight calculated from the region feature embedding vector.

In order to enhance the attention of the semantics,

in order to enhance the attention geometrically,

in order to enhance the function for the semantic relationship,

is a geometric relationship enhancement function.The semantic relationship enhancement function and the geometric relationship enhancement function are described in detail below.

Compared with the prior art, the semantic relation enhancement function has the advantages that when a specific relation exists between two targets, attention scores are improved by utilizing semantic relation information of the two targets, and a novel direction-sensitive semantic enhancement attention algorithm is provided for the method. Specifically, the invention considers the concern of the regional characteristic information to the semantic relation and the concern of the semantic relation to the two directions of the regional characteristic information, and the semantic relation enhancement function

The calculation is as follows:

；

and

is a projection matrix based on a semantic relationship,

and

subscripts c and s represent region features (content) and semantic relationships (semantic) respectively for the projected query semantic relationship vector and key value semantic relationship vector. Thus in the basic way of calculation of the attention mechanism,

and

the attention of the regional characteristic information to the semantic relation and the attention of the semantic relation to the regional characteristic information in two directions are represented respectively.

For geometric enhancement functions, unlike semantic relationships, the geometric matrix is symmetric. The invention inquires the feature vector of the region

Using key-value geometric relation vectors

Dynamic geometric enhancement was performed:

，

wherein,

is a projection matrix with respect to the geometric relationship.

2.2 adaptive relationship weight assignment

In different scenarios, the image description generates an inconsistency in dependency on semantic and geometric information, which indicates when semantic and geometric features need to be activated. Therefore, the invention provides a weight distribution module for self-adaptive learning in attention aggregation operation to obtain semantic and geometric relationship information enhancement of different levels. Specifically, the patent introduces a gating fusion module and implements a sigmoid function

：

，

，

Wherein,

、

and

respectively using projection matrices

、

And

and generating a truth-value region feature vector, a truth-value semantic relation vector and a truth-value geometric relation vector. Gated value

Indicating which property information is more important to the current state,

an output weight representing the semantic relationship,

the output weights representing the geometric relationship,

are learnable parameters. Finally, the output of the adaptive relationship enhancement attention is calculated as:

。

in summary, the present invention considers direction sensitive semantic enhanced attention and geometric enhanced attention from different perspectives, the former focusing more on the provided prior semantic relationship, and the latter focusing more on the geometric relationship between objects, in contrast.

3. Language generation prediction

After encoding, the present invention uses a transform decoder (translator decoder) to generate the image description. The input to the decoder is the features from the last layer of encoders, which contain the alignment of the region features to the words, and the output is a descriptive sentence of the region features, as shown in fig. 1. The confidence of the word distribution output after the decoding layer is as follows:

；

wherein,

w is the learnable weight and b is the bias term for the output of the decoder layer.

4. Training loss function

The adaptive enhanced adaptive attention network of the present invention is trained in an end-to-end manner. According to a standard image description training mode, the adaptive enhanced self-attention network of the invention is firstly optimized by using cross entropy loss, and a training target loss function is as follows:

，

wherein,

in order to be the parameters of the model,

is the target true value sequence at the current time,

is a previously generated word sequence.

Then, additional training is performed with reinforcement learning:

，

wherein the prize is awarded

CIDER-D scores, CIDER-D will be given a lower weight score based on the more occurrences of visual information-independent words in all reference labels, E representing the desired reward. The goal of the additional training is to be based on the model parameters

And generating words at different training moments

Information, using generated rewards

The CIDER-D score of (1) maximizes the desired reward E for generating words.

In the training process, the model is first trained for several rounds using cross entropy loss, and then further optimized using CIDER-D awards until convergence.

The invention utilizes the semantic relation and the geometric relation to jointly cover the complex visual relation between the targets, and can adaptively enhance the semantic relation and the geometric relation, and a large number of experimental results show that the invention can obviously and effectively improve the precision of image description, generate more accurate semantic and geometric relation words and better embody the advantages of the invention on the task of image description.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein, and any reference signs in the claims are not intended to be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims

1. An image description method based on an adaptive enhanced self-attention network comprises the following steps:

calculating an enhanced attention score

；

Wherein,

in order to be the region feature weight,

in order to enhance the attention of the semantics,

to enhance attention for geometry;

and

respectively, using projection matrices based on regional characteristics

And

；

wherein,

；

and

is a projection matrix based on the semantic relation,

for the projected query semantic relationship vector,

obtaining a key value semantic relation vector for projection;

wherein,

；

is a projection matrix based on a geometric relationship,

key value geometric relation vector obtained for projection；

；

Wherein the gating value

And σ () is a sigmoid function,

、

and

respectively using projection matrices

、

And

；

an output weight representing the semantic relationship,

output representing geometric relationshipsThe weight of the weight is calculated,

is a learnable parameter;

2. The image description method based on the adaptive augmented adaptive attention network of claim 1, characterized in that: when a scene graph extractor is used for constructing a semantic relation graph S in the first step, a scene graph analyzer is used for extracting semantic relation triples for each pair of targets of a scene image, then a word embedding layer is used for converting the semantic relation triples into semantic embedding vectors, and finally all the semantic embedding vectors are constructed into a high-dimensional directed semantic vector matrix, namely the semantic relation graph S.

3. The image description method based on the adaptive augmented adaptive attention network of claim 1, characterized in that: when the region feature embedding vector H is calculated in the first step, visual information of an interested target in a scene image is coded into region features by utilizing a pre-trained Faster R-CNN model

Wherein

for activation with ReLUA feedforward network of functions]A feature stitching operation is represented.

4. The image description method based on the adaptive augmented adaptive attention network of claim 1, characterized in that: when the geometric relational graph G is constructed in the first step, firstly, the relative geometric characteristics between the bounding boxes of the object i and the object j are calculated

：

，

Wherein,

for a four-dimensional vector containing the relative distances and magnitudes,

relative geometric features

5. The image description method based on the adaptive augmented adaptive attention network of claim 1, characterized in that: using cross entropy loss function