CN113095431B

CN113095431B - Image description method, system and device based on attention mechanism

Info

Publication number: CN113095431B
Application number: CN202110457256.5A
Authority: CN
Inventors: 胡海峰; 夏志武; 吴永波
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-04-27
Filing date: 2021-04-27
Publication date: 2023-08-18
Anticipated expiration: 2041-04-27
Also published as: CN113095431A

Abstract

The application discloses an image description method, a system and a device based on an attention mechanism, wherein the method comprises the following steps: processing the image characteristics based on the encoder module to obtain encoded information; acquiring sequence vector information based on a decoder module and decoding the coded information to obtain word probability distribution; repeating the encoding and decoding steps until the preset times are reached, and outputting the image description. The system comprises: encoder modules, decoder modules, and loop modules. The apparatus comprises a memory and a processor for performing the above described attention mechanism based image description method. By using the method and the device, the hidden built-in semantic relation and the hidden space position relation between the objects in the image can be fully dredged, and comprehensive and accurate image description is generated. The image description method, system and device based on the attention mechanism can be widely applied to image description generation detection.

Description

Image description method, system and device based on attention mechanism

Technical Field

The present application relates to the field of image description generation, and in particular, to an image description method, system and device based on an attention mechanism.

Background

Image description generation technology is a challenging task in the field of artificial intelligence, and is receiving increasing attention. The generation of the image description generation technology brings new development and application prospects for a computer to quickly acquire information from an image. Image description generation technology is closely related to technologies such as image semantic analysis, image annotation, image advanced semantic extraction and the like. The image description generation technology is that a computer automatically generates a complete and smooth description sentence for an image. Image description generation techniques in a large data background have found wide application in the business field. If the user inputs keywords in the shopping software, the commodity meeting the requirement is quickly searched; searching pictures in a search engine by a user; the method comprises the steps of identifying multi-object targets in videos, automatically and semantically labeling medical image professionals, identifying target objects in automatic driving, searching images, guiding intelligent blind persons, performing man-machine interaction and the like. However, the conventional image description generation method has the problems that the hidden semantic information of the image is not fully mined, the characteristics of the image are not fully utilized, and the generated description is not accurate and comprehensive.

Disclosure of Invention

In order to solve the technical problems, the application aims to provide an image description method, an image description system and an image description device based on an attention mechanism, which are used for deeply mining semantic relations among objects in an image and generating more flexible and more accurate text description.

The first technical scheme adopted by the application is as follows: an image description method based on an attention mechanism comprises the following steps:

acquiring image features X of an input image and performing linear transformation on the image features X to obtain a vector set Q, K ₁ And V ₁ ；

For vector set K ₁ And V ₁ Respectively insert semantic association vectors S _k 、S _v Obtaining a vector set K ₂ And V ₂ ；

Vector set Q, K ₂ And V ₂ Inputting the self-attention module S to obtain characteristic information S (X);

regularizing the feature representation S (X) through forward propagation and residual connection to obtain coding information

Acquiring sequence vector information Y of the previous time step and processing the sequence vector information Y by a mask self-attention module to obtain an inquiry vector Yq;

will encode informationObtaining a vector set K through linear transformation ₂ And V ₂ ；

Query toQuantity Yq, vector set K ₂ And V ₂ Inputting a cross attention module to obtain a decoding result C and further carrying out residual connection and regularization to update the decoding result C;

c is propagated forward through a Sigmoid operator to obtain word probability distribution

To encode informationAs a new image feature, word probability distribution +.>As a new sequence vector representation and returning to step S1 until the number of loops reaches four, the image description is output.

Further, the image feature X is linearly transformed to obtain a vector set Q, K ₁ And V ₁ The method comprises the following steps:

weight matrix W based on preset size _q 、W _k And W is _v Dot-forming with the image feature X to obtain a vector set Q, K corresponding to the representation feature ₁ And V ₁ 。

Further, the self-attention module S consists of a basic scaling matrix dot product operation, the vector set Q, K ₂ And V ₂ The step of inputting the self-attention module S to obtain the feature information S (X), the formula is:

S(X)＝Attention(W _q X,[W _k X,S _k ],[W _v X,S _v ])

where Attention (·) represents the self-Attention operator, softmax (·) represents the compression of the elements of the matrix to the range of (0, 1),representing the dimension of the feature vector.

Further, the feature representation S (X) is subjected to forward propagation and residual connection regularization to obtain encoded informationThis step, the formula is:

Z＝AddNorm(S(X))

F(Z) _i ＝Mσ(WZ _i +b)+c

wherein AddNarm (·) consists of residual ligation and layer normalization, Z _i Representing the i-th vector of input, F (Z) _i Representing the i-th forward propagating output vector, M, W represents a learnable weight parameter, b, c represent a learnable bias term, σ (·) represents an activation function.

Further, the query vector Yq and the vector set K ₂ And V ₂ The input cross-attention module gets C, expressed as:

wherein Attention_c represents a cross Attention operator operation, W ^k 、W ^v And W is ^q Representing a learnable weight parameter.

Further, the C is propagated forward through a Sigmoid operator to obtain word probability distributionThe formula is:

wherein, [,]representing matrix stitching, a _i Representing matrix stitching, Z represents to beAnd F (Z) represents the result of forward propagation processing after the Sigmoid activation function is activated.

The second technical scheme adopted by the application is as follows: an attention mechanism based image description system comprising:

the encoder module is used for executing the encoding step and processing the image characteristics to obtain encoding information;

the decoder module is used for executing a decoding step, obtaining sequence vector information and decoding the coded information to obtain word probability distribution;

and the circulation module is used for executing the circulation coding and decoding steps until the preset times are reached and outputting the final image description.

The third technical scheme adopted by the application is as follows: an attention mechanism based image description device comprising:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement an attention mechanism based image description method as described above.

The method, the system and the device have the beneficial effects that: the application adopts a novel attention mechanism, and the coding stage utilizes the multistage encoder to process the input image characteristics, so that the precision measurement of the association degree between objects in the image and the depth mining of the hidden semantic association between objects in the image can be realized, the description quality is improved, and the model complexity is reduced.

Drawings

FIG. 1 is a flow chart of steps of an embodiment of the present application;

FIG. 2 is a schematic diagram of the insertion of an association vector into a vector set in accordance with an embodiment of the present application.

Detailed Description

The application will now be described in further detail with reference to the drawings and to specific examples. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.

Referring to fig. 1, the present application provides an image description method based on an attention mechanism, the method comprising the steps of:

For vector set K ₁ And V ₁ Respectively insert semantic association vectors S _k 、S _v As shown in fig. 2, matrix multiplication is performed on the embedded representation R of the word and the image feature X to obtain multi-modal semantic association information RX, and then a weight matrix W is randomly set through an initial value ¹ 、W ² Respectively linearly converting semantic association information RX into vector sets S _k 、S _v Thus obtaining the underlying semantic information in image feature X, it should be noted that the underlying semantic information obtained here is not found in Q, K, V converted from image features, resulting in vector set K ₂ And V ₂ 。

Query vector Yq, vector set K ₂ And V ₂ Inputting a cross attention module to obtain a decoding result C and further carrying out residual connection and regularization to update the decoding result C;

specifically, C is obtained by cross attention of the coding information and the prediction result input obtained in the previous time step, and is a preliminary result of decoding in the current time step, and is used for further processing to obtain a final result of decoding in the current time step.

Further as a preferred embodiment of the method, the image feature X is linearly transformed to obtain a vector set Q, K ₁ And V ₁ The method comprises the following steps:

Specifically, q=w _q X，K ₁ ＝W _k X，V ₁ ＝W _v X。

Further as a preferred embodiment of the method, the pair of vectors set K ₁ And V ₁ Respectively insert semantic association vectors S _k 、S _v Obtaining a vector set K ₂ And V ₂ The formula is:

K ₂ ＝[W _k X,S _k ]

V ₂ ＝[W _k X,S _k ]

further as a preferred embodiment of the method, the self-attention module S consists of a basic scaling matrix dot product operation, the vector set Q, K ₂ And V ₂ The step of inputting the self-attention module S to obtain the feature information S (X), the formula is:

S(X)＝Attention(W _q X,[W _k X,S _k ],[W _v X,S _v ])

Further as a preferred embodiment of the method, the feature representation S (X) is regularized by forward propagation and residual connection to obtain encoded informationThis step, the formula is:

Z＝AddNorm(S(X))

F(Z) _i ＝Mσ(WZ _i +b)+c

Further as a preferred embodiment of the method, the sequence vector information Y of the previous time step is obtained and processed by the mask self-attention module to obtain the query vector Yq, where the formula is as follows:

Y _q ＝AddNorm(Attention_m(Y))

where M represents the mask matrix.

Further as a preferred embodiment of the method, the query vector Yq, vector set K ₂ And V ₂ The input cross-attention module gets C, expressed as:

Further as a preferred embodiment of the method, the step of propagating C forward through a Sigmoid operator to obtain word probability distributionThe formula is:

As shown in the upper part of fig. 1, the decoder of the encoder is schematically connected, the output of the previous encoder layer is used as the input of the next encoder layer, the output of the previous decoder layer is used as the input of the next decoder layer, and the encoder and the decoder corresponding to the serial numbers are reconnected, so that each stage of the multi-stage characteristics of the encoder is fully decoded in the decoder, the loss of the initial image characteristics is avoided, and the quality of the finally obtained image description is affected. Sequential concatenation of 4-layer encoders enables accurate representation of the degree of correlation between objects in an image and depth mining of semantic associations hidden between objects in an image. Of course, this is not limited to only 4-layer encoding-order-decoder connections.

An attention mechanism based image description system comprising:

The content in the method embodiment is applicable to the system embodiment, the functions specifically realized by the system embodiment are the same as those of the method embodiment, and the achieved beneficial effects are the same as those of the method embodiment.

An image description device based on an attention mechanism:

at least one processor;

at least one memory for storing at least one program;

The content in the method embodiment is applicable to the embodiment of the device, and the functions specifically realized by the embodiment of the device are the same as those of the method embodiment, and the obtained beneficial effects are the same as those of the method embodiment.

While the preferred embodiment of the present application has been described in detail, the application is not limited to the embodiment, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.

Claims

1. An image description method based on an attention mechanism, which is characterized by comprising the following steps:

s1, acquiring image characteristics X of an input image and based on a weight matrix W of a preset size _q 、W _k And W is _v With image feature XDot product, get corresponding to represent the vector set Q, K of the characteristic ₁ And V ₁ ；

S2, vector set K ₁ And V ₁ Respectively insert semantic association vectors S _k 、S _v Obtaining a vector set K ₂ And V ₂ ；

S3, vector set Q, K ₂ And V ₂ Inputting the self-attention module S to obtain characteristic information S (X);

s4, carrying out forward propagation and residual connection regularization on the characteristic representation S (X) to obtain coding information

S5, acquiring sequence vector information Y of the previous time step and processing the sequence vector information Y by a mask self-attention module to obtain an inquiry vector Yq;

s6, coding the informationObtaining a vector set K through linear transformation ₂ And V ₂ ；

S7, inquiring the vector Yq and the vector set K ₂ And V ₂ Inputting a cross attention module to obtain a decoding result C and further carrying out residual connection and regularization to update the decoding result C;

s8, carrying out Sigmoid operator and forward propagation on the decoding result C to obtain word probability distribution

S9, coding informationAs a new image feature, word probability distribution +.>As a new sequence vector representation and returning to step S1 until the number of loops reaches four, the image description is output.

2. An image description method based on an attention mechanism according to claim 1, characterized in that the self-attention module S consists of a basic scaling matrix dot product operation, the vector set Q, K ₂ And V ₂ The step of inputting the self-attention module S to obtain the feature information S (X), the formula is:

S(X)＝Attention(W _q X,[W _k X,S _k ],[W _v X,S _v ])

3. An image description method based on an attention mechanism according to claim 2, characterized in that the feature representation S (X) is regularized by forward propagation and residual connection to obtain encoded informationThis step, the formula is:

Z＝AddNorm(S(X))

F(Z) _i ＝Mσ(WZ _i +b)+c

wherein AddNarm (·) consists of residual ligation and layer normalization, Z _i Representing the i-th vector of input, F (Z) _i Representing the i-th forward-propagating output vector, M, W representing a learnable weight parameter,b. c represents a learnable bias term and σ (·) represents an activation function.

4. An image description method based on an attention mechanism according to claim 3, characterized in that said query vector Yq, vector set K ₂ And V ₂ The input cross attention module obtains a decoding result C, and the formula is as follows:

5. The method for describing an image based on an attention mechanism according to claim 4, wherein said decoding result C is propagated forward through a Sigmoid operator to obtain a word probability distributionThe formula is:

6. An attention mechanism based image description system comprising:

an encoder module for performing an encoding step, acquiring image features X of an input image and based on a weight matrix W of a preset size _q 、W _k And W is _v Performing point multiplication on the image characteristic X to obtain a vector set Q, K corresponding to the characteristic ₁ And V ₁ The method comprises the steps of carrying out a first treatment on the surface of the For vector set K ₁ And V ₁ Respectively insert semantic association vectors S _k 、S _v Obtaining a vector set K ₂ And V ₂ The method comprises the steps of carrying out a first treatment on the surface of the Vector set Q, K ₂ And V ₂ Inputting the self-attention module S to obtain characteristic information S (X); regularizing the feature representation S (X) through forward propagation and residual connection to obtain coding information

The decoder module is used for executing a decoding step, obtaining sequence vector information Y of the previous time step and obtaining an inquiry vector Yq through processing of the mask self-attention module; will encode informationObtaining a new vector set K through linear transformation ₂ And V ₂ The method comprises the steps of carrying out a first treatment on the surface of the Query vector Yq, vector set K ₂ And V ₂ Inputting a cross attention module to obtain a decoding result C and further carrying out residual connection and regularization to update the decoding result C; calculating the decoding result C by SigmoidSub-and forward propagation, resulting in a word probability distribution +.>

A loop module for encoding informationAs a new image feature, word probability distribution +.>As a new sequence vector representation and returning to the original step until the number of loops reaches four, the image description is output.

7. An attention mechanism based image description device, comprising:

at least one processor;

at least one memory for storing at least one program;

when the at least one program is executed by the at least one processor, the at least one processor is caused to implement an attention-based image description method as claimed in any one of claims 1 to 5.