CN113095431A

CN113095431A - Image description method, system and device based on attention mechanism

Info

Publication number: CN113095431A
Application number: CN202110457256.5A
Authority: CN
Inventors: 胡海峰; 夏志武; 吴永波
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-04-27
Filing date: 2021-04-27
Publication date: 2021-07-09
Anticipated expiration: 2041-04-27
Also published as: CN113095431B

Abstract

The invention discloses an image description method, a system and a device based on an attention mechanism, wherein the method comprises the following steps: processing the image characteristics based on an encoder module to obtain encoding information; obtaining sequence vector information based on a decoder module and decoding the coded information to obtain word probability distribution; and repeating the encoding and decoding steps until the preset times are reached, and outputting the image description. The system comprises: an encoder module, a decoder module, and a loop module. The apparatus includes a memory and a processor for performing the above-described attention-based image description method. By using the method and the device, the hidden internal semantic relation and the spatial position relation between the objects in the image can be fully excavated, and the comprehensive and accurate image description is generated. The image description method, the image description system and the image description device based on the attention mechanism can be widely applied to image description generation detection.

Description

Image description method, system and device based on attention mechanism

Technical Field

The invention relates to the field of image description generation, in particular to an image description method, system and device based on an attention mechanism.

Background

Image description generation technology is a challenging task in the field of artificial intelligence, and is receiving more and more attention. The generation of the image description generation technology brings new development and application prospects for a computer to rapidly acquire information from images. The image description generation technology is closely related to image semantic analysis, image annotation, high-level semantic extraction and the like. The image description generation technology is that a computer automatically generates a complete and smooth description sentence for an image. Image description generation techniques in the context of big data have wide application in the business field. If the user inputs keywords in the shopping software, the commodity meeting the requirements is quickly searched out; searching pictures in a search engine by a user; the method comprises the steps of identification of multiple-event targets in a video, professional automatic semantic annotation of medical images, identification of target objects in automatic driving, image retrieval, intelligent blind person guidance, man-machine interaction and the like. However, the conventional image description generation method has the problems that the semantic information implicit in the image is not sufficiently mined, the characteristics of the image are not sufficiently utilized, and the generated description is not accurate and comprehensive.

Disclosure of Invention

In order to solve the technical problems, the invention aims to provide an image description method, system and device based on an attention mechanism, which are used for deeply mining semantic relations among objects in an image and generating more flexible and accurate text descriptions.

The first technical scheme adopted by the invention is as follows: an attention mechanism-based image description method, comprising the steps of:

acquiring image characteristics X of an input image and performing linear transformation on the image characteristics X to obtain a vector set Q, K₁And V₁；

Vector set K₁And V₁Respectively inserting semantic association vectors S_k、S_vTo obtain a vector set K₂And V₂；

Vector set Q, K₂And V₂Inputting the feature information S (X) from the attention module S;

the characteristic representation S (X) is subjected to forward propagation and residual connection regularization to obtain coding information

Acquiring sequence vector information Y of a previous time step, and processing the sequence vector information Y by a mask self-attention module to obtain a query vector Yq;

will encode the information

Obtaining a vector set K through linear transformation₂And V₂；

Query vector Yq, vector set K₂And V₂Inputting a cross attention module to obtain a decoding result C and further performing residual error connection and regularization updating on the decoding result C;

c is subjected to Sigmoid operator and forward propagation to obtain word probability distribution

To encode information

As new image features, word probability distributions

The image description is output as a new sequence vector representation and returns to step S1 until the number of loops reaches four.

Further, the image feature X is subjected to linear transformation to obtain a vector set Q, K₁And V₁The method specifically comprises the following steps:

weight matrix W based on preset size_q、W_kAnd W_vPoint-forming with the image feature X to obtain a vector set Q, K corresponding to the representation feature₁And V₁。

Further, the self-attention module S consists of basic scaling matrix dot product operations, the set of vectors Q, K₂And V₂The step of inputting the feature information S (X) from the attention module S is to obtain the feature information S (X), and the formula is as follows:

S(X)＝Attention(W_qX,[W_kX,S_k],[W_vX,S_v])

wherein Attenttion (-) denotes a self-Attention operation operator, Softmax (-) denotes a range of compressing elements of a matrix to (0, 1),

representing the dimensions of the feature vector.

Further, the characteristic representation S (X) is subjected to forward propagation and residual connection regularization to obtain coding information

This step, the formula is:

Z＝AddNorm(S(X))

F(Z)_i＝Mσ(WZ_i+b)+c

wherein AddForm (. cndot.) consists of residual concatenation and layer normalization, Z_iThe ith vector representing the input, F (Z)_iThe output vector representing the i-th forward propagation, M, W the learnable weight parameter, b, c the learnable bias term, σ (-) the activation function.

Further, the query vector Yq, vector set K₂And V₂The input cross attention module yields C, which is formulated as:

wherein A isttention _ c denotes the cross attention operator operation, W^k、W^vAnd W^qRepresenting a learnable weight parameter.

Further, the C is subjected to Sigmoid operator and forward propagation to obtain word probability distribution

The formula is expressed as:

wherein, the following components are added to the mixture,]representing a matrix splice, a_iRepresenting a matrix concatenation, Z represents to

F (z) represents the result of the forward propagation process.

The second technical scheme adopted by the invention is as follows: an attention-based image description system comprising:

the encoder module is used for executing the encoding step and processing the image characteristics to obtain encoding information;

the decoder module is used for executing the decoding step, acquiring the sequence vector information and decoding the coding information to obtain the word probability distribution;

and the loop module is used for executing the loop coding and decoding steps until the preset times are reached and outputting the final image description.

The third technical scheme adopted by the invention is as follows: an attention-based image description apparatus comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement an attention-based image description method as described above.

The method, the system and the device have the advantages that: the invention adopts a novel attention mechanism, utilizes a multi-stage encoder to process the input image characteristics in the encoding stage, can realize the precision measurement of the association degree between objects in the image and the deep excavation of the hidden semantic association between the objects in the image, improves the description quality and reduces the model complexity.

Drawings

FIG. 1 is a flow chart of steps of an embodiment of the present invention;

FIG. 2 is a diagram illustrating the insertion of an association vector into a vector set according to an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

Referring to fig. 1, the present invention provides an attention-based image description method, comprising the steps of:

Vector set K₁And V₁Respectively inserting semantic association vectors S_k、S_vThe embedding of the word into the representation R and the image feature X is performed as shown in FIG. 2Matrix multiplication is carried out to obtain multi-modal semantic associated information RX, and then a weight matrix W with randomly set initial values is used¹、W²Respectively linearly converting semantic associated information RX into vector set S_k、S_vThereby obtaining the semantic information implicit in the image feature X, it should be noted that the implicit semantic information obtained here is not obtained in Q, K, V converted from the image feature, and the vector set K is obtained₂And V₂。

will encode the information

Obtaining a vector set K through linear transformation₂And V₂；

specifically, where C is obtained by inputting cross attention from the encoded information and the prediction result obtained at the previous time step, and is a preliminary result of decoding at the current time step, the preliminary result is used for further processing to obtain a final result of decoding at the current time step.

To encode information

As new image features, word probability distributions

Further as a preferred embodiment of the method, the image feature X is linearly transformed to obtain a vector set Q, K₁And V₁The method specifically comprises the following steps:

Specifically, Q ═ W_qX，K₁＝W_kX，V₁＝W_vX。

Further as a preferred embodiment of the method, said set of vectors K is a set of vectors₁And V₁Respectively inserting semantic association vectors S_k、S_vTo obtain a vector set K₂And V₂The formula is expressed as:

K₂＝[W_kX,S_k]

V₂＝[W_kX,S_k]

further as a preferred embodiment of the method, the self-attention module S consists of basic scaling matrix dot product operations, the set of vectors Q, K₂And V₂The step of inputting the feature information S (X) from the attention module S is to obtain the feature information S (X), and the formula is as follows:

S(X)＝Attention(W_qX,[W_kX,S_k],[W_vX,S_v])

representing the dimensions of the feature vector.

Further as a preferred embodiment of the method, the feature representation s (x) is regularized by forward propagation and residual connection to obtain encoded information

This step, the formula is:

Z＝AddNorm(S(X))

F(Z)_i＝Mσ(WZ_i+b)+c

As a preferred embodiment of the method, the query vector Yq is obtained by acquiring the sequence vector information Y of the previous time step and processing the sequence vector information Y by the mask self-attention module, and a formula is represented as:

Y_q＝AddNorm(Attention_m(Y))

where M denotes a mask matrix.

Further as a preferred embodiment of the method, the query vector Yq, the set of vectors K, are described₂And V₂The input cross attention module yields C, which is formulated as:

wherein Attention _ c denotes the cross Attention operator operation, W^k、W^vAnd W^qRepresenting a learnable weight parameter.

Further as a preferred embodiment of the method, the C is subjected to Sigmoid operator and forward propagation to obtain word probability distribution

The formula is expressed as:

F (z) represents the result of the forward propagation process.

As shown in the upper part of fig. 1, the connection diagram of the encoder and the decoder is shown, the output of the previous encoder layer is used as the input of the next encoder layer, the output of the previous decoder layer is used as the input of the next decoder layer, and the encoder and the decoder corresponding to the sequence number are connected again, so that each stage of the multi-stage features of the encoder is fully decoded in the decoder, and the loss of the initial image features is avoided, which affects the quality of the finally obtained image description. The sequential cascade of 4-layer encoders enables an accurate representation of the degree of correlation between objects in an image and deep mining of semantic correlations hidden between objects in an image. Of course, this is not limited to only 4-layer codec connections.

An attention-based image description system comprising:

The contents in the above method embodiments are all applicable to the present system embodiment, the functions specifically implemented by the present system embodiment are the same as those in the above method embodiment, and the beneficial effects achieved by the present system embodiment are also the same as those achieved by the above method embodiment.

An attention-based image description apparatus:

at least one processor;

at least one memory for storing at least one program;

The contents in the above method embodiments are all applicable to the present apparatus embodiment, the functions specifically implemented by the present apparatus embodiment are the same as those in the above method embodiments, and the advantageous effects achieved by the present apparatus embodiment are also the same as those achieved by the above method embodiments.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An attention mechanism-based image description method is characterized by comprising the following steps:

will encode the information

Obtaining a vector set K through linear transformation₂And V₂；

the decoding result C is subjected to Sigmoid operator and forward propagation to obtain word probability distribution

To encode information

As new image features, word probability distributions

2. The method as claimed in claim 1, wherein the image feature X is transformed linearly to obtain a vector set Q, K₁And V₁The method specifically comprises the following steps:

weight matrix W based on preset size_q、W_kAnd W_vDot product with image characteristic X to obtain vector set Q, K corresponding to the characteristic₁And V₁。

3. The method as claimed in claim 2, wherein the self-attention module S is composed of basic scaling matrix dot product operations, and the vector set Q, K is set₂And V₂The step of inputting the feature information S (X) from the attention module S is to obtain the feature information S (X), and the formula is as follows:

S(X)＝Attention(W_qX,[W_kX,S_k],[W_vX,S_v])

representing the dimensions of the feature vector.

4. The method as claimed in claim 3, wherein the method comprises a step of displaying the image according to the attention mechanismThen, the feature representation S (X) is subjected to forward propagation and residual connection regularization to obtain coding information

This step, the formula is:

Z＝AddNorm(S(X))

F(Z)_i＝Mσ(WZ_i+b)+c

5. The method of claim 4, wherein the query vector Yq and the vector set K are set₂And V₂The input cross attention module obtains a decoding result C, and the formula is expressed as:

6. The method as claimed in claim 5, wherein the decoding result C is processed by Sigmoid operator and forward propagation to obtain word summaryRate distribution

The formula is expressed as:

F (z) represents the result of the forward propagation process.

7. An attention-based image description system, comprising:

8. An attention-based image description apparatus, comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement a method for attention-based image description according to any one of claims 1-6.