CN116452688A

CN116452688A - Image description generation method based on common attention mechanism

Info

Publication number: CN116452688A
Application number: CN202310334196.7A
Authority: CN
Inventors: 贾海涛; 李玉琳; 李彧; 张洋; 张钰琪; 贾宇明; 任利
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-03-31
Filing date: 2023-03-31
Publication date: 2023-07-18

Abstract

The invention discloses an image description generation method based on a common attention mechanism. The invention has certain effectiveness in semantic alignment of image description algorithm. Aiming at the problem that the generated description is not aligned with the region in the image, a prior attention mechanism is added in the encoder-decoder framework, and the prior attention mechanism can dynamically pay attention to the image region according to the information of the future time step; aiming at the problem of semantic consistency in image description, through introducing a common attention mechanism into a discriminator, introducing an idea of counterlearning, training a generator and the discriminator to classify the generated image description, thereby improving the semantic consistency of the generated image description. The image description algorithm model based on the common attention mechanism can accurately generate descriptions conforming to image contents and generate image descriptions diversified in language based on the generation of the countermeasure network.

Description

Image description generation method based on common attention mechanism

Technical Field

The invention relates to the field of image description generation in deep learning, and aims at solving the problem that an image is not aligned with the generated description semantics in image description generation.

Background

Image description algorithms are a method of integrating computer vision techniques and natural language processing techniques in order to enable a machine to generate natural language descriptions from a given image. Applications of the algorithm include image searching, automatic image annotation, intelligent robots and other fields.

In practical application scenarios, image description algorithms have been widely used. For example, in social media, the image description algorithm can help the social media platform automatically generate image descriptions, so that a user can better know photo content, and user experience is enhanced. In the search engine, the image description algorithm can help the search engine to better understand the picture content, improve the retrieval accuracy and provide better search results for users. In autopilot, the autopilot needs to perceive the environment through image recognition technology, and image description algorithms can help the autopilot to better understand and predict road conditions. The image description algorithm can also be applied to a plurality of fields such as medical images, unmanned aerial vehicle monitoring and the like, and provides powerful support for realizing intellectualization and automation.

The image description algorithm mainly uses an attention-enhanced encoder-decoder framework. The attention mechanism directs the decoding process by focusing on the hidden state of the image region in each temporal step. This technology has had great success in promoting the development of image description technology. The current attention mechanism focuses on image areas based on previous hidden states, which contain information of words generated in the past. Thus, the attention model must predict the attention weight without knowing the words it should dock with. The image region of interest is thus more accurate on the current input word than on the output word.

On the task of generating image descriptions using convolutional neural networks, reinforcement learning techniques based on strategic gradient methods were introduced to directly optimize N-gram matching metrics such as CIDEr, BLEU4, or SPICE. Image description model training is performed, for example, using CIDEr as an optimization index. However, these metrics do not implement the process of semantic alignment between the image and the description. They do not provide a way to promote the naturalness of language so that machine-generated text becomes indistinguishable from human-created text.

With the continuous progress and development of deep learning, the application of the deep learning in image description algorithms is wider and wider, and the method is applied to the problem that images are not matched with description semantics in the image description algorithms. The invention uses an improved prior attention mechanism in network design based on generation of an countermeasure network, and trains a common attention discriminator to detect a dislocation signal between an image and a generated sentence. In this way, the generator can use this signal to improve its text generation mechanism, thereby better aligning the description with a given image.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides an image description generation method based on a common attention mechanism, which is based on generation of an antagonism network. The technique incorporates an improved prior attention mechanism and trains a common attention discriminator to detect a misalignment signal between the image and the generated sentence. Aiming at the problem that the image and the description semantics in the image description algorithm are not matched (as shown in figure 1), the image description algorithm based on the generation of the countermeasure network is further improved.

The technical scheme adopted by the invention is as follows:

step 1: based on an image description algorithm for generating an countermeasure network, the network model is divided into a generator and a discriminator, the former is used for generating a description of the corresponding image; the latter is to evaluate the description accuracy of the text description to the image, and the whole framework is shown in fig. 2;

step 2: the encoder-decoder framework is employed by the generator in step 1. The structure is as follows: the encoder adopts a convolutional neural network, a first-known attention mechanism, the decoder adopts a cyclic neural network, an image I is given, and the generator G outputs an image description

Step 3: the encoder in the step 2 adopts the fast R-CNN to accept the image I and extract the image characteristic V= { V ₁ ,...,v _k }∈R ^d×N 。

Step 4: the generator decoder in step 2 consists of an initial layer and a layer of attention-precedent. The initial layer is of an LSTM structure, and the generation of the image description can be controlled through certain modification. The precedent Attention layer calculates the Attention weight using bi-directional LSTM and improves Self-Attention. The attention weight is divided into present and future two parts, wherein the attention weight of the future part is calculated by predicting the generation probability of the next word;

step 5: in step 1, the discriminator network is designed by adopting a common attention mechanism so as to discriminate the generated image description into manual generation or machine generation. This arbiter is composed of two parts: an image attention module and a text attention module. The two modules are respectively used for extracting the characteristics of the image and the description and generating a corresponding attention matrix. The two attention matrices are then combined by a dot product operation to generate a matrix that represents the degree of semantic matching between the image and the description. Finally, this matrix is used as an output of the arbiter to force semantic alignment between the image and the description;

step 6: training a network model by adopting reinforcement learning SCST, using rewards under a decoding algorithm as a base line, and normalizing by using an image description evaluation index CIDEr so as to enable the generated description to be close to a provided sample reference of an N-gram level;

step 7: the arbiter will alternate with the generator during training. The two modules are trained together, the description generated by the network can reach a balance, and finally a generator network for generating the description and realizing semantic alignment between the image and the description is obtained.

Compared with the prior art, the invention has the beneficial effects that:

(1) The description can reach higher accuracy on the aspect of the problem of the alignment of the image and the description semantics;

(2) For the defect of image description algorithm diversity, more language and more diversified descriptions can be generated.

Drawings

Fig. 1 is: an example graph of a sequence of image regions for each word of the image description is generated.

Fig. 2 is: the image based on the common attention mechanism describes the overall frame map.

Fig. 3 is: faster R-CNN framework diagram.

Fig. 4 is: visual attention architecture diagram.

Fig. 5 is: the prior attention mechanism architecture is shown.

Fig. 6 is: the common attention discriminator architecture diagram.

Fig. 7 is: a SCST training generator schematic is used.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

First, the encoder network in the generator uses Faster R-CNN. The Faster R-CNN structure is shown in FIG. 3. Faster R-CNN is an object detection model that aims to identify object instances belonging to certain classes and locate them with bounding boxes.

The fast R-CNN model is mainly composed of two modules: the RPN candidate frame extraction module and the Fast R-CNN detection module are shown in the following figure and can be subdivided into 3 parts; convolutional layer, region Proposal Network (RPN), roI Pooling. The convolution layer comprises a series of convolution (Conv+Relu) and Pooling (Pooling) operations for extracting features of the image, an existing classical network model VGG16 is used, and the weight parameter of the convolution layer is RPN and Fast RCNN are shared, which is also a key point for accelerating the training process and improving the real-time performance of the model. The RPN network is used for generating a region candidate frame, classifying and judging whether the anchor frame belongs to a target or a background through Softmax based on a multi-scale anchor frame introduced by the network model, carrying out regression prediction on the anchor frame by using frame boundary regression, obtaining the accurate position of the candidate frame, and being used for subsequent target identification and detection. And (3) synthesizing the information of the convolutional layer characteristics and the candidate frames in the RoI Pooling network, mapping the coordinates of the candidate frames in the input image into a final layer (conv 5-3), carrying out Pooling operation on the corresponding areas in the characteristic image, obtaining Pooling results output by a fixed size (7 multiplied by 7), and connecting with the following full-connection layer. The full connection layer is connected with two sub-connection layers, namely a classification layer and a regression layer, wherein the classification layer is used for judging the category of the candidate frame, and the regression layer predicts the accurate position of the candidate frame through the border regression. The output of fast R-CNN is the eigenvector V= { V for k images ₁ ,...,v _k }∈R ^d×N 。

Attention-based enhanced image description decoder as shown in fig. 4, for each decoding step t, the decoder takes the current input word y _t-1 Word embedding with averaged visual featuresStitching is performed as an input to LSTM, as in equation (1):

wherein [;]indicating the connection operation, W _e Representing a learnable word embedding parameter. Next, the output h of LSTM _t Is used as a query to focus on the relevant image areas in the visual feature set V and to generate a visual feature of interest c _t As in formulas (2), (3):

wherein w is _α 、W _h And W is _V Is a learnable parameter.Representing matrix-vector addition, calculated by adding a vector to each column of the matrix. Finally, h _t And c _t Passed to the linear layer to predict the next word as in equation (4):

y _t ～p _t ＝softmax(W _p [h _t ；c _t ]+b _p ) (4)

wherein W is _p And b _p Is a learnable parameter. Finally, a target reference sequence is givenAnd a description model with a parameter θ, the training objective is to minimize the following cross entropy loss, as in equation (5):

as can be seen from the formula, at each time period t, the attention model depends on h _t It contains the descriptive word y generated in the past _1:t-1 To calculate the attention weight alpha _t . This reliance on past information makes the visual features of interest less basic on words generated during the current time period, which compromises the accuracy of the description.

In order to enable the attention model to unbiased relate image regions to words to be generated, a predictive attention model is employed, as shown in fig. 5, which can be used to guide conventional widely used attention models with information on future words to solve their semantic misalignment problem and select the correct image region to generate the corresponding word.

Specifically, firstly adoptGenerating the whole sentence y using a conventional encoder-decoder framework _1:T . Then, for each time step t, the advance attention will be given to the future information y _i:j (j.gtoreq.t) as input, calculate the attention weightThis is naturally based on the generated words. In the implementation, as shown in FIG. 5, a bi-directional LSTM (BiLSTM) pair y is used _1:T Encoding is performed so that y _i:j The information of (2) is first converted into h' _i:j The attention weight is then calculated by the following equation (6):

wherein the attention models in equations (2), (3) and (6) share the same set of parameters. Suggesting alpha use in training _t Andthe $l1$ criterion in between, as a regularization penalty, can be defined as equation (7):

wherein I II ₁ Representing the $l1$ criterion. By minimizing the loss in equation (7), the attention model will have previously generated the word y _1:t-1 The "bias" attention weight α calculated above _t Word y generated towards the future _i:j "ideal" attention weight calculated on (j. Gtoreq.t)And (5) closing.

Then, to train the advance notice, willIncorporating a conventional encoder-decoder framework to regenerate target baseQuasi->It is defined as formulas (8), (9), (10):

combine the loss L in equation (5) _CE (θ), loss in equation (10)And loss L in equation (7) _Att (θ), the complete training goal is defined as equation (11):

where λ is the hyper-parameter that controls regularization. In the training process, the description model is first pre-trained 25 times with equation (5), and then the complete model is trained with equation (11). In this way, appropriate parameter weights can be initialized for the advance notice. In the test phase, the same procedure as the conventional attention model is followed in the description decoder, since future words are not visible for the current time step in the language generation task.

To keep track of the information that can be based on future time steps, the image region is dynamically focused. In particular, for a noun phrase, such as a black shirt, all of the words therein should be treated as a complete phrase rather than a single word. Thus, for Dynamic advance concern (Dynamic propsetAttention, DPA), if the currently output word y _t Belonging to a Noun Phrase (NP), the DPA will use all words in the Noun phrase to calculate the attention weightThen, when the word is a Non Visual (NV) word, the pre-emptive attention model is masked, i.e., the loss +.>And the loss in equation (7). For the remaining words, i=j=t is set directly. Specifically, in the image description, the remaining words are typically verbs, which serve as relational words in the description, connecting different noun phrases. In short, dynamic precedent notice is defined as formula (12):

wherein { y } _NV And } represents a set of all NV words. The attention model may learn to output each word y _t The reference samples of the training description are not needed to locate the image area.

The task of the arbiter is to score the similarity between the image and the description. The image and description are co-embedded at an early stage using a common attention model and similarity is calculated over the entire set representation. The common attention discriminator is shown in fig. 6 and details of the construction are provided below.

Given a word sequence (w ₁ ,...w _T ) The composed sentence w, the arbiter embeds each word using LSTM (state dimension m=512), resulting in h= [ H ] ₁ ,...h _T ] ^T For H.epsilon.R ^T×m Wherein h is _t ,c _t ＝LSTM(h _t-1 ,c _t-1 ,w _t ). For image I, features are extracted (I ₁ ,...I _C ) Where c=14×14=196, while embedding it as i= [ WI ₁ ,...WI _C ] ^T ∈R ^C×m Whereind _I =2048, is the image feature size herein. The present section uses bilinear projection Q εR ^m×m Calculating correlation Y, y=tanh (IQH ^T )∈R ^C×T . The matrix Y is used to calculate the common attention weight of one mode to another, as in equations (13), (14):

α＝Softmax(Linear(tanh(IW _I +YHW _Ih )))∈R ^C (13)

β＝Softmax(Linear(tanh(HW _h +YTIW _hI )))∈RT (14)

wherein all new matrices are at R ^m×m Is a kind of medium. The weights are then used to combine the word and image features.For U _I ,V _S ∈R ^m×m . Finally, the image-description score is calculated asWherein E is _I Is the average spatial set of CNN features, E _S Is the last state of the LSTM.

At model training time, the generator is optimized to solve max _θ L _G (θ), where L _G (θ)＝E _I logD _η (I,G _θ (I) A kind of electronic device. Training generator G by means of SCST _θ SCST is a variant of reinforcement learning, using rewards under the decoding algorithm as a baseline. In this work, the decoding algorithm can be considered as a greedy algorithm, from argmaxp at each step _θ (.∣h _t ) The most likely word is selected. For a given image, a single sample w of the generator ^s Is used to estimate the full-sequence prize,wherein w is ^s ～p _θ (. I). Using SCST, the gradient estimation method is as in equation (15):

wherein,,obtained with a greedy maximum as shown in fig. 7. Note that the baseline does not change the expected value of the gradient, but reduces the variance of the estimated value.

In addition, GAN training can describe the evaluation index r by using images _NLP Normalization is performed to bring the generated description close to the provided sample reference at the N-gram level. The gradient is then as in equation (16):

distinguishing device D _η Not only is training to distinguish between a real description and a fake description, but it is also possible to detect when an image is combined with a random uncorrelated real sentence, forcing it to check not only the composition of the sentence, but also the semantic relationship between the image and the description. To achieve this goal, this section solves the following optimization problem: max (max) _η L _D (eta) wherein L is lost _D (eta) is formula (17):

wherein w is a true sentence, w ^s Is a slave generator G _θ The resulting spurious description is sampled and w' is a true but randomly chosen description.

Claims

1. An image description generation method based on a common attention mechanism, which is characterized by comprising the following steps:

step 1: based on the image description method for generating the countermeasure network, the network model is divided into a generator and a discriminator, wherein the former is used for generating the description of the corresponding image; the latter is to evaluate the descriptive accuracy of the textual description on the image;

2. The method of claim 1, wherein the attention mechanism in step 2 is a prior-known attention mechanism.

3. The method of claim 1, wherein the arbiter network in step 5 is a common attention arbiter.

4. The method of claim 1 wherein the model training method of step 6 is trained using an SCST reinforcement learning network model.