CN110674850A

CN110674850A - Image description generation method based on attention mechanism

Info

Publication number: CN110674850A
Application number: CN201910828522.3A
Authority: CN
Inventors: 肖春霞; 赵坤
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2019-09-03
Filing date: 2019-09-03
Publication date: 2020-01-10

Abstract

The invention provides an image description generation method based on an attention mechanism. The invention has the following advantages: the fusion of the relation characteristic and the object characteristic can enrich image information; the double-layer language model can generate image description with finer granularity; further optimization of the training model using reinforcement learning can alleviate the exposure bias problem.

Description

Image description generation method based on attention mechanism

Technical Field

The invention belongs to the field of computer vision and natural language processing, relates to an image language description generation method, and particularly relates to an image description generation method based on an attention mechanism.

Background

In many cases in life, it is necessary to convert image content into text description, for example, in social software, text abstract of an image is automatically generated in case of poor network state, and people with visual impairment are helped to understand the image content. The existing image description method is mainly based on a deep learning method, a convolutional neural network is used as an image processing model to extract image features, and the image features are input into a cyclic neural network to be used as a language generation model to generate an image description language. However, the model usually uses global or object-level image features, it is difficult to pay attention to a significant target object in the image and lose much important information in the image, and it is difficult to sufficiently apply important visual semantic relationship information in the image to the model. Moreover, the existing model is mostly a single-step forward process, when the model generates the next word, the model can only utilize the word which is generated before, so that the error accumulation can be caused if an error word is generated in the generation process. On the other hand, the existing model maximizes the joint probability of the sequence generated by the model during training, thus minimizing the cross entropy loss to train the model, and the joint probability of the generated reference word is improved to the maximum extent through back propagation, so that the probability distribution of the words in the sentence can be learned by the model, and the evaluation indexes are different from the automatic evaluation indexes usually used during evaluating the quality of the sentence generated by the image description model, and are not trivial and therefore cannot be directly used as the loss function, and the inconsistency of the loss function and the evaluation indexes makes the model not be optimized sufficiently.

Disclosure of Invention

The invention aims to overcome the defects of the existing method, and provides an image description generation method based on an attention mechanism.

The technical problem of the present invention is mainly solved by the following technical solutions, and an image description generation method based on an attention mechanism includes the following steps:

step 1, extracting words from the labeled sentences of the data set to construct a vocabulary;

step 2, adopting a ResNet101 model as a CNN initial model, performing parameter pre-training of ResNet101, using the pre-trained ResNet101 to independently extract global features of the image, then using the pre-trained ResNet101 to replace the CNN in a FasterR-CNN algorithm to extract a plurality of object region features of each image, and then forming the object regions into relationship regions in pairs to extract relationship features;

step 3, performing feature fusion on the relationship features and the object region features to obtain object region features containing the relationship between the objects;

step 4, inputting the object region characteristics containing the relationship between the objects obtained in the previous step into a double-layer LSTM language model to obtain an output result, namely natural language description generated on the image;

and 5, training the similarity between the image measured by the mapping space model and the description sentence, using the CIDER score and the similarity as reward items, and further optimizing the double-layer LSTM language model by using reinforcement learning.

Furthermore, the vocabulary table is constructed in step 1 by counting the occurrence frequency of each word in the text description of the MS COCO dataset, and only selecting words with the occurrence frequency greater than five to be listed in the vocabulary table, wherein the MS COCO dataset vocabulary table contains 9487 words.

Further, the fast R-CNN algorithm is used to extract the object region features of the image in step 2, and the loss function for an image during training is defined as follows:

wherein the parameter lambda is used to balance N_clsAnd N_regThese two normalization parameters; will N_clsIs set to the size of mini-batch, N is set_regSet to the total amount of anchors; i denotes the index number of anchor in a mini-batch during training, p_iThe anchor with index number i is the prediction probability of the object region; if anchor is a positive sample, p_i ^*Equal to 1, if anchor is a negative sample, p_i ^*Equal to 0; t is t_iIs a vector of 4 coordinate parameters representing the generated bounding box, t_i ^*Is the coordinate vector of the bounding box of the ground truth corresponding to the anchor of the positive sample;

L_clsis the classification loss function of the object:

L_cls(p_i,p_i ^*)＝-log[p_i ^*p_i+(1-p_i ^*)(1-p_i)]

L_regrepresents the bounding box regression loss function:

L_reg(t_i,t_i ^*)＝R(t_i-t_i ^*)

where R is the smooth L1 loss function:

further, the specific method for performing feature fusion on the relationship features and the object region features in step 3 is as follows:

for the input image I, a series of object regions v are obtained in the previous step₁,...,v_i,...,v_kAnd relation area { S }₁,...,S_i...,S_kEach object is contained in a plurality of different relationship areas, the importance of each relationship to the object is different, and p is calculated by the following formula_i(S_k) Representing a relation s_kFor object v_iAttention weight of (1):

summing objects v according to attention weight_iThe connected relation region features are aggregated into an overall aggregated relation region feature, and then the aggregated relation region feature is transmitted to the target object feature as information, and the following formula is used for:

wherein S is upper case_kRepresenting a collection of relational areas, a lower case sk representing a single relational area,

representing the final aggregated relational feature vector after aggregation,

is v after fine tuning_i。

Further, the specific implementation manner of step 4 is as follows,

step 4.1, input of global features v of the image₀To the first layer in a two-layer LSTM language model;

step 4.2, respectively calculating an attention weight for each object area subjected to feature fusion at each time t:

wherein

W_v1、W_h1Are parameters that need to be learned in the language model,

the attention weight assigned to the i region at time t,

referring to the result output by the first layer LSTM at each time t, tanh is the tanh activation function:

step 4.3, the attention weight assigned to each region represents the contribution degree of the region to the currently generated word:

wherein the content of the first and second substances,

refer to for each v_iThe result after attention weighting;

step 4.4, the input of the LSTM in the second layer of the language model is formed by combining the output of the first-stage language model and the attention-weighted image characteristics;

step 4.5, adding a sentinel gate g on the basis of the second layer of LSTM_tTo calculate a language information vector s of references needed to generate text words_t：

Where σ is the sigmoid activation function, W_xAnd W_hIs a parameter requiring training, c_tCell state (cell state) indicating LSTM at time t, ⊙ indicating multiplication of corresponding elements, x_tRefers to the input of the second layer LSTM, h_t-1Refers to the output of the hidden layer at a time t-1 on the second layer LSTM;

step 4.6, the second layer LSTM also uses the attention mechanism:

wherein

W_v2、W_h2Is a parameter to be learned in the model, except that visual words and text words are distinguished in this region, making the network automatic selection more dependent on the visual information v when generating each word_iOr more on the language information s_t；

Step 4.7, calculating the distribution weight of the visual information and the language information when each word is generated:

wherein the content of the first and second substances,for the attention vector obtained in step 4.6,

W_s、Wh₃parameters needing to be learned in the model comprise image region characteristics and the weight of a language information vector;

for each v_iMultiplying by the corresponding β and then summing up, the length becomes K +1 due to the addition of one language information vector;

step 4.8, inputting the output of the second layer language model into the softmax layer to calculate the distribution probability of the generated words at the time of generating t:

and 4.9, finally, training the model by using a cross entropy loss function.

Further, the specific implementation manner of step 5 is as follows,

step 5.1, first calculate the CIDER score CIDER of the model (c)_i,S_i)，c_iAs candidate sentences and S_iIs a reference sentence;

step 5.2, for a matching image-sentence pair (I)_n,S_n) Training a convolution on the input image IGlobal feature vector phi (I) extracted by neural network, training a cyclic neural network for sentence S to extract its features

Then, the features of the two different modes are mapped to the same space through two linear mapping layers;

the cosine similarity is then calculated to represent the cosine similarity between the image and the sentence as follows:

to train such a mapping space model, a parameter θ is defined_sIs minimized, the training set is represented as

e(θ_s) Represents the loss function L_eAverage error of (I, S), where L is defined using a two-way ordering penalty_e(I,S)：

Wherein β represents a boundary distance, I, S represents a set of reference image-sentence pairs, I ', S' represents a set of sentence image pairs randomly selected in the training set, and S represents cosine similarity;

step 5.3, using

To define a reward, whereinThe results of the model predictions are represented as,

indicating the result of the prediction

The CIDER score of (A) is obtained,

representing an input image I and

cosine similarity;

step 5.4, updating network parameters by using the strategy gradient in reinforcement learning, and according to a REINFORCE algorithm, obtaining a loss function L_RL(theta) the gradient calculation formula with respect to the parameter theta is as follows,

wherein

Is a function of the score,is a cost function;

in order to reduce the variance of the gradient estimate, a baseline function b is introduced,

step 5.5, let b ═ R (S)^*I), then the gradient calculation formula is:

and S is a descriptive sentence corresponding to the image I, and is equivalent to the true value of S.

Compared with the prior art, the invention has the following advantages:

1. the fusion of the middle relation characteristic and the object characteristic can enrich image information;

2. the double-layer language model can generate image description with finer granularity;

3. the invention can further optimize the training model by using reinforcement learning to relieve the exposure deviation problem.

Drawings

Fig. 1 is a general flow chart of the present invention.

Detailed Description

The technical solution of the present invention is further explained with reference to the drawings and the embodiments.

As shown in fig. 1, an attention-based image description generation method includes the following steps:

the vocabulary table obtained in the step 1 is obtained by counting the occurrence frequency of each word in the text description of the MS COCO data set, and only selecting the words with the occurrence frequency more than five times to be listed in the vocabulary table, wherein the MS COCO data set vocabulary table comprises 9,487 words.

Step 2, adopting a ResNet101 model as a CNN initial model, adopting an ImageNet data set to pre-train parameters of ResNet101, using the pre-trained ResNet101 to independently extract global features of the image, then using the pre-trained ResNet101 to replace CNN in a Faster R-CNN algorithm to extract 36 object region features of each image, and then forming every two object regions into a relationship region to extract relationship features;

in the step 2, the global features of the image are extracted by using the pre-trained ResNet101, the object region features of the image are extracted by using the Faster R-CNN algorithm, and a loss function for one image during training is defined as follows:

wherein the parameter lambda is used to balance N_clsAnd N_regThese two normalization parameters. Will N_clsIs set to the size of mini-batch, N is set_regSet to the total amount of anchors and λ to 10. i denotes the index number of anchor in a mini-batch during training, p_iIs the prediction probability of the object region with anchor with index number i. If anchor is a positive sample, p_i ^*Equal to 1, if anchor is a negative sample, p_i ^*Equal to 0. t is t_iIs a vector representing the parameters of 4 coordinates (including upper left, upper right, lower left, and lower right, respectively) of the generated bounding box, t_i ^*Is the coordinate vector of the bounding box of the ground treth corresponding to the anchor of the positive sample. L is_clsIs the classification loss function of the object:

L_cls(p_i,p_i ^*)＝-log[p_i ^*p_i+(1-p_i ^*)(1-p_i)]

L_regrepresents the bounding box regression loss function:

L_reg(t_i,t_i ^*)＝R(t_i-t_i ^*)

where R is the smooth L1 loss function:

and 3, performing feature fusion on the relationship features and the object region features to obtain object region features containing the relationship between the objects, wherein the specific method comprises the following steps:

for the input image I, a series of object regions { v } are obtained in the previous step₁,...,v_i,...,v_kAnd relation area { S }₁,...,S_i...,S_kEach object is contained in a plurality of different relationship areas, the importance of each relationship to the object is different, and p is calculated by the following formula_i(S_k) Representing a relation s_kFor object v_iAttention weight of (1):

according to the attentionWeighting the number of sum object regions v_iThe connected relation region features are aggregated into an overall aggregated relation region feature, and then the aggregated relation region feature is transmitted to the target object feature as information, and the following formula is used for:

wherein S is upper case_kRepresenting sets of relational areas, lower case s_kA single one of the relationship regions is represented,

representing the final aggregated relational feature vector after aggregation,

is v after fine tuning_i。

Step 4, inputting the object region characteristics containing the relationship between the objects obtained in the previous step into a double-layer LSTM language model to obtain an output result, namely, natural language description generated for the image, and specifically comprising the following substeps:

step 4.1, Global features v of first layer LSTM input image₀The initial description is generated by the connection vector of the output of the second layer LSTM at the time t-1 and the code of the word generated at the time t. The features of the word sequence generated at the current time are concatenated with the output of the first layer LSTM at the current time as input to the next layer language model.

wherein

W_v1、W_h1Are parameters that need to be learned in the language model,

the attention weight assigned to the i region at time t,

wherein the content of the first and second substances,

refer to for each v_iAttention weighted results.

Where σ is the sigmoid activation function, W_xAnd W_hIs a parameter requiring training, c_tCell state (cell state) indicating LSTM at time t, ⊙ indicating multiplication of corresponding elements, x_tRefers to the input of the second layer LSTM, h_t-1Refers to the output of the hidden layer at a time t-1 on the second layer LSTM.

Step 4.6, the second layer LSTM also uses the attention mechanism:

wherein

W_v2、W_h2Is a parameter to be learned in the model, except that visual words and text words are distinguished in this region, making the network automatic selection more dependent on the visual information v when generating each word_iOr more on the language information s_t。

wherein the content of the first and second substances,

for the attention vector obtained in step 4.6,

W_s、W_h3the parameters to be learned in the model include the image region characteristics and the weight of the language information vector.

Wherein the content of the first and second substances,

for each v_iThe result of multiplying by the corresponding β and then summing, since one language information vector is added, the length becomes K + 1.

step 4.9, training the model by using a cross entropy loss function:

wherein the content of the first and second substances,

representing the value at the previous time t-1,

representing the true value at time t, and L (θ) represents the conditional probability for the value at time t given the values from 1 to time t-1.

Step 5, training the similarity between the image and the description sentence measured by a mapping space model, using the CIDER score and the similarity as reward items, and further optimizing the double-layer LSTM language model by using reinforcement learning, which specifically comprises the following substeps:

step 5.1, first calculate the CIDER score of the model, an n-gram tuple w_kAppear in the reference sentence s_ijThe number of times in (1) is recorded as h_k(s_ij) Appears in the candidate sentence c_iThe number of times in (1) is recorded as h_k(c_i) Each n-gram tuple w is calculated by_kTF-IDF weight of:

where Ω is the set of all n-grams and I is all images in the dataset. Using candidate sentence c for n-grams tuple of length n_iAnd a reference sentence S_iAverage cosine similarity between them to calculate its CIDER_nAnd (3) fractional:

finally, the total CIDER fraction, w, is calculated_nRepresenting a n-grams tuple as follows:

the above procedure can be found in the literature Vedantam R, Lawrence Zitnick C, Parikh D.Cider: Consensuss-based image description evaluation [ C]//Proceedings of the IEEE conference on computer vision and patternrecognition.2015:4566-4575。

Step 5.2, for a matching image-sentence pair (I)_n,S_n) Training a convolutional neural network to extract global feature vector phi (I) for input image I, and training a Recurrent Neural Network (RNN) for sentence S to extract its features

The features of the two different modalities are then mapped to the same space by two linear mapping layers.

e(θ_s) Represents the loss function L_eAverage error of (I, S).

Here, L is defined using a two-way ordering penalty_e(I,S)：

Where β represents a boundary distance, I, S represents a set of reference image-sentence pairs, and I ', S' represents a set of randomly chosen image-sentence pairs in the training set.

Step 5.3, using

To define a reward, wherein

The results of the model predictions are represented as,

indicating the result of the prediction

The CIDER score of (A) is obtained,

representing an input image I and

cosine similarity;

step 5.4, updating network parameters by using policy gradient (policy gradient) in reinforcement learning, and according to a REINFORCE algorithm, obtaining a loss function L_RL(theta) the gradient calculation formula with respect to the parameter theta is as follows,

wherein

Is a function of the score,

is a cost function.

The above steps can be found in the documents R.S. Sutton, D.McAllester, S.Singh, and Y.Mansource. policy gradient methods for retrieving Information with functional Processing Systems in Advances in Neural Information Processing Systems 12, pages 1057-1063,2000.

step 5.5, let b ═ R (S)^*I), then the gradient is calculated as

And S is a descriptive sentence corresponding to the image I, which is equivalent to the true value of S, and the true value is a known quantity.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. An attention mechanism-based image description generation method is characterized by comprising the following steps:

step 2, adopting a ResNet101 model as a CNN initial model, performing parameter pre-training of ResNet101, using the pre-trained ResNet101 to independently extract global features of the image, then using the pre-trained ResNet101 to replace the CNN in a Faster R-CNN algorithm to extract a plurality of object region features of each image, and then forming the object regions into relationship regions in pairs to extract relationship features;

2. The method of claim 1, wherein the method comprises: the vocabulary table is constructed in the step 1 by counting the occurrence frequency of each word in the text description of the MS COCO data set, and only selecting the words with the occurrence frequency more than five times to be listed in the vocabulary table, wherein the MS COCO data set vocabulary table comprises 9487 words.

3. The method of claim 1, wherein the method comprises: in step 2, the fast R-CNN algorithm is used to extract the object region characteristics of the image, and the loss function of an image during training is defined as follows:

L_clsis the classification loss function of the object:

L_cls(p_i,p_i ^*)＝-log[p_i ^*p_i+(1-p_i ^*)(1-p_i)]

L_regrepresents the bounding box regression loss function:

L_reg(t_i,t_i ^*)＝R(t_i-t_i ^*)

where R is the smooth L1 loss function:

4. the method of claim 1, wherein the method comprises: the specific method for performing feature fusion on the relationship features and the object region features in the step 3 comprises the following steps:

for the input image I, a series of object regions v are obtained in the previous step₁,...,v_i,...,v_kAnd relation area { S }₁,…,S_i…,S_kEach object is contained in a plurality of different relationship areas, the importance of each relationship to the object is different, and p is calculated by the following formula_i(S_k) Representing a relation s_kFor object v_iAttention weight of (1):

representing the final aggregated relational feature vector after aggregation,is v after fine tuning_i。

5. The method of claim 1, wherein the method comprises: the specific implementation of step 4 is as follows,

whereinW_v1、W_h1Are parameters that need to be learned in the language model,

the attention weight assigned to the i region at time t,

wherein the content of the first and second substances,

refer to for each v_iThe result after attention weighting;

Where σ is the sigmoid activation function，W_xAnd W_hIs a parameter requiring training, c_tCell state (cell state) indicating LSTM at time t, ⊙ indicating multiplication of corresponding elements, x_tRefers to the input of the second layer LSTM, h_t-1Refers to the output of the hidden layer at a time t-1 on the second layer LSTM;

step 4.6, the second layer LSTM also uses the attention mechanism:

wherein

wherein the content of the first and second substances,

for the attention vector obtained in step 4.6,W_s、Wh₃parameters needing to be learned in the model comprise image region characteristics and the weight of a language information vector;

and 4.9, finally, training the model by using a cross entropy loss function.

6. The method of claim 1, wherein the method comprises: the specific implementation of step 5 is as follows,

step 5.2, for a matching image-sentence pair (I)_n,S_n) Training a convolution neural network to extract global feature vector phi (I) for input image I, and training a convolution neural network to extract features of sentence SThen, the features of the two different modes are mapped to the same space through two linear mapping layers;

Wherein β represents a boundary distance, I, S represents a set of reference image-sentence pairs, I ', S' represents a set of sentence image pairs randomly selected in the training set;

step 5.3, using