CN112116685A

CN112116685A - Multi-attention fusion network image subtitle generating method based on multi-granularity reward mechanism

Info

Publication number: CN112116685A
Application number: CN202010974467.1A
Authority: CN
Inventors: 王雷全; 袁韶祖; 段海龙; 吴杰; 路静; 吴春雷
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2020-09-16
Filing date: 2020-09-16
Publication date: 2020-12-22

Abstract

The invention discloses an image subtitle generating method of a multi-attention fusion network based on a multi-granularity reward mechanism, which solves the problem that each generated word has different importance in the image subtitle generating method based on an enhanced learning reward mechanism. The invention provides a multi-attention fusion network based on a multi-granularity reward mechanism for generating image captions for the first time, which comprises a multi-attention fusion model, a word importance re-evaluation network and a label retrieval network. The multi-attention fusion model is used as a baseline of the image caption method based on reinforcement learning; the word importance re-evaluation network is used for reward re-evaluation by evaluating the different importance of each word in the generated title; the tag retrieval network can retrieve corresponding real tags from a batch of subtitles as a retrieval reward, and then generate better subtitles by training the network in a manner that maximizes the reward. A large number of experiments are carried out on the MSCOCO data set, and a very competitive evaluation result is obtained.

Description

Multi-attention fusion network image subtitle generating method based on multi-granularity reward mechanism

Technical Field

The invention belongs to an automatic generation method of image captions, and relates to the technical field of computer vision and natural language processing.

Background

The goal of an image caption (image caption) is to automatically generate a natural language description of a given image. At present, the task faces huge challenges, and on one hand, a computer has to fully understand image content from multi-level visual features; on the other hand, the image subtitle generation algorithm needs to gradually modify the rough semantic concept to resemble a natural language description of a human. In recent years, advances in deep learning related technologies (including attention mechanism and reinforcement learning) have significantly improved the quality of subtitle generation, and among them, the encoding-decoding framework is the mainstream method of image subtitle generation. Vinyals et al generate subtitles using spatially merged CNN feature maps, compress the entire image into a static representation, and improve the performance of the subtitles by learning to adaptively focus on regions of the image using the attention mechanism, but only a single LSTM serves as the visual information handler as well as the language generator, which is attenuated by the simultaneous visual handlers. Peter Anderson et al propose a top-down architecture with two independent LSTM layers: the first LSTM layer acts as a top-down visual attention model and the second LSTM layer acts as a language generator. All of the image captioning methods mentioned above use the high-level visual features of the CNN last convolution layer as an image encoder, ignoring the low-level visual features, which in fact are also useful for understanding the image. Due to complementarity among the multilayer features, the image subtitle can be optimized by adopting multilayer feature fusion, however, the early fusion method is not good in effect, and how to fuse the multistage visual features into the image subtitle model is a considerable problem. In general, training the image caption model is achieved by maximizing cross entropy (XE), which makes the image caption model more sensitive to abnormal captions, rather than optimizing around human consensus on proper captions for stable output. Furthermore, the subtitle models are typically evaluated by computing different metrics on the test set, such as BLEU, ROUGE, METEOR, and CIDER. Mismatch between the objective function and the evaluation metric adversely affects the image caption model, and this problem can be solved by Reinforcement Learning (RL), such as Policy Gradient and Actor-Critic. The reinforcement learning method can optimize the non-differentiable sequence-based evaluation index, and when the Policy Gradient method is used, the author of the SCST applies CIDER as a reward to generate subtitles more conforming to the human language consensus.

In SCST, each word is given the same reward as a gradient weight. However, not all words should be awarded equal rewards in one sentence, and different words may have different importance. Yu et al use the monte carlo reasoning to estimate the importance of each word, SeqGan, however, it must produce rich sentences, which results in expensive time complexity. Based on the Actor-Critic strategy, Dzmitry bahdana et al employ a value evaluation network to evaluate words, but evaluation indicators (e.g., CIDEr, BLEU) cannot be directly optimized. In this document, it is proposed to optimize the RL training-based image caption model with word-level rewards, aiming at solving the different importance problem of each generated word.

Calculating an evaluation metric (e.g., CIDER, BLEU) as the reward signal is an intuitive way in RL training to generate more human-language-like subtitles, however, these evaluation metrics are not the only criteria to judge the quality of the generated subtitles, which can also be evaluated by whether it can retrieve the corresponding tags in the retrieval system. From an information utilization perspective, the traditional CIDER reward fully utilizes the matched label information, the search reward benefits from the extra label information, and the search loss can also be used as a reward system.

In this document, a Hierarchical Attention Fusion (HAF) model of image captioning is proposed that integrates the multi-level feature mapping of Resnet with hierarchical attention, serving as a baseline for RL-based image captioning methods. In addition, a multi-granularity reward is presented at the RL stage to modify the proposed HAF. In particular, a word importance re-evaluation network (REN) is used for reward re-evaluation by evaluating the different importance of each word in the generated caption, wherein the reward for re-evaluation is derived by weighting the CIDEr score, the different weights are calculated from the REN, and the re-evaluated reward can be considered as a word-level reward. To benefit from the additional tags, a tag Retrieval Network (RN) is implemented to retrieve the corresponding tags from a collection of subtitles as a retrieval reward, which may be considered a sentence-level reward.

Disclosure of Invention

The invention aims to solve the problem of different importance of each generated word in an image subtitle generating method based on a reinforcement learning reward mechanism, so that sentences which are more consistent with human language consensus are generated, not all words are required to be equally rewarded in one sentence, and different words may have different importance.

The technical scheme adopted by the invention for solving the technical problems is as follows:

s1, constructing a multi-attention fusion model.

And S2, constructing a word importance re-evaluation network based on a reinforcement learning reward mechanism.

And S3, constructing a label retrieval network by combining a reinforcement learning reward mechanism.

And S4, combining the model in S1, the word importance re-evaluation network in S2 and the label retrieval network in S3 to construct a multi-attention fusion network architecture based on a multi-granularity reward mechanism.

And S5, training and subtitle generation of a multi-attention fusion network based on a multi-granularity reward mechanism.

Among them, the multi-attention fusion model (HAF) is used as a baseline for image caption RL training, the hierarchical visual features of CNN are focused, the multi-level visual information is fully utilized, besides the last layer of convolution representation of the image and the adoption of a single attention model to focus on a specific region of the image at each time step, we also consider fusing the attention model for caption and input the attention-derived image features to the unit nodes of the language LSTM. We use a classical network structure based on the LSTM hidden state h at each time step t_tGenerating normalizationAttention weight α_t。α_tThe different spaces Att for participating in the image features as the final representation (a) of the image:

α_t＝softmax(a_t) (2)

wherein, W_a，U_a，

Are learning parameters.

Wherein h is²Is the output of the second LSTM, which consists of the image information of the convolutional layer and the content of the generated sequence. Generation of h²The process of (a) can be given by:

finally, the probability of the output word is given by the non-linear softmax function:

the word importance re-evaluation network is constructed based on a reinforcement learning reward mechanism, and the reward based on indexes is re-evaluated by automatically evaluating the importance of different words in the generated caption. Firstly, REN takes the generated sentence S as input, then the sentence is processed by RNN with attention network and average pooling layer, the word embedding vector is formed by connecting the sentence embedding vector with attention and the sentence embedding vector after pooling as comprehensive representation for generating caption, then two full connection layers and sigmoid transformation are applied to obtain the weight W of different words_t. In particular, the caption model pre-trained by the CIDER reward mechanism (rl-model) serves as baseline (b), significantly reducing the variance without changing the expected gradient. We will award Wr the word-level prize_tConstructed as 16, therefore, only samples from the model are given positive weight over the current test model (rl-model), while bad samples are suppressed. Mathematically, the loss function can be formulated as equation (11):

Wr_t＝RW_t+R-b (10)

wherein, W_iIs the output weight of REN, theta is a parameter of the image caption network,

different words representing the generated sentence.

To take advantage of index-based rewards (CIDER) and constrain sentence space, after CIDER optimization, word-level rewards are employed to fine tune the caption network, and furthermore, to optimize REN simultaneously, we define the update of REN as another RL process with reward R-b. We observe that R-b is too small resulting in a weaker gradient for REN, so the hyper-parameter γ is set to enhance the gradient, and REN can be similarly updated by the reinforcement learning algorithm with the following loss function:

the label Retrieval Network (RN) is also constructed based on a reinforcement learning reward mechanism, and is introduced to enhance the reward (CIDER) based on indexes and utilize labels and other unmatched labels, so that the generated subtitles can be matched with the corresponding labels. According to the method called cross-media retrieval proposed by farmashfaghri et al, we reconstruct a sentence retrieval model with two LSTM networks, first, RN is pre-trained to converge from different labels of the images, since each image has five different labels, we encode the labels and generate the subtitles for the features in the same embedding space of RN:

s_i＝LSTM(C_i) (13)

g_j＝LSTM(G_j) (14)

where C and G denote the generated subtitles and tags, S_iAnd g_iIndicating their respective embedded characteristics. Calculating cosine similarity of similarity between S and g:

the score of the designated matching word pair is higher than the score of any unmatched word pair, and the penalty of the RN is calculated by the hinge penalty:

wherein

Is the correct word pair, and

is incorrect. Hinge loss of CIDER acts as a sentence-level reward in RL training, which encourages the generation of captions for the caption model to best match a given tag.

Equation (17) is a loss function for beta optimizing the caption model by sentence-level rewards, where beta is a hyper-parameter for balancing hinge loss and CIDEr. It is noted that the retrieval process is performed in each mini-batch, since retrieval is time consuming in the entire data set.

The multi-attention fusion network based on the multi-granularity reward mechanism comprises a multi-attention fusion model (HAF), a word importance re-evaluation network (REN) and a label Retrieval Network (RN).

Finally, the training method of the multi-attention fusion network based on the multi-granularity reward mechanism comprises the following steps:

all models were pre-trained with cross-entropy loss and then trained to maximize different RL rewards. The encoder uses the pre-trained Resnet-101 to obtain a representation of the images, and for each image we extract the outputs of the conv4 and conv5 convolutional layers from Resnet, which map to a vector of dimension 1024 as the input to the HAF. For the HAF, the image feature embedding dimension, LSTM hidden state and word embedding dimension are all set to 512. The baseline model was trained using an ADAM optimizer at XE goal with an initial learning rate of 10^-4. At each iteration cycle, we evaluate the model and select the best CIDER as the baseline score. The reinforced training starts from the 30 th iteration cycle to optimize the CIDER measurement, and the learning rate is 10^-5。

In the word-level reward training phase, the image subtitle model trains the CIDER reward of 20 iteration cycles and the reward level reward of 10 iteration cycles in advance. In sentence-level reward training, the RN is pre-trained for 10 iteration cycles with different labels per img. Where the word embedding and LSTM hiding size is set to 512 and the joint embedding size is set to 1024, and the hyperparametric edge a is set to 0.2. In addition, the subtitle model for baseline (b) was trained using cross entropy for 30 epochs, with the iteration period for sentence-level reward training set to 30.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention provides a layered attention fusion (HAF) model as a baseline for RL training of image captions. The HAF pays attention to the layered visual characteristics of the CNN for many times, and can fully utilize multi-level visual information.

2. The present invention proposes a word importance re-evaluation network (REN) for facilitating a re-evaluation reward calculation that automatically assigns different importance to words generated in a sentence during the RL training phase.

3. The present invention proposes a tag Retrieval Network (RN) to obtain sentence-level retrieval rewards. The RN will drive the generated subtitles to tend to match their corresponding tags and not other sentences.

Drawings

Fig. 1 is a schematic diagram of a multi-attention convergence network structure based on a multi-granularity reward mechanism.

FIG. 2 is a schematic diagram of a Hierarchical Attention Fusion (HAF) model.

Fig. 3 is a schematic diagram of a word importance re-evaluation network (REN) structure.

Fig. 4 is a schematic diagram of a tag Retrieval Network (RN).

Fig. 5 is a comparison diagram of subtitles generated by a multi-attention fusion network based on a multi-granularity reward mechanism, subtitles generated by a top-down method, subtitles generated by a single-use layered attention fusion model, and real subtitles.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent.

The invention is further illustrated below with reference to the figures and examples.

Fig. 1 is a schematic diagram of a multi-attention convergence network structure based on a multi-granularity reward mechanism. As shown in fig. 1, sentence-level awards and word-level awards, respectively, are generated by adaptively re-evaluating the importance of words on the left side and on the right side, the sentence-level awards are formed by retrieval losses calculated from the retrieval similarity S.

FIG. 2 is a schematic diagram of a Hierarchical Attention Fusion (HAF) model. As shown in figure 2 of the drawings, in which,

representing the average features of conv4 and conv5, X being the one-hot encoding of the input word and E being the word embedding matrix of the vocabulary. We use a classical network structure based on the LSTM hidden state h at each time step t_tGenerating a normalized attention weight α_t，α_tThe different spaces Att for participating in the image features as the final representation (a) of the image:

α_t＝softmax(a_t) (2)

wherein, W_a，U_a，

Are learning parameters.

Where h2 is the output of the second LSTM, which consists of the image information of the convolutional layer and the content of the generated sequence. The process of generating h2 can be given by:

finally, the probability of the output word is given by the nonlinear softmax function:

fig. 3 is a schematic diagram of a word importance re-evaluation network (REN) structure. As shown in FIG. 3, the word importance re-evaluation network embeds the generated sentences and provides the rewarding weight W, S is sigmoid, and rl-model is a caption model pre-trained by CIDER. Firstly, REN takes the generated sentence S as input, then the sentence is processed by RNN with attention network and average pooling layer, the word embedding vector is formed by connecting the sentence embedding vector with attention and the sentence embedding vector after pooling as comprehensive representation for generating caption, then two full connection layers and sigmoid transformation are applied to obtain the weight W of different words_t. Mathematically, the loss function can be formulated as 11:

Wr_t＝RW_t+R-b (10)

different words representing the generated sentence.

To take advantage of index-based rewards (CIDER) and constrain sentence space, word-level rewards are employed to fine-tune the caption network after CIDER optimization. Furthermore, to optimize REN simultaneously, we define the update of REN as another RL procedure with a reward of R-b. We observe that R-b is too small resulting in a weaker gradient for REN, so the hyper-parameter γ is set to enhance the gradient. Similarly, REN may be updated by a reinforcement learning algorithm with the following loss function:

fig. 4 is a schematic diagram of a tag Retrieval Network (RN). As shown in fig. 4, with text-to-text retrieval, using tags and unmatched tags to construct the sentence-level reward for RL training, we encode the tags and generate subtitles for features in the same embedding space of RN:

s_i＝LSTM(C_i) (13)

g_j＝LSTM(G_j) (14)

where C and G denote the generated subtitles and tags, S_iAnd g_iAnd (3) representing the respective embedding characteristics, calculating cosine similarity of similarity between S and g:

assigning a score for a matching word pair higher than the score for any unmatched word pair, the penalty of the RN being calculated by the hinge penalty:

wherein

Is the correct word pair, and

is an incorrect word pair. Hinge loss of CIDER acts as a sentence-level reward in RL training, which encourages the generation of captions for the caption model to best match a given tag.

Equation (17) is a loss function for beta optimizing the caption model by sentence-level rewards, where beta is a hyper-parameter for balancing hinge loss and CIDER, it is noted that the retrieval process is performed in each mini-batch, since retrieval is time consuming in the entire dataset.

Fig. 5 is a comparison diagram of subtitles generated by a multi-attention fusion network based on a multi-granularity reward mechanism, subtitles generated by a top-down method, subtitles generated by a single-use layered attention fusion model, and real subtitles. As shown in fig. 5, the sentences generated by the multi-attention fusion network based on the multi-granularity reward mechanism are more accurate and humanized than other models in the graph.

The invention provides a word importance re-evaluation network and a label retrieval network based on a reinforcement learning reward mechanism, and provides an image subtitle generating method of a multi-attention fusion network based on a multi-granularity reward mechanism. The present invention proposes a Hierarchical Attention Fusion (HAF) model as a baseline for image caption RL training, the HAF paying attention to the hierarchical visual features of CNN multiple times, being able to fully utilize multi-level visual information, while a word importance re-evaluation network (REN) is used to facilitate re-evaluation reward calculation, which automatically assigns different importance to words generated in sentences during the RL training phase. The tag Retrieval Network (RN) encourages the generated subtitles to match their corresponding tags and not other sentences. The generated image captions are expressed accurately and smoothly through training, and the content in the images can be well reflected.

Finally, the details of the above-described examples of the present invention are merely examples for illustrating the present invention, and any modification, improvement, replacement, etc. of the above-described examples should be included in the scope of the claims of the present invention for those skilled in the art.

Claims

1. The method for generating the image captions of the multi-attention fusion network based on the multi-granularity reward mechanism is characterized by comprising the following steps:

s1, constructing a multi-attention fusion model.

2. The method for generating image captions based on a multi-granular incentive mechanism and a multi-attention fusion network according to claim 1, wherein the specific process of S1 is as follows:

a classical network structure is adopted, which is based on LSTM hidden state h of each time step t_tGenerating a normalized attention weight α_t。α_tThe different spaces Att for participating in the image features as the final representation (a) of the image:

α_t＝softmax(a_t) (2)

wherein, W_a，U_a，

Are learning parameters.

3. the method for generating image captions based on a multi-granular incentive mechanism and a multi-attention fusion network according to claim 1, wherein the specific process of S2 is as follows:

REN takes the generated sentence S as input, then the sentence is processed by RNN with attention network and average pooling layer, the word embedding vector is formed by connecting the sentence embedding vector with attention and the sentence embedding vector after pooling as comprehensive representation of generating caption, then two full connection layers and sigmoid transformation are applied to obtain the weight W of different words_t. Mathematically, the loss function can be formalized as (11):

Wr_t＝RW_t+R-b (10)

different words representing the generated sentence.

To take advantage of index-based rewards (CIDER) and constrain sentence space, after CIDER optimization, word-level rewards are employed to fine tune the caption network, and furthermore, to optimize REN simultaneously, we define the update of REN as another RL process with reward R-b. We observe that R-b is too small resulting in a weaker gradient for REN, so the hyper-parameter γ is set to enhance the gradient. Similarly, REN may be updated by a reinforcement learning algorithm with the following loss function:

4. the method for generating image captions based on a multi-granular incentive mechanism and a multi-attention fusion network according to claim 1, wherein the specific process of S3 is as follows:

the RN is pre-trained to converge with different labels for the images because each image has five different labels. We encode the tags and generate subtitles for the features in the same embedding space of the RN:

s_i＝LSTM(C_i) (13)

g_j＝LSTM(G_j) (14)

wherein

Is the correct word pair, and

5. The method for generating image captions based on a multi-granular incentive mechanism and a multi-attention fusion network according to claim 1, wherein the specific process of S4 is as follows:

the multi-attention fusion network based on the multi-granularity reward mechanism comprises a multi-attention fusion model (HAF), a reevaluation network (REN) and a Retrieval Network (RN).

6. The method for generating image captions based on a multi-granular incentive mechanism and a multi-attention fusion network according to claim 1, wherein the specific process of S5 is as follows:

the training and training method of the multi-attention fusion network based on the multi-granularity reward mechanism comprises the following steps:

In the word-level reward training phase, the image subtitle model trains the CIDER reward of 20 iteration cycles and the reward level reward of 10 iteration cycles in advance. In sentence-level reward training, RN trains 10 iteration cycles in advance with different real labels per img, where word embedding and LSTM hidden sizes are set to 512 and joint embedding size is set to 1024, and the hyper-parameter edge α is set to 0.2. In addition, the subtitle model for baseline (b) was trained using cross entropy for 30 epochs, with the iteration period for sentence-level reward training set to 30.