CN112116685A - Multi-attention fusion network image subtitle generating method based on multi-granularity reward mechanism - Google Patents
Multi-attention fusion network image subtitle generating method based on multi-granularity reward mechanism Download PDFInfo
- Publication number
- CN112116685A CN112116685A CN202010974467.1A CN202010974467A CN112116685A CN 112116685 A CN112116685 A CN 112116685A CN 202010974467 A CN202010974467 A CN 202010974467A CN 112116685 A CN112116685 A CN 112116685A
- Authority
- CN
- China
- Prior art keywords
- reward
- network
- image
- word
- sentence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 230000004927 fusion Effects 0.000 title claims abstract description 37
- 230000007246 mechanism Effects 0.000 title claims abstract description 36
- 238000012549 training Methods 0.000 claims abstract description 27
- 238000011867 re-evaluation Methods 0.000 claims abstract description 20
- 230000002787 reinforcement Effects 0.000 claims abstract description 15
- 235000019987 cider Nutrition 0.000 claims description 27
- 230000006870 function Effects 0.000 claims description 13
- 230000008569 process Effects 0.000 claims description 13
- 238000011176 pooling Methods 0.000 claims description 6
- 230000008901 benefit Effects 0.000 claims description 5
- 238000005457 optimization Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 2
- 238000005259 measurement Methods 0.000 claims description 2
- 238000011156 evaluation Methods 0.000 abstract description 7
- 230000002860 competitive effect Effects 0.000 abstract 1
- 238000002474 experimental method Methods 0.000 abstract 1
- 230000000007 visual effect Effects 0.000 description 13
- 238000010586 diagram Methods 0.000 description 10
- 238000004364 calculation method Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 230000002238 attenuated effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- JEIPFZHSYJVQDO-UHFFFAOYSA-N iron(III) oxide Inorganic materials O=[Fe]O[Fe]=O JEIPFZHSYJVQDO-UHFFFAOYSA-N 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T11/00—2D [Two Dimensional] image generation
- G06T11/60—Editing figures and text; Combining figures or text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/488—Data services, e.g. news ticker
- H04N21/4884—Data services, e.g. news ticker for displaying subtitles
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
- H04N5/278—Subtitling
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses an image subtitle generating method of a multi-attention fusion network based on a multi-granularity reward mechanism, which solves the problem that each generated word has different importance in the image subtitle generating method based on an enhanced learning reward mechanism. The invention provides a multi-attention fusion network based on a multi-granularity reward mechanism for generating image captions for the first time, which comprises a multi-attention fusion model, a word importance re-evaluation network and a label retrieval network. The multi-attention fusion model is used as a baseline of the image caption method based on reinforcement learning; the word importance re-evaluation network is used for reward re-evaluation by evaluating the different importance of each word in the generated title; the tag retrieval network can retrieve corresponding real tags from a batch of subtitles as a retrieval reward, and then generate better subtitles by training the network in a manner that maximizes the reward. A large number of experiments are carried out on the MSCOCO data set, and a very competitive evaluation result is obtained.
Description
Technical Field
The invention belongs to an automatic generation method of image captions, and relates to the technical field of computer vision and natural language processing.
Background
The goal of an image caption (image caption) is to automatically generate a natural language description of a given image. At present, the task faces huge challenges, and on one hand, a computer has to fully understand image content from multi-level visual features; on the other hand, the image subtitle generation algorithm needs to gradually modify the rough semantic concept to resemble a natural language description of a human. In recent years, advances in deep learning related technologies (including attention mechanism and reinforcement learning) have significantly improved the quality of subtitle generation, and among them, the encoding-decoding framework is the mainstream method of image subtitle generation. Vinyals et al generate subtitles using spatially merged CNN feature maps, compress the entire image into a static representation, and improve the performance of the subtitles by learning to adaptively focus on regions of the image using the attention mechanism, but only a single LSTM serves as the visual information handler as well as the language generator, which is attenuated by the simultaneous visual handlers. Peter Anderson et al propose a top-down architecture with two independent LSTM layers: the first LSTM layer acts as a top-down visual attention model and the second LSTM layer acts as a language generator. All of the image captioning methods mentioned above use the high-level visual features of the CNN last convolution layer as an image encoder, ignoring the low-level visual features, which in fact are also useful for understanding the image. Due to complementarity among the multilayer features, the image subtitle can be optimized by adopting multilayer feature fusion, however, the early fusion method is not good in effect, and how to fuse the multistage visual features into the image subtitle model is a considerable problem. In general, training the image caption model is achieved by maximizing cross entropy (XE), which makes the image caption model more sensitive to abnormal captions, rather than optimizing around human consensus on proper captions for stable output. Furthermore, the subtitle models are typically evaluated by computing different metrics on the test set, such as BLEU, ROUGE, METEOR, and CIDER. Mismatch between the objective function and the evaluation metric adversely affects the image caption model, and this problem can be solved by Reinforcement Learning (RL), such as Policy Gradient and Actor-Critic. The reinforcement learning method can optimize the non-differentiable sequence-based evaluation index, and when the Policy Gradient method is used, the author of the SCST applies CIDER as a reward to generate subtitles more conforming to the human language consensus.
In SCST, each word is given the same reward as a gradient weight. However, not all words should be awarded equal rewards in one sentence, and different words may have different importance. Yu et al use the monte carlo reasoning to estimate the importance of each word, SeqGan, however, it must produce rich sentences, which results in expensive time complexity. Based on the Actor-Critic strategy, Dzmitry bahdana et al employ a value evaluation network to evaluate words, but evaluation indicators (e.g., CIDEr, BLEU) cannot be directly optimized. In this document, it is proposed to optimize the RL training-based image caption model with word-level rewards, aiming at solving the different importance problem of each generated word.
Calculating an evaluation metric (e.g., CIDER, BLEU) as the reward signal is an intuitive way in RL training to generate more human-language-like subtitles, however, these evaluation metrics are not the only criteria to judge the quality of the generated subtitles, which can also be evaluated by whether it can retrieve the corresponding tags in the retrieval system. From an information utilization perspective, the traditional CIDER reward fully utilizes the matched label information, the search reward benefits from the extra label information, and the search loss can also be used as a reward system.
In this document, a Hierarchical Attention Fusion (HAF) model of image captioning is proposed that integrates the multi-level feature mapping of Resnet with hierarchical attention, serving as a baseline for RL-based image captioning methods. In addition, a multi-granularity reward is presented at the RL stage to modify the proposed HAF. In particular, a word importance re-evaluation network (REN) is used for reward re-evaluation by evaluating the different importance of each word in the generated caption, wherein the reward for re-evaluation is derived by weighting the CIDEr score, the different weights are calculated from the REN, and the re-evaluated reward can be considered as a word-level reward. To benefit from the additional tags, a tag Retrieval Network (RN) is implemented to retrieve the corresponding tags from a collection of subtitles as a retrieval reward, which may be considered a sentence-level reward.
Disclosure of Invention
The invention aims to solve the problem of different importance of each generated word in an image subtitle generating method based on a reinforcement learning reward mechanism, so that sentences which are more consistent with human language consensus are generated, not all words are required to be equally rewarded in one sentence, and different words may have different importance.
The technical scheme adopted by the invention for solving the technical problems is as follows:
s1, constructing a multi-attention fusion model.
And S2, constructing a word importance re-evaluation network based on a reinforcement learning reward mechanism.
And S3, constructing a label retrieval network by combining a reinforcement learning reward mechanism.
And S4, combining the model in S1, the word importance re-evaluation network in S2 and the label retrieval network in S3 to construct a multi-attention fusion network architecture based on a multi-granularity reward mechanism.
And S5, training and subtitle generation of a multi-attention fusion network based on a multi-granularity reward mechanism.
Among them, the multi-attention fusion model (HAF) is used as a baseline for image caption RL training, the hierarchical visual features of CNN are focused, the multi-level visual information is fully utilized, besides the last layer of convolution representation of the image and the adoption of a single attention model to focus on a specific region of the image at each time step, we also consider fusing the attention model for caption and input the attention-derived image features to the unit nodes of the language LSTM. We use a classical network structure based on the LSTM hidden state h at each time step ttGenerating normalizationAttention weight αt。αtThe different spaces Att for participating in the image features as the final representation (a) of the image:
αt=softmax(at) (2)
Wherein h is2Is the output of the second LSTM, which consists of the image information of the convolutional layer and the content of the generated sequence. Generation of h2The process of (a) can be given by:
finally, the probability of the output word is given by the non-linear softmax function:
the word importance re-evaluation network is constructed based on a reinforcement learning reward mechanism, and the reward based on indexes is re-evaluated by automatically evaluating the importance of different words in the generated caption. Firstly, REN takes the generated sentence S as input, then the sentence is processed by RNN with attention network and average pooling layer, the word embedding vector is formed by connecting the sentence embedding vector with attention and the sentence embedding vector after pooling as comprehensive representation for generating caption, then two full connection layers and sigmoid transformation are applied to obtain the weight W of different wordst. In particular, the caption model pre-trained by the CIDER reward mechanism (rl-model) serves as baseline (b), significantly reducing the variance without changing the expected gradient. We will award Wr the word-level prizetConstructed as 16, therefore, only samples from the model are given positive weight over the current test model (rl-model), while bad samples are suppressed. Mathematically, the loss function can be formulated as equation (11):
Wrt=RWt+R-b (10)
wherein, WiIs the output weight of REN, theta is a parameter of the image caption network,different words representing the generated sentence.
To take advantage of index-based rewards (CIDER) and constrain sentence space, after CIDER optimization, word-level rewards are employed to fine tune the caption network, and furthermore, to optimize REN simultaneously, we define the update of REN as another RL process with reward R-b. We observe that R-b is too small resulting in a weaker gradient for REN, so the hyper-parameter γ is set to enhance the gradient, and REN can be similarly updated by the reinforcement learning algorithm with the following loss function:
the label Retrieval Network (RN) is also constructed based on a reinforcement learning reward mechanism, and is introduced to enhance the reward (CIDER) based on indexes and utilize labels and other unmatched labels, so that the generated subtitles can be matched with the corresponding labels. According to the method called cross-media retrieval proposed by farmashfaghri et al, we reconstruct a sentence retrieval model with two LSTM networks, first, RN is pre-trained to converge from different labels of the images, since each image has five different labels, we encode the labels and generate the subtitles for the features in the same embedding space of RN:
si=LSTM(Ci) (13)
gj=LSTM(Gj) (14)
where C and G denote the generated subtitles and tags, SiAnd giIndicating their respective embedded characteristics. Calculating cosine similarity of similarity between S and g:
the score of the designated matching word pair is higher than the score of any unmatched word pair, and the penalty of the RN is calculated by the hinge penalty:
whereinIs the correct word pair, andis incorrect. Hinge loss of CIDER acts as a sentence-level reward in RL training, which encourages the generation of captions for the caption model to best match a given tag.
Equation (17) is a loss function for beta optimizing the caption model by sentence-level rewards, where beta is a hyper-parameter for balancing hinge loss and CIDEr. It is noted that the retrieval process is performed in each mini-batch, since retrieval is time consuming in the entire data set.
The multi-attention fusion network based on the multi-granularity reward mechanism comprises a multi-attention fusion model (HAF), a word importance re-evaluation network (REN) and a label Retrieval Network (RN).
Finally, the training method of the multi-attention fusion network based on the multi-granularity reward mechanism comprises the following steps:
all models were pre-trained with cross-entropy loss and then trained to maximize different RL rewards. The encoder uses the pre-trained Resnet-101 to obtain a representation of the images, and for each image we extract the outputs of the conv4 and conv5 convolutional layers from Resnet, which map to a vector of dimension 1024 as the input to the HAF. For the HAF, the image feature embedding dimension, LSTM hidden state and word embedding dimension are all set to 512. The baseline model was trained using an ADAM optimizer at XE goal with an initial learning rate of 10-4. At each iteration cycle, we evaluate the model and select the best CIDER as the baseline score. The reinforced training starts from the 30 th iteration cycle to optimize the CIDER measurement, and the learning rate is 10-5。
In the word-level reward training phase, the image subtitle model trains the CIDER reward of 20 iteration cycles and the reward level reward of 10 iteration cycles in advance. In sentence-level reward training, the RN is pre-trained for 10 iteration cycles with different labels per img. Where the word embedding and LSTM hiding size is set to 512 and the joint embedding size is set to 1024, and the hyperparametric edge a is set to 0.2. In addition, the subtitle model for baseline (b) was trained using cross entropy for 30 epochs, with the iteration period for sentence-level reward training set to 30.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention provides a layered attention fusion (HAF) model as a baseline for RL training of image captions. The HAF pays attention to the layered visual characteristics of the CNN for many times, and can fully utilize multi-level visual information.
2. The present invention proposes a word importance re-evaluation network (REN) for facilitating a re-evaluation reward calculation that automatically assigns different importance to words generated in a sentence during the RL training phase.
3. The present invention proposes a tag Retrieval Network (RN) to obtain sentence-level retrieval rewards. The RN will drive the generated subtitles to tend to match their corresponding tags and not other sentences.
Drawings
Fig. 1 is a schematic diagram of a multi-attention convergence network structure based on a multi-granularity reward mechanism.
FIG. 2 is a schematic diagram of a Hierarchical Attention Fusion (HAF) model.
Fig. 3 is a schematic diagram of a word importance re-evaluation network (REN) structure.
Fig. 4 is a schematic diagram of a tag Retrieval Network (RN).
Fig. 5 is a comparison diagram of subtitles generated by a multi-attention fusion network based on a multi-granularity reward mechanism, subtitles generated by a top-down method, subtitles generated by a single-use layered attention fusion model, and real subtitles.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent.
The invention is further illustrated below with reference to the figures and examples.
Fig. 1 is a schematic diagram of a multi-attention convergence network structure based on a multi-granularity reward mechanism. As shown in fig. 1, sentence-level awards and word-level awards, respectively, are generated by adaptively re-evaluating the importance of words on the left side and on the right side, the sentence-level awards are formed by retrieval losses calculated from the retrieval similarity S.
FIG. 2 is a schematic diagram of a Hierarchical Attention Fusion (HAF) model. As shown in figure 2 of the drawings, in which,representing the average features of conv4 and conv5, X being the one-hot encoding of the input word and E being the word embedding matrix of the vocabulary. We use a classical network structure based on the LSTM hidden state h at each time step ttGenerating a normalized attention weight αt,αtThe different spaces Att for participating in the image features as the final representation (a) of the image:
αt=softmax(at) (2)
Where h2 is the output of the second LSTM, which consists of the image information of the convolutional layer and the content of the generated sequence. The process of generating h2 can be given by:
finally, the probability of the output word is given by the nonlinear softmax function:
fig. 3 is a schematic diagram of a word importance re-evaluation network (REN) structure. As shown in FIG. 3, the word importance re-evaluation network embeds the generated sentences and provides the rewarding weight W, S is sigmoid, and rl-model is a caption model pre-trained by CIDER. Firstly, REN takes the generated sentence S as input, then the sentence is processed by RNN with attention network and average pooling layer, the word embedding vector is formed by connecting the sentence embedding vector with attention and the sentence embedding vector after pooling as comprehensive representation for generating caption, then two full connection layers and sigmoid transformation are applied to obtain the weight W of different wordst. Mathematically, the loss function can be formulated as 11:
Wrt=RWt+R-b (10)
wherein, WiIs the output weight of REN, theta is a parameter of the image caption network,different words representing the generated sentence.
To take advantage of index-based rewards (CIDER) and constrain sentence space, word-level rewards are employed to fine-tune the caption network after CIDER optimization. Furthermore, to optimize REN simultaneously, we define the update of REN as another RL procedure with a reward of R-b. We observe that R-b is too small resulting in a weaker gradient for REN, so the hyper-parameter γ is set to enhance the gradient. Similarly, REN may be updated by a reinforcement learning algorithm with the following loss function:
fig. 4 is a schematic diagram of a tag Retrieval Network (RN). As shown in fig. 4, with text-to-text retrieval, using tags and unmatched tags to construct the sentence-level reward for RL training, we encode the tags and generate subtitles for features in the same embedding space of RN:
si=LSTM(Ci) (13)
gj=LSTM(Gj) (14)
where C and G denote the generated subtitles and tags, SiAnd giAnd (3) representing the respective embedding characteristics, calculating cosine similarity of similarity between S and g:
assigning a score for a matching word pair higher than the score for any unmatched word pair, the penalty of the RN being calculated by the hinge penalty:
whereinIs the correct word pair, andis an incorrect word pair. Hinge loss of CIDER acts as a sentence-level reward in RL training, which encourages the generation of captions for the caption model to best match a given tag.
Equation (17) is a loss function for beta optimizing the caption model by sentence-level rewards, where beta is a hyper-parameter for balancing hinge loss and CIDER, it is noted that the retrieval process is performed in each mini-batch, since retrieval is time consuming in the entire dataset.
Fig. 5 is a comparison diagram of subtitles generated by a multi-attention fusion network based on a multi-granularity reward mechanism, subtitles generated by a top-down method, subtitles generated by a single-use layered attention fusion model, and real subtitles. As shown in fig. 5, the sentences generated by the multi-attention fusion network based on the multi-granularity reward mechanism are more accurate and humanized than other models in the graph.
The invention provides a word importance re-evaluation network and a label retrieval network based on a reinforcement learning reward mechanism, and provides an image subtitle generating method of a multi-attention fusion network based on a multi-granularity reward mechanism. The present invention proposes a Hierarchical Attention Fusion (HAF) model as a baseline for image caption RL training, the HAF paying attention to the hierarchical visual features of CNN multiple times, being able to fully utilize multi-level visual information, while a word importance re-evaluation network (REN) is used to facilitate re-evaluation reward calculation, which automatically assigns different importance to words generated in sentences during the RL training phase. The tag Retrieval Network (RN) encourages the generated subtitles to match their corresponding tags and not other sentences. The generated image captions are expressed accurately and smoothly through training, and the content in the images can be well reflected.
Finally, the details of the above-described examples of the present invention are merely examples for illustrating the present invention, and any modification, improvement, replacement, etc. of the above-described examples should be included in the scope of the claims of the present invention for those skilled in the art.
Claims (6)
1. The method for generating the image captions of the multi-attention fusion network based on the multi-granularity reward mechanism is characterized by comprising the following steps:
s1, constructing a multi-attention fusion model.
And S2, constructing a word importance re-evaluation network based on a reinforcement learning reward mechanism.
And S3, constructing a label retrieval network by combining a reinforcement learning reward mechanism.
And S4, combining the model in S1, the word importance re-evaluation network in S2 and the label retrieval network in S3 to construct a multi-attention fusion network architecture based on a multi-granularity reward mechanism.
And S5, training and subtitle generation of a multi-attention fusion network based on a multi-granularity reward mechanism.
2. The method for generating image captions based on a multi-granular incentive mechanism and a multi-attention fusion network according to claim 1, wherein the specific process of S1 is as follows:
a classical network structure is adopted, which is based on LSTM hidden state h of each time step ttGenerating a normalized attention weight αt。αtThe different spaces Att for participating in the image features as the final representation (a) of the image:
αt=softmax(at) (2)
Wherein h is2Is the output of the second LSTM, which consists of the image information of the convolutional layer and the content of the generated sequence. Generation of h2The process of (a) can be given by:
finally, the probability of the output word is given by the non-linear softmax function:
3. the method for generating image captions based on a multi-granular incentive mechanism and a multi-attention fusion network according to claim 1, wherein the specific process of S2 is as follows:
REN takes the generated sentence S as input, then the sentence is processed by RNN with attention network and average pooling layer, the word embedding vector is formed by connecting the sentence embedding vector with attention and the sentence embedding vector after pooling as comprehensive representation of generating caption, then two full connection layers and sigmoid transformation are applied to obtain the weight W of different wordst. Mathematically, the loss function can be formalized as (11):
Wrt=RWt+R-b (10)
wherein, WiIs the output weight of REN, theta is a parameter of the image caption network,different words representing the generated sentence.
To take advantage of index-based rewards (CIDER) and constrain sentence space, after CIDER optimization, word-level rewards are employed to fine tune the caption network, and furthermore, to optimize REN simultaneously, we define the update of REN as another RL process with reward R-b. We observe that R-b is too small resulting in a weaker gradient for REN, so the hyper-parameter γ is set to enhance the gradient. Similarly, REN may be updated by a reinforcement learning algorithm with the following loss function:
4. the method for generating image captions based on a multi-granular incentive mechanism and a multi-attention fusion network according to claim 1, wherein the specific process of S3 is as follows:
the RN is pre-trained to converge with different labels for the images because each image has five different labels. We encode the tags and generate subtitles for the features in the same embedding space of the RN:
si=LSTM(Ci) (13)
gj=LSTM(Gj) (14)
where C and G denote the generated subtitles and tags, SiAnd giIndicating their respective embedded characteristics. Calculating cosine similarity of similarity between S and g:
the score of the designated matching word pair is higher than the score of any unmatched word pair, and the penalty of the RN is calculated by the hinge penalty:
whereinIs the correct word pair, andis incorrect. Hinge loss of CIDER acts as a sentence-level reward in RL training, which encourages the generation of captions for the caption model to best match a given tag.
Equation (17) is a loss function for beta optimizing the caption model by sentence-level rewards, where beta is a hyper-parameter for balancing hinge loss and CIDEr. It is noted that the retrieval process is performed in each mini-batch, since retrieval is time consuming in the entire data set.
5. The method for generating image captions based on a multi-granular incentive mechanism and a multi-attention fusion network according to claim 1, wherein the specific process of S4 is as follows:
the multi-attention fusion network based on the multi-granularity reward mechanism comprises a multi-attention fusion model (HAF), a reevaluation network (REN) and a Retrieval Network (RN).
6. The method for generating image captions based on a multi-granular incentive mechanism and a multi-attention fusion network according to claim 1, wherein the specific process of S5 is as follows:
the training and training method of the multi-attention fusion network based on the multi-granularity reward mechanism comprises the following steps:
all models were pre-trained with cross-entropy loss and then trained to maximize different RL rewards. The encoder uses the pre-trained Resnet-101 to obtain a representation of the images, and for each image we extract the outputs of the conv4 and conv5 convolutional layers from Resnet, which map to a vector of dimension 1024 as the input to the HAF. For the HAF, the image feature embedding dimension, LSTM hidden state and word embedding dimension are all set to 512. The baseline model was trained using an ADAM optimizer at XE goal with an initial learning rate of 10-4. At each iteration cycle, we evaluate the model and select the best CIDER as the baseline score. The reinforced training starts from the 30 th iteration cycle to optimize the CIDER measurement, and the learning rate is 10-5。
In the word-level reward training phase, the image subtitle model trains the CIDER reward of 20 iteration cycles and the reward level reward of 10 iteration cycles in advance. In sentence-level reward training, RN trains 10 iteration cycles in advance with different real labels per img, where word embedding and LSTM hidden sizes are set to 512 and joint embedding size is set to 1024, and the hyper-parameter edge α is set to 0.2. In addition, the subtitle model for baseline (b) was trained using cross entropy for 30 epochs, with the iteration period for sentence-level reward training set to 30.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010974467.1A CN112116685A (en) | 2020-09-16 | 2020-09-16 | Multi-attention fusion network image subtitle generating method based on multi-granularity reward mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010974467.1A CN112116685A (en) | 2020-09-16 | 2020-09-16 | Multi-attention fusion network image subtitle generating method based on multi-granularity reward mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112116685A true CN112116685A (en) | 2020-12-22 |
Family
ID=73803138
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010974467.1A Pending CN112116685A (en) | 2020-09-16 | 2020-09-16 | Multi-attention fusion network image subtitle generating method based on multi-granularity reward mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112116685A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113408430A (en) * | 2021-06-22 | 2021-09-17 | 哈尔滨理工大学 | Image Chinese description system and method based on multistage strategy and deep reinforcement learning framework |
CN113918754A (en) * | 2021-11-01 | 2022-01-11 | 中国石油大学(华东) | Image subtitle generating method based on scene graph updating and feature splicing |
CN116501859A (en) * | 2023-06-26 | 2023-07-28 | 中国海洋大学 | Paragraph retrieval method, equipment and medium based on refrigerator field |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170098153A1 (en) * | 2015-10-02 | 2017-04-06 | Baidu Usa Llc | Intelligent image captioning |
US20170127016A1 (en) * | 2015-10-29 | 2017-05-04 | Baidu Usa Llc | Systems and methods for video paragraph captioning using hierarchical recurrent neural networks |
CN107608943A (en) * | 2017-09-08 | 2018-01-19 | 中国石油大学(华东) | Merge visual attention and the image method for generating captions and system of semantic notice |
US20180143966A1 (en) * | 2016-11-18 | 2018-05-24 | Salesforce.Com, Inc. | Spatial Attention Model for Image Captioning |
CN109711464A (en) * | 2018-12-25 | 2019-05-03 | 中山大学 | Image Description Methods based on the building of stratification Attributed Relational Graps |
CN110135567A (en) * | 2019-05-27 | 2019-08-16 | 中国石油大学(华东) | The image method for generating captions of confrontation network is generated based on more attentions |
CN110347860A (en) * | 2019-07-01 | 2019-10-18 | 南京航空航天大学 | Depth image based on convolutional neural networks describes method |
US10467274B1 (en) * | 2016-11-10 | 2019-11-05 | Snap Inc. | Deep reinforcement learning-based captioning with embedding reward |
CN110473267A (en) * | 2019-07-12 | 2019-11-19 | 北京邮电大学 | Social networks image based on attention feature extraction network describes generation method |
CN111046966A (en) * | 2019-12-18 | 2020-04-21 | 江南大学 | Image subtitle generating method based on measurement attention mechanism |
US10699129B1 (en) * | 2019-11-15 | 2020-06-30 | Fudan University | System and method for video captioning |
KR20200104663A (en) * | 2019-02-27 | 2020-09-04 | 한국전력공사 | System and method for automatic generation of image caption |
KR20200106115A (en) * | 2019-02-27 | 2020-09-11 | 한국전력공사 | Apparatus and method for automatically generating explainable image caption |
-
2020
- 2020-09-16 CN CN202010974467.1A patent/CN112116685A/en active Pending
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170098153A1 (en) * | 2015-10-02 | 2017-04-06 | Baidu Usa Llc | Intelligent image captioning |
US20170127016A1 (en) * | 2015-10-29 | 2017-05-04 | Baidu Usa Llc | Systems and methods for video paragraph captioning using hierarchical recurrent neural networks |
US10467274B1 (en) * | 2016-11-10 | 2019-11-05 | Snap Inc. | Deep reinforcement learning-based captioning with embedding reward |
US20180143966A1 (en) * | 2016-11-18 | 2018-05-24 | Salesforce.Com, Inc. | Spatial Attention Model for Image Captioning |
CN107608943A (en) * | 2017-09-08 | 2018-01-19 | 中国石油大学(华东) | Merge visual attention and the image method for generating captions and system of semantic notice |
CN109711464A (en) * | 2018-12-25 | 2019-05-03 | 中山大学 | Image Description Methods based on the building of stratification Attributed Relational Graps |
KR20200104663A (en) * | 2019-02-27 | 2020-09-04 | 한국전력공사 | System and method for automatic generation of image caption |
KR20200106115A (en) * | 2019-02-27 | 2020-09-11 | 한국전력공사 | Apparatus and method for automatically generating explainable image caption |
CN110135567A (en) * | 2019-05-27 | 2019-08-16 | 中国石油大学(华东) | The image method for generating captions of confrontation network is generated based on more attentions |
CN110347860A (en) * | 2019-07-01 | 2019-10-18 | 南京航空航天大学 | Depth image based on convolutional neural networks describes method |
CN110473267A (en) * | 2019-07-12 | 2019-11-19 | 北京邮电大学 | Social networks image based on attention feature extraction network describes generation method |
US10699129B1 (en) * | 2019-11-15 | 2020-06-30 | Fudan University | System and method for video captioning |
CN111046966A (en) * | 2019-12-18 | 2020-04-21 | 江南大学 | Image subtitle generating method based on measurement attention mechanism |
Non-Patent Citations (3)
Title |
---|
CHUNLEI WU ET AL: "Hierarchical Attention-Based Fusion for Image Caption With Multi-Grained Rewards", 《IEEE ACCESS》, pages 57943 - 57951 * |
杜海骏等: "融合约束学习的图像字幕生成方法", 《中国图象图形学报》, pages 0333 - 0342 * |
袁韶祖等: "基于多粒度视频信息和注意力机制的视频 场景识别", 《计算机系统应用》, pages 252 - 256 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113408430A (en) * | 2021-06-22 | 2021-09-17 | 哈尔滨理工大学 | Image Chinese description system and method based on multistage strategy and deep reinforcement learning framework |
CN113408430B (en) * | 2021-06-22 | 2022-09-09 | 哈尔滨理工大学 | Image Chinese description system and method based on multi-level strategy and deep reinforcement learning framework |
CN113918754A (en) * | 2021-11-01 | 2022-01-11 | 中国石油大学(华东) | Image subtitle generating method based on scene graph updating and feature splicing |
CN116501859A (en) * | 2023-06-26 | 2023-07-28 | 中国海洋大学 | Paragraph retrieval method, equipment and medium based on refrigerator field |
CN116501859B (en) * | 2023-06-26 | 2023-09-01 | 中国海洋大学 | Paragraph retrieval method, equipment and medium based on refrigerator field |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107133211B (en) | Composition scoring method based on attention mechanism | |
CN109992779B (en) | Emotion analysis method, device, equipment and storage medium based on CNN | |
CN112116685A (en) | Multi-attention fusion network image subtitle generating method based on multi-granularity reward mechanism | |
CN109753571B (en) | Scene map low-dimensional space embedding method based on secondary theme space projection | |
CN112527966B (en) | Network text emotion analysis method based on Bi-GRU neural network and self-attention mechanism | |
CN116431793B (en) | Visual question-answering method, device and storage medium based on knowledge generation | |
CN113408430B (en) | Image Chinese description system and method based on multi-level strategy and deep reinforcement learning framework | |
CN112527993B (en) | Cross-media hierarchical deep video question-answer reasoning framework | |
CN115510814B (en) | Chapter-level complex problem generation method based on dual planning | |
CN114611492B (en) | Text smoothing method, system and computer equipment | |
CN114090815A (en) | Training method and training device for image description model | |
CN114898121A (en) | Concrete dam defect image description automatic generation method based on graph attention network | |
CN114969278A (en) | Knowledge enhancement graph neural network-based text question-answering model | |
CN116779091B (en) | Automatic generation method of multi-mode network interconnection and fusion chest image diagnosis report | |
CN111538838B (en) | Problem generating method based on article | |
Fu et al. | Contrastive transformer based domain adaptation for multi-source cross-domain sentiment classification | |
CN116992042A (en) | Construction method of scientific and technological innovation service knowledge graph system based on novel research and development institutions | |
CN114429143A (en) | Cross-language attribute level emotion classification method based on enhanced distillation | |
CN116168401A (en) | Training method of text image translation model based on multi-mode codebook | |
CN113627424B (en) | Collaborative gating circulation fusion LSTM image labeling method | |
CN111783852B (en) | Method for adaptively generating image description based on deep reinforcement learning | |
CN115564049B (en) | Knowledge graph embedding method for bidirectional coding | |
CN112015760A (en) | Automatic question-answering method and device based on candidate answer set reordering and storage medium | |
CN116484868A (en) | Cross-domain named entity recognition method and device based on diffusion model generation | |
Liao et al. | Question generation through transfer learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20201222 |