CN112116685A - Multi-attention fusion network image subtitle generating method based on multi-granularity reward mechanism - Google Patents

Multi-attention fusion network image subtitle generating method based on multi-granularity reward mechanism Download PDF

Info

Publication number
CN112116685A
CN112116685A CN202010974467.1A CN202010974467A CN112116685A CN 112116685 A CN112116685 A CN 112116685A CN 202010974467 A CN202010974467 A CN 202010974467A CN 112116685 A CN112116685 A CN 112116685A
Authority
CN
China
Prior art keywords
reward
network
image
word
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010974467.1A
Other languages
Chinese (zh)
Inventor
王雷全
袁韶祖
段海龙
吴杰
路静
吴春雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Petroleum East China
Original Assignee
China University of Petroleum East China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Petroleum East China filed Critical China University of Petroleum East China
Priority to CN202010974467.1A priority Critical patent/CN112116685A/en
Publication of CN112116685A publication Critical patent/CN112116685A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/278Subtitling

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an image subtitle generating method of a multi-attention fusion network based on a multi-granularity reward mechanism, which solves the problem that each generated word has different importance in the image subtitle generating method based on an enhanced learning reward mechanism. The invention provides a multi-attention fusion network based on a multi-granularity reward mechanism for generating image captions for the first time, which comprises a multi-attention fusion model, a word importance re-evaluation network and a label retrieval network. The multi-attention fusion model is used as a baseline of the image caption method based on reinforcement learning; the word importance re-evaluation network is used for reward re-evaluation by evaluating the different importance of each word in the generated title; the tag retrieval network can retrieve corresponding real tags from a batch of subtitles as a retrieval reward, and then generate better subtitles by training the network in a manner that maximizes the reward. A large number of experiments are carried out on the MSCOCO data set, and a very competitive evaluation result is obtained.

Description

Multi-attention fusion network image subtitle generating method based on multi-granularity reward mechanism
Technical Field
The invention belongs to an automatic generation method of image captions, and relates to the technical field of computer vision and natural language processing.
Background
The goal of an image caption (image caption) is to automatically generate a natural language description of a given image. At present, the task faces huge challenges, and on one hand, a computer has to fully understand image content from multi-level visual features; on the other hand, the image subtitle generation algorithm needs to gradually modify the rough semantic concept to resemble a natural language description of a human. In recent years, advances in deep learning related technologies (including attention mechanism and reinforcement learning) have significantly improved the quality of subtitle generation, and among them, the encoding-decoding framework is the mainstream method of image subtitle generation. Vinyals et al generate subtitles using spatially merged CNN feature maps, compress the entire image into a static representation, and improve the performance of the subtitles by learning to adaptively focus on regions of the image using the attention mechanism, but only a single LSTM serves as the visual information handler as well as the language generator, which is attenuated by the simultaneous visual handlers. Peter Anderson et al propose a top-down architecture with two independent LSTM layers: the first LSTM layer acts as a top-down visual attention model and the second LSTM layer acts as a language generator. All of the image captioning methods mentioned above use the high-level visual features of the CNN last convolution layer as an image encoder, ignoring the low-level visual features, which in fact are also useful for understanding the image. Due to complementarity among the multilayer features, the image subtitle can be optimized by adopting multilayer feature fusion, however, the early fusion method is not good in effect, and how to fuse the multistage visual features into the image subtitle model is a considerable problem. In general, training the image caption model is achieved by maximizing cross entropy (XE), which makes the image caption model more sensitive to abnormal captions, rather than optimizing around human consensus on proper captions for stable output. Furthermore, the subtitle models are typically evaluated by computing different metrics on the test set, such as BLEU, ROUGE, METEOR, and CIDER. Mismatch between the objective function and the evaluation metric adversely affects the image caption model, and this problem can be solved by Reinforcement Learning (RL), such as Policy Gradient and Actor-Critic. The reinforcement learning method can optimize the non-differentiable sequence-based evaluation index, and when the Policy Gradient method is used, the author of the SCST applies CIDER as a reward to generate subtitles more conforming to the human language consensus.
In SCST, each word is given the same reward as a gradient weight. However, not all words should be awarded equal rewards in one sentence, and different words may have different importance. Yu et al use the monte carlo reasoning to estimate the importance of each word, SeqGan, however, it must produce rich sentences, which results in expensive time complexity. Based on the Actor-Critic strategy, Dzmitry bahdana et al employ a value evaluation network to evaluate words, but evaluation indicators (e.g., CIDEr, BLEU) cannot be directly optimized. In this document, it is proposed to optimize the RL training-based image caption model with word-level rewards, aiming at solving the different importance problem of each generated word.
Calculating an evaluation metric (e.g., CIDER, BLEU) as the reward signal is an intuitive way in RL training to generate more human-language-like subtitles, however, these evaluation metrics are not the only criteria to judge the quality of the generated subtitles, which can also be evaluated by whether it can retrieve the corresponding tags in the retrieval system. From an information utilization perspective, the traditional CIDER reward fully utilizes the matched label information, the search reward benefits from the extra label information, and the search loss can also be used as a reward system.
In this document, a Hierarchical Attention Fusion (HAF) model of image captioning is proposed that integrates the multi-level feature mapping of Resnet with hierarchical attention, serving as a baseline for RL-based image captioning methods. In addition, a multi-granularity reward is presented at the RL stage to modify the proposed HAF. In particular, a word importance re-evaluation network (REN) is used for reward re-evaluation by evaluating the different importance of each word in the generated caption, wherein the reward for re-evaluation is derived by weighting the CIDEr score, the different weights are calculated from the REN, and the re-evaluated reward can be considered as a word-level reward. To benefit from the additional tags, a tag Retrieval Network (RN) is implemented to retrieve the corresponding tags from a collection of subtitles as a retrieval reward, which may be considered a sentence-level reward.
Disclosure of Invention
The invention aims to solve the problem of different importance of each generated word in an image subtitle generating method based on a reinforcement learning reward mechanism, so that sentences which are more consistent with human language consensus are generated, not all words are required to be equally rewarded in one sentence, and different words may have different importance.
The technical scheme adopted by the invention for solving the technical problems is as follows:
s1, constructing a multi-attention fusion model.
And S2, constructing a word importance re-evaluation network based on a reinforcement learning reward mechanism.
And S3, constructing a label retrieval network by combining a reinforcement learning reward mechanism.
And S4, combining the model in S1, the word importance re-evaluation network in S2 and the label retrieval network in S3 to construct a multi-attention fusion network architecture based on a multi-granularity reward mechanism.
And S5, training and subtitle generation of a multi-attention fusion network based on a multi-granularity reward mechanism.
Among them, the multi-attention fusion model (HAF) is used as a baseline for image caption RL training, the hierarchical visual features of CNN are focused, the multi-level visual information is fully utilized, besides the last layer of convolution representation of the image and the adoption of a single attention model to focus on a specific region of the image at each time step, we also consider fusing the attention model for caption and input the attention-derived image features to the unit nodes of the language LSTM. We use a classical network structure based on the LSTM hidden state h at each time step ttGenerating normalizationAttention weight αt。αtThe different spaces Att for participating in the image features as the final representation (a) of the image:
Figure BDA0002685289120000031
αt=softmax(at) (2)
Figure BDA0002685289120000032
wherein, Wa,Ua
Figure BDA0002685289120000033
Are learning parameters.
Figure BDA0002685289120000034
Wherein h is2Is the output of the second LSTM, which consists of the image information of the convolutional layer and the content of the generated sequence. Generation of h2The process of (a) can be given by:
Figure BDA0002685289120000035
Figure BDA0002685289120000036
Figure BDA0002685289120000037
Figure BDA0002685289120000038
finally, the probability of the output word is given by the non-linear softmax function:
Figure BDA0002685289120000039
the word importance re-evaluation network is constructed based on a reinforcement learning reward mechanism, and the reward based on indexes is re-evaluated by automatically evaluating the importance of different words in the generated caption. Firstly, REN takes the generated sentence S as input, then the sentence is processed by RNN with attention network and average pooling layer, the word embedding vector is formed by connecting the sentence embedding vector with attention and the sentence embedding vector after pooling as comprehensive representation for generating caption, then two full connection layers and sigmoid transformation are applied to obtain the weight W of different wordst. In particular, the caption model pre-trained by the CIDER reward mechanism (rl-model) serves as baseline (b), significantly reducing the variance without changing the expected gradient. We will award Wr the word-level prizetConstructed as 16, therefore, only samples from the model are given positive weight over the current test model (rl-model), while bad samples are suppressed. Mathematically, the loss function can be formulated as equation (11):
Wrt=RWt+R-b (10)
Figure BDA00026852891200000310
wherein, WiIs the output weight of REN, theta is a parameter of the image caption network,
Figure BDA0002685289120000041
different words representing the generated sentence.
To take advantage of index-based rewards (CIDER) and constrain sentence space, after CIDER optimization, word-level rewards are employed to fine tune the caption network, and furthermore, to optimize REN simultaneously, we define the update of REN as another RL process with reward R-b. We observe that R-b is too small resulting in a weaker gradient for REN, so the hyper-parameter γ is set to enhance the gradient, and REN can be similarly updated by the reinforcement learning algorithm with the following loss function:
Figure BDA0002685289120000042
the label Retrieval Network (RN) is also constructed based on a reinforcement learning reward mechanism, and is introduced to enhance the reward (CIDER) based on indexes and utilize labels and other unmatched labels, so that the generated subtitles can be matched with the corresponding labels. According to the method called cross-media retrieval proposed by farmashfaghri et al, we reconstruct a sentence retrieval model with two LSTM networks, first, RN is pre-trained to converge from different labels of the images, since each image has five different labels, we encode the labels and generate the subtitles for the features in the same embedding space of RN:
si=LSTM(Ci) (13)
gj=LSTM(Gj) (14)
where C and G denote the generated subtitles and tags, SiAnd giIndicating their respective embedded characteristics. Calculating cosine similarity of similarity between S and g:
Figure BDA0002685289120000043
Figure BDA0002685289120000044
the score of the designated matching word pair is higher than the score of any unmatched word pair, and the penalty of the RN is calculated by the hinge penalty:
Figure BDA0002685289120000045
wherein
Figure BDA0002685289120000046
Is the correct word pair, and
Figure BDA0002685289120000047
is incorrect. Hinge loss of CIDER acts as a sentence-level reward in RL training, which encourages the generation of captions for the caption model to best match a given tag.
Figure BDA0002685289120000048
Equation (17) is a loss function for beta optimizing the caption model by sentence-level rewards, where beta is a hyper-parameter for balancing hinge loss and CIDEr. It is noted that the retrieval process is performed in each mini-batch, since retrieval is time consuming in the entire data set.
The multi-attention fusion network based on the multi-granularity reward mechanism comprises a multi-attention fusion model (HAF), a word importance re-evaluation network (REN) and a label Retrieval Network (RN).
Finally, the training method of the multi-attention fusion network based on the multi-granularity reward mechanism comprises the following steps:
all models were pre-trained with cross-entropy loss and then trained to maximize different RL rewards. The encoder uses the pre-trained Resnet-101 to obtain a representation of the images, and for each image we extract the outputs of the conv4 and conv5 convolutional layers from Resnet, which map to a vector of dimension 1024 as the input to the HAF. For the HAF, the image feature embedding dimension, LSTM hidden state and word embedding dimension are all set to 512. The baseline model was trained using an ADAM optimizer at XE goal with an initial learning rate of 10-4. At each iteration cycle, we evaluate the model and select the best CIDER as the baseline score. The reinforced training starts from the 30 th iteration cycle to optimize the CIDER measurement, and the learning rate is 10-5
In the word-level reward training phase, the image subtitle model trains the CIDER reward of 20 iteration cycles and the reward level reward of 10 iteration cycles in advance. In sentence-level reward training, the RN is pre-trained for 10 iteration cycles with different labels per img. Where the word embedding and LSTM hiding size is set to 512 and the joint embedding size is set to 1024, and the hyperparametric edge a is set to 0.2. In addition, the subtitle model for baseline (b) was trained using cross entropy for 30 epochs, with the iteration period for sentence-level reward training set to 30.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention provides a layered attention fusion (HAF) model as a baseline for RL training of image captions. The HAF pays attention to the layered visual characteristics of the CNN for many times, and can fully utilize multi-level visual information.
2. The present invention proposes a word importance re-evaluation network (REN) for facilitating a re-evaluation reward calculation that automatically assigns different importance to words generated in a sentence during the RL training phase.
3. The present invention proposes a tag Retrieval Network (RN) to obtain sentence-level retrieval rewards. The RN will drive the generated subtitles to tend to match their corresponding tags and not other sentences.
Drawings
Fig. 1 is a schematic diagram of a multi-attention convergence network structure based on a multi-granularity reward mechanism.
FIG. 2 is a schematic diagram of a Hierarchical Attention Fusion (HAF) model.
Fig. 3 is a schematic diagram of a word importance re-evaluation network (REN) structure.
Fig. 4 is a schematic diagram of a tag Retrieval Network (RN).
Fig. 5 is a comparison diagram of subtitles generated by a multi-attention fusion network based on a multi-granularity reward mechanism, subtitles generated by a top-down method, subtitles generated by a single-use layered attention fusion model, and real subtitles.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent.
The invention is further illustrated below with reference to the figures and examples.
Fig. 1 is a schematic diagram of a multi-attention convergence network structure based on a multi-granularity reward mechanism. As shown in fig. 1, sentence-level awards and word-level awards, respectively, are generated by adaptively re-evaluating the importance of words on the left side and on the right side, the sentence-level awards are formed by retrieval losses calculated from the retrieval similarity S.
FIG. 2 is a schematic diagram of a Hierarchical Attention Fusion (HAF) model. As shown in figure 2 of the drawings, in which,
Figure BDA0002685289120000061
representing the average features of conv4 and conv5, X being the one-hot encoding of the input word and E being the word embedding matrix of the vocabulary. We use a classical network structure based on the LSTM hidden state h at each time step ttGenerating a normalized attention weight αt,αtThe different spaces Att for participating in the image features as the final representation (a) of the image:
Figure BDA0002685289120000062
αt=softmax(at) (2)
Figure BDA0002685289120000063
wherein, Wa,Ua
Figure BDA0002685289120000064
Are learning parameters.
Figure BDA0002685289120000065
Where h2 is the output of the second LSTM, which consists of the image information of the convolutional layer and the content of the generated sequence. The process of generating h2 can be given by:
Figure BDA0002685289120000066
Figure BDA0002685289120000067
Figure BDA0002685289120000068
Figure BDA0002685289120000069
finally, the probability of the output word is given by the nonlinear softmax function:
Figure BDA00026852891200000610
fig. 3 is a schematic diagram of a word importance re-evaluation network (REN) structure. As shown in FIG. 3, the word importance re-evaluation network embeds the generated sentences and provides the rewarding weight W, S is sigmoid, and rl-model is a caption model pre-trained by CIDER. Firstly, REN takes the generated sentence S as input, then the sentence is processed by RNN with attention network and average pooling layer, the word embedding vector is formed by connecting the sentence embedding vector with attention and the sentence embedding vector after pooling as comprehensive representation for generating caption, then two full connection layers and sigmoid transformation are applied to obtain the weight W of different wordst. Mathematically, the loss function can be formulated as 11:
Wrt=RWt+R-b (10)
Figure BDA0002685289120000071
wherein, WiIs the output weight of REN, theta is a parameter of the image caption network,
Figure BDA0002685289120000072
different words representing the generated sentence.
To take advantage of index-based rewards (CIDER) and constrain sentence space, word-level rewards are employed to fine-tune the caption network after CIDER optimization. Furthermore, to optimize REN simultaneously, we define the update of REN as another RL procedure with a reward of R-b. We observe that R-b is too small resulting in a weaker gradient for REN, so the hyper-parameter γ is set to enhance the gradient. Similarly, REN may be updated by a reinforcement learning algorithm with the following loss function:
Figure BDA0002685289120000073
fig. 4 is a schematic diagram of a tag Retrieval Network (RN). As shown in fig. 4, with text-to-text retrieval, using tags and unmatched tags to construct the sentence-level reward for RL training, we encode the tags and generate subtitles for features in the same embedding space of RN:
si=LSTM(Ci) (13)
gj=LSTM(Gj) (14)
where C and G denote the generated subtitles and tags, SiAnd giAnd (3) representing the respective embedding characteristics, calculating cosine similarity of similarity between S and g:
Figure BDA0002685289120000074
Figure BDA0002685289120000075
assigning a score for a matching word pair higher than the score for any unmatched word pair, the penalty of the RN being calculated by the hinge penalty:
Figure BDA0002685289120000076
wherein
Figure BDA0002685289120000077
Is the correct word pair, and
Figure BDA0002685289120000078
is an incorrect word pair. Hinge loss of CIDER acts as a sentence-level reward in RL training, which encourages the generation of captions for the caption model to best match a given tag.
Figure BDA0002685289120000081
Equation (17) is a loss function for beta optimizing the caption model by sentence-level rewards, where beta is a hyper-parameter for balancing hinge loss and CIDER, it is noted that the retrieval process is performed in each mini-batch, since retrieval is time consuming in the entire dataset.
Fig. 5 is a comparison diagram of subtitles generated by a multi-attention fusion network based on a multi-granularity reward mechanism, subtitles generated by a top-down method, subtitles generated by a single-use layered attention fusion model, and real subtitles. As shown in fig. 5, the sentences generated by the multi-attention fusion network based on the multi-granularity reward mechanism are more accurate and humanized than other models in the graph.
The invention provides a word importance re-evaluation network and a label retrieval network based on a reinforcement learning reward mechanism, and provides an image subtitle generating method of a multi-attention fusion network based on a multi-granularity reward mechanism. The present invention proposes a Hierarchical Attention Fusion (HAF) model as a baseline for image caption RL training, the HAF paying attention to the hierarchical visual features of CNN multiple times, being able to fully utilize multi-level visual information, while a word importance re-evaluation network (REN) is used to facilitate re-evaluation reward calculation, which automatically assigns different importance to words generated in sentences during the RL training phase. The tag Retrieval Network (RN) encourages the generated subtitles to match their corresponding tags and not other sentences. The generated image captions are expressed accurately and smoothly through training, and the content in the images can be well reflected.
Finally, the details of the above-described examples of the present invention are merely examples for illustrating the present invention, and any modification, improvement, replacement, etc. of the above-described examples should be included in the scope of the claims of the present invention for those skilled in the art.

Claims (6)

1. The method for generating the image captions of the multi-attention fusion network based on the multi-granularity reward mechanism is characterized by comprising the following steps:
s1, constructing a multi-attention fusion model.
And S2, constructing a word importance re-evaluation network based on a reinforcement learning reward mechanism.
And S3, constructing a label retrieval network by combining a reinforcement learning reward mechanism.
And S4, combining the model in S1, the word importance re-evaluation network in S2 and the label retrieval network in S3 to construct a multi-attention fusion network architecture based on a multi-granularity reward mechanism.
And S5, training and subtitle generation of a multi-attention fusion network based on a multi-granularity reward mechanism.
2. The method for generating image captions based on a multi-granular incentive mechanism and a multi-attention fusion network according to claim 1, wherein the specific process of S1 is as follows:
a classical network structure is adopted, which is based on LSTM hidden state h of each time step ttGenerating a normalized attention weight αt。αtThe different spaces Att for participating in the image features as the final representation (a) of the image:
Figure FDA0002685289110000011
αt=softmax(at) (2)
Figure FDA0002685289110000012
wherein, Wa,Ua
Figure FDA0002685289110000013
Are learning parameters.
Figure FDA0002685289110000014
Wherein h is2Is the output of the second LSTM, which consists of the image information of the convolutional layer and the content of the generated sequence. Generation of h2The process of (a) can be given by:
Figure FDA0002685289110000015
Figure FDA0002685289110000016
Figure FDA0002685289110000017
Figure FDA0002685289110000018
finally, the probability of the output word is given by the non-linear softmax function:
Figure FDA0002685289110000019
3. the method for generating image captions based on a multi-granular incentive mechanism and a multi-attention fusion network according to claim 1, wherein the specific process of S2 is as follows:
REN takes the generated sentence S as input, then the sentence is processed by RNN with attention network and average pooling layer, the word embedding vector is formed by connecting the sentence embedding vector with attention and the sentence embedding vector after pooling as comprehensive representation of generating caption, then two full connection layers and sigmoid transformation are applied to obtain the weight W of different wordst. Mathematically, the loss function can be formalized as (11):
Wrt=RWt+R-b (10)
Figure FDA0002685289110000021
wherein, WiIs the output weight of REN, theta is a parameter of the image caption network,
Figure FDA0002685289110000022
different words representing the generated sentence.
To take advantage of index-based rewards (CIDER) and constrain sentence space, after CIDER optimization, word-level rewards are employed to fine tune the caption network, and furthermore, to optimize REN simultaneously, we define the update of REN as another RL process with reward R-b. We observe that R-b is too small resulting in a weaker gradient for REN, so the hyper-parameter γ is set to enhance the gradient. Similarly, REN may be updated by a reinforcement learning algorithm with the following loss function:
Figure FDA0002685289110000023
4. the method for generating image captions based on a multi-granular incentive mechanism and a multi-attention fusion network according to claim 1, wherein the specific process of S3 is as follows:
the RN is pre-trained to converge with different labels for the images because each image has five different labels. We encode the tags and generate subtitles for the features in the same embedding space of the RN:
si=LSTM(Ci) (13)
gj=LSTM(Gj) (14)
where C and G denote the generated subtitles and tags, SiAnd giIndicating their respective embedded characteristics. Calculating cosine similarity of similarity between S and g:
Figure FDA0002685289110000024
Figure FDA0002685289110000025
the score of the designated matching word pair is higher than the score of any unmatched word pair, and the penalty of the RN is calculated by the hinge penalty:
Figure FDA0002685289110000031
wherein
Figure FDA0002685289110000032
Is the correct word pair, and
Figure FDA0002685289110000033
is incorrect. Hinge loss of CIDER acts as a sentence-level reward in RL training, which encourages the generation of captions for the caption model to best match a given tag.
Figure FDA0002685289110000034
Equation (17) is a loss function for beta optimizing the caption model by sentence-level rewards, where beta is a hyper-parameter for balancing hinge loss and CIDEr. It is noted that the retrieval process is performed in each mini-batch, since retrieval is time consuming in the entire data set.
5. The method for generating image captions based on a multi-granular incentive mechanism and a multi-attention fusion network according to claim 1, wherein the specific process of S4 is as follows:
the multi-attention fusion network based on the multi-granularity reward mechanism comprises a multi-attention fusion model (HAF), a reevaluation network (REN) and a Retrieval Network (RN).
6. The method for generating image captions based on a multi-granular incentive mechanism and a multi-attention fusion network according to claim 1, wherein the specific process of S5 is as follows:
the training and training method of the multi-attention fusion network based on the multi-granularity reward mechanism comprises the following steps:
all models were pre-trained with cross-entropy loss and then trained to maximize different RL rewards. The encoder uses the pre-trained Resnet-101 to obtain a representation of the images, and for each image we extract the outputs of the conv4 and conv5 convolutional layers from Resnet, which map to a vector of dimension 1024 as the input to the HAF. For the HAF, the image feature embedding dimension, LSTM hidden state and word embedding dimension are all set to 512. The baseline model was trained using an ADAM optimizer at XE goal with an initial learning rate of 10-4. At each iteration cycle, we evaluate the model and select the best CIDER as the baseline score. The reinforced training starts from the 30 th iteration cycle to optimize the CIDER measurement, and the learning rate is 10-5
In the word-level reward training phase, the image subtitle model trains the CIDER reward of 20 iteration cycles and the reward level reward of 10 iteration cycles in advance. In sentence-level reward training, RN trains 10 iteration cycles in advance with different real labels per img, where word embedding and LSTM hidden sizes are set to 512 and joint embedding size is set to 1024, and the hyper-parameter edge α is set to 0.2. In addition, the subtitle model for baseline (b) was trained using cross entropy for 30 epochs, with the iteration period for sentence-level reward training set to 30.
CN202010974467.1A 2020-09-16 2020-09-16 Multi-attention fusion network image subtitle generating method based on multi-granularity reward mechanism Pending CN112116685A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010974467.1A CN112116685A (en) 2020-09-16 2020-09-16 Multi-attention fusion network image subtitle generating method based on multi-granularity reward mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010974467.1A CN112116685A (en) 2020-09-16 2020-09-16 Multi-attention fusion network image subtitle generating method based on multi-granularity reward mechanism

Publications (1)

Publication Number Publication Date
CN112116685A true CN112116685A (en) 2020-12-22

Family

ID=73803138

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010974467.1A Pending CN112116685A (en) 2020-09-16 2020-09-16 Multi-attention fusion network image subtitle generating method based on multi-granularity reward mechanism

Country Status (1)

Country Link
CN (1) CN112116685A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408430A (en) * 2021-06-22 2021-09-17 哈尔滨理工大学 Image Chinese description system and method based on multistage strategy and deep reinforcement learning framework
CN113918754A (en) * 2021-11-01 2022-01-11 中国石油大学(华东) Image subtitle generating method based on scene graph updating and feature splicing
CN116501859A (en) * 2023-06-26 2023-07-28 中国海洋大学 Paragraph retrieval method, equipment and medium based on refrigerator field

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170098153A1 (en) * 2015-10-02 2017-04-06 Baidu Usa Llc Intelligent image captioning
US20170127016A1 (en) * 2015-10-29 2017-05-04 Baidu Usa Llc Systems and methods for video paragraph captioning using hierarchical recurrent neural networks
CN107608943A (en) * 2017-09-08 2018-01-19 中国石油大学(华东) Merge visual attention and the image method for generating captions and system of semantic notice
US20180143966A1 (en) * 2016-11-18 2018-05-24 Salesforce.Com, Inc. Spatial Attention Model for Image Captioning
CN109711464A (en) * 2018-12-25 2019-05-03 中山大学 Image Description Methods based on the building of stratification Attributed Relational Graps
CN110135567A (en) * 2019-05-27 2019-08-16 中国石油大学(华东) The image method for generating captions of confrontation network is generated based on more attentions
CN110347860A (en) * 2019-07-01 2019-10-18 南京航空航天大学 Depth image based on convolutional neural networks describes method
US10467274B1 (en) * 2016-11-10 2019-11-05 Snap Inc. Deep reinforcement learning-based captioning with embedding reward
CN110473267A (en) * 2019-07-12 2019-11-19 北京邮电大学 Social networks image based on attention feature extraction network describes generation method
CN111046966A (en) * 2019-12-18 2020-04-21 江南大学 Image subtitle generating method based on measurement attention mechanism
US10699129B1 (en) * 2019-11-15 2020-06-30 Fudan University System and method for video captioning
KR20200104663A (en) * 2019-02-27 2020-09-04 한국전력공사 System and method for automatic generation of image caption
KR20200106115A (en) * 2019-02-27 2020-09-11 한국전력공사 Apparatus and method for automatically generating explainable image caption

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170098153A1 (en) * 2015-10-02 2017-04-06 Baidu Usa Llc Intelligent image captioning
US20170127016A1 (en) * 2015-10-29 2017-05-04 Baidu Usa Llc Systems and methods for video paragraph captioning using hierarchical recurrent neural networks
US10467274B1 (en) * 2016-11-10 2019-11-05 Snap Inc. Deep reinforcement learning-based captioning with embedding reward
US20180143966A1 (en) * 2016-11-18 2018-05-24 Salesforce.Com, Inc. Spatial Attention Model for Image Captioning
CN107608943A (en) * 2017-09-08 2018-01-19 中国石油大学(华东) Merge visual attention and the image method for generating captions and system of semantic notice
CN109711464A (en) * 2018-12-25 2019-05-03 中山大学 Image Description Methods based on the building of stratification Attributed Relational Graps
KR20200104663A (en) * 2019-02-27 2020-09-04 한국전력공사 System and method for automatic generation of image caption
KR20200106115A (en) * 2019-02-27 2020-09-11 한국전력공사 Apparatus and method for automatically generating explainable image caption
CN110135567A (en) * 2019-05-27 2019-08-16 中国石油大学(华东) The image method for generating captions of confrontation network is generated based on more attentions
CN110347860A (en) * 2019-07-01 2019-10-18 南京航空航天大学 Depth image based on convolutional neural networks describes method
CN110473267A (en) * 2019-07-12 2019-11-19 北京邮电大学 Social networks image based on attention feature extraction network describes generation method
US10699129B1 (en) * 2019-11-15 2020-06-30 Fudan University System and method for video captioning
CN111046966A (en) * 2019-12-18 2020-04-21 江南大学 Image subtitle generating method based on measurement attention mechanism

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHUNLEI WU ET AL: "Hierarchical Attention-Based Fusion for Image Caption With Multi-Grained Rewards", 《IEEE ACCESS》, pages 57943 - 57951 *
杜海骏等: "融合约束学习的图像字幕生成方法", 《中国图象图形学报》, pages 0333 - 0342 *
袁韶祖等: "基于多粒度视频信息和注意力机制的视频 场景识别", 《计算机系统应用》, pages 252 - 256 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113408430A (en) * 2021-06-22 2021-09-17 哈尔滨理工大学 Image Chinese description system and method based on multistage strategy and deep reinforcement learning framework
CN113408430B (en) * 2021-06-22 2022-09-09 哈尔滨理工大学 Image Chinese description system and method based on multi-level strategy and deep reinforcement learning framework
CN113918754A (en) * 2021-11-01 2022-01-11 中国石油大学(华东) Image subtitle generating method based on scene graph updating and feature splicing
CN116501859A (en) * 2023-06-26 2023-07-28 中国海洋大学 Paragraph retrieval method, equipment and medium based on refrigerator field
CN116501859B (en) * 2023-06-26 2023-09-01 中国海洋大学 Paragraph retrieval method, equipment and medium based on refrigerator field

Similar Documents

Publication Publication Date Title
CN107133211B (en) Composition scoring method based on attention mechanism
CN109992779B (en) Emotion analysis method, device, equipment and storage medium based on CNN
CN112116685A (en) Multi-attention fusion network image subtitle generating method based on multi-granularity reward mechanism
CN109753571B (en) Scene map low-dimensional space embedding method based on secondary theme space projection
CN112527966B (en) Network text emotion analysis method based on Bi-GRU neural network and self-attention mechanism
CN116431793B (en) Visual question-answering method, device and storage medium based on knowledge generation
CN113408430B (en) Image Chinese description system and method based on multi-level strategy and deep reinforcement learning framework
CN112527993B (en) Cross-media hierarchical deep video question-answer reasoning framework
CN115510814B (en) Chapter-level complex problem generation method based on dual planning
CN114611492B (en) Text smoothing method, system and computer equipment
CN114090815A (en) Training method and training device for image description model
CN114898121A (en) Concrete dam defect image description automatic generation method based on graph attention network
CN114969278A (en) Knowledge enhancement graph neural network-based text question-answering model
CN116779091B (en) Automatic generation method of multi-mode network interconnection and fusion chest image diagnosis report
CN111538838B (en) Problem generating method based on article
Fu et al. Contrastive transformer based domain adaptation for multi-source cross-domain sentiment classification
CN116992042A (en) Construction method of scientific and technological innovation service knowledge graph system based on novel research and development institutions
CN114429143A (en) Cross-language attribute level emotion classification method based on enhanced distillation
CN116168401A (en) Training method of text image translation model based on multi-mode codebook
CN113627424B (en) Collaborative gating circulation fusion LSTM image labeling method
CN111783852B (en) Method for adaptively generating image description based on deep reinforcement learning
CN115564049B (en) Knowledge graph embedding method for bidirectional coding
CN112015760A (en) Automatic question-answering method and device based on candidate answer set reordering and storage medium
CN116484868A (en) Cross-domain named entity recognition method and device based on diffusion model generation
Liao et al. Question generation through transfer learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20201222