CN110674850A - Image description generation method based on attention mechanism - Google Patents

Image description generation method based on attention mechanism Download PDF

Info

Publication number
CN110674850A
CN110674850A CN201910828522.3A CN201910828522A CN110674850A CN 110674850 A CN110674850 A CN 110674850A CN 201910828522 A CN201910828522 A CN 201910828522A CN 110674850 A CN110674850 A CN 110674850A
Authority
CN
China
Prior art keywords
image
model
features
region
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910828522.3A
Other languages
Chinese (zh)
Inventor
肖春霞
赵坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN201910828522.3A priority Critical patent/CN110674850A/en
Publication of CN110674850A publication Critical patent/CN110674850A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides an image description generation method based on an attention mechanism. The invention has the following advantages: the fusion of the relation characteristic and the object characteristic can enrich image information; the double-layer language model can generate image description with finer granularity; further optimization of the training model using reinforcement learning can alleviate the exposure bias problem.

Description

Image description generation method based on attention mechanism
Technical Field
The invention belongs to the field of computer vision and natural language processing, relates to an image language description generation method, and particularly relates to an image description generation method based on an attention mechanism.
Background
In many cases in life, it is necessary to convert image content into text description, for example, in social software, text abstract of an image is automatically generated in case of poor network state, and people with visual impairment are helped to understand the image content. The existing image description method is mainly based on a deep learning method, a convolutional neural network is used as an image processing model to extract image features, and the image features are input into a cyclic neural network to be used as a language generation model to generate an image description language. However, the model usually uses global or object-level image features, it is difficult to pay attention to a significant target object in the image and lose much important information in the image, and it is difficult to sufficiently apply important visual semantic relationship information in the image to the model. Moreover, the existing model is mostly a single-step forward process, when the model generates the next word, the model can only utilize the word which is generated before, so that the error accumulation can be caused if an error word is generated in the generation process. On the other hand, the existing model maximizes the joint probability of the sequence generated by the model during training, thus minimizing the cross entropy loss to train the model, and the joint probability of the generated reference word is improved to the maximum extent through back propagation, so that the probability distribution of the words in the sentence can be learned by the model, and the evaluation indexes are different from the automatic evaluation indexes usually used during evaluating the quality of the sentence generated by the image description model, and are not trivial and therefore cannot be directly used as the loss function, and the inconsistency of the loss function and the evaluation indexes makes the model not be optimized sufficiently.
Disclosure of Invention
The invention aims to overcome the defects of the existing method, and provides an image description generation method based on an attention mechanism.
The technical problem of the present invention is mainly solved by the following technical solutions, and an image description generation method based on an attention mechanism includes the following steps:
step 1, extracting words from the labeled sentences of the data set to construct a vocabulary;
step 2, adopting a ResNet101 model as a CNN initial model, performing parameter pre-training of ResNet101, using the pre-trained ResNet101 to independently extract global features of the image, then using the pre-trained ResNet101 to replace the CNN in a FasterR-CNN algorithm to extract a plurality of object region features of each image, and then forming the object regions into relationship regions in pairs to extract relationship features;
step 3, performing feature fusion on the relationship features and the object region features to obtain object region features containing the relationship between the objects;
step 4, inputting the object region characteristics containing the relationship between the objects obtained in the previous step into a double-layer LSTM language model to obtain an output result, namely natural language description generated on the image;
and 5, training the similarity between the image measured by the mapping space model and the description sentence, using the CIDER score and the similarity as reward items, and further optimizing the double-layer LSTM language model by using reinforcement learning.
Furthermore, the vocabulary table is constructed in step 1 by counting the occurrence frequency of each word in the text description of the MS COCO dataset, and only selecting words with the occurrence frequency greater than five to be listed in the vocabulary table, wherein the MS COCO dataset vocabulary table contains 9487 words.
Further, the fast R-CNN algorithm is used to extract the object region features of the image in step 2, and the loss function for an image during training is defined as follows:
wherein the parameter lambda is used to balance NclsAnd NregThese two normalization parameters; will NclsIs set to the size of mini-batch, N is setregSet to the total amount of anchors; i denotes the index number of anchor in a mini-batch during training, piThe anchor with index number i is the prediction probability of the object region; if anchor is a positive sample, pi *Equal to 1, if anchor is a negative sample, pi *Equal to 0; t is tiIs a vector of 4 coordinate parameters representing the generated bounding box, ti *Is the coordinate vector of the bounding box of the ground truth corresponding to the anchor of the positive sample;
Lclsis the classification loss function of the object:
Lcls(pi,pi *)=-log[pi *pi+(1-pi *)(1-pi)]
Lregrepresents the bounding box regression loss function:
Lreg(ti,ti *)=R(ti-ti *)
where R is the smooth L1 loss function:
Figure BDA0002189889300000022
further, the specific method for performing feature fusion on the relationship features and the object region features in step 3 is as follows:
for the input image I, a series of object regions v are obtained in the previous step1,...,vi,...,vkAnd relation area { S }1,...,Si...,SkEach object is contained in a plurality of different relationship areas, the importance of each relationship to the object is different, and p is calculated by the following formulai(Sk) Representing a relation skFor object viAttention weight of (1):
Figure BDA0002189889300000031
summing objects v according to attention weightiThe connected relation region features are aggregated into an overall aggregated relation region feature, and then the aggregated relation region feature is transmitted to the target object feature as information, and the following formula is used for:
Figure BDA0002189889300000032
Figure BDA0002189889300000033
wherein S is upper casekRepresenting a collection of relational areas, a lower case sk representing a single relational area,
Figure BDA0002189889300000034
representing the final aggregated relational feature vector after aggregation,
Figure BDA0002189889300000035
is v after fine tuningi
Further, the specific implementation manner of step 4 is as follows,
step 4.1, input of global features v of the image0To the first layer in a two-layer LSTM language model;
step 4.2, respectively calculating an attention weight for each object area subjected to feature fusion at each time t:
wherein
Figure BDA0002189889300000037
Wv1、Wh1Are parameters that need to be learned in the language model,
Figure BDA0002189889300000038
the attention weight assigned to the i region at time t,
Figure BDA0002189889300000039
referring to the result output by the first layer LSTM at each time t, tanh is the tanh activation function:
Figure BDA00021898893000000310
step 4.3, the attention weight assigned to each region represents the contribution degree of the region to the currently generated word:
Figure BDA00021898893000000311
Figure BDA00021898893000000312
wherein the content of the first and second substances,
Figure BDA00021898893000000313
refer to for each viThe result after attention weighting;
step 4.4, the input of the LSTM in the second layer of the language model is formed by combining the output of the first-stage language model and the attention-weighted image characteristics;
step 4.5, adding a sentinel gate g on the basis of the second layer of LSTMtTo calculate a language information vector s of references needed to generate text wordst
Figure BDA0002189889300000041
Figure BDA0002189889300000042
Where σ is the sigmoid activation function, WxAnd WhIs a parameter requiring training, ctCell state (cell state) indicating LSTM at time t, ⊙ indicating multiplication of corresponding elements, xtRefers to the input of the second layer LSTM, ht-1Refers to the output of the hidden layer at a time t-1 on the second layer LSTM;
step 4.6, the second layer LSTM also uses the attention mechanism:
Figure BDA0002189889300000043
wherein
Figure BDA0002189889300000044
Wv2、Wh2Is a parameter to be learned in the model, except that visual words and text words are distinguished in this region, making the network automatic selection more dependent on the visual information v when generating each wordiOr more on the language information st
Step 4.7, calculating the distribution weight of the visual information and the language information when each word is generated:
Figure BDA0002189889300000045
wherein the content of the first and second substances,for the attention vector obtained in step 4.6,
Figure BDA0002189889300000047
Ws、Wh3parameters needing to be learned in the model comprise image region characteristics and the weight of a language information vector;
Figure BDA0002189889300000048
for each viMultiplying by the corresponding β and then summing up, the length becomes K +1 due to the addition of one language information vector;
step 4.8, inputting the output of the second layer language model into the softmax layer to calculate the distribution probability of the generated words at the time of generating t:
Figure BDA00021898893000000410
and 4.9, finally, training the model by using a cross entropy loss function.
Further, the specific implementation manner of step 5 is as follows,
step 5.1, first calculate the CIDER score CIDER of the model (c)i,Si),ciAs candidate sentences and SiIs a reference sentence;
step 5.2, for a matching image-sentence pair (I)n,Sn) Training a convolution on the input image IGlobal feature vector phi (I) extracted by neural network, training a cyclic neural network for sentence S to extract its features
Figure BDA00021898893000000411
Then, the features of the two different modes are mapped to the same space through two linear mapping layers;
the cosine similarity is then calculated to represent the cosine similarity between the image and the sentence as follows:
Figure BDA0002189889300000051
to train such a mapping space model, a parameter θ is definedsIs minimized, the training set is represented as
Figure BDA0002189889300000052
Figure BDA0002189889300000053
e(θs) Represents the loss function LeAverage error of (I, S), where L is defined using a two-way ordering penaltye(I,S):
Figure BDA0002189889300000054
Wherein β represents a boundary distance, I, S represents a set of reference image-sentence pairs, I ', S' represents a set of sentence image pairs randomly selected in the training set, and S represents cosine similarity;
step 5.3, using
Figure BDA0002189889300000055
To define a reward, whereinThe results of the model predictions are represented as,
Figure BDA0002189889300000057
indicating the result of the prediction
Figure BDA0002189889300000058
The CIDER score of (A) is obtained,
Figure BDA0002189889300000059
representing an input image I and
Figure BDA00021898893000000510
cosine similarity;
step 5.4, updating network parameters by using the strategy gradient in reinforcement learning, and according to a REINFORCE algorithm, obtaining a loss function LRL(theta) the gradient calculation formula with respect to the parameter theta is as follows,
Figure BDA00021898893000000511
wherein
Figure BDA00021898893000000512
Is a function of the score,is a cost function;
in order to reduce the variance of the gradient estimate, a baseline function b is introduced,
step 5.5, let b ═ R (S)*I), then the gradient calculation formula is:
Figure BDA00021898893000000515
and S is a descriptive sentence corresponding to the image I, and is equivalent to the true value of S.
Compared with the prior art, the invention has the following advantages:
1. the fusion of the middle relation characteristic and the object characteristic can enrich image information;
2. the double-layer language model can generate image description with finer granularity;
3. the invention can further optimize the training model by using reinforcement learning to relieve the exposure deviation problem.
Drawings
Fig. 1 is a general flow chart of the present invention.
Detailed Description
The technical solution of the present invention is further explained with reference to the drawings and the embodiments.
As shown in fig. 1, an attention-based image description generation method includes the following steps:
step 1, extracting words from the labeled sentences of the data set to construct a vocabulary;
the vocabulary table obtained in the step 1 is obtained by counting the occurrence frequency of each word in the text description of the MS COCO data set, and only selecting the words with the occurrence frequency more than five times to be listed in the vocabulary table, wherein the MS COCO data set vocabulary table comprises 9,487 words.
Step 2, adopting a ResNet101 model as a CNN initial model, adopting an ImageNet data set to pre-train parameters of ResNet101, using the pre-trained ResNet101 to independently extract global features of the image, then using the pre-trained ResNet101 to replace CNN in a Faster R-CNN algorithm to extract 36 object region features of each image, and then forming every two object regions into a relationship region to extract relationship features;
in the step 2, the global features of the image are extracted by using the pre-trained ResNet101, the object region features of the image are extracted by using the Faster R-CNN algorithm, and a loss function for one image during training is defined as follows:
Figure BDA0002189889300000061
wherein the parameter lambda is used to balance NclsAnd NregThese two normalization parameters. Will NclsIs set to the size of mini-batch, N is setregSet to the total amount of anchors and λ to 10. i denotes the index number of anchor in a mini-batch during training, piIs the prediction probability of the object region with anchor with index number i. If anchor is a positive sample, pi *Equal to 1, if anchor is a negative sample, pi *Equal to 0. t is tiIs a vector representing the parameters of 4 coordinates (including upper left, upper right, lower left, and lower right, respectively) of the generated bounding box, ti *Is the coordinate vector of the bounding box of the ground treth corresponding to the anchor of the positive sample. L isclsIs the classification loss function of the object:
Lcls(pi,pi *)=-log[pi *pi+(1-pi *)(1-pi)]
Lregrepresents the bounding box regression loss function:
Lreg(ti,ti *)=R(ti-ti *)
where R is the smooth L1 loss function:
Figure BDA0002189889300000062
and 3, performing feature fusion on the relationship features and the object region features to obtain object region features containing the relationship between the objects, wherein the specific method comprises the following steps:
for the input image I, a series of object regions { v } are obtained in the previous step1,...,vi,...,vkAnd relation area { S }1,...,Si...,SkEach object is contained in a plurality of different relationship areas, the importance of each relationship to the object is different, and p is calculated by the following formulai(Sk) Representing a relation skFor object viAttention weight of (1):
according to the attentionWeighting the number of sum object regions viThe connected relation region features are aggregated into an overall aggregated relation region feature, and then the aggregated relation region feature is transmitted to the target object feature as information, and the following formula is used for:
Figure BDA0002189889300000072
wherein S is upper casekRepresenting sets of relational areas, lower case skA single one of the relationship regions is represented,
Figure BDA0002189889300000074
representing the final aggregated relational feature vector after aggregation,
Figure BDA0002189889300000075
is v after fine tuningi
Step 4, inputting the object region characteristics containing the relationship between the objects obtained in the previous step into a double-layer LSTM language model to obtain an output result, namely, natural language description generated for the image, and specifically comprising the following substeps:
step 4.1, Global features v of first layer LSTM input image0The initial description is generated by the connection vector of the output of the second layer LSTM at the time t-1 and the code of the word generated at the time t. The features of the word sequence generated at the current time are concatenated with the output of the first layer LSTM at the current time as input to the next layer language model.
Step 4.2, respectively calculating an attention weight for each object area subjected to feature fusion at each time t:
Figure BDA0002189889300000076
wherein
Figure BDA0002189889300000077
Wv1、Wh1Are parameters that need to be learned in the language model,
Figure BDA0002189889300000078
the attention weight assigned to the i region at time t,
Figure BDA0002189889300000079
referring to the result output by the first layer LSTM at each time t, tanh is the tanh activation function:
step 4.3, the attention weight assigned to each region represents the contribution degree of the region to the currently generated word:
Figure BDA00021898893000000711
wherein the content of the first and second substances,
Figure BDA00021898893000000713
refer to for each viAttention weighted results.
Step 4.4, the input of the LSTM in the second layer of the language model is formed by combining the output of the first-stage language model and the attention-weighted image characteristics;
step 4.5, adding a sentinel gate g on the basis of the second layer of LSTMtTo calculate a language information vector s of references needed to generate text wordst
Figure BDA0002189889300000081
Figure BDA0002189889300000082
Where σ is the sigmoid activation function, WxAnd WhIs a parameter requiring training, ctCell state (cell state) indicating LSTM at time t, ⊙ indicating multiplication of corresponding elements, xtRefers to the input of the second layer LSTM, ht-1Refers to the output of the hidden layer at a time t-1 on the second layer LSTM.
Step 4.6, the second layer LSTM also uses the attention mechanism:
Figure BDA0002189889300000083
wherein
Figure BDA0002189889300000084
Wv2、Wh2Is a parameter to be learned in the model, except that visual words and text words are distinguished in this region, making the network automatic selection more dependent on the visual information v when generating each wordiOr more on the language information st
Step 4.7, calculating the distribution weight of the visual information and the language information when each word is generated:
Figure BDA0002189889300000085
wherein the content of the first and second substances,
Figure BDA0002189889300000086
for the attention vector obtained in step 4.6,
Figure BDA0002189889300000087
Ws、Wh3the parameters to be learned in the model include the image region characteristics and the weight of the language information vector.
Figure BDA0002189889300000088
Wherein the content of the first and second substances,
Figure BDA0002189889300000089
for each viThe result of multiplying by the corresponding β and then summing, since one language information vector is added, the length becomes K + 1.
Step 4.8, inputting the output of the second layer language model into the softmax layer to calculate the distribution probability of the generated words at the time of generating t:
Figure BDA00021898893000000810
step 4.9, training the model by using a cross entropy loss function:
Figure BDA00021898893000000811
wherein the content of the first and second substances,
Figure BDA00021898893000000812
representing the value at the previous time t-1,
Figure BDA00021898893000000813
representing the true value at time t, and L (θ) represents the conditional probability for the value at time t given the values from 1 to time t-1.
Step 5, training the similarity between the image and the description sentence measured by a mapping space model, using the CIDER score and the similarity as reward items, and further optimizing the double-layer LSTM language model by using reinforcement learning, which specifically comprises the following substeps:
step 5.1, first calculate the CIDER score of the model, an n-gram tuple wkAppear in the reference sentence sijThe number of times in (1) is recorded as hk(sij) Appears in the candidate sentence ciThe number of times in (1) is recorded as hk(ci) Each n-gram tuple w is calculated bykTF-IDF weight of:
Figure BDA0002189889300000091
where Ω is the set of all n-grams and I is all images in the dataset. Using candidate sentence c for n-grams tuple of length niAnd a reference sentence SiAverage cosine similarity between them to calculate its CIDERnAnd (3) fractional:
Figure BDA0002189889300000092
finally, the total CIDER fraction, w, is calculatednRepresenting a n-grams tuple as follows:
Figure BDA0002189889300000093
the above procedure can be found in the literature Vedantam R, Lawrence Zitnick C, Parikh D.Cider: Consensuss-based image description evaluation [ C]//Proceedings of the IEEE conference on computer vision and patternrecognition.2015:4566-4575。
Step 5.2, for a matching image-sentence pair (I)n,Sn) Training a convolutional neural network to extract global feature vector phi (I) for input image I, and training a Recurrent Neural Network (RNN) for sentence S to extract its features
Figure BDA0002189889300000098
The features of the two different modalities are then mapped to the same space by two linear mapping layers.
The cosine similarity is then calculated to represent the cosine similarity between the image and the sentence as follows:
Figure BDA0002189889300000094
to train such a mapping space model, a parameter θ is definedsIs minimized, the training set is represented as
Figure BDA0002189889300000095
Figure BDA0002189889300000096
e(θs) Represents the loss function LeAverage error of (I, S).
Here, L is defined using a two-way ordering penaltye(I,S):
Figure BDA0002189889300000097
Where β represents a boundary distance, I, S represents a set of reference image-sentence pairs, and I ', S' represents a set of randomly chosen image-sentence pairs in the training set.
Step 5.3, using
Figure BDA0002189889300000101
To define a reward, wherein
Figure BDA0002189889300000102
The results of the model predictions are represented as,
Figure BDA0002189889300000103
indicating the result of the prediction
Figure BDA0002189889300000104
The CIDER score of (A) is obtained,
Figure BDA0002189889300000105
representing an input image I and
Figure BDA0002189889300000106
cosine similarity;
step 5.4, updating network parameters by using policy gradient (policy gradient) in reinforcement learning, and according to a REINFORCE algorithm, obtaining a loss function LRL(theta) the gradient calculation formula with respect to the parameter theta is as follows,
wherein
Figure BDA0002189889300000108
Is a function of the score,
Figure BDA0002189889300000109
is a cost function.
The above steps can be found in the documents R.S. Sutton, D.McAllester, S.Singh, and Y.Mansource. policy gradient methods for retrieving Information with functional Processing Systems in Advances in Neural Information Processing Systems 12, pages 1057-1063,2000.
In order to reduce the variance of the gradient estimate, a baseline function b is introduced,
Figure BDA00021898893000001010
step 5.5, let b ═ R (S)*I), then the gradient is calculated as
Figure BDA00021898893000001011
And S is a descriptive sentence corresponding to the image I, which is equivalent to the true value of S, and the true value is a known quantity.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims (6)

1. An attention mechanism-based image description generation method is characterized by comprising the following steps:
step 1, extracting words from the labeled sentences of the data set to construct a vocabulary;
step 2, adopting a ResNet101 model as a CNN initial model, performing parameter pre-training of ResNet101, using the pre-trained ResNet101 to independently extract global features of the image, then using the pre-trained ResNet101 to replace the CNN in a Faster R-CNN algorithm to extract a plurality of object region features of each image, and then forming the object regions into relationship regions in pairs to extract relationship features;
step 3, performing feature fusion on the relationship features and the object region features to obtain object region features containing the relationship between the objects;
step 4, inputting the object region characteristics containing the relationship between the objects obtained in the previous step into a double-layer LSTM language model to obtain an output result, namely natural language description generated on the image;
and 5, training the similarity between the image measured by the mapping space model and the description sentence, using the CIDER score and the similarity as reward items, and further optimizing the double-layer LSTM language model by using reinforcement learning.
2. The method of claim 1, wherein the method comprises: the vocabulary table is constructed in the step 1 by counting the occurrence frequency of each word in the text description of the MS COCO data set, and only selecting the words with the occurrence frequency more than five times to be listed in the vocabulary table, wherein the MS COCO data set vocabulary table comprises 9487 words.
3. The method of claim 1, wherein the method comprises: in step 2, the fast R-CNN algorithm is used to extract the object region characteristics of the image, and the loss function of an image during training is defined as follows:
Figure FDA0002189889290000011
wherein the parameter lambda is used to balance NclsAnd NregThese two normalization parameters; will NclsIs set to the size of mini-batch, N is setregSet to the total amount of anchors; i denotes the index number of anchor in a mini-batch during training, piThe anchor with index number i is the prediction probability of the object region; if anchor is a positive sample, pi *Equal to 1, if anchor is a negative sample, pi *Equal to 0; t is tiIs a vector of 4 coordinate parameters representing the generated bounding box, ti *Is the coordinate vector of the bounding box of the ground truth corresponding to the anchor of the positive sample;
Lclsis the classification loss function of the object:
Lcls(pi,pi *)=-log[pi *pi+(1-pi *)(1-pi)]
Lregrepresents the bounding box regression loss function:
Lreg(ti,ti *)=R(ti-ti *)
where R is the smooth L1 loss function:
Figure FDA0002189889290000021
4. the method of claim 1, wherein the method comprises: the specific method for performing feature fusion on the relationship features and the object region features in the step 3 comprises the following steps:
for the input image I, a series of object regions v are obtained in the previous step1,...,vi,...,vkAnd relation area { S }1,…,Si…,SkEach object is contained in a plurality of different relationship areas, the importance of each relationship to the object is different, and p is calculated by the following formulai(Sk) Representing a relation skFor object viAttention weight of (1):
Figure FDA0002189889290000022
summing objects v according to attention weightiThe connected relation region features are aggregated into an overall aggregated relation region feature, and then the aggregated relation region feature is transmitted to the target object feature as information, and the following formula is used for:
wherein S is upper casekRepresenting a collection of relational areas, a lower case sk representing a single relational area,
Figure FDA0002189889290000025
representing the final aggregated relational feature vector after aggregation,is v after fine tuningi
5. The method of claim 1, wherein the method comprises: the specific implementation of step 4 is as follows,
step 4.1, input of global features v of the image0To the first layer in a two-layer LSTM language model;
step 4.2, respectively calculating an attention weight for each object area subjected to feature fusion at each time t:
whereinWv1、Wh1Are parameters that need to be learned in the language model,
Figure FDA0002189889290000029
the attention weight assigned to the i region at time t,
Figure FDA00021898892900000210
referring to the result output by the first layer LSTM at each time t, tanh is the tanh activation function:
step 4.3, the attention weight assigned to each region represents the contribution degree of the region to the currently generated word:
Figure FDA00021898892900000212
Figure FDA0002189889290000031
wherein the content of the first and second substances,
Figure FDA0002189889290000032
refer to for each viThe result after attention weighting;
step 4.4, the input of the LSTM in the second layer of the language model is formed by combining the output of the first-stage language model and the attention-weighted image characteristics;
step 4.5, adding a sentinel gate g on the basis of the second layer of LSTMtTo calculate a language information vector s of references needed to generate text wordst
Figure FDA0002189889290000034
Where σ is the sigmoid activation function,WxAnd WhIs a parameter requiring training, ctCell state (cell state) indicating LSTM at time t, ⊙ indicating multiplication of corresponding elements, xtRefers to the input of the second layer LSTM, ht-1Refers to the output of the hidden layer at a time t-1 on the second layer LSTM;
step 4.6, the second layer LSTM also uses the attention mechanism:
Figure FDA0002189889290000035
wherein
Figure FDA0002189889290000036
Wv2、Wh2Is a parameter to be learned in the model, except that visual words and text words are distinguished in this region, making the network automatic selection more dependent on the visual information v when generating each wordiOr more on the language information st
Step 4.7, calculating the distribution weight of the visual information and the language information when each word is generated:
Figure FDA0002189889290000037
wherein the content of the first and second substances,
Figure FDA0002189889290000038
for the attention vector obtained in step 4.6,Ws、Wh3parameters needing to be learned in the model comprise image region characteristics and the weight of a language information vector;
Figure FDA00021898892900000310
Figure FDA00021898892900000311
for each viMultiplying by the corresponding β and then summing up, the length becomes K +1 due to the addition of one language information vector;
step 4.8, inputting the output of the second layer language model into the softmax layer to calculate the distribution probability of the generated words at the time of generating t:
Figure FDA00021898892900000312
and 4.9, finally, training the model by using a cross entropy loss function.
6. The method of claim 1, wherein the method comprises: the specific implementation of step 5 is as follows,
step 5.1, first calculate the CIDER score CIDER of the model (c)i,Si),ciAs candidate sentences and SiIs a reference sentence;
step 5.2, for a matching image-sentence pair (I)n,Sn) Training a convolution neural network to extract global feature vector phi (I) for input image I, and training a convolution neural network to extract features of sentence SThen, the features of the two different modes are mapped to the same space through two linear mapping layers;
the cosine similarity is then calculated to represent the cosine similarity between the image and the sentence as follows:
to train such a mapping space model, a parameter θ is definedsIs minimized, the training set is represented as
Figure FDA0002189889290000043
Figure FDA0002189889290000044
e(θs) Represents the loss function LeAverage error of (I, S), where L is defined using a two-way ordering penaltye(I,S):
Figure FDA0002189889290000045
Wherein β represents a boundary distance, I, S represents a set of reference image-sentence pairs, I ', S' represents a set of sentence image pairs randomly selected in the training set;
step 5.3, using
Figure FDA0002189889290000046
To define a reward, wherein
Figure FDA0002189889290000047
The results of the model predictions are represented as,indicating the result of the prediction
Figure FDA0002189889290000049
The CIDER score of (A) is obtained,
Figure FDA00021898892900000410
representing an input image I and
Figure FDA00021898892900000411
cosine similarity;
step 5.4, updating network parameters by using the strategy gradient in reinforcement learning, and according to a REINFORCE algorithm, obtaining a loss function LRL(theta) the gradient calculation formula with respect to the parameter theta is as follows,
wherein
Figure FDA00021898892900000413
Is a function of the score,
Figure FDA00021898892900000414
is a cost function;
in order to reduce the variance of the gradient estimate, a baseline function b is introduced,
Figure FDA00021898892900000415
step 5.5, let b ═ R (S)*I), then the gradient calculation formula is:
Figure FDA00021898892900000416
and S is a descriptive sentence corresponding to the image I, and is equivalent to the true value of S.
CN201910828522.3A 2019-09-03 2019-09-03 Image description generation method based on attention mechanism Pending CN110674850A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910828522.3A CN110674850A (en) 2019-09-03 2019-09-03 Image description generation method based on attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910828522.3A CN110674850A (en) 2019-09-03 2019-09-03 Image description generation method based on attention mechanism

Publications (1)

Publication Number Publication Date
CN110674850A true CN110674850A (en) 2020-01-10

Family

ID=69076245

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910828522.3A Pending CN110674850A (en) 2019-09-03 2019-09-03 Image description generation method based on attention mechanism

Country Status (1)

Country Link
CN (1) CN110674850A (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325323A (en) * 2020-02-19 2020-06-23 山东大学 Power transmission and transformation scene description automatic generation method fusing global information and local information
CN111414962A (en) * 2020-03-19 2020-07-14 创新奇智(重庆)科技有限公司 Image classification method introducing object relationship
CN111612103A (en) * 2020-06-23 2020-09-01 中国人民解放军国防科技大学 Image description generation method, system and medium combined with abstract semantic representation
CN111753825A (en) * 2020-03-27 2020-10-09 北京京东尚科信息技术有限公司 Image description generation method, device, system, medium and electronic equipment
CN111783852A (en) * 2020-06-16 2020-10-16 北京工业大学 Self-adaptive image description generation method based on deep reinforcement learning
CN111814946A (en) * 2020-03-17 2020-10-23 同济大学 Image description automatic generation method based on multi-body evolution
CN111916050A (en) * 2020-08-03 2020-11-10 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN112037239A (en) * 2020-08-28 2020-12-04 大连理工大学 Text guidance image segmentation method based on multi-level explicit relation selection
CN112069841A (en) * 2020-07-24 2020-12-11 华南理工大学 Novel X-ray contraband parcel tracking method and device
CN112200268A (en) * 2020-11-04 2021-01-08 福州大学 Image description method based on encoder-decoder framework
CN112256904A (en) * 2020-09-21 2021-01-22 天津大学 Image retrieval method based on visual description sentences
CN112528989A (en) * 2020-12-01 2021-03-19 重庆邮电大学 Description generation method for semantic fine granularity of image
CN113378919A (en) * 2021-06-09 2021-09-10 重庆师范大学 Image description generation method for fusing visual sense and enhancing multilayer global features
CN113408430A (en) * 2021-06-22 2021-09-17 哈尔滨理工大学 Image Chinese description system and method based on multistage strategy and deep reinforcement learning framework
CN113469143A (en) * 2021-08-16 2021-10-01 西南科技大学 Finger vein image identification method based on neural network learning
CN113837230A (en) * 2021-08-30 2021-12-24 厦门大学 Image description generation method based on adaptive attention mechanism
CN114693790A (en) * 2022-04-02 2022-07-01 江西财经大学 Automatic image description method and system based on mixed attention mechanism
CN114882488A (en) * 2022-05-18 2022-08-09 北京理工大学 Multi-source remote sensing image information processing method based on deep learning and attention mechanism
CN116580283A (en) * 2023-07-13 2023-08-11 平安银行股份有限公司 Image prompt word generation method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107918782A (en) * 2016-12-29 2018-04-17 中国科学院计算技术研究所 A kind of method and system for the natural language for generating description picture material
CN108520273A (en) * 2018-03-26 2018-09-11 天津大学 A kind of quick detection recognition method of dense small item based on target detection
CN109146786A (en) * 2018-08-07 2019-01-04 北京市商汤科技开发有限公司 Scene chart generation method and device, electronic equipment and storage medium
CN109726696A (en) * 2019-01-03 2019-05-07 电子科技大学 System and method is generated based on the iamge description for weighing attention mechanism
CN110111399A (en) * 2019-04-24 2019-08-09 上海理工大学 A kind of image text generation method of view-based access control model attention
CN110188779A (en) * 2019-06-03 2019-08-30 中国矿业大学 A kind of generation method of image, semantic description

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107918782A (en) * 2016-12-29 2018-04-17 中国科学院计算技术研究所 A kind of method and system for the natural language for generating description picture material
CN108520273A (en) * 2018-03-26 2018-09-11 天津大学 A kind of quick detection recognition method of dense small item based on target detection
CN109146786A (en) * 2018-08-07 2019-01-04 北京市商汤科技开发有限公司 Scene chart generation method and device, electronic equipment and storage medium
CN109726696A (en) * 2019-01-03 2019-05-07 电子科技大学 System and method is generated based on the iamge description for weighing attention mechanism
CN110111399A (en) * 2019-04-24 2019-08-09 上海理工大学 A kind of image text generation method of view-based access control model attention
CN110188779A (en) * 2019-06-03 2019-08-30 中国矿业大学 A kind of generation method of image, semantic description

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PREKSHA NEMA等: "Generating Descriptions from Structured Data Using a Bifocal Attention Mechanism and Gated Orthogonalization", 《ARXIV》 *
靳华中等: "一种结合全局和局部特征的图像描述生成模型", 《应用科学学报》 *

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111325323B (en) * 2020-02-19 2023-07-14 山东大学 Automatic power transmission and transformation scene description generation method integrating global information and local information
CN111325323A (en) * 2020-02-19 2020-06-23 山东大学 Power transmission and transformation scene description automatic generation method fusing global information and local information
CN111814946A (en) * 2020-03-17 2020-10-23 同济大学 Image description automatic generation method based on multi-body evolution
CN111814946B (en) * 2020-03-17 2022-11-15 同济大学 Multi-body evolution-based automatic image description generation method
CN111414962A (en) * 2020-03-19 2020-07-14 创新奇智(重庆)科技有限公司 Image classification method introducing object relationship
WO2021190257A1 (en) * 2020-03-27 2021-09-30 北京京东尚科信息技术有限公司 Image description generation method, apparatus and system, and medium and electronic device
CN111753825A (en) * 2020-03-27 2020-10-09 北京京东尚科信息技术有限公司 Image description generation method, device, system, medium and electronic equipment
CN111783852B (en) * 2020-06-16 2024-03-12 北京工业大学 Method for adaptively generating image description based on deep reinforcement learning
CN111783852A (en) * 2020-06-16 2020-10-16 北京工业大学 Self-adaptive image description generation method based on deep reinforcement learning
CN111612103A (en) * 2020-06-23 2020-09-01 中国人民解放军国防科技大学 Image description generation method, system and medium combined with abstract semantic representation
CN111612103B (en) * 2020-06-23 2023-07-11 中国人民解放军国防科技大学 Image description generation method, system and medium combined with abstract semantic representation
CN112069841A (en) * 2020-07-24 2020-12-11 华南理工大学 Novel X-ray contraband parcel tracking method and device
CN111916050A (en) * 2020-08-03 2020-11-10 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN112037239A (en) * 2020-08-28 2020-12-04 大连理工大学 Text guidance image segmentation method based on multi-level explicit relation selection
CN112256904A (en) * 2020-09-21 2021-01-22 天津大学 Image retrieval method based on visual description sentences
CN112200268A (en) * 2020-11-04 2021-01-08 福州大学 Image description method based on encoder-decoder framework
CN112528989A (en) * 2020-12-01 2021-03-19 重庆邮电大学 Description generation method for semantic fine granularity of image
CN112528989B (en) * 2020-12-01 2022-10-18 重庆邮电大学 Description generation method for semantic fine granularity of image
CN113378919B (en) * 2021-06-09 2022-06-14 重庆师范大学 Image description generation method for fusing visual sense and enhancing multilayer global features
CN113378919A (en) * 2021-06-09 2021-09-10 重庆师范大学 Image description generation method for fusing visual sense and enhancing multilayer global features
CN113408430A (en) * 2021-06-22 2021-09-17 哈尔滨理工大学 Image Chinese description system and method based on multistage strategy and deep reinforcement learning framework
CN113469143A (en) * 2021-08-16 2021-10-01 西南科技大学 Finger vein image identification method based on neural network learning
CN113837230A (en) * 2021-08-30 2021-12-24 厦门大学 Image description generation method based on adaptive attention mechanism
CN114693790A (en) * 2022-04-02 2022-07-01 江西财经大学 Automatic image description method and system based on mixed attention mechanism
CN114693790B (en) * 2022-04-02 2022-11-18 江西财经大学 Automatic image description method and system based on mixed attention mechanism
CN114882488A (en) * 2022-05-18 2022-08-09 北京理工大学 Multi-source remote sensing image information processing method based on deep learning and attention mechanism
CN114882488B (en) * 2022-05-18 2024-06-28 北京理工大学 Multisource remote sensing image information processing method based on deep learning and attention mechanism
CN116580283A (en) * 2023-07-13 2023-08-11 平安银行股份有限公司 Image prompt word generation method and device, electronic equipment and storage medium
CN116580283B (en) * 2023-07-13 2023-09-26 平安银行股份有限公司 Image prompt word generation method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110674850A (en) Image description generation method based on attention mechanism
CN110807154B (en) Recommendation method and system based on hybrid deep learning model
CN112784092B (en) Cross-modal image text retrieval method of hybrid fusion model
CN109299396B (en) Convolutional neural network collaborative filtering recommendation method and system fusing attention model
CN108363753B (en) Comment text emotion classification model training and emotion classification method, device and equipment
CN107273438B (en) Recommendation method, device, equipment and storage medium
CN109389151B (en) Knowledge graph processing method and device based on semi-supervised embedded representation model
CN110175628A (en) A kind of compression algorithm based on automatic search with the neural networks pruning of knowledge distillation
CN112733027B (en) Hybrid recommendation method based on local and global representation model joint learning
CN112800344B (en) Deep neural network-based movie recommendation method
CN111753044A (en) Regularization-based language model for removing social bias and application
CN112597302B (en) False comment detection method based on multi-dimensional comment representation
CN115269847A (en) Knowledge-enhanced syntactic heteromorphic graph-based aspect-level emotion classification method
CN112256866A (en) Text fine-grained emotion analysis method based on deep learning
CN112529071B (en) Text classification method, system, computer equipment and storage medium
CN116610778A (en) Bidirectional image-text matching method based on cross-modal global and local attention mechanism
CN113326384A (en) Construction method of interpretable recommendation model based on knowledge graph
CN112100439B (en) Recommendation method based on dependency embedding and neural attention network
CN114332519A (en) Image description generation method based on external triple and abstract relation
CN114372475A (en) Network public opinion emotion analysis method and system based on RoBERTA model
CN114036298B (en) Node classification method based on graph convolution neural network and word vector
CN110874392B (en) Text network information fusion embedding method based on depth bidirectional attention mechanism
CN116881689A (en) Knowledge-enhanced user multi-mode online comment quality evaluation method and system
CN117216381A (en) Event prediction method, event prediction device, computer device, storage medium, and program product
CN115422369B (en) Knowledge graph completion method and device based on improved TextRank

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination