CN110674850A - Image description generation method based on attention mechanism - Google Patents
Image description generation method based on attention mechanism Download PDFInfo
- Publication number
- CN110674850A CN110674850A CN201910828522.3A CN201910828522A CN110674850A CN 110674850 A CN110674850 A CN 110674850A CN 201910828522 A CN201910828522 A CN 201910828522A CN 110674850 A CN110674850 A CN 110674850A
- Authority
- CN
- China
- Prior art keywords
- image
- model
- features
- region
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
- Machine Translation (AREA)
Abstract
The invention provides an image description generation method based on an attention mechanism. The invention has the following advantages: the fusion of the relation characteristic and the object characteristic can enrich image information; the double-layer language model can generate image description with finer granularity; further optimization of the training model using reinforcement learning can alleviate the exposure bias problem.
Description
Technical Field
The invention belongs to the field of computer vision and natural language processing, relates to an image language description generation method, and particularly relates to an image description generation method based on an attention mechanism.
Background
In many cases in life, it is necessary to convert image content into text description, for example, in social software, text abstract of an image is automatically generated in case of poor network state, and people with visual impairment are helped to understand the image content. The existing image description method is mainly based on a deep learning method, a convolutional neural network is used as an image processing model to extract image features, and the image features are input into a cyclic neural network to be used as a language generation model to generate an image description language. However, the model usually uses global or object-level image features, it is difficult to pay attention to a significant target object in the image and lose much important information in the image, and it is difficult to sufficiently apply important visual semantic relationship information in the image to the model. Moreover, the existing model is mostly a single-step forward process, when the model generates the next word, the model can only utilize the word which is generated before, so that the error accumulation can be caused if an error word is generated in the generation process. On the other hand, the existing model maximizes the joint probability of the sequence generated by the model during training, thus minimizing the cross entropy loss to train the model, and the joint probability of the generated reference word is improved to the maximum extent through back propagation, so that the probability distribution of the words in the sentence can be learned by the model, and the evaluation indexes are different from the automatic evaluation indexes usually used during evaluating the quality of the sentence generated by the image description model, and are not trivial and therefore cannot be directly used as the loss function, and the inconsistency of the loss function and the evaluation indexes makes the model not be optimized sufficiently.
Disclosure of Invention
The invention aims to overcome the defects of the existing method, and provides an image description generation method based on an attention mechanism.
The technical problem of the present invention is mainly solved by the following technical solutions, and an image description generation method based on an attention mechanism includes the following steps:
step 1, extracting words from the labeled sentences of the data set to construct a vocabulary;
step 2, adopting a ResNet101 model as a CNN initial model, performing parameter pre-training of ResNet101, using the pre-trained ResNet101 to independently extract global features of the image, then using the pre-trained ResNet101 to replace the CNN in a FasterR-CNN algorithm to extract a plurality of object region features of each image, and then forming the object regions into relationship regions in pairs to extract relationship features;
step 3, performing feature fusion on the relationship features and the object region features to obtain object region features containing the relationship between the objects;
step 4, inputting the object region characteristics containing the relationship between the objects obtained in the previous step into a double-layer LSTM language model to obtain an output result, namely natural language description generated on the image;
and 5, training the similarity between the image measured by the mapping space model and the description sentence, using the CIDER score and the similarity as reward items, and further optimizing the double-layer LSTM language model by using reinforcement learning.
Furthermore, the vocabulary table is constructed in step 1 by counting the occurrence frequency of each word in the text description of the MS COCO dataset, and only selecting words with the occurrence frequency greater than five to be listed in the vocabulary table, wherein the MS COCO dataset vocabulary table contains 9487 words.
Further, the fast R-CNN algorithm is used to extract the object region features of the image in step 2, and the loss function for an image during training is defined as follows:
wherein the parameter lambda is used to balance NclsAnd NregThese two normalization parameters; will NclsIs set to the size of mini-batch, N is setregSet to the total amount of anchors; i denotes the index number of anchor in a mini-batch during training, piThe anchor with index number i is the prediction probability of the object region; if anchor is a positive sample, pi *Equal to 1, if anchor is a negative sample, pi *Equal to 0; t is tiIs a vector of 4 coordinate parameters representing the generated bounding box, ti *Is the coordinate vector of the bounding box of the ground truth corresponding to the anchor of the positive sample;
Lclsis the classification loss function of the object:
Lcls(pi,pi *)=-log[pi *pi+(1-pi *)(1-pi)]
Lregrepresents the bounding box regression loss function:
Lreg(ti,ti *)=R(ti-ti *)
where R is the smooth L1 loss function:
further, the specific method for performing feature fusion on the relationship features and the object region features in step 3 is as follows:
for the input image I, a series of object regions v are obtained in the previous step1,...,vi,...,vkAnd relation area { S }1,...,Si...,SkEach object is contained in a plurality of different relationship areas, the importance of each relationship to the object is different, and p is calculated by the following formulai(Sk) Representing a relation skFor object viAttention weight of (1):
summing objects v according to attention weightiThe connected relation region features are aggregated into an overall aggregated relation region feature, and then the aggregated relation region feature is transmitted to the target object feature as information, and the following formula is used for:
wherein S is upper casekRepresenting a collection of relational areas, a lower case sk representing a single relational area,representing the final aggregated relational feature vector after aggregation,is v after fine tuningi。
Further, the specific implementation manner of step 4 is as follows,
step 4.1, input of global features v of the image0To the first layer in a two-layer LSTM language model;
step 4.2, respectively calculating an attention weight for each object area subjected to feature fusion at each time t:
whereinWv1、Wh1Are parameters that need to be learned in the language model,the attention weight assigned to the i region at time t,referring to the result output by the first layer LSTM at each time t, tanh is the tanh activation function:
step 4.3, the attention weight assigned to each region represents the contribution degree of the region to the currently generated word:
wherein the content of the first and second substances,refer to for each viThe result after attention weighting;
step 4.4, the input of the LSTM in the second layer of the language model is formed by combining the output of the first-stage language model and the attention-weighted image characteristics;
step 4.5, adding a sentinel gate g on the basis of the second layer of LSTMtTo calculate a language information vector s of references needed to generate text wordst:
Where σ is the sigmoid activation function, WxAnd WhIs a parameter requiring training, ctCell state (cell state) indicating LSTM at time t, ⊙ indicating multiplication of corresponding elements, xtRefers to the input of the second layer LSTM, ht-1Refers to the output of the hidden layer at a time t-1 on the second layer LSTM;
step 4.6, the second layer LSTM also uses the attention mechanism:
whereinWv2、Wh2Is a parameter to be learned in the model, except that visual words and text words are distinguished in this region, making the network automatic selection more dependent on the visual information v when generating each wordiOr more on the language information st;
Step 4.7, calculating the distribution weight of the visual information and the language information when each word is generated:
wherein the content of the first and second substances,for the attention vector obtained in step 4.6,Ws、Wh3parameters needing to be learned in the model comprise image region characteristics and the weight of a language information vector;
for each viMultiplying by the corresponding β and then summing up, the length becomes K +1 due to the addition of one language information vector;
step 4.8, inputting the output of the second layer language model into the softmax layer to calculate the distribution probability of the generated words at the time of generating t:
and 4.9, finally, training the model by using a cross entropy loss function.
Further, the specific implementation manner of step 5 is as follows,
step 5.1, first calculate the CIDER score CIDER of the model (c)i,Si),ciAs candidate sentences and SiIs a reference sentence;
step 5.2, for a matching image-sentence pair (I)n,Sn) Training a convolution on the input image IGlobal feature vector phi (I) extracted by neural network, training a cyclic neural network for sentence S to extract its featuresThen, the features of the two different modes are mapped to the same space through two linear mapping layers;
the cosine similarity is then calculated to represent the cosine similarity between the image and the sentence as follows:
to train such a mapping space model, a parameter θ is definedsIs minimized, the training set is represented as
e(θs) Represents the loss function LeAverage error of (I, S), where L is defined using a two-way ordering penaltye(I,S):
Wherein β represents a boundary distance, I, S represents a set of reference image-sentence pairs, I ', S' represents a set of sentence image pairs randomly selected in the training set, and S represents cosine similarity;
step 5.3, usingTo define a reward, whereinThe results of the model predictions are represented as,indicating the result of the predictionThe CIDER score of (A) is obtained,representing an input image I andcosine similarity;
step 5.4, updating network parameters by using the strategy gradient in reinforcement learning, and according to a REINFORCE algorithm, obtaining a loss function LRL(theta) the gradient calculation formula with respect to the parameter theta is as follows,
in order to reduce the variance of the gradient estimate, a baseline function b is introduced,
step 5.5, let b ═ R (S)*I), then the gradient calculation formula is:
and S is a descriptive sentence corresponding to the image I, and is equivalent to the true value of S.
Compared with the prior art, the invention has the following advantages:
1. the fusion of the middle relation characteristic and the object characteristic can enrich image information;
2. the double-layer language model can generate image description with finer granularity;
3. the invention can further optimize the training model by using reinforcement learning to relieve the exposure deviation problem.
Drawings
Fig. 1 is a general flow chart of the present invention.
Detailed Description
The technical solution of the present invention is further explained with reference to the drawings and the embodiments.
As shown in fig. 1, an attention-based image description generation method includes the following steps:
step 1, extracting words from the labeled sentences of the data set to construct a vocabulary;
the vocabulary table obtained in the step 1 is obtained by counting the occurrence frequency of each word in the text description of the MS COCO data set, and only selecting the words with the occurrence frequency more than five times to be listed in the vocabulary table, wherein the MS COCO data set vocabulary table comprises 9,487 words.
Step 2, adopting a ResNet101 model as a CNN initial model, adopting an ImageNet data set to pre-train parameters of ResNet101, using the pre-trained ResNet101 to independently extract global features of the image, then using the pre-trained ResNet101 to replace CNN in a Faster R-CNN algorithm to extract 36 object region features of each image, and then forming every two object regions into a relationship region to extract relationship features;
in the step 2, the global features of the image are extracted by using the pre-trained ResNet101, the object region features of the image are extracted by using the Faster R-CNN algorithm, and a loss function for one image during training is defined as follows:
wherein the parameter lambda is used to balance NclsAnd NregThese two normalization parameters. Will NclsIs set to the size of mini-batch, N is setregSet to the total amount of anchors and λ to 10. i denotes the index number of anchor in a mini-batch during training, piIs the prediction probability of the object region with anchor with index number i. If anchor is a positive sample, pi *Equal to 1, if anchor is a negative sample, pi *Equal to 0. t is tiIs a vector representing the parameters of 4 coordinates (including upper left, upper right, lower left, and lower right, respectively) of the generated bounding box, ti *Is the coordinate vector of the bounding box of the ground treth corresponding to the anchor of the positive sample. L isclsIs the classification loss function of the object:
Lcls(pi,pi *)=-log[pi *pi+(1-pi *)(1-pi)]
Lregrepresents the bounding box regression loss function:
Lreg(ti,ti *)=R(ti-ti *)
where R is the smooth L1 loss function:
and 3, performing feature fusion on the relationship features and the object region features to obtain object region features containing the relationship between the objects, wherein the specific method comprises the following steps:
for the input image I, a series of object regions { v } are obtained in the previous step1,...,vi,...,vkAnd relation area { S }1,...,Si...,SkEach object is contained in a plurality of different relationship areas, the importance of each relationship to the object is different, and p is calculated by the following formulai(Sk) Representing a relation skFor object viAttention weight of (1):
according to the attentionWeighting the number of sum object regions viThe connected relation region features are aggregated into an overall aggregated relation region feature, and then the aggregated relation region feature is transmitted to the target object feature as information, and the following formula is used for:
wherein S is upper casekRepresenting sets of relational areas, lower case skA single one of the relationship regions is represented,representing the final aggregated relational feature vector after aggregation,is v after fine tuningi。
Step 4, inputting the object region characteristics containing the relationship between the objects obtained in the previous step into a double-layer LSTM language model to obtain an output result, namely, natural language description generated for the image, and specifically comprising the following substeps:
step 4.1, Global features v of first layer LSTM input image0The initial description is generated by the connection vector of the output of the second layer LSTM at the time t-1 and the code of the word generated at the time t. The features of the word sequence generated at the current time are concatenated with the output of the first layer LSTM at the current time as input to the next layer language model.
Step 4.2, respectively calculating an attention weight for each object area subjected to feature fusion at each time t:
whereinWv1、Wh1Are parameters that need to be learned in the language model,the attention weight assigned to the i region at time t,referring to the result output by the first layer LSTM at each time t, tanh is the tanh activation function:
step 4.3, the attention weight assigned to each region represents the contribution degree of the region to the currently generated word:
wherein the content of the first and second substances,refer to for each viAttention weighted results.
Step 4.4, the input of the LSTM in the second layer of the language model is formed by combining the output of the first-stage language model and the attention-weighted image characteristics;
step 4.5, adding a sentinel gate g on the basis of the second layer of LSTMtTo calculate a language information vector s of references needed to generate text wordst:
Where σ is the sigmoid activation function, WxAnd WhIs a parameter requiring training, ctCell state (cell state) indicating LSTM at time t, ⊙ indicating multiplication of corresponding elements, xtRefers to the input of the second layer LSTM, ht-1Refers to the output of the hidden layer at a time t-1 on the second layer LSTM.
Step 4.6, the second layer LSTM also uses the attention mechanism:
whereinWv2、Wh2Is a parameter to be learned in the model, except that visual words and text words are distinguished in this region, making the network automatic selection more dependent on the visual information v when generating each wordiOr more on the language information st。
Step 4.7, calculating the distribution weight of the visual information and the language information when each word is generated:
wherein the content of the first and second substances,for the attention vector obtained in step 4.6,Ws、Wh3the parameters to be learned in the model include the image region characteristics and the weight of the language information vector.
Wherein the content of the first and second substances,for each viThe result of multiplying by the corresponding β and then summing, since one language information vector is added, the length becomes K + 1.
Step 4.8, inputting the output of the second layer language model into the softmax layer to calculate the distribution probability of the generated words at the time of generating t:
step 4.9, training the model by using a cross entropy loss function:
wherein the content of the first and second substances,representing the value at the previous time t-1,representing the true value at time t, and L (θ) represents the conditional probability for the value at time t given the values from 1 to time t-1.
Step 5, training the similarity between the image and the description sentence measured by a mapping space model, using the CIDER score and the similarity as reward items, and further optimizing the double-layer LSTM language model by using reinforcement learning, which specifically comprises the following substeps:
step 5.1, first calculate the CIDER score of the model, an n-gram tuple wkAppear in the reference sentence sijThe number of times in (1) is recorded as hk(sij) Appears in the candidate sentence ciThe number of times in (1) is recorded as hk(ci) Each n-gram tuple w is calculated bykTF-IDF weight of:
where Ω is the set of all n-grams and I is all images in the dataset. Using candidate sentence c for n-grams tuple of length niAnd a reference sentence SiAverage cosine similarity between them to calculate its CIDERnAnd (3) fractional:
finally, the total CIDER fraction, w, is calculatednRepresenting a n-grams tuple as follows:
the above procedure can be found in the literature Vedantam R, Lawrence Zitnick C, Parikh D.Cider: Consensuss-based image description evaluation [ C]//Proceedings of the IEEE conference on computer vision and patternrecognition.2015:4566-4575。
Step 5.2, for a matching image-sentence pair (I)n,Sn) Training a convolutional neural network to extract global feature vector phi (I) for input image I, and training a Recurrent Neural Network (RNN) for sentence S to extract its featuresThe features of the two different modalities are then mapped to the same space by two linear mapping layers.
The cosine similarity is then calculated to represent the cosine similarity between the image and the sentence as follows:
to train such a mapping space model, a parameter θ is definedsIs minimized, the training set is represented as
e(θs) Represents the loss function LeAverage error of (I, S).
Here, L is defined using a two-way ordering penaltye(I,S):
Where β represents a boundary distance, I, S represents a set of reference image-sentence pairs, and I ', S' represents a set of randomly chosen image-sentence pairs in the training set.
Step 5.3, usingTo define a reward, whereinThe results of the model predictions are represented as,indicating the result of the predictionThe CIDER score of (A) is obtained,representing an input image I andcosine similarity;
step 5.4, updating network parameters by using policy gradient (policy gradient) in reinforcement learning, and according to a REINFORCE algorithm, obtaining a loss function LRL(theta) the gradient calculation formula with respect to the parameter theta is as follows,
The above steps can be found in the documents R.S. Sutton, D.McAllester, S.Singh, and Y.Mansource. policy gradient methods for retrieving Information with functional Processing Systems in Advances in Neural Information Processing Systems 12, pages 1057-1063,2000.
In order to reduce the variance of the gradient estimate, a baseline function b is introduced,
step 5.5, let b ═ R (S)*I), then the gradient is calculated as
And S is a descriptive sentence corresponding to the image I, which is equivalent to the true value of S, and the true value is a known quantity.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.
Claims (6)
1. An attention mechanism-based image description generation method is characterized by comprising the following steps:
step 1, extracting words from the labeled sentences of the data set to construct a vocabulary;
step 2, adopting a ResNet101 model as a CNN initial model, performing parameter pre-training of ResNet101, using the pre-trained ResNet101 to independently extract global features of the image, then using the pre-trained ResNet101 to replace the CNN in a Faster R-CNN algorithm to extract a plurality of object region features of each image, and then forming the object regions into relationship regions in pairs to extract relationship features;
step 3, performing feature fusion on the relationship features and the object region features to obtain object region features containing the relationship between the objects;
step 4, inputting the object region characteristics containing the relationship between the objects obtained in the previous step into a double-layer LSTM language model to obtain an output result, namely natural language description generated on the image;
and 5, training the similarity between the image measured by the mapping space model and the description sentence, using the CIDER score and the similarity as reward items, and further optimizing the double-layer LSTM language model by using reinforcement learning.
2. The method of claim 1, wherein the method comprises: the vocabulary table is constructed in the step 1 by counting the occurrence frequency of each word in the text description of the MS COCO data set, and only selecting the words with the occurrence frequency more than five times to be listed in the vocabulary table, wherein the MS COCO data set vocabulary table comprises 9487 words.
3. The method of claim 1, wherein the method comprises: in step 2, the fast R-CNN algorithm is used to extract the object region characteristics of the image, and the loss function of an image during training is defined as follows:
wherein the parameter lambda is used to balance NclsAnd NregThese two normalization parameters; will NclsIs set to the size of mini-batch, N is setregSet to the total amount of anchors; i denotes the index number of anchor in a mini-batch during training, piThe anchor with index number i is the prediction probability of the object region; if anchor is a positive sample, pi *Equal to 1, if anchor is a negative sample, pi *Equal to 0; t is tiIs a vector of 4 coordinate parameters representing the generated bounding box, ti *Is the coordinate vector of the bounding box of the ground truth corresponding to the anchor of the positive sample;
Lclsis the classification loss function of the object:
Lcls(pi,pi *)=-log[pi *pi+(1-pi *)(1-pi)]
Lregrepresents the bounding box regression loss function:
Lreg(ti,ti *)=R(ti-ti *)
where R is the smooth L1 loss function:
4. the method of claim 1, wherein the method comprises: the specific method for performing feature fusion on the relationship features and the object region features in the step 3 comprises the following steps:
for the input image I, a series of object regions v are obtained in the previous step1,...,vi,...,vkAnd relation area { S }1,…,Si…,SkEach object is contained in a plurality of different relationship areas, the importance of each relationship to the object is different, and p is calculated by the following formulai(Sk) Representing a relation skFor object viAttention weight of (1):
summing objects v according to attention weightiThe connected relation region features are aggregated into an overall aggregated relation region feature, and then the aggregated relation region feature is transmitted to the target object feature as information, and the following formula is used for:
5. The method of claim 1, wherein the method comprises: the specific implementation of step 4 is as follows,
step 4.1, input of global features v of the image0To the first layer in a two-layer LSTM language model;
step 4.2, respectively calculating an attention weight for each object area subjected to feature fusion at each time t:
whereinWv1、Wh1Are parameters that need to be learned in the language model,the attention weight assigned to the i region at time t,referring to the result output by the first layer LSTM at each time t, tanh is the tanh activation function:
step 4.3, the attention weight assigned to each region represents the contribution degree of the region to the currently generated word:
wherein the content of the first and second substances,refer to for each viThe result after attention weighting;
step 4.4, the input of the LSTM in the second layer of the language model is formed by combining the output of the first-stage language model and the attention-weighted image characteristics;
step 4.5, adding a sentinel gate g on the basis of the second layer of LSTMtTo calculate a language information vector s of references needed to generate text wordst:
Where σ is the sigmoid activation function,WxAnd WhIs a parameter requiring training, ctCell state (cell state) indicating LSTM at time t, ⊙ indicating multiplication of corresponding elements, xtRefers to the input of the second layer LSTM, ht-1Refers to the output of the hidden layer at a time t-1 on the second layer LSTM;
step 4.6, the second layer LSTM also uses the attention mechanism:
whereinWv2、Wh2Is a parameter to be learned in the model, except that visual words and text words are distinguished in this region, making the network automatic selection more dependent on the visual information v when generating each wordiOr more on the language information st;
Step 4.7, calculating the distribution weight of the visual information and the language information when each word is generated:
wherein the content of the first and second substances,for the attention vector obtained in step 4.6,Ws、Wh3parameters needing to be learned in the model comprise image region characteristics and the weight of a language information vector;
for each viMultiplying by the corresponding β and then summing up, the length becomes K +1 due to the addition of one language information vector;
step 4.8, inputting the output of the second layer language model into the softmax layer to calculate the distribution probability of the generated words at the time of generating t:
and 4.9, finally, training the model by using a cross entropy loss function.
6. The method of claim 1, wherein the method comprises: the specific implementation of step 5 is as follows,
step 5.1, first calculate the CIDER score CIDER of the model (c)i,Si),ciAs candidate sentences and SiIs a reference sentence;
step 5.2, for a matching image-sentence pair (I)n,Sn) Training a convolution neural network to extract global feature vector phi (I) for input image I, and training a convolution neural network to extract features of sentence SThen, the features of the two different modes are mapped to the same space through two linear mapping layers;
the cosine similarity is then calculated to represent the cosine similarity between the image and the sentence as follows:
to train such a mapping space model, a parameter θ is definedsIs minimized, the training set is represented as
e(θs) Represents the loss function LeAverage error of (I, S), where L is defined using a two-way ordering penaltye(I,S):
Wherein β represents a boundary distance, I, S represents a set of reference image-sentence pairs, I ', S' represents a set of sentence image pairs randomly selected in the training set;
step 5.3, usingTo define a reward, whereinThe results of the model predictions are represented as,indicating the result of the predictionThe CIDER score of (A) is obtained,representing an input image I andcosine similarity;
step 5.4, updating network parameters by using the strategy gradient in reinforcement learning, and according to a REINFORCE algorithm, obtaining a loss function LRL(theta) the gradient calculation formula with respect to the parameter theta is as follows,
in order to reduce the variance of the gradient estimate, a baseline function b is introduced,
step 5.5, let b ═ R (S)*I), then the gradient calculation formula is:
and S is a descriptive sentence corresponding to the image I, and is equivalent to the true value of S.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910828522.3A CN110674850A (en) | 2019-09-03 | 2019-09-03 | Image description generation method based on attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910828522.3A CN110674850A (en) | 2019-09-03 | 2019-09-03 | Image description generation method based on attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110674850A true CN110674850A (en) | 2020-01-10 |
Family
ID=69076245
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910828522.3A Pending CN110674850A (en) | 2019-09-03 | 2019-09-03 | Image description generation method based on attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110674850A (en) |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111325323A (en) * | 2020-02-19 | 2020-06-23 | 山东大学 | Power transmission and transformation scene description automatic generation method fusing global information and local information |
CN111414962A (en) * | 2020-03-19 | 2020-07-14 | 创新奇智(重庆)科技有限公司 | Image classification method introducing object relationship |
CN111612103A (en) * | 2020-06-23 | 2020-09-01 | 中国人民解放军国防科技大学 | Image description generation method, system and medium combined with abstract semantic representation |
CN111753825A (en) * | 2020-03-27 | 2020-10-09 | 北京京东尚科信息技术有限公司 | Image description generation method, device, system, medium and electronic equipment |
CN111783852A (en) * | 2020-06-16 | 2020-10-16 | 北京工业大学 | Self-adaptive image description generation method based on deep reinforcement learning |
CN111814946A (en) * | 2020-03-17 | 2020-10-23 | 同济大学 | Image description automatic generation method based on multi-body evolution |
CN111916050A (en) * | 2020-08-03 | 2020-11-10 | 北京字节跳动网络技术有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
CN112037239A (en) * | 2020-08-28 | 2020-12-04 | 大连理工大学 | Text guidance image segmentation method based on multi-level explicit relation selection |
CN112069841A (en) * | 2020-07-24 | 2020-12-11 | 华南理工大学 | Novel X-ray contraband parcel tracking method and device |
CN112200268A (en) * | 2020-11-04 | 2021-01-08 | 福州大学 | Image description method based on encoder-decoder framework |
CN112256904A (en) * | 2020-09-21 | 2021-01-22 | 天津大学 | Image retrieval method based on visual description sentences |
CN112528989A (en) * | 2020-12-01 | 2021-03-19 | 重庆邮电大学 | Description generation method for semantic fine granularity of image |
CN113378919A (en) * | 2021-06-09 | 2021-09-10 | 重庆师范大学 | Image description generation method for fusing visual sense and enhancing multilayer global features |
CN113408430A (en) * | 2021-06-22 | 2021-09-17 | 哈尔滨理工大学 | Image Chinese description system and method based on multistage strategy and deep reinforcement learning framework |
CN113469143A (en) * | 2021-08-16 | 2021-10-01 | 西南科技大学 | Finger vein image identification method based on neural network learning |
CN113837230A (en) * | 2021-08-30 | 2021-12-24 | 厦门大学 | Image description generation method based on adaptive attention mechanism |
CN114693790A (en) * | 2022-04-02 | 2022-07-01 | 江西财经大学 | Automatic image description method and system based on mixed attention mechanism |
CN114882488A (en) * | 2022-05-18 | 2022-08-09 | 北京理工大学 | Multi-source remote sensing image information processing method based on deep learning and attention mechanism |
CN116580283A (en) * | 2023-07-13 | 2023-08-11 | 平安银行股份有限公司 | Image prompt word generation method and device, electronic equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107918782A (en) * | 2016-12-29 | 2018-04-17 | 中国科学院计算技术研究所 | A kind of method and system for the natural language for generating description picture material |
CN108520273A (en) * | 2018-03-26 | 2018-09-11 | 天津大学 | A kind of quick detection recognition method of dense small item based on target detection |
CN109146786A (en) * | 2018-08-07 | 2019-01-04 | 北京市商汤科技开发有限公司 | Scene chart generation method and device, electronic equipment and storage medium |
CN109726696A (en) * | 2019-01-03 | 2019-05-07 | 电子科技大学 | System and method is generated based on the iamge description for weighing attention mechanism |
CN110111399A (en) * | 2019-04-24 | 2019-08-09 | 上海理工大学 | A kind of image text generation method of view-based access control model attention |
CN110188779A (en) * | 2019-06-03 | 2019-08-30 | 中国矿业大学 | A kind of generation method of image, semantic description |
-
2019
- 2019-09-03 CN CN201910828522.3A patent/CN110674850A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107918782A (en) * | 2016-12-29 | 2018-04-17 | 中国科学院计算技术研究所 | A kind of method and system for the natural language for generating description picture material |
CN108520273A (en) * | 2018-03-26 | 2018-09-11 | 天津大学 | A kind of quick detection recognition method of dense small item based on target detection |
CN109146786A (en) * | 2018-08-07 | 2019-01-04 | 北京市商汤科技开发有限公司 | Scene chart generation method and device, electronic equipment and storage medium |
CN109726696A (en) * | 2019-01-03 | 2019-05-07 | 电子科技大学 | System and method is generated based on the iamge description for weighing attention mechanism |
CN110111399A (en) * | 2019-04-24 | 2019-08-09 | 上海理工大学 | A kind of image text generation method of view-based access control model attention |
CN110188779A (en) * | 2019-06-03 | 2019-08-30 | 中国矿业大学 | A kind of generation method of image, semantic description |
Non-Patent Citations (2)
Title |
---|
PREKSHA NEMA等: "Generating Descriptions from Structured Data Using a Bifocal Attention Mechanism and Gated Orthogonalization", 《ARXIV》 * |
靳华中等: "一种结合全局和局部特征的图像描述生成模型", 《应用科学学报》 * |
Cited By (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111325323B (en) * | 2020-02-19 | 2023-07-14 | 山东大学 | Automatic power transmission and transformation scene description generation method integrating global information and local information |
CN111325323A (en) * | 2020-02-19 | 2020-06-23 | 山东大学 | Power transmission and transformation scene description automatic generation method fusing global information and local information |
CN111814946A (en) * | 2020-03-17 | 2020-10-23 | 同济大学 | Image description automatic generation method based on multi-body evolution |
CN111814946B (en) * | 2020-03-17 | 2022-11-15 | 同济大学 | Multi-body evolution-based automatic image description generation method |
CN111414962A (en) * | 2020-03-19 | 2020-07-14 | 创新奇智(重庆)科技有限公司 | Image classification method introducing object relationship |
WO2021190257A1 (en) * | 2020-03-27 | 2021-09-30 | 北京京东尚科信息技术有限公司 | Image description generation method, apparatus and system, and medium and electronic device |
CN111753825A (en) * | 2020-03-27 | 2020-10-09 | 北京京东尚科信息技术有限公司 | Image description generation method, device, system, medium and electronic equipment |
CN111783852B (en) * | 2020-06-16 | 2024-03-12 | 北京工业大学 | Method for adaptively generating image description based on deep reinforcement learning |
CN111783852A (en) * | 2020-06-16 | 2020-10-16 | 北京工业大学 | Self-adaptive image description generation method based on deep reinforcement learning |
CN111612103A (en) * | 2020-06-23 | 2020-09-01 | 中国人民解放军国防科技大学 | Image description generation method, system and medium combined with abstract semantic representation |
CN111612103B (en) * | 2020-06-23 | 2023-07-11 | 中国人民解放军国防科技大学 | Image description generation method, system and medium combined with abstract semantic representation |
CN112069841A (en) * | 2020-07-24 | 2020-12-11 | 华南理工大学 | Novel X-ray contraband parcel tracking method and device |
CN111916050A (en) * | 2020-08-03 | 2020-11-10 | 北京字节跳动网络技术有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
CN112037239A (en) * | 2020-08-28 | 2020-12-04 | 大连理工大学 | Text guidance image segmentation method based on multi-level explicit relation selection |
CN112256904A (en) * | 2020-09-21 | 2021-01-22 | 天津大学 | Image retrieval method based on visual description sentences |
CN112200268A (en) * | 2020-11-04 | 2021-01-08 | 福州大学 | Image description method based on encoder-decoder framework |
CN112528989A (en) * | 2020-12-01 | 2021-03-19 | 重庆邮电大学 | Description generation method for semantic fine granularity of image |
CN112528989B (en) * | 2020-12-01 | 2022-10-18 | 重庆邮电大学 | Description generation method for semantic fine granularity of image |
CN113378919B (en) * | 2021-06-09 | 2022-06-14 | 重庆师范大学 | Image description generation method for fusing visual sense and enhancing multilayer global features |
CN113378919A (en) * | 2021-06-09 | 2021-09-10 | 重庆师范大学 | Image description generation method for fusing visual sense and enhancing multilayer global features |
CN113408430A (en) * | 2021-06-22 | 2021-09-17 | 哈尔滨理工大学 | Image Chinese description system and method based on multistage strategy and deep reinforcement learning framework |
CN113469143A (en) * | 2021-08-16 | 2021-10-01 | 西南科技大学 | Finger vein image identification method based on neural network learning |
CN113837230A (en) * | 2021-08-30 | 2021-12-24 | 厦门大学 | Image description generation method based on adaptive attention mechanism |
CN114693790A (en) * | 2022-04-02 | 2022-07-01 | 江西财经大学 | Automatic image description method and system based on mixed attention mechanism |
CN114693790B (en) * | 2022-04-02 | 2022-11-18 | 江西财经大学 | Automatic image description method and system based on mixed attention mechanism |
CN114882488A (en) * | 2022-05-18 | 2022-08-09 | 北京理工大学 | Multi-source remote sensing image information processing method based on deep learning and attention mechanism |
CN114882488B (en) * | 2022-05-18 | 2024-06-28 | 北京理工大学 | Multisource remote sensing image information processing method based on deep learning and attention mechanism |
CN116580283A (en) * | 2023-07-13 | 2023-08-11 | 平安银行股份有限公司 | Image prompt word generation method and device, electronic equipment and storage medium |
CN116580283B (en) * | 2023-07-13 | 2023-09-26 | 平安银行股份有限公司 | Image prompt word generation method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110674850A (en) | Image description generation method based on attention mechanism | |
CN110807154B (en) | Recommendation method and system based on hybrid deep learning model | |
CN112784092B (en) | Cross-modal image text retrieval method of hybrid fusion model | |
CN109299396B (en) | Convolutional neural network collaborative filtering recommendation method and system fusing attention model | |
CN108363753B (en) | Comment text emotion classification model training and emotion classification method, device and equipment | |
CN107273438B (en) | Recommendation method, device, equipment and storage medium | |
CN109389151B (en) | Knowledge graph processing method and device based on semi-supervised embedded representation model | |
CN110175628A (en) | A kind of compression algorithm based on automatic search with the neural networks pruning of knowledge distillation | |
CN112733027B (en) | Hybrid recommendation method based on local and global representation model joint learning | |
CN112800344B (en) | Deep neural network-based movie recommendation method | |
CN111753044A (en) | Regularization-based language model for removing social bias and application | |
CN112597302B (en) | False comment detection method based on multi-dimensional comment representation | |
CN115269847A (en) | Knowledge-enhanced syntactic heteromorphic graph-based aspect-level emotion classification method | |
CN112256866A (en) | Text fine-grained emotion analysis method based on deep learning | |
CN112529071B (en) | Text classification method, system, computer equipment and storage medium | |
CN116610778A (en) | Bidirectional image-text matching method based on cross-modal global and local attention mechanism | |
CN113326384A (en) | Construction method of interpretable recommendation model based on knowledge graph | |
CN112100439B (en) | Recommendation method based on dependency embedding and neural attention network | |
CN114332519A (en) | Image description generation method based on external triple and abstract relation | |
CN114372475A (en) | Network public opinion emotion analysis method and system based on RoBERTA model | |
CN114036298B (en) | Node classification method based on graph convolution neural network and word vector | |
CN110874392B (en) | Text network information fusion embedding method based on depth bidirectional attention mechanism | |
CN116881689A (en) | Knowledge-enhanced user multi-mode online comment quality evaluation method and system | |
CN117216381A (en) | Event prediction method, event prediction device, computer device, storage medium, and program product | |
CN115422369B (en) | Knowledge graph completion method and device based on improved TextRank |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |