CN110288029B - Tri-LSTMs model-based image description method - Google Patents
Tri-LSTMs model-based image description method Download PDFInfo
- Publication number
- CN110288029B CN110288029B CN201910565977.0A CN201910565977A CN110288029B CN 110288029 B CN110288029 B CN 110288029B CN 201910565977 A CN201910565977 A CN 201910565977A CN 110288029 B CN110288029 B CN 110288029B
- Authority
- CN
- China
- Prior art keywords
- lstm
- image
- network
- convolutional neural
- representing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an image description method based on a Tri-LSTMs model, which comprises the following steps: generating a training set and mapping word vectors, building and training an RPN (recursive resilient network) convolutional neural network and a Faster-RCNN convolutional neural network, extracting the characteristics of the image full-connection layer, constructing and training a Tri-LSTMs model, and generating image description. The invention combines a plurality of long-time and short-time memory networks (LSTM), and simultaneously utilizes the full-link layer characteristics of the image and the 300-dimensional GLOVE word vector of the word, thereby effectively improving the diversity of the generated captions and generating more accurate image description.
Description
Technical Field
The invention belongs to the technical field of image processing, and further relates to an image description method based on a Tri-LSTMs model in the technical field of image description. The invention can be used for generating accurate and diversified sentences for a given image to describe the content of the image. The Tri-LSTMs represents a Tri-LSTMs model consisting of a semantic LSTM module, a visual LSTM module and a language LSTM module.
Background
An image description is a statement generated to describe the contents of an image given an image. The generated sentence is not only smooth, but also can accurately describe the object in the image and the attribute, the position and the relationship between the objects. The generated image description can be used for searching images which accord with the description content, and image retrieval is facilitated. In addition, the generated image description is converted into braille, so that the blind can be helped to understand the image content.
Shenzhen university in its own patent technology "a bag-of-words model-based image description method and system" (patent application No.: 201410491596X, grant publication No.: CN 104299010B) proposes a bag-of-words model-based image description method. The patent technology mainly solves the problems of information loss and low accuracy of the traditional method. The patent technology comprises the following implementation steps: extracting characteristic points from an image to be described; (2) Calculating a distance set between the characteristic point and a visual word in a code book, and obtaining a membership set between the characteristic point and the visual word by utilizing the distance set through a Gaussian membership function; (3) And counting the membership degree of the visual words for describing each feature point by using the membership degree set to form a histogram vector, wherein the histogram vector is used for describing the image to be described. Although the traditional image description technology is improved, the description accuracy is higher, but the method still has the disadvantages that feature points need to be manually extracted, different extraction methods are adopted, the result is greatly influenced, the extraction process is complicated, and the finally generated image description is not diverse.
Tianjin university in its own patent technology "a generation method from structured text to image description" (patent application No.: 2016108541692, grant publication No.: CN 106503055B) proposes a generation method from structured text to image description. The patent technology mainly solves the problems of low accuracy and insufficient diversity of image description generated by the prior art. The patent technology comprises the following implementation steps: downloading pictures from the Internet to form a picture training set; (2) Performing lexical analysis on the description corresponding to the images in the training set to construct a structured text; (3) Extracting the convolutional neural network characteristics of the training set image by using the existing neural network model, and constructing a multi-task recognition model by taking the < image characteristics and structured text > as input; (4) Taking the structured text and the corresponding description extracted from the training set as the input of the recurrent neural network, and training to obtain the parameters of the recurrent neural network model; (5) Inputting the convolutional neural network characteristics of an image to be described, and obtaining a prediction structured text through a multi-task recognition model; (6) Inputting a prediction structured text, and obtaining image description through a recurrent neural network model. Although the patent technology improves the problem of insufficient diversity of the generated image description, the method still has the disadvantage that only image features are used, and other effective information is not utilized to guide the decoding process, so that the accuracy of the finally generated image description is influenced.
An Image description method based on an encoder-decoder model is proposed by Oriol Vinyals et al in its published paper "Show and Tell: A Neural Image Caption Generator" (cvpr 2015 conference paper). The method comprises the steps of firstly utilizing a Convolutional Neural Network (CNN) to extract image features, and then sending the image features to a Long Short-term memory (LSTM) Network to generate a description corresponding to an image. The method solves the image description problem by using the structure of an encoder-decoder for the first time, but the method still has the defects that the model structure is too simple, and the generated image description is inaccurate.
Kelvin Xu et al, in its published paper, "Show, attend and Tell: neural Image Caption Generation with Visual Attention" (cvpr 2015 conference paper), propose an Image description method that combines a Long Short-term memory (LSTM) with an Attention mechanism. The method assigns different weights to different positions of the image in the decoding process, thereby giving different attention to objects at different positions. The method generates more accurate image description, and proves the effectiveness of the combination of a Long Short-term memory network (LSTM) and an attention mechanism. However, the method still has the disadvantages that a single-layer Short-term memory network (LSTM) simultaneously bears multiple responsibilities such as statement generation, image weight distribution and the like, and the generated image description is not accurate enough due to the confusion of the responsibilities.
The article "Image capturing with Semantic attribute" (cvpr 2016 conference article) published by the equal to quanzing young et al proposes an Image description method combining Semantic attributes, image features and Attention mechanism at the same time. The method comprises the steps of firstly selecting 1000 words with the highest occurrence frequency in a vocabulary library as semantic attributes, and then introducing the weighted semantic attributes into an input layer and an output layer of a decoder. The method proves the effectiveness of combining semantic attributes, image features and attention mechanism at the same time. However, the method still has the disadvantages that the difference between corresponding descriptions of different images is too small, and the generated description is rigid and templated.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides an image description method based on a Tri-LSTMs model. The invention can effectively improve the accuracy and diversity of image description.
The technical idea for realizing the invention is as follows: firstly, building and training an RPN convolutional neural network model and a faster-RCNN network model; then, building and training a Tri-LSTMs model; and finally, extracting an image region by using a pre-trained fast-RCNN network model, inputting the image region into a Tri-LSTMs model, and generating image description for the image.
The specific steps for realizing the purpose of the invention are as follows:
(1) Generating a training set and mapping word vectors:
(1a) Selecting at least 80000 samples from an image data set with image descriptions to form a training set, wherein each selected sample is an image-description pair, and each image-description pair comprises one image and five corresponding image descriptions;
(1b) The image description of each sample in the training set consists of a plurality of English words, the frequency of occurrence of the English words in all the image descriptions of all the samples is counted, the English words are subjected to power reduction sorting, the first 1000 words are selected, each selected word is mapped into a corresponding 300-dimensional GLOVE word vector, and the vector is stored in a computer;
(2) Constructing an RPN convolutional neural network model and a master-RCNN network model:
(2a) Constructing an RPN convolutional neural network model consisting of eight convolutional layers and a Softmax layer, and setting parameters of each layer;
(2b) Building a master-RCNN network model consisting of five convolution layers, one ROIploling layer, four full-connection layers and one Softmax layer, and setting parameters of each layer;
(3) Training an RPN convolutional neural network and a fast-RCNN convolutional neural network:
performing alternate training on the RPN convolutional neural network and the fast-RCNN convolutional neural network by adopting an alternate training method to obtain a trained RPN convolutional neural network and a fast-RCNN convolutional neural network;
(4) Extracting the full-connection layer characteristics of each sample image in the training set:
(4a) Sequentially inputting each sample image in the training set into a trained RPN convolutional neural network, and outputting the positions of all target rough selection frames in each sample image and the types of targets in the frames;
(4b) Respectively inputting the image area in each target rough selection frame into a resnet101 network trained on an ImageNet database, and storing all full-connection layer characteristics output by the last full-connection layer of the network into a computer;
(5) Constructing a Tri-LSTMs model:
(5a) Sequentially forming a long short-term memory network LSTM and an attention network into a semantic LSTM module, wherein the long short-term memory network LSTM comprises 1024 neurons;
(5b) Sequentially forming a long short-term memory network LSTM and an attention network into a visual LSTM module, wherein the long short-term memory network LSTM comprises 1024 neurons;
(5c) Sequentially forming a language LSTM module by a long-short term memory network LSTM and a full connection layer, wherein the long-short term memory network LSTM comprises 1024 neurons, and the number of the neurons of the full connection layer is set as the total number of words contained in all image descriptions in a training set;
(5d) Sequentially combining a semantic LSTM module, a visual LSTM module and a language LSTM module into a Tri-LSTMs model;
(6) Training Tri-LSTMs model:
(6a) At different moments, taking words at different positions in the image description of the training sample as input, and training a Tri-LSTMs model from zero moment;
(6b) Reading all full-connection layer characteristics output by the last full-connection layer of the resnet101 network stored in the computer in the step (4 b), and taking the average value of all full-connection layer characteristics as a characteristic vector;
(6c) Adding the feature vector and the word vector mapped by the word at the current moment in the image description, inputting the added feature vector into a long-short term memory network (LSTM) in a semantic LSTM module, and outputting a hidden state by forward conduction of the long-short term memory network (LSTM);
(6d) Reading 1000 300-dimensional GLOVE word vectors stored in the computer in the step (1), inputting the GLOVE word vectors into an attention network of a semantic LSTM module, and outputting the weighted GLOVE word vectors after the attention network conducts forwards;
(6e) Adding the hidden state of the semantic LSTM module at the current moment with the output of the attention network in the semantic LSTM module, and taking the obtained sum vector as the output of the semantic LSTM module;
(6f) Inputting the sum vector output by the semantic LSTM module into a long-short term memory network LSTM in the visual LSTM module, and conducting and outputting a hidden state in the forward direction of the long-short term memory network LSTM;
(6g) Reading all full-connection layer characteristics output by the last full-connection layer of the resnet101 network stored in the computer in the step (4 b), inputting the characteristics into an attention network of the visual LSTM module, conducting the attention network forwards, and outputting weighted full-connection layer characteristic vectors;
(6h) Taking the hidden state of the visual LSTM module at the current moment and the output of the attention network in the visual LSTM module, and taking the obtained sum vector as the output of the visual LSTM module;
(6i) Inputting the sum vector output by the semantic LSTM module into a long-short term memory network LSTM in the language LSTM module, outputting a hidden state by forward conduction of the long-short term memory network LSTM, inputting the hidden state into a full-connection layer, and outputting a probability vector of a word at the next moment;
(6j) Judging whether a word exists in the image description at the next moment, if so, calculating the cross entropy loss between the word probability vector and the word vector at the next moment of the image description, and then executing the step (6 b), otherwise, executing the step (6 k);
(6k) Adding the cross entropy losses at all times to obtain a total loss, optimizing all parameters in the model by using a BP algorithm to minimize the total loss, and stopping training when the total loss is converged to obtain a trained Tri-LSTMs model;
(7) Generating an image description:
(7a) Inputting a natural image into a pre-trained false-RCNN, and outputting a target rough selection frame;
(7b) Inputting the image area in the target rough selection frame into a trained resnet101 network, and outputting the image characteristics of the full connection layer;
(7c) And inputting the image characteristics of the full connection layer into a Tri-LSTMs model to generate image description.
Compared with the prior art, the invention has the following advantages:
firstly, the Tri-LSTMs model constructed by the invention utilizes the combination of three long and short term memory networks LSTM, and overcomes the defects that the model structure is too simple and the image description which is accurate enough cannot be generated because the image description is generated by only utilizing a single long and short term memory network LSTM in the prior art, so that the invention can combine a plurality of long and short term memory networks LSTM, can effectively improve the accuracy of the image description and has the advantage of strong generalization capability.
Secondly, the invention simultaneously utilizes the full-link layer characteristics of the image and the 300-dimensional GLOVE word vector of the word as the input of the Tri-LSTMs model, overcomes the problem that the prior art only uses the full-link layer characteristics of the image as the input of the model and can utilize too single effective information, thereby causing the insufficient diversity of the image description generated by the image description method, and leading the invention to have the advantage of more diversified image description
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a structural diagram of the Tri-LSTMs model constructed in the present invention.
FIG. 3 is a simulation diagram of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The steps implemented by the present invention are further described with reference to fig. 1.
Step 1, generating a training set and mapping word vectors.
At least 80000 samples are selected from the image data set with the image descriptions to form a training set, each selected sample is an image-description pair, and each image-description pair comprises one image and five corresponding image descriptions. Image description refers to the nature, location, and relationship of objects in an image to one another.
And the image description of each sample in the training set consists of a plurality of English words, the frequency of the English words in all the image descriptions of all the samples is counted, the English words are subjected to power reduction sorting, the first 1000 words are selected, each selected word is mapped into a corresponding 300-dimensional GLOVE word vector, and the vector is stored in a computer.
And 2, building an RPN convolutional neural network model and a faster-RCNN network model.
And constructing an RPN convolutional neural network model consisting of eight convolutional layers and a Softmax layer, and setting parameters of each layer, wherein the sizes of convolutional kernels of all layers are 3*3.
And constructing a faster-RCNN network model consisting of five convolution layers, one ROIpooling layer, four full-connection layers and one Softmax layer, and setting parameters of each layer, wherein the sizes of convolution kernels of each layer are 3*3.
And 3, training an RPN convolutional neural network and a fast-RCNN convolutional neural network.
And (3) alternately training the RPN convolutional neural network and the fast-RCNN convolutional neural network by adopting an alternate training method to obtain the trained RPN convolutional neural network and the fast-RCNN convolutional neural network.
The alternating training method comprises the following steps:
step 1, selecting a random value for each parameter of the RPN convolutional neural network, and carrying out random initialization.
And 2, inputting the training sample image into the initialized RPN convolutional neural network, training the network by using a back propagation BP algorithm, and adjusting parameters of the RPN convolutional neural network until all the parameters are converged to obtain the initially trained RPN convolutional neural network.
And 3, inputting the training sample image into the trained RPN convolutional neural network, and outputting a target rough selection frame on the training sample image.
And 4, selecting a random value for each parameter of the fast-RCNN convolutional neural network, and performing random initialization.
And 5, inputting the training sample image and the target rough selection box obtained in the step 3 into the initialized fast-RCNN convolutional neural network, training the network by using a back propagation BP algorithm, and adjusting parameters of the fast-RCNN convolutional neural network until all the parameters are converged to obtain the initially trained fast-RCNN convolutional neural network.
And 6, fixing the parameters of the front five layers of convolutional layers of the RPN convolutional neural network trained in the step 2 and the parameters of the fast-RCNN convolutional neural network trained in the step 5, inputting the training sample image into the trained RPN convolutional neural network, and finely adjusting the unfixed parameters of the RPN convolutional neural network by using a back propagation BP algorithm until the unfixed parameters are converged to obtain the finally trained RPN convolutional neural network model.
And 7, inputting the training sample image into the RPN convolutional neural network finally trained in the step 6, and obtaining a target rough selection frame on the sample image again.
And 8, fixing parameters of the front five layers of convolutional layers of the fast-RCNN convolutional neural network trained in the fifth step and parameters of the RPN convolutional neural network finally trained in the step 6, inputting the training sample image and the target rough selection box obtained again in the step 7 into the fast-RCNN convolutional neural network, and finely adjusting unfixed parameters of the fast-RCNN convolutional neural network by using a back propagation BP algorithm until the unfixed parameters are converged to obtain the finally trained fast-RCNN convolutional neural network.
And 4, extracting the characteristics of the image full-connection layer.
And sequentially inputting the sample images in the training set into the trained RPN convolutional neural network, and outputting the positions of all the target rough selection frames in each sample image and the types of the targets in the frames.
And respectively inputting the image area in each target rough selection frame into a resnet101 network trained on an ImageNet database, and storing all the full-connection layer characteristics output by the last full-connection layer of the network into a computer.
And 5, constructing a Tri-LSTMs model.
A long short-term memory network LSTM and an attention network are sequentially combined into a semantic LSTM module, and the long short-term memory network LSTM comprises 1024 neurons.
A long-short term memory network LSTM and an attention network are sequentially combined into a visual LSTM module, and the long-short term memory network LSTM comprises 1024 neurons.
And sequentially forming a language LSTM module by a long-short term memory network LSTM and a full connection layer, wherein the long-short term memory network LSTM comprises 1024 neurons, and the number of the neurons of the full connection layer is set as the total number of words contained in all image descriptions in the training set.
The Tri-LSTMs model is formed by the semantic LSTM module, the visual LSTM module and the language LSTM module in sequence as shown in figure 2.
And 6, training a Tri-LSTMs model.
Step 1, taking words at different positions in the image description of the training sample as input at different moments, and training a Tri-LSTMs model from zero moment.
And step 2, reading all full-connection layer characteristics output by the last full-connection layer of the resnet101 network stored in the computer in the step 4, and taking the average value of all full-connection layer characteristics as a characteristic vector.
And 3, adding the feature vector and the word vector mapped by the word at the current moment in the image description, inputting the sum into a long-short term memory network (LSTM) in a semantic LSTM module, and conducting the long-short term memory network (LSTM) forward to output a hidden state.
The long-short term memory network LSTM forward conduction is realized according to the following formula:
i t =sigmoid(W ix x t +W ih h t-1 )
f t =sigmoid(W fx x t +W fh h t-1 )
o t =sigmoid(W ox x t +W oh h t-1 )
c t =f t ⊙c t-1 +i t ⊙tanh(W cx x t +W ch h t-1 )
h t =o t ⊙tanh(c t )
wherein i t An input gate of a long-term and short-term memory network LSTM at the time t is represented, and sigmoid represents an activation functione denotes an exponential operation with a natural constant e as base, W ix Weight transfer matrix, x, representing input gates t Representing the input, W, of the long-short term memory network LSTM at time t ih Weight transfer matrix, h, representing hidden states corresponding to input gates t-1 Representing the hidden state of the long-short term memory network LSTM at time t-1, f t Forgetting gate, W, representing long-short term memory network LSTM at time t fx Weight transfer matrix, W, representing forgetting gate fh Weight transfer matrix representing hidden states corresponding to forgetting gates, o t Output gate, W, representing the long-short term memory network LSTM at time t ox Weight transfer matrix, W, representing output gates oh A weight transfer matrix representing the hidden state corresponding to the output gate, c t Status cells indicating a long short term memory network LSTM at time t indicate a compute inner product operation, c t-1 Representing the state unit of the long-short term memory network LSTM at the time t-1, and tanh representing the activation functionW cx Weight transfer matrix, W, representing a state cell ch Weight transfer matrix, h, representing hidden states corresponding to the state units t Representing the hidden state of the long-short term memory network LSTM at time t.
And 4, reading 1000 300-dimensional GLOVE word vectors stored in the computer in the step 1, inputting the vectors into an attention network of a semantic LSTM module, and outputting the weighted GLOVE word vectors after the attention network conducts forwards.
The attention network forward conduction is realized according to the following formula:
a i,t =tanh(W s s i +W h h t )
wherein, a i,t Representing the weight value of the ith vector in 1000 300-dimensional GLOVE word vectors at the time t, and tanh representing an activation functione denotes an exponential operation with a natural constant e as base, W s Weight transfer matrix, s, representing a 300-dimensional GLOVE word vector i Represents the ith word vector, W, of the 1000 300-dimensional GLOVE word vectors input h Weight transfer matrix, h, representing hidden states of long short term memory network LSTM outputs in semantic LSTM modules t Represents the hidden state of the long-short term memory network LSTM output in the semantic LSTM module at the time t,representing the feature vector output by the attention network in the semantic LSTM module at the time t, K representing the total number of 300-dimensional GLOVE word vectors, sigma representing the summation operation, and i representing the index of each vector in the word vectors.
And 5, adding the hidden state of the semantic LSTM module at the current moment and the output of the attention network in the semantic LSTM module to obtain a sum vector as the output of the semantic LSTM module.
And 6, inputting the sum vector of the output of the semantic LSTM module into a long-short term memory network LSTM in the visual LSTM module, and conducting and outputting a hidden state forward by the long-short term memory network LSTM.
The long-short term memory network LSTM forward conduction is realized according to the following formula:
i t =sigmoid(W ix x t +W ih h t-1 )
f t =sigmoid(W fx x t +W fh h t-1 )
o t =sigmoid(W ox x t +W oh h t-1 )
c t =f t ⊙c t-1 +i t ⊙tanh(W cx x t +W ch h t-1 )
h t =o t ⊙tanh(c t )
wherein i t An input gate of a long-term and short-term memory network LSTM at the time t is represented, and sigmoid represents an activation functione denotes an exponential operation with a natural constant e as base, W ix Weight transfer matrix, x, representing input gates t Representing the input, W, of the long-short term memory network LSTM at time t ih Weight transfer matrix, h, representing hidden states corresponding to input gates t-1 Representing the hidden state of the long-short term memory network LSTM at time t-1, f t Forgetting gate, W, representing long-short term memory network LSTM at time t fx Weight transfer matrix, W, representing forgetting gate fh Weight transfer matrix, o, representing the hidden state corresponding to the forgetting gate t Output gate, W, representing the long-short term memory network LSTM at time t ox Weight transfer matrix, W, representing output gates oh A weight transfer matrix representing the hidden state corresponding to the output gate, c t Status cells indicating a long short term memory network LSTM at time t indicate a compute inner product operation, c t-1 Representing the state unit of the long-short term memory network LSTM at the time t-1, and tanh representing the activation functionW cx Weight transfer matrix, W, representing a state unit ch Weight transfer matrix, h, representing hidden states corresponding to the state units t Network for expressing long and short term memory at t momentHidden state of LSTM.
And 7, reading all full-connection layer characteristics output by the last full-connection layer of the resnet101 network stored in the computer in the step 4, inputting the characteristics into an attention network of the visual LSTM module, conducting the characteristics forward by the attention network, and outputting weighted full-connection layer characteristic vectors.
The attention network forward conduction is realized according to the following formula:
a i,t =tanh(W v v i +W h h t )
wherein, a i,t Represents the weight of the ith feature in all the fully-connected layer features at the moment t, and tanh represents the activation functione denotes an exponential operation with a natural constant e as base, W v Weight transfer matrix, v, representing the characteristics of the fully-connected layer i Denotes the ith feature, W, of all fully-connected layer features h Weight matrix, h, representing hidden states of long short term memory network LSTM in visual LSTM module t Representing the hidden state of the long-short term memory network LSTM output in the visual LSTM module at the time t,represents the output of the attention network in the visual LSTM module at time t, K represents the total number of fully connected layer feature vectors, Σ represents the summation operation, and i represents the index of each vector in the feature vectors.
And 8, taking the hidden state of the visual LSTM module at the current moment and the output of the attention network in the visual LSTM module, and taking the obtained sum vector as the output of the visual LSTM module.
And 9, inputting the sum vector output by the semantic LSTM module into a long-short term memory network LSTM in the language LSTM module, conducting and outputting a hidden state in the forward direction of the long-short term memory network LSTM, inputting the hidden state into a full connection layer, and outputting the probability vector of the word at the next moment.
The long-short term memory network LSTM forward conduction is realized according to the following formula:
i t =sigmoid(W ix x t +W ih h t-1 )
f t =sigmoid(W fx x t +W fh h t-1 )
o t =sigmoid(W ox x t +W oh h t-1 )
c t =f t ⊙c t-1 +i t ⊙tanh(W cx x t +W ch h t-1 )
h t =o t ⊙tanh(c t )
wherein i t An input gate of a long-term and short-term memory network LSTM at the time t is represented, and sigmoid represents an activation functione denotes an exponential operation with a natural constant e as the base, W ix Weight transfer matrix, x, representing input gates t Representing the input, W, of the long-short term memory network LSTM at time t ih Weight transfer matrix, h, representing hidden states corresponding to input gates t-1 Representing the hidden state of the long-term and short-term memory network LSTM at time t-1, f t Forgetting gate, W, representing long-short term memory network LSTM at time t fx Weight transfer matrix, W, representing forgetting gate fh Weight transfer matrix, o, representing the hidden state corresponding to the forgetting gate t Output gate, W, representing the long-short term memory network LSTM at time t ox Weight transfer matrix, W, representing output gates oh A weight transfer matrix representing the hidden state corresponding to the output gate, c t Status cells indicating a long short term memory network LSTM at time t indicate a compute inner product operation, c t-1 Representing the state unit of the long-short term memory network LSTM at the time t-1, and tanh representing the activation functionW cx Weight transfer matrix, W, representing a state cell ch Weight transfer matrix, h, representing hidden states corresponding to the state units t Representing the hidden state of the long-short term memory network LSTM at time t.
And step 10, judging whether the image description at the next moment has words, if so, executing the step 2 after calculating the cross entropy loss between the word probability vector and the word vector at the next moment described by the image, otherwise, executing the step 11.
The cross entropy loss between the word probability vector and the word vector describing the next moment in the image is calculated according to the following formula:
wherein loss represents the cross entropy loss between the word probability vector and the word vector at the next moment of the image description of the training set, N represents the total number of words in the image description of the training set, Σ represents the summation operation, t represents the index of the word in the image description of the training set, log represents the logarithm operation with the natural constant e as the base, and P(s) t I; theta) represents the word probability vector at the t moment output by inputting the average value of all the fully connected layer features of the training set image into the Tri-LSTMs model, I represents the average value of all the fully connected layer features of the training set image, and theta represents all the parameters of the Tri-LSTMs model.
And 11, adding the cross entropy losses at all times to obtain a total loss, optimizing all parameters in the model by using a BP algorithm to minimize the total loss, and stopping training when the total loss is converged to obtain the trained Tri-LSTMs model.
And 7, generating an image description.
And inputting a natural image into a pre-trained fast-RCNN, and outputting a target rough selection frame.
And inputting the image area in the target rough selection box into a trained resnet101 network, and outputting the image characteristics of the full connection layer.
And inputting the image characteristics of the full connection layer into a Tri-LSTMs model to generate image description.
The effect of the present invention will be further described with reference to the simulation.
1. Simulation experiment conditions are as follows:
the hardware platform of the simulation experiment of the invention is as follows: an Intel (R) Core5 processor of a Daire computer, a main frequency of 3.20GHz and a memory of 64GB;
the software platform of the simulation experiment of the invention is as follows: python3.5, tensorflow1.2 platform.
The data set used in the simulation experiment of the invention is a COCO data set, the data set is obtained by Microsoft team and can be used for image description generation, the construction time of the data set is 2014 years, and the training set and the testing set of the data set respectively comprise 123287 and 40,775 images.
2. Simulation content and result analysis:
the simulation experiment of the invention adopts the invention and two prior arts (self-adaptive attention mechanism method, scst method), 123287 training set samples of COCO data set are respectively input into the respectively constructed model for training, 40,775 images of the test set are respectively input into the trained model, and three methods are generated to describe the images of each test set image.
In the simulation experiment, two prior arts are adopted:
the Adaptive Attention machine method in the prior art refers to an Image description generation method, which is proposed by Jiasen Lu et al in a paper "knotting When to Look: adaptive Attention view A Visual Sentinel for Image capturing" (cvpr 2017 conference paper), and is called the Adaptive Attention machine method for short.
The scst method in the prior art refers to an Image description generation method proposed by Jiasen Lu et al in a paper "Self-critical Sequence Training for Image capturing" (cvpr 2017 conference paper), which is called scst method for short.
In order to compare the advantages and disadvantages of the image descriptions generated by the three methods, the image descriptions generated by the three methods on the COCO test set image are evaluated by utilizing four evaluation indexes (BLEU-4, METEOR and ROUGE-L, CIDER). And drawing the index result into a table 1, wherein Net-1 represents an image description method based on a Tri-LSTMs model, net-2 represents an adaptive attention mechanism method, and Net-3 represents a scst method.
TABLE 1 quantitative analysis table of classification results of the present invention and two prior arts in simulation experiment
As can be seen from table 1, compared with the adaptive attention mechanism method and the scst method, the network in the present invention obtains higher scores on various evaluation indexes, and thus performs better, and can generate more accurate image description.
In order to describe the effect of the present invention more intuitively, two graphs are randomly selected from the simulation result of the present invention on the COCO test set, as shown in fig. 3, where fig. 3 (a) and fig. 3 (b) are each a natural image in the COCO test set and an image description corresponding to the image.
As can be seen from the simulation diagram of FIG. 3, the image generated by the present invention describes the content in the image more accurately and specifically.
Claims (7)
1. An image description method based on Tri-LSTMs model is characterized in that a Tri-LSTMs model composed of a semantic LSTM module, a visual LSTM module and a language LSTM module is built, and sentence description image content is generated for any one natural image, and the method comprises the following steps:
(1) Generating a training set and mapping word vectors:
(1a) Selecting at least 80000 samples from an image data set with image descriptions to form a training set, wherein each selected sample is an image-description pair, and each image-description pair comprises one image and five corresponding image descriptions;
(1b) The image description of each sample in the training set consists of a plurality of English words, the frequency of occurrence of the English words in all the image descriptions of all the samples is counted, the English words are subjected to power reduction sorting, the first 1000 words are selected, each selected word is mapped into a corresponding 300-dimensional GLOVE word vector, and the vector is stored in a computer;
(2) Constructing an RPN convolutional neural network model and a master-RCNN network model:
(2a) Constructing an RPN convolutional neural network model consisting of eight convolutional layers and a Softmax layer, and setting parameters of each layer;
(2b) Building a master-RCNN network model consisting of five convolution layers, one ROIploling layer, four full-connection layers and one Softmax layer, and setting parameters of each layer;
(3) Training an RPN convolutional neural network and a fast-RCNN convolutional neural network:
performing alternate training on the RPN convolutional neural network and the fast-RCNN convolutional neural network by adopting an alternate training method to obtain a trained RPN convolutional neural network and a fast-RCNN convolutional neural network;
(4) Extracting the full connection layer characteristics of each sample image in the training set:
(4a) Sequentially inputting each sample image in the training set into a trained RPN convolutional neural network, and outputting the positions of all target rough selection frames in each sample image and the types of targets in the frames;
(4b) Respectively inputting the image area in each target rough selection frame into a resnet101 network trained on an ImageNet database, and storing all full-connection layer characteristics output by the last full-connection layer of the network into a computer;
(5) Constructing a Tri-LSTMs model:
(5a) Sequentially forming a long short-term memory network LSTM and an attention network into a semantic LSTM module, wherein the long short-term memory network LSTM comprises 1024 neurons;
(5b) Sequentially forming a long short-term memory network LSTM and an attention network into a visual LSTM module, wherein the long short-term memory network LSTM comprises 1024 neurons;
(5c) Sequentially forming a language LSTM module by a long-short term memory network LSTM and a full connection layer, wherein the long-short term memory network LSTM comprises 1024 neurons, and the number of the neurons of the full connection layer is set as the total number of words contained in all image descriptions in a training set;
(5d) Sequentially combining a semantic LSTM module, a visual LSTM module and a language LSTM module into a Tri-LSTMs model;
(6) Training Tri-LSTMs model:
(6a) At different moments, taking words at different positions in the image description of the training sample as input, and training a Tri-LSTMs model from zero moment;
(6b) Reading all full-connection layer characteristics output by the last full-connection layer of the resnet101 network stored in the computer in the step (4 b), and taking the average value of all full-connection layer characteristics as a characteristic vector;
(6c) Adding the feature vector and the word vector mapped by the word at the current moment in the image description, inputting the added feature vector into a long-short term memory network LSTM in a semantic LSTM module, and outputting a hidden state in a forward conduction manner by the long-short term memory network LSTM;
(6d) Reading 1000 300-dimensional GLOVE word vectors stored in the computer in the step (1), inputting the GLOVE word vectors into an attention network of a semantic LSTM module, and outputting the weighted GLOVE word vectors after the attention network conducts forwards;
(6e) Adding the hidden state of the semantic LSTM module at the current moment with the output of the attention network in the semantic LSTM module, and taking the obtained sum vector as the output of the semantic LSTM module;
(6f) Inputting the sum vector output by the semantic LSTM module into a long-short term memory network LSTM in the visual LSTM module, and conducting and outputting a hidden state in the forward direction of the long-short term memory network LSTM;
(6g) Reading all full-connection layer characteristics output by the last full-connection layer of the resnet101 network stored in the computer in the step (4 b), inputting the characteristics into an attention network of the visual LSTM module, conducting the attention network forwards, and outputting weighted full-connection layer characteristic vectors;
(6h) Taking the hidden state of the visual LSTM module at the current moment and the output of the attention network in the visual LSTM module, and taking the obtained sum vector as the output of the visual LSTM module;
(6i) Inputting the sum vector output by the semantic LSTM module into a long-short term memory network LSTM in the language LSTM module, outputting a hidden state by forward conduction of the long-short term memory network LSTM, inputting the hidden state into a full-connection layer, and outputting a probability vector of a word at the next moment;
(6j) Judging whether a word exists in the image description at the next moment, if so, calculating the cross entropy loss between the word probability vector and the word vector at the next moment of the image description, and then executing the step (6 b), otherwise, executing the step (6 k);
(6k) Adding the cross entropy losses at all times to obtain a total loss, optimizing all parameters in the model by using a BP algorithm to minimize the total loss, and stopping training when the total loss is converged to obtain a trained Tri-LSTMs model;
(7) Generating an image description:
(7a) Inputting a natural image into a pre-trained false-RCNN, and outputting a target rough selection frame;
(7b) Inputting the image area in the target rough selection frame into a trained resnet101 network, and outputting the image characteristics of the full connection layer;
(7c) And inputting the image characteristics of the full connection layer into a Tri-LSTMs model to generate image description.
2. The Tri-LSTMs model-based image description method of claim 1, wherein the image description in step (1 a) refers to the attributes, positions and relationships of objects in the image.
3. The image description method based on the Tri-LSTMs model of claim 1, wherein the step of the alternative training method in step (3) is as follows:
firstly, selecting a random value for each parameter of the RPN convolutional neural network, and carrying out random initialization;
secondly, inputting a training sample image into the initialized RPN convolutional neural network, training the network by using a back propagation BP algorithm, and adjusting parameters of the RPN convolutional neural network until all the parameters are converged to obtain the initially trained RPN convolutional neural network;
inputting the training sample image into the trained RPN convolutional neural network, and outputting a target rough selection frame on the training sample image;
fourthly, selecting a random value for each parameter of the fast-RCNN convolutional neural network, and carrying out random initialization;
fifthly, inputting the training sample image and the target rough selection box obtained in the third step into the initialized fast-RCNN convolutional neural network, training the network by using a back propagation BP algorithm, and adjusting the parameters of the fast-RCNN convolutional neural network until all the parameters are converged to obtain the initially trained fast-RCNN convolutional neural network;
fixing the parameters of the front five layers of convolutional layers of the RPN convolutional neural network trained in the second step and the parameters of the fast-RCNN convolutional neural network trained in the fifth step, inputting a training sample image into the trained RPN convolutional neural network, and finely adjusting the unfixed parameters of the RPN convolutional neural network by using a back propagation BP algorithm until the unfixed parameters are converged to obtain a finally trained RPN convolutional neural network model;
step seven, inputting the training sample image into the RPN convolutional neural network finally trained in the step six, and obtaining a target rough selection frame on the sample image again;
and eighthly, fixing parameters of the front five layers of convolutional layers of the fast-RCNN convolutional neural network trained in the fifth step and parameters of the RPN convolutional neural network trained finally in the sixth step, inputting the training sample image and the target rough selection box obtained again in the seventh step into the fast-RCNN convolutional neural network, finely adjusting unfixed parameters of the fast-RCNN convolutional neural network by using a back propagation BP algorithm until the unfixed parameters are converged, and obtaining the finally trained fast-RCNN convolutional neural network.
4. The image description method based on the Tri-LSTMs model of claim 1, wherein the long short term memory network LSTM forward conduction in step (6 c), step (6 f) and step (6 i) is implemented according to the following formula:
i t =sigmoid(W ix x t +W ih h t-1 )
f t =sigmoid(W fx x t +W fh h t-1 )
o t =sigmoid(W ox x t +W oh h t-1 )
c t =f t ⊙c t-1 +i t ⊙tanh(W cx x t +W ch h t-1 )
h t =o t ⊙tanh(c t )
wherein i t An input gate of a long-term and short-term memory network LSTM at the time t is represented, sigmoid represents an activation functione denotes an exponential operation with a natural constant e as base, W ix Weight transfer matrix, x, representing input gates t Representing the input, W, of the long-short term memory network LSTM at time t ih Weight transfer matrix, h, representing hidden states corresponding to input gates t-1 Representing the hidden state of the long-short term memory network LSTM at time t-1, f t Forgetting gate, W, representing long-short term memory network LSTM at time t fx Weight transfer matrix, W, representing forgetting gate fh Weight transfer matrix, o, representing the hidden state corresponding to the forgetting gate t Output gate, W, representing the long-short term memory network LSTM at time t ox Weight transfer matrix, W, representing output gates oh A weight transfer matrix representing the hidden state corresponding to the output gate, c t Status cells indicating a long short term memory network LSTM at time t indicate a compute inner product operation, c t-1 Representing the state unit of the long-short term memory network LSTM at the time t-1, and tanh representing the activation functionW cx Weight transfer matrix, W, representing a state cell ch Weight transfer matrix, h, representing hidden states corresponding to the state units t Representing the hidden state of the long-short term memory network LSTM at time t.
5. The Tri-LSTMs model-based image description method of claim 1, wherein the attention network forward conduction in step (6 d) is implemented according to the following formula:
a i,t =tanh(W s s i +W h h t )
wherein, a i,t Representing the weight value of the ith vector in 1000 300-dimensional GLOVE word vectors at the time t, and tanh representing an activation functione denotes an exponential operation with a natural constant e as base, W s Weight transfer matrix, s, representing 300-dimensional GLOVE word vectors i Represents the ith word vector, W, of the 1000 300-dimensional GLOVE word vectors input h Weight transfer matrix, h, representing hidden states of long short term memory network LSTM outputs in semantic LSTM modules t Represents the hidden state of the long-short term memory network LSTM output in the semantic LSTM module at the time t,representing the feature vector output by the attention network in the semantic LSTM module at the time t, K representing the total number of 300-dimensional GLOVE word vectors, sigma representing the summation operation, and i representing the index of each vector in the word vectors.
6. The Tri-LSTMs model-based image description method of claim 1, wherein the attention network forward conduction in step (6 g) is implemented according to the following formula:
a i,t =tanh(W v v i +W h h t )
wherein, a i,t Represents the weight of the ith feature in all the fully-connected layer features at the moment t, and tanh represents the activation functione denotes an exponential operation with a natural constant e as base, W v Weight transfer matrix, v, representing the characteristics of the fully-connected layer i Denotes the ith feature, W, of all fully-connected layer features h Weight matrix, h, representing hidden states of long short term memory network LSTM in visual LSTM module t Represents the hidden state of the long-short term memory network LSTM output in the vision LSTM module at the time t,represents the output of the attention network in the visual LSTM module at time t, K represents the total number of fully connected layer feature vectors, Σ represents the summation operation, and i represents the index of each vector in the feature vectors.
7. The Tri-LSTMs model-based image description method of claim 1, wherein the cross entropy loss between the word probability vector in step (6 j) and the word vector at the next time of image description is calculated according to the following formula:
wherein loss represents the cross entropy loss between the word probability vector and the word vector at the next moment of the image description of the training set, N represents the total number of words in the image description of the training set, Σ represents the summation operation, t represents the index of the word in the image description of the training set, log represents the logarithm operation with the natural constant e as the base, and P(s) t I; theta) represents the word probability vector at t moment output by inputting the average value of all the fully connected layer characteristics of the training set image into the Tri-LSTMs model, and the table IAnd (3) showing the average value of all the fully connected layer characteristics of the images of the training set, wherein theta represents all the parameters of the Tri-LSTMs model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910565977.0A CN110288029B (en) | 2019-06-27 | 2019-06-27 | Tri-LSTMs model-based image description method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910565977.0A CN110288029B (en) | 2019-06-27 | 2019-06-27 | Tri-LSTMs model-based image description method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110288029A CN110288029A (en) | 2019-09-27 |
CN110288029B true CN110288029B (en) | 2022-12-06 |
Family
ID=68007639
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910565977.0A Active CN110288029B (en) | 2019-06-27 | 2019-06-27 | Tri-LSTMs model-based image description method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110288029B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112580658B (en) * | 2019-09-29 | 2024-03-12 | 中国移动通信集团辽宁有限公司 | Image semantic description method, device, computing equipment and computer storage medium |
CN110968725B (en) * | 2019-12-03 | 2023-04-28 | 咪咕动漫有限公司 | Image content description information generation method, electronic device and storage medium |
CN111144553B (en) * | 2019-12-28 | 2023-06-23 | 北京工业大学 | Image description method based on space-time memory attention |
CN111159454A (en) * | 2019-12-30 | 2020-05-15 | 浙江大学 | Picture description generation method and system based on Actor-Critic generation type countermeasure network |
CN111275780B (en) * | 2020-01-09 | 2023-10-17 | 北京搜狐新媒体信息技术有限公司 | Character image generation method and device |
CN111242059B (en) * | 2020-01-16 | 2022-03-15 | 合肥工业大学 | Method for generating unsupervised image description model based on recursive memory network |
CN113836985A (en) * | 2020-06-24 | 2021-12-24 | 富士通株式会社 | Image processing apparatus, image processing method, and computer-readable storage medium |
CN116543289B (en) * | 2023-05-10 | 2023-11-21 | 南通大学 | Image description method based on encoder-decoder and Bi-LSTM attention model |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA3040165A1 (en) * | 2016-11-18 | 2018-05-24 | Salesforce.Com, Inc. | Spatial attention model for image captioning |
CN108875807A (en) * | 2018-05-31 | 2018-11-23 | 陕西师范大学 | A kind of Image Description Methods multiple dimensioned based on more attentions |
CN109711465A (en) * | 2018-12-26 | 2019-05-03 | 西安电子科技大学 | Image method for generating captions based on MLL and ASCA-FR |
-
2019
- 2019-06-27 CN CN201910565977.0A patent/CN110288029B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA3040165A1 (en) * | 2016-11-18 | 2018-05-24 | Salesforce.Com, Inc. | Spatial attention model for image captioning |
CN108875807A (en) * | 2018-05-31 | 2018-11-23 | 陕西师范大学 | A kind of Image Description Methods multiple dimensioned based on more attentions |
CN109711465A (en) * | 2018-12-26 | 2019-05-03 | 西安电子科技大学 | Image method for generating captions based on MLL and ASCA-FR |
Also Published As
Publication number | Publication date |
---|---|
CN110288029A (en) | 2019-09-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110288029B (en) | Tri-LSTMs model-based image description method | |
CN110083705B (en) | Multi-hop attention depth model, method, storage medium and terminal for target emotion classification | |
Aneja et al. | Convolutional image captioning | |
Zhang et al. | More is better: Precise and detailed image captioning using online positive recall and missing concepts mining | |
CN108875807B (en) | Image description method based on multiple attention and multiple scales | |
US11580975B2 (en) | Systems and methods for response selection in multi-party conversations with dynamic topic tracking | |
Yao et al. | Describing videos by exploiting temporal structure | |
CN111985369A (en) | Course field multi-modal document classification method based on cross-modal attention convolution neural network | |
Dong et al. | Word2visualvec: Image and video to sentence matching by visual feature prediction | |
CN108830287A (en) | The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method | |
CN111460132B (en) | Generation type conference abstract method based on graph convolution neural network | |
CN112749274B (en) | Chinese text classification method based on attention mechanism and interference word deletion | |
CN112232087B (en) | Specific aspect emotion analysis method of multi-granularity attention model based on Transformer | |
CN111598183B (en) | Multi-feature fusion image description method | |
CN110347831A (en) | Based on the sensibility classification method from attention mechanism | |
US20240177506A1 (en) | Method and Apparatus for Generating Captioning Device, and Method and Apparatus for Outputting Caption | |
CN111597341A (en) | Document level relation extraction method, device, equipment and storage medium | |
Chen et al. | Movie fill in the blank by joint learning from video and text with adaptive temporal attention | |
US20230385558A1 (en) | Text classifier for answer identification, background knowledge representation generator and training device therefor, and computer program | |
CN113297387B (en) | News detection method for image-text mismatching based on NKD-GNN | |
CN115422388B (en) | Visual dialogue method and system | |
CN110930469A (en) | Text image generation method and system based on transition space mapping | |
Xu et al. | Isolated Word Sign Language Recognition Based on Improved SKResNet‐TCN Network | |
Yang et al. | Visual Skeleton and Reparative Attention for Part-of-Speech image captioning system | |
Liu et al. | Multimodal cross-guided attention networks for visual question answering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |