CN110288029B

CN110288029B - Image description method based on Tri-LSTMs model

Info

Publication number: CN110288029B
Application number: CN201910565977.0A
Authority: CN
Inventors: 王爽; 侯彪; 张磊; 孟芸; 叶秀眺; 田敬贤
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-06-27
Filing date: 2019-06-27
Publication date: 2022-12-06
Anticipated expiration: 2039-06-27
Also published as: CN110288029A

Abstract

The invention discloses an image description method based on the Tri‑LSTMs model, the steps of which are: generating a training set and mapping word vectors, building and training an RPN convolutional neural network and a Faster‑RCNN convolutional neural network, and extracting an image fully connected layer features, build and train Tri‑LSTMs models, and generate image descriptions. The invention combines a plurality of long-short-term memory network LSTMs, and simultaneously utilizes fully connected layer features of images and 300-dimensional GLOVE word vectors of words, effectively improves the diversity of generated subtitles, and generates more accurate image descriptions.

Description

Image description method based on Tri-LSTMs model

技术领域technical field

本发明属于图像处理技术领域，更进一步涉及图像描述技术领域中的一种基于Tri-LSTMs模型的图像描述方法。本发明可用于对给定图像生成准确且具有多样性的语句来描述图像的内容。其中，Tri-LSTMs表示由语义LSTM模块、视觉LSTM模块和语言LSTM模块三个模块组成的Tri-LSTMs模型。The invention belongs to the technical field of image processing, and further relates to an image description method based on a Tri-LSTMs model in the technical field of image description. The invention can be used to generate accurate and diverse sentences for a given image to describe the content of the image. Among them, Tri-LSTMs refers to the Tri-LSTMs model composed of three modules: semantic LSTM module, visual LSTM module and language LSTM module.

背景技术Background technique

图像描述是给定一张图像，生成语句来描述图像的内容。生成的语句不仅要流畅，还要能够准确地描述图像中的物体以及物体的属性、位置以及物体之间的关系。生成的图像描述可以用于寻找符合描述内容的图像，便于图像检索。此外，将生成的图像描述转为盲文后，可以帮助盲人理解图像内容。Image description is given an image, generating sentences to describe the content of the image. The generated sentences must not only be fluent, but also be able to accurately describe the objects in the image and the attributes, positions, and relationships between objects. The generated image description can be used to find images that match the description content, which is convenient for image retrieval. In addition, converting the generated image descriptions into Braille can help blind people understand the image content.

深圳大学在其拥有的专利技术“一种基于词袋模型的图像描述方法及系统”(专利申请号：201410491596X，授权公告号：CN104299010B)中提出了一种基于词袋模型的图像描述方法。该专利技术主要解决传统方法信息丢失与准确度过低的问题。该专利技术实现步骤为：(1)从待描述图像中提取特征点；(2)计算所述特征点与码书中视觉单词之间的距离集合，并通过高斯隶属度函数、利用所述距离集合获得所述特征点与所述视觉单词之间的隶属度集合；(3)利用所述隶属度集合，统计用于描述每一特征点的所述视觉单词的隶属度，形成直方图矢量，所述直方图矢量用以描述所述待描述图像。该专利技术虽然改进了传统的图像描述技术，描述的准确度更高，但是，该方法仍然存在的不足之处在于，需要人工提取特征点，采用不同的提取方法对结果有很大影响，提取的过程繁杂，且最终生成的图像描述多样性不足。Shenzhen University proposed an image description method based on the bag-of-words model in its patented technology "An image description method and system based on the bag-of-words model" (patent application number: 201410491596X, authorized announcement number: CN104299010B). This patented technology mainly solves the problems of information loss and low accuracy in traditional methods. The implementation steps of this patented technology are: (1) extract feature points from the image to be described; (2) calculate the distance set between the feature points and the visual words in the code book, and use the Gaussian membership function to use the distance Set to obtain the degree of membership set between the feature point and the visual word; (3) utilize the degree of membership set to count the degree of membership of the visual word used to describe each feature point to form a histogram vector, The histogram vector is used to describe the image to be described. Although this patented technology has improved the traditional image description technology and has higher description accuracy, the method still has the disadvantage that feature points need to be manually extracted, and the use of different extraction methods has a great impact on the results. Extraction The process is complicated, and the final image description diversity is insufficient.

天津大学在其拥有的专利技术“一种从结构化文本到图像描述的生成方法”(专利申请号：2016108541692，授权公告号：CN106503055B)中提出了一种基从结构化文本到图像描述的生成方法。该专利技术主要解决现有技术生成的图像描述准确度低且多样性不足的问题。该专利技术实现步骤为：(1)从互联网下载图片，构成图片训练集；(2)对训练集中图像对应的描述进行词法分析，构造结构化文本；(3)利用现有的神经网络模型，提取训练集图像的卷积神经网络特征，并以<图像特征，结构化文本>作为输入，构造多任务识别模型；(4)以训练集中提取的结构化文本和相应描述作为递归神经网络的输入，训练得到递归神经网络模型的参数；(5)输入待描述图像的卷积神经网络特征，通过多任务识别模型得到预测结构化文本；(6)输入预测结构化文本，通过递归神经网络模型得到图像描述。该专利技术虽然改进了改善了生成的图像描述多样性不足的问题，但是，该方法仍然存在的不足之处在于，仅仅使用了图像特征，没有利用其他有效信息对解码过程进行指导，影响最终生成的图像描述的准确度。Tianjin University has proposed a method based on the generation of structured text to image description in its patented technology "a generation method from structured text to image description" (patent application number: 2016108541692, authorized announcement number: CN106503055B). method. This patented technology mainly solves the problems of low accuracy and insufficient diversity of image descriptions generated by existing technologies. The implementation steps of this patented technology are: (1) Download pictures from the Internet to form a picture training set; (2) Perform lexical analysis on the descriptions corresponding to the images in the training set to construct structured text; (3) Utilize the existing neural network model, Extract the convolutional neural network features of the training set images, and use <image features, structured text> as input to construct a multi-task recognition model; (4) use the structured text extracted from the training set and the corresponding description as the input of the recurrent neural network , training to obtain the parameters of the recurrent neural network model; (5) input the convolutional neural network features of the image to be described, and obtain the predicted structured text through the multi-task recognition model; (6) input the predicted structured text, and obtain the predicted structured text through the recurrent neural network model image description. Although this patented technology has improved the problem of insufficient diversity of generated image descriptions, the method still has shortcomings in that it only uses image features and does not use other effective information to guide the decoding process, which affects the final generation The accuracy of the image description.

Oriol Vinyals等人在其发表的论文“Show and Tell:A Neural Image CaptionGenerator”(cvpr 2015会议论文)中提出基于编码器-解码器模型的图像描述方法。该方法是先利用卷积神经网络(ConvolutionalNeural Network,CNN)提取图像特征，然后送到长短时记忆网络(Long Short-TermMemory,LSTM)中生成图像对应的描述。该方法首次使用编码器-解码器的结构解决图像描述问题，但是，该方法仍然存在的不足之处在于，模型结构过于简单，生成的图像描述不准确。Oriol Vinyals et al. proposed an image description method based on the encoder-decoder model in their paper "Show and Tell: A Neural Image CaptionGenerator" (cvpr 2015 conference paper). This method is to first use the convolutional neural network (Convolutional Neural Network, CNN) to extract image features, and then send it to the long short-term memory network (Long Short-Term Memory, LSTM) to generate the description corresponding to the image. This method uses the encoder-decoder structure for the first time to solve the image description problem. However, the disadvantage of this method is that the model structure is too simple and the generated image description is not accurate.

Kelvin Xu等人在其发表的论文“Show,Attend and Tell:Neural Image CaptionGeneration with Visual Attention”(cvpr 2015会议论文)中提出将长短时记忆网络(Long Short-TermMemory,LSTM)与注意力机制结合的图像描述方法。该方法在解码过程中对图像的不同位置分配不同的权重，从而对不同位置的物体给予不同的关注度。该方法生成了更准确的图像描述，证明了长短时记忆网络(Long Short-TermMemory,LSTM)与注意力机制结合的有效性。但是该方法仍然存在的不足之处是，单层长短时记忆网络(LongShort-TermMemory,LSTM)同时承担语句生成、图像权重分配等多种职责，职责混淆导致生成的图像描述不够准确。In their paper "Show, Attend and Tell: Neural Image CaptionGeneration with Visual Attention" (cvpr 2015 conference paper), Kelvin Xu et al. proposed the combination of Long Short-Term Memory (LSTM) and attention mechanism. Image description method. This method assigns different weights to different positions of the image during the decoding process, thus giving different degrees of attention to objects at different positions. The method generates more accurate image descriptions and demonstrates the effectiveness of Long Short-Term Memory (LSTM) combined with attention mechanisms. However, the disadvantage of this method is that the single-layer long short-term memory network (Long Short-Term Memory, LSTM) undertakes multiple responsibilities such as sentence generation and image weight distribution at the same time, and the confusion of responsibilities leads to inaccurate image descriptions.

Quanzeng You等人在其发表的论文“Image Captioning with SemanticAttention”(cvpr 2016会议论文)中提出将语义属性、图像特征同时与注意力机制结合的图像描述方法。该方法首先选取词汇库中出现频率最高的1000个单词作为语义属性，然后在解码器的输入层与输出层引入加权后的语义属性。该方法证明了同时将语义属性、图像特征与注意力机制结合的有效性。但是，该方法仍然存在的不足之处是，不同图像对应的描述之间差异性过小，生成的描述僵硬、模板化。In their paper "Image Captioning with Semantic Attention" (cvpr 2016 conference paper), Quanzeng You et al. proposed an image description method that combines semantic attributes, image features and attention mechanisms at the same time. This method first selects the 1000 words with the highest frequency in the vocabulary as semantic attributes, and then introduces weighted semantic attributes into the input layer and output layer of the decoder. The method demonstrates the effectiveness of combining semantic attributes, image features and attention mechanisms simultaneously. However, the disadvantage of this method is that the difference between the descriptions corresponding to different images is too small, and the generated descriptions are rigid and templated.

发明内容Contents of the invention

本发明的目的在于克服上述现有技术的不足，提出了一种基于Tri-LSTMs模型的图像描述方法。本发明能有效地提高图像描述的准确性和多样性。The purpose of the present invention is to overcome the above-mentioned deficiencies in the prior art, and proposes an image description method based on the Tri-LSTMs model. The invention can effectively improve the accuracy and diversity of image description.

实现本发明的技术思路是：首先，搭建并训练RPN卷积神经网络模型与faster-RCNN网络模型；然后，搭建并训练Tri-LSTMs模型；最后，使用预训练好的faster-RCNN网络模型提取图像区域，将图像区域输入到Tri-LSTMs模型中，对图像生成图像描述。The technical ideas for realizing the present invention are: first, build and train the RPN convolutional neural network model and the faster-RCNN network model; then, build and train the Tri-LSTMs model; finally, use the pre-trained faster-RCNN network model to extract images Region, input the image region into the Tri-LSTMs model to generate an image description for the image.

实现本发明目的的具体步骤如下：The concrete steps that realize the object of the present invention are as follows:

(1)生成训练集并映射词向量：(1) Generate a training set and map word vectors:

(1a)从带有图像描述的图像数据集中选取至少80000个样本组成训练集，所选取的每个样本是一个图像-描述对，每个图像-描述对中包含了一幅图像以及五条对应的图像描述；(1a) Select at least 80,000 samples from the image data set with image description to form the training set. Each selected sample is an image-description pair, and each image-description pair contains an image and five corresponding image description;

(1b)训练集中每个样本的图像描述由若多个英文单词组成，统计所有样本的所有图像描述中英文单词出现的频率并降幂排序，选取前1000个单词，将所选的每个单词映射为对应的300维GLOVE词向量，并将其存储到计算机中；(1b) The image description of each sample in the training set is composed of several English words, and the frequency of occurrence of Chinese and English words in all image descriptions of all samples is counted and sorted by descending power. Select the first 1000 words, and each selected word Map to the corresponding 300-dimensional GLOVE word vector, and store it in the computer;

(2)搭建RPN卷积神经网络模型与faster-RCNN网络模型：(2) Build the RPN convolutional neural network model and the faster-RCNN network model:

(2a)搭建一个由八个卷积层和一个Softmax层构成的RPN卷积神经网络模型并设置各层参数；(2a) Build a RPN convolutional neural network model consisting of eight convolutional layers and a Softmax layer and set the parameters of each layer;

(2b)搭建一个由五个卷积层、一个ROIpooling层、四个全连接层和一个Softmax层构成的faster-RCNN网络模型并设置各层参数；(2b) Build a faster-RCNN network model consisting of five convolutional layers, one ROIpooling layer, four fully connected layers and one Softmax layer and set the parameters of each layer;

(3)训练RPN卷积神经网络和fast-RCNN卷积神经网络：(3) Training RPN convolutional neural network and fast-RCNN convolutional neural network:

采用交替训练方法，对RPN卷积神经网络和fast-RCNN卷积神经网络进行交替训练，得到训练好的RPN卷积神经网络和fast-RCNN卷积神经网络；Using the alternate training method, the RPN convolutional neural network and the fast-RCNN convolutional neural network are alternately trained to obtain the trained RPN convolutional neural network and the fast-RCNN convolutional neural network;

(4)提取训练集中每个样本图像的全连接层特征：(4) Extract the fully connected layer features of each sample image in the training set:

(4a)将训练集中的每个样本图像依次输入到训练好的RPN卷积神经网络中，输出每个样本图像中所有目标粗选框的位置和框中目标的种类；(4a) Input each sample image in the training set into the trained RPN convolutional neural network in turn, and output the positions of all target rough selection frames and the types of targets in the frame in each sample image;

(4b)将每个目标粗选框中的图像区域分别输入到在ImageNet数据库上训练好的resnet101网络中，将该网络最后一层全连接层输出的全部全连接层特征存储到计算机中；(4b) Input the image area in the rough selection frame of each target into the resnet101 network trained on the ImageNet database respectively, and store all the fully connected layer features of the network last layer fully connected layer output in the computer;

(5)构建Tri-LSTMs模型：(5) Build the Tri-LSTMs model:

(5a)将一个长短期记忆网络LSTM和一个注意力网络依次组成语义LSTM模块，长短期记忆网络LSTM包含了1024个神经元；(5a) A long-term short-term memory network LSTM and an attention network are sequentially composed of a semantic LSTM module, and the long-term short-term memory network LSTM contains 1024 neurons;

(5b)将一个长短期记忆网络LSTM和一个注意力网络依次组成视觉LSTM模块，长短期记忆网络LSTM包含了1024个神经元；(5b) A long-term short-term memory network LSTM and an attention network are sequentially composed of a visual LSTM module, and the long-term short-term memory network LSTM contains 1024 neurons;

(5c)将一个长短期记忆网络LSTM、一个全连接层依次组成语言LSTM模块，长短期记忆网络LSTM包含了1024个神经元，全连接层的神经元数目设定为训练集中所有图像描述包含的单词总数；(5c) A language LSTM module is composed of a long-term short-term memory network LSTM and a fully-connected layer in turn. The long-term short-term memory network LSTM contains 1024 neurons, and the number of neurons in the fully-connected layer is set to be included in all image descriptions in the training set. total number of words;

(5d)将语义LSTM模块、视觉LSTM模块、语言LSTM模块依次组成Tri-LSTMs模型；(5d) The semantic LSTM module, the visual LSTM module, and the language LSTM module are sequentially composed of a Tri-LSTMs model;

(6)训练Tri-LSTMs模型：(6) Training Tri-LSTMs model:

(6a)在不同的时刻，将训练样本图像描述中不同位置的单词作为输入，从零时刻开始，训练Tri-LSTMs模型；(6a) At different moments, the words in different positions in the training sample image description are used as input, and the Tri-LSTMs model is trained from moment zero;

(6b)读取步骤(4b)计算机中存储的resnet101网络最后一层全连接层输出的全部全连接层特征，将全部全连接层特征的平均值作为特征向量；(6b) read all the fully connected layer features of the last layer fully connected layer output of the resnet101 network stored in the step (4b) computer, and use the average value of all fully connected layer features as a feature vector;

(6c)将特征向量与图像描述中当前时刻的单词映射的词向量相加，输入到语义LSTM模块中的长短期记忆网络LSTM中，长短期记忆网络LSTM前向传导输出隐藏态；(6c) Add the feature vector and the word vector of the word mapping at the current moment in the image description, and input it into the long-term short-term memory network LSTM in the semantic LSTM module, and the long-term short-term memory network LSTM forwards the output hidden state;

(6d)读取步骤(1)计算机中存储的1000个300维GLOVE词向量，输入到语义LSTM模块的注意力网络中，注意力网络前向传导后输出加权后的GLOVE词向量；(6d) read the 1000 300-dimensional GLOVE word vectors stored in the computer in step (1), input them into the attention network of the semantic LSTM module, and output the weighted GLOVE word vectors after the attention network conducts forward;

(6e)将语义LSTM模块当前时刻的隐藏态与语义LSTM模块中注意力网络的输出相加，将得到的和向量作为语义LSTM模块的输出；(6e) Add the hidden state of the semantic LSTM module at the current moment to the output of the attention network in the semantic LSTM module, and use the obtained sum vector as the output of the semantic LSTM module;

(6f)将语义LSTM模块输出的和向量，输入到视觉LSTM模块中长短期记忆网络LSTM中，长短期记忆网络LSTM前向传导输出隐藏态；(6f) Input the sum vector output by the semantic LSTM module into the long-term short-term memory network LSTM in the visual LSTM module, and the long-term short-term memory network LSTM forwards the output hidden state;

(6g)读取步骤(4b)计算机中存储的resnet101网络最后一层全连接层输出的全部全连接层特征，输入到视觉LSTM模块的注意力网络中，注意力网络前向传导，输出加权后的全连接层特征向量；(6g) Read all the fully connected layer features output by the last fully connected layer of the resnet101 network stored in the computer in step (4b), and input them into the attention network of the visual LSTM module. The attention network conducts forward, and the output is weighted The feature vector of the fully connected layer;

(6h)将视觉LSTM模块当前时刻的隐藏态与视觉LSTM模块中注意力网络的输出，将得到的和向量作为视觉LSTM模块的输出；(6h) the hidden state of the visual LSTM module at the current moment and the output of the attention network in the visual LSTM module, and the obtained sum vector as the output of the visual LSTM module;

(6i)将语义LSTM模块的输出的和向量，输入到语言LSTM模块中长短期记忆网络LSTM中，长短期记忆网络LSTM前向传导输出隐藏态，将隐藏态输入到全连接层中，输出下一个时刻单词的概率向量；(6i) Input the sum vector of the output of the semantic LSTM module into the long-term short-term memory network LSTM in the language LSTM module, and the long-term short-term memory network LSTM forwards the output hidden state, inputs the hidden state into the fully connected layer, and outputs the next The probability vector of a word at a time;

(6j)判断下一个时刻图像描述中是否存在单词，若是，计算单词概率向量与图像描述下一个时刻的单词向量之间的交叉熵损失后执行步骤(6b)，否则，执行步骤(6k)；(6j) Determine whether there is a word in the image description at the next moment, if so, perform step (6b) after calculating the cross entropy loss between the word probability vector and the word vector of the image description next moment, otherwise, perform step (6k);

(6k)将所有时刻的交叉熵损失相加得到总损失，使用BP算法优化模型中的所有参数，使总损失最小，当总损失收敛时停止训练，得到训练好的Tri-LSTMs模型；(6k) Add the cross-entropy losses at all moments to obtain the total loss, use the BP algorithm to optimize all parameters in the model to minimize the total loss, stop training when the total loss converges, and obtain the trained Tri-LSTMs model;

(7)生成图像描述：(7) Generate image description:

(7a)将一张自然图像输入到预训练好的faster-RCNN中，输出目标粗选框；(7a) Input a natural image into the pre-trained faster-RCNN, and output the target rough selection frame;

(7b)将目标粗选框中的图像区域输入到训练好的resnet101网络中，输出全连接层图像特征；(7b) Input the image area in the rough selection frame of the target into the trained resnet101 network, and output the fully connected layer image features;

(7c)将全连接层图像特征输入到Tri-LSTMs模型中，生成图像描述。(7c) Input the fully connected layer image features into the Tri-LSTMs model to generate image descriptions.

本发明与现有技术相比较，具有以下优点：Compared with the prior art, the present invention has the following advantages:

第一，由于本发明构建的Tri-LSTMs模型利用三个长短期记忆网络LSTM组合，克服了现有技术仅仅利用单个长短期记忆网络LSTM来生成图像描述，导致模型结构过于简单，不能生成足够准确的图像描述的缺点，使得本发明可以将多个长短期记忆网络LSTM组合起来，能够有效提升图像描述的准确性，具有泛化能力较强的优点。First, because the Tri-LSTMs model constructed by the present invention uses a combination of three long-term short-term memory networks LSTM, it overcomes the existing technology that only uses a single long-term short-term memory network LSTM to generate image descriptions, resulting in a model structure that is too simple to generate sufficiently accurate The shortcomings of the image description, so that the present invention can combine multiple long short-term memory networks LSTM, can effectively improve the accuracy of image description, and has the advantage of strong generalization ability.

第二，本发明同时利用了图像的全连接层特征、单词的300维GLOVE词向量作为Tri-LSTMs模型的输入，克服了现有技术仅使用图像的全连接层特征作为模型的输入，能利用的有效信息过于单一，因此导致图像描述方法生成的图像描述多样性不足的问题，使得本发明具有生成的图像描述更加多样化的的优点Second, the present invention simultaneously utilizes the fully connected layer features of the image and the 300-dimensional GLOVE word vector of the word as the input of the Tri-LSTMs model, which overcomes the prior art that only uses the fully connected layer features of the image as the input of the model, and can utilize The effective information is too single, which leads to the problem of insufficient diversity of image descriptions generated by the image description method, so that the present invention has the advantage of generating more diverse image descriptions

附图说明Description of drawings

图1是本发明的流程图；Fig. 1 is a flow chart of the present invention;

图2是本发明中构建的Tri-LSTMs模型的结构图。Fig. 2 is a structural diagram of the Tri-LSTMs model constructed in the present invention.

图3是本发明的仿真图。Fig. 3 is a simulation diagram of the present invention.

具体实施方式detailed description

下面结合附图对本发明做进一步的描述。The present invention will be further described below in conjunction with the accompanying drawings.

参照图1，对本发明实现的步骤做进一步的描述。Referring to Fig. 1, the steps of the present invention are further described.

步骤1，生成训练集并映射词向量。Step 1, generate a training set and map word vectors.

从带有图像描述的图像数据集中选取至少80000个样本组成训练集，所选取的每个样本是一个图像-描述对，每个图像-描述对中包含了一幅图像以及五条对应的图像描述。图像描述是指，图像中物体的属性、位置以及相互之间的关系。Select at least 80,000 samples from the image dataset with image description to form the training set. Each selected sample is an image-description pair, and each image-description pair contains an image and five corresponding image descriptions. Image description refers to the attributes, positions and relationships between objects in the image.

训练集中每个样本的图像描述由若多个英文单词组成，统计所有样本的所有图像描述中英文单词出现的频率并降幂排序，选取前1000个单词，将所选的每个单词映射为对应的300维GLOVE词向量，并将其存储到计算机中。The image description of each sample in the training set is composed of multiple English words, and the frequency of occurrence of Chinese and English words in all image descriptions of all samples is counted and sorted by descending power. The first 1000 words are selected, and each selected word is mapped to the corresponding The 300-dimensional GLOVE word vector and store it in the computer.

步骤2，搭建RPN卷积神经网络模型与faster-RCNN网络模型。Step 2, build the RPN convolutional neural network model and the faster-RCNN network model.

搭建一个由八个卷积层和一个Softmax层构成的RPN卷积神经网络模型并设置各层参数，各层卷积核大小均为3*3。Build a RPN convolutional neural network model consisting of eight convolutional layers and a Softmax layer and set the parameters of each layer. The convolution kernel size of each layer is 3*3.

搭建一个由五个卷积层、一个ROIpooling层、四个全连接层和一个Softmax层构成的faster-RCNN网络模型并设置各层参数，各层卷积核大小均为3*3。Build a faster-RCNN network model consisting of five convolutional layers, one ROIpooling layer, four fully connected layers and one Softmax layer and set the parameters of each layer. The size of the convolution kernel of each layer is 3*3.

步骤3，训练RPN卷积神经网络和fast-RCNN卷积神经网络。Step 3, train RPN convolutional neural network and fast-RCNN convolutional neural network.

采用交替训练方法，对RPN卷积神经网络和fast-RCNN卷积神经网络进行交替训练，得到训练好的RPN卷积神经网络和fast-RCNN卷积神经网络。The alternate training method is used to alternately train the RPN convolutional neural network and the fast-RCNN convolutional neural network to obtain the trained RPN convolutional neural network and fast-RCNN convolutional neural network.

交替训练方法的步骤如下：The steps of the alternate training method are as follows:

第1步，对RPN卷积神经网络的每个参数选取一个随机值，进行随机初始化。In the first step, a random value is selected for each parameter of the RPN convolutional neural network for random initialization.

第2步，将训练样本图像输入到初始化后的RPN卷积神经网络中，使用反向传播BP算法训练该网络，调整RPN卷积神经网络参数，直到所有参数收敛为止，得到初次训练好的RPN卷积神经网络。The second step is to input the training sample image into the initialized RPN convolutional neural network, use the backpropagation BP algorithm to train the network, and adjust the parameters of the RPN convolutional neural network until all parameters converge to obtain the initial trained RPN Convolutional neural network.

第3步，将训练样本图像输入到训练好的RPN卷积神经网络中，输出训练样本图像上的目标粗选框。In the third step, the training sample image is input into the trained RPN convolutional neural network, and the target rough selection box on the training sample image is output.

第4步，对fast-RCNN卷积神经网络的每个参数选取一个随机值，进行随机初始化。Step 4: Select a random value for each parameter of the fast-RCNN convolutional neural network and perform random initialization.

第5步，将训练样本图像和本步骤3步中获得的目标粗选框输入到初始化后的fast-RCNN卷积神经网络中，使用反向传播BP算法训练该网络，调整fast-RCNN卷积神经网络参数，直到所有参数收敛为止，得到初次训练好的fast-RCNN卷积神经网络。In step 5, input the training sample image and the target rough selection frame obtained in step 3 into the initialized fast-RCNN convolutional neural network, use the backpropagation BP algorithm to train the network, and adjust the fast-RCNN convolution Neural network parameters, until all parameters converge, the first trained fast-RCNN convolutional neural network is obtained.

第6步，固定本步骤第2步中训练好的RPN卷积神经网络前五层卷积层的参数与本步骤第5步中训练好的fast-RCNN卷积神经网络的参数，将训练样本图像输入到训练好的RPN卷积神经网络中，使用反向传播BP算法微调RPN卷积神经网络未固定的参数，直到其收敛为止，得到最终训练好的RPN卷积神经网络模型。Step 6, fix the parameters of the first five convolutional layers of the RPN convolutional neural network trained in step 2 of this step and the parameters of the fast-RCNN convolutional neural network trained in step 5 of this step, and set the training samples The image is input into the trained RPN convolutional neural network, and the backpropagation BP algorithm is used to fine-tune the unfixed parameters of the RPN convolutional neural network until it converges to obtain the final trained RPN convolutional neural network model.

第7步，将训练样本图像输入到本步骤第6步最终训练好的RPN卷积神经网络中，重新得到样本图像上的目标粗选框。In step 7, input the training sample image into the RPN convolutional neural network trained in step 6 of this step, and regain the target rough selection box on the sample image.

第8步，固定第五步中训练好的fast-RCNN卷积神经网络前五层卷积层参数和本步骤第6步最终训练好的RPN卷积神经网络参数，将训练样本图像和本步骤第7步中重新得到的目标粗选框输入fast-RCNN卷积神经网络中，使用反向传播BP算法微调fast-RCNN卷积神经网络未固定的参数，直到其收敛为止，得到最终训练好的fast-RCNN卷积神经网络。Step 8, fix the parameters of the first five convolutional layers of the fast-RCNN convolutional neural network trained in step 5 and the final trained RPN convolutional neural network parameters in step 6 of this step, and combine the training sample image with this step The target rough selection frame obtained in step 7 is input into the fast-RCNN convolutional neural network, and the unfixed parameters of the fast-RCNN convolutional neural network are fine-tuned using the backpropagation BP algorithm until it converges to obtain the final trained fast-RCNN Convolutional Neural Network.

步骤4，提取图像全连接层特征。Step 4, extract the features of the fully connected layer of the image.

将训练集中的样本图像依次输入到训练好的RPN卷积神经网络中，输出每个样本图像中所有目标粗选框的位置和框中目标的种类。The sample images in the training set are sequentially input into the trained RPN convolutional neural network, and the positions of all target rough selection boxes in each sample image and the types of targets in the boxes are output.

将每个目标粗选框中的图像区域分别输入到在ImageNet数据库上训练好的resnet101网络中，将该网络最后一层全连接层输出的全部全连接层特征存储到计算机中。Input the image area in the rough selection frame of each target into the resnet101 network trained on the ImageNet database, and store all the fully connected layer features output by the last fully connected layer of the network into the computer.

步骤5，构建Tri-LSTMs模型。Step 5, construct the Tri-LSTMs model.

将一个长短期记忆网络LSTM和一个注意力网络依次组成语义LSTM模块，长短期记忆网络LSTM包含了1024个神经元。A long short-term memory network LSTM and an attention network are sequentially composed of a semantic LSTM module. The long-term short-term memory network LSTM contains 1024 neurons.

将一个长短期记忆网络LSTM和一个注意力网络依次组成视觉LSTM模块，长短期记忆网络LSTM包含了1024个神经元。A long-term short-term memory network LSTM and an attention network are sequentially composed of a visual LSTM module. The long-term short-term memory network LSTM contains 1024 neurons.

将一个长短期记忆网络LSTM、一个全连接层依次组成语言LSTM模块，长短期记忆网络LSTM包含了1024个神经元，全连接层的神经元数目设定为训练集中所有图像描述包含的单词总数。A long-term short-term memory network LSTM and a fully-connected layer are sequentially composed of a language LSTM module. The long-term short-term memory network LSTM contains 1024 neurons, and the number of neurons in the fully-connected layer is set to the total number of words contained in all image descriptions in the training set.

将语义LSTM模块、视觉LSTM模块、语言LSTM模块依次组成Tri-LSTMs模型如图2所示。The Tri-LSTMs model is composed of the semantic LSTM module, visual LSTM module, and language LSTM module in turn, as shown in Figure 2.

步骤6，训练Tri-LSTMs模型。Step 6, train the Tri-LSTMs model.

第1步，在不同的时刻，将训练样本图像描述中不同位置的单词作为输入，从零时刻开始，训练Tri-LSTMs模型。In the first step, at different moments, the words in different positions in the training sample image description are used as input, and the Tri-LSTMs model is trained from moment zero.

第2步，读取步骤4中计算机中存储的resnet101网络最后一层全连接层输出的全部全连接层特征，将全部全连接层特征的平均值作为特征向量。In step 2, read all the fully connected layer features output by the last fully connected layer of the resnet101 network stored in the computer in step 4, and use the average value of all fully connected layer features as the feature vector.

第3步，将特征向量与图像描述中当前时刻的单词映射的词向量相加，输入到语义LSTM模块中的长短期记忆网络LSTM中，长短期记忆网络LSTM前向传导输出隐藏态。In the third step, the feature vector is added to the word vector of the word mapping at the current moment in the image description, and input to the long-term short-term memory network LSTM in the semantic LSTM module, and the long-term short-term memory network LSTM forwards and outputs the hidden state.

所述长短期记忆网络LSTM前向传导是按照下式实现的：The forward conduction of the long short-term memory network LSTM is realized according to the following formula:

i_t＝sigmoid(W_ixx_t+W_ihh_t-1)i _t ＝sigmoid(W _ix x _t +W _ih h _t-1 )

f_t＝sigmoid(W_fxx_t+W_fhh_t-1)f _t ＝sigmoid(W _fx x _t +W _fh h _t-1 )

o_t＝sigmoid(W_oxx_t+W_ohh_t-1)o _t ＝sigmoid(W _ox x _t +W _oh h _t-1 )

c_t＝f_t⊙c_t-1+i_t⊙tanh(W_cxx_t+W_chh_t-1)c _t ＝f _t ⊙c _t-1 +i _t ⊙tanh(W _cx x _t +W _ch h _t-1 )

h_t＝o_t⊙tanh(c_t)h _t ＝o _t ⊙tanh(c _t )

其中，i_t表示t时刻长短期记忆网络LSTM的输入门，sigmoid表示激活函数

e表示以自然常数e为底的指数操作，W_ix表示输入门的权重转移矩阵，x_t表示t时刻长短期记忆网络LSTM的输入，W_ih表示输入门所对应的隐藏态的权重转移矩阵，h_t-1表示t-1时刻长短期记忆网络LSTM的隐藏态，f_t表示t时刻长短期记忆网络LSTM的遗忘门，W_fx表示遗忘门的权重转移矩阵，W_fh表示遗忘门所对应的隐藏态的权重转移矩阵，o_t表示t时刻长短期记忆网络LSTM的输出门，W_ox表示输出门的权重转移矩阵，W_oh表示输出门所对应的隐藏态的权重转移矩阵，c_t表示t时刻长短期记忆网络LSTM的状态单元，⊙表示计算内积操作，c_t-1表示t-1时刻长短期记忆网络LSTM的状态单元，tanh表示激活函数

W_cx表示状态单元的权重转移矩阵，W_ch表示状态单元所对应的隐藏态的权重转移矩阵，h_t表示t时刻长短期记忆网络LSTM的隐藏态。Among them, i _t represents the input gate of the long-term short-term memory network LSTM at time t, and sigmoid represents the activation function

e represents the exponential operation with the natural constant e as the base, Wi _ix represents the weight transfer matrix of the input gate, x _t represents the input of the long-term short-term memory network LSTM at time t, and W _ih represents the weight transfer matrix of the hidden state corresponding to the input gate, h _t-1 represents the hidden state of the long-term short-term memory network LSTM at time t-1, f _t represents the forgetting gate of the long-term short-term memory network LSTM at time t, W _fx represents the weight transfer matrix of the forgetting gate, W _fh represents the corresponding value of the forgetting gate The weight transfer matrix of the hidden state, o _t represents the output gate of the long-term short-term memory network LSTM at time t, W _ox represents the weight transfer matrix of the output gate, W _oh represents the weight transfer matrix of the hidden state corresponding to the output gate, c _t represents t The state unit of the long-term short-term memory network LSTM at time, ⊙ represents the calculation inner product operation, c _t-1 represents the state unit of the long-term short-term memory network LSTM at time t-1, and tanh represents the activation function

W _cx represents the weight transfer matrix of the state unit, W _ch represents the weight transfer matrix of the hidden state corresponding to the state unit, and h _t represents the hidden state of the long-term short-term memory network LSTM at time t.

第4步，读取步骤1中计算机中存储的1000个300维GLOVE词向量，输入到语义LSTM模块的注意力网络中，注意力网络前向传导后输出加权后的GLOVE词向量。Step 4: Read 1,000 300-dimensional GLOVE word vectors stored in the computer in step 1, and input them into the attention network of the semantic LSTM module, and the attention network outputs weighted GLOVE word vectors after forward transmission.

所述注意力网络前向传导是按照下式实现的：The forward conduction of the attention network is realized according to the following formula:

a_i,t＝tanh(W_ss_i+W_hh_t)a _i,t ＝tanh(W _s s _i +W _h h _t )

其中，a_i,t表示t时刻1000个300维GLOVE词向量中第i个向量的权重值，tanh表示激活函数

e表示以自然常数e为底的指数操作，W_s表示300维GLOVE词向量的权重转移矩阵，s_i表示输入的1000个300维GLOVE词向量中的第i个词向量，W_h表示语义LSTM模块中的长短期记忆网络LSTM输出的隐藏态的权重转移矩阵，h_t表示t时刻语义LSTM模块中的长短期记忆网络LSTM输出的隐藏态,

表示t时刻语义LSTM模块中注意力网络输出的特征向量，K表示300维GLOVE词向量的总数，∑表示求和操作，i表示词向量中每个向量的索引。Among them, a _{i, t} represents the weight value of the i-th vector among the 1000 300-dimensional GLOVE word vectors at time t, and tanh represents the activation function

e represents the exponential operation based on the natural constant e, W _s represents the weight transfer matrix of the 300-dimensional GLOVE word vector, s _i represents the i-th word vector among the 1000 300-dimensional GLOVE word vectors input, W _h represents the semantic LSTM The weight transfer matrix of the hidden state output by the long-term short-term memory network LSTM in the module, h _t represents the hidden state output by the long-term short-term memory network LSTM in the semantic LSTM module at time t,

Represents the feature vector output by the attention network in the semantic LSTM module at time t, K represents the total number of 300-dimensional GLOVE word vectors, ∑ represents the summation operation, and i represents the index of each vector in the word vector.

第5步，将语义LSTM模块当前时刻的隐藏态与语义LSTM模块中注意力网络的输出相加，得到的和向量作为语义LSTM模块的输出。Step 5: Add the hidden state of the semantic LSTM module at the current moment to the output of the attention network in the semantic LSTM module, and the resulting sum vector is used as the output of the semantic LSTM module.

第6步，将语义LSTM模块的输出的和向量，输入到视觉LSTM模块中长短期记忆网络LSTM中，长短期记忆网络LSTM前向传导输出隐藏态。In the sixth step, the sum vector of the output of the semantic LSTM module is input into the long-term short-term memory network LSTM in the visual LSTM module, and the long-term short-term memory network LSTM forwards the output hidden state.

i_t＝sigmoid(W_ixx_t+W_ihh_t-1)i _t ＝sigmoid(W _ix x _t +W _ih h _t-1 )

f_t＝sigmoid(W_fxx_t+W_fhh_t-1)f _t ＝sigmoid(W _fx x _t +W _fh h _t-1 )

o_t＝sigmoid(W_oxx_t+W_ohh_t-1)o _t ＝sigmoid(W _ox x _t +W _oh h _t-1 )

h_t＝o_t⊙tanh(c_t)h _t ＝o _t ⊙tanh(c _t )

第7步，读取步骤4计算机中存储的resnet101网络最后一层全连接层输出的全部全连接层特征，输入到视觉LSTM模块的注意力网络中，注意力网络前向传导，输出加权后的全连接层特征向量。Step 7, read all the fully connected layer features output by the last fully connected layer of the resnet101 network stored in the computer in step 4, input them into the attention network of the visual LSTM module, the attention network conducts forward, and outputs the weighted Fully connected layer feature vector.

a_i,t＝tanh(W_vv_i+W_hh_t)a _i,t ＝tanh(W _v v _i +W _h h _t )

其中，a_i,t表示t时刻全部全连接层特征中第i个特征的权重，tanh表示激活函数

e表示以自然常数e为底的指数操作，W_v表示全连接层特征的权重转移矩阵，v_i表示全部全连接层特征中第i个特征，W_h表示视觉LSTM模块中的长短期记忆网络LSTM的隐藏态的权重矩阵，h_t表示t时刻视觉LSTM模块中的长短期记忆网络LSTM输出的隐藏态，

表示t时刻视觉LSTM模块中注意力网络的输出，K表示全连接层特征向量的总数，∑表示求和操作，i表示特征向量中每个向量的索引。Among them, a _{i, t} represents the weight of the i-th feature in all fully connected layer features at time t, and tanh represents the activation function

e represents the exponential operation based on the natural constant e, W _v represents the weight transfer matrix of fully connected layer features, v _i represents the i-th feature in all fully connected layer features, W _h represents the long short-term memory network in the visual LSTM module The weight matrix of the hidden state of LSTM, h _t represents the hidden state output by the long short-term memory network LSTM in the visual LSTM module at time t,

Represents the output of the attention network in the visual LSTM module at time t, K represents the total number of feature vectors in the fully connected layer, ∑ represents the summation operation, and i represents the index of each vector in the feature vector.

第8步，将视觉LSTM模块当前时刻的隐藏态与视觉LSTM模块中注意力网络的输出，得到的和向量作为视觉LSTM模块的输出。Step 8, the hidden state of the visual LSTM module at the current moment and the output of the attention network in the visual LSTM module, and the obtained sum vector is used as the output of the visual LSTM module.

第9步，将语义LSTM模块的输出的和向量，输入到语言LSTM模块中长短期记忆网络LSTM中，长短期记忆网络LSTM前向传导输出隐藏态，将隐藏态输入到全连接层中，输出下一个时刻单词的概率向量。Step 9: Input the sum vector of the output of the semantic LSTM module into the long-term short-term memory network LSTM in the language LSTM module, and the long-term short-term memory network LSTM forwards the output hidden state, inputs the hidden state into the fully connected layer, and outputs A vector of probabilities for the words at the next moment.

i_t＝sigmoid(W_ixx_t+W_ihh_t-1)i _t ＝sigmoid(W _ix x _t +W _ih h _t-1 )

f_t＝sigmoid(W_fxx_t+W_fhh_t-1)f _t ＝sigmoid(W _fx x _t +W _fh h _t-1 )

o_t＝sigmoid(W_oxx_t+W_ohh_t-1)o _t ＝sigmoid(W _ox x _t +W _oh h _t-1 )

h_t＝o_t⊙tanh(c_t)h _t ＝o _t ⊙tanh(c _t )

第10步，判断下一个时刻图像描述是否有单词，若是，计算单词概率向量与图像描述下一个时刻的单词向量之间的交叉熵损失后，则执行本步骤的第2步，否则，执行本步骤的第11步。Step 10, judge whether there are words in the image description at the next moment, if so, after calculating the cross-entropy loss between the word probability vector and the word vector of the image description at the next moment, then execute the second step of this step, otherwise, execute this step Step 11 of the steps.

所述单词概率向量与图像描述下一个时刻的单词向量之间的交叉熵损失是按照下述公式计算得到的：The cross-entropy loss between the word probability vector and the image description word vector at the next moment is calculated according to the following formula:

其中，loss表示单词概率向量与训练集图像描述下一个时刻的单词向量之间的交叉熵损失，N表示训练集图像描述中单词的总数,∑表示求和操作，t表示训练集图像描述中单词的索引，log表示以自然常数e为底的对数操作，P(s_t|I；θ)表示将训练集图像的全部全连接层特征的平均值输入到Tri-LSTMs模型中，输出的t时刻单词概率向量，I表示训练集图像全部全连接层特征的平均值，θ表示Tri-LSTMs模型的所有参数。Among them, loss represents the cross-entropy loss between the word probability vector and the word vector of the training set image description at the next moment, N represents the total number of words in the training set image description, ∑ represents the sum operation, and t represents the word in the training set image description The index of , log represents the logarithmic operation based on the natural constant e, P(s _t |I; θ) represents the average value of all fully connected layer features of the training set image input into the Tri-LSTMs model, and the output t Time word probability vector, I represents the average value of all fully connected layer features of the training set image, and θ represents all parameters of the Tri-LSTMs model.

第11步，将所有时刻的交叉熵损失相加，得到总损失，使用BP算法优化模型中的所有参数，使总损失最小，当总损失收敛时停止训练，得到训练好的Tri-LSTMs模型。Step 11: Add the cross-entropy losses at all times to obtain the total loss, use the BP algorithm to optimize all parameters in the model to minimize the total loss, stop training when the total loss converges, and obtain the trained Tri-LSTMs model.

步骤7，生成图像描述。Step 7, generate image description.

将一张自然图像输入到预训练好的faster-RCNN中，输出目标粗选框。Input a natural image into the pre-trained faster-RCNN, and output the target rough selection box.

将目标粗选框中的图像区域输入到训练好的resnet101网络中，输出全连接层图像特征。Input the image area in the target rough selection box into the trained resnet101 network, and output the fully connected layer image features.

将全连接层图像特征输入到Tri-LSTMs模型中，生成图像描述。Input the fully connected layer image features into the Tri-LSTMs model to generate image descriptions.

下面结合仿真对本发明效果做进一步的说明。The effects of the present invention will be further described below in conjunction with simulation.

1、仿真实验条件：1. Simulation experiment conditions:

本发明的仿真实验的硬件平台为：戴尔计算机Intel(R)Core5处理器，主频3.20GHz，内存64GB；The hardware platform of the emulation experiment of the present invention is: Dell computer Intel (R) Core5 processor, main frequency 3.20GHz, internal memory 64GB;

本发明的仿真实验的软件平台为：Python3.5，Tensorflow1.2平台。The software platform of the simulation experiment of the present invention is: Python3.5, Tensorflow1.2 platform.

本发明仿真实验所使用的数据集为COCO数据集，该数据集是由微软团队获取的一个可以用来做图像描述生成的数据集，数据集的构建时间为2014年，数据集的训练集和测试集各自包含123287和40,775张图像。The data set used in the simulation experiment of the present invention is the COCO data set, which is a data set obtained by the Microsoft team that can be used for image description generation. The construction time of the data set is 2014. The training set of the data set and The test sets contain 123,287 and 40,775 images respectively.

2、仿真内容及结果分析：2. Simulation content and result analysis:

本发明仿真实验是采用本发明和两个现有技术(自适应注意力机制方法、scst方法)，将COCO数据集的123287个训练集样本分别输入到各自构建的模型中进行训练，将测试集的40,775张图像分别输入到训练好的模型中，生成三种方法对每张测试集图像的图像描述。The simulation experiment of the present invention is to adopt the present invention and two prior art (adaptive attention mechanism method, scst method), input 123287 training set samples of COCO data set respectively in the model that builds respectively to carry out training, test set The 40,775 images in the dataset were respectively input into the trained model to generate image descriptions for each test set image by the three methods.

在仿真实验中，采用的两个现有技术是指：In the simulation experiment, the two existing technologies adopted refer to:

现有技术自适应注意力机制方法是指，Jiasen Lu等人在论文“Knowing When toLook:Adaptive Attention via A Visual Sentinel for Image Captioning”(cvpr 2017会议论文)中提出的图像描述生成方法，简称自适应注意力机制方法。The prior art adaptive attention mechanism method refers to the image description generation method proposed by Jiasen Lu et al. in the paper "Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning" (cvpr 2017 conference paper), referred to as adaptive Attention Mechanism Method.

现有技术scst方法是指，Jiasen Lu等人在论文“Self-critical SequenceTraining for Image Captioning”(cvpr 2017会议论文)中提出的图像描述生成方法，简称scst方法。The existing scst method refers to the image description generation method proposed by Jiasen Lu et al. in the paper "Self-critical Sequence Training for Image Captioning" (cvpr 2017 conference paper), referred to as the scst method.

为了比较三种方法生成的图像描述的优劣，利用四个评价指标(BLEU-4、METEOR、ROUGE-L、CIDER)分别对COCO测试集图像上三种方法生成的图像描述进行评价。将指标结果绘制成表1，其中Net-1表示本发明基于Tri-LSTMs模型的图像描述方法，Net-2表示自适应注意力机制方法，Net-3表示scst方法。In order to compare the pros and cons of the image descriptions generated by the three methods, four evaluation indicators (BLEU-4, METEOR, ROUGE-L, CIDER) are used to evaluate the image descriptions generated by the three methods on the COCO test set images. The index results are drawn into Table 1, where Net-1 represents the image description method based on the Tri-LSTMs model of the present invention, Net-2 represents the adaptive attention mechanism method, and Net-3 represents the scst method.

表1.仿真实验中本发明和两个现有技术分类结果的定量分析表Table 1. Quantitative analysis table of the present invention and two prior art classification results in the simulation experiment

从表1中可以看出，本发明中的网络相比于基于自适应注意力机制方法、scst方法，在各种评价指标上获得了更高的分数，因而表现更好，能够生成更精确的图像描述。As can be seen from Table 1, compared with the method based on adaptive attention mechanism and the scst method, the network in the present invention has obtained higher scores on various evaluation indicators, so it performs better and can generate more accurate image description.

为了更直观描述本发明的效果，随机从本发明在COCO测试集上的仿真结果中选取两个图，如图3所示，其中，图3(a)，图3(b)均为COCO测试集中的一张自然图像以及该图像对应的图像描述。In order to describe the effect of the present invention more intuitively, two figures are randomly selected from the simulation results of the present invention on the COCO test set, as shown in Figure 3, where Figure 3(a) and Figure 3(b) are both COCO tests A collection of natural images and their corresponding image descriptions.

由图3的仿真图可以看出，本发明生成的图像描述较为准确、具体地描述了图像中的内容。It can be seen from the simulation diagram in FIG. 3 that the image description generated by the present invention describes the contents of the image more accurately and specifically.

Claims

1. An image description method based on Tri-LSTMs model is characterized in that a Tri-LSTMs model composed of a semantic LSTM module, a visual LSTM module and a language LSTM module is built, and sentence description image content is generated for any one natural image, and the method comprises the following steps:

(1) Generating a training set and mapping word vectors:

(1a) Selecting at least 80000 samples from an image data set with image descriptions to form a training set, wherein each selected sample is an image-description pair, and each image-description pair comprises one image and five corresponding image descriptions;

(1b) The image description of each sample in the training set consists of a plurality of English words, the frequency of occurrence of the English words in all the image descriptions of all the samples is counted, the English words are subjected to power reduction sorting, the first 1000 words are selected, each selected word is mapped into a corresponding 300-dimensional GLOVE word vector, and the vector is stored in a computer;

(2) Constructing an RPN convolutional neural network model and a master-RCNN network model:

(2a) Constructing an RPN convolutional neural network model consisting of eight convolutional layers and a Softmax layer, and setting parameters of each layer;

(2b) Building a master-RCNN network model consisting of five convolution layers, one ROIploling layer, four full-connection layers and one Softmax layer, and setting parameters of each layer;

(3) Training an RPN convolutional neural network and a fast-RCNN convolutional neural network:

performing alternate training on the RPN convolutional neural network and the fast-RCNN convolutional neural network by adopting an alternate training method to obtain a trained RPN convolutional neural network and a fast-RCNN convolutional neural network;

(4) Extracting the full connection layer characteristics of each sample image in the training set:

(4a) Sequentially inputting each sample image in the training set into a trained RPN convolutional neural network, and outputting the positions of all target rough selection frames in each sample image and the types of targets in the frames;

(4b) Respectively inputting the image area in each target rough selection frame into a resnet101 network trained on an ImageNet database, and storing all full-connection layer characteristics output by the last full-connection layer of the network into a computer;

(5) Constructing a Tri-LSTMs model:

(5a) Sequentially forming a long short-term memory network LSTM and an attention network into a semantic LSTM module, wherein the long short-term memory network LSTM comprises 1024 neurons;

(5b) Sequentially forming a long short-term memory network LSTM and an attention network into a visual LSTM module, wherein the long short-term memory network LSTM comprises 1024 neurons;

(5c) Sequentially forming a language LSTM module by a long-short term memory network LSTM and a full connection layer, wherein the long-short term memory network LSTM comprises 1024 neurons, and the number of the neurons of the full connection layer is set as the total number of words contained in all image descriptions in a training set;

(5d) Sequentially combining a semantic LSTM module, a visual LSTM module and a language LSTM module into a Tri-LSTMs model;

(6) Training Tri-LSTMs model:

(6a) At different moments, taking words at different positions in the image description of the training sample as input, and training a Tri-LSTMs model from zero moment;

(6b) Reading all full-connection layer characteristics output by the last full-connection layer of the resnet101 network stored in the computer in the step (4 b), and taking the average value of all full-connection layer characteristics as a characteristic vector;

(6c) Adding the feature vector and the word vector mapped by the word at the current moment in the image description, inputting the added feature vector into a long-short term memory network LSTM in a semantic LSTM module, and outputting a hidden state in a forward conduction manner by the long-short term memory network LSTM;

(6d) Reading 1000 300-dimensional GLOVE word vectors stored in the computer in the step (1), inputting the GLOVE word vectors into an attention network of a semantic LSTM module, and outputting the weighted GLOVE word vectors after the attention network conducts forwards;

(6e) Adding the hidden state of the semantic LSTM module at the current moment with the output of the attention network in the semantic LSTM module, and taking the obtained sum vector as the output of the semantic LSTM module;

(6f) Inputting the sum vector output by the semantic LSTM module into a long-short term memory network LSTM in the visual LSTM module, and conducting and outputting a hidden state in the forward direction of the long-short term memory network LSTM;

(6g) Reading all full-connection layer characteristics output by the last full-connection layer of the resnet101 network stored in the computer in the step (4 b), inputting the characteristics into an attention network of the visual LSTM module, conducting the attention network forwards, and outputting weighted full-connection layer characteristic vectors;

(6h) Taking the hidden state of the visual LSTM module at the current moment and the output of the attention network in the visual LSTM module, and taking the obtained sum vector as the output of the visual LSTM module;

(6i) Inputting the sum vector output by the semantic LSTM module into a long-short term memory network LSTM in the language LSTM module, outputting a hidden state by forward conduction of the long-short term memory network LSTM, inputting the hidden state into a full-connection layer, and outputting a probability vector of a word at the next moment;

(6j) Judging whether a word exists in the image description at the next moment, if so, calculating the cross entropy loss between the word probability vector and the word vector at the next moment of the image description, and then executing the step (6 b), otherwise, executing the step (6 k);

(6k) Adding the cross entropy losses at all times to obtain a total loss, optimizing all parameters in the model by using a BP algorithm to minimize the total loss, and stopping training when the total loss is converged to obtain a trained Tri-LSTMs model;

(7) Generating an image description:

(7a) Inputting a natural image into a pre-trained false-RCNN, and outputting a target rough selection frame;

(7b) Inputting the image area in the target rough selection frame into a trained resnet101 network, and outputting the image characteristics of the full connection layer;

(7c) And inputting the image characteristics of the full connection layer into a Tri-LSTMs model to generate image description.

2. The Tri-LSTMs model-based image description method of claim 1, wherein the image description in step (1 a) refers to the attributes, positions and relationships of objects in the image.

3. The image description method based on the Tri-LSTMs model of claim 1, wherein the step of the alternative training method in step (3) is as follows:

firstly, selecting a random value for each parameter of the RPN convolutional neural network, and carrying out random initialization;

secondly, inputting a training sample image into the initialized RPN convolutional neural network, training the network by using a back propagation BP algorithm, and adjusting parameters of the RPN convolutional neural network until all the parameters are converged to obtain the initially trained RPN convolutional neural network;

inputting the training sample image into the trained RPN convolutional neural network, and outputting a target rough selection frame on the training sample image;

fourthly, selecting a random value for each parameter of the fast-RCNN convolutional neural network, and carrying out random initialization;

fifthly, inputting the training sample image and the target rough selection box obtained in the third step into the initialized fast-RCNN convolutional neural network, training the network by using a back propagation BP algorithm, and adjusting the parameters of the fast-RCNN convolutional neural network until all the parameters are converged to obtain the initially trained fast-RCNN convolutional neural network;

fixing the parameters of the front five layers of convolutional layers of the RPN convolutional neural network trained in the second step and the parameters of the fast-RCNN convolutional neural network trained in the fifth step, inputting a training sample image into the trained RPN convolutional neural network, and finely adjusting the unfixed parameters of the RPN convolutional neural network by using a back propagation BP algorithm until the unfixed parameters are converged to obtain a finally trained RPN convolutional neural network model;

step seven, inputting the training sample image into the RPN convolutional neural network finally trained in the step six, and obtaining a target rough selection frame on the sample image again;

and eighthly, fixing parameters of the front five layers of convolutional layers of the fast-RCNN convolutional neural network trained in the fifth step and parameters of the RPN convolutional neural network trained finally in the sixth step, inputting the training sample image and the target rough selection box obtained again in the seventh step into the fast-RCNN convolutional neural network, finely adjusting unfixed parameters of the fast-RCNN convolutional neural network by using a back propagation BP algorithm until the unfixed parameters are converged, and obtaining the finally trained fast-RCNN convolutional neural network.

4. The image description method based on the Tri-LSTMs model of claim 1, wherein the long short term memory network LSTM forward conduction in step (6 c), step (6 f) and step (6 i) is implemented according to the following formula:

i _t ＝sigmoid(W _ix x _t +W _ih h _t-1 )

f _t ＝sigmoid(W _fx x _t +W _fh h _t-1 )

o _t ＝sigmoid(W _ox x _t +W _oh h _t-1 )

c _t ＝f _t ⊙c _t-1 +i _t ⊙tanh(W _cx x _t +W _ch h _t-1 )

h _t ＝o _t ⊙tanh(c _t )

wherein i _t An input gate of a long-term and short-term memory network LSTM at the time t is represented, sigmoid represents an activation function

e denotes an exponential operation with a natural constant e as base, W _ix Weight transfer matrix, x, representing input gates _t Representing the input, W, of the long-short term memory network LSTM at time t _ih Weight transfer matrix, h, representing hidden states corresponding to input gates _t-1 Representing the hidden state of the long-short term memory network LSTM at time t-1, f _t Forgetting gate, W, representing long-short term memory network LSTM at time t _fx Weight transfer matrix, W, representing forgetting gate _fh Weight transfer matrix, o, representing the hidden state corresponding to the forgetting gate _t Output gate, W, representing the long-short term memory network LSTM at time t _ox Weight transfer matrix, W, representing output gates _oh A weight transfer matrix representing the hidden state corresponding to the output gate, c _t Status cells indicating a long short term memory network LSTM at time t indicate a compute inner product operation, c _t-1 Representing the state unit of the long-short term memory network LSTM at the time t-1, and tanh representing the activation function

W _cx Weight transfer matrix, W, representing a state cell _ch Weight transfer matrix, h, representing hidden states corresponding to the state units _t Representing the hidden state of the long-short term memory network LSTM at time t.

5. The Tri-LSTMs model-based image description method of claim 1, wherein the attention network forward conduction in step (6 d) is implemented according to the following formula:

a _i,t ＝tanh(W _s s _i +W _h h _t )

wherein, a _i,t Representing the weight value of the ith vector in 1000 300-dimensional GLOVE word vectors at the time t, and tanh representing an activation function

e denotes an exponential operation with a natural constant e as base, W _s Weight transfer matrix, s, representing 300-dimensional GLOVE word vectors _i Represents the ith word vector, W, of the 1000 300-dimensional GLOVE word vectors input _h Weight transfer matrix, h, representing hidden states of long short term memory network LSTM outputs in semantic LSTM modules _t Represents the hidden state of the long-short term memory network LSTM output in the semantic LSTM module at the time t,

representing the feature vector output by the attention network in the semantic LSTM module at the time t, K representing the total number of 300-dimensional GLOVE word vectors, sigma representing the summation operation, and i representing the index of each vector in the word vectors.

6. The Tri-LSTMs model-based image description method of claim 1, wherein the attention network forward conduction in step (6 g) is implemented according to the following formula:

a _i,t ＝tanh(W _v v _i +W _h h _t )

wherein, a _i,t Represents the weight of the ith feature in all the fully-connected layer features at the moment t, and tanh represents the activation function

e denotes an exponential operation with a natural constant e as base, W _v Weight transfer matrix, v, representing the characteristics of the fully-connected layer _i Denotes the ith feature, W, of all fully-connected layer features _h Weight matrix, h, representing hidden states of long short term memory network LSTM in visual LSTM module _t Represents the hidden state of the long-short term memory network LSTM output in the vision LSTM module at the time t,

represents the output of the attention network in the visual LSTM module at time t, K represents the total number of fully connected layer feature vectors, Σ represents the summation operation, and i represents the index of each vector in the feature vectors.

7. The Tri-LSTMs model-based image description method of claim 1, wherein the cross entropy loss between the word probability vector in step (6 j) and the word vector at the next time of image description is calculated according to the following formula:

wherein loss represents the cross entropy loss between the word probability vector and the word vector at the next moment of the image description of the training set, N represents the total number of words in the image description of the training set, Σ represents the summation operation, t represents the index of the word in the image description of the training set, log represents the logarithm operation with the natural constant e as the base, and P(s) _t I; theta) represents the word probability vector at t moment output by inputting the average value of all the fully connected layer characteristics of the training set image into the Tri-LSTMs model, and the table IAnd (3) showing the average value of all the fully connected layer characteristics of the images of the training set, wherein theta represents all the parameters of the Tri-LSTMs model.