CN110750669A

CN110750669A - Method and system for generating image captions

Info

Publication number: CN110750669A
Application number: CN201910885349.0A
Authority: CN
Inventors: 杨志明
Original assignee: Reflections On Artificial Intelligence Robot Technology (beijing) Co Ltd
Current assignee: Reflections On Artificial Intelligence Robot Technology (beijing) Co Ltd
Priority date: 2019-09-19
Filing date: 2019-09-19
Publication date: 2020-02-04
Anticipated expiration: 2039-09-19
Also published as: CN110750669B

Abstract

The invention discloses a method and a system for generating image captions, wherein a neural network model in the embodiment of the invention adopts an encoder-decoder (encoder-decoder) recursive bidirectional cyclic neural network architecture, and the model comprises two main parts: an image feature extraction section and an image caption generation section. For the image feature extraction part, a convolutional neural network is arranged to convert an image into feature vector information of the image; and for the image caption generating part, a bidirectional cyclic neural network is arranged, the feature vector information of the image is input, the bidirectional cyclic neural network extracts the depth semantic information in the text, and finally the image caption information is obtained. The embodiment of the invention simply and automatically generates the image captions and improves the accuracy of generating the image captions.

Description

Method and system for generating image captions

Technical Field

The present invention relates to computer image processing technology and natural language processing technology, and is especially image subtitle generating method and system.

Background

Image captioning is the generation of readable text description information, also known as automatic image annotation or image annotation, for a given image, such as a picture of an object or scene. The image caption generation technology is a research direction in another new computer vision field after image classification, target detection and image segmentation. The image caption generation technology needs to describe objects in an image and relations between the objects by using natural language sentences in a correct form, which is a very challenging task, and in order to realize the image caption generation, the relevant knowledge combining computer vision and natural language processing is needed, namely, the computer vision technology is needed to explain the content of the image, and the natural language processing technology is needed to generate text description information. However, the image subtitle generating technology may have a great influence, for example, to help visually impaired people to better understand image contents in the internet.

How to generate image captions is a challenging problem in the field of artificial intelligence, combining computer vision and natural language processing, and fast browsing images can point out and describe a great deal of detail about visual scenes, which is a relatively simple problem for humans, but is very challenging for computers, as it relates to both how to understand the content of images and how to translate this understanding into natural language.

At present, two methods are mainly adopted for realizing the image subtitle generation process, namely a template-based method and a nearest neighbor-based method. When the method is based on a template mode, a title template is preset, and the title template fills the set title template according to the results of object detection and attribute discovery in the image; when based on the nearest neighbor approach, images that resemble subtitles are retrieved from a large database, and these retrieved subtitles are then modified to fit the current query. However, these two methods are complicated in the process of generating image subtitles and the accuracy of the generated subtitles is not high.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method for generating image subtitles, which is capable of simply and automatically generating image subtitles and improving the accuracy of generating image subtitles.

The embodiment of the invention provides a system for generating image captions, which can simply and automatically generate the image captions and improve the accuracy of generating the image captions.

The embodiment of the invention is realized as follows:

a method of image subtitle generation, the method comprising:

training to obtain an encoder-decoder recursive bidirectional recurrent neural network as an image subtitle generation model, wherein the image subtitle generation model comprises a feature extraction model and a language model;

inputting the acquired image into a feature extraction model to perform image feature extraction processing to obtain feature vector information of the image;

and inputting the feature vector information of the image into a language model, and extracting semantic information by the language model according to the feature vector information of the image to obtain an image subtitle.

The feature extraction model is formed by a convolutional neural network, and the language model is formed by a bidirectional cyclic neural network model.

The feature extraction model is formed by a deep convolutional neural network and comprises a plurality of convolutional layers, a plurality of full-link layers and a classifier.

The obtaining of the image captions includes:

and performing circulation processing of a BilSTM network set by the feature vector information of the image and the subtitle features generated based on the feature vector information of the image, connecting the image subtitles obtained by the circulation processing with a full-connection network, and processing by a Softmax classifier to obtain the image subtitles.

The feature extraction model is also provided with a full-connection network, and the feature vector information of the image is provided for the language model through the full-connection network.

The method further comprises the following steps:

and evaluating the image subtitle generating model by adopting a BLEU tool.

A system for image subtitle generation, comprising: a feature extraction model module and a language model module, wherein,

the feature extraction model module is used for training to obtain a feature extraction model, inputting the obtained image into the feature extraction model for image feature extraction processing to obtain feature vector information of the image and outputting the feature vector information to the language model module;

and the language model module is used for training to obtain a language model, inputting the feature vector information of the image into the language model, and extracting semantic information by the language model according to the feature vector information of the image to obtain an image subtitle.

The feature extraction model is formed by deep convolutional neural network training and comprises a plurality of convolutional layers, a plurality of full-link layers and a classifier;

the language model is formed by adopting bidirectional cyclic neural network training and is used for performing cyclic processing on the BiLSTM network which is used for setting the feature vector information of the image and the caption features generated based on the feature vector information of the image, then performing full-connection network connection on the image captions obtained by the cyclic processing, and obtaining the image captions after the processing of a Softmax classifier.

The feature extraction model module is also used for carrying out full connection processing on the obtained feature vector information of the image through the set full connection network and then outputting the feature vector information to the language model module.

The BLEU model estimation module is used for estimating the set feature extraction model and the image subtitle generation model constructed by the set language model.

As seen above, the neural network model in the embodiment of the present invention employs an encoder-decoder (encoder-decoder) recursive bi-directional cyclic neural network architecture, which includes two main parts: an image feature extraction section and an image caption generation section. For the image feature extraction part, a convolutional neural network is arranged to convert an image into feature vector information of the image; and for the image caption generating part, a bidirectional cyclic neural network is arranged, the feature vector information of the image is input, the bidirectional cyclic neural network extracts the depth semantic information in the text, and finally the image caption information is obtained. Thus, the embodiment of the invention simply and automatically generates the image captions and improves the accuracy of the image caption generation.

Drawings

Fig. 1 is a flowchart of a method for generating image subtitles according to an embodiment of the present invention;

FIG. 2 is a simplified structural diagram of a feature extraction model according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating an example of a method for extracting image features by using a feature extraction model according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a process for executing a language model according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a model architecture for generating a subtitle for an entire image according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a system for generating image subtitles according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples.

In order to simply and automatically generate the image captions and improve the accuracy of image caption generation, the embodiment of the invention introduces the neural network model into the realization of generating the image captions. With the rapid development of the neural network model, the neural network model can be applied to the generation process of the image captions. The neural network model in the embodiment of the invention combines the latest progress of computer vision and machine translation (natural language processing) and the idea of transfer learning, and can be used for generating natural language sentences describing images. The neural network model in embodiments of the invention maximizes the likelihood that a target of a given training image describes a sentence.

Specifically, the neural network model in the embodiment of the present invention adopts an encoder-decoder (encoder-decoder) recursive bi-directional cyclic neural network architecture, and the model includes two main parts: an image feature extraction section and an image caption generation section. For the image feature extraction part, a convolutional neural network is arranged to convert an image into feature vector information of the image; and for the image caption generating part, a bidirectional cyclic neural network is arranged, the feature vector information of the image is input, the bidirectional cyclic neural network extracts the depth semantic information in the text, and finally the image caption information is obtained.

Therefore, the embodiment of the invention can simply and automatically generate the image captions and improve the accuracy of generating the image captions.

Fig. 1 is a flowchart of a method for generating image subtitles according to an embodiment of the present invention, which includes the following specific steps:

step 101, setting an encoder-decoder recursive bidirectional recurrent neural network as an image subtitle generating model, wherein the model comprises a feature extraction model and a language model;

here, the feature extraction model is realized by a convolutional neural network, and the language model is realized by a bidirectional cyclic neural network;

step 102, inputting the acquired image into a feature extraction model for image feature extraction processing to obtain feature vector information of the image;

and 103, inputting the feature vector information of the image into a language model, and extracting semantic information by the language model according to the feature vector information of the image to obtain an image subtitle.

In the embodiment of the present invention, the image feature extraction refers to that a computer converts a red, green, blue (RGB) image into a feature matrix or a feature vector through a series of operations, and usually the feature matrix or the feature vector is represented by a vector with a fixed length, and the vector with the fixed length spatially represents the image and becomes feature vector information of the image. The feature extraction model for extracting the graphic features can adopt a deep convolutional neural network, the network can be obtained by training on the images in the image caption set, and the obtained images can also be processed by using the network obtained by training. The feature extraction model is structured by VGG, AlexNet, GoogleNet or ResNet, etc.

Fig. 2 is a simplified structural diagram of a feature extraction model according to an embodiment of the present invention, and as shown in the drawing, the feature extraction model is formed by using a deep convolutional neural network, where the deep convolutional neural network includes a plurality of convolutional layers, fully-connected layers, and a classifier, where an acquired image is input into the convolutional layers to be convolutional-processed, and then fully-connected into the fully-connected layers, and finally classified by the classifier, so as to obtain feature vector information of the image.

Fig. 3 is a flowchart of an example of a method for extracting image features by using a feature extraction model according to an embodiment of the present invention, where the selecting a deep convolutional neural network ResNet model for feature extraction includes:

step 301, inputting an image;

step 302, inputting the image into a feature extraction model;

and step 303, carrying out image feature extraction on the image by using the feature extraction model to obtain a feature code with fixed dimensionality.

In an embodiment of the invention, the language model is the probability of predicting the next word in the sequence given a word already present in the sequence. For image captioning, the language model is a neural network that predicts a single sequence in a description given the feature vector information of the image and builds the description conditional on the words that have been generated. In the embodiment of the present invention, a bidirectional recurrent neural network is used as a language model, a new Word is generated in a sequence at each output time point, then each generated Word is encoded by using Word embedding (such as Word2Vec), and the encoded Word is transmitted as an input to a decoder in the language model for generating a subsequent Word, fig. 4 is a schematic diagram of an execution process of the language model provided by the embodiment of the present invention, and the specific steps are as follows:

step 401, inputting feature codes with fixed dimensions;

step 402, inputting feature codes with fixed dimensions into a language model;

and 403, extracting semantic information by the language model according to the feature codes with fixed dimensionality, and outputting sentences.

It can be seen that the image subtitle generating model of the embodiment of the present invention is mainly divided into two parts, namely, image feature extraction and subtitle generation, and the two parts can be connected by using a full-connection network. Namely, the feature extraction model and the language model are connected by adopting a full-connection network. Assume that an input image of the image caption generation model is a caption description S corresponding to I, and the caption sequence length is n, i.e., S ═ S1, S2. In the implementation of the invention, for image feature extraction, a pre-trained convolutional neural network ResNet is adopted to extract image features, the extracted feature vector information of an image is input into a set full connection layer, and the full connection layer connecting the convolutional neural network and the bidirectional cyclic neural network has the function of converting the feature vector information of the image into proper dimensionality in an affine transformation mode for subsequent input; for the language model, the BilSTM network can be used for receiving the feature vector information of the image and the generated caption features, the BilSTM network is used for processing, the full connection layer in the language model is used for processing, and finally, the corresponding sequence is output through the Softmax classifier. The model architecture for the entire image subtitle generation is shown in fig. 5.

When training the image caption generating model, inputting an image will output the caption of the image, and the caption generating process is to generate a word at a time, and the previously generated word is used as an input for the generation of the subsequent word. Therefore, it is necessary to set an initial word to indicate that the generation process is started, an end word to indicate an end flag, and a startseq and an endseq to indicate start and end flags of the sequence at the time of processing. The image subtitle generating model in the embodiment of the invention receives a picture and an initial word, generates a next word, and provides the first two words as input to the model to generate the next word. This is the training process of the image caption generating model or the process of outputting the final caption by the trained image caption generating model. For example, for the input sequence "Two peoples clinmbing up a snowymoutain" to be divided into 8 pairs of input and output for training of the image caption generating model, the constructed model input and output pairs are as shown in table 1.

TABLE 1

Thus, an image subtitle generating model is obtained through training.

In the embodiment of the invention, how to evaluate the image subtitle generating model obtained by training is an important problem. Generally, there are two main ways for evaluating an image subtitle generation model, namely, manual and machine. However, this manual evaluation is slow and costly, and is subjective and requires expertise and experience. The embodiment of the invention mainly adopts a machine evaluation mode, namely a Bilingual evaluation complementation (BLEU) tool is set to evaluate the model. The tool is an index for measuring the similarity degree of a machine translation text and a set reference text, the value range is between [0 and 1], and the tool can be used for evaluating texts considered to be generated by a series of natural language processing, such as tasks of language generation, picture title generation, text summarization or voice recognition and the like. The BLEU tool has the following advantages: 1) the calculation speed is high, and the consumed resources are less; 2) is easy to understand; 3) language independent; 4) highly correlated with human evaluation; 5) is widely adopted.

When training the trained image caption generation model using the BLEU tool, work by comparing the candidate caption text with a count of matching n-grams in the reference text set at where 1-gram or unigram compares every word, and bigram compares every word pair, regardless of the order of the words. The greater the number of matches, the better the quality of the identified candidate subtitles. The calculation formula of the evaluation system of the BLEU tool is shown below.

Wherein the content of the first and second substances,

r is the number of words of the reference text, c is the number of words in the candidate subtitles, and BP represents a short penalty value of sentences in the subtitles.

The numerator represents the minimum number of times that the n-gram appears in the caption and the reference text, and the denominator identification takes the number of times that the n-gram appears in the caption.

Fig. 6 is a schematic structural diagram of a system for generating image subtitles according to an embodiment of the present invention, including: a feature extraction model module and a language model module, wherein,

In the system, the feature extraction model is formed by convolutional neural network training and comprises a plurality of convolutional layers, a plurality of full-link layers and a classifier.

In the system, the language model is formed by bidirectional recurrent neural network training, the circulation processing of the BilSTM network is carried out on the feature vector information of the image and the caption features generated based on the feature vector information of the image, the image captions obtained by the circulation processing are connected with a full-connection network, and the image captions are finally obtained after the processing of a Softmax classifier.

In the system, the feature extraction model module is further configured to perform full-connection processing on the feature vector information of the obtained image through the set full-connection network, and then output the feature vector information to the language model module.

The system also comprises a BLEU tool evaluation module which is used for evaluating the set feature extraction model and the image subtitle generation model constructed by the set language model.

It can be seen that a subtitle generation model is constructed in the embodiment of the invention, and the related technologies of computer vision and natural language processing are fused. Firstly, when image features are extracted, a deep convolutional neural network, such as a trained ResNet50 deep convolutional neural network, is adopted to obtain feature vector information of the image; and secondly, a text sequence processing technology is used, a bidirectional cyclic neural network is selected for extracting semantic information in the feature vector information of the image, a transfer learning idea is used in the process, a Word2ved trained language model is fused to serve as the initialization of a Word vector for extracting the semantic information, then a decoder is used for fusing the feature vector information of the image, which is output by image feature extraction and text sequence processing, and the fused feature vector information is used for final prediction through a full connection layer. Thus, higher accuracy and better fluency of the generated image captions are obtained.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method of image subtitle generation, the method comprising:

2. The method of claim 1, wherein the feature extraction model is constructed using a convolutional neural network and the language model is constructed using a bi-directional cyclic neural network model.

3. The method of claim 2, wherein the feature extraction model is a deep convolutional neural network comprising a plurality of convolutional layers, a plurality of fully-connected layers, and a classifier.

4. The method of claim 2, wherein the obtaining the image caption comprises:

5. The method of claim 1, wherein a fully connected network is further provided in the feature extraction model, and the feature vector information of the image is provided to the language model through the fully connected network.

6. The method of claim 1, further comprising:

and evaluating the image subtitle generating model by adopting a BLEU tool.

7. A system for image subtitle generation, comprising: a feature extraction model module and a language model module, wherein,

8. The system of claim 7, wherein the feature extraction model is constructed using deep convolutional neural network training, comprising a plurality of convolutional layers, a plurality of fully-connected layers, and a classifier;

9. The system according to claim 7, wherein the feature extraction model module is further configured to perform full-join processing on the feature vector information of the obtained image through the set full-join network, and output the processed feature vector information to the language model module.

10. The system of claim 7, further comprising a BLEU tool evaluation module for evaluating an image subtitle generation model constructed from the set feature extraction model and the set language model.