CN110750669B

CN110750669B - Method and system for generating image captions

Info

Publication number: CN110750669B
Application number: CN201910885349.0A
Authority: CN
Inventors: 杨志明
Original assignee: Ideepwise Artificial Intelligence Robot Technology Beijing Co ltd
Current assignee: Ideepwise Artificial Intelligence Robot Technology Beijing Co ltd
Priority date: 2019-09-19
Filing date: 2019-09-19
Publication date: 2023-05-23
Anticipated expiration: 2039-09-19
Also published as: CN110750669A

Abstract

The invention discloses a method and a system for generating image captions, wherein a neural network model in the embodiment of the invention adopts an encoder-decoder (encoder-decoder) recursion two-way circulation neural network architecture, and the model comprises two main parts: an image feature extraction section and an image caption generation section. Wherein, for the image feature extraction part, a convolutional neural network is set to convert the image into feature vector information of the image; and setting a bidirectional cyclic neural network for the image subtitle generating part, inputting the feature vector information of the image, extracting the depth semantic information in the text by the bidirectional cyclic neural network, and finally obtaining the image subtitle information. The embodiment of the invention simply and automatically generates the image captions and improves the accuracy of the image captions generation.

Description

Method and system for generating image captions

Technical Field

The present invention relates to computer image processing technology and natural language processing technology, and is especially one kind of image caption generating method and system.

Background

Image subtitle generation is the generation of readable text descriptive information, also known as automatic image annotation or image annotation, for a given image, for example a picture of a certain object or scene. The image caption generating technology is a new research direction in the field of computer vision after image classification, target detection and image segmentation. The image subtitle generation technology needs to describe the objects in the image and the relation between the objects by using the natural language sentences in the correct form, which is a very challenging task, and in order to realize the image subtitle generation, the related knowledge of combining computer vision and natural language processing is needed, namely, the technology of computer vision is needed to explain the content of the image, and the technology of natural language processing is needed to generate text description information. However, the image caption generating technique may have a great influence, for example, may help visually impaired people to better understand the image contents in the internet.

How to generate image captions is a challenging problem in the field of artificial intelligence in combination with computer vision and natural language processing, fast browsing of images can point out and describe a large amount of detail in the relevant visual scene, which is a relatively simple problem for humans, but is very challenging for computers, as it involves both how to understand the content of an image and how to translate this understanding into natural language.

At present, two methods are mainly adopted for realizing the image subtitle generating process, namely a template-based mode and a nearest neighbor-based mode. When the template mode is based, presetting a title template, and filling the set title template according to the results of object detection and attribute discovery in the image; based on the nearest neighbor approach, caption-like images are retrieved from a large database, and these retrieved captions are then modified to fit the current query. However, when the image subtitle is generated, the two methods are complicated and the generated subtitle has low accuracy.

Disclosure of Invention

In view of this, the embodiments of the present invention provide a method for generating image subtitles, which can simply and automatically generate the image subtitles and improve the accuracy of generating the image subtitles.

The embodiment of the invention provides a system for generating image captions, which can automatically generate the image captions simply and improve the accuracy of the image captions generation.

The embodiment of the invention is realized as follows:

a method of image subtitle generation, the method comprising:

training to obtain an encoder-decoder recursive bidirectional cyclic neural network serving as an image subtitle generation model, wherein the encoder-decoder recursive bidirectional cyclic neural network comprises a feature extraction model and a language model;

inputting the acquired image into a feature extraction model to perform image feature extraction processing to obtain feature vector information of the image;

inputting the feature vector information of the image into a language model, and extracting semantic information by the language model according to the feature vector information of the image to obtain the image subtitle.

The feature extraction model is composed of a convolutional neural network, and the language model is composed of a bidirectional cyclic neural network model.

The feature extraction model is composed of a deep convolutional neural network and comprises a plurality of convolutional layers, a plurality of full-connection layers and a classifier.

The obtaining the image subtitle comprises the following steps:

and after performing cyclic processing on the characteristic vector information of the image and the BiLSTM network which is set based on the caption characteristics generated by the characteristic vector information of the image, performing full-connection network connection on the image caption obtained by the cyclic processing, and performing processing of a Softmax classifier to obtain the image caption.

And the feature extraction model is also provided with a full-connection network, and feature vector information of the image is provided for the language model through the full-connection network.

The method further comprises the steps of:

and evaluating the image subtitle generation model by using a BLEU tool.

A system for image subtitle generation, comprising: a feature extraction model module and a language model module, wherein,

the feature extraction model module is used for training to obtain a feature extraction model, inputting the obtained image into the feature extraction model for image feature extraction processing to obtain feature vector information of the image, and outputting the feature vector information to the language model module;

the language model module is used for training to obtain a language model, inputting the feature vector information of the image into the language model, and extracting semantic information by the language model according to the feature vector information of the image to obtain the image subtitle.

The feature extraction model is formed by training a deep convolutional neural network and comprises a plurality of convolutional layers, a plurality of full-connection layers and a classifier;

the language model is formed by training a bidirectional cyclic neural network, and is used for carrying out cyclic processing on the characteristic vector information of the image and a BiLSTM network for setting the caption characteristics generated based on the characteristic vector information of the image, connecting the image captions obtained by the cyclic processing with a full-connection network, and then carrying out processing of a Softmax classifier to obtain the image captions.

The feature extraction model module is also used for carrying out full connection processing on the feature vector information of the obtained image through the set full connection network and then outputting the feature vector information to the language model module.

The system also comprises a BLEU tool evaluation module which is used for evaluating the set feature extraction model and the image subtitle generation model obtained by the set language model construction.

As seen above, the neural network model in the embodiment of the present invention employs an encoder-decoder (encoder-decoder) recursive bidirectional cyclic neural network architecture, and the model includes two main parts: an image feature extraction section and an image caption generation section. Wherein, for the image feature extraction part, a convolutional neural network is set to convert the image into feature vector information of the image; and setting a bidirectional cyclic neural network for the image subtitle generating part, inputting the feature vector information of the image, extracting the depth semantic information in the text by the bidirectional cyclic neural network, and finally obtaining the image subtitle information. Thus, the embodiment of the invention simply and automatically generates the image captions, and improves the accuracy of the image captions generation.

Drawings

Fig. 1 is a flowchart of a method for generating an image subtitle according to an embodiment of the present invention;

FIG. 2 is a simplified diagram of a feature extraction model according to an embodiment of the present invention;

FIG. 3 is a flowchart of an example of a method for extracting image features by using a feature extraction model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a language model execution process according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a model architecture for generating an entire image subtitle according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a system structure for generating an image subtitle according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below by referring to the accompanying drawings and examples.

The embodiment of the invention introduces a neural network model into the realization of generating the image captions in order to simply and automatically generate the image captions and improve the accuracy of generating the image captions. With the rapid development of the neural network model, the neural network model can be applied to the generation process of the image captions. The neural network model in the embodiment of the invention combines the latest progress of computer vision and machine translation (natural language processing) and the idea of transfer learning, and can be used for generating natural language sentences describing images. The neural network model in embodiments of the present invention maximizes the likelihood of a target descriptive sentence for a given training image.

Specifically, the neural network model in the embodiment of the invention adopts an encoder-decoder (encoder-decoder) recursive bidirectional cyclic neural network architecture, and the model comprises two main parts: an image feature extraction section and an image caption generation section. Wherein, for the image feature extraction part, a convolutional neural network is set to convert the image into feature vector information of the image; and setting a bidirectional cyclic neural network for the image subtitle generating part, inputting the feature vector information of the image, extracting the depth semantic information in the text by the bidirectional cyclic neural network, and finally obtaining the image subtitle information.

Therefore, the embodiment of the invention can simply and automatically generate the image captions, and improve the accuracy of the generation of the image captions.

Fig. 1 is a flowchart of a method for generating an image subtitle according to an embodiment of the present invention, which specifically includes the steps of:

step 101, setting an encoder-decoder recursion bidirectional cyclic neural network as an image subtitle generation model, wherein the model comprises a feature extraction model and a language model;

the feature extraction model is realized by adopting a convolutional neural network, and the language model is realized by adopting a bidirectional cyclic neural network;

102, inputting the acquired image into a feature extraction model to perform image feature extraction processing to obtain feature vector information of the image;

and step 103, inputting the feature vector information of the image into a language model, and extracting semantic information by the language model according to the feature vector information of the image to obtain the image subtitle.

In the embodiment of the present invention, the image feature extraction refers to that a computer converts a red, green and blue (RGB) image into a feature matrix or feature vector through a series of operations, and the feature matrix or feature vector is usually represented by a vector with a fixed length, and the vector with the fixed length represents the image in space and becomes feature vector information of the image. The feature extraction model for graphic feature extraction can adopt a deep convolutional neural network, can train on images in an image subtitle set to obtain the network, and can process acquired images by using the trained network. The feature extraction model architecture is VGG, alexNet, googleNet or ResNet, etc.

Fig. 2 is a simplified diagram of a feature extraction model according to an embodiment of the present invention, where the feature extraction model is configured by using a deep convolutional neural network, and the deep convolutional neural network includes a plurality of convolutional layers, a full-connection layer, and a classifier, where an acquired image is input into the convolutional layers for convolutional processing, then is input into the full-connection layer for full-connection, and finally is classified by the classifier to obtain feature vector information of the image.

Fig. 3 is a flowchart of an example of a method for extracting image features by using a feature extraction model according to an embodiment of the present invention, where feature extraction is performed by using a deep convolutional neural network res net model, including:

step 301, inputting an image;

step 302, inputting the image into a feature extraction model;

and 303, extracting image features of the image by the feature extraction model to obtain feature codes with fixed dimensions.

In an embodiment of the invention, the language model is the probability of predicting the next word in the sequence given the words already present in the sequence. For image subtitles, the language model is a neural network that, given the feature vector information of the image, predicts a single sequence in the description and builds the description on the condition that words have been generated. The embodiment of the invention uses a bi-directional cyclic neural network as a language model, generates a new Word in a sequence at each output time point, then uses Word embedding (such as Word2 Vec) to encode each generated Word, and transmits the Word as input to a decoder in the language model for generating a subsequent Word, and FIG. 4 is a schematic diagram of the execution process of the language model provided by the embodiment of the invention, wherein the specific steps are as follows:

step 401, inputting feature codes with fixed dimensions;

step 402, inputting feature codes with fixed dimensions into a language model;

and 403, extracting semantic information by the language model according to the feature codes with fixed dimensions, and outputting sentences.

It can be seen that the image subtitle generating model according to the embodiment of the present invention is mainly divided into two parts, namely, image feature extraction and subtitle generation, and the two parts of contents can be connected by using a fully connected network. The feature extraction model and the language model are connected by adopting a fully-connected network. Let the subtitle description corresponding to the input image of the image subtitle generation model be I be S, and the subtitle sequence length be n, i.e., s= { S1, S2,... In the implementation of the invention, for image feature extraction, a pre-trained convolutional neural network ResNet is adopted to extract image features, the feature vector information of the extracted image is input into a set full-connection layer, and the full-connection layer connecting the convolutional neural network and the bidirectional convolutional neural network has the function of converting the feature vector information of the image into a proper dimension through an affine transformation mode for subsequent input; for the language model, the BiLSTM network can be used for receiving the feature vector information of the image and the generated caption features, after the processing of the BiLSTM network, the processing of the full-connection layer in the language model is carried out, and finally, the corresponding sequence is output through the Softmax classifier. The model architecture for the entire image subtitle generation is shown in fig. 5.

When training the image subtitle generating model, inputting an image outputs the title of the image, and the generating process of the title is to generate one word at a time, and the previously generated word is used as input for the generation of the subsequent word. Therefore, it is necessary to set an initial word to represent the start of the generation process, set an end word to represent the end-of-representation flag, and use startseq and endeq to represent the start and end flags of the sequence at the time of processing. The image caption generating model in the embodiment of the invention receives a picture and an initial word and generates a next word, and then provides the first two words described as inputs to the model to generate the next word. This is the training process of the image caption generating model or the process of the trained image caption generating model outputting the final caption. For example, for the input sequence "Two people climbing up a snowy mountain" would be split into 8 pairs of inputs and outputs for training of the image caption generating model, the constructed model input and output pairs are shown in table 1.

TABLE 1

Thus, an image subtitle generating model is trained.

In the embodiment of the invention, how to evaluate the quality of the image subtitle generating model obtained through training is an important problem. There are generally two main ways of manual and machine for evaluating the image subtitle generating model. However, such an evaluation is slow and costly by manual work, and such manual evaluation is relatively subjective and requires reliance on expertise and experience. The embodiment of the invention mainly adopts a machine evaluation mode, namely a bilingual evaluation complementary (BLEU, bilingual Evaluation Understudy) tool is arranged to evaluate the model. The tool is an index for measuring the similarity degree of the machine translation text and the set reference text, the value range is between 0 and 1, and the tool can evaluate a series of texts which are considered to be generated by natural language processing, such as language generation, picture title generation, text abstract or voice recognition and other tasks. The BLEU tool has the following advantages: 1) The calculation speed is high, and the consumed resources are small; 2) Is easy to understand; 3) Is independent of language; 4) Highly relevant to human evaluation; 5) Is widely used.

In training the trained image subtitle generation model using the BLEU tool, work is done by counting the number of matching n-grams in the candidate subtitle text and the reference text set there, where 1-gram or unigram compares each word and the binary (bigram) compares each word pair, this comparison does not take into account the order of the words. The greater the number of matches, the better the quality of the identified candidate subtitles. The calculation formula of the evaluation system of the BLEU tool is shown below.

Wherein, the liquid crystal display device comprises a liquid crystal display device,

r is the word number of the reference text, c is the word number of the candidate subtitle, and BP represents the shorter penalty value of sentences in the subtitle.

The numerator represents the minimum number of occurrences of the n-gram in the subtitle and in the reference text, and the denominator identifies the number of occurrences of the n-gram in the subtitle.

Fig. 6 is a schematic diagram of a system structure for generating an image subtitle according to an embodiment of the present invention, including: a feature extraction model module and a language model module, wherein,

In the system, the feature extraction model is formed by adopting convolutional neural network training and comprises a plurality of convolutional layers, a plurality of full-connection layers and a classifier.

In the system, the language model is formed by training a bidirectional cyclic neural network, the characteristic vector information of an image and the caption characteristics generated based on the characteristic vector information of the image are circularly processed by a BiLSTM network, the image captions obtained by the circular processing are connected by a full-connection network, and the image captions are finally obtained after being processed by a Softmax classifier.

In the system, the feature extraction model module is further used for outputting the feature vector information of the obtained image to the language model module after performing full connection processing on the feature vector information of the obtained image through the set full connection network.

The system also comprises a BLEU tool evaluation module which is used for evaluating the set feature extraction model and the image subtitle generation model obtained by the construction of the set language model.

It can be seen that the embodiment of the invention constructs a caption generating model, and integrates the related technologies of computer vision and natural language processing. Firstly, when image feature extraction is carried out, a depth convolution neural network, such as a trained ResNet50 depth convolution neural network, is adopted to obtain feature vector information of the image; secondly, a text sequence processing technology is used, a bidirectional cyclic neural network is selected for extracting semantic information in feature vector information of the image, the idea of transfer learning is used in the process, a language model trained by Word2ved is fused to serve as initialization of Word vectors to be used for extracting semantic information, a decoder is used for fusing the feature vector information of the image, which is output by image feature extraction and text sequence processing, and the image feature information is used for final prediction through a full-connection layer. Thus, higher accuracy and better fluency of the generated image captions are achieved.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the invention.

Claims

1. A method of image subtitle generation for describing objects and relationships between objects in an image using natural language sentences, the method comprising:

inputting the feature vector information of the image into a language model, and extracting semantic information by the language model according to the feature vector information of the image to obtain an image subtitle;

the feature extraction model is formed by a deep convolutional neural network and comprises a plurality of convolutional layers, a plurality of full-connection layers and a classifier;

the obtaining the image subtitle comprises the following steps:

after performing cyclic processing on feature vector information of an image and a BiLSTM network which is set based on caption features generated by the feature vector information of the image, performing full-connection network connection on the image captions obtained by the cyclic processing, and performing processing of a Softmax classifier to obtain the image captions;

2. The method of claim 1, wherein the feature extraction model is constructed using a convolutional neural network and the language model is constructed using a bi-directional recurrent neural network model.

3. The method of claim 1, wherein the method further comprises:

and evaluating the image subtitle generation model by using a BLEU tool.

4. A system for generating image captions, wherein the image captions are generated to describe objects in an image and relationships between objects using natural language sentences, comprising: a feature extraction model module and a language model module, wherein,

the language model module is used for training to obtain a language model, inputting the feature vector information of the image into the language model, and extracting semantic information by the language model according to the feature vector information of the image to obtain an image subtitle;

the language model is formed by training a bidirectional cyclic neural network, and is used for carrying out cyclic processing on the characteristic vector information of the image and a BiLSTM network which is set based on the caption characteristics generated by the characteristic vector information of the image, connecting the image captions obtained by the cyclic processing with a full-connection network, and then carrying out processing of a Softmax classifier to obtain the image captions;

5. The system of claim 4, further comprising a BLEU tool evaluation module for evaluating the set feature extraction model and the set language model constructed image subtitle generation model.