CN107909115B

CN107909115B - Image Chinese subtitle generating method

Info

Publication number: CN107909115B
Application number: CN201711260141.7A
Authority: CN
Inventors: 王斌; 王剑锋; 周小平; 张倩; 黄继风
Original assignee: Shanghai Normal University
Current assignee: Shanghai Normal University
Priority date: 2017-12-04
Filing date: 2017-12-04
Publication date: 2022-02-15
Anticipated expiration: 2037-12-04
Also published as: CN107909115A

Abstract

The invention discloses a method for generating Chinese image subtitles, which comprises the following steps: step one, constructing a training set: collecting images and adding Chinese descriptions with similar meanings to the images manually; training a convolutional neural network to extract image features, and after the convolutional neural network is trained, performing forward propagation operation on the image collected in the step one to obtain semantic features of the image; thirdly, segmenting the Chinese description of each sentence according to the meaning, and constructing a Chinese dictionary; training a cyclic neural network to generate Chinese subtitles; and fifthly, generating image captions, and finishing the image caption generating task of the image captions by sequentially passing the image of the captions to be generated through the convolutional neural network and the cyclic neural network at the testing or using stage.

Description

Image Chinese subtitle generating method

Technical Field

The invention relates to the field of computer vision, machine learning and artificial neural networks, in particular to a method for generating Chinese image subtitles.

Background

In the field of artificial intelligence, a long-standing goal of people is to develop a machine that not only can perceive and understand the rich visual world around us, but also can communicate with us using natural language. In the field of computer vision, a number of different tasks have been implemented, such as image recognition, image localization, image segmentation, etc. The main steps of these task implementations include image feature extraction and training of classifiers. The image feature extraction method mainly comprises the following steps: color Histogram features, image texture features, Histogram of Oriented (HOG) features, Local Binary Pattern (LBP) features, etc., where color Histogram and image texture features are global features of an image, and HOG and LBP are Local features of an image. The classifiers generally include SOFTMAX classifier, SVM classifier, neural network classifier, and ensemble classifier. The implementation of these tasks has greatly facilitated the development of artificial intelligence, but these tasks have been to classify images or parts of images into pre-specified categories or discrete labels.

Image Caption Generation (Image Caption Generation) is to give an Image and let a machine automatically generate a natural language to describe the content of the Image, which is essentially a Visual-to-language (Visual-to-language) problem, and it is simply expected that a computer can give a sentence describing the content of the Image according to the content of the Image. The task of generating image captions requires not only a computer to understand the objects contained in the image, but also to express the relationships between these objects in the correct natural language.

Therefore, those skilled in the art are dedicated to develop a method for generating chinese subtitles from images, which utilizes the local feature information of the images in the initial stage, and establishes the position relationship between the image contents, and associates the semantic information of each word with the local feature of the images; and modeling by using a neural network model with an attention mechanism, wherein the established sequence model generates an attention variable distribution at each moment according to the image characteristic information and the word semantic information, and the variable distribution represents the position information of the image noticed by the model at the moment.

Disclosure of Invention

The invention aims to provide a Chinese image subtitle generating method based on a neural network aiming at the defect that most of the existing computer vision tasks divide images into discrete labels, thereby overcoming the obstacle from the images to the language.

In order to achieve the purpose, the invention provides a method for generating Chinese image subtitles, which comprises the following steps:

step one, constructing a training set: collecting images and adding Chinese descriptions with similar meanings to the images manually;

training a convolutional neural network to extract image features, and after the convolutional neural network is trained, performing forward propagation operation on the image collected in the step one to obtain semantic features of the image;

thirdly, segmenting the Chinese description of each sentence according to the meaning, and constructing a Chinese dictionary;

training a cyclic neural network to generate Chinese subtitles;

and fifthly, generating image captions, and finishing the image caption generating task of the image captions by sequentially passing the image of the captions to be generated through the convolutional neural network and the cyclic neural network at the testing or using stage.

Further, step one selects the Flickr8k image subtitle data set.

Further, the second step adopts 16-layer neural network, including convolution, pooling, activation and other operations, and extracts image features by using the convolutional neural network to obtain features with semantic information.

Further, the 16-layer neural network comprises 13 convolutional layers and 3 fully-connected layers, the activation function of each layer is a Relu function, and a Dropout layer is added after the last three layers.

Further, the second step comprises a data set, wherein the data set adopts an ImageNet data set, an Adadelta gradient descent algorithm is adopted as a training algorithm of the second step, and the network parameters are updated according to the following formula:

w_t+1＝w_t+Δw_t (4)

wherein, w_tParameter values representing the t-th iteration, g representing the gradient, E [ g ]²]Representing the moving average of the square of the gradient g, α is the coefficient for calculating the moving average and is typically taken to be 0.99, η is the learning rate 0.0001, and ε is taken to be a small number to prevent the denominator from being 0.

Furthermore, step four adopts an LSTM network added with Dropout, and adopts a method of randomly setting 0 in different time allowable periods, thereby improving the generalization capability of the model.

Further, step four pairs of conditional probabilities P (S)_t|I,S₀,S₁,…,S_t-1(ii) a Theta) modeling using a fixed length of hidden unit output h in the model_tTo express the conditional probability value of the t time and the hidden unit h of the previous time_t-1And input x at that time_tIn relation to, and therefore, the hidden unit output h_t＝f(h_t-1,x_t)，

Wherein f is a tan h nonlinear function; for an initial value h_-1Then the feature extraction is carried out on the input image I through the convolution neural network of the step three, x_tIt represents a certain vocabulary vector corresponding to each time t. For the problem that each image faces the inequality of Chinese description length, the invention adopts the method of complementing 0 at the tail of the sequence number vector. When the network is trained as well.

Further, when the recurrent neural network is trained, the image features obtained in the second step and the subtitle sequence number vectors generated in the third step are selected through input of each iteration, an Adadelta gradient descent algorithm is adopted in the network weight updating method, and the learning rate is set to be 0.0001.

The technical effects are as follows:

in the initial stage, the local characteristic information of the image is utilized, the position relation among the image contents is also established, and the semantic information of each word is associated with the local characteristic of the image; and modeling by using a neural network model with an attention mechanism, wherein the established sequence model generates an attention variable distribution at each moment according to the image characteristic information and the word semantic information, and the variable distribution represents the position information of the image noticed by the model at the moment.

Drawings

Fig. 1 is a flowchart of a method for generating chinese subtitles from images according to the present invention.

Fig. 2 is an example of image chinese subtitle data of an image chinese subtitle generating method of the present invention.

Fig. 3 is an example of chinese subtitle word segmentation for the image chinese subtitle generating method of the present invention.

FIG. 4 is a comparison between the test image Chinese caption generation result and the actual result of the image Chinese caption generation method of the present invention.

FIG. 5 is a comparison between the test image Chinese caption generation result and the actual result of the image Chinese caption generation method of the present invention.

FIG. 6 is a comparison of CIDER learning curves on Flickr8K CN for the present invention and the conventional method.

FIG. 7 is a comparison of CIDER learning curves on Flickr8K for the present invention and the conventional method.

Table 1 shows a comparison of the results of the experiments conducted in the Flickr8k CN data set according to the present invention and the conventional method.

Detailed Description

The specific embodiment of the invention is a standard dataset Flickr8K and its Chinese version Flickr8K CN. The invention provides a method for generating Chinese image subtitles, which is realized by the following scheme. Firstly, a training set is constructed according to actual requirements in a training stage, images as many as possible are collected, and proper Chinese subtitles are added to each image manually, wherein the data set is used for training a machine to learn how to automatically add the Chinese subtitles to the images according to samples. Next, feature extraction is performed on the images of the training set by training a multi-layer convolutional neural network. Then, semantically segmenting the Chinese subtitles of each image, and constructing a dictionary according to the occurrence frequency of the vocabularies. And finally, modeling the Chinese subtitles by training a cyclic neural network, and learning how to generate the Chinese subtitles according to the image characteristics. In the testing or using stage, for the input image, the convolutional neural network obtained in the training stage is used for extracting the characteristics, and the characteristics are input into the cyclic neural network to obtain the Chinese caption. This model is a discriminant model, i.e. it maximizes the probability of obtaining a correct description sequence S given a certain picture I. The process can be expressed in a formal manner as,

wherein: θ is the parameter to be learned by the model; the first summation is for all pictures I in the training set and their correct description sequence S; the second summation is for each word S in the correct description sequence S_t. According to the bayesian formula, the second summation result represents the logarithm joint probability value of the whole description sequence S under the condition of the given picture I.

As shown in fig. 1, a preferred embodiment of the present invention provides a method for generating chinese subtitles of an image, which includes the following steps:

step one, constructing a training set

According to actual requirements, a plurality of images are collected and a plurality of sentences of Chinese descriptions are added to the images manually. Because of the limitations of the model, the chinese description added to the image requires the selection of words that are as simple as possible and that can directly express the meaning of the image.

The embodiment selects a Flickr8k image caption data set that is closer to daily life, which has about 8000 images in total, most of which show the situation of human and animal participating in a certain activity, as shown in FIG. 2. In order to implement the image Chinese subtitle generation of the present invention, 5 sentences of simple Chinese subtitle descriptions are added to each image, as shown in fig. 2, thereby forming a data set.

Step two, training the convolutional neural network to extract image characteristics

The invention realizes the extraction of the semantic features of the image by utilizing the convolutional neural network. The network needs to be pre-trained on a larger data set before feature extraction can be performed on the data set. The convolutional neural network comprises a series of operations of convolution, pooling, activation and the like, and the convolutional neural network is used for extracting image features, so that the features which have semantic information more than the traditional LBP, HOG and color histogram features can be obtained. The present embodiment trains a 16-layer neural network using the ImageNet database, with top 13The layers are convolutional layers, and the last 3 layers are fully-connected layers, wherein each convolutional layer comprises operations of convolution, activation, pooling and the like. The number of convolution kernels in each three layers is 16, 32, 64, 128 and 256 respectively, and the initialization weight values adopt the mean value of 0 and the variance of 0

Where input _ size represents the dimension of the layer of input data. The last layer of the network is the SOFTMAX classifier, which is used to calculate the probability of each training image for each class. The activation function of each layer is a Relu function, and a Dropout layer is added after the last three layers. The dataset to train the convolutional neural network uses an ImageNet dataset that includes 1000 classes, each class containing perhaps thousands of images. Through experiments, the training method adopts an Adadelta gradient descent algorithm to update network parameters according to the following formula:

w_t+1＝w_t+Δw_t (4)

wherein, w_tRepresenting the value of the parameter for the t-th iteration, g representing its gradient, E [ g²]Representing the moving average of the square of the gradient g, a is the coefficient for calculating the moving average and is typically taken to be 0.99, η is the learning rate 0.0001, and ε is a small number here to prevent the denominator from being 0. During training, the training is stopped when the loss function of the model is not changed much, and the model parameters are kept unchanged in the later steps. Finally, 4096-dimensional output of the second full-connection layer of the model is used for generating the follow-up caption by using the features extracted by the convolutional neural network. Experiments show that the learning rate of each update is 0.0001, and the result of randomly selecting 128 images is better

Step three, dividing words according to meaning for each sentence of Chinese description, and constructing Chinese dictionary

And (3) segmenting words of the Chinese character curbs collected in the step one according to semantics, wherein the words can be segmented by adopting an artificial word segmentation method or word segmentation software, and the result of the artificial word segmentation is more accurate. An example of a correct word segmentation can be shown in fig. 3, where the original sentence is: "a dog plays on the lawn", the word segmentation result is: "one/dog/on/grass/on/play". And finally, after all Chinese description words are segmented, counting all the appeared words and sequencing according to the appearance frequency of the words, wherein the first 2000 words and an unknown word marker < UNK > are taken as dictionaries. Thus, for each sentence, a sequence number vector can be used, which represents the Chinese description in its dimensional space.

Step four, training the recurrent neural network to generate Chinese captions

In the traditional Recurrent Neural Networks (RNNs), in the training process, due to the phenomena of gradient explosion, disappearance and the like, the weight of the sequence unit at the tail end is updated faster, while the weight of the sequence unit at the front end is often not updated effectively, so that the RNN network has poor effect when processing some longer sequences. The Long Short-Term Memory (LSTM) network solves the problems of gradient disappearance, gradient explosion and the like caused by overlong time sequence by adding a Memory unit and a plurality of different gate structures, and obtains better effect on processing the problem of Long-Term dependence. The invention adds a Dropout layer on the traditional LSTM network, which is different from the traditional method in that the Dropout layer is invariable in each time sequence period, and the random 0 setting method is adopted in different time sequence periods as the traditional method, thereby improving the generalization capability of the model. The Cell structure of LSTM has a Cell State (Cell State) that is transferred between time sequences, and several different gate (Gates) structures to control the input, output, and Cell State. These door structures include: input door i_tAnd an output gate o_tForgetting door f_tAnd an input modulation structure g_tAt each time t, the cellularity c of the LSTM network_tAnd hidden layer output h_tCan be found by the following formula:

i_t＝σ(W_ixx_t+W_ihh_t-1+b_i) (6)

f_t＝σ(W_fxx_t+W_fhh_t-1+b_f) (7)

o_t＝σ(W_oxx_t+W_ohh_t-1+b_o) (8)

g_t＝tanh(W_gxx_t+W_ghh_t-1+b_g) (9)

c_t＝f_t⊙c_t-1+i_t⊙g_t (10)

h_t＝o_t⊙tanh(c_t) (11)

wherein x is_tFor input at time t, h_t-1For the output of the hidden layer unit at the previous time, σ (x) is 1/(1+ e)^-x) Is a sigmoid function, and tanh (x) ═ (e ^ x-e ^ (-x))/(e ^ x + e ^ (-x)) is a hyperbolic tangent function, W_ix、W_fx、W_ox、W_gx、W_ih、W_fh、W_oh、W_ghAnd b_i、b_f、b_o、b_gFor the parameters to be learned for the model, they do not change as time t changes, and the symbol &representsthe multiplication of the corresponding elements of the matrix. Then, a Drop layer is added behind each hidden layer to construct a Drop-LSTM network, namely, the hidden layer is output h at each time t_tMultiply by the same 0-1 random matrix that is the same shape:

h_t＝h_t⊙m_h

wherein m is_hRepresenting a random matrix, which can be generated by subjecting each element thereof to a binary distribution of 0-1 with a probability p, typically 0.5, m_hIt does not change with the time t, and is a constant value in the same time sequence. Finally, the features extracted by the convolutional neural network and the corresponding Chinese description sequence number matrix are used as input, and the convolutional neural network is trained according to the method for training the convolutional neural networkThe network is trained to learn how to automatically generate subtitles.

Step five, generating Chinese image subtitles by using the model

And based on the trained image Chinese caption generation model, for each image to be subjected to caption generation, sequentially extracting the characteristics of the image through a convolutional neural network, inputting the characteristics into a cyclic neural network, and automatically generating the corresponding Chinese caption by the cyclic neural network by utilizing the vocabulary in the dictionary constructed in the step three. In order to verify the effectiveness of the method of the invention, verification was performed on specific examples.

As shown in fig. 4, the test image chinese subtitle generating result of the image chinese subtitle generating method of the present invention is compared with the real result. The images are selected from a test set of Flickr8k data sets, and Chinese and English reference captions are provided by the respective data sets. As shown in fig. 4, both chinese subtitles and english subtitles generated for the test image describe the subject of the image and its motion well. FIGS. 5 and 6 are comparative plots of CIDER learning curves on Flickr8K and Flickr8K CN for the method of the present invention and the conventional method. The CIDER is an evaluation index of the image subtitle generating task. As shown in fig. 5 and 6, the model of the present invention generates chinese subtitles and english subtitles with a significantly higher effect than the conventional non-attentive basic model.

Table 1 comparison of experimental results of the present invention and the conventional method on the Flickr8k CN data set

Table 1 shows the comparison of the results of the experiments of the model of the present invention and the two conventional models in the database Flickr8k CN. Wherein, Baseline and CS-NIC are two common caption generating traditional models; BLEU, ROUGE-L and CIDER are three evaluation indexes of the image Chinese subtitle generating task, and the higher the values of the three indexes are, the better the subtitle generating task effect is. As seen from the table, the attention model of the present invention is higher than both the Baseline reference model and the CS-NIC conventional model in all indexes.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A method for generating Chinese image subtitles is characterized by comprising the following steps:

step one, constructing a training set: collecting images and adding Chinese descriptions with similar meanings to the images manually; selecting a Flickr8k image subtitle data set, wherein each image in the original data set is correspondingly marked with 5 English sentences, and 5 simple Chinese subtitle descriptions are added to each image;

training a convolutional neural network to extract image features, wherein the network needs to be pre-trained on a larger data set before feature extraction is carried out on the data set; training a 16-layer neural network by using an ImageNet database, wherein the first 13 layers are convolutional layers, and the last 3 layers are fully-connected layers, wherein each convolutional layer comprises convolution, activation and pooling operations; the number of convolution kernels in each three layers is 16, 32, 64, 128 and 256 respectively, and the initialization weight values adopt the mean value of 0 and the variance of 0

Initializing the gaussian distribution of (1), wherein input _ size represents the dimension of the input data of the layer; the last layer of the network is a SOFTMAX classifier used for calculating the probability of each training image corresponding to each category; the activation function of each layer is a Relu function, and a Dropout layer is added after the last three layers; the data set for training the convolutional neural network adopts an ImageNet data set, the data set comprises 1000 categories, and each category comprises thousands of images; the training method adopts an Adadelta gradient descent algorithm and updates network parameters according to the following formula:

w_t+1＝w_t+Δw_t (3)

wherein, w_tRepresenting the value of the parameter for the t-th iteration, g representing its gradient, E [ g²]Represents the moving average of the g square of the gradient, E [ g²]_tRepresents the moving average of the square of the gradient g of the t iteration, alpha is the coefficient for calculating the moving average, and is taken as 0.99, delta w_tRepresenting the parameter variation value of the t iteration, eta is the learning rate and takes 0.0001, and epsilon is a very small number and prevents the denominator from being 0; g_tThe model is a modulation structure and represents the gradient of the t iteration, the training is stopped when the loss function of the model is not changed greatly during the training, and the model parameters are kept unchanged in the following steps; finally, 4096-dimensional output of the second full-connection layer of the model is used for generating follow-up subtitles according to the characteristics extracted by the convolutional neural network;

after the convolutional neural network is trained, carrying out forward propagation operation on the image collected in the first step to obtain semantic features of the image;

thirdly, segmenting the Chinese description of each sentence according to semantics and constructing a Chinese dictionary; after all Chinese descriptions are segmented, counting all appeared vocabularies, sequencing according to the occurrence frequency of the vocabularies, and taking the first 2000 vocabularies and an unknown vocabulary marker < UNK > as a dictionary;

training a cyclic neural network to generate Chinese subtitles; on the traditional LSTM network, a Dropout layer is added, which is different from the traditional method that the Dropout layer is invariable in each time sequence period and adopts a method of randomly setting 0 in different time sequence periods in the same way as the traditional method, thereby improving the generalization capability of the model; the unit structure of LSTM has a cellStates are transferred between time sequences, and several different gate Gates structures control input, output, and cell states; these door structures include: input door i_tAnd an output gate o_tForgetting door f_tAnd an input modulation structure g_tAt the t-th iteration, the cell state c of the LSTM network_tAnd hidden layer output h_tThe following formula was used:

i_t＝σ(W_ixx_t+W_ihh_t-1+b_i) (4)

f_t＝σ(W_fxx_t+W_fhh_t-1+b_f) (5)

o_t＝σ(W_oxx_t+W_ohh_t-1+b_o) (6)

g_t＝tanh(W_gxx_t+W_ghh_t-1+b_g) (7)

c_t＝f_t⊙c_t-1+i_t⊙g_t (8)

h_t＝o_t⊙tanh(c_t) (9)

wherein x is_tAs input for the t-th iteration, h_t-1For the output of the hidden layer unit of t-1 iterations, σ (x) is 1/(1+ e)^-x) Is a sigmoid function, and tanh (x) ═ (e ^ x-e ^ (-x))/(e ^ x + e ^ (-x)) is a hyperbolic tangent function, W_ix、W_fx、W_ox、W_gx、W_ih、W_fh、W_oh、W_ghAnd b_i、b_f、b_o、b_gFor the parameters to be learned by the model, the parameters do not change along with the change of the iteration number t, and a symbol [ ] represents the multiplication of corresponding elements of the matrix; then, adding a Drop layer behind each hidden layer to construct a Drop-LSTM network, namely outputting h from the hidden layer every iteration t_tMultiply by the same 0-1 random matrix that is the same shape:

h_t＝h_t⊙m_h

wherein m is_hRepresenting a random matrix, the matrix being generated by letting each of themThe elements are generated according to a 0-1 binary distribution with probability p, p being 0.5, m_hThe time sequence does not change along with the change of the iteration times t, and is a constant value in the same time sequence; finally, the features extracted by the convolutional neural network and the corresponding Chinese description sequence number matrix are used as input, and the network is trained according to the method for training the convolutional neural network, so that the network learns how to automatically generate the subtitles;

2. The method for generating Chinese subtitles in images according to claim 1, wherein the step of four pairs of conditional probabilities P (S)_t|I，S₀，S₁，...，S_t-1(ii) a Theta) is modeled, where theta is all parameters to be learned by the model, and h is output by the hidden layer unit in the model_tTo express the conditional probability, h, of the t-th iteration_tIs fixed; hidden unit h with last time_t-1And input x at that time_tIn relation to, and therefore, the hidden unit output h_t＝f(h_t-1，x_t)，

Wherein f is a tan h nonlinear function; for an initial value h_-1Then the feature extraction is carried out on the input image I through the convolution neural network of the second step_tIt represents a certain vocabulary vector corresponding to each time t.

3. The method as claimed in claim 2, wherein when the recurrent neural network is trained, the input of each iteration selects the image feature obtained in the second step and the subtitle sequence number vector generated in the third step, the network weight updating method adopts an adapelta gradient descent algorithm, and the learning rate is set to 0.0001.