CN107909115B - Image Chinese subtitle generating method - Google Patents

Image Chinese subtitle generating method Download PDF

Info

Publication number
CN107909115B
CN107909115B CN201711260141.7A CN201711260141A CN107909115B CN 107909115 B CN107909115 B CN 107909115B CN 201711260141 A CN201711260141 A CN 201711260141A CN 107909115 B CN107909115 B CN 107909115B
Authority
CN
China
Prior art keywords
image
neural network
chinese
training
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711260141.7A
Other languages
Chinese (zh)
Other versions
CN107909115A (en
Inventor
王斌
王剑锋
周小平
张倩
黄继风
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Normal University
Original Assignee
Shanghai Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Normal University filed Critical Shanghai Normal University
Priority to CN201711260141.7A priority Critical patent/CN107909115B/en
Publication of CN107909115A publication Critical patent/CN107909115A/en
Application granted granted Critical
Publication of CN107909115B publication Critical patent/CN107909115B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/28Determining representative reference patterns, e.g. by averaging or distorting; Generating dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for generating Chinese image subtitles, which comprises the following steps: step one, constructing a training set: collecting images and adding Chinese descriptions with similar meanings to the images manually; training a convolutional neural network to extract image features, and after the convolutional neural network is trained, performing forward propagation operation on the image collected in the step one to obtain semantic features of the image; thirdly, segmenting the Chinese description of each sentence according to the meaning, and constructing a Chinese dictionary; training a cyclic neural network to generate Chinese subtitles; and fifthly, generating image captions, and finishing the image caption generating task of the image captions by sequentially passing the image of the captions to be generated through the convolutional neural network and the cyclic neural network at the testing or using stage.

Description

Image Chinese subtitle generating method
Technical Field
The invention relates to the field of computer vision, machine learning and artificial neural networks, in particular to a method for generating Chinese image subtitles.
Background
In the field of artificial intelligence, a long-standing goal of people is to develop a machine that not only can perceive and understand the rich visual world around us, but also can communicate with us using natural language. In the field of computer vision, a number of different tasks have been implemented, such as image recognition, image localization, image segmentation, etc. The main steps of these task implementations include image feature extraction and training of classifiers. The image feature extraction method mainly comprises the following steps: color Histogram features, image texture features, Histogram of Oriented (HOG) features, Local Binary Pattern (LBP) features, etc., where color Histogram and image texture features are global features of an image, and HOG and LBP are Local features of an image. The classifiers generally include SOFTMAX classifier, SVM classifier, neural network classifier, and ensemble classifier. The implementation of these tasks has greatly facilitated the development of artificial intelligence, but these tasks have been to classify images or parts of images into pre-specified categories or discrete labels.
Image Caption Generation (Image Caption Generation) is to give an Image and let a machine automatically generate a natural language to describe the content of the Image, which is essentially a Visual-to-language (Visual-to-language) problem, and it is simply expected that a computer can give a sentence describing the content of the Image according to the content of the Image. The task of generating image captions requires not only a computer to understand the objects contained in the image, but also to express the relationships between these objects in the correct natural language.
Therefore, those skilled in the art are dedicated to develop a method for generating chinese subtitles from images, which utilizes the local feature information of the images in the initial stage, and establishes the position relationship between the image contents, and associates the semantic information of each word with the local feature of the images; and modeling by using a neural network model with an attention mechanism, wherein the established sequence model generates an attention variable distribution at each moment according to the image characteristic information and the word semantic information, and the variable distribution represents the position information of the image noticed by the model at the moment.
Disclosure of Invention
The invention aims to provide a Chinese image subtitle generating method based on a neural network aiming at the defect that most of the existing computer vision tasks divide images into discrete labels, thereby overcoming the obstacle from the images to the language.
In order to achieve the purpose, the invention provides a method for generating Chinese image subtitles, which comprises the following steps:
step one, constructing a training set: collecting images and adding Chinese descriptions with similar meanings to the images manually;
training a convolutional neural network to extract image features, and after the convolutional neural network is trained, performing forward propagation operation on the image collected in the step one to obtain semantic features of the image;
thirdly, segmenting the Chinese description of each sentence according to the meaning, and constructing a Chinese dictionary;
training a cyclic neural network to generate Chinese subtitles;
and fifthly, generating image captions, and finishing the image caption generating task of the image captions by sequentially passing the image of the captions to be generated through the convolutional neural network and the cyclic neural network at the testing or using stage.
Further, step one selects the Flickr8k image subtitle data set.
Further, the second step adopts 16-layer neural network, including convolution, pooling, activation and other operations, and extracts image features by using the convolutional neural network to obtain features with semantic information.
Further, the 16-layer neural network comprises 13 convolutional layers and 3 fully-connected layers, the activation function of each layer is a Relu function, and a Dropout layer is added after the last three layers.
Further, the second step comprises a data set, wherein the data set adopts an ImageNet data set, an Adadelta gradient descent algorithm is adopted as a training algorithm of the second step, and the network parameters are updated according to the following formula:
Figure BDA0001493413040000031
Figure BDA0001493413040000032
wt+1=wt+Δwt (4)
wherein, wtParameter values representing the t-th iteration, g representing the gradient, E [ g ]2]Representing the moving average of the square of the gradient g, α is the coefficient for calculating the moving average and is typically taken to be 0.99, η is the learning rate 0.0001, and ε is taken to be a small number to prevent the denominator from being 0.
Furthermore, step four adopts an LSTM network added with Dropout, and adopts a method of randomly setting 0 in different time allowable periods, thereby improving the generalization capability of the model.
Further, step four pairs of conditional probabilities P (S)t|I,S0,S1,…,St-1(ii) a Theta) modeling using a fixed length of hidden unit output h in the modeltTo express the conditional probability value of the t time and the hidden unit h of the previous timet-1And input x at that timetIn relation to, and therefore, the hidden unit output ht=f(ht-1,xt),
Wherein f is a tan h nonlinear function; for an initial value h-1Then the feature extraction is carried out on the input image I through the convolution neural network of the step three, xtIt represents a certain vocabulary vector corresponding to each time t. For the problem that each image faces the inequality of Chinese description length, the invention adopts the method of complementing 0 at the tail of the sequence number vector. When the network is trained as well.
Further, when the recurrent neural network is trained, the image features obtained in the second step and the subtitle sequence number vectors generated in the third step are selected through input of each iteration, an Adadelta gradient descent algorithm is adopted in the network weight updating method, and the learning rate is set to be 0.0001.
The technical effects are as follows:
in the initial stage, the local characteristic information of the image is utilized, the position relation among the image contents is also established, and the semantic information of each word is associated with the local characteristic of the image; and modeling by using a neural network model with an attention mechanism, wherein the established sequence model generates an attention variable distribution at each moment according to the image characteristic information and the word semantic information, and the variable distribution represents the position information of the image noticed by the model at the moment.
Drawings
Fig. 1 is a flowchart of a method for generating chinese subtitles from images according to the present invention.
Fig. 2 is an example of image chinese subtitle data of an image chinese subtitle generating method of the present invention.
Fig. 3 is an example of chinese subtitle word segmentation for the image chinese subtitle generating method of the present invention.
FIG. 4 is a comparison between the test image Chinese caption generation result and the actual result of the image Chinese caption generation method of the present invention.
FIG. 5 is a comparison between the test image Chinese caption generation result and the actual result of the image Chinese caption generation method of the present invention.
FIG. 6 is a comparison of CIDER learning curves on Flickr8K CN for the present invention and the conventional method.
FIG. 7 is a comparison of CIDER learning curves on Flickr8K for the present invention and the conventional method.
Table 1 shows a comparison of the results of the experiments conducted in the Flickr8k CN data set according to the present invention and the conventional method.
Detailed Description
The specific embodiment of the invention is a standard dataset Flickr8K and its Chinese version Flickr8K CN. The invention provides a method for generating Chinese image subtitles, which is realized by the following scheme. Firstly, a training set is constructed according to actual requirements in a training stage, images as many as possible are collected, and proper Chinese subtitles are added to each image manually, wherein the data set is used for training a machine to learn how to automatically add the Chinese subtitles to the images according to samples. Next, feature extraction is performed on the images of the training set by training a multi-layer convolutional neural network. Then, semantically segmenting the Chinese subtitles of each image, and constructing a dictionary according to the occurrence frequency of the vocabularies. And finally, modeling the Chinese subtitles by training a cyclic neural network, and learning how to generate the Chinese subtitles according to the image characteristics. In the testing or using stage, for the input image, the convolutional neural network obtained in the training stage is used for extracting the characteristics, and the characteristics are input into the cyclic neural network to obtain the Chinese caption. This model is a discriminant model, i.e. it maximizes the probability of obtaining a correct description sequence S given a certain picture I. The process can be expressed in a formal manner as,
Figure BDA0001493413040000051
wherein: θ is the parameter to be learned by the model; the first summation is for all pictures I in the training set and their correct description sequence S; the second summation is for each word S in the correct description sequence St. According to the bayesian formula, the second summation result represents the logarithm joint probability value of the whole description sequence S under the condition of the given picture I.
As shown in fig. 1, a preferred embodiment of the present invention provides a method for generating chinese subtitles of an image, which includes the following steps:
step one, constructing a training set
According to actual requirements, a plurality of images are collected and a plurality of sentences of Chinese descriptions are added to the images manually. Because of the limitations of the model, the chinese description added to the image requires the selection of words that are as simple as possible and that can directly express the meaning of the image.
The embodiment selects a Flickr8k image caption data set that is closer to daily life, which has about 8000 images in total, most of which show the situation of human and animal participating in a certain activity, as shown in FIG. 2. In order to implement the image Chinese subtitle generation of the present invention, 5 sentences of simple Chinese subtitle descriptions are added to each image, as shown in fig. 2, thereby forming a data set.
Step two, training the convolutional neural network to extract image characteristics
The invention realizes the extraction of the semantic features of the image by utilizing the convolutional neural network. The network needs to be pre-trained on a larger data set before feature extraction can be performed on the data set. The convolutional neural network comprises a series of operations of convolution, pooling, activation and the like, and the convolutional neural network is used for extracting image features, so that the features which have semantic information more than the traditional LBP, HOG and color histogram features can be obtained. The present embodiment trains a 16-layer neural network using the ImageNet database, with top 13The layers are convolutional layers, and the last 3 layers are fully-connected layers, wherein each convolutional layer comprises operations of convolution, activation, pooling and the like. The number of convolution kernels in each three layers is 16, 32, 64, 128 and 256 respectively, and the initialization weight values adopt the mean value of 0 and the variance of 0
Figure BDA0001493413040000061
Where input _ size represents the dimension of the layer of input data. The last layer of the network is the SOFTMAX classifier, which is used to calculate the probability of each training image for each class. The activation function of each layer is a Relu function, and a Dropout layer is added after the last three layers. The dataset to train the convolutional neural network uses an ImageNet dataset that includes 1000 classes, each class containing perhaps thousands of images. Through experiments, the training method adopts an Adadelta gradient descent algorithm to update network parameters according to the following formula:
Figure BDA0001493413040000062
Figure BDA0001493413040000063
wt+1=wt+Δwt (4)
wherein, wtRepresenting the value of the parameter for the t-th iteration, g representing its gradient, E [ g2]Representing the moving average of the square of the gradient g, a is the coefficient for calculating the moving average and is typically taken to be 0.99, η is the learning rate 0.0001, and ε is a small number here to prevent the denominator from being 0. During training, the training is stopped when the loss function of the model is not changed much, and the model parameters are kept unchanged in the later steps. Finally, 4096-dimensional output of the second full-connection layer of the model is used for generating the follow-up caption by using the features extracted by the convolutional neural network. Experiments show that the learning rate of each update is 0.0001, and the result of randomly selecting 128 images is better
Step three, dividing words according to meaning for each sentence of Chinese description, and constructing Chinese dictionary
And (3) segmenting words of the Chinese character curbs collected in the step one according to semantics, wherein the words can be segmented by adopting an artificial word segmentation method or word segmentation software, and the result of the artificial word segmentation is more accurate. An example of a correct word segmentation can be shown in fig. 3, where the original sentence is: "a dog plays on the lawn", the word segmentation result is: "one/dog/on/grass/on/play". And finally, after all Chinese description words are segmented, counting all the appeared words and sequencing according to the appearance frequency of the words, wherein the first 2000 words and an unknown word marker < UNK > are taken as dictionaries. Thus, for each sentence, a sequence number vector can be used, which represents the Chinese description in its dimensional space.
Step four, training the recurrent neural network to generate Chinese captions
In the traditional Recurrent Neural Networks (RNNs), in the training process, due to the phenomena of gradient explosion, disappearance and the like, the weight of the sequence unit at the tail end is updated faster, while the weight of the sequence unit at the front end is often not updated effectively, so that the RNN network has poor effect when processing some longer sequences. The Long Short-Term Memory (LSTM) network solves the problems of gradient disappearance, gradient explosion and the like caused by overlong time sequence by adding a Memory unit and a plurality of different gate structures, and obtains better effect on processing the problem of Long-Term dependence. The invention adds a Dropout layer on the traditional LSTM network, which is different from the traditional method in that the Dropout layer is invariable in each time sequence period, and the random 0 setting method is adopted in different time sequence periods as the traditional method, thereby improving the generalization capability of the model. The Cell structure of LSTM has a Cell State (Cell State) that is transferred between time sequences, and several different gate (Gates) structures to control the input, output, and Cell State. These door structures include: input door itAnd an output gate otForgetting door ftAnd an input modulation structure gtAt each time t, the cellularity c of the LSTM networktAnd hidden layer output htCan be found by the following formula:
it=σ(Wixxt+Wihht-1+bi) (6)
ft=σ(Wfxxt+Wfhht-1+bf) (7)
ot=σ(Woxxt+Wohht-1+bo) (8)
gt=tanh(Wgxxt+Wghht-1+bg) (9)
ct=ft⊙ct-1+it⊙gt (10)
ht=ot⊙tanh(ct) (11)
wherein x istFor input at time t, ht-1For the output of the hidden layer unit at the previous time, σ (x) is 1/(1+ e)-x) Is a sigmoid function, and tanh (x) ═ (e ^ x-e ^ (-x))/(e ^ x + e ^ (-x)) is a hyperbolic tangent function, Wix、Wfx、Wox、Wgx、Wih、Wfh、Woh、WghAnd bi、bf、bo、bgFor the parameters to be learned for the model, they do not change as time t changes, and the symbol &representsthe multiplication of the corresponding elements of the matrix. Then, a Drop layer is added behind each hidden layer to construct a Drop-LSTM network, namely, the hidden layer is output h at each time ttMultiply by the same 0-1 random matrix that is the same shape:
ht=ht⊙mh
wherein m ishRepresenting a random matrix, which can be generated by subjecting each element thereof to a binary distribution of 0-1 with a probability p, typically 0.5, mhIt does not change with the time t, and is a constant value in the same time sequence. Finally, the features extracted by the convolutional neural network and the corresponding Chinese description sequence number matrix are used as input, and the convolutional neural network is trained according to the method for training the convolutional neural networkThe network is trained to learn how to automatically generate subtitles.
Step five, generating Chinese image subtitles by using the model
And based on the trained image Chinese caption generation model, for each image to be subjected to caption generation, sequentially extracting the characteristics of the image through a convolutional neural network, inputting the characteristics into a cyclic neural network, and automatically generating the corresponding Chinese caption by the cyclic neural network by utilizing the vocabulary in the dictionary constructed in the step three. In order to verify the effectiveness of the method of the invention, verification was performed on specific examples.
As shown in fig. 4, the test image chinese subtitle generating result of the image chinese subtitle generating method of the present invention is compared with the real result. The images are selected from a test set of Flickr8k data sets, and Chinese and English reference captions are provided by the respective data sets. As shown in fig. 4, both chinese subtitles and english subtitles generated for the test image describe the subject of the image and its motion well. FIGS. 5 and 6 are comparative plots of CIDER learning curves on Flickr8K and Flickr8K CN for the method of the present invention and the conventional method. The CIDER is an evaluation index of the image subtitle generating task. As shown in fig. 5 and 6, the model of the present invention generates chinese subtitles and english subtitles with a significantly higher effect than the conventional non-attentive basic model.
Table 1 comparison of experimental results of the present invention and the conventional method on the Flickr8k CN data set
Figure BDA0001493413040000091
Table 1 shows the comparison of the results of the experiments of the model of the present invention and the two conventional models in the database Flickr8k CN. Wherein, Baseline and CS-NIC are two common caption generating traditional models; BLEU, ROUGE-L and CIDER are three evaluation indexes of the image Chinese subtitle generating task, and the higher the values of the three indexes are, the better the subtitle generating task effect is. As seen from the table, the attention model of the present invention is higher than both the Baseline reference model and the CS-NIC conventional model in all indexes.
The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims (3)

1. A method for generating Chinese image subtitles is characterized by comprising the following steps:
step one, constructing a training set: collecting images and adding Chinese descriptions with similar meanings to the images manually; selecting a Flickr8k image subtitle data set, wherein each image in the original data set is correspondingly marked with 5 English sentences, and 5 simple Chinese subtitle descriptions are added to each image;
training a convolutional neural network to extract image features, wherein the network needs to be pre-trained on a larger data set before feature extraction is carried out on the data set; training a 16-layer neural network by using an ImageNet database, wherein the first 13 layers are convolutional layers, and the last 3 layers are fully-connected layers, wherein each convolutional layer comprises convolution, activation and pooling operations; the number of convolution kernels in each three layers is 16, 32, 64, 128 and 256 respectively, and the initialization weight values adopt the mean value of 0 and the variance of 0
Figure FDA0003309823170000011
Initializing the gaussian distribution of (1), wherein input _ size represents the dimension of the input data of the layer; the last layer of the network is a SOFTMAX classifier used for calculating the probability of each training image corresponding to each category; the activation function of each layer is a Relu function, and a Dropout layer is added after the last three layers; the data set for training the convolutional neural network adopts an ImageNet data set, the data set comprises 1000 categories, and each category comprises thousands of images; the training method adopts an Adadelta gradient descent algorithm and updates network parameters according to the following formula:
Figure FDA0003309823170000012
Figure FDA0003309823170000013
wt+1=wt+Δwt (3)
wherein, wtRepresenting the value of the parameter for the t-th iteration, g representing its gradient, E [ g2]Represents the moving average of the g square of the gradient, E [ g2]tRepresents the moving average of the square of the gradient g of the t iteration, alpha is the coefficient for calculating the moving average, and is taken as 0.99, delta wtRepresenting the parameter variation value of the t iteration, eta is the learning rate and takes 0.0001, and epsilon is a very small number and prevents the denominator from being 0; gtThe model is a modulation structure and represents the gradient of the t iteration, the training is stopped when the loss function of the model is not changed greatly during the training, and the model parameters are kept unchanged in the following steps; finally, 4096-dimensional output of the second full-connection layer of the model is used for generating follow-up subtitles according to the characteristics extracted by the convolutional neural network;
after the convolutional neural network is trained, carrying out forward propagation operation on the image collected in the first step to obtain semantic features of the image;
thirdly, segmenting the Chinese description of each sentence according to semantics and constructing a Chinese dictionary; after all Chinese descriptions are segmented, counting all appeared vocabularies, sequencing according to the occurrence frequency of the vocabularies, and taking the first 2000 vocabularies and an unknown vocabulary marker < UNK > as a dictionary;
training a cyclic neural network to generate Chinese subtitles; on the traditional LSTM network, a Dropout layer is added, which is different from the traditional method that the Dropout layer is invariable in each time sequence period and adopts a method of randomly setting 0 in different time sequence periods in the same way as the traditional method, thereby improving the generalization capability of the model; the unit structure of LSTM has a cellStates are transferred between time sequences, and several different gate Gates structures control input, output, and cell states; these door structures include: input door itAnd an output gate otForgetting door ftAnd an input modulation structure gtAt the t-th iteration, the cell state c of the LSTM networktAnd hidden layer output htThe following formula was used:
it=σ(Wixxt+Wihht-1+bi) (4)
ft=σ(Wfxxt+Wfhht-1+bf) (5)
ot=σ(Woxxt+Wohht-1+bo) (6)
gt=tanh(Wgxxt+Wghht-1+bg) (7)
ct=ft⊙ct-1+it⊙gt (8)
ht=ot⊙tanh(ct) (9)
wherein x istAs input for the t-th iteration, ht-1For the output of the hidden layer unit of t-1 iterations, σ (x) is 1/(1+ e)-x) Is a sigmoid function, and tanh (x) ═ (e ^ x-e ^ (-x))/(e ^ x + e ^ (-x)) is a hyperbolic tangent function, Wix、Wfx、Wox、Wgx、Wih、Wfh、Woh、WghAnd bi、bf、bo、bgFor the parameters to be learned by the model, the parameters do not change along with the change of the iteration number t, and a symbol [ ] represents the multiplication of corresponding elements of the matrix; then, adding a Drop layer behind each hidden layer to construct a Drop-LSTM network, namely outputting h from the hidden layer every iteration ttMultiply by the same 0-1 random matrix that is the same shape:
ht=ht⊙mh
wherein m ishRepresenting a random matrix, the matrix being generated by letting each of themThe elements are generated according to a 0-1 binary distribution with probability p, p being 0.5, mhThe time sequence does not change along with the change of the iteration times t, and is a constant value in the same time sequence; finally, the features extracted by the convolutional neural network and the corresponding Chinese description sequence number matrix are used as input, and the network is trained according to the method for training the convolutional neural network, so that the network learns how to automatically generate the subtitles;
and fifthly, generating image captions, and finishing the image caption generating task of the image captions by sequentially passing the image of the captions to be generated through the convolutional neural network and the cyclic neural network at the testing or using stage.
2. The method for generating Chinese subtitles in images according to claim 1, wherein the step of four pairs of conditional probabilities P (S)t|I,S0,S1,...,St-1(ii) a Theta) is modeled, where theta is all parameters to be learned by the model, and h is output by the hidden layer unit in the modeltTo express the conditional probability, h, of the t-th iterationtIs fixed; hidden unit h with last timet-1And input x at that timetIn relation to, and therefore, the hidden unit output ht=f(ht-1,xt),
Wherein f is a tan h nonlinear function; for an initial value h-1Then the feature extraction is carried out on the input image I through the convolution neural network of the second steptIt represents a certain vocabulary vector corresponding to each time t.
3. The method as claimed in claim 2, wherein when the recurrent neural network is trained, the input of each iteration selects the image feature obtained in the second step and the subtitle sequence number vector generated in the third step, the network weight updating method adopts an adapelta gradient descent algorithm, and the learning rate is set to 0.0001.
CN201711260141.7A 2017-12-04 2017-12-04 Image Chinese subtitle generating method Active CN107909115B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711260141.7A CN107909115B (en) 2017-12-04 2017-12-04 Image Chinese subtitle generating method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711260141.7A CN107909115B (en) 2017-12-04 2017-12-04 Image Chinese subtitle generating method

Publications (2)

Publication Number Publication Date
CN107909115A CN107909115A (en) 2018-04-13
CN107909115B true CN107909115B (en) 2022-02-15

Family

ID=61854300

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711260141.7A Active CN107909115B (en) 2017-12-04 2017-12-04 Image Chinese subtitle generating method

Country Status (1)

Country Link
CN (1) CN107909115B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108764299B (en) * 2018-05-04 2020-10-23 北京物灵智能科技有限公司 Story model training and generating method and system, robot and storage device
CN109033321B (en) * 2018-07-18 2021-12-17 成都快眼科技有限公司 Image and natural language feature extraction and keyword-based language indication image segmentation method
CN109242090B (en) * 2018-08-28 2020-06-26 电子科技大学 Video description and description consistency judgment method based on GAN network
US10980030B2 (en) * 2019-03-29 2021-04-13 Huawei Technologies Co., Ltd. Method and apparatus for wireless communication using polarization-based signal space mapping
CN110110770A (en) * 2019-04-24 2019-08-09 佛山科学技术学院 Garment image shopping guide character generating method and device neural network based
CN112183513B (en) * 2019-07-03 2023-09-05 杭州海康威视数字技术股份有限公司 Method and device for recognizing characters in image, electronic equipment and storage medium
CN110750669B (en) * 2019-09-19 2023-05-23 深思考人工智能机器人科技(北京)有限公司 Method and system for generating image captions
US11252004B2 (en) 2020-03-30 2022-02-15 Huawei Technologies Co., Ltd. Multiple access wireless communications using a non-gaussian manifold
CN112347764B (en) * 2020-11-05 2024-05-07 中国平安人寿保险股份有限公司 Method and device for generating barrage cloud and computer equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105894475A (en) * 2016-04-21 2016-08-24 上海师范大学 International phonetic symbol image character refining method
CN106503055A (en) * 2016-09-27 2017-03-15 天津大学 A kind of generation method from structured text to iamge description
CN106649542A (en) * 2015-11-03 2017-05-10 百度(美国)有限责任公司 Systems and methods for visual question answering
CN107391709A (en) * 2017-07-28 2017-11-24 深圳市唯特视科技有限公司 A kind of method that image captions generation is carried out based on new attention model

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9858524B2 (en) * 2014-11-14 2018-01-02 Google Inc. Generating natural language descriptions of images
US10395118B2 (en) * 2015-10-29 2019-08-27 Baidu Usa Llc Systems and methods for video paragraph captioning using hierarchical recurrent neural networks
US9807473B2 (en) * 2015-11-20 2017-10-31 Microsoft Technology Licensing, Llc Jointly modeling embedding and translation to bridge video and language
CN105894043A (en) * 2016-04-27 2016-08-24 上海高智科技发展有限公司 Method and system for generating video description sentences
CN106354701B (en) * 2016-08-30 2019-06-21 腾讯科技(深圳)有限公司 Chinese character processing method and device
CN106844442A (en) * 2016-12-16 2017-06-13 广东顺德中山大学卡内基梅隆大学国际联合研究院 Multi-modal Recognition with Recurrent Neural Network Image Description Methods based on FCN feature extractions
CN106934352A (en) * 2017-02-28 2017-07-07 华南理工大学 A kind of video presentation method based on two-way fractal net work and LSTM

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649542A (en) * 2015-11-03 2017-05-10 百度(美国)有限责任公司 Systems and methods for visual question answering
CN105894475A (en) * 2016-04-21 2016-08-24 上海师范大学 International phonetic symbol image character refining method
CN106503055A (en) * 2016-09-27 2017-03-15 天津大学 A kind of generation method from structured text to iamge description
CN107391709A (en) * 2017-07-28 2017-11-24 深圳市唯特视科技有限公司 A kind of method that image captions generation is carried out based on new attention model

Also Published As

Publication number Publication date
CN107909115A (en) 2018-04-13

Similar Documents

Publication Publication Date Title
CN107909115B (en) Image Chinese subtitle generating method
Murphy Probabilistic machine learning: an introduction
CN107526785B (en) Text classification method and device
Chaturvedi et al. Learning word dependencies in text by means of a deep recurrent belief network
Mansimov et al. Generating images from captions with attention
Kottur et al. Visual word2vec (vis-w2v): Learning visually grounded word embeddings using abstract scenes
Adams et al. A survey of feature selection methods for Gaussian mixture models and hidden Markov models
Karpathy Connecting images and natural language
CN108830287A (en) The Chinese image, semantic of Inception network integration multilayer GRU based on residual error connection describes method
CN110222163A (en) A kind of intelligent answer method and system merging CNN and two-way LSTM
CN111291556A (en) Chinese entity relation extraction method based on character and word feature fusion of entity meaning item
CN110263174B (en) Topic category analysis method based on focus attention
CN112949647A (en) Three-dimensional scene description method and device, electronic equipment and storage medium
Huang et al. C-Rnn: a fine-grained language model for image captioning
Yan Computational Methods for Deep Learning: Theory, Algorithms, and Implementations
Tekir et al. Deep learning: Exemplar studies in natural language processing and computer vision
Newatia et al. Convolutional neural network for ASR
Zemmari et al. Deep Learning in Mining of Visual Content
Glick et al. Insect classification with heirarchical deep convolutional neural networks
Kanungo Analysis of Image Classification Deep Learning Algorithm
Xie et al. Chinese alt text writing based on deep learning
CN114116974A (en) Emotional cause extraction method based on attention mechanism
Stamp Alphabet soup of deep learning topics
Li et al. Supervised classification of plant image based on attention mechanism
Lee et al. A two-level recurrent neural network language model based on the continuous Bag-of-Words model for sentence classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant