CN110750669A - Method and system for generating image captions - Google Patents

Method and system for generating image captions Download PDF

Info

Publication number
CN110750669A
CN110750669A CN201910885349.0A CN201910885349A CN110750669A CN 110750669 A CN110750669 A CN 110750669A CN 201910885349 A CN201910885349 A CN 201910885349A CN 110750669 A CN110750669 A CN 110750669A
Authority
CN
China
Prior art keywords
image
model
vector information
feature extraction
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910885349.0A
Other languages
Chinese (zh)
Other versions
CN110750669B (en
Inventor
杨志明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Reflections On Artificial Intelligence Robot Technology (beijing) Co Ltd
Original Assignee
Reflections On Artificial Intelligence Robot Technology (beijing) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Reflections On Artificial Intelligence Robot Technology (beijing) Co Ltd filed Critical Reflections On Artificial Intelligence Robot Technology (beijing) Co Ltd
Priority to CN201910885349.0A priority Critical patent/CN110750669B/en
Publication of CN110750669A publication Critical patent/CN110750669A/en
Application granted granted Critical
Publication of CN110750669B publication Critical patent/CN110750669B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Library & Information Science (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method and a system for generating image captions, wherein a neural network model in the embodiment of the invention adopts an encoder-decoder (encoder-decoder) recursive bidirectional cyclic neural network architecture, and the model comprises two main parts: an image feature extraction section and an image caption generation section. For the image feature extraction part, a convolutional neural network is arranged to convert an image into feature vector information of the image; and for the image caption generating part, a bidirectional cyclic neural network is arranged, the feature vector information of the image is input, the bidirectional cyclic neural network extracts the depth semantic information in the text, and finally the image caption information is obtained. The embodiment of the invention simply and automatically generates the image captions and improves the accuracy of generating the image captions.

Description

Method and system for generating image captions
Technical Field
The present invention relates to computer image processing technology and natural language processing technology, and is especially image subtitle generating method and system.
Background
Image captioning is the generation of readable text description information, also known as automatic image annotation or image annotation, for a given image, such as a picture of an object or scene. The image caption generation technology is a research direction in another new computer vision field after image classification, target detection and image segmentation. The image caption generation technology needs to describe objects in an image and relations between the objects by using natural language sentences in a correct form, which is a very challenging task, and in order to realize the image caption generation, the relevant knowledge combining computer vision and natural language processing is needed, namely, the computer vision technology is needed to explain the content of the image, and the natural language processing technology is needed to generate text description information. However, the image subtitle generating technology may have a great influence, for example, to help visually impaired people to better understand image contents in the internet.
How to generate image captions is a challenging problem in the field of artificial intelligence, combining computer vision and natural language processing, and fast browsing images can point out and describe a great deal of detail about visual scenes, which is a relatively simple problem for humans, but is very challenging for computers, as it relates to both how to understand the content of images and how to translate this understanding into natural language.
At present, two methods are mainly adopted for realizing the image subtitle generation process, namely a template-based method and a nearest neighbor-based method. When the method is based on a template mode, a title template is preset, and the title template fills the set title template according to the results of object detection and attribute discovery in the image; when based on the nearest neighbor approach, images that resemble subtitles are retrieved from a large database, and these retrieved subtitles are then modified to fit the current query. However, these two methods are complicated in the process of generating image subtitles and the accuracy of the generated subtitles is not high.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method for generating image subtitles, which is capable of simply and automatically generating image subtitles and improving the accuracy of generating image subtitles.
The embodiment of the invention provides a system for generating image captions, which can simply and automatically generate the image captions and improve the accuracy of generating the image captions.
The embodiment of the invention is realized as follows:
a method of image subtitle generation, the method comprising:
training to obtain an encoder-decoder recursive bidirectional recurrent neural network as an image subtitle generation model, wherein the image subtitle generation model comprises a feature extraction model and a language model;
inputting the acquired image into a feature extraction model to perform image feature extraction processing to obtain feature vector information of the image;
and inputting the feature vector information of the image into a language model, and extracting semantic information by the language model according to the feature vector information of the image to obtain an image subtitle.
The feature extraction model is formed by a convolutional neural network, and the language model is formed by a bidirectional cyclic neural network model.
The feature extraction model is formed by a deep convolutional neural network and comprises a plurality of convolutional layers, a plurality of full-link layers and a classifier.
The obtaining of the image captions includes:
and performing circulation processing of a BilSTM network set by the feature vector information of the image and the subtitle features generated based on the feature vector information of the image, connecting the image subtitles obtained by the circulation processing with a full-connection network, and processing by a Softmax classifier to obtain the image subtitles.
The feature extraction model is also provided with a full-connection network, and the feature vector information of the image is provided for the language model through the full-connection network.
The method further comprises the following steps:
and evaluating the image subtitle generating model by adopting a BLEU tool.
A system for image subtitle generation, comprising: a feature extraction model module and a language model module, wherein,
the feature extraction model module is used for training to obtain a feature extraction model, inputting the obtained image into the feature extraction model for image feature extraction processing to obtain feature vector information of the image and outputting the feature vector information to the language model module;
and the language model module is used for training to obtain a language model, inputting the feature vector information of the image into the language model, and extracting semantic information by the language model according to the feature vector information of the image to obtain an image subtitle.
The feature extraction model is formed by deep convolutional neural network training and comprises a plurality of convolutional layers, a plurality of full-link layers and a classifier;
the language model is formed by adopting bidirectional cyclic neural network training and is used for performing cyclic processing on the BiLSTM network which is used for setting the feature vector information of the image and the caption features generated based on the feature vector information of the image, then performing full-connection network connection on the image captions obtained by the cyclic processing, and obtaining the image captions after the processing of a Softmax classifier.
The feature extraction model module is also used for carrying out full connection processing on the obtained feature vector information of the image through the set full connection network and then outputting the feature vector information to the language model module.
The BLEU model estimation module is used for estimating the set feature extraction model and the image subtitle generation model constructed by the set language model.
As seen above, the neural network model in the embodiment of the present invention employs an encoder-decoder (encoder-decoder) recursive bi-directional cyclic neural network architecture, which includes two main parts: an image feature extraction section and an image caption generation section. For the image feature extraction part, a convolutional neural network is arranged to convert an image into feature vector information of the image; and for the image caption generating part, a bidirectional cyclic neural network is arranged, the feature vector information of the image is input, the bidirectional cyclic neural network extracts the depth semantic information in the text, and finally the image caption information is obtained. Thus, the embodiment of the invention simply and automatically generates the image captions and improves the accuracy of the image caption generation.
Drawings
Fig. 1 is a flowchart of a method for generating image subtitles according to an embodiment of the present invention;
FIG. 2 is a simplified structural diagram of a feature extraction model according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating an example of a method for extracting image features by using a feature extraction model according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating a process for executing a language model according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a model architecture for generating a subtitle for an entire image according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a system for generating image subtitles according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples.
In order to simply and automatically generate the image captions and improve the accuracy of image caption generation, the embodiment of the invention introduces the neural network model into the realization of generating the image captions. With the rapid development of the neural network model, the neural network model can be applied to the generation process of the image captions. The neural network model in the embodiment of the invention combines the latest progress of computer vision and machine translation (natural language processing) and the idea of transfer learning, and can be used for generating natural language sentences describing images. The neural network model in embodiments of the invention maximizes the likelihood that a target of a given training image describes a sentence.
Specifically, the neural network model in the embodiment of the present invention adopts an encoder-decoder (encoder-decoder) recursive bi-directional cyclic neural network architecture, and the model includes two main parts: an image feature extraction section and an image caption generation section. For the image feature extraction part, a convolutional neural network is arranged to convert an image into feature vector information of the image; and for the image caption generating part, a bidirectional cyclic neural network is arranged, the feature vector information of the image is input, the bidirectional cyclic neural network extracts the depth semantic information in the text, and finally the image caption information is obtained.
Therefore, the embodiment of the invention can simply and automatically generate the image captions and improve the accuracy of generating the image captions.
Fig. 1 is a flowchart of a method for generating image subtitles according to an embodiment of the present invention, which includes the following specific steps:
step 101, setting an encoder-decoder recursive bidirectional recurrent neural network as an image subtitle generating model, wherein the model comprises a feature extraction model and a language model;
here, the feature extraction model is realized by a convolutional neural network, and the language model is realized by a bidirectional cyclic neural network;
step 102, inputting the acquired image into a feature extraction model for image feature extraction processing to obtain feature vector information of the image;
and 103, inputting the feature vector information of the image into a language model, and extracting semantic information by the language model according to the feature vector information of the image to obtain an image subtitle.
In the embodiment of the present invention, the image feature extraction refers to that a computer converts a red, green, blue (RGB) image into a feature matrix or a feature vector through a series of operations, and usually the feature matrix or the feature vector is represented by a vector with a fixed length, and the vector with the fixed length spatially represents the image and becomes feature vector information of the image. The feature extraction model for extracting the graphic features can adopt a deep convolutional neural network, the network can be obtained by training on the images in the image caption set, and the obtained images can also be processed by using the network obtained by training. The feature extraction model is structured by VGG, AlexNet, GoogleNet or ResNet, etc.
Fig. 2 is a simplified structural diagram of a feature extraction model according to an embodiment of the present invention, and as shown in the drawing, the feature extraction model is formed by using a deep convolutional neural network, where the deep convolutional neural network includes a plurality of convolutional layers, fully-connected layers, and a classifier, where an acquired image is input into the convolutional layers to be convolutional-processed, and then fully-connected into the fully-connected layers, and finally classified by the classifier, so as to obtain feature vector information of the image.
Fig. 3 is a flowchart of an example of a method for extracting image features by using a feature extraction model according to an embodiment of the present invention, where the selecting a deep convolutional neural network ResNet model for feature extraction includes:
step 301, inputting an image;
step 302, inputting the image into a feature extraction model;
and step 303, carrying out image feature extraction on the image by using the feature extraction model to obtain a feature code with fixed dimensionality.
In an embodiment of the invention, the language model is the probability of predicting the next word in the sequence given a word already present in the sequence. For image captioning, the language model is a neural network that predicts a single sequence in a description given the feature vector information of the image and builds the description conditional on the words that have been generated. In the embodiment of the present invention, a bidirectional recurrent neural network is used as a language model, a new Word is generated in a sequence at each output time point, then each generated Word is encoded by using Word embedding (such as Word2Vec), and the encoded Word is transmitted as an input to a decoder in the language model for generating a subsequent Word, fig. 4 is a schematic diagram of an execution process of the language model provided by the embodiment of the present invention, and the specific steps are as follows:
step 401, inputting feature codes with fixed dimensions;
step 402, inputting feature codes with fixed dimensions into a language model;
and 403, extracting semantic information by the language model according to the feature codes with fixed dimensionality, and outputting sentences.
It can be seen that the image subtitle generating model of the embodiment of the present invention is mainly divided into two parts, namely, image feature extraction and subtitle generation, and the two parts can be connected by using a full-connection network. Namely, the feature extraction model and the language model are connected by adopting a full-connection network. Assume that an input image of the image caption generation model is a caption description S corresponding to I, and the caption sequence length is n, i.e., S ═ S1, S2. In the implementation of the invention, for image feature extraction, a pre-trained convolutional neural network ResNet is adopted to extract image features, the extracted feature vector information of an image is input into a set full connection layer, and the full connection layer connecting the convolutional neural network and the bidirectional cyclic neural network has the function of converting the feature vector information of the image into proper dimensionality in an affine transformation mode for subsequent input; for the language model, the BilSTM network can be used for receiving the feature vector information of the image and the generated caption features, the BilSTM network is used for processing, the full connection layer in the language model is used for processing, and finally, the corresponding sequence is output through the Softmax classifier. The model architecture for the entire image subtitle generation is shown in fig. 5.
When training the image caption generating model, inputting an image will output the caption of the image, and the caption generating process is to generate a word at a time, and the previously generated word is used as an input for the generation of the subsequent word. Therefore, it is necessary to set an initial word to indicate that the generation process is started, an end word to indicate an end flag, and a startseq and an endseq to indicate start and end flags of the sequence at the time of processing. The image subtitle generating model in the embodiment of the invention receives a picture and an initial word, generates a next word, and provides the first two words as input to the model to generate the next word. This is the training process of the image caption generating model or the process of outputting the final caption by the trained image caption generating model. For example, for the input sequence "Two peoples clinmbing up a snowymoutain" to be divided into 8 pairs of input and output for training of the image caption generating model, the constructed model input and output pairs are as shown in table 1.
TABLE 1
Thus, an image subtitle generating model is obtained through training.
In the embodiment of the invention, how to evaluate the image subtitle generating model obtained by training is an important problem. Generally, there are two main ways for evaluating an image subtitle generation model, namely, manual and machine. However, this manual evaluation is slow and costly, and is subjective and requires expertise and experience. The embodiment of the invention mainly adopts a machine evaluation mode, namely a Bilingual evaluation complementation (BLEU) tool is set to evaluate the model. The tool is an index for measuring the similarity degree of a machine translation text and a set reference text, the value range is between [0 and 1], and the tool can be used for evaluating texts considered to be generated by a series of natural language processing, such as tasks of language generation, picture title generation, text summarization or voice recognition and the like. The BLEU tool has the following advantages: 1) the calculation speed is high, and the consumed resources are less; 2) is easy to understand; 3) language independent; 4) highly correlated with human evaluation; 5) is widely adopted.
When training the trained image caption generation model using the BLEU tool, work by comparing the candidate caption text with a count of matching n-grams in the reference text set at where 1-gram or unigram compares every word, and bigram compares every word pair, regardless of the order of the words. The greater the number of matches, the better the quality of the identified candidate subtitles. The calculation formula of the evaluation system of the BLEU tool is shown below.
Figure BDA0002207148950000062
Wherein the content of the first and second substances,
Figure BDA0002207148950000063
r is the number of words of the reference text, c is the number of words in the candidate subtitles, and BP represents a short penalty value of sentences in the subtitles.
Figure BDA0002207148950000064
The numerator represents the minimum number of times that the n-gram appears in the caption and the reference text, and the denominator identification takes the number of times that the n-gram appears in the caption.
Fig. 6 is a schematic structural diagram of a system for generating image subtitles according to an embodiment of the present invention, including: a feature extraction model module and a language model module, wherein,
the feature extraction model module is used for training to obtain a feature extraction model, inputting the obtained image into the feature extraction model for image feature extraction processing to obtain feature vector information of the image and outputting the feature vector information to the language model module;
and the language model module is used for training to obtain a language model, inputting the feature vector information of the image into the language model, and extracting semantic information by the language model according to the feature vector information of the image to obtain an image subtitle.
In the system, the feature extraction model is formed by convolutional neural network training and comprises a plurality of convolutional layers, a plurality of full-link layers and a classifier.
In the system, the language model is formed by bidirectional recurrent neural network training, the circulation processing of the BilSTM network is carried out on the feature vector information of the image and the caption features generated based on the feature vector information of the image, the image captions obtained by the circulation processing are connected with a full-connection network, and the image captions are finally obtained after the processing of a Softmax classifier.
In the system, the feature extraction model module is further configured to perform full-connection processing on the feature vector information of the obtained image through the set full-connection network, and then output the feature vector information to the language model module.
The system also comprises a BLEU tool evaluation module which is used for evaluating the set feature extraction model and the image subtitle generation model constructed by the set language model.
It can be seen that a subtitle generation model is constructed in the embodiment of the invention, and the related technologies of computer vision and natural language processing are fused. Firstly, when image features are extracted, a deep convolutional neural network, such as a trained ResNet50 deep convolutional neural network, is adopted to obtain feature vector information of the image; and secondly, a text sequence processing technology is used, a bidirectional cyclic neural network is selected for extracting semantic information in the feature vector information of the image, a transfer learning idea is used in the process, a Word2ved trained language model is fused to serve as the initialization of a Word vector for extracting the semantic information, then a decoder is used for fusing the feature vector information of the image, which is output by image feature extraction and text sequence processing, and the fused feature vector information is used for final prediction through a full connection layer. Thus, higher accuracy and better fluency of the generated image captions are obtained.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method of image subtitle generation, the method comprising:
training to obtain an encoder-decoder recursive bidirectional recurrent neural network as an image subtitle generation model, wherein the image subtitle generation model comprises a feature extraction model and a language model;
inputting the acquired image into a feature extraction model to perform image feature extraction processing to obtain feature vector information of the image;
and inputting the feature vector information of the image into a language model, and extracting semantic information by the language model according to the feature vector information of the image to obtain an image subtitle.
2. The method of claim 1, wherein the feature extraction model is constructed using a convolutional neural network and the language model is constructed using a bi-directional cyclic neural network model.
3. The method of claim 2, wherein the feature extraction model is a deep convolutional neural network comprising a plurality of convolutional layers, a plurality of fully-connected layers, and a classifier.
4. The method of claim 2, wherein the obtaining the image caption comprises:
and performing circulation processing of a BilSTM network set by the feature vector information of the image and the subtitle features generated based on the feature vector information of the image, connecting the image subtitles obtained by the circulation processing with a full-connection network, and processing by a Softmax classifier to obtain the image subtitles.
5. The method of claim 1, wherein a fully connected network is further provided in the feature extraction model, and the feature vector information of the image is provided to the language model through the fully connected network.
6. The method of claim 1, further comprising:
and evaluating the image subtitle generating model by adopting a BLEU tool.
7. A system for image subtitle generation, comprising: a feature extraction model module and a language model module, wherein,
the feature extraction model module is used for training to obtain a feature extraction model, inputting the obtained image into the feature extraction model for image feature extraction processing to obtain feature vector information of the image and outputting the feature vector information to the language model module;
and the language model module is used for training to obtain a language model, inputting the feature vector information of the image into the language model, and extracting semantic information by the language model according to the feature vector information of the image to obtain an image subtitle.
8. The system of claim 7, wherein the feature extraction model is constructed using deep convolutional neural network training, comprising a plurality of convolutional layers, a plurality of fully-connected layers, and a classifier;
the language model is formed by adopting bidirectional cyclic neural network training and is used for performing cyclic processing on the BiLSTM network which is used for setting the feature vector information of the image and the caption features generated based on the feature vector information of the image, then performing full-connection network connection on the image captions obtained by the cyclic processing, and obtaining the image captions after the processing of a Softmax classifier.
9. The system according to claim 7, wherein the feature extraction model module is further configured to perform full-join processing on the feature vector information of the obtained image through the set full-join network, and output the processed feature vector information to the language model module.
10. The system of claim 7, further comprising a BLEU tool evaluation module for evaluating an image subtitle generation model constructed from the set feature extraction model and the set language model.
CN201910885349.0A 2019-09-19 2019-09-19 Method and system for generating image captions Active CN110750669B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910885349.0A CN110750669B (en) 2019-09-19 2019-09-19 Method and system for generating image captions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910885349.0A CN110750669B (en) 2019-09-19 2019-09-19 Method and system for generating image captions

Publications (2)

Publication Number Publication Date
CN110750669A true CN110750669A (en) 2020-02-04
CN110750669B CN110750669B (en) 2023-05-23

Family

ID=69276733

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910885349.0A Active CN110750669B (en) 2019-09-19 2019-09-19 Method and system for generating image captions

Country Status (1)

Country Link
CN (1) CN110750669B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414505A (en) * 2020-03-11 2020-07-14 上海爱数信息技术股份有限公司 Rapid image abstract generation method based on sequence generation model
CN113449564A (en) * 2020-03-26 2021-09-28 上海交通大学 Behavior image classification method based on human body local semantic knowledge

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2546360A (en) * 2016-01-13 2017-07-19 Adobe Systems Inc Image captioning with weak supervision
CN107729987A (en) * 2017-09-19 2018-02-23 东华大学 The automatic describing method of night vision image based on depth convolution loop neutral net
CN107909115A (en) * 2017-12-04 2018-04-13 上海师范大学 A kind of image Chinese subtitle generation method
US20180373979A1 (en) * 2017-06-22 2018-12-27 Adobe Systems Incorporated Image captioning utilizing semantic text modeling and adversarial learning
CN109902750A (en) * 2019-03-04 2019-06-18 山西大学 Method is described based on two-way single attention mechanism image
CN109919221A (en) * 2019-03-04 2019-06-21 山西大学 Method is described based on two-way double attention mechanism images

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2546360A (en) * 2016-01-13 2017-07-19 Adobe Systems Inc Image captioning with weak supervision
US20180373979A1 (en) * 2017-06-22 2018-12-27 Adobe Systems Incorporated Image captioning utilizing semantic text modeling and adversarial learning
CN107729987A (en) * 2017-09-19 2018-02-23 东华大学 The automatic describing method of night vision image based on depth convolution loop neutral net
CN107909115A (en) * 2017-12-04 2018-04-13 上海师范大学 A kind of image Chinese subtitle generation method
CN109902750A (en) * 2019-03-04 2019-06-18 山西大学 Method is described based on two-way single attention mechanism image
CN109919221A (en) * 2019-03-04 2019-06-21 山西大学 Method is described based on two-way double attention mechanism images

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SOW, DAOUDA ET AL.: "A SEQUENTIAL GUIDING NETWORK WITH ATTENTION FOR IMAGE CAPTIONING" *
杨楠;南琳;张丁一;库涛;: "基于深度学习的图像描述研究" *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414505A (en) * 2020-03-11 2020-07-14 上海爱数信息技术股份有限公司 Rapid image abstract generation method based on sequence generation model
CN111414505B (en) * 2020-03-11 2023-10-20 上海爱数信息技术股份有限公司 Quick image abstract generation method based on sequence generation model
CN113449564A (en) * 2020-03-26 2021-09-28 上海交通大学 Behavior image classification method based on human body local semantic knowledge
CN113449564B (en) * 2020-03-26 2022-09-06 上海交通大学 Behavior image classification method based on human body local semantic knowledge

Also Published As

Publication number Publication date
CN110750669B (en) 2023-05-23

Similar Documents

Publication Publication Date Title
CN110119786B (en) Text topic classification method and device
CN109933801B (en) Bidirectional LSTM named entity identification method based on predicted position attention
CN111738251B (en) Optical character recognition method and device fused with language model and electronic equipment
CN110866399B (en) Chinese short text entity recognition and disambiguation method based on enhanced character vector
CN111737511B (en) Image description method based on self-adaptive local concept embedding
CN110175246B (en) Method for extracting concept words from video subtitles
Dilawari et al. ASoVS: abstractive summarization of video sequences
WO2017177809A1 (en) Word segmentation method and system for language text
CN113298151A (en) Remote sensing image semantic description method based on multi-level feature fusion
CN114153971B (en) Error correction recognition and classification equipment for Chinese text containing errors
Vinnarasu et al. Speech to text conversion and summarization for effective understanding and documentation
CN112329482A (en) Machine translation method, device, electronic equipment and readable storage medium
CN114339450A (en) Video comment generation method, system, device and storage medium
CN113657115A (en) Multi-modal Mongolian emotion analysis method based on ironic recognition and fine-grained feature fusion
CN115129934A (en) Multi-mode video understanding method
CN110750669B (en) Method and system for generating image captions
CN110659392B (en) Retrieval method and device, and storage medium
CN112084788A (en) Automatic marking method and system for implicit emotional tendency of image captions
Toshevska et al. Exploration into deep learning text generation architectures for dense image captioning
CN115186683B (en) Attribute-level multi-modal emotion classification method based on cross-modal translation
CN116152118B (en) Image description method based on contour feature enhancement
CN113139378B (en) Image description method based on visual embedding and condition normalization
US20240127812A1 (en) Method and system for auto-correction of an ongoing speech command
KR20230080849A (en) Content specific captioning method and system for real time online professional lectures
Prakash Image Caption Generation for Low Light Images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant