CN110750669B - Method and system for generating image captions - Google Patents

Method and system for generating image captions Download PDF

Info

Publication number
CN110750669B
CN110750669B CN201910885349.0A CN201910885349A CN110750669B CN 110750669 B CN110750669 B CN 110750669B CN 201910885349 A CN201910885349 A CN 201910885349A CN 110750669 B CN110750669 B CN 110750669B
Authority
CN
China
Prior art keywords
image
model
feature extraction
vector information
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910885349.0A
Other languages
Chinese (zh)
Other versions
CN110750669A (en
Inventor
杨志明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ideepwise Artificial Intelligence Robot Technology Beijing Co ltd
Original Assignee
Ideepwise Artificial Intelligence Robot Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ideepwise Artificial Intelligence Robot Technology Beijing Co ltd filed Critical Ideepwise Artificial Intelligence Robot Technology Beijing Co ltd
Priority to CN201910885349.0A priority Critical patent/CN110750669B/en
Publication of CN110750669A publication Critical patent/CN110750669A/en
Application granted granted Critical
Publication of CN110750669B publication Critical patent/CN110750669B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a method and a system for generating image captions, wherein a neural network model in the embodiment of the invention adopts an encoder-decoder (encoder-decoder) recursion two-way circulation neural network architecture, and the model comprises two main parts: an image feature extraction section and an image caption generation section. Wherein, for the image feature extraction part, a convolutional neural network is set to convert the image into feature vector information of the image; and setting a bidirectional cyclic neural network for the image subtitle generating part, inputting the feature vector information of the image, extracting the depth semantic information in the text by the bidirectional cyclic neural network, and finally obtaining the image subtitle information. The embodiment of the invention simply and automatically generates the image captions and improves the accuracy of the image captions generation.

Description

Method and system for generating image captions
Technical Field
The present invention relates to computer image processing technology and natural language processing technology, and is especially one kind of image caption generating method and system.
Background
Image subtitle generation is the generation of readable text descriptive information, also known as automatic image annotation or image annotation, for a given image, for example a picture of a certain object or scene. The image caption generating technology is a new research direction in the field of computer vision after image classification, target detection and image segmentation. The image subtitle generation technology needs to describe the objects in the image and the relation between the objects by using the natural language sentences in the correct form, which is a very challenging task, and in order to realize the image subtitle generation, the related knowledge of combining computer vision and natural language processing is needed, namely, the technology of computer vision is needed to explain the content of the image, and the technology of natural language processing is needed to generate text description information. However, the image caption generating technique may have a great influence, for example, may help visually impaired people to better understand the image contents in the internet.
How to generate image captions is a challenging problem in the field of artificial intelligence in combination with computer vision and natural language processing, fast browsing of images can point out and describe a large amount of detail in the relevant visual scene, which is a relatively simple problem for humans, but is very challenging for computers, as it involves both how to understand the content of an image and how to translate this understanding into natural language.
At present, two methods are mainly adopted for realizing the image subtitle generating process, namely a template-based mode and a nearest neighbor-based mode. When the template mode is based, presetting a title template, and filling the set title template according to the results of object detection and attribute discovery in the image; based on the nearest neighbor approach, caption-like images are retrieved from a large database, and these retrieved captions are then modified to fit the current query. However, when the image subtitle is generated, the two methods are complicated and the generated subtitle has low accuracy.
Disclosure of Invention
In view of this, the embodiments of the present invention provide a method for generating image subtitles, which can simply and automatically generate the image subtitles and improve the accuracy of generating the image subtitles.
The embodiment of the invention provides a system for generating image captions, which can automatically generate the image captions simply and improve the accuracy of the image captions generation.
The embodiment of the invention is realized as follows:
a method of image subtitle generation, the method comprising:
training to obtain an encoder-decoder recursive bidirectional cyclic neural network serving as an image subtitle generation model, wherein the encoder-decoder recursive bidirectional cyclic neural network comprises a feature extraction model and a language model;
inputting the acquired image into a feature extraction model to perform image feature extraction processing to obtain feature vector information of the image;
inputting the feature vector information of the image into a language model, and extracting semantic information by the language model according to the feature vector information of the image to obtain the image subtitle.
The feature extraction model is composed of a convolutional neural network, and the language model is composed of a bidirectional cyclic neural network model.
The feature extraction model is composed of a deep convolutional neural network and comprises a plurality of convolutional layers, a plurality of full-connection layers and a classifier.
The obtaining the image subtitle comprises the following steps:
and after performing cyclic processing on the characteristic vector information of the image and the BiLSTM network which is set based on the caption characteristics generated by the characteristic vector information of the image, performing full-connection network connection on the image caption obtained by the cyclic processing, and performing processing of a Softmax classifier to obtain the image caption.
And the feature extraction model is also provided with a full-connection network, and feature vector information of the image is provided for the language model through the full-connection network.
The method further comprises the steps of:
and evaluating the image subtitle generation model by using a BLEU tool.
A system for image subtitle generation, comprising: a feature extraction model module and a language model module, wherein,
the feature extraction model module is used for training to obtain a feature extraction model, inputting the obtained image into the feature extraction model for image feature extraction processing to obtain feature vector information of the image, and outputting the feature vector information to the language model module;
the language model module is used for training to obtain a language model, inputting the feature vector information of the image into the language model, and extracting semantic information by the language model according to the feature vector information of the image to obtain the image subtitle.
The feature extraction model is formed by training a deep convolutional neural network and comprises a plurality of convolutional layers, a plurality of full-connection layers and a classifier;
the language model is formed by training a bidirectional cyclic neural network, and is used for carrying out cyclic processing on the characteristic vector information of the image and a BiLSTM network for setting the caption characteristics generated based on the characteristic vector information of the image, connecting the image captions obtained by the cyclic processing with a full-connection network, and then carrying out processing of a Softmax classifier to obtain the image captions.
The feature extraction model module is also used for carrying out full connection processing on the feature vector information of the obtained image through the set full connection network and then outputting the feature vector information to the language model module.
The system also comprises a BLEU tool evaluation module which is used for evaluating the set feature extraction model and the image subtitle generation model obtained by the set language model construction.
As seen above, the neural network model in the embodiment of the present invention employs an encoder-decoder (encoder-decoder) recursive bidirectional cyclic neural network architecture, and the model includes two main parts: an image feature extraction section and an image caption generation section. Wherein, for the image feature extraction part, a convolutional neural network is set to convert the image into feature vector information of the image; and setting a bidirectional cyclic neural network for the image subtitle generating part, inputting the feature vector information of the image, extracting the depth semantic information in the text by the bidirectional cyclic neural network, and finally obtaining the image subtitle information. Thus, the embodiment of the invention simply and automatically generates the image captions, and improves the accuracy of the image captions generation.
Drawings
Fig. 1 is a flowchart of a method for generating an image subtitle according to an embodiment of the present invention;
FIG. 2 is a simplified diagram of a feature extraction model according to an embodiment of the present invention;
FIG. 3 is a flowchart of an example of a method for extracting image features by using a feature extraction model according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a language model execution process according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a model architecture for generating an entire image subtitle according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a system structure for generating an image subtitle according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below by referring to the accompanying drawings and examples.
The embodiment of the invention introduces a neural network model into the realization of generating the image captions in order to simply and automatically generate the image captions and improve the accuracy of generating the image captions. With the rapid development of the neural network model, the neural network model can be applied to the generation process of the image captions. The neural network model in the embodiment of the invention combines the latest progress of computer vision and machine translation (natural language processing) and the idea of transfer learning, and can be used for generating natural language sentences describing images. The neural network model in embodiments of the present invention maximizes the likelihood of a target descriptive sentence for a given training image.
Specifically, the neural network model in the embodiment of the invention adopts an encoder-decoder (encoder-decoder) recursive bidirectional cyclic neural network architecture, and the model comprises two main parts: an image feature extraction section and an image caption generation section. Wherein, for the image feature extraction part, a convolutional neural network is set to convert the image into feature vector information of the image; and setting a bidirectional cyclic neural network for the image subtitle generating part, inputting the feature vector information of the image, extracting the depth semantic information in the text by the bidirectional cyclic neural network, and finally obtaining the image subtitle information.
Therefore, the embodiment of the invention can simply and automatically generate the image captions, and improve the accuracy of the generation of the image captions.
Fig. 1 is a flowchart of a method for generating an image subtitle according to an embodiment of the present invention, which specifically includes the steps of:
step 101, setting an encoder-decoder recursion bidirectional cyclic neural network as an image subtitle generation model, wherein the model comprises a feature extraction model and a language model;
the feature extraction model is realized by adopting a convolutional neural network, and the language model is realized by adopting a bidirectional cyclic neural network;
102, inputting the acquired image into a feature extraction model to perform image feature extraction processing to obtain feature vector information of the image;
and step 103, inputting the feature vector information of the image into a language model, and extracting semantic information by the language model according to the feature vector information of the image to obtain the image subtitle.
In the embodiment of the present invention, the image feature extraction refers to that a computer converts a red, green and blue (RGB) image into a feature matrix or feature vector through a series of operations, and the feature matrix or feature vector is usually represented by a vector with a fixed length, and the vector with the fixed length represents the image in space and becomes feature vector information of the image. The feature extraction model for graphic feature extraction can adopt a deep convolutional neural network, can train on images in an image subtitle set to obtain the network, and can process acquired images by using the trained network. The feature extraction model architecture is VGG, alexNet, googleNet or ResNet, etc.
Fig. 2 is a simplified diagram of a feature extraction model according to an embodiment of the present invention, where the feature extraction model is configured by using a deep convolutional neural network, and the deep convolutional neural network includes a plurality of convolutional layers, a full-connection layer, and a classifier, where an acquired image is input into the convolutional layers for convolutional processing, then is input into the full-connection layer for full-connection, and finally is classified by the classifier to obtain feature vector information of the image.
Fig. 3 is a flowchart of an example of a method for extracting image features by using a feature extraction model according to an embodiment of the present invention, where feature extraction is performed by using a deep convolutional neural network res net model, including:
step 301, inputting an image;
step 302, inputting the image into a feature extraction model;
and 303, extracting image features of the image by the feature extraction model to obtain feature codes with fixed dimensions.
In an embodiment of the invention, the language model is the probability of predicting the next word in the sequence given the words already present in the sequence. For image subtitles, the language model is a neural network that, given the feature vector information of the image, predicts a single sequence in the description and builds the description on the condition that words have been generated. The embodiment of the invention uses a bi-directional cyclic neural network as a language model, generates a new Word in a sequence at each output time point, then uses Word embedding (such as Word2 Vec) to encode each generated Word, and transmits the Word as input to a decoder in the language model for generating a subsequent Word, and FIG. 4 is a schematic diagram of the execution process of the language model provided by the embodiment of the invention, wherein the specific steps are as follows:
step 401, inputting feature codes with fixed dimensions;
step 402, inputting feature codes with fixed dimensions into a language model;
and 403, extracting semantic information by the language model according to the feature codes with fixed dimensions, and outputting sentences.
It can be seen that the image subtitle generating model according to the embodiment of the present invention is mainly divided into two parts, namely, image feature extraction and subtitle generation, and the two parts of contents can be connected by using a fully connected network. The feature extraction model and the language model are connected by adopting a fully-connected network. Let the subtitle description corresponding to the input image of the image subtitle generation model be I be S, and the subtitle sequence length be n, i.e., s= { S1, S2,... In the implementation of the invention, for image feature extraction, a pre-trained convolutional neural network ResNet is adopted to extract image features, the feature vector information of the extracted image is input into a set full-connection layer, and the full-connection layer connecting the convolutional neural network and the bidirectional convolutional neural network has the function of converting the feature vector information of the image into a proper dimension through an affine transformation mode for subsequent input; for the language model, the BiLSTM network can be used for receiving the feature vector information of the image and the generated caption features, after the processing of the BiLSTM network, the processing of the full-connection layer in the language model is carried out, and finally, the corresponding sequence is output through the Softmax classifier. The model architecture for the entire image subtitle generation is shown in fig. 5.
When training the image subtitle generating model, inputting an image outputs the title of the image, and the generating process of the title is to generate one word at a time, and the previously generated word is used as input for the generation of the subsequent word. Therefore, it is necessary to set an initial word to represent the start of the generation process, set an end word to represent the end-of-representation flag, and use startseq and endeq to represent the start and end flags of the sequence at the time of processing. The image caption generating model in the embodiment of the invention receives a picture and an initial word and generates a next word, and then provides the first two words described as inputs to the model to generate the next word. This is the training process of the image caption generating model or the process of the trained image caption generating model outputting the final caption. For example, for the input sequence "Two people climbing up a snowy mountain" would be split into 8 pairs of inputs and outputs for training of the image caption generating model, the constructed model input and output pairs are shown in table 1.
Figure BDA0002207148950000051
Figure BDA0002207148950000061
TABLE 1
Thus, an image subtitle generating model is trained.
In the embodiment of the invention, how to evaluate the quality of the image subtitle generating model obtained through training is an important problem. There are generally two main ways of manual and machine for evaluating the image subtitle generating model. However, such an evaluation is slow and costly by manual work, and such manual evaluation is relatively subjective and requires reliance on expertise and experience. The embodiment of the invention mainly adopts a machine evaluation mode, namely a bilingual evaluation complementary (BLEU, bilingual Evaluation Understudy) tool is arranged to evaluate the model. The tool is an index for measuring the similarity degree of the machine translation text and the set reference text, the value range is between 0 and 1, and the tool can evaluate a series of texts which are considered to be generated by natural language processing, such as language generation, picture title generation, text abstract or voice recognition and other tasks. The BLEU tool has the following advantages: 1) The calculation speed is high, and the consumed resources are small; 2) Is easy to understand; 3) Is independent of language; 4) Highly relevant to human evaluation; 5) Is widely used.
In training the trained image subtitle generation model using the BLEU tool, work is done by counting the number of matching n-grams in the candidate subtitle text and the reference text set there, where 1-gram or unigram compares each word and the binary (bigram) compares each word pair, this comparison does not take into account the order of the words. The greater the number of matches, the better the quality of the identified candidate subtitles. The calculation formula of the evaluation system of the BLEU tool is shown below.
Figure BDA0002207148950000062
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0002207148950000063
r is the word number of the reference text, c is the word number of the candidate subtitle, and BP represents the shorter penalty value of sentences in the subtitle.
Figure BDA0002207148950000064
The numerator represents the minimum number of occurrences of the n-gram in the subtitle and in the reference text, and the denominator identifies the number of occurrences of the n-gram in the subtitle.
Fig. 6 is a schematic diagram of a system structure for generating an image subtitle according to an embodiment of the present invention, including: a feature extraction model module and a language model module, wherein,
the feature extraction model module is used for training to obtain a feature extraction model, inputting the obtained image into the feature extraction model for image feature extraction processing to obtain feature vector information of the image, and outputting the feature vector information to the language model module;
the language model module is used for training to obtain a language model, inputting the feature vector information of the image into the language model, and extracting semantic information by the language model according to the feature vector information of the image to obtain the image subtitle.
In the system, the feature extraction model is formed by adopting convolutional neural network training and comprises a plurality of convolutional layers, a plurality of full-connection layers and a classifier.
In the system, the language model is formed by training a bidirectional cyclic neural network, the characteristic vector information of an image and the caption characteristics generated based on the characteristic vector information of the image are circularly processed by a BiLSTM network, the image captions obtained by the circular processing are connected by a full-connection network, and the image captions are finally obtained after being processed by a Softmax classifier.
In the system, the feature extraction model module is further used for outputting the feature vector information of the obtained image to the language model module after performing full connection processing on the feature vector information of the obtained image through the set full connection network.
The system also comprises a BLEU tool evaluation module which is used for evaluating the set feature extraction model and the image subtitle generation model obtained by the construction of the set language model.
It can be seen that the embodiment of the invention constructs a caption generating model, and integrates the related technologies of computer vision and natural language processing. Firstly, when image feature extraction is carried out, a depth convolution neural network, such as a trained ResNet50 depth convolution neural network, is adopted to obtain feature vector information of the image; secondly, a text sequence processing technology is used, a bidirectional cyclic neural network is selected for extracting semantic information in feature vector information of the image, the idea of transfer learning is used in the process, a language model trained by Word2ved is fused to serve as initialization of Word vectors to be used for extracting semantic information, a decoder is used for fusing the feature vector information of the image, which is output by image feature extraction and text sequence processing, and the image feature information is used for final prediction through a full-connection layer. Thus, higher accuracy and better fluency of the generated image captions are achieved.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the invention.

Claims (5)

1. A method of image subtitle generation for describing objects and relationships between objects in an image using natural language sentences, the method comprising:
training to obtain an encoder-decoder recursive bidirectional cyclic neural network serving as an image subtitle generation model, wherein the encoder-decoder recursive bidirectional cyclic neural network comprises a feature extraction model and a language model;
inputting the acquired image into a feature extraction model to perform image feature extraction processing to obtain feature vector information of the image;
inputting the feature vector information of the image into a language model, and extracting semantic information by the language model according to the feature vector information of the image to obtain an image subtitle;
the feature extraction model is formed by a deep convolutional neural network and comprises a plurality of convolutional layers, a plurality of full-connection layers and a classifier;
the obtaining the image subtitle comprises the following steps:
after performing cyclic processing on feature vector information of an image and a BiLSTM network which is set based on caption features generated by the feature vector information of the image, performing full-connection network connection on the image captions obtained by the cyclic processing, and performing processing of a Softmax classifier to obtain the image captions;
and the feature extraction model is also provided with a full-connection network, and feature vector information of the image is provided for the language model through the full-connection network.
2. The method of claim 1, wherein the feature extraction model is constructed using a convolutional neural network and the language model is constructed using a bi-directional recurrent neural network model.
3. The method of claim 1, wherein the method further comprises:
and evaluating the image subtitle generation model by using a BLEU tool.
4. A system for generating image captions, wherein the image captions are generated to describe objects in an image and relationships between objects using natural language sentences, comprising: a feature extraction model module and a language model module, wherein,
the feature extraction model module is used for training to obtain a feature extraction model, inputting the obtained image into the feature extraction model for image feature extraction processing to obtain feature vector information of the image, and outputting the feature vector information to the language model module;
the language model module is used for training to obtain a language model, inputting the feature vector information of the image into the language model, and extracting semantic information by the language model according to the feature vector information of the image to obtain an image subtitle;
the feature extraction model is formed by training a deep convolutional neural network and comprises a plurality of convolutional layers, a plurality of full-connection layers and a classifier;
the language model is formed by training a bidirectional cyclic neural network, and is used for carrying out cyclic processing on the characteristic vector information of the image and a BiLSTM network which is set based on the caption characteristics generated by the characteristic vector information of the image, connecting the image captions obtained by the cyclic processing with a full-connection network, and then carrying out processing of a Softmax classifier to obtain the image captions;
the feature extraction model module is also used for carrying out full connection processing on the feature vector information of the obtained image through the set full connection network and then outputting the feature vector information to the language model module.
5. The system of claim 4, further comprising a BLEU tool evaluation module for evaluating the set feature extraction model and the set language model constructed image subtitle generation model.
CN201910885349.0A 2019-09-19 2019-09-19 Method and system for generating image captions Active CN110750669B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910885349.0A CN110750669B (en) 2019-09-19 2019-09-19 Method and system for generating image captions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910885349.0A CN110750669B (en) 2019-09-19 2019-09-19 Method and system for generating image captions

Publications (2)

Publication Number Publication Date
CN110750669A CN110750669A (en) 2020-02-04
CN110750669B true CN110750669B (en) 2023-05-23

Family

ID=69276733

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910885349.0A Active CN110750669B (en) 2019-09-19 2019-09-19 Method and system for generating image captions

Country Status (1)

Country Link
CN (1) CN110750669B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414505B (en) * 2020-03-11 2023-10-20 上海爱数信息技术股份有限公司 Quick image abstract generation method based on sequence generation model
CN113449564B (en) * 2020-03-26 2022-09-06 上海交通大学 Behavior image classification method based on human body local semantic knowledge

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2546360A (en) * 2016-01-13 2017-07-19 Adobe Systems Inc Image captioning with weak supervision

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11113599B2 (en) * 2017-06-22 2021-09-07 Adobe Inc. Image captioning utilizing semantic text modeling and adversarial learning
CN107729987A (en) * 2017-09-19 2018-02-23 东华大学 The automatic describing method of night vision image based on depth convolution loop neutral net
CN107909115B (en) * 2017-12-04 2022-02-15 上海师范大学 Image Chinese subtitle generating method
CN109902750A (en) * 2019-03-04 2019-06-18 山西大学 Method is described based on two-way single attention mechanism image
CN109919221B (en) * 2019-03-04 2022-07-19 山西大学 Image description method based on bidirectional double-attention machine

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2546360A (en) * 2016-01-13 2017-07-19 Adobe Systems Inc Image captioning with weak supervision

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Sow, Daouda et al..A SEQUENTIAL GUIDING NETWORK WITH ATTENTION FOR IMAGE CAPTIONING.2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP).2019,第3802-3806页. *
杨楠 ; 南琳 ; 张丁一 ; 库涛 ; .基于深度学习的图像描述研究.红外与激光工程.2018,(02),第1-8页. *

Also Published As

Publication number Publication date
CN110750669A (en) 2020-02-04

Similar Documents

Publication Publication Date Title
CN110750959B (en) Text information processing method, model training method and related device
CN108009154B (en) Image Chinese description method based on deep learning model
CN107256221B (en) Video description method based on multi-feature fusion
JP5128629B2 (en) Part-of-speech tagging system, part-of-speech tagging model training apparatus and method
CN111738251B (en) Optical character recognition method and device fused with language model and electronic equipment
WO2018028077A1 (en) Deep learning based method and device for chinese semantics analysis
CN111737511B (en) Image description method based on self-adaptive local concept embedding
CN110705206B (en) Text information processing method and related device
WO2020199904A1 (en) Video description information generation method, video processing method, and corresponding devices
CN111160031A (en) Social media named entity identification method based on affix perception
Shi et al. Watch it twice: Video captioning with a refocused video encoder
CN110750669B (en) Method and system for generating image captions
KR20190065665A (en) Apparatus and method for recognizing Korean named entity using deep-learning
CN112329482A (en) Machine translation method, device, electronic equipment and readable storage medium
CN110659392B (en) Retrieval method and device, and storage medium
CN113657115A (en) Multi-modal Mongolian emotion analysis method based on ironic recognition and fine-grained feature fusion
CN112528989B (en) Description generation method for semantic fine granularity of image
KR20210035721A (en) Machine translation method using multi-language corpus and system implementing using the same
CN112084788A (en) Automatic marking method and system for implicit emotional tendency of image captions
CN115186683B (en) Attribute-level multi-modal emotion classification method based on cross-modal translation
CN115359323A (en) Image text information generation method and deep learning model training method
Islam et al. Bengali Caption Generation for Images Using Deep Learning
CN113139378B (en) Image description method based on visual embedding and condition normalization
Zhong et al. Improving Chinese medical named entity recognition using glyph and lexicon
KR102510645B1 (en) Method for handling out-of-vocabulary problem in hangeul word embeddings, recording medium and system for performing the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant