CN108898639A - A kind of Image Description Methods and system - Google Patents
A kind of Image Description Methods and system Download PDFInfo
- Publication number
- CN108898639A CN108898639A CN201810537627.9A CN201810537627A CN108898639A CN 108898639 A CN108898639 A CN 108898639A CN 201810537627 A CN201810537627 A CN 201810537627A CN 108898639 A CN108898639 A CN 108898639A
- Authority
- CN
- China
- Prior art keywords
- training
- text
- picture
- test
- trained
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
- G06T9/002—Image coding using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of Image Description Methods and systems.The convolutional neural networks VggNet that the present invention first uses Google to provide extracts characteristics of image, image is encoded, it reuses attention mechanism and weight distribution is carried out to feature subgraph, it finds out and describes important key feature subgraph to text, then the crucial text for using LSTM and being obtained by LSTM hidden layer as weight, to text and notice that the combination of the image after operation is decoded, that is, generates the description of correspondence image, the present invention more effectively improves the precision of iamge description.
Description
Technical field
The present invention relates to iamge description fields, more particularly to a kind of Image Description Methods and system.
Background technique
With the development of depth learning technology, the iamge description technology based on deep learning has tended to be mature.Image is retouched
Stating is a kind of technology that a secondary image content is described using the sentence correctly combined.Image Description Methods be integrated with deep learning,
A variety of professional techniques such as pattern-recognition, Digital Image Processing and natural language processing.Iamge description key has two o'clock:(1) image
The extraction of feature;(2) natural language is synthesized.Deep learning realizes the automation of image characteristics extraction and identification, greatly mentions
The high precision of object and Context awareness.And deep learning realizes the prediction of natural language so that sentence more it is smooth just
Really.The design of deep learning network structure used in Image Description Methods often directly influences the effect of iamge description,
Therefore designing a kind of suitable deep learning network structure is to improve one of the vital task of iamge description precision.Directly using high
Layer convolution is simply extracted image static nature and is widely used in traditional images understanding method, but this method has one kind
Potential defect --- it is easily lost abundant and important image information, to make finally to describe precision reduction.
Summary of the invention
In view of the above-mentioned problems, the present invention provides a kind of Image Description Methods and systems.
To achieve the above object, the present invention provides following schemes:
A kind of Image Description Methods, the method includes:
Obtain training set, the training set include training pictures and to each trained picture in the trained pictures into
The training text of row description;
Feature image training set is obtained according to the trained pictures;
Determine the attention weight of attention Mechanism Model;
Key feature picture training set is obtained by the attention weight according to the feature image training set;
Take the key feature picture training set and the training text as the input of long memory models in short-term, is grown
The output of the output of short-term memory model, length memory models in short-term is crucial training text;
According to the crucial training text and the key feature picture training set, training neural network model is obtained
Decoded model;
Test set is obtained, the test set includes test pictures and test text;
Feature image test set is obtained according to the test pictures;
Key feature picture test set is obtained by the attention weight according to the feature image test set;
By the key feature picture test set, the test text and the length memory models in short-term, closed
Key test text;
It is obtained according to the crucial test text and the key feature picture training set by the decoded model
The test picture concentrates the text description of each test picture.
Optionally, described to obtain feature image training set according to the trained pictures, it specifically includes:
By the first convolution neural network model of the trained pictures training, trained first convolution nerve net is obtained
Network model;
Obtain the output of the trained first convolution neural network model;The trained first convolution nerve net
The output of network model is initial characteristics picture training set;
By the initial characteristics picture training set the second convolution neural network model of training, trained volume Two is obtained
Product neural network model;
Obtain the output of the trained second convolution neural network model;The trained second convolution nerve net
The output of network model is characterized picture training set.
Optionally, the output for obtaining the trained first convolution neural network model, specifically includes:
Each trained picture is cut, the training picture after being cut;
By the initial characteristics of each trained picture of the convolutional neural networks model extraction, initial characteristics picture is obtained
Training set.
Optionally, the output for obtaining the trained second convolution neural network model, specifically includes:
By the convolutional layer of the second convolution neural network model to each spy in the initial characteristics picture training set
It levies picture and carries out convolution operation, obtain convolution feature image training set;
Each convolution feature image in the convolution feature image training set is adjusted to the big of corresponding each trained picture
It is small;
Convolution feature image collection adjusted and the trained pictures are spliced, feature image training set is obtained.
Optionally, the attention weight of the determining attention Mechanism Model, specifically includes:
By the output of long memory models in short-term described in the initial output and iterative process as feature training figure
Weight, and then determine and pay attention to weight, thus corresponding trained key feature subgraph required for correspondence word in being described.
Optionally, described to pass through the key feature picture test set, the test text and the long short-term memory
Model obtains crucial test text;It specifically includes:
The output of length memory models in short-term by full attended operation and is subjected to scaled using as the test
The weight of text, to obtain crucial test text.
Optionally, described according to the crucial training text and the key feature picture training set, training nerve net
Network model, obtains decoded model, specifically includes:
It is superimposed the defeated of the key feature picture training set and the crucial training text and the neural network model
Out, training superposition text is obtained;
Obtain the penalty values of training the superposition text and the training text;
The second convolution neural network model, the attention Mechanism Model, the length are adjusted by the penalty values
The parameter of short-term memory model and the neural network model makes the error of training the superposition text and the training text
Within the scope of error threshold, decoded model is obtained.
A kind of iamge description system, the system comprises:
First obtains module, and for obtaining training set, the training set includes training pictures and schemes to the training
The training text that piece concentrates each trained picture to be described;
Feature image training set determining module, for obtaining feature image training set according to the trained pictures;
Weight determination module is paid attention to, for determining the attention weight of attention Mechanism Model;
Key feature picture training set determining module, for being weighed by the attention according to the feature image training set
Weight, obtains key feature picture training set;
Output obtains module, is used for the key feature picture training set and the training text as long short-term memory
The input of model, obtains the output of long memory models in short-term, and the output of length memory models in short-term is crucial training text;
Training module, for according to the crucial training text and the key feature picture training set, training nerve
Network model obtains decoded model;
Second obtains module, and for obtaining test set, the test set includes test pictures and test text;
Feature image test set determining module obtains feature image test set according to the test pictures;
Key feature picture test set determining module is obtained according to the feature image test set by the attention weight
To key feature picture test set;
Crucial test text obtains module, for by the key feature picture test set, the test text and
Length memory models in short-term, obtain crucial test text;
Text describes determining module, is used for according to the crucial test text and the key feature picture training set,
By the decoded model, the text description that the test picture concentrates each test picture is obtained.
Optionally, the training module includes:
Superpositing unit, for being superimposed the key feature picture training set and the crucial training text and the nerve
The output of network model obtains training superposition text;
Penalty values acquiring unit, for obtaining the penalty values of training the superposition text and the training text;
Training unit, for adjusting the second convolution neural network model, the attention machine by the penalty values
The parameter of simulation, the length memory models and the neural network model in short-term, make the training superposition text with it is described
The error of training text obtains decoded model within the scope of error threshold.
Compared with prior art, the present invention has the following technical effects:The convolutional Neural that the present invention first uses Google to provide
Network VggNet extracts characteristics of image, i.e., encodes to image, reuses attention mechanism and carries out weight point to feature subgraph
Match, find out and describe important key feature subgraph to text, then use LSTM and is obtained by LSTM hidden layer as weight
The crucial text arrived to text and notices that the combination of the image after operation is decoded, that is, generates the description of correspondence image, this hair
The bright precision for more effectively improving iamge description.
Detailed description of the invention
It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment
Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention
Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings
Obtain other attached drawings.
Fig. 1 is the flow chart of Image Description Methods of the embodiment of the present invention;
Fig. 2 is the structural block diagram of iamge description of embodiment of the present invention system.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real
Applying mode, the present invention is described in further detail.
Fig. 1 is the flow chart of Image Description Methods of the embodiment of the present invention.As shown in Figure 1, a kind of Image Description Methods include
Following steps:
Step 101:Obtain training set, the training set includes training pictures and to respectively instructing in the trained pictures
Practice the training text that picture is described.
Step 102:Feature image training set is obtained according to the trained pictures.
Specifically, obtaining the trained first volume by the first convolution neural network model of the trained pictures training
Product neural network model;
Obtain the output of the trained first convolution neural network model;The trained first convolution nerve net
The output of network model is initial characteristics picture training set;Each trained picture is cut, the training figure after being cut
Piece;By the initial characteristics of each trained picture of the convolutional neural networks model extraction, initial characteristics picture training is obtained
Collection.
By the initial characteristics picture training set the second convolution neural network model of training, trained volume Two is obtained
Product neural network model;
Obtain the output of the trained second convolution neural network model;The trained second convolution nerve net
The output of network model is characterized picture training set.By the convolutional layer of the second convolution neural network model to the initial spy
Each feature image levied in picture training set carries out convolution operation, obtains convolution feature image training set;By the convolution feature
Each convolution feature image in picture training set is adjusted to the size of corresponding each trained picture;By convolution characteristic pattern adjusted
Piece collection is spliced with the trained pictures, obtains feature image training set.
Step 103:Determine the attention weight of attention Mechanism Model.It will be described in the initial output and iterative process
Weight of the output of long memory models in short-term as feature training figure, and then determine and pay attention to weight, thus pair in being described
Answer corresponding trained key feature subgraph required for word.
Step 104:Key feature picture training is obtained by the attention weight according to the feature image training set
Collection.
Step 105:It is the defeated of long memory models in short-term with the key feature picture training set and the training text
Enter, obtain the output of long memory models in short-term, the output of length memory models in short-term is crucial training text.
Step 106:According to the crucial training text and the key feature picture training set, training neural network mould
Type obtains decoded model.
Specifically, being superimposed the key feature picture training set and the crucial training text and the neural network mould
The output of type obtains training superposition text;Obtain the penalty values of training the superposition text and the training text;By described
Penalty values adjust the second convolution neural network model, the attention Mechanism Model, the length in short-term memory models and
The parameter of the neural network model makes the error of the training superposition text and the training text in error threshold range
It is interior, obtain decoded model.
Step 107:Test set is obtained, the test set includes test pictures and test text.
Step 108:Feature image test set is obtained according to the test pictures.
Step 109:The test of key feature picture is obtained by the attention weight according to the feature image test set
Collection.
Step 110:Pass through the key feature picture test set, the test text and the long short-term memory mould
Type obtains crucial test text.By the length in short-term memory models output by full attended operation and carry out scaled with
As the weight of the test text, to obtain crucial test text.
Step 111:According to the crucial test text and the key feature picture training set, pass through the decoding mould
Type obtains the text description that the test picture concentrates each test picture.
The specific embodiment provided according to the present invention, the invention discloses following technical effects:The present invention first uses Google
The convolutional neural networks VggNet of offer extracts characteristics of image, i.e., encodes to image, reuse attention mechanism to label
Text carries out weight distribution, finds out the important crucial text that feature is described, after then being operated using LSTM to attention
Image be decoded, that is, generate correspondence image description.After the present invention extracts characteristics of image with VggNet, to characteristics of image
Carry out a convolution operation, then by the Character adjustment size resize after convolution at after original image feature sizes with original image feature
Spliced, carry out a convolution operation after splicing again and is normalized.By original image feature and use the image after convolution operation
Merging features can not lose important image information, the present invention more effectively improves while extracting deeper semantic feature
The precision of iamge description.
Specific implementation
One, preparation and pre-treatment step:
Step 1:It gets out the COCO training set of team of Microsoft production and the file of COCO test set and is placed on engineering catalogue
Under.Wherein COCO is the large data collection of a kind of image recognition, image segmentation and iamge description.The present invention is provided using Google
Depth convolutional neural networks VGGNet parameter model, wherein VGGNet structure includes five layers of convolution pond layer, first layer and second
Layer all includes after two convolutional layers plus a pond layer, and third layer and the 4th layer all include adding a pond after four convolutional layers
Layer, the last layer include three convolutional layers.
Step 2:Training set and test set are pre-processed, prepared for subsequent operation.Input be training set picture and
Description and test set picture.
Step 2.1:Training set picture and test set picture are uniformly cut into the picture size of 224*224.
Step 2.2:It include picture and the description to picture in training set.The description sentence of all pictures is carried out single one by one
The cutting of word, and repeated word is abandoned, then all words are put on into serial number, obtain " dictionary " word_to_idx.
Step 2.3:The picture and " dictionary " word_to_idx of output after treatment.
Two, training step:
Step 3:It is sent the training set picture after cutting as input into existing VGGNet, exported in VGGNet layer 5
Third convolutional layer feature subgraph.By using VGGNet, preextraction characteristics of image, and image is subjected to precoding, i.e.,
Image is transformed into the vector space of needs.
Step 4:VGGNet is exported and carries out batch normalization operation batchnormalization.After normalization operation
Characteristics of image is denoted as features.
Step 5:User initializes the parameter in network structure, and the initiation parameter includes that training sample inputs in batches,
Every batch of training sample number is denoted as batch_size, number epoch, picture feature dimension dim_ needed for all samples of training
Features, word Embedded dimensions dim_embedding, LSTM the number of iterations n_time_step, LSTM hidden state dimension dim_
Hidden, double random regularization coefficient alpha_c, learning rate learning_rate, network optimizer update_rule, how much
In generation, shows primary result print_every, training set path image_path, training pattern storing path model_path, surveys
Die trial type read path test_path.
Step 6:The whole network being made of VGGNet, attention mechanism and LSTM is using following formula as loss function
Wherein-log (P (y | x)) it is the negative log-likelihood for losing loss, wherein loss loss is calculated using cross entropy,
Wherein x represents true description, and y represents description, that is, out_logits of prediction.
WhereinIt is the penalty term indicated with soft weight vectors α, wherein λ is the control parameter of penalty term.
C indicates the number of α in a description, and L indicates the length of α vector.
Step 7:In the network structure newly constructed, using features as input.Features is sent into one layer of convolution
The output conv that layer obtains, to obtain deeper image, semantic information.Conv is subjected to up-sampling resize into life size again
Image_resize, so as to compared with shallow-layer feature and deeper merging features, and resize operation not will increase training parameter.This
When conv and image_resize are stitched together are re-fed into one layer of convolutional layer, be converted to original image character shape.It is most laggard
Row batch normalization operation batchnormalization.Features_ is denoted as by the feature subgraph that this network exports
concat。
Step 8:Initial value is provided for LSTM network.The all pixels of each features are added and are averaging.Again will
The above-mentioned feature subgraph by average operation is assigned to the original state c of LSTM cell respectively0With the initial hidden shape of LSTM cell
State h0。
Step 9:It is transformed into the word serial number in " dictionary " in corresponding vector space.Input is " dictionary " word_to_
idx.Picture description in training set is subjected to word insertion embedding operation by word, exports all words in being described
Term vector xt (wherein t indicate one description word quantity).
Step 10:Using features_concat and each time the hidden state ht of LSTM of iteration as input be sent into attention
Mechanism.To achieve the effect that Dynamic Extraction characteristics of image.
Step 10.1:One layer of full connection neuron is sent into the channel of features_concat, full connection output is denoted as
Features_proj, and port number is consistent with the port number of features_concat.
Step 10.2:Ht is obtained and the consistent vector of features_proj port number by one layer of full connection neuron
Ht ', and extend and one-dimensional obtain ht ".Ht " is after activation primitive tanh, then carries out contraposition with features_concat and be multiplied,
Obtain features_h.
Step 10.3:By features_h, finally one-dimensional channel is averaging, and is obtained two-dimensional soft weight vectors, is denoted as
alpha.The alpha after softmax is multiplied with every feature subgraph of features_concat respectively again, then will
The all pixels of every subgraph are summed into a pixel, obtain the attention target position of picture.Ht is obtained by full connection
One-dimensional weight is simultaneously multiplied with the attention target position of picture, obtains context vector, is denoted as context.
Step 10.4:The output of attention mechanism is context.
Step 11:Using LSTM model, context vector context and correspondence in description that iteration each time obtains are changed
The term vector xt in generation is stitched together as input, is sent into LSTM unit cell, exports the hidden state ht+1 of LSTM.LSTM unit is thin
Born of the same parents remember and forget to currently available information and before recall info by remembering and forgeing operation.
Step 12:Final decoding operate is done to LSTM output.Input is ht+1, context and xt.It will be in step 11
Ht+1 carries out dropout operation, using one layer of full connection neuron, obtains ht+1 ' identical with term vector length.Make
Context obtains context ' identical with term vector length by one layer of full connection neuron.Obtain ht by full connection
One-dimensional weight is simultaneously multiplied with the xt in step 9 per one-dimensional, obtains new xt', exports xt '.By ht+1 ' and context ' and
Xt' is added, and carries out the activation of activation primitive tanh after addition again, is then carried out dropout operation again, is finally carried out one
Secondary full connection, obtains the output of dictionary vector length, is denoted as out_logits.
Step 13:All training parameters are saved after having trained.
Three, testing procedure:
Step 14:It is sent the training set picture after cutting as input into existing VGGNet, exports VGGNet layer 5
In third convolutional layer feature subgraph.By using VGGNet, preextraction characteristics of image, and image is subjected to precoding,
Image is transformed into the vector space of needs.
Step 15:VGGNet is exported and carries out batch normalization operation batchnormalization.After normalization operation
Characteristics of image be denoted as features.
Step 16:Load previously stored trained network parameter.
Step 17:In the network structure newly constructed, using features as input.Features is sent into one layer of convolution
The output conv that layer obtains, to obtain deeper image, semantic information.Conv is subjected to up-sampling resize into life size again
Image_resize, so as to compared with shallow-layer feature and deeper merging features, and resize operation not will increase training parameter.This
When conv and image_resize are stitched together are re-fed into one layer of convolutional layer, be converted to original image character shape.It is most laggard
Row batch normalization operation batch normalization.Features_ is denoted as by the feature subgraph that this network exports
concat。
Step 18:Initial value is provided for LSTM network.The all pixels of each features are added and are averaging.Again will
The above-mentioned feature subgraph by average operation is assigned to the original state c of LSTM cell respectively0With the initial hidden shape of LSTM cell
State h0。
Step 19:Word insertion embedding behaviour is carried out using the word sampled_word of upper iteration prediction as input
Make, output is denoted as xt
Step 20:Using features_concat and each time the hidden state ht of LSTM of iteration as input be sent into attention
Mechanism.To achieve the effect that Dynamic Extraction characteristics of image.
Step 20.1:One layer of full connection neuron is sent into the channel of features_concat, full connection output is denoted as
Features_proj, and port number is consistent with the port number of features_concat.
Step 20.2:Ht is obtained and the consistent vector of features_proj port number by one layer of full connection neuron
Ht ', and extend and one-dimensional obtain ht ".Ht " is after activation primitive tanh, then carries out contraposition with features_concat and be multiplied,
Obtain features_h.
Step 20.3:By features_h, finally one-dimensional channel is averaging, and is obtained two-dimensional soft weight vectors, is denoted as
alpha.The alpha after softmax is multiplied with every feature subgraph of features_concat respectively again, then will
The all pixels of every subgraph are summed into a pixel, obtain the attention target position of picture.Ht is obtained by full connection
One-dimensional weight is simultaneously multiplied with the attention target position of picture, obtains context vector, is denoted as context.
Step 20.4:The output of attention mechanism is context.
Step 21:Using LSTM model, context vector context and correspondence in description that iteration each time obtains are changed
The term vector xt in generation is stitched together as input, is sent into LSTM unit cell, exports the hidden state ht+1 of LSTM.LSTM unit is thin
Born of the same parents remember and forget to currently available information and before recall info by remembering and forgeing operation.
Step 22:Final decoding operate is done to LSTM output.Input is ht+1, context and xt.It will be in step 21
Ht+1 carries out dropout operation, using one layer of full connection neuron, obtains ht+1 ' identical with term vector length.Make
Context obtains context ' identical with term vector length by one layer of full connection neuron.Obtain ht by full connection
One-dimensional weight is simultaneously multiplied with the xt in step 19 per one-dimensional, obtains new xt', exports xt '.By ht+1 ' and context ' and
Xt' is added, and carries out the activation of activation primitive tanh after addition again, is then carried out dropout operation again, is finally carried out one
Secondary full connection, obtains the output of dictionary vector length, is denoted as out_logits.
Step 23:Out_logits is subjected to softmax operation, the word sequence of maximum probability is selected, is denoted as
Sampled_word, the dictionary made before reusing, translates corresponding English word.
Step 24:After the completion of all iteration, word is successively printed to display.
Fig. 2 is the structural block diagram of iamge description of embodiment of the present invention system.As shown in Fig. 2, a kind of iamge description system packet
It includes:
First obtains module 201, and for obtaining training set, the training set includes training pictures and to the training
The training text that each trained picture is described in pictures.
Feature image training set determining module 202, for obtaining feature image training set according to the trained pictures.
Weight determination module 203 is paid attention to, for determining the attention weight of attention Mechanism Model.
Key feature picture training set determining module 204, for passing through the attention according to the feature image training set
Weight obtains key feature picture training set.
Output obtains module 205, for the key feature picture training set and the training text be it is long in short-term
The input of memory models, obtains the output of long memory models in short-term, and the output of length memory models in short-term is crucial training text
This.
Training module 206, for according to the crucial training text and the key feature picture training set, training mind
Through network model, decoded model is obtained.
The training module 206 specifically includes:
Superpositing unit, for being superimposed the key feature picture training set and the crucial training text and the nerve
The output of network model obtains training superposition text;
Penalty values acquiring unit, for obtaining the penalty values of training the superposition text and the training text;
Training unit, for adjusting the second convolution neural network model, the attention machine by the penalty values
The parameter of simulation, the length memory models and the neural network model in short-term, make the training superposition text with it is described
The error of training text obtains decoded model within the scope of error threshold.
Second obtains module 207, and for obtaining test set, the test set includes test pictures and test text.
Feature image test set determining module 208 obtains feature image test set according to the test pictures.
Key feature picture test set determining module 209 is weighed according to the feature image test set by the attention
Weight, obtains key feature picture test set.
Crucial test text obtains module 210, for by the key feature picture test set, the test text with
And length memory models in short-term, obtain crucial test text.
Text describes determining module 211, for according to the crucial test text and the key feature picture training
Collection obtains the text description that the test picture concentrates each test picture by the decoded model.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other
The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For system disclosed in embodiment
For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part
It is bright.
Used herein a specific example illustrates the principle and implementation of the invention, and above embodiments are said
It is bright to be merely used to help understand method and its core concept of the invention;At the same time, for those skilled in the art, foundation
Thought of the invention, there will be changes in the specific implementation manner and application range.In conclusion the content of the present specification is not
It is interpreted as limitation of the present invention.
Claims (9)
1. a kind of Image Description Methods, which is characterized in that the method includes:
Training set is obtained, the training set includes training pictures and retouches to each trained picture in the trained pictures
The training text stated;
Feature image training set is obtained according to the trained pictures;
Determine the attention weight of attention Mechanism Model;
Key feature picture training set is obtained by the attention weight according to the feature image training set;
Take the key feature picture training set and the training text as the input of long memory models in short-term, obtains length in short-term
The output of the output of memory models, length memory models in short-term is crucial training text;
According to the crucial training text and the key feature picture training set, training neural network model is decoded
Model;
Test set is obtained, the test set includes test pictures and test text;
Feature image test set is obtained according to the test pictures;
Key feature picture test set is obtained by the attention weight according to the feature image test set;
By the key feature picture test set, the test text and the length memory models in short-term, crucial survey is obtained
Try text;
According to the crucial test text and the key feature picture training set, by the decoded model, obtain described
Test picture concentrates the text description of each test picture.
2. the method according to claim 1, wherein described obtain feature image instruction according to the trained pictures
Practice collection, specifically includes:
By the first convolution neural network model of the trained pictures training, trained first convolutional neural networks mould is obtained
Type;
Obtain the output of the trained first convolution neural network model;The trained first convolutional neural networks mould
The output of type is initial characteristics picture training set;
By the initial characteristics picture training set the second convolution neural network model of training, trained second convolution mind is obtained
Through network model;
Obtain the output of the trained second convolution neural network model;The trained second convolutional neural networks mould
The output of type is characterized picture training set.
3. according to the method described in claim 2, it is characterized in that, described obtain trained first convolutional neural networks
The output of model, specifically includes:
Each trained picture is cut, the training picture after being cut;
By the initial characteristics of each trained picture of the convolutional neural networks model extraction, initial characteristics picture training is obtained
Collection.
4. according to right want 2 described in method, which is characterized in that it is described to obtain the trained second convolutional neural networks mould
The output of type, specifically includes:
By the convolutional layer of the second convolution neural network model to each characteristic pattern in the initial characteristics picture training set
Piece carries out convolution operation, obtains convolution feature image training set;
Each convolution feature image in the convolution feature image training set is adjusted to the size of corresponding each trained picture;
Convolution feature image collection adjusted and the trained pictures are spliced, feature image training set is obtained.
5. according to right want 1 described in method, which is characterized in that the attention weight of the determining attention Mechanism Model, specifically
Including:
Weight by the output of long memory models in short-term described in the initial output and iterative process as feature training figure,
It determines in turn and pays attention to weight, thus corresponding trained key feature subgraph required for the correspondence word in being described.
6. according to right want 1 described in method, which is characterized in that it is described to pass through the key feature picture test set, the survey
Text and the length memory models in short-term are tried, crucial test text is obtained;It specifically includes:
The output of length memory models in short-term by full attended operation and is subjected to scaled using as the test text
Weight, to obtain crucial test text.
7. according to right want 4 described in method, which is characterized in that it is described according to the crucial training text and described crucial special
Picture training set is levied, training neural network model obtains decoded model, specifically includes:
It is superimposed the output of the key feature picture training set and the crucial training text and the neural network model, is obtained
Text is superimposed to training;
Obtain the penalty values of training the superposition text and the training text;
The second convolution neural network model, the attention Mechanism Model, the length are adjusted in short-term by the penalty values
The parameter of memory models and the neural network model makes the error of the training superposition text and the training text accidentally
In poor threshold range, decoded model is obtained.
8. a kind of iamge description system, which is characterized in that the system comprises:
First obtains module, and for obtaining training set, the training set includes training pictures and to the trained pictures
In the training text that is described of each trained picture;
Feature image training set determining module, for obtaining feature image training set according to the trained pictures;
Weight determination module is paid attention to, for determining the attention weight of attention Mechanism Model;
Key feature picture training set determining module, for being obtained according to the feature image training set by the attention weight
To key feature picture training set;
Output obtains module, for being long memory models in short-term with the key feature picture training set and the training text
Input, obtain the output of long memory models in short-term, the output of length memory models in short-term is crucial training text;
Training module, for according to the crucial training text and the key feature picture training set, training neural network
Model obtains decoded model;
Second obtains module, and for obtaining test set, the test set includes test pictures and test text;
Feature image test set determining module obtains feature image test set according to the test pictures;
Key feature picture test set determining module is closed according to the feature image test set by the attention weight
Key feature image test set;
Crucial test text obtains module, for passing through the key feature picture test set, the test text and described
Long memory models in short-term, obtain crucial test text;
Text describes determining module, for passing through according to the crucial test text and the key feature picture training set
The decoded model obtains the text description that the test picture concentrates each test picture.
9. system according to claim 8, which is characterized in that the training module includes:
Superpositing unit, for being superimposed the key feature picture training set and the crucial training text and the neural network
The output of model obtains training superposition text;
Penalty values acquiring unit, for obtaining the penalty values of training the superposition text and the training text;
Training unit, for adjusting the second convolution neural network model, the attention mechanism mould by the penalty values
The parameter of type, the length memory models and the neural network model in short-term makes the training superposition text and the training
The error of text obtains decoded model within the scope of error threshold.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810537627.9A CN108898639A (en) | 2018-05-30 | 2018-05-30 | A kind of Image Description Methods and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810537627.9A CN108898639A (en) | 2018-05-30 | 2018-05-30 | A kind of Image Description Methods and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108898639A true CN108898639A (en) | 2018-11-27 |
Family
ID=64344019
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810537627.9A Pending CN108898639A (en) | 2018-05-30 | 2018-05-30 | A kind of Image Description Methods and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108898639A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109947526A (en) * | 2019-03-29 | 2019-06-28 | 北京百度网讯科技有限公司 | Method and apparatus for output information |
CN110288535A (en) * | 2019-05-14 | 2019-09-27 | 北京邮电大学 | A kind of image rain removing method and device |
CN110288665A (en) * | 2019-05-13 | 2019-09-27 | 中国科学院西安光学精密机械研究所 | Image Description Methods, computer readable storage medium based on convolutional neural networks, electronic equipment |
CN110929587A (en) * | 2019-10-30 | 2020-03-27 | 杭州电子科技大学 | Bidirectional reconstruction network video description method based on hierarchical attention mechanism |
CN111597326A (en) * | 2019-02-21 | 2020-08-28 | 北京京东尚科信息技术有限公司 | Method and device for generating commodity description text |
WO2020186484A1 (en) * | 2019-03-20 | 2020-09-24 | 深圳大学 | Automatic image description generation method and system, electronic device, and storage medium |
WO2021008145A1 (en) * | 2019-07-12 | 2021-01-21 | 北京京东尚科信息技术有限公司 | Image paragraph description generating method and apparatus, medium and electronic device |
CN116091363A (en) * | 2023-04-03 | 2023-05-09 | 南京信息工程大学 | Handwriting Chinese character image restoration method and system |
CN116453120A (en) * | 2023-04-19 | 2023-07-18 | 浪潮智慧科技有限公司 | Image description method, device and medium based on time sequence scene graph attention mechanism |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120033874A1 (en) * | 2010-08-05 | 2012-02-09 | Xerox Corporation | Learning weights of fonts for typed samples in handwritten keyword spotting |
US20120163707A1 (en) * | 2010-12-28 | 2012-06-28 | Microsoft Corporation | Matching text to images |
CN105938485A (en) * | 2016-04-14 | 2016-09-14 | 北京工业大学 | Image description method based on convolution cyclic hybrid model |
CN106503055A (en) * | 2016-09-27 | 2017-03-15 | 天津大学 | A kind of generation method from structured text to iamge description |
WO2017151757A1 (en) * | 2016-03-01 | 2017-09-08 | The United States Of America, As Represented By The Secretary, Department Of Health And Human Services | Recurrent neural feedback model for automated image annotation |
CN107918782A (en) * | 2016-12-29 | 2018-04-17 | 中国科学院计算技术研究所 | A kind of method and system for the natural language for generating description picture material |
CN108052512A (en) * | 2017-11-03 | 2018-05-18 | 同济大学 | A kind of iamge description generation method based on depth attention mechanism |
WO2018094296A1 (en) * | 2016-11-18 | 2018-05-24 | Salesforce.Com, Inc. | Sentinel long short-term memory |
US20180143966A1 (en) * | 2016-11-18 | 2018-05-24 | Salesforce.Com, Inc. | Spatial Attention Model for Image Captioning |
-
2018
- 2018-05-30 CN CN201810537627.9A patent/CN108898639A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120033874A1 (en) * | 2010-08-05 | 2012-02-09 | Xerox Corporation | Learning weights of fonts for typed samples in handwritten keyword spotting |
US20120163707A1 (en) * | 2010-12-28 | 2012-06-28 | Microsoft Corporation | Matching text to images |
WO2017151757A1 (en) * | 2016-03-01 | 2017-09-08 | The United States Of America, As Represented By The Secretary, Department Of Health And Human Services | Recurrent neural feedback model for automated image annotation |
CN105938485A (en) * | 2016-04-14 | 2016-09-14 | 北京工业大学 | Image description method based on convolution cyclic hybrid model |
CN106503055A (en) * | 2016-09-27 | 2017-03-15 | 天津大学 | A kind of generation method from structured text to iamge description |
WO2018094296A1 (en) * | 2016-11-18 | 2018-05-24 | Salesforce.Com, Inc. | Sentinel long short-term memory |
US20180143966A1 (en) * | 2016-11-18 | 2018-05-24 | Salesforce.Com, Inc. | Spatial Attention Model for Image Captioning |
CN107918782A (en) * | 2016-12-29 | 2018-04-17 | 中国科学院计算技术研究所 | A kind of method and system for the natural language for generating description picture material |
CN108052512A (en) * | 2017-11-03 | 2018-05-18 | 同济大学 | A kind of iamge description generation method based on depth attention mechanism |
Non-Patent Citations (9)
Title |
---|
THÉODORE BLUCHE: "Scan, Attend and Read: End-to-End Handwritten Paragraph Recognition with MDLSTM Attention", 《2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR)》 * |
张军阳等: "深度学习相关研究综述", 《计算机应用研究》 * |
张琳林等: "一种基于卷积神经网络的图像分类方法", 《福建电脑》 * |
杨楠等: "基于深度学习的图像描述研究", 《红外与激光工程》 * |
林杰等: "基于深度学习的图像识别处理", 《网络安全技术与应用》 * |
梁锐等: "基于多特征融合的深度视频自然语言描述方法", 《计算机应用》 * |
汤鹏杰等: "LSTM逐层多目标优化及多层概率融合的图像描述", 《自动化学报》 * |
陈虹君等: "基于CNN-RNN深度学习的图片描述方法与优化", 《湘潭大学自然科学学报》 * |
马龙龙等: "图像的文本描述方法研究综述", 《中文信息学报》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111597326A (en) * | 2019-02-21 | 2020-08-28 | 北京京东尚科信息技术有限公司 | Method and device for generating commodity description text |
CN111597326B (en) * | 2019-02-21 | 2024-03-05 | 北京汇钧科技有限公司 | Method and device for generating commodity description text |
WO2020186484A1 (en) * | 2019-03-20 | 2020-09-24 | 深圳大学 | Automatic image description generation method and system, electronic device, and storage medium |
CN109947526A (en) * | 2019-03-29 | 2019-06-28 | 北京百度网讯科技有限公司 | Method and apparatus for output information |
CN109947526B (en) * | 2019-03-29 | 2023-04-11 | 北京百度网讯科技有限公司 | Method and apparatus for outputting information |
CN110288665A (en) * | 2019-05-13 | 2019-09-27 | 中国科学院西安光学精密机械研究所 | Image Description Methods, computer readable storage medium based on convolutional neural networks, electronic equipment |
CN110288535A (en) * | 2019-05-14 | 2019-09-27 | 北京邮电大学 | A kind of image rain removing method and device |
WO2021008145A1 (en) * | 2019-07-12 | 2021-01-21 | 北京京东尚科信息技术有限公司 | Image paragraph description generating method and apparatus, medium and electronic device |
CN110929587A (en) * | 2019-10-30 | 2020-03-27 | 杭州电子科技大学 | Bidirectional reconstruction network video description method based on hierarchical attention mechanism |
CN116091363A (en) * | 2023-04-03 | 2023-05-09 | 南京信息工程大学 | Handwriting Chinese character image restoration method and system |
CN116453120A (en) * | 2023-04-19 | 2023-07-18 | 浪潮智慧科技有限公司 | Image description method, device and medium based on time sequence scene graph attention mechanism |
CN116453120B (en) * | 2023-04-19 | 2024-04-05 | 浪潮智慧科技有限公司 | Image description method, device and medium based on time sequence scene graph attention mechanism |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108898639A (en) | A kind of Image Description Methods and system | |
CN113674140B (en) | Physical countermeasure sample generation method and system | |
CN113343705B (en) | Text semantic based detail preservation image generation method and system | |
CN108228686A (en) | It is used to implement the matched method, apparatus of picture and text and electronic equipment | |
CN109087258A (en) | A kind of image rain removing method and device based on deep learning | |
CN107729987A (en) | The automatic describing method of night vision image based on depth convolution loop neutral net | |
CN109255772A (en) | License plate image generation method, device, equipment and medium based on Style Transfer | |
CN109977199A (en) | A kind of reading understanding method based on attention pond mechanism | |
CN110544218A (en) | Image processing method, device and storage medium | |
CN114511576B (en) | Image segmentation method and system of scale self-adaptive feature enhanced deep neural network | |
CN114596566B (en) | Text recognition method and related device | |
KR20230152741A (en) | Multi-modal few-shot learning using fixed language models | |
CN116704079B (en) | Image generation method, device, equipment and storage medium | |
Jiang et al. | Language-guided global image editing via cross-modal cyclic mechanism | |
CN109300128A (en) | The transfer learning image processing method of structure is implied based on convolutional Neural net | |
CN117522697A (en) | Face image generation method, face image generation system and model training method | |
CN117576264B (en) | Image generation method, device, equipment and medium | |
CN110969137A (en) | Household image description generation method, device and system and storage medium | |
CN113962192B (en) | Method and device for generating Chinese character font generation model and Chinese character font generation method and device | |
CN116958766B (en) | Image processing method and computer readable storage medium | |
CN117392293A (en) | Image processing method, device, electronic equipment and storage medium | |
CN117034951A (en) | Digital person with specific language style based on large language model | |
CN110097615B (en) | Stylized and de-stylized artistic word editing method and system | |
CN110866866A (en) | Image color-matching processing method and device, electronic device and storage medium | |
CN116402067A (en) | Cross-language self-supervision generation method for multi-language character style retention |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
AD01 | Patent right deemed abandoned | ||
AD01 | Patent right deemed abandoned |
Effective date of abandoning: 20230915 |