CN108898639A - A kind of Image Description Methods and system - Google Patents

A kind of Image Description Methods and system Download PDF

Info

Publication number
CN108898639A
CN108898639A CN201810537627.9A CN201810537627A CN108898639A CN 108898639 A CN108898639 A CN 108898639A CN 201810537627 A CN201810537627 A CN 201810537627A CN 108898639 A CN108898639 A CN 108898639A
Authority
CN
China
Prior art keywords
training
text
picture
test
trained
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810537627.9A
Other languages
Chinese (zh)
Inventor
王紫嫣
刘罡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei University of Technology
Original Assignee
Hubei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei University of Technology filed Critical Hubei University of Technology
Priority to CN201810537627.9A priority Critical patent/CN108898639A/en
Publication of CN108898639A publication Critical patent/CN108898639A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of Image Description Methods and systems.The convolutional neural networks VggNet that the present invention first uses Google to provide extracts characteristics of image, image is encoded, it reuses attention mechanism and weight distribution is carried out to feature subgraph, it finds out and describes important key feature subgraph to text, then the crucial text for using LSTM and being obtained by LSTM hidden layer as weight, to text and notice that the combination of the image after operation is decoded, that is, generates the description of correspondence image, the present invention more effectively improves the precision of iamge description.

Description

A kind of Image Description Methods and system
Technical field
The present invention relates to iamge description fields, more particularly to a kind of Image Description Methods and system.
Background technique
With the development of depth learning technology, the iamge description technology based on deep learning has tended to be mature.Image is retouched Stating is a kind of technology that a secondary image content is described using the sentence correctly combined.Image Description Methods be integrated with deep learning, A variety of professional techniques such as pattern-recognition, Digital Image Processing and natural language processing.Iamge description key has two o'clock:(1) image The extraction of feature;(2) natural language is synthesized.Deep learning realizes the automation of image characteristics extraction and identification, greatly mentions The high precision of object and Context awareness.And deep learning realizes the prediction of natural language so that sentence more it is smooth just Really.The design of deep learning network structure used in Image Description Methods often directly influences the effect of iamge description, Therefore designing a kind of suitable deep learning network structure is to improve one of the vital task of iamge description precision.Directly using high Layer convolution is simply extracted image static nature and is widely used in traditional images understanding method, but this method has one kind Potential defect --- it is easily lost abundant and important image information, to make finally to describe precision reduction.
Summary of the invention
In view of the above-mentioned problems, the present invention provides a kind of Image Description Methods and systems.
To achieve the above object, the present invention provides following schemes:
A kind of Image Description Methods, the method includes:
Obtain training set, the training set include training pictures and to each trained picture in the trained pictures into The training text of row description;
Feature image training set is obtained according to the trained pictures;
Determine the attention weight of attention Mechanism Model;
Key feature picture training set is obtained by the attention weight according to the feature image training set;
Take the key feature picture training set and the training text as the input of long memory models in short-term, is grown The output of the output of short-term memory model, length memory models in short-term is crucial training text;
According to the crucial training text and the key feature picture training set, training neural network model is obtained Decoded model;
Test set is obtained, the test set includes test pictures and test text;
Feature image test set is obtained according to the test pictures;
Key feature picture test set is obtained by the attention weight according to the feature image test set;
By the key feature picture test set, the test text and the length memory models in short-term, closed Key test text;
It is obtained according to the crucial test text and the key feature picture training set by the decoded model The test picture concentrates the text description of each test picture.
Optionally, described to obtain feature image training set according to the trained pictures, it specifically includes:
By the first convolution neural network model of the trained pictures training, trained first convolution nerve net is obtained Network model;
Obtain the output of the trained first convolution neural network model;The trained first convolution nerve net The output of network model is initial characteristics picture training set;
By the initial characteristics picture training set the second convolution neural network model of training, trained volume Two is obtained Product neural network model;
Obtain the output of the trained second convolution neural network model;The trained second convolution nerve net The output of network model is characterized picture training set.
Optionally, the output for obtaining the trained first convolution neural network model, specifically includes:
Each trained picture is cut, the training picture after being cut;
By the initial characteristics of each trained picture of the convolutional neural networks model extraction, initial characteristics picture is obtained Training set.
Optionally, the output for obtaining the trained second convolution neural network model, specifically includes:
By the convolutional layer of the second convolution neural network model to each spy in the initial characteristics picture training set It levies picture and carries out convolution operation, obtain convolution feature image training set;
Each convolution feature image in the convolution feature image training set is adjusted to the big of corresponding each trained picture It is small;
Convolution feature image collection adjusted and the trained pictures are spliced, feature image training set is obtained.
Optionally, the attention weight of the determining attention Mechanism Model, specifically includes:
By the output of long memory models in short-term described in the initial output and iterative process as feature training figure Weight, and then determine and pay attention to weight, thus corresponding trained key feature subgraph required for correspondence word in being described.
Optionally, described to pass through the key feature picture test set, the test text and the long short-term memory Model obtains crucial test text;It specifically includes:
The output of length memory models in short-term by full attended operation and is subjected to scaled using as the test The weight of text, to obtain crucial test text.
Optionally, described according to the crucial training text and the key feature picture training set, training nerve net Network model, obtains decoded model, specifically includes:
It is superimposed the defeated of the key feature picture training set and the crucial training text and the neural network model Out, training superposition text is obtained;
Obtain the penalty values of training the superposition text and the training text;
The second convolution neural network model, the attention Mechanism Model, the length are adjusted by the penalty values The parameter of short-term memory model and the neural network model makes the error of training the superposition text and the training text Within the scope of error threshold, decoded model is obtained.
A kind of iamge description system, the system comprises:
First obtains module, and for obtaining training set, the training set includes training pictures and schemes to the training The training text that piece concentrates each trained picture to be described;
Feature image training set determining module, for obtaining feature image training set according to the trained pictures;
Weight determination module is paid attention to, for determining the attention weight of attention Mechanism Model;
Key feature picture training set determining module, for being weighed by the attention according to the feature image training set Weight, obtains key feature picture training set;
Output obtains module, is used for the key feature picture training set and the training text as long short-term memory The input of model, obtains the output of long memory models in short-term, and the output of length memory models in short-term is crucial training text;
Training module, for according to the crucial training text and the key feature picture training set, training nerve Network model obtains decoded model;
Second obtains module, and for obtaining test set, the test set includes test pictures and test text;
Feature image test set determining module obtains feature image test set according to the test pictures;
Key feature picture test set determining module is obtained according to the feature image test set by the attention weight To key feature picture test set;
Crucial test text obtains module, for by the key feature picture test set, the test text and Length memory models in short-term, obtain crucial test text;
Text describes determining module, is used for according to the crucial test text and the key feature picture training set, By the decoded model, the text description that the test picture concentrates each test picture is obtained.
Optionally, the training module includes:
Superpositing unit, for being superimposed the key feature picture training set and the crucial training text and the nerve The output of network model obtains training superposition text;
Penalty values acquiring unit, for obtaining the penalty values of training the superposition text and the training text;
Training unit, for adjusting the second convolution neural network model, the attention machine by the penalty values The parameter of simulation, the length memory models and the neural network model in short-term, make the training superposition text with it is described The error of training text obtains decoded model within the scope of error threshold.
Compared with prior art, the present invention has the following technical effects:The convolutional Neural that the present invention first uses Google to provide Network VggNet extracts characteristics of image, i.e., encodes to image, reuses attention mechanism and carries out weight point to feature subgraph Match, find out and describe important key feature subgraph to text, then use LSTM and is obtained by LSTM hidden layer as weight The crucial text arrived to text and notices that the combination of the image after operation is decoded, that is, generates the description of correspondence image, this hair The bright precision for more effectively improving iamge description.
Detailed description of the invention
It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.
Fig. 1 is the flow chart of Image Description Methods of the embodiment of the present invention;
Fig. 2 is the structural block diagram of iamge description of embodiment of the present invention system.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.
Fig. 1 is the flow chart of Image Description Methods of the embodiment of the present invention.As shown in Figure 1, a kind of Image Description Methods include Following steps:
Step 101:Obtain training set, the training set includes training pictures and to respectively instructing in the trained pictures Practice the training text that picture is described.
Step 102:Feature image training set is obtained according to the trained pictures.
Specifically, obtaining the trained first volume by the first convolution neural network model of the trained pictures training Product neural network model;
Obtain the output of the trained first convolution neural network model;The trained first convolution nerve net The output of network model is initial characteristics picture training set;Each trained picture is cut, the training figure after being cut Piece;By the initial characteristics of each trained picture of the convolutional neural networks model extraction, initial characteristics picture training is obtained Collection.
By the initial characteristics picture training set the second convolution neural network model of training, trained volume Two is obtained Product neural network model;
Obtain the output of the trained second convolution neural network model;The trained second convolution nerve net The output of network model is characterized picture training set.By the convolutional layer of the second convolution neural network model to the initial spy Each feature image levied in picture training set carries out convolution operation, obtains convolution feature image training set;By the convolution feature Each convolution feature image in picture training set is adjusted to the size of corresponding each trained picture;By convolution characteristic pattern adjusted Piece collection is spliced with the trained pictures, obtains feature image training set.
Step 103:Determine the attention weight of attention Mechanism Model.It will be described in the initial output and iterative process Weight of the output of long memory models in short-term as feature training figure, and then determine and pay attention to weight, thus pair in being described Answer corresponding trained key feature subgraph required for word.
Step 104:Key feature picture training is obtained by the attention weight according to the feature image training set Collection.
Step 105:It is the defeated of long memory models in short-term with the key feature picture training set and the training text Enter, obtain the output of long memory models in short-term, the output of length memory models in short-term is crucial training text.
Step 106:According to the crucial training text and the key feature picture training set, training neural network mould Type obtains decoded model.
Specifically, being superimposed the key feature picture training set and the crucial training text and the neural network mould The output of type obtains training superposition text;Obtain the penalty values of training the superposition text and the training text;By described Penalty values adjust the second convolution neural network model, the attention Mechanism Model, the length in short-term memory models and The parameter of the neural network model makes the error of the training superposition text and the training text in error threshold range It is interior, obtain decoded model.
Step 107:Test set is obtained, the test set includes test pictures and test text.
Step 108:Feature image test set is obtained according to the test pictures.
Step 109:The test of key feature picture is obtained by the attention weight according to the feature image test set Collection.
Step 110:Pass through the key feature picture test set, the test text and the long short-term memory mould Type obtains crucial test text.By the length in short-term memory models output by full attended operation and carry out scaled with As the weight of the test text, to obtain crucial test text.
Step 111:According to the crucial test text and the key feature picture training set, pass through the decoding mould Type obtains the text description that the test picture concentrates each test picture.
The specific embodiment provided according to the present invention, the invention discloses following technical effects:The present invention first uses Google The convolutional neural networks VggNet of offer extracts characteristics of image, i.e., encodes to image, reuse attention mechanism to label Text carries out weight distribution, finds out the important crucial text that feature is described, after then being operated using LSTM to attention Image be decoded, that is, generate correspondence image description.After the present invention extracts characteristics of image with VggNet, to characteristics of image Carry out a convolution operation, then by the Character adjustment size resize after convolution at after original image feature sizes with original image feature Spliced, carry out a convolution operation after splicing again and is normalized.By original image feature and use the image after convolution operation Merging features can not lose important image information, the present invention more effectively improves while extracting deeper semantic feature The precision of iamge description.
Specific implementation
One, preparation and pre-treatment step:
Step 1:It gets out the COCO training set of team of Microsoft production and the file of COCO test set and is placed on engineering catalogue Under.Wherein COCO is the large data collection of a kind of image recognition, image segmentation and iamge description.The present invention is provided using Google Depth convolutional neural networks VGGNet parameter model, wherein VGGNet structure includes five layers of convolution pond layer, first layer and second Layer all includes after two convolutional layers plus a pond layer, and third layer and the 4th layer all include adding a pond after four convolutional layers Layer, the last layer include three convolutional layers.
Step 2:Training set and test set are pre-processed, prepared for subsequent operation.Input be training set picture and Description and test set picture.
Step 2.1:Training set picture and test set picture are uniformly cut into the picture size of 224*224.
Step 2.2:It include picture and the description to picture in training set.The description sentence of all pictures is carried out single one by one The cutting of word, and repeated word is abandoned, then all words are put on into serial number, obtain " dictionary " word_to_idx.
Step 2.3:The picture and " dictionary " word_to_idx of output after treatment.
Two, training step:
Step 3:It is sent the training set picture after cutting as input into existing VGGNet, exported in VGGNet layer 5 Third convolutional layer feature subgraph.By using VGGNet, preextraction characteristics of image, and image is subjected to precoding, i.e., Image is transformed into the vector space of needs.
Step 4:VGGNet is exported and carries out batch normalization operation batchnormalization.After normalization operation Characteristics of image is denoted as features.
Step 5:User initializes the parameter in network structure, and the initiation parameter includes that training sample inputs in batches, Every batch of training sample number is denoted as batch_size, number epoch, picture feature dimension dim_ needed for all samples of training Features, word Embedded dimensions dim_embedding, LSTM the number of iterations n_time_step, LSTM hidden state dimension dim_ Hidden, double random regularization coefficient alpha_c, learning rate learning_rate, network optimizer update_rule, how much In generation, shows primary result print_every, training set path image_path, training pattern storing path model_path, surveys Die trial type read path test_path.
Step 6:The whole network being made of VGGNet, attention mechanism and LSTM is using following formula as loss function
Wherein-log (P (y | x)) it is the negative log-likelihood for losing loss, wherein loss loss is calculated using cross entropy, Wherein x represents true description, and y represents description, that is, out_logits of prediction.
WhereinIt is the penalty term indicated with soft weight vectors α, wherein λ is the control parameter of penalty term. C indicates the number of α in a description, and L indicates the length of α vector.
Step 7:In the network structure newly constructed, using features as input.Features is sent into one layer of convolution The output conv that layer obtains, to obtain deeper image, semantic information.Conv is subjected to up-sampling resize into life size again Image_resize, so as to compared with shallow-layer feature and deeper merging features, and resize operation not will increase training parameter.This When conv and image_resize are stitched together are re-fed into one layer of convolutional layer, be converted to original image character shape.It is most laggard Row batch normalization operation batchnormalization.Features_ is denoted as by the feature subgraph that this network exports concat。
Step 8:Initial value is provided for LSTM network.The all pixels of each features are added and are averaging.Again will The above-mentioned feature subgraph by average operation is assigned to the original state c of LSTM cell respectively0With the initial hidden shape of LSTM cell State h0
Step 9:It is transformed into the word serial number in " dictionary " in corresponding vector space.Input is " dictionary " word_to_ idx.Picture description in training set is subjected to word insertion embedding operation by word, exports all words in being described Term vector xt (wherein t indicate one description word quantity).
Step 10:Using features_concat and each time the hidden state ht of LSTM of iteration as input be sent into attention Mechanism.To achieve the effect that Dynamic Extraction characteristics of image.
Step 10.1:One layer of full connection neuron is sent into the channel of features_concat, full connection output is denoted as Features_proj, and port number is consistent with the port number of features_concat.
Step 10.2:Ht is obtained and the consistent vector of features_proj port number by one layer of full connection neuron Ht ', and extend and one-dimensional obtain ht ".Ht " is after activation primitive tanh, then carries out contraposition with features_concat and be multiplied, Obtain features_h.
Step 10.3:By features_h, finally one-dimensional channel is averaging, and is obtained two-dimensional soft weight vectors, is denoted as alpha.The alpha after softmax is multiplied with every feature subgraph of features_concat respectively again, then will The all pixels of every subgraph are summed into a pixel, obtain the attention target position of picture.Ht is obtained by full connection One-dimensional weight is simultaneously multiplied with the attention target position of picture, obtains context vector, is denoted as context.
Step 10.4:The output of attention mechanism is context.
Step 11:Using LSTM model, context vector context and correspondence in description that iteration each time obtains are changed The term vector xt in generation is stitched together as input, is sent into LSTM unit cell, exports the hidden state ht+1 of LSTM.LSTM unit is thin Born of the same parents remember and forget to currently available information and before recall info by remembering and forgeing operation.
Step 12:Final decoding operate is done to LSTM output.Input is ht+1, context and xt.It will be in step 11 Ht+1 carries out dropout operation, using one layer of full connection neuron, obtains ht+1 ' identical with term vector length.Make Context obtains context ' identical with term vector length by one layer of full connection neuron.Obtain ht by full connection One-dimensional weight is simultaneously multiplied with the xt in step 9 per one-dimensional, obtains new xt', exports xt '.By ht+1 ' and context ' and Xt' is added, and carries out the activation of activation primitive tanh after addition again, is then carried out dropout operation again, is finally carried out one Secondary full connection, obtains the output of dictionary vector length, is denoted as out_logits.
Step 13:All training parameters are saved after having trained.
Three, testing procedure:
Step 14:It is sent the training set picture after cutting as input into existing VGGNet, exports VGGNet layer 5 In third convolutional layer feature subgraph.By using VGGNet, preextraction characteristics of image, and image is subjected to precoding, Image is transformed into the vector space of needs.
Step 15:VGGNet is exported and carries out batch normalization operation batchnormalization.After normalization operation Characteristics of image be denoted as features.
Step 16:Load previously stored trained network parameter.
Step 17:In the network structure newly constructed, using features as input.Features is sent into one layer of convolution The output conv that layer obtains, to obtain deeper image, semantic information.Conv is subjected to up-sampling resize into life size again Image_resize, so as to compared with shallow-layer feature and deeper merging features, and resize operation not will increase training parameter.This When conv and image_resize are stitched together are re-fed into one layer of convolutional layer, be converted to original image character shape.It is most laggard Row batch normalization operation batch normalization.Features_ is denoted as by the feature subgraph that this network exports concat。
Step 18:Initial value is provided for LSTM network.The all pixels of each features are added and are averaging.Again will The above-mentioned feature subgraph by average operation is assigned to the original state c of LSTM cell respectively0With the initial hidden shape of LSTM cell State h0
Step 19:Word insertion embedding behaviour is carried out using the word sampled_word of upper iteration prediction as input Make, output is denoted as xt
Step 20:Using features_concat and each time the hidden state ht of LSTM of iteration as input be sent into attention Mechanism.To achieve the effect that Dynamic Extraction characteristics of image.
Step 20.1:One layer of full connection neuron is sent into the channel of features_concat, full connection output is denoted as Features_proj, and port number is consistent with the port number of features_concat.
Step 20.2:Ht is obtained and the consistent vector of features_proj port number by one layer of full connection neuron Ht ', and extend and one-dimensional obtain ht ".Ht " is after activation primitive tanh, then carries out contraposition with features_concat and be multiplied, Obtain features_h.
Step 20.3:By features_h, finally one-dimensional channel is averaging, and is obtained two-dimensional soft weight vectors, is denoted as alpha.The alpha after softmax is multiplied with every feature subgraph of features_concat respectively again, then will The all pixels of every subgraph are summed into a pixel, obtain the attention target position of picture.Ht is obtained by full connection One-dimensional weight is simultaneously multiplied with the attention target position of picture, obtains context vector, is denoted as context.
Step 20.4:The output of attention mechanism is context.
Step 21:Using LSTM model, context vector context and correspondence in description that iteration each time obtains are changed The term vector xt in generation is stitched together as input, is sent into LSTM unit cell, exports the hidden state ht+1 of LSTM.LSTM unit is thin Born of the same parents remember and forget to currently available information and before recall info by remembering and forgeing operation.
Step 22:Final decoding operate is done to LSTM output.Input is ht+1, context and xt.It will be in step 21 Ht+1 carries out dropout operation, using one layer of full connection neuron, obtains ht+1 ' identical with term vector length.Make Context obtains context ' identical with term vector length by one layer of full connection neuron.Obtain ht by full connection One-dimensional weight is simultaneously multiplied with the xt in step 19 per one-dimensional, obtains new xt', exports xt '.By ht+1 ' and context ' and Xt' is added, and carries out the activation of activation primitive tanh after addition again, is then carried out dropout operation again, is finally carried out one Secondary full connection, obtains the output of dictionary vector length, is denoted as out_logits.
Step 23:Out_logits is subjected to softmax operation, the word sequence of maximum probability is selected, is denoted as Sampled_word, the dictionary made before reusing, translates corresponding English word.
Step 24:After the completion of all iteration, word is successively printed to display.
Fig. 2 is the structural block diagram of iamge description of embodiment of the present invention system.As shown in Fig. 2, a kind of iamge description system packet It includes:
First obtains module 201, and for obtaining training set, the training set includes training pictures and to the training The training text that each trained picture is described in pictures.
Feature image training set determining module 202, for obtaining feature image training set according to the trained pictures.
Weight determination module 203 is paid attention to, for determining the attention weight of attention Mechanism Model.
Key feature picture training set determining module 204, for passing through the attention according to the feature image training set Weight obtains key feature picture training set.
Output obtains module 205, for the key feature picture training set and the training text be it is long in short-term The input of memory models, obtains the output of long memory models in short-term, and the output of length memory models in short-term is crucial training text This.
Training module 206, for according to the crucial training text and the key feature picture training set, training mind Through network model, decoded model is obtained.
The training module 206 specifically includes:
Superpositing unit, for being superimposed the key feature picture training set and the crucial training text and the nerve The output of network model obtains training superposition text;
Penalty values acquiring unit, for obtaining the penalty values of training the superposition text and the training text;
Training unit, for adjusting the second convolution neural network model, the attention machine by the penalty values The parameter of simulation, the length memory models and the neural network model in short-term, make the training superposition text with it is described The error of training text obtains decoded model within the scope of error threshold.
Second obtains module 207, and for obtaining test set, the test set includes test pictures and test text.
Feature image test set determining module 208 obtains feature image test set according to the test pictures.
Key feature picture test set determining module 209 is weighed according to the feature image test set by the attention Weight, obtains key feature picture test set.
Crucial test text obtains module 210, for by the key feature picture test set, the test text with And length memory models in short-term, obtain crucial test text.
Text describes determining module 211, for according to the crucial test text and the key feature picture training Collection obtains the text description that the test picture concentrates each test picture by the decoded model.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For system disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.
Used herein a specific example illustrates the principle and implementation of the invention, and above embodiments are said It is bright to be merely used to help understand method and its core concept of the invention;At the same time, for those skilled in the art, foundation Thought of the invention, there will be changes in the specific implementation manner and application range.In conclusion the content of the present specification is not It is interpreted as limitation of the present invention.

Claims (9)

1. a kind of Image Description Methods, which is characterized in that the method includes:
Training set is obtained, the training set includes training pictures and retouches to each trained picture in the trained pictures The training text stated;
Feature image training set is obtained according to the trained pictures;
Determine the attention weight of attention Mechanism Model;
Key feature picture training set is obtained by the attention weight according to the feature image training set;
Take the key feature picture training set and the training text as the input of long memory models in short-term, obtains length in short-term The output of the output of memory models, length memory models in short-term is crucial training text;
According to the crucial training text and the key feature picture training set, training neural network model is decoded Model;
Test set is obtained, the test set includes test pictures and test text;
Feature image test set is obtained according to the test pictures;
Key feature picture test set is obtained by the attention weight according to the feature image test set;
By the key feature picture test set, the test text and the length memory models in short-term, crucial survey is obtained Try text;
According to the crucial test text and the key feature picture training set, by the decoded model, obtain described Test picture concentrates the text description of each test picture.
2. the method according to claim 1, wherein described obtain feature image instruction according to the trained pictures Practice collection, specifically includes:
By the first convolution neural network model of the trained pictures training, trained first convolutional neural networks mould is obtained Type;
Obtain the output of the trained first convolution neural network model;The trained first convolutional neural networks mould The output of type is initial characteristics picture training set;
By the initial characteristics picture training set the second convolution neural network model of training, trained second convolution mind is obtained Through network model;
Obtain the output of the trained second convolution neural network model;The trained second convolutional neural networks mould The output of type is characterized picture training set.
3. according to the method described in claim 2, it is characterized in that, described obtain trained first convolutional neural networks The output of model, specifically includes:
Each trained picture is cut, the training picture after being cut;
By the initial characteristics of each trained picture of the convolutional neural networks model extraction, initial characteristics picture training is obtained Collection.
4. according to right want 2 described in method, which is characterized in that it is described to obtain the trained second convolutional neural networks mould The output of type, specifically includes:
By the convolutional layer of the second convolution neural network model to each characteristic pattern in the initial characteristics picture training set Piece carries out convolution operation, obtains convolution feature image training set;
Each convolution feature image in the convolution feature image training set is adjusted to the size of corresponding each trained picture;
Convolution feature image collection adjusted and the trained pictures are spliced, feature image training set is obtained.
5. according to right want 1 described in method, which is characterized in that the attention weight of the determining attention Mechanism Model, specifically Including:
Weight by the output of long memory models in short-term described in the initial output and iterative process as feature training figure, It determines in turn and pays attention to weight, thus corresponding trained key feature subgraph required for the correspondence word in being described.
6. according to right want 1 described in method, which is characterized in that it is described to pass through the key feature picture test set, the survey Text and the length memory models in short-term are tried, crucial test text is obtained;It specifically includes:
The output of length memory models in short-term by full attended operation and is subjected to scaled using as the test text Weight, to obtain crucial test text.
7. according to right want 4 described in method, which is characterized in that it is described according to the crucial training text and described crucial special Picture training set is levied, training neural network model obtains decoded model, specifically includes:
It is superimposed the output of the key feature picture training set and the crucial training text and the neural network model, is obtained Text is superimposed to training;
Obtain the penalty values of training the superposition text and the training text;
The second convolution neural network model, the attention Mechanism Model, the length are adjusted in short-term by the penalty values The parameter of memory models and the neural network model makes the error of the training superposition text and the training text accidentally In poor threshold range, decoded model is obtained.
8. a kind of iamge description system, which is characterized in that the system comprises:
First obtains module, and for obtaining training set, the training set includes training pictures and to the trained pictures In the training text that is described of each trained picture;
Feature image training set determining module, for obtaining feature image training set according to the trained pictures;
Weight determination module is paid attention to, for determining the attention weight of attention Mechanism Model;
Key feature picture training set determining module, for being obtained according to the feature image training set by the attention weight To key feature picture training set;
Output obtains module, for being long memory models in short-term with the key feature picture training set and the training text Input, obtain the output of long memory models in short-term, the output of length memory models in short-term is crucial training text;
Training module, for according to the crucial training text and the key feature picture training set, training neural network Model obtains decoded model;
Second obtains module, and for obtaining test set, the test set includes test pictures and test text;
Feature image test set determining module obtains feature image test set according to the test pictures;
Key feature picture test set determining module is closed according to the feature image test set by the attention weight Key feature image test set;
Crucial test text obtains module, for passing through the key feature picture test set, the test text and described Long memory models in short-term, obtain crucial test text;
Text describes determining module, for passing through according to the crucial test text and the key feature picture training set The decoded model obtains the text description that the test picture concentrates each test picture.
9. system according to claim 8, which is characterized in that the training module includes:
Superpositing unit, for being superimposed the key feature picture training set and the crucial training text and the neural network The output of model obtains training superposition text;
Penalty values acquiring unit, for obtaining the penalty values of training the superposition text and the training text;
Training unit, for adjusting the second convolution neural network model, the attention mechanism mould by the penalty values The parameter of type, the length memory models and the neural network model in short-term makes the training superposition text and the training The error of text obtains decoded model within the scope of error threshold.
CN201810537627.9A 2018-05-30 2018-05-30 A kind of Image Description Methods and system Pending CN108898639A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810537627.9A CN108898639A (en) 2018-05-30 2018-05-30 A kind of Image Description Methods and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810537627.9A CN108898639A (en) 2018-05-30 2018-05-30 A kind of Image Description Methods and system

Publications (1)

Publication Number Publication Date
CN108898639A true CN108898639A (en) 2018-11-27

Family

ID=64344019

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810537627.9A Pending CN108898639A (en) 2018-05-30 2018-05-30 A kind of Image Description Methods and system

Country Status (1)

Country Link
CN (1) CN108898639A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109947526A (en) * 2019-03-29 2019-06-28 北京百度网讯科技有限公司 Method and apparatus for output information
CN110288535A (en) * 2019-05-14 2019-09-27 北京邮电大学 A kind of image rain removing method and device
CN110288665A (en) * 2019-05-13 2019-09-27 中国科学院西安光学精密机械研究所 Image Description Methods, computer readable storage medium based on convolutional neural networks, electronic equipment
CN110929587A (en) * 2019-10-30 2020-03-27 杭州电子科技大学 Bidirectional reconstruction network video description method based on hierarchical attention mechanism
CN111597326A (en) * 2019-02-21 2020-08-28 北京京东尚科信息技术有限公司 Method and device for generating commodity description text
WO2020186484A1 (en) * 2019-03-20 2020-09-24 深圳大学 Automatic image description generation method and system, electronic device, and storage medium
WO2021008145A1 (en) * 2019-07-12 2021-01-21 北京京东尚科信息技术有限公司 Image paragraph description generating method and apparatus, medium and electronic device
CN116091363A (en) * 2023-04-03 2023-05-09 南京信息工程大学 Handwriting Chinese character image restoration method and system
CN116453120A (en) * 2023-04-19 2023-07-18 浪潮智慧科技有限公司 Image description method, device and medium based on time sequence scene graph attention mechanism

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120033874A1 (en) * 2010-08-05 2012-02-09 Xerox Corporation Learning weights of fonts for typed samples in handwritten keyword spotting
US20120163707A1 (en) * 2010-12-28 2012-06-28 Microsoft Corporation Matching text to images
CN105938485A (en) * 2016-04-14 2016-09-14 北京工业大学 Image description method based on convolution cyclic hybrid model
CN106503055A (en) * 2016-09-27 2017-03-15 天津大学 A kind of generation method from structured text to iamge description
WO2017151757A1 (en) * 2016-03-01 2017-09-08 The United States Of America, As Represented By The Secretary, Department Of Health And Human Services Recurrent neural feedback model for automated image annotation
CN107918782A (en) * 2016-12-29 2018-04-17 中国科学院计算技术研究所 A kind of method and system for the natural language for generating description picture material
CN108052512A (en) * 2017-11-03 2018-05-18 同济大学 A kind of iamge description generation method based on depth attention mechanism
WO2018094296A1 (en) * 2016-11-18 2018-05-24 Salesforce.Com, Inc. Sentinel long short-term memory
US20180143966A1 (en) * 2016-11-18 2018-05-24 Salesforce.Com, Inc. Spatial Attention Model for Image Captioning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120033874A1 (en) * 2010-08-05 2012-02-09 Xerox Corporation Learning weights of fonts for typed samples in handwritten keyword spotting
US20120163707A1 (en) * 2010-12-28 2012-06-28 Microsoft Corporation Matching text to images
WO2017151757A1 (en) * 2016-03-01 2017-09-08 The United States Of America, As Represented By The Secretary, Department Of Health And Human Services Recurrent neural feedback model for automated image annotation
CN105938485A (en) * 2016-04-14 2016-09-14 北京工业大学 Image description method based on convolution cyclic hybrid model
CN106503055A (en) * 2016-09-27 2017-03-15 天津大学 A kind of generation method from structured text to iamge description
WO2018094296A1 (en) * 2016-11-18 2018-05-24 Salesforce.Com, Inc. Sentinel long short-term memory
US20180143966A1 (en) * 2016-11-18 2018-05-24 Salesforce.Com, Inc. Spatial Attention Model for Image Captioning
CN107918782A (en) * 2016-12-29 2018-04-17 中国科学院计算技术研究所 A kind of method and system for the natural language for generating description picture material
CN108052512A (en) * 2017-11-03 2018-05-18 同济大学 A kind of iamge description generation method based on depth attention mechanism

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
THÉODORE BLUCHE: "Scan, Attend and Read: End-to-End Handwritten Paragraph Recognition with MDLSTM Attention", 《2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR)》 *
张军阳等: "深度学习相关研究综述", 《计算机应用研究》 *
张琳林等: "一种基于卷积神经网络的图像分类方法", 《福建电脑》 *
杨楠等: "基于深度学习的图像描述研究", 《红外与激光工程》 *
林杰等: "基于深度学习的图像识别处理", 《网络安全技术与应用》 *
梁锐等: "基于多特征融合的深度视频自然语言描述方法", 《计算机应用》 *
汤鹏杰等: "LSTM逐层多目标优化及多层概率融合的图像描述", 《自动化学报》 *
陈虹君等: "基于CNN-RNN深度学习的图片描述方法与优化", 《湘潭大学自然科学学报》 *
马龙龙等: "图像的文本描述方法研究综述", 《中文信息学报》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111597326A (en) * 2019-02-21 2020-08-28 北京京东尚科信息技术有限公司 Method and device for generating commodity description text
CN111597326B (en) * 2019-02-21 2024-03-05 北京汇钧科技有限公司 Method and device for generating commodity description text
WO2020186484A1 (en) * 2019-03-20 2020-09-24 深圳大学 Automatic image description generation method and system, electronic device, and storage medium
CN109947526A (en) * 2019-03-29 2019-06-28 北京百度网讯科技有限公司 Method and apparatus for output information
CN109947526B (en) * 2019-03-29 2023-04-11 北京百度网讯科技有限公司 Method and apparatus for outputting information
CN110288665A (en) * 2019-05-13 2019-09-27 中国科学院西安光学精密机械研究所 Image Description Methods, computer readable storage medium based on convolutional neural networks, electronic equipment
CN110288535A (en) * 2019-05-14 2019-09-27 北京邮电大学 A kind of image rain removing method and device
WO2021008145A1 (en) * 2019-07-12 2021-01-21 北京京东尚科信息技术有限公司 Image paragraph description generating method and apparatus, medium and electronic device
CN110929587A (en) * 2019-10-30 2020-03-27 杭州电子科技大学 Bidirectional reconstruction network video description method based on hierarchical attention mechanism
CN116091363A (en) * 2023-04-03 2023-05-09 南京信息工程大学 Handwriting Chinese character image restoration method and system
CN116453120A (en) * 2023-04-19 2023-07-18 浪潮智慧科技有限公司 Image description method, device and medium based on time sequence scene graph attention mechanism
CN116453120B (en) * 2023-04-19 2024-04-05 浪潮智慧科技有限公司 Image description method, device and medium based on time sequence scene graph attention mechanism

Similar Documents

Publication Publication Date Title
CN108898639A (en) A kind of Image Description Methods and system
CN113674140B (en) Physical countermeasure sample generation method and system
CN113343705B (en) Text semantic based detail preservation image generation method and system
CN108228686A (en) It is used to implement the matched method, apparatus of picture and text and electronic equipment
CN109087258A (en) A kind of image rain removing method and device based on deep learning
CN107729987A (en) The automatic describing method of night vision image based on depth convolution loop neutral net
CN109255772A (en) License plate image generation method, device, equipment and medium based on Style Transfer
CN109977199A (en) A kind of reading understanding method based on attention pond mechanism
CN110544218A (en) Image processing method, device and storage medium
CN114511576B (en) Image segmentation method and system of scale self-adaptive feature enhanced deep neural network
CN114596566B (en) Text recognition method and related device
KR20230152741A (en) Multi-modal few-shot learning using fixed language models
CN116704079B (en) Image generation method, device, equipment and storage medium
Jiang et al. Language-guided global image editing via cross-modal cyclic mechanism
CN109300128A (en) The transfer learning image processing method of structure is implied based on convolutional Neural net
CN117522697A (en) Face image generation method, face image generation system and model training method
CN117576264B (en) Image generation method, device, equipment and medium
CN110969137A (en) Household image description generation method, device and system and storage medium
CN113962192B (en) Method and device for generating Chinese character font generation model and Chinese character font generation method and device
CN116958766B (en) Image processing method and computer readable storage medium
CN117392293A (en) Image processing method, device, electronic equipment and storage medium
CN117034951A (en) Digital person with specific language style based on large language model
CN110097615B (en) Stylized and de-stylized artistic word editing method and system
CN110866866A (en) Image color-matching processing method and device, electronic device and storage medium
CN116402067A (en) Cross-language self-supervision generation method for multi-language character style retention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned
AD01 Patent right deemed abandoned

Effective date of abandoning: 20230915