CN108898639A

CN108898639A - A kind of Image Description Methods and system

Info

Publication number: CN108898639A
Application number: CN201810537627.9A
Authority: CN
Inventors: 王紫嫣; 刘罡
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2018-05-30
Filing date: 2018-05-30
Publication date: 2018-11-27

Abstract

The invention discloses a kind of Image Description Methods and systems.The convolutional neural networks VggNet that the present invention first uses Google to provide extracts characteristics of image, image is encoded, it reuses attention mechanism and weight distribution is carried out to feature subgraph, it finds out and describes important key feature subgraph to text, then the crucial text for using LSTM and being obtained by LSTM hidden layer as weight, to text and notice that the combination of the image after operation is decoded, that is, generates the description of correspondence image, the present invention more effectively improves the precision of iamge description.

Description

A kind of Image Description Methods and system

Technical field

The present invention relates to iamge description fields, more particularly to a kind of Image Description Methods and system.

Background technique

With the development of depth learning technology, the iamge description technology based on deep learning has tended to be mature.Image is retouched Stating is a kind of technology that a secondary image content is described using the sentence correctly combined.Image Description Methods be integrated with deep learning, A variety of professional techniques such as pattern-recognition, Digital Image Processing and natural language processing.Iamge description key has two o'clock：(1) image The extraction of feature；(2) natural language is synthesized.Deep learning realizes the automation of image characteristics extraction and identification, greatly mentions The high precision of object and Context awareness.And deep learning realizes the prediction of natural language so that sentence more it is smooth just Really.The design of deep learning network structure used in Image Description Methods often directly influences the effect of iamge description, Therefore designing a kind of suitable deep learning network structure is to improve one of the vital task of iamge description precision.Directly using high Layer convolution is simply extracted image static nature and is widely used in traditional images understanding method, but this method has one kind Potential defect --- it is easily lost abundant and important image information, to make finally to describe precision reduction.

Summary of the invention

In view of the above-mentioned problems, the present invention provides a kind of Image Description Methods and systems.

To achieve the above object, the present invention provides following schemes：

A kind of Image Description Methods, the method includes：

Obtain training set, the training set include training pictures and to each trained picture in the trained pictures into The training text of row description；

Feature image training set is obtained according to the trained pictures；

Determine the attention weight of attention Mechanism Model；

Key feature picture training set is obtained by the attention weight according to the feature image training set；

Take the key feature picture training set and the training text as the input of long memory models in short-term, is grown The output of the output of short-term memory model, length memory models in short-term is crucial training text；

According to the crucial training text and the key feature picture training set, training neural network model is obtained Decoded model；

Test set is obtained, the test set includes test pictures and test text；

Feature image test set is obtained according to the test pictures；

Key feature picture test set is obtained by the attention weight according to the feature image test set；

By the key feature picture test set, the test text and the length memory models in short-term, closed Key test text；

It is obtained according to the crucial test text and the key feature picture training set by the decoded model The test picture concentrates the text description of each test picture.

Optionally, described to obtain feature image training set according to the trained pictures, it specifically includes：

By the first convolution neural network model of the trained pictures training, trained first convolution nerve net is obtained Network model；

Obtain the output of the trained first convolution neural network model；The trained first convolution nerve net The output of network model is initial characteristics picture training set；

By the initial characteristics picture training set the second convolution neural network model of training, trained volume Two is obtained Product neural network model；

Obtain the output of the trained second convolution neural network model；The trained second convolution nerve net The output of network model is characterized picture training set.

Optionally, the output for obtaining the trained first convolution neural network model, specifically includes：

Each trained picture is cut, the training picture after being cut；

By the initial characteristics of each trained picture of the convolutional neural networks model extraction, initial characteristics picture is obtained Training set.

Optionally, the output for obtaining the trained second convolution neural network model, specifically includes：

By the convolutional layer of the second convolution neural network model to each spy in the initial characteristics picture training set It levies picture and carries out convolution operation, obtain convolution feature image training set；

Each convolution feature image in the convolution feature image training set is adjusted to the big of corresponding each trained picture It is small；

Convolution feature image collection adjusted and the trained pictures are spliced, feature image training set is obtained.

Optionally, the attention weight of the determining attention Mechanism Model, specifically includes：

By the output of long memory models in short-term described in the initial output and iterative process as feature training figure Weight, and then determine and pay attention to weight, thus corresponding trained key feature subgraph required for correspondence word in being described.

Optionally, described to pass through the key feature picture test set, the test text and the long short-term memory Model obtains crucial test text；It specifically includes：

The output of length memory models in short-term by full attended operation and is subjected to scaled using as the test The weight of text, to obtain crucial test text.

Optionally, described according to the crucial training text and the key feature picture training set, training nerve net Network model, obtains decoded model, specifically includes：

It is superimposed the defeated of the key feature picture training set and the crucial training text and the neural network model Out, training superposition text is obtained；

Obtain the penalty values of training the superposition text and the training text；

The second convolution neural network model, the attention Mechanism Model, the length are adjusted by the penalty values The parameter of short-term memory model and the neural network model makes the error of training the superposition text and the training text Within the scope of error threshold, decoded model is obtained.

A kind of iamge description system, the system comprises：

First obtains module, and for obtaining training set, the training set includes training pictures and schemes to the training The training text that piece concentrates each trained picture to be described；

Feature image training set determining module, for obtaining feature image training set according to the trained pictures；

Weight determination module is paid attention to, for determining the attention weight of attention Mechanism Model；

Key feature picture training set determining module, for being weighed by the attention according to the feature image training set Weight, obtains key feature picture training set；

Output obtains module, is used for the key feature picture training set and the training text as long short-term memory The input of model, obtains the output of long memory models in short-term, and the output of length memory models in short-term is crucial training text；

Training module, for according to the crucial training text and the key feature picture training set, training nerve Network model obtains decoded model；

Second obtains module, and for obtaining test set, the test set includes test pictures and test text；

Feature image test set determining module obtains feature image test set according to the test pictures；

Key feature picture test set determining module is obtained according to the feature image test set by the attention weight To key feature picture test set；

Crucial test text obtains module, for by the key feature picture test set, the test text and Length memory models in short-term, obtain crucial test text；

Text describes determining module, is used for according to the crucial test text and the key feature picture training set, By the decoded model, the text description that the test picture concentrates each test picture is obtained.

Optionally, the training module includes：

Superpositing unit, for being superimposed the key feature picture training set and the crucial training text and the nerve The output of network model obtains training superposition text；

Penalty values acquiring unit, for obtaining the penalty values of training the superposition text and the training text；

Training unit, for adjusting the second convolution neural network model, the attention machine by the penalty values The parameter of simulation, the length memory models and the neural network model in short-term, make the training superposition text with it is described The error of training text obtains decoded model within the scope of error threshold.

Compared with prior art, the present invention has the following technical effects：The convolutional Neural that the present invention first uses Google to provide Network VggNet extracts characteristics of image, i.e., encodes to image, reuses attention mechanism and carries out weight point to feature subgraph Match, find out and describe important key feature subgraph to text, then use LSTM and is obtained by LSTM hidden layer as weight The crucial text arrived to text and notices that the combination of the image after operation is decoded, that is, generates the description of correspondence image, this hair The bright precision for more effectively improving iamge description.

Detailed description of the invention

It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.

Fig. 1 is the flow chart of Image Description Methods of the embodiment of the present invention；

Fig. 2 is the structural block diagram of iamge description of embodiment of the present invention system.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.

Fig. 1 is the flow chart of Image Description Methods of the embodiment of the present invention.As shown in Figure 1, a kind of Image Description Methods include Following steps：

Step 101：Obtain training set, the training set includes training pictures and to respectively instructing in the trained pictures Practice the training text that picture is described.

Step 102：Feature image training set is obtained according to the trained pictures.

Specifically, obtaining the trained first volume by the first convolution neural network model of the trained pictures training Product neural network model；

Obtain the output of the trained first convolution neural network model；The trained first convolution nerve net The output of network model is initial characteristics picture training set；Each trained picture is cut, the training figure after being cut Piece；By the initial characteristics of each trained picture of the convolutional neural networks model extraction, initial characteristics picture training is obtained Collection.

Obtain the output of the trained second convolution neural network model；The trained second convolution nerve net The output of network model is characterized picture training set.By the convolutional layer of the second convolution neural network model to the initial spy Each feature image levied in picture training set carries out convolution operation, obtains convolution feature image training set；By the convolution feature Each convolution feature image in picture training set is adjusted to the size of corresponding each trained picture；By convolution characteristic pattern adjusted Piece collection is spliced with the trained pictures, obtains feature image training set.

Step 103：Determine the attention weight of attention Mechanism Model.It will be described in the initial output and iterative process Weight of the output of long memory models in short-term as feature training figure, and then determine and pay attention to weight, thus pair in being described Answer corresponding trained key feature subgraph required for word.

Step 104：Key feature picture training is obtained by the attention weight according to the feature image training set Collection.

Step 105：It is the defeated of long memory models in short-term with the key feature picture training set and the training text Enter, obtain the output of long memory models in short-term, the output of length memory models in short-term is crucial training text.

Step 106：According to the crucial training text and the key feature picture training set, training neural network mould Type obtains decoded model.

Specifically, being superimposed the key feature picture training set and the crucial training text and the neural network mould The output of type obtains training superposition text；Obtain the penalty values of training the superposition text and the training text；By described Penalty values adjust the second convolution neural network model, the attention Mechanism Model, the length in short-term memory models and The parameter of the neural network model makes the error of the training superposition text and the training text in error threshold range It is interior, obtain decoded model.

Step 107：Test set is obtained, the test set includes test pictures and test text.

Step 108：Feature image test set is obtained according to the test pictures.

Step 109：The test of key feature picture is obtained by the attention weight according to the feature image test set Collection.

Step 110：Pass through the key feature picture test set, the test text and the long short-term memory mould Type obtains crucial test text.By the length in short-term memory models output by full attended operation and carry out scaled with As the weight of the test text, to obtain crucial test text.

Step 111：According to the crucial test text and the key feature picture training set, pass through the decoding mould Type obtains the text description that the test picture concentrates each test picture.

The specific embodiment provided according to the present invention, the invention discloses following technical effects：The present invention first uses Google The convolutional neural networks VggNet of offer extracts characteristics of image, i.e., encodes to image, reuse attention mechanism to label Text carries out weight distribution, finds out the important crucial text that feature is described, after then being operated using LSTM to attention Image be decoded, that is, generate correspondence image description.After the present invention extracts characteristics of image with VggNet, to characteristics of image Carry out a convolution operation, then by the Character adjustment size resize after convolution at after original image feature sizes with original image feature Spliced, carry out a convolution operation after splicing again and is normalized.By original image feature and use the image after convolution operation Merging features can not lose important image information, the present invention more effectively improves while extracting deeper semantic feature The precision of iamge description.

Specific implementation

One, preparation and pre-treatment step：

Step 1：It gets out the COCO training set of team of Microsoft production and the file of COCO test set and is placed on engineering catalogue Under.Wherein COCO is the large data collection of a kind of image recognition, image segmentation and iamge description.The present invention is provided using Google Depth convolutional neural networks VGGNet parameter model, wherein VGGNet structure includes five layers of convolution pond layer, first layer and second Layer all includes after two convolutional layers plus a pond layer, and third layer and the 4th layer all include adding a pond after four convolutional layers Layer, the last layer include three convolutional layers.

Step 2：Training set and test set are pre-processed, prepared for subsequent operation.Input be training set picture and Description and test set picture.

Step 2.1：Training set picture and test set picture are uniformly cut into the picture size of 224*224.

Step 2.2：It include picture and the description to picture in training set.The description sentence of all pictures is carried out single one by one The cutting of word, and repeated word is abandoned, then all words are put on into serial number, obtain " dictionary " word_to_idx.

Step 2.3：The picture and " dictionary " word_to_idx of output after treatment.

Two, training step：

Step 3：It is sent the training set picture after cutting as input into existing VGGNet, exported in VGGNet layer 5 Third convolutional layer feature subgraph.By using VGGNet, preextraction characteristics of image, and image is subjected to precoding, i.e., Image is transformed into the vector space of needs.

Step 4：VGGNet is exported and carries out batch normalization operation batchnormalization.After normalization operation Characteristics of image is denoted as features.

Step 5：User initializes the parameter in network structure, and the initiation parameter includes that training sample inputs in batches, Every batch of training sample number is denoted as batch_size, number epoch, picture feature dimension dim_ needed for all samples of training Features, word Embedded dimensions dim_embedding, LSTM the number of iterations n_time_step, LSTM hidden state dimension dim_ Hidden, double random regularization coefficient alpha_c, learning rate learning_rate, network optimizer update_rule, how much In generation, shows primary result print_every, training set path image_path, training pattern storing path model_path, surveys Die trial type read path test_path.

Step 6：The whole network being made of VGGNet, attention mechanism and LSTM is using following formula as loss function

Wherein-log (P (y | x)) it is the negative log-likelihood for losing loss, wherein loss loss is calculated using cross entropy, Wherein x represents true description, and y represents description, that is, out_logits of prediction.

WhereinIt is the penalty term indicated with soft weight vectors α, wherein λ is the control parameter of penalty term. C indicates the number of α in a description, and L indicates the length of α vector.

Step 7：In the network structure newly constructed, using features as input.Features is sent into one layer of convolution The output conv that layer obtains, to obtain deeper image, semantic information.Conv is subjected to up-sampling resize into life size again Image_resize, so as to compared with shallow-layer feature and deeper merging features, and resize operation not will increase training parameter.This When conv and image_resize are stitched together are re-fed into one layer of convolutional layer, be converted to original image character shape.It is most laggard Row batch normalization operation batchnormalization.Features_ is denoted as by the feature subgraph that this network exports concat。

Step 8：Initial value is provided for LSTM network.The all pixels of each features are added and are averaging.Again will The above-mentioned feature subgraph by average operation is assigned to the original state c of LSTM cell respectively₀With the initial hidden shape of LSTM cell State h₀。

Step 9：It is transformed into the word serial number in " dictionary " in corresponding vector space.Input is " dictionary " word_to_ idx.Picture description in training set is subjected to word insertion embedding operation by word, exports all words in being described Term vector xt (wherein t indicate one description word quantity).

Step 10：Using features_concat and each time the hidden state ht of LSTM of iteration as input be sent into attention Mechanism.To achieve the effect that Dynamic Extraction characteristics of image.

Step 10.1：One layer of full connection neuron is sent into the channel of features_concat, full connection output is denoted as Features_proj, and port number is consistent with the port number of features_concat.

Step 10.2：Ht is obtained and the consistent vector of features_proj port number by one layer of full connection neuron Ht ', and extend and one-dimensional obtain ht ".Ht " is after activation primitive tanh, then carries out contraposition with features_concat and be multiplied, Obtain features_h.

Step 10.3：By features_h, finally one-dimensional channel is averaging, and is obtained two-dimensional soft weight vectors, is denoted as alpha.The alpha after softmax is multiplied with every feature subgraph of features_concat respectively again, then will The all pixels of every subgraph are summed into a pixel, obtain the attention target position of picture.Ht is obtained by full connection One-dimensional weight is simultaneously multiplied with the attention target position of picture, obtains context vector, is denoted as context.

Step 10.4：The output of attention mechanism is context.

Step 11：Using LSTM model, context vector context and correspondence in description that iteration each time obtains are changed The term vector xt in generation is stitched together as input, is sent into LSTM unit cell, exports the hidden state ht+1 of LSTM.LSTM unit is thin Born of the same parents remember and forget to currently available information and before recall info by remembering and forgeing operation.

Step 12：Final decoding operate is done to LSTM output.Input is ht+1, context and xt.It will be in step 11 Ht+1 carries out dropout operation, using one layer of full connection neuron, obtains ht+1 ' identical with term vector length.Make Context obtains context ' identical with term vector length by one layer of full connection neuron.Obtain ht by full connection One-dimensional weight is simultaneously multiplied with the xt in step 9 per one-dimensional, obtains new xt', exports xt '.By ht+1 ' and context ' and Xt' is added, and carries out the activation of activation primitive tanh after addition again, is then carried out dropout operation again, is finally carried out one Secondary full connection, obtains the output of dictionary vector length, is denoted as out_logits.

Step 13：All training parameters are saved after having trained.

Three, testing procedure：

Step 14：It is sent the training set picture after cutting as input into existing VGGNet, exports VGGNet layer 5 In third convolutional layer feature subgraph.By using VGGNet, preextraction characteristics of image, and image is subjected to precoding, Image is transformed into the vector space of needs.

Step 15：VGGNet is exported and carries out batch normalization operation batchnormalization.After normalization operation Characteristics of image be denoted as features.

Step 16：Load previously stored trained network parameter.

Step 17：In the network structure newly constructed, using features as input.Features is sent into one layer of convolution The output conv that layer obtains, to obtain deeper image, semantic information.Conv is subjected to up-sampling resize into life size again Image_resize, so as to compared with shallow-layer feature and deeper merging features, and resize operation not will increase training parameter.This When conv and image_resize are stitched together are re-fed into one layer of convolutional layer, be converted to original image character shape.It is most laggard Row batch normalization operation batch normalization.Features_ is denoted as by the feature subgraph that this network exports concat。

Step 18：Initial value is provided for LSTM network.The all pixels of each features are added and are averaging.Again will The above-mentioned feature subgraph by average operation is assigned to the original state c of LSTM cell respectively₀With the initial hidden shape of LSTM cell State h₀。

Step 19：Word insertion embedding behaviour is carried out using the word sampled_word of upper iteration prediction as input Make, output is denoted as xt

Step 20：Using features_concat and each time the hidden state ht of LSTM of iteration as input be sent into attention Mechanism.To achieve the effect that Dynamic Extraction characteristics of image.

Step 20.1：One layer of full connection neuron is sent into the channel of features_concat, full connection output is denoted as Features_proj, and port number is consistent with the port number of features_concat.

Step 20.2：Ht is obtained and the consistent vector of features_proj port number by one layer of full connection neuron Ht ', and extend and one-dimensional obtain ht ".Ht " is after activation primitive tanh, then carries out contraposition with features_concat and be multiplied, Obtain features_h.

Step 20.3：By features_h, finally one-dimensional channel is averaging, and is obtained two-dimensional soft weight vectors, is denoted as alpha.The alpha after softmax is multiplied with every feature subgraph of features_concat respectively again, then will The all pixels of every subgraph are summed into a pixel, obtain the attention target position of picture.Ht is obtained by full connection One-dimensional weight is simultaneously multiplied with the attention target position of picture, obtains context vector, is denoted as context.

Step 20.4：The output of attention mechanism is context.

Step 21：Using LSTM model, context vector context and correspondence in description that iteration each time obtains are changed The term vector xt in generation is stitched together as input, is sent into LSTM unit cell, exports the hidden state ht+1 of LSTM.LSTM unit is thin Born of the same parents remember and forget to currently available information and before recall info by remembering and forgeing operation.

Step 22：Final decoding operate is done to LSTM output.Input is ht+1, context and xt.It will be in step 21 Ht+1 carries out dropout operation, using one layer of full connection neuron, obtains ht+1 ' identical with term vector length.Make Context obtains context ' identical with term vector length by one layer of full connection neuron.Obtain ht by full connection One-dimensional weight is simultaneously multiplied with the xt in step 19 per one-dimensional, obtains new xt', exports xt '.By ht+1 ' and context ' and Xt' is added, and carries out the activation of activation primitive tanh after addition again, is then carried out dropout operation again, is finally carried out one Secondary full connection, obtains the output of dictionary vector length, is denoted as out_logits.

Step 23：Out_logits is subjected to softmax operation, the word sequence of maximum probability is selected, is denoted as Sampled_word, the dictionary made before reusing, translates corresponding English word.

Step 24：After the completion of all iteration, word is successively printed to display.

Fig. 2 is the structural block diagram of iamge description of embodiment of the present invention system.As shown in Fig. 2, a kind of iamge description system packet It includes：

First obtains module 201, and for obtaining training set, the training set includes training pictures and to the training The training text that each trained picture is described in pictures.

Feature image training set determining module 202, for obtaining feature image training set according to the trained pictures.

Weight determination module 203 is paid attention to, for determining the attention weight of attention Mechanism Model.

Key feature picture training set determining module 204, for passing through the attention according to the feature image training set Weight obtains key feature picture training set.

Output obtains module 205, for the key feature picture training set and the training text be it is long in short-term The input of memory models, obtains the output of long memory models in short-term, and the output of length memory models in short-term is crucial training text This.

Training module 206, for according to the crucial training text and the key feature picture training set, training mind Through network model, decoded model is obtained.

The training module 206 specifically includes：

Second obtains module 207, and for obtaining test set, the test set includes test pictures and test text.

Feature image test set determining module 208 obtains feature image test set according to the test pictures.

Key feature picture test set determining module 209 is weighed according to the feature image test set by the attention Weight, obtains key feature picture test set.

Crucial test text obtains module 210, for by the key feature picture test set, the test text with And length memory models in short-term, obtain crucial test text.

Text describes determining module 211, for according to the crucial test text and the key feature picture training Collection obtains the text description that the test picture concentrates each test picture by the decoded model.

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For system disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.

Used herein a specific example illustrates the principle and implementation of the invention, and above embodiments are said It is bright to be merely used to help understand method and its core concept of the invention；At the same time, for those skilled in the art, foundation Thought of the invention, there will be changes in the specific implementation manner and application range.In conclusion the content of the present specification is not It is interpreted as limitation of the present invention.

Claims

1. a kind of Image Description Methods, which is characterized in that the method includes：

Training set is obtained, the training set includes training pictures and retouches to each trained picture in the trained pictures The training text stated；

Feature image training set is obtained according to the trained pictures；

Determine the attention weight of attention Mechanism Model；

Take the key feature picture training set and the training text as the input of long memory models in short-term, obtains length in short-term The output of the output of memory models, length memory models in short-term is crucial training text；

According to the crucial training text and the key feature picture training set, training neural network model is decoded Model；

Test set is obtained, the test set includes test pictures and test text；

Feature image test set is obtained according to the test pictures；

By the key feature picture test set, the test text and the length memory models in short-term, crucial survey is obtained Try text；

According to the crucial test text and the key feature picture training set, by the decoded model, obtain described Test picture concentrates the text description of each test picture.

2. the method according to claim 1, wherein described obtain feature image instruction according to the trained pictures Practice collection, specifically includes：

By the first convolution neural network model of the trained pictures training, trained first convolutional neural networks mould is obtained Type；

Obtain the output of the trained first convolution neural network model；The trained first convolutional neural networks mould The output of type is initial characteristics picture training set；

By the initial characteristics picture training set the second convolution neural network model of training, trained second convolution mind is obtained Through network model；

Obtain the output of the trained second convolution neural network model；The trained second convolutional neural networks mould The output of type is characterized picture training set.

3. according to the method described in claim 2, it is characterized in that, described obtain trained first convolutional neural networks The output of model, specifically includes：

Each trained picture is cut, the training picture after being cut；

By the initial characteristics of each trained picture of the convolutional neural networks model extraction, initial characteristics picture training is obtained Collection.

4. according to right want 2 described in method, which is characterized in that it is described to obtain the trained second convolutional neural networks mould The output of type, specifically includes：

By the convolutional layer of the second convolution neural network model to each characteristic pattern in the initial characteristics picture training set Piece carries out convolution operation, obtains convolution feature image training set；

Each convolution feature image in the convolution feature image training set is adjusted to the size of corresponding each trained picture；

5. according to right want 1 described in method, which is characterized in that the attention weight of the determining attention Mechanism Model, specifically Including：

Weight by the output of long memory models in short-term described in the initial output and iterative process as feature training figure, It determines in turn and pays attention to weight, thus corresponding trained key feature subgraph required for the correspondence word in being described.

6. according to right want 1 described in method, which is characterized in that it is described to pass through the key feature picture test set, the survey Text and the length memory models in short-term are tried, crucial test text is obtained；It specifically includes：

The output of length memory models in short-term by full attended operation and is subjected to scaled using as the test text Weight, to obtain crucial test text.

7. according to right want 4 described in method, which is characterized in that it is described according to the crucial training text and described crucial special Picture training set is levied, training neural network model obtains decoded model, specifically includes：

It is superimposed the output of the key feature picture training set and the crucial training text and the neural network model, is obtained Text is superimposed to training；

The second convolution neural network model, the attention Mechanism Model, the length are adjusted in short-term by the penalty values The parameter of memory models and the neural network model makes the error of the training superposition text and the training text accidentally In poor threshold range, decoded model is obtained.

8. a kind of iamge description system, which is characterized in that the system comprises：

First obtains module, and for obtaining training set, the training set includes training pictures and to the trained pictures In the training text that is described of each trained picture；

Key feature picture training set determining module, for being obtained according to the feature image training set by the attention weight To key feature picture training set；

Output obtains module, for being long memory models in short-term with the key feature picture training set and the training text Input, obtain the output of long memory models in short-term, the output of length memory models in short-term is crucial training text；

Training module, for according to the crucial training text and the key feature picture training set, training neural network Model obtains decoded model；

Key feature picture test set determining module is closed according to the feature image test set by the attention weight Key feature image test set；

Crucial test text obtains module, for passing through the key feature picture test set, the test text and described Long memory models in short-term, obtain crucial test text；

Text describes determining module, for passing through according to the crucial test text and the key feature picture training set The decoded model obtains the text description that the test picture concentrates each test picture.

9. system according to claim 8, which is characterized in that the training module includes：

Superpositing unit, for being superimposed the key feature picture training set and the crucial training text and the neural network The output of model obtains training superposition text；

Training unit, for adjusting the second convolution neural network model, the attention mechanism mould by the penalty values The parameter of type, the length memory models and the neural network model in short-term makes the training superposition text and the training The error of text obtains decoded model within the scope of error threshold.