CN110287357A

CN110287357A - A kind of iamge description generation method generating confrontation network based on condition

Info

Publication number: CN110287357A
Application number: CN201910467500.9A
Authority: CN
Inventors: 白琮; 黄远; 李宏凯; 陈胜勇
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2019-05-31
Filing date: 2019-05-31
Publication date: 2019-09-27
Anticipated expiration: 2039-05-31
Also published as: CN110287357B

Abstract

A kind of iamge description generation method generating confrontation network based on condition, the following steps are included: Step 1: network struction, the condition generates confrontation network frame and is made of a generation model and a discrimination model two parts, it is similar with discrimination model structure to generate model, but parameter stand-alone training updates；Step 2: data set pre-processes；Step 3: network training, process is as follows: step 3.1: generating model and discrimination model parameter with random weight initialization；Step 3.2: training generates model；Step 3.3: training discrimination model；Step 3.4: minimizing loss function with RMSprop descent algorithm；Step 4: accuracy test can be realized and generate to the description of test picture by the operation of above-mentioned steps.The present invention is provided a kind of robustness and preferably, to training data requires the lower iamge description generation method for being generated dual training based on condition.

Description

A kind of iamge description generation method generating confrontation network based on condition

Technical field

The present invention relates to the processing of the multimedia big data of computer vision field and analysis, in particular to a kind of condition generates The picture of confrontation describes generation method, spans two fields of computer vision and natural language processing.

Background technique

With the development of Sharing Technology in Network, there are more and more pictures can be by real-time sharing and reception on network.Such as What understands content represented by image with machine, and outputs it as a sentence is coherent, correctly sentence becomes one to semanteme A primary study problem.In recent years with the fast development of deep learning method, have benefited from depth characteristic to the essence of picture material Really expression, achieves major progress using machine to automatically generate description.But there is gradient in these methods in the training process It disappears and picture feature has loss in a network, the description of generation still remains scarce on semantic richness and content accuracy It falls into, is unable to get good effect.

Summary of the invention

In order to overcome existing picture to describe the high to training data requirement of generation technique, generation description dullness, description not The deficiencies of enough true, it is preferable, lower to training data requirement based on condition generation confrontation instruction that the present invention provides a kind of robustness Experienced iamge description generation method.

The technical solution adopted by the present invention to solve the technical problems is:

A kind of iamge description generation method being generated confrontation network based on condition, the described method comprises the following steps:

Step 1: network struction, process is as follows:

Step 1.1: the condition generates confrontation network frame and is made of a generation model and a discrimination model two parts, It is similar with discrimination model structure to generate model, but parameter stand-alone training updates；

Step 1.2: the first layer for generating model is embeding layer, exports a three-dimensional feature vector；

Step 1.3: the embeding layer for generating model is followed by a full articulamentum；

Step 1.4: the third layer for generating model is a full articulamentum；

Step 1.5: the full articulamentum of third for generating model is followed by Relu activation primitive

Step 1.6: generating model and be followed by a GLU module, include three-layer coil lamination；

Step 1.7: generating model and pass through a full articulamentum after three-layer coil lamination；

Step 1.8: generating model and extend dimension using the last one full articulamentum；

Step 1.9: the output result of model will be generated as the input of discrimination model；

Step 1.10: the first layer of discrimination model is embeding layer, and input dimension is extended；

Step 1.11: the embeding layer of discrimination model is followed by a full articulamentum, exports a three-dimensional feature vector；

Step 1.12: the third layer of discrimination model is a full articulamentum；

Step 1.13: the full articulamentum of the third of discrimination model is followed by Relu activation primitive

Step 1.14: discrimination model is followed by a GLU module, includes three-layer coil lamination；

Step 1.15: discrimination model passes through a full articulamentum after three-layer coil lamination；

Step 1.16: discrimination model changes output dimension using the last one full articulamentum；

Step 1.17: discrimination model feeds back calculated description sentence similarity score into generation model；

Step 2: data set pre-processes；

Step 3: network training, process is as follows:

Step 3.1: generating model and discrimination model parameter with random weight initialization；

Step 3.2: training generates model；

Step 3.3: training discrimination model；

Step 3.4: minimizing loss function with RMSprop descent algorithm；

Step 4: accuracy test, process is as follows:

Step 4.1: the test data set pre-processed is sent into optimal Maker model；

Step 4.2: generator generates corresponding descriptive statement for given picture, by generating model；

Step 4.3: the degree of correlation for the descriptive statement that the true descriptive statement and generator for comparing inquiry picture return, according to Interpretational criteria in iamge description calculates the descriptive statement that all inquiry pictures generate；

Step 4.4: being verified in test data, generate the descriptive statement of test picture；

By the operation of above-mentioned steps, it can be realized and the description of test picture is generated.

Further, in the step 2, the process of data prediction is as follows:

Step 2.1: data include description sentence two parts of training picture and picture in data set, and picture is extracted feature It is input in network；

Step 2.2: extracting picture feature with the good VGG model of the pre-training on ImageNet；

Step 2.3: picture being inputted in the form of feature vector in confrontation network.

Further, in the step 3.2, the process that training generates model is as follows:

Step 3.2.1: for generating model, not inputting true sentence to him and describe, by one and true description suspension The random noise vector of degree describes input as sentence and generates model；

Step 3.2.2: picture feature and the random vector feeding that VGG network extracts are generated into model together；

Step 3.2.3: generate model to the description of the sentence of input by embeding layer and full articulamentum obtain sentence characteristics to Amount；

Step 3.2.4: by the picture feature extracted pass through full articulamentum and embeding layer, be converted to sentence characteristics to Measure same dimension；

Step 3.2.5: sentence characteristics and picture feature are spliced, and are input to the training of GLU module jointly；

Step 3.2.6: obtained vector is obtained to generate sentence Expressive Features vector by two full articulamentums；

Step 3.2.7: similarity is converted into the selected probability of word using normalization exponential function, generates description language Sentence.

Further, in the step 3.3, the process of training discrimination model is as follows:

Step 3.3.1: the description sentence and the picture feature that generator is generated are inputted as arbiter；

Step 3.3.2: discrimination model to the description of the sentence of input by embeding layer and full articulamentum obtain sentence characteristics to Amount；

Step 3.3.3: sentence characteristics and picture feature are spliced, are input to GLU module jointly；

Step 3.3.4: obtained vector is obtained to generate sentence Expressive Features vector by two full articulamentums；

Step 3.3.5: similarity is converted into the selected probability of word using normalization exponential function, generates description language Sentence；

Step 3.3.6: the authenticity of the sentence description and the description of true sentence that calculate generation compares, and obtains a sentence and obtains Point, generation model is fed back to, acts on and next generates generation of the model to description.

Beneficial effects of the present invention are mainly manifested in: the present invention proposes that a kind of iamge description for generating confrontation based on condition is raw At method.In the case where inputting a picture and not providing picture description, generates model and discrimination model passes through maximization Dual training is minimized to improve the performance of both sides: the description sentence of the picture can be generated by generating model, and discrimination model can Judge whether the description sentence of generator output and true sentence describe similar to the greatest extent.This method solve deep learnings In need largely to have the problem of markup information in the training process that faces, while being also that condition generates confrontation network and generating image Once successfully realizing in description task.

Detailed description of the invention

Fig. 1 is that the iamge description based on generation confrontation network that the present invention uses generates model framework schematic diagram.

Specific embodiment

The invention will be further described below in conjunction with the accompanying drawings.

Referring to Fig.1, a kind of iamge description generation method generating confrontation network based on condition, it is raw that the method includes conditions Four processes are tested at the building of confrontation training net network, data set pretreatment, network training and evaluation index.

Picture in the implementation case comes from MSCOCO data set, includes training set, verifying collection and test set.In training set Upper training pattern, and training result is verified on test set and verifying collection.The iamge description for generating confrontation network based on condition is raw At method frame as shown in Figure 1, operating procedure includes that the building of network, data set pretreatment, network training and picture retrieval are surveyed Try Four processes.

It is described based on condition generate confrontation network iamge description generation method the following steps are included:

Step 1: network struction, process is as follows:

Step 1.1: the condition generates confrontation network frame and is made of a generation model and a discrimination model two parts, It is similar with discrimination model structure to generate model, but parameter is independently updated；

Step 1.2: the first layer for generating model is embeding layer, and neuron number is set as 512, and weight W_1 is defined as floating Point-type variable, no biasing；

Step 1.3: the second layer for generating model is full articulamentum, and neuron number is set as 512, and weight W_2 is defined as Floating type variable, no biasing；

Step 1.4: generating the third layer of model as full articulamentum, neuron number 512, weight W_3 is defined as floating Point-type variable, no biasing are followed by Relu activation primitive；

Step 1.5: the third layer for generating model is followed by a GLU module, the module lamination containing three-layer coil, neuron number It is 512, weight W_4, convolution kernel size is 5, and step-length 1, zero padding is set as 2, is defined as floating type variable, no biasing；

Step 1.6: the convolutional layer for generating model is followed by a full articulamentum, neuron number 256, and weight W_5 determines Justice is floating type variable, no biasing；

Step 1.7: the last layer for generating model s is full articulamentum, neuron number 9221, weight W_6, definition For floating type variable, no biasing；

Step 1.8: the first layer of discrimination model is embeding layer, and neuron number is set as 512, and weight W_7 is defined as floating Point-type variable, no biasing；

Step 1.9: the second layer of discrimination model is full articulamentum, and neuron number is set as 512, and weight W_8 is defined as Floating type variable, no biasing；

Step 1.10: the third layer of discrimination model is full articulamentum, and neuron number 512, weight W_9 is defined as Floating type variable, no biasing are followed by Relu activation primitive；

Step 1.11: the third layer of discrimination model is followed by a GLU module, the module lamination containing three-layer coil, neuron Number is 512, weight W_10, and convolution kernel size is 5, and step-length 1, zero padding is set as 2, is defined as floating type variable, unbiased It sets；

Step 1.12: the convolutional layer of discrimination model is followed by a full articulamentum, neuron number 256, weight W_11, It is defined as floating type variable, no biasing；

Step 1.13: the last layer of discrimination model is full articulamentum, neuron number 9221, and weight W_12 is fixed Justice is floating type variable, no biasing；

Step 1.14: discrimination model is damaged calculated generation description sentence with generator with true sentence similarity score The form feedback of function weight is lost into generation model；

Step 2: data set pre-processes, process is as follows:

Step 2.1: data are divided into 113287 picture of training set, 500 figures of 5000 picture of test set and verifying collection Piece three parts, pre-process data, and the sentence word in data set is all converted into lowercase, discard non-alphabetical number Word character；

Step 2.2: with the good VGG model of the pre-training on ImageNet, setting output picture feature dimension as 4096 dimensions；

Step 2.3: with the corresponding feature vector of VGG network model extraction training set picture finely tune, and by characteristic value Using a full articulamentum and an embeding layer；

Step 3: network training, process is as follows:

Step 3.1: generating the parameter in model and discrimination model with random weight initialization；For model, 30 circulations are set Iteration, preservation model parameter after the completion of each iteration；

Step 3.2: training generates model；

Step 3.2.1: learning rate is set as 0.00005, is reduced to original 10% by 15 circulation learning rates；

Step 3.2.2: the random vector of sentence identical dimensional will be described with picture as defeated as the sentence for generating model Enter to be sent in network；

Step 3.2.3: the picture feature of extraction is changed into characteristic dimension as life by a full articulamentum and embeding layer It inputs, is sent in network at the picture of model；

Step 3.2.4: ensure that the dimension of picture feature vector is tieed up with the sentence characteristics vector after first embeding layer It spends identical, the two is spliced into common input network and is trained；

Step 3.2.5: it generates model and input picture and sentence is trained using three layers of GLU module, for each Picture generates model and generates descriptive statement to it, and word probability similarity is converted to word quilt using normalization exponential function The probability chosen selects the high highest word of phase probability, the output as generator according to probability size from data set；

Step 3.3: training discrimination model；

Step 3.3.1: learning rate is set as 0.00005, is reduced to original 10% by 15 circulation learning rates；

Step 3.3.2: the description sentence that generator is returned describes sentence representative as the input of arbiter, and by this Picture is input in arbiter together；

Step 3.3.3: description sentence is changed into dimension by an embeding layer, common input is then spliced with picture feature To discrimination model；

Step 3.3.4: discrimination model is trained input picture and sentence using three layers of GLU module, for each Picture generates model and generates descriptive statement to it, and word probability similarity is converted to word quilt using normalization exponential function The probability chosen selects the high highest word of phase probability according to probability size from data set；

Step 3.3.5: carrying out relevance evaluation for the descriptive statement of generation and true descriptive statement, calculates obtaining for sentence Point, by the calculated similarity score feedback of arbiter into generator, generation is directly acted in the form of loss function weight The optimization of the weight of device；

Step 3.4: saving output of the optimal Maker model as training；

Step 4: accuracy test, process is as follows:

Step 4.1: the test data set picture pre-processed is sent into optimal Maker model；

Step 4.2: generator generates descriptive statement for given inquiry picture, to the picture；

Step 4.3: the degree of correlation for the descriptive statement that the true descriptive statement and generator for comparing inquiry picture return, according to Interpretational criteria in iamge description calculates all descriptive statements for inquiring picture generation；

By the operation of above-mentioned steps, it can be realized and descriptive statement is generated to picture.

Above-described specific descriptions have carried out further specifically the purpose of invention, technical scheme and beneficial effects It is bright, it should be understood that above is only a specific embodiment of the present invention, being used to explain the present invention, it is not used to limit this The protection scope of invention, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should all It is included within protection scope of the present invention.

Claims

1. a kind of iamge description generation method for generating confrontation network based on condition, which is characterized in that the method includes following Step:

Step 1: network struction, process is as follows:

Step 1.1: the condition generates confrontation network frame and is made of a generation model and a discrimination model two parts, generates Model is similar with discrimination model structure, but parameter stand-alone training updates；

Step 1.4: the third layer for generating model is a full articulamentum；

Step 1.5: the full articulamentum of third for generating model is followed by Relu activation primitive；

Step 1.12: the third layer of discrimination model is a full articulamentum；

Step 2: data set pre-processes；

Step 3: network training, process is as follows:

Step 3.2: training generates model；

Step 3.3: training discrimination model；

Step 3.4: minimizing loss function with RMSprop descent algorithm

Step 4: accuracy test, process is as follows:

Step 4.1: the test data set pre-processed is sent into optimal Maker model；

Step 4.3: the degree of correlation for the descriptive statement that the true descriptive statement and generator for comparing inquiry picture return, according to image Interpretational criteria in description calculates the descriptive statement that all inquiry pictures generate；

Step 4.4: being verified in test data, generate the descriptive statement of test picture.

2. a kind of iamge description generation method for generating confrontation network based on condition as described in claim 1, which is characterized in that In the step 2, the process of data prediction is as follows:

Step 2.1: data include description sentence two parts of training picture and picture in data set, and picture is extracted feature input Into network；

3. a kind of iamge description generation method for being generated confrontation network based on condition as claimed in claim 1 or 2, feature are existed In in the step 3.2, the process that training generates model is as follows:

Step 3.2.1: for generating model, not inputting true sentence to him and describe, by one and true description same latitude Random noise vector describes input as sentence and generates model；

Step 3.2.3: it generates model and sentence characteristics vector is obtained by embeding layer and full articulamentum to the sentence description of input；

Step 3.2.4: the picture feature extracted is passed through into full articulamentum and embeding layer, is converted to same with sentence characteristics vector Dimension；

Step 3.2.7: similarity is converted into the selected probability of word using normalization exponential function, generates descriptive statement.

4. a kind of iamge description generation method for generating confrontation network based on condition as claimed in claim 3, which is characterized in that In the step 3.3, the process of training discrimination model is as follows:

Step 3.3.2: discrimination model obtains sentence characteristics vector by embeding layer and full articulamentum to the sentence description of input；

Step 3.3.5: similarity is converted into the selected probability of word using normalization exponential function, generates descriptive statement；

Step 3.3.6: the authenticity of the sentence description and the description of true sentence that calculate generation compares, and obtains a sentence score, Generation model is fed back to, acts on and next generates generation of the model to description.