CN110751698B

CN110751698B - Text-to-image generation method based on hybrid network model

Info

Publication number: CN110751698B
Application number: CN201910923354.6A
Authority: CN
Inventors: 张玲; 李钢; 黄晓琪; 杨子固; 刘剑超; 王莉
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2019-09-27
Filing date: 2019-09-27
Publication date: 2022-05-17
Anticipated expiration: 2039-09-27
Also published as: CN110751698A

Abstract

The invention discloses a text-to-image generation method based on a hybrid network model, which comprises the steps of carrying out forward training on the mapping from a text to an image, carrying out forward training on a generator and a discriminator in a text-to-image generation model, inputting three types of inputs into the discriminator to train the discriminator, carrying out forward test training on a text-to-image generation countermeasure model, training the model to obtain model function loss information, and evaluating the quality of the image by using an image evaluation module; the quality of the generated image is obviously higher than that of the image generated by the traditional GAN text, so that the application value is higher. The method overcomes the defects of the existing model research based on the generation of the text to the image of the confrontation network, and can be better suitable for the generation of the text to the image. By the method and the device, the output of high-quality clear images can be realized, and a generalization effect can be obtained through a small amount of training data.

Description

Text-to-image generation method based on hybrid network model

Technical Field

The invention belongs to the technical field of image processing, and relates to a text-to-image generation method based on a hybrid network model.

Background

With the rapid development of artificial intelligence, generating images from text has attracted a great deal of interest. In recent years, recurrent neural network architectures have been used to learn text feature representations. The deep convolution resisting network can generate high-quality clear images of specific types, such as specific images of human faces, rooms and the like.

The conventional text-based image model for generating a countermeasure network is generated, a convolutional network is used in a discriminator to extract the features of an image, and a recurrent neural network is used to extract the features of a sentence sequence, but the conventional convolutional network for extracting the image features needs a lot of images for training, so that the conventional convolutional network has great limitation, and a capsule network can be generalized by using much less training data. Meanwhile, the conventional convolutional network cannot cope well with image blurriness, but the capsule network can. In addition, in the conventional model for generating an image based on a text for generating a countermeasure network, the conventional convolutional layer network for image feature extraction is connected with a full connection layer, and the full connection layer almost occupies a large part of parameters in the network. Therefore, in the model training process, the training speed of the model is relatively slow, and overfitting is relatively serious.

Based on the above, it is necessary to invent a brand-new method for extracting image features by using a convolutional network to solve the above problems in the process of extracting image features by using a conventional convolutional network.

Disclosure of Invention

The invention provides a text-to-image generation method based on a hybrid network model, which is used for realizing the output of high-quality clear images and obtaining the generalization effect by using a small amount of training data.

The technical scheme adopted by the invention is that the text-to-image generation method based on the hybrid network model specifically comprises the following steps:

step S1: loading relevant data based on a text-image countermeasure model generating a countermeasure network;

step S2: defining a text-image countermeasure model, including defining a real image, defining an error image, defining a real title, defining an error title and defining a noise variable;

step S3: in the text-image confrontation model, forward training is carried out on the mapping of the text and the image;

step S4: forward training a generator in the text-image countermeasure model, encoding a correct title by using a recurrent neural network, and adding noise into an encoded vector to train the generator to obtain a trained generator and a forged image;

step S5: three types of inputs are input to a discriminator of a text-image countermeasure model (a discriminator with pairing consciousness is used here, which is an improvement of the discriminator in the standard text conditional DCGAN framework, and besides the discriminator is used for discriminating whether the generated image is not true or not, the generated image fails to be discriminated: the method comprises the steps that a forged image vector and a real title vector, a real image vector and a forged title vector, and a real image vector and a real title vector are used for training a discriminator to obtain a trained discriminator;

step S6: carrying out forward test training on the text-image confrontation model, coding a real title by using a recurrent neural network, and adding random noise into a coding vector to test a generator so as to know whether the generator can output an ideal result as expected;

step S7: parameter definition, specifically including learning rate, learning attenuation rate and the definition of the optimizer of the generator and the discriminator;

step S8: training a text-image countermeasure model, downloading a nearest check point, acquiring subscripts for generating seeds, noise and sentences, acquiring matched texts, acquiring real images, acquiring wrong titles, acquiring wrong images and noise, updating the mapping of the text-images, updating a discriminator and a generator, and finally obtaining round number and loss function information;

and step S9, storing the trained text-image confrontation model, and evaluating the image quality of the image generated by the trained text-image confrontation model.

Further, in step S1, the loading the relevant data of the text-image confrontation model includes: loading a title set, and storing the processed title set into a corresponding dictionary; establishing a related vocabulary, wherein the number of corresponding vocabularies is recorded; storing the subscripts of the associated titles in a list; randomly checking a list storing title subscripts; loading related images and deforming the images; acquiring the image quantity of the related image training set and image test set, and the title quantity of the title training set and title test set; storing the vocabulary, the image training set, the image testing set, the title training number, the title testing number, the title number corresponding to each image, the number of testing images, the number of training images, the training subscript set and the testing subscript set in a binary form;

further, in step S2, the definitions of the correct image, the incorrect image, the correct title, and the incorrect title specifically include definitions of a name, a type, and a size of the true and false images, and definitions of a name, a type, and a size of the true and false titles.

Further, in the forward training of the text-image mapping, the real and false images are encoded through the capsule network, and the specific process is as follows:

(1) firstly inputting a true image or a false image to an input layer of a capsule network in a vector group form;

(2) after the input layer simply processes the image, the image is sequentially input into the two capsule layers, and after high-order features of the image are extracted by the two capsule layers, the image is normalized through the batch normalization layer; then, performing feature extraction and normalization processing on the image again through a capsule layer and a batch normalization layer, and performing same processing on the image through the same capsule layer and the same batch normalization layer; then, the network layer compresses an image tensor into a vector; and finally, passing through a full connection layer, wherein each node of the full connection layer is connected with all nodes of the previous layer, and is used for integrating the extracted features and outputting the final overall feature.

Further, in step S3, in the forward training of the text-image mapping, the recurrent neural network is used to extract feature vectors from the header, and the method includes the following steps:

(1) inputting the title sequence into an Embedding layer (the Embedding layer is used for mapping words from a semantic space to a vector space and keeping the relation of a sample in the semantic space as much as possible, namely the vector distance after the words with similar semantics are mapped is as close as possible, and vice versa, the title sequence is processed, and finally, a three-dimensional tensor is output and contains batch size and Embedding dimension information;

(2) inputting information output by an Embedding input layer into a Dynamic circulation network layer (Dynamic RNN layer) of a recurrent neural network, and processing the information to obtain a final network output tensor;

(3) calculating a loss function of the recurrent neural network;

further, in step S4, the forward training of the generator is as follows:

(1) encoding the correct header using a recurrent neural network;

(2) after the correct title is coded, adding corresponding noise, and processing by a generator to obtain a generated forged image; the generator updates the formula as follows:

L_G←log(S_f) (1)

G←G-αδL_G/δG (2)

wherein S is_fThe probability of discrimination between the forged image and the real text vector is shown, and alpha is a constant factor.

Further, in step S5, the training process of the arbiter is as follows:

(1) three types of inputs are input to the arbiter: the method comprises the steps that a forged image vector and a real title vector, a real image vector and a forged title vector, and a real image vector and a real title vector are used for training a discriminator to obtain a trained discriminator;

(2) in updating the discriminator, the correlation formula is as follows:

L_D←log(S_τ)+(log(1-S_w)+log(1-S_f))/2 (3)

D←D-αδL_D/δD (4)

in the above formula, S_τ、S_WAnd S_fRespectively representing the discrimination probabilities of the real image and the real title, the discrimination probabilities of the real image and the error text and the discrimination probabilities of the forged image and the real text;

in the discriminator, in the process of matching and discriminating the image vector and the title vector, the similarity of two probability distributions is discriminated by using a KL divergence formula:

p and Q are two probability distributions, respectively, and P (x) and Q (x) are two corresponding probability densities, respectively, and the KL divergence has nonnegativity and asymmetry.

Further, in the step S6, the specific process of performing forward test training on the text-to-image generative confrontation model is as follows: coding the real title, and then adding noise for training to obtain a trained generator; and respectively obtaining loss functions of the generator and the discriminator through a cross entropy formula, wherein the cross entropy formula is as follows:

wherein y comprises 3 values, a logarithmic probability of discrimination of the forged image and the real title vector, a logarithmic probability of discrimination of the real image and the real title, and a logarithmic probability of discrimination of the real image and the false title,

to unitize the corresponding log probabilities.

Further, in step S7, parameters are defined, and the specific process is as follows: defining a learning rate, defining a learning attenuation rate, and defining a learning attenuation round number and a bias; acquiring names of related variables; an optimizer defining a generator and a discriminator, and an optimizer defining a recurrent neural network.

Further, in step S8, training of the model is started, and the preparation process before the training is started is as follows: using a ConfigProto function in TensorFlow to perform parameter configuration on the session, and using a global initialization function to initialize the session; loading the latest check point; acquiring title description sentences and subscripts of the sentences of the model to be input, acquiring the batch size of training and sample sentences, and preprocessing the sample sentences; updating the learning rate; obtaining the matched text, obtaining the correct image, obtaining the wrong title, obtaining the wrong image and obtaining the corresponding noise.

Further, in step S8, training of the model is started, and the training process is as follows: updating a mapping process from a text to an image, when the number of training rounds is less than 50, acquiring a dictionary containing true and false images and true and false titles, and acquiring errors of the recurrent neural network by using vectors formed by a loss function and an optimization function of the recurrent neural network; when the number of training rounds is more than or equal to 50 rounds, setting the error to be 0 and updating the discriminator and the generator; printing the time after a certain number of turns; acquiring a dictionary composed of sample sentences and sample seeds, combining a vector group composed of the output of a generating network and the output of a circulating neural network to obtain the output of the generated image and the output of the circulating neural network, adding a layer of attention mechanism in the process of generating the image, and when a certain region in the image is generated, corresponding the certain probability to a corresponding sentence, thereby improving the generation quality; saving the generated image to a specified directory; and saving the model, saving the updated check point every 10 rounds, saving the latest and latest check point in the last round, namely the 100 th round, and updating the corresponding name.

Further, in step S9, the image quality of the image generated by the trained text-image confrontation model is evaluated, and the specific process is as follows: constructing a corresponding scoring module; adopting one of the image quality evaluation methods: FID (Freehet inclusion distance) Score, the generated image is embedded into a feature space given by a specific layer of the inclusion Net, the space is regarded as continuous multi-element Gaussian distribution, the mean value and the covariance of the generated data and the actual data are calculated, and finally the mean value and the covariance are returned to serve as the evaluation standard of the image quality.

Different from the prior art, the text-to-image generation method based on the hybrid network model carries out forward training through the mapping from the text to the image, carries out forward training on a generator and a discriminator in a text-to-image generation model, inputs three types of inputs into the discriminator to train the discriminator, carries out forward test training on a generation countermeasure model from the text to the image, trains the model to obtain model function loss information, and uses an image evaluation module to evaluate the quality of the image; the quality of the generated image is obviously higher than that of the image generated by the traditional GAN text, so that the application value is higher. The method overcomes the defects of the existing model research based on the generation of the text to the image of the confrontation network, and can be better suitable for the generation of the text to the image. By the method and the device, the output of high-quality clear images can be realized, and a generalization effect can be obtained through a small amount of training data.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a text-to-image generation method based on a hybrid network model according to the present invention.

Fig. 2 is a structure and a workflow of a capsule network, taking handwritten numbers as an example, in a text-to-image generation method based on a hybrid network model provided by the invention.

FIG. 3 is a workflow of an attention mechanism module in image generation in the text-to-image generation method based on a hybrid network model according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the present invention provides a text-to-image generation method based on a hybrid network model, which specifically comprises the following steps:

step S5: three types of inputs are input to a discriminator of a text-image countermeasure model (a discriminator with pairing consciousness is used here, which is an improvement of the discriminator in the standard text conditional DCGAN framework, and besides the discriminator is used for discriminating whether the generated image is not true or not, the generated image fails to be discriminated: training the discriminator through a forged image vector and a real title vector, a real image vector and a forged title vector, and a real image vector and a real title vector to obtain a trained discriminator;

Further, in step S1, the loading the relevant data of the text-image confrontation model includes: loading a title set, and storing the processed title set into a corresponding dictionary; establishing a related vocabulary, wherein the number of corresponding vocabularies is recorded; storing the subscripts of the associated titles in a list; randomly checking a list storing title subscripts; loading related images and deforming the images; acquiring the number of images of the related image training set and image testing set, and the number of titles of the title training set and title testing set; storing the vocabulary, the image training set, the image testing set, the title training number, the title testing number, the title number corresponding to each image, the number of testing images, the number of training images, the training subscript set and the testing subscript set in a binary form;

Further, in the forward training of the text-image mapping, the true and false images are encoded through the capsule network, and the structure and the work flow of the capsule network, taking the handwritten numbers as an example, are shown in fig. 2. The specific process is as follows:

Further, in step S3, in the forward training of the text-image mapping, a recurrent neural network is used to extract feature vectors from the header, and the workflow of the attention mechanism module in the image generation is shown in fig. 3. The method comprises the following steps:

(3) calculating a loss function of the recurrent neural network;

further, in step S4, the forward training of the generator is as follows:

(1) encoding the correct header using a recurrent neural network;

L_G←log(S_f) (1)

G←G-αδL_G/δG (2)

Further, in step S5, the training process of the arbiter is as follows:

(2) in updating the discriminator, the correlation formula is as follows:

L_D←log(S_τ)+(log(1-S_w)+log(1-S_f))/2 (3)

D←D-αδL_D/δD (4)

wherein y comprises 3 values, a forged image andthe log probability of the true title vector discrimination, the log probability of the true image and true title discrimination, and the log probability of the true image and false title discrimination,

to unitize the corresponding log probabilities.

Further, in the step S8, training of the model is started, and the training process is as follows: updating a mapping process from a text to an image, when the number of training rounds is less than 50, acquiring a dictionary containing true and false images and true and false titles, and acquiring errors of the recurrent neural network by using vectors formed by a loss function and an optimization function of the recurrent neural network; when the number of training rounds is more than or equal to 50 rounds, setting the error as 0; updating the discriminator and the generator; printing the time after a certain number of turns; acquiring a dictionary composed of sample sentences and sample seeds, combining a vector group composed of the output of a generating network and the output of a circulating neural network to obtain the output of the generated image and the output of the circulating neural network, adding a layer of attention mechanism in the process of generating the image, and when a certain region in the image is generated, corresponding the certain probability to a corresponding sentence, thereby improving the generation quality; saving the generated image to a specified directory; and saving the model, saving the updated check point every 10 rounds, saving the latest and latest check point in the last round, namely the 100 th round, and updating the corresponding name. Further, in step S9, the image quality of the image generated by the trained text-image confrontation model is evaluated, and the specific process is as follows: constructing a corresponding scoring module; adopting one of the image quality evaluation methods: FID (Freehet inclusion distance) Score, the generated image is embedded into a feature space given by a specific layer of the inclusion Net, the space is regarded as continuous multi-element Gaussian distribution, the mean value and the covariance of the generated data and the actual data are calculated, and finally the mean value and the covariance are returned to serve as the evaluation standard of the image quality.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for text-to-image generation based on a hybrid network model, comprising the steps of:

step S3: in the text-image confrontation model, forward training is carried out on the mapping of the text and the image; in the forward training of the mapping of the text and the image, the real and false images are coded through a capsule network, and the specific process is as follows:

(1) firstly inputting a true image or a false image into an input layer of a capsule network in a vector group form;

(2) after the input layer simply processes the image, the image is sequentially input into the two capsule layers, and after the high-order features of the image are extracted by the two capsule layers, the image is normalized through the batch normalization layer; then, performing feature extraction and normalization processing on the image again through a capsule layer and a batch normalization layer, and performing same processing on the image through the same capsule layer and the same batch normalization layer; then, the network layer compresses an image tensor into a vector; finally, passing through a full connection layer, each node of the full connection layer is connected with all nodes of the previous layer and used for integrating the extracted features in front and outputting the final overall feature;

step S5: inputting three types of input into a discriminator of a text-image countermeasure model, using a discriminator with pairing consciousness, improving the discriminator in a standard text conditional DCGAN framework, and discriminating whether the generated content failed belongs to the unreal generated image or the mismatching generated image, a forged image vector and a real title vector, a real image vector and a forged title vector, and a real image vector and a real title vector to train the discriminator to obtain the trained discriminator by the discriminator besides discriminating the authenticity of an output image;

the training of the model begins, as follows: updating a mapping process from a text to an image, when the number of training rounds is less than 50, acquiring a dictionary containing true and false images and true and false titles, and acquiring errors of the recurrent neural network by using vectors formed by a loss function and an optimization function of the recurrent neural network; when the number of training rounds is more than or equal to 50 rounds, setting the error as 0; updating the discriminator and the generator; printing the time after a certain number of turns; acquiring a dictionary composed of sample sentences and sample seeds, combining a vector group composed of the output of a generating network and the output of a recurrent neural network to obtain a generated image, adding a layer of attention mechanism in the process of generating the image, and when a certain region in the image is generated, corresponding the certain region to a corresponding sentence with a certain probability so as to improve the generation quality; saving the generated image to a specified directory; the storage model stores the check points once every 10 rounds, and stores the latest check point and updates the corresponding name in the last round, namely the 100 th round;

step S9, saving the trained text-image confrontation model, and evaluating the image quality of the image generated by the trained text-image confrontation model, wherein the evaluation process is as follows: constructing a corresponding scoring module; adopting one of the image quality evaluation methods: and the FID Score is used for embedding the generated image into a feature space given by a specific layer of the increment Net, regarding the feature space as a continuous multi-element Gaussian distribution, calculating the mean value and the covariance of the generated data and the actual data, and finally returning the mean value and the covariance as the evaluation standard of the image quality.

2. The method for generating text-to-image based on hybrid network model as claimed in claim 1, wherein the step S1, loading the relevant data of the text-to-image confrontation model comprises: loading a title set, and storing the processed title set into a corresponding dictionary; establishing a related vocabulary, wherein the number of corresponding vocabularies is recorded; storing the subscripts of the associated titles in a list; randomly checking a list storing title subscripts; loading related images and deforming the images; acquiring the image quantity of the related image training set and image test set, and the title quantity of the title training set and title test set; and storing the vocabulary, the image training set, the image testing set, the title training number, the title testing number, the title number corresponding to each image, the number of testing images, the number of training images, the training subscript set and the testing subscript set in a binary form.

3. The method for generating texts to images based on hybrid network model according to claim 1, wherein in step S2, the definitions of the correct images, the incorrect images, the correct titles and the incorrect titles specifically include definitions of names, types and sizes of true and false images and definitions of names, types and sizes of true and false titles.

4. The method for generating text-to-image based on hybrid network model as claimed in claim 1, wherein in said step S3, during the forward training of the text-to-image mapping, the recursive neural network is used to extract feature vectors from the header, according to the following steps:

(1) inputting the title sequence into an Embedding layer for processing, and finally outputting a three-dimensional tensor comprising batch size and embedded dimension information;

(2) inputting information output by an Embedding input layer into a Dynamic circulation network layer (namely a Dynamic RNN layer) of a recurrent neural network, and processing the information to obtain a final network output tensor;

(3) a loss function of the recurrent neural network is calculated.

5. A method for generating text-to-image based on hybrid network model according to claim 1, wherein in step S4, the forward training of the generator is as follows:

(1) encoding the correct header using a recurrent neural network;

L_G←log(S_f) (1)

G←G-αδL_G/δG (2)

6. The method for generating text-to-image based on hybrid network model as claimed in claim 1, wherein in step S5, the training of the discriminant is as follows:

(2) in updating the discriminator, the correlation formula is as follows:

L_D←log(S_τ)+(log(1-S_w)+log(1-S_f))/2 (3)

D←D-αδL_D/δD (4)

7. The method for generating text-to-image based on hybrid network model as claimed in claim 1, wherein the forward test training of the text-to-image generation countermeasure model in step S6 is as follows: coding the real title, and then adding noise for training to obtain a trained generator; and respectively obtaining loss functions of the generator and the discriminator through a cross entropy formula, wherein the cross entropy formula is as follows:

to unitize the corresponding log probabilities.

8. The method for generating text-to-image based on hybrid network model as claimed in claim 1, wherein in step S7, the parameters are defined as follows: defining a learning rate, defining a learning attenuation rate, and defining a learning attenuation round number and a bias; acquiring names of related variables; defining an optimizer of a generator and a discriminator and defining an optimizer of a recurrent neural network; in step S8, training of the model is started, and the preparation process before the training is started is as follows: using a ConfigProto function in TensorFlow to perform parameter configuration on the session, and using a global initialization function to initialize the session; loading the latest check point; acquiring title description sentences and subscripts of the sentences of the model to be input, acquiring the batch size of training and sample sentences, and preprocessing the sample sentences; updating the learning rate; obtaining the matched text, obtaining the correct image, obtaining the wrong title, obtaining the wrong image and obtaining the corresponding noise.