CN112765317A

CN112765317A - Method and device for generating image by introducing text of class information

Info

Publication number: CN112765317A
Application number: CN202110071013.8A
Authority: CN
Inventors: 周德宇; 孙凯; 胡名起
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-01-19
Filing date: 2021-01-19
Publication date: 2021-05-07

Abstract

The invention discloses a method and a device for generating an image by a text with introduced class information, wherein the method for generating the image by the text with introduced class information comprises a training stage and a testing stage, wherein the training stage is based on generation of a confrontation network and utilizes a natural language text for describing an image, a class label of the text, a corresponding real image training generator and a discriminator; in the testing stage, the text and the class labels thereof are utilized to generate corresponding images in the generator, and the invention has the advantages that: according to the text information coding and the class information coding, text semantic image features and class information image features are generated through transcoding respectively, then the two levels of image features are fused, decoding is performed to generate images, the corresponding class information is introduced in the image generating process to strengthen the correlation between the generated images and texts, meanwhile, the images with higher resolution are generated step by step through the multi-stage generating process in the training process, and the training difficulty of directly generating high-resolution images is reduced.

Description

Method and device for generating image by introducing text of class information

Technical Field

The invention relates to the technical field of deep learning generation models, in particular to a method and a device for generating an image by introducing class information text.

Background

Text generation images are an important issue and have wide applications such as computer-aided medicine, news photo generation, etc.

The research of the text image generation method is mainly based on two models of generating formulas, namely a Conditional variable Auto-Encoder (CVAE) and a Conditional Generative Adaptive Network (CGAN). Among them, pictures generated by the CVAE method often have the problem of picture blurring, and the current mainstream methods are all based on CGAN models.

Because of the instability of GAN training, it is very difficult to directly generate high-resolution images from text descriptions, so a hierarchical generation countermeasure network (StackGAN) proposes a strategy of generating low-resolution images from text and then gradually generating high-resolution images from low-resolution images, and is widely applied in later work.

The problem of generating images in text can be mainly divided into two sub-problems: 1) how to capture text semantic representation is generally to extract visual related information in text semantic embedding and text description through a text encoder; 2) how to generate a vivid and text-fitting image by the generator using the semantic representation of the text in 1, so that human beings can misunderstand that the image is a real related picture.

Conventional text-generating image networks use only text information and ignore class information of the text itself. But the type information is also helpful for text generation, objects of the same type often have certain similarity, the introduction of the type information of the text can help solve the one-sided problem of single text description, and in addition, the correlation between the generated image and the text can be drawn. As a method for generating an image from a text, an index of Inclusion Score (IS) IS widely used for evaluation. The IS evaluates the quality of the generated image by calculating the correlation between the generated image distribution and the real image distribution, and the higher the IS value IS, the more clear and easier-to-identify entity IS contained in the generated image.

Disclosure of Invention

The invention provides a method and a device for generating an image by a text with class information introduced, aiming at the defects of the existing image generation method, which can introduce the class information to which the text belongs in the image generation process, restrict the relevance of the same class of text generated pictures through the class information and solve the problem that the single text description is not comprehensive.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a method and a device for generating an image by a text with introduced class information are characterized in that: the method comprises the following steps:

step 1, encoding a natural language text for describing an image to obtain text semantic embedded representation;

step 2, encoding the class label of the text to obtain class information semantic embedded representation;

step 3, mixing the text semantic embedded representation obtained in the step 1 with random noise, reading the text semantic embedded representation and the random noise by adopting a recurrent neural network, and outputting an object steganography of the text;

step 4, mixing the semantic embedded representation of the class information obtained in the step 2 with noise, and obtaining object hidden codes of the class information through variational inference;

step 5, respectively decoding the text hidden codes and the class information hidden codes obtained in the step 3 and the step 4 to obtain image characteristics of a text level and image characteristics of a class level;

step 6, decoding the image characteristics of the fusion text level and the image characteristics of the class level obtained in the step 5 to generate an image;

step 7, performing countermeasure training on the generated image obtained in the step 6 and the corresponding real image;

step 8, respectively up-sampling the image characteristics obtained in the step 5 to obtain image characteristics with different dimensions, and repeating the steps 6-7 to gradually generate images with higher resolution;

and 9, inputting the text and the class labels thereof in the testing stage, repeating the steps 1-6, and generating a high-resolution image in the image generator through multiple stages.

Further, in step 1, the method for encoding the natural language text describing the image is as follows: segmenting the natural language text to obtain a word sequence p ═ w (with the length d)₁,w₂,…w_d) Wherein each word w_iRepresenting by adopting a pre-trained word vector, i is 1-d, and coding the text by using the obtained word vector;

further, in the step 2, if each text-image data only belongs to one class, the class information is encoded in a one-bit efficient coding (one-hot) manner; if the text-image data term is multiple classes, the class information is encoded using a multi-bit-efficient coding (multi-hot) approach.

Further, in step 3, the recurrent neural network adopts a long-term memory network.

Further, in the step 3, a direct connection mode is adopted for a mixed mode of text semantic embedding representation and noise, the adopted noise is Gaussian noise z-N (0, I), and a mixed result of text semantic embedding s and z is z_s＝(s,z)。

Furthermore, the mixed mode of class information semantic embedding and noise in the step 4 is variable estimation, that is, a variable encoder estimates the hidden attribute distribution q (z) of class information under the condition of given noise z and class information c_cI c, z), the semantic embedding of class information sampled from the distribution represents z_c。

Further, in the step 5, an upsampling operation is adopted to decode the text steganography and the class information steganography to obtain the image characteristics.

Further, in the step 6, the image feature h generated by the text_cImage feature h generated by sum class information_rThe image features are fused in a cascading mode, and the fused image features canIs denoted by h₁＝h_c⊙h_r(ii) a And decoding the fused image features by adopting a convolutional neural network to generate an image.

Further, the confrontation training method in step 7 is as follows: and respectively obtaining image implicit representation of the generated image and the real image through a convolutional neural network, simultaneously inputting corresponding text and class information, and outputting scores of image real degree, image and text matching degree and image and class information matching degree.

Further, in the step 8, a staged image generation method is adopted to generate pictures with higher resolution step by step, taking two-stage image generation as an example, the first stage generates a picture with low resolution by using the fused image features; the second stage generates image characteristics h of the text obtained in the first stage_cImage feature h generated by sum class information_rAnd performing upsampling to obtain image features and text features with higher dimensionality, and then generating a picture with higher resolution.

Further, in the two-stage image generation network, in which the resolution of the image generated in the first stage is 64 × 64 and the resolution of the image generated in the second stage is 128 × 128, the model may be further stacked.

Further, the input of the test stage in step 9 is text and its class labels, and the high-resolution image is generated in stages through the generator model obtained in the training stage.

A method and a device for generating an image by introducing a text of class information are characterized in that the device comprises:

the text encoder is used for encoding the text describing the image to obtain text semantic embedded representation;

the class information encoder is used for encoding class information of a text describing an image to obtain semantic embedded representation of the class information;

the generator comprises a recurrent neural network transcoder, a variational inference transcoder, an image feature fusion device and an image decoder, wherein the recurrent neural network transcoder is used for reading the text semantic embedding and the hidden state of the previous step of the transcoder and outputting the corresponding text image feature; the variational inference transcoder is used for reading semantic embedding of input class information and outputting corresponding class information image characteristics; the image feature fusion device fuses text image features and information-like image features generated by the recurrent neural network transcoder and the variational inference; the image decoder decodes the input fusion image characteristics to generate an image;

the discriminator comprises an image semantic discriminator, a text semantic discriminator and a class information discriminator, and the image semantic discriminator judges the correlation between the generated image and the real image; the text semantic discriminator judges the correlation between the generated image and the corresponding text; the class information discriminator judges the correlation between the generated image and the class information. The discriminator comprises an image semantic discriminator, a text semantic discriminator and a class information discriminator, wherein the image semantic discriminator judges the correlation between the generated image and the real image; the text semantic discriminator judges the correlation between the generated image and the corresponding text; the class information discriminator judges the correlation between the generated image and the class information.

Compared with the prior art, the invention has the following beneficial effects:

the invention introduces extra class information in the process of generating the image by the text, such as class labels of birds and different object labels contained in the picture, the class information obtains hidden codes by a variation inference method, entity information behind the class labels can be fully mined, image features generated by the text and image features generated by the class information are fused in an image space, and the training difficulty is reduced.

Drawings

FIG. 1 is a flow chart of a generator method of the present invention.

FIG. 2 is a flow chart of a method of the discriminator of the present invention.

Detailed Description

The invention will be further elucidated with reference to the drawings and specific embodiments, it being understood that these examples are intended to illustrate the invention only and are not intended to limit the scope of the invention. Various equivalent modifications of the invention, which fall within the scope of the appended claims of this application, will occur to persons skilled in the art upon reading this disclosure. A text-generating image method for generating introductory class information of a confrontational text based on a condition, as shown in fig. 1-2, comprising the steps of:

step 1, constructing a text encoding unit comprising a text encoder and a recurrent neural network transcoder, as described in S1 in fig. 1. A natural language text is input into a text coder and an embedded representation of the text is output. The natural language text is an English text, a word sequence with the length d is obtained after stop words are removed, and each word is represented by a pre-trained word vector.

For example, inputting the natural language text "This little text is moved with blackprimary and secondaries", removing stop words to obtain the final word sequence of [ little text, bird, move, blue, black, primary, secondaries ], d ═ 7, we set the maximum value of d to 18, the insufficient part is filled in, and the excessive part is cut out.

The goal of the recurrent neural network transcoder is to extract high-level semantic features in natural language text, served by a pre-trained Bi-directional long-and-short memory network (Bi-LSTM). Input text sequence, hidden state h of each word output_iAs the feature of the word level of the word, the output hidden state time sequence average at all the time is embedded as the semantic meaning of the text, namely

The method is only one preferred mode of the text coding unit, and other reasonable coding modes can be adopted for coding.

And 2, constructing a class information coding unit comprising a class information coder and a variation inference coder, as described in S2 in FIG. 1. The class information encoder inputs class information corresponding to the text and outputs an embedded representation of the class information. The class information of the text has two cases, namely single class and multi-class, if each text has only one class label, we encode the class information in a one-hot form, for example, "This little text is an album with a blackprimary and secondary," there is only one class label, i.e., "there is only one class label, which represents that the type of bird corresponding to the text is changed, and there are 20 different classes of birds in the data set, we encode the class information into a 20-dimensional one-hot vector [1,0, … 0,0 ]; if the text has a plurality of class attributes, the text is coded by using a multi-hot coding mode alignment, for example, coded into [1,0,0,1,0], so that the text has class labels of a first class and a third class.

After class information is coded into a class vector c, the class vector is converted into class information to be embedded by adopting a variation inference mode. The encoder uses the class vector c and the noise data

Conditional on a given z and

the posterior inference of the hidden variable z is performed. We assume the posterior distribution of hidden variables

We here use a three-layer linear neural network for inference, subject to multivariate diagonal gaussian distributions, where the mean and variance of the implicit distributions are learned by the encoder, or more complex encoding schemes based on the distribution of classes in the data.

And 3, constructing a text information and class information fusion module as described in S3 in FIG. 1. The module fuses image features generated by text and image features generated by class information. The fusion module is composed of an up-sampling network for extracting image characteristics, text image characteristics and the splicing of information-like image characteristics. For text information, upsampling network input text semantic embedding s and joint data z of noise_sAnd obtaining image characteristics h of corresponding dimensionality by up-sampling_s(ii) a For class information, the up-sampling network predicts the posterior distribution of hidden variables from class information encoder

Intermediate sampling to obtain input z_cObtaining the class information image characteristics h with the same dimensionality_c(ii) a And finally, performing point multiplication operation on the text information image characteristics and the class information image characteristics to obtain finally fused image characteristics h.

Step 4, constructing the condition generating countermeasure network is composed of a generator and an arbiter, as described in S4 in fig. 1 and fig. 2. The generator is composed of a convolution neural network, and the discriminator is composed of an image discriminator, a text semantic discriminator and a class information discriminator. The generator decodes the image by adopting the scale-invariant convolution layer, and converts the characteristics of the fused image into a finally generated image; the image discriminator scores the truth of the generated image, the text semantic discriminator evaluates the relation between the generated image and the original text, and the class information discriminator scores the matching degree of the class information of the generated image.

Step 5, respectively inputting the natural language text describing the image and the corresponding class information into a text encoder and a class information encoder to obtain text semantic embedded representation and class information embedded representation;

step 6, the generated text semantic embedded representation and class information embedded representation output text information and class information fusion module to obtain the image characteristics fusing the two information;

step 7, inputting the fused image features into an image generator to generate a picture with lower resolution, wherein the set resolution is 64 × 64; and inputting corresponding real pictures, natural language texts and class information into the discriminator to perform countermeasure training. The loss functions of the generator and the discriminator in the countermeasure training process are respectively as follows:

L_D＝-(E_x～P[log[D(x)_r]+E_x～Q[1-log[D(x)_r])

-(E_x～P[log[D(x)_c]+E_x～Q[1-log[D(x)_c])

-(E_x～P[log[D(x,s)]+E_x～Q[1-log[D(x,s)])

in the formula, P is the actual data, Q is the generated data distribution, D (x)_rRepresenting the probability that the generated image x is true, D (x)_cRepresenting the probability that the generated image belongs to the correct class label, and D (x, s) representing the probability of a match between the generated image and the descriptive text.

Two KL divergence terms are added to the loss function of the generator as constrained two hidden variables z_cAnd z_sLoss of regularization. During training, first, with a fixed generator, the loss L is reduced_DOptimizing the arbiter D (x), and then, in the case of a fixed arbiter, pressing the penalty L_GThe generator G is optimized. The above two steps are alternately trained by small batches of random gradient descent.

And 8, respectively up-sampling the image characteristic text image characteristic and the class information image characteristic generated in the first stage to 128 × 128 dimensionality. The confrontational training process of step 7 is repeated in a higher dimension, generating higher resolution images.

When the network is trained, Normalization techniques such as Batch Normalization and Spectral Normalization can be added into the generator and the discriminator to stabilize the training, and the generation quality is further improved.

In summary, compared with the previous method, the method for generating the image by the text with the introduced class information disclosed by the invention is additionally provided with the modules for encoding the class information and fusing the class information and the text information. The method introduces the category mark of the text, limits the category of the generated image in the discriminator, and improves the correlation between the generated image and the text by introducing the category information.

In the experimental process, the benchmark model based on the StackGAN is based on the experiment, the dimensions of the hidden variable and the noise variable are set to be 128, the arbiter is trained once per iteration of the countertraining, and the generator is trained once; training the network using Adam solver, where β₁＝0.5，β₂0.999; the learning rate α is 0.0002.

The IS IS promoted to 3.74 +/-0.03 from 3.35 +/-0.02 on the CUB data set; the IS IS promoted from 7.34 +/-0.17 to 7.46 +/-0.30 on the COCO data set, and the image generation quality and the entity definition in the generated image are better than those of a reference model.

The above examples are only preferred embodiments of the present invention, but the embodiments of the present invention are not limited by the above examples, and it should be noted that: it will be apparent to those skilled in the art that various modifications, alterations, combinations, and simplifications may be made without departing from the spirit of the invention, which is equivalent to the substitution and is intended to be within the scope of the invention.

Claims

1. A method and a device for generating an image by a text with introduced class information are characterized in that: the method comprises the following steps:

2. The method and apparatus for generating images according to the text of the lead-in information as claimed in claim 1, wherein: in step 1, the method for encoding the natural language text describing the image includes: segmenting the natural language text to obtain a word sequence p ═ w (with the length d)₁,w₂,…w_d) Wherein each word w_iAnd representing by using a pre-trained word vector, i is 1-d, and coding the text by using the obtained word vector.

3. The method and apparatus for generating images according to the text of the lead-in information as claimed in claim 1, wherein: in step 2, if each text-image data only belongs to one class, the class information is encoded by using a one-hot (one-hot) method, and if the text-image data belongs to a plurality of classes, the class information is encoded by using a multi-hot (multi-hot) method.

4. The method and apparatus for generating images according to the text of the lead-in information as claimed in claim 1, wherein: in the step 3, the recurrent neural network adopts a bidirectional long-time and short-time memory network.

5. The method and apparatus for generating images according to the text of the lead-in information as claimed in claim 1, wherein: in the step 3, the text semantic embedding expression and the noise mixing mode adopt a direct connection mode, the adopted noise is Gaussian noise z-N (0, I), and the text semantic embedding s and the mixing result of z are z_s＝(s,z)。

6. The method and apparatus for generating images according to the text of the lead-in information as claimed in claim 1, wherein: the mixed mode of class information semantic embedding and noise in the step 4For variational inference, i.e. the variational encoder infers the hidden attribute distribution q (z) of class information given noise z and class information c_cI c, z), the semantic embedding of class information sampled from the distribution represents z_c。

7. The method and apparatus for generating images according to the text of the lead-in information as claimed in claim 1, wherein: in the step 5, the text steganography and the class information steganography are decoded by adopting an upsampling operation to obtain image characteristics.

8. The method and apparatus for generating image according to the text with the introduction of class information as claimed in claim 1, wherein in the step 6, the image feature h generated by the text_cImage feature h generated by sum class information_rThe fusion is carried out in a dot multiplication mode, and the fused image features can be expressed as h₁＝h_c⊙h_r(ii) a And decoding the fused image features by adopting a convolutional neural network to generate an image.

9. The method and apparatus for generating image according to the text with the generic information as claimed in claim 1, wherein the countertraining method in step 7 is: and respectively obtaining image implicit representation of the generated image and the real image through a convolutional neural network, simultaneously inputting corresponding text and class information, and outputting scores of image real degree, image and text matching degree and image and class information matching degree.

10. The method and apparatus for generating image according to the text with the introduction of class information as claimed in claim 1, wherein the step 8 employs a staged image generation method to generate the pictures with higher resolution step by step, for example, two-stage image generation, the first stage generates the pictures with lower resolution by using the fused image features; the second stage generates image characteristics h of the text obtained in the first stage_sImage feature h generated by sum class information_cUp-sampling is carried out to obtain image characteristics and text characteristics with higher dimensionality, and then generation is carried outA picture with higher resolution is formed.

11. The method and apparatus of claim 10, wherein the model is further stacked in a two-stage image generation network, wherein the resolution of the first stage image generation is 64 x 64 and the second stage image generation is 128 x 128.

12. The method and apparatus for generating image from text with class information as claimed in claim 1, wherein the input of the test stage in step 9 is text and its class label, and the high resolution image is generated in stages by the generator model obtained in the training stage.

13. A method and a device for generating an image by introducing a text of class information are characterized in that the device comprises:

the discriminator comprises an image semantic discriminator, a text semantic discriminator and a class information discriminator, and the image semantic discriminator judges the correlation between the generated image and the real image; the text semantic discriminator judges the correlation between the generated image and the corresponding text; the class information discriminator judges the correlation between the generated image and the class information.