CN112765316A

CN112765316A - Method and device for generating image by introducing text of capsule network

Info

Publication number: CN112765316A
Application number: CN202110069525.0A
Authority: CN
Inventors: 周德宇; 孙凯; 胡名起
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-01-19
Filing date: 2021-01-19
Publication date: 2021-05-07

Abstract

The invention discloses a method and a device for generating an image by a text introduced into a capsule network, which comprises a training stage and a testing stage, wherein the training stage trains an image generation model introduced into the capsule network through text information, class labels and real images, the method comprises a multi-stage image generator and an image discriminator for grading the generated images, the testing stage inputs the text and the class labels thereof, the image generator is used for generating corresponding images, the capsule network is introduced in the process of generating the images by the text, and meanwhile, the natural language text and the entity information in the corresponding class labels are learned, so that the correlation between the generated images and the text is enhanced. The invention has the beneficial effects that: the capsule network enhances the learning of entity information, the text information and the class information are fused in a hidden space with lower dimensionality, the training parameter quantity is reduced, the interaction of the text information and the class information is enhanced, and the training difficulty of directly generating a high-resolution image is reduced through a multi-stage generation process in the training process.

Description

Method and device for generating image by introducing text of capsule network

Technical Field

The invention relates to the technical field of deep learning generation models, in particular to a method and a device for generating an image by introducing a text of a capsule network.

Background

Text generation images are an important problem and have wide application, such as computer aided design, illustration generation, etc.

The research of the text image generation method is mainly based on two models of generating formulas, namely a Conditional variable Auto-Encoder (CVAE) and a Conditional Generative Adaptive Network (CGAN). Among them, pictures generated by the CVAE method often have the problem of picture blurring, and the current mainstream methods are all based on CGAN models.

Because of the instability of GAN training, it is very difficult to directly generate high-resolution images from text descriptions, so a hierarchical generation countermeasure network (StackGAN) proposes a strategy of generating low-resolution images from text and then gradually generating high-resolution images from low-resolution images, and is widely applied in later work.

The general text semantic embedding shows that only information at a sentence level exists, the detail information in the text is lack of corresponding to the image, the attention generation countermeasure network (AttnGAN) corresponds a specific word in the text to a generated picture sub-region through an attention mechanism, and the generation effect is improved.

In conventional CGAN architectures, the initially input condition vector is typically mapped to a long narrow three-dimensional initial image feature representation through a fully connected layer. However, the general semantic space has a low dimension, the image feature space has a high dimension, and the problem of information loss may occur when the full connection layer is directly used for the dimension conversion.

The Capsule Network (Capsule Network) is a new type of neuron, and the input and the output are vectors, the activation vector represents instantiation parameters of a specific type of entity, and the modulus of the activation vector can represent the probability of the entity.

Conventional text-generating image networks use only text information and ignore class information of the text itself. But the type information is also helpful for text generation, objects of the same type often have certain similarity, and the introduction of the type information of the text can help solve the one-sided problem of single text description, and simultaneously, the relevance between the generated image and the text is drawn.

As a method for generating an image from a text, an index of Inclusion Score (IS) IS widely used for evaluation. The IS evaluates the quality of the generated image by calculating the correlation between the generated image distribution and the real image distribution, and the higher the IS value IS, the more clear and easier-to-identify entity IS contained in the generated image.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method and a device for generating an image by a text introduced into a capsule network, which can introduce class information to which the text belongs in the process of generating the image, constrain the correlation of the same class of text to generate pictures by the class information, and solve the problem that the single text description may not be comprehensive.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a method and a device for generating an image by a text introduced into a capsule network comprise the following steps:

step 1, encoding a natural language text for describing an image to obtain text semantic embedded representation;

step 2, encoding the class label of the text to obtain class information semantic embedded representation;

step 3, mixing the text semantic embedded representation obtained in the step 1 with random noise, reading the text semantic embedded representation and the random noise by adopting a recurrent neural network, and outputting an object hidden code of the text and a hidden code of each word in the text;

step 4, mixing the semantic embedded representation of the class information obtained in the step 2 with noise, and obtaining object hidden codes of the class information through variational inference;

step 5, fusing the text hidden codes and the class information hidden codes obtained in the step 3 and the step 4 to obtain fused hidden codes containing the text information and the class information;

step 6, transcoding the fused hidden codes obtained in the step 5 by utilizing a capsule network to obtain image characteristics;

step 7, decoding the image characteristics obtained in the step 6 and outputting an image with a target size;

step 8, carrying out countermeasure training on the generated image and the corresponding real image;

and 9, fusing the image characteristics obtained in the step 6 and the hidden codes of each word in the text obtained in the step 3 by using an attention model, taking the fused image characteristics as input of the next stage, and repeating the steps 6-8 to gradually generate images with higher resolution.

Step 10, inputting the natural language text and the labels thereof in the testing stage, and generating corresponding images in stages according to the steps 1-7.

Further, the capsule network in step 6 is a neuron, the input and output are vectors, the modulus of the activated vector can represent the probability of the occurrence of the entity, the representation capability of the capsule network to the entity is utilized, the joint coding of the text and the class information is transcoded in a generator, then an image with corresponding dimensionality is obtained by an image decoder, the class information is evaluated in a discriminator, and the identification of the entity information in the generated image by the discriminator is enhanced.

Further, in step 1, the method for encoding the natural language text describing the image in step 1 is as follows: segmenting the natural language text to obtain a word sequence p ═ w (with the length d)₁,w₂,…w_d) Wherein each word w_iRepresenting by adopting a pre-trained word vector, i is 1-d, and coding the text by using the obtained word vector;

further, if each text-image data only belongs to one class in step 2, the class information is encoded in a one-bit efficient coding (one-hot) manner; if the text-image data belongs to multiple classes, the class information is encoded using a multi-bit efficient coding (multi-hot) scheme.

Further, the recurrent neural network in step 3 adopts a bidirectional long-short term memory network, wherein the long-short term memory network reads text semantic embedding and the hidden state of the previous step, outputs the object hidden code of each step, and uses the object hidden code of each step as the feature s of the word level of each word_iAnd taking the object hidden code s of the last step as the characteristic of sentence level, namely the object hidden code of the text.

Further, the text semantic embedded representation and the noise in the step 3 are directly connected in a mixed mode, and the adopted noise is gaussian noise

The mixed result of text semantic embedding s and z is

The mixed mode of class information semantic embedding and noise in the step 4 is variation inference, namely a variation encoder is used for given noise

And class information c, estimating the hidden attribute distribution of the class information

Semantically embedding a class information representation z sampled from the distribution_c。

Further, in the step 5, a direct connection mode is adopted to fuse the text hidden code and the class information hidden code to obtain a fused hidden code z ═ (z ═_s,z_c)。

Further, in the step 6, the capsule network is used for transcoding the fused hidden codes containing the class information into the initial image representation, and then the upsampling network is used for obtaining the image features of the corresponding dimension.

Further, the confrontation training method in step 8 is as follows: and respectively obtaining image implicit representation of the generated image and the real image through a convolutional neural network, simultaneously inputting corresponding text and class information, and outputting scores of image real degree, image and text matching degree and image and class information matching degree.

Furthermore, the class information is judged by constructing a classification capsule layer by utilizing a capsule network, and a module of an output vector is standardized for judging the class information, so that the entity classification performance in the generated image can be improved

Further, in the step 9, a staged image generation method is adopted to generate pictures with higher resolution step by step, taking two-stage image generation as an example, in the first stage, a low-resolution picture is generated by using fusion hidden coding of text and class information; in the second stage, an attention mechanism is utilized to calculate the attention score of each sub-region of the image feature h obtained in the first stage and each word in the text coding of the word level obtained in the step 3, high-dimensional text-image mixed features are obtained through fusion and input into a high-resolution image generator as fusion hidden codes, the steps 7-8 are repeated, and the high-resolution image is generated while the generation difficulty is reduced by adopting a multi-stage generation scheme.

Further, in the staged image generation network, where the resolution of the images generated in the first stage is 64 × 64, the resolution of the images generated in the second stage is 128 × 128, and the resolution of the images generated in the third stage is 256 × 256, the model may continue to be stacked.

Further, the input of the testing stage in the step 10 is text and its class labels, and the high-resolution image is generated in stages through the generator model obtained in the training stage.

A method and a device for generating an image by a text introduced into a capsule network are characterized in that: the device comprises:

the text encoder is used for encoding the text describing the image to obtain text semantic embedded representation;

the class information encoder is used for encoding class information of a text describing an image to obtain semantic embedded representation of the class information;

the generator comprises a cyclic neural network transcoder, a variational inference transcoder, a capsule network transcoder and an image decoder, wherein the cyclic neural network transcoder is used for reading the semantic embedding of a text and the hidden state of the previous step of the transcoder and outputting a corresponding text hidden code; the variational inference transcoder is used for reading semantic embedding of input class information and outputting corresponding class information hidden codes; the capsule network transcoder fuses the text information and the class information hidden codes and transcodes the text information and the class information hidden codes into image characteristics; the image decoder decodes the image characteristics to generate an image;

the discriminator comprises an image semantic discriminator, a text semantic discriminator and a class information discriminator introduced into the capsule network, wherein the image semantic discriminator judges the correlation between the generated image and the real image; the text semantic discriminator judges the correlation between the generated image and the corresponding text; the class information discriminator introduced into the capsule network utilizes the capsule network to replace a full connection layer in the original discriminator, and the correlation between the generated image and the class information is judged better.

The invention constructs a CGAN-based text generation image model introducing class information, adopts the technologies of a recurrent neural network, a capsule network, variational inference, attention mechanism and the like in the generation process, extracts information from a text and a class label respectively, generates text semantic embedding and class information semantic embedding, and then fuses two kinds of hidden representations in a semantic space to obtain input fused hidden codes. In a generator, the fused implicit codes are converted into image features by using a capsule network, then an image is generated by using an image decoder, and the generated image and a real image are subjected to confrontation training in a discriminator.

Compared with the prior art, the invention has the following beneficial effects:

the method introduces additional class information in the process of generating the image by the text, constrains the similar image in training, enhances the correlation between the generated image and the text, obtains the hidden code of the class information by a variation inference method, fully excavates the entity information behind the class label, fuses the class information and the text information in a semantic space with lower dimensionality, enhances the interaction between the class information and the text information, reduces the parameter amount of the training, transcodes the fused hidden code through a capsule network, better learns the entity information in the text and the class label, scores the class information by using the capsule network in a discriminator, enhances the identification of the entity information in the generated image, gradually generates the high-resolution image by a staged generation method, and reduces the difficulty of generating the high-resolution image.

Drawings

Fig. 1 is a flow chart of a generator method of the first stage of the present invention (generating a low resolution image).

FIG. 2 is a flow chart of a method of the discriminator of the present invention.

Fig. 3 is a flow chart of a generator method (generating a high resolution image) at a subsequent stage in the invention.

Detailed Description

The invention will be further elucidated with reference to the drawings and specific embodiments, it being understood that these examples are intended to illustrate the invention only and are not intended to limit the scope of the invention. Various equivalent modifications of the invention, which fall within the scope of the appended claims of this application, will occur to persons skilled in the art upon reading this disclosure. A text-generating image method for generating introductory class information of a confrontational text based on a condition, as shown in fig. 1, comprising the steps of:

step 1, constructing a text encoder, inputting a natural language text, and outputting an embedded representation of the text, wherein the natural language text adopts an English text, a word sequence with the length of d is obtained after stop words are removed, and each word is represented by a pre-trained word vector.

For example, inputting the natural language text "Colorful disorders holding mean, vegetables, menu, and break", removing stop words to obtain the final word sequence of [ Colorful, dis, hold, mean, vegetables, menu, break ], d ═ 7, we set the maximum value of d to 18, the insufficient part is filled, and the excessive part is cut.

The goal of the text coder is to extract natural languageThe high-level semantic features in the text are served by a pre-trained bidirectional long-time memory network (Bi-LSTM). Input text sequence, hidden state h of each word output_iFeatures s as word-level of the word_iThe output of the temporal average of the hidden states at all times is embedded as the semantic of the text, i.e.

This is only a preferred way of encoding text, but other reasonable encoding methods can be used.

And 2, constructing a class information encoder, inputting class information corresponding to the text, and outputting embedded representation of the class information. The class information of the text has two cases, namely single class and multi-class, if each text has only one class label, we encode the class information in a one-hot form, such as "This little text is an album with a missing print and a second records," there is only one class label, indicating that the type of the bird corresponding to the text is changed, and the data set has 20 different classes of birds, we encode the class information into a 20-dimensional one-hot vector [1,0, … 0,0 ]; if the text itself has multiple class attributes, such as 4 class labels Dish, mean, Vegetables, Fruit, andbread, "in the sentence" color Dish holding mean, Vegetables, Fruit, andbread, "the text is encoded by using a multi-hot encoding method, such as encoding into [1,1,1,1,1,0, … … 0,0], indicating that the text has class labels of the first to fifth classes.

After class information is coded into a class vector c, the class vector is converted into class information to be embedded by adopting a variation inference mode. The encoder uses the class vector c and the noise data

Conditional on a given z and

the posterior inference of the hidden variable z is performed. We assume the posterior distribution of hidden variables

We here use a three-layer linear neural network for inference, subject to multivariate diagonal gaussian distributions, where the mean and variance of the implicit distributions are learned by the encoder, or more complex encoding schemes based on the distribution of classes in the data.

Step 3, constructing a text information and class information fusion and transcoding module, and embedding the semantics of the whole text obtained in the step 3 as a text information code z_sAnd the class information distribution z obtained by sampling in step 4_cEncoding z as class information_cThe two codes are fused in a hidden space by adopting a direct connection mode to obtain a fused hidden code z ═ z (z is)_s,z_c)。

And 4, constructing a condition generation countermeasure network, wherein the generator consists of a capsule network and a convolutional neural network, and the discriminator consists of an image encoder, an image discriminator, a text semantic discriminator and a class information discriminator which are introduced into the capsule network. The generator replaces a common full-connection layer in a CGAN model with a capsule network, the capsule is a novel neuron, the mode of an activation vector of the neuron can represent the probability of the occurrence of an entity in a picture, and the class information of the neuron is often related to the entity information. The input and output of the capsule network are vectors, a capsule layer containing 1024 capsule units is used, the input dimension of each capsule unit is 16 groups of input vectors with the length of 8, the output dimension is 1024 groups of vectors with the length of 16, and the input vectors are transformed into 1024 x 4 three-dimensional image features h which are input into the convolutional neural network; in a convolutional neural network, decoding an image by using a scale-invariant convolutional layer, and converting the characteristics of a fused image into a final generated image; an image encoder of a capsule network is introduced into a discriminator to encode a generated image and a real image, the generated image and the real image are input into a subsequent discriminator with three dimensions, the image discriminator scores the real degree of the generated image, a text semantic discriminator is used for evaluating the phase relation between the generated image and an original text, a classification capsule layer is built by utilizing the capsule network in a class information discriminator, and the class information matching degree of the generated image is scored after the output vector is standardized.

Step 5, respectively inputting the natural language text describing the image and the corresponding class information into a text encoder and a class information encoder to obtain text semantic embedded representation and class information embedded representation;

step 6, the generated text semantic embedded representation and class information embedded representation output text information and class information fusion module to obtain fusion hidden codes containing the two kinds of information;

step 7, inputting the fused image features into an image generator to generate a picture with lower resolution, wherein the set resolution is 64 × 64; and inputting corresponding real pictures, natural language texts and class information into the discriminator to perform countermeasure training. The loss functions of the generator and the discriminator in the countermeasure training process are respectively as follows:

L_D＝-(E_x～P[log[D(x)_r]+E_x～Q[1-log[D(x)_r])

-(E_x～P[log[D(x)_c]+E_x～Q[1-log[D(x)_c])

-(E_x～P[log[D(x,s)]+E_x～Q[1-log[D(x,s)])

in the formula, P is the actual data, Q is the generated data distribution, D (x)_rRepresenting the probability that the generated image x is true, D (x)_cRepresenting the probability that the generated image belongs to the correct class label, and D (x, s) representing the probability of a match between the generated image and the descriptive text.

Two KL divergence terms are added to the loss function of the generator as constrained two hidden variables z_cAnd z_sLoss of regularization. During training, first, with a fixed generator, the loss L is reduced_DOptimizing the arbiter D (x), and then, in the case of a fixed arbiter, pressing the penalty L_GThe generator G is optimized. The above two steps are alternately trained by small batches of random gradient descent. Fighting trainingTraining the discriminator once and the generator once every iteration; training the network using Adam solver, where β₁＝0.5，β₂0.999; the learning rate α is 0.0002.

Step 8, generating image characteristics h in the first stage and semantic characteristics s of the word level obtained in the step 3_iThe stitching is performed through an attention network. The attention mechanism obtains image-text features by calculating the correlation of each word and each sub-region in the image features, wherein the text image features of the jth sub-region are calculated as follows:

wherein s'_iIs s_iA feature representation obtained by a new activation function;

representing fine-grained fusion of image features and word-level text features; beta is a_j,iThe relevance of the jth sub-region to the ith word is shown by an attention model. We upsample the image-text features sen-img to 128 x 128 dimensions, repeating the confrontational training process of step 7 in the higher dimensions.

When the network is trained, Normalization techniques such as Batch Normalization and Spectral Normalization can be added into the generator and the discriminator to stabilize the training, and the generation quality is further improved.

In the experimental process, the reference model based on the AttnGAN is based on the experiment, dimensions of a hidden variable and a noise variable are set to be 128, the arbiter is trained once every iteration of the antagonistic training, and the generator is trained once; training the network using Adam solver, where β₁＝0.5，β₂0.999; the learning rate α is 0.0002.

The IS IS promoted to 4.67 +/-0.05 from 4.36 +/-0.02 on the CUB data set; the image generation quality and the definition of entities in the generated image are better than those of the reference model. In the peeling experiment, we compared the introduction of the capsule network in the generator and the discriminator respectively, wherein the IS IS 4.53 + -0.05 when the capsule network IS introduced in the generator only, and the IS IS 4.46 + -0.02 when the capsule network IS introduced in the discriminator only, and the effectiveness of the capsule network IS proved.

In summary, compared with the previous method, the method for generating the image by the text introduced into the capsule network disclosed by the invention is additionally provided with a module for encoding the class information and fusing the class information and the text information, introduces the class mark of the text, limits the class of the generated image in the discriminator, and improves the correlation between the generated image and the text by introducing the class information; text information and class information are fused in a low-dimensional hidden space, so that the training parameters are reduced; the generator adopts a capsule network to transcode hidden codes containing class information, so that entity information in the input can be better learned; the discriminator adopts the capsule network to grade the class information of the generated image, thus enhancing the identification of the entity information in the generated image; the whole structure adopts a multi-stage generation mode of introducing an attention mechanism, reduces the difficulty of generating high-resolution images, and strengthens the supervision of text information on the image generation process.

The above examples are only preferred embodiments of the present invention, but the embodiments of the present invention are not limited by the above examples, and it should be noted that: it will be apparent to those skilled in the art that various modifications, alterations, combinations, and simplifications may be made without departing from the spirit of the invention, which is equivalent to the substitution and is intended to be within the scope of the invention.

Claims

1. A method and a device for generating an image by a text introduced into a capsule network are characterized in that: the method comprises the following steps:

2. The method and apparatus for generating images of text introduced into capsule networks as claimed in claim 1, wherein: the capsule network in the step 6 is a neuron, the input and the output are vectors, the mode of the activated vector can represent the occurrence probability of the entity, the representation capability of the capsule network to the entity is utilized, the combined coding of the text and the class information is transcoded in a generator, then an image with corresponding dimensionality is obtained by an image decoder, the class information is evaluated in a discriminator, and the identification of the entity information in the generated image by the discriminator is enhanced.

3. Text generation for introduction into capsule networks of claim 1The image forming method and the device are characterized in that: the method for encoding the natural language text describing the image in the step 1 comprises the following steps: segmenting the natural language text to obtain a word sequence p ═ w (with the length d)₁,w₂,…w_d) Wherein each word w_iAnd representing by using a pre-trained word vector, i is 1-d, and coding the text by using the obtained word vector.

4. The method of generating an image of text incorporated into a capsule network of claim 1, wherein: if each text-image data only belongs to one class in the step 2, encoding class information in a one-bit effective encoding (one-hot) mode; if the text-image data belongs to multiple classes, the class information is encoded using a multi-bit efficient coding (multi-hot) scheme.

5. The method and apparatus for generating images of text introduced into capsule networks as claimed in claim 1, wherein: the recurrent neural network in the step 3 adopts a bidirectional long-short time memory network, wherein the long-short time memory network reads text semantic embedding and the hidden state of the previous step, outputs the object hidden code of each step, and takes the object hidden code of each step as the feature s of the word level of each word_iAnd taking the object hidden code s of the last step as the characteristic of sentence level, namely the object hidden code of the text.

6. The method and apparatus for generating images of text introduced into capsule networks as claimed in claim 1, wherein: in the step 3, the text semantic embedded expression and the noise are mixed in a direct connection mode, and the adopted noise is Gaussian noise

The mixed result of text semantic embedding s and z is

Semantic embedding of class information and noise in the step 4The mixing being a variational inference, i.e. the variational coder giving noise

7. The method and apparatus for generating images of text introduced into capsule networks as claimed in claim 1, wherein: in the step 5, a direct connection mode is adopted to fuse the text hidden code and the class information hidden code to obtain a fused hidden code z ═ (z ═_s,z_c)。

8. The method and apparatus for generating images of text introduced into capsule networks as claimed in claim 1, wherein: in the step 6, the capsule network is used for transcoding the fused hidden codes containing the class information into the initial image representation, and then the upsampling network is used for obtaining the image features of the corresponding dimensionality.

9. The method and apparatus for generating images of text introduced into capsule networks as claimed in claim 1, wherein: the confrontation training method in the step 8 comprises the following steps: and respectively obtaining image implicit representation of the generated image and the real image through a convolutional neural network, simultaneously inputting corresponding text and class information, and outputting scores of image real degree, image and text matching degree and image and class information matching degree.

10. The method and apparatus for generating images of text introduced into capsule networks as claimed in claim 9, wherein: the class information is judged by constructing a classification capsule layer by utilizing a capsule network, and a module of an output vector is standardized for judging the class information, so that the entity classification performance in a generated image can be improved.

11. The method and apparatus for generating images of texts introduced into capsule networks according to claim 1, wherein: in the step 9, a staged image generation method is adopted to generate pictures with higher resolution step by step, taking two-stage image generation as an example, in the first stage, a low-resolution picture is generated by using fusion hidden coding of text and class information; in the second stage, an attention mechanism is utilized to calculate the attention score of each sub-region of the image feature h obtained in the first stage and each word in the text coding of the word level obtained in the step 3, high-dimensional text-image mixed features are obtained through fusion and input into a high-resolution image generator as fusion hidden codes, the steps 7-8 are repeated, and the high-resolution image is generated while the generation difficulty is reduced by adopting a multi-stage generation scheme.

12. The method and apparatus for generating images of text introduced into capsule networks as claimed in claim 11, wherein: in the staged image generation network, where the resolution of the images generated in the first stage is 64 x 64, the resolution of the images generated in the second stage is 128 x 128, and the resolution of the images generated in the third stage is 256 x 256, the model may continue to be stacked.

13. The method and apparatus for generating images of text introduced into capsule networks as claimed in claim 1, wherein: in the step 10, the input of the test stage is text and its class labels, and the high-resolution image is generated in stages through the generator model obtained in the training stage.

14. A method and a device for generating an image by a text introduced into a capsule network are characterized in that: the device comprises:

the discriminator comprises an image semantic discriminator, a text semantic discriminator and a class information discriminator introduced into the capsule network, wherein the image semantic discriminator judges the correlation between the generated image and the real image; the text semantic discriminator judges the correlation between the generated image and the text; the class information discriminator introduced into the capsule network utilizes the capsule network to replace a full connection layer in the original discriminator, and the correlation between the generated image and the class information is judged better.