CN109671125B

CN109671125B - Highly-integrated GAN network device and method for realizing text image generation

Info

Publication number: CN109671125B
Application number: CN201811542578.4A
Authority: CN
Inventors: 宋井宽; 陈岱渊; 高联丽
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-12-17
Filing date: 2018-12-17
Publication date: 2023-04-07
Anticipated expiration: 2038-12-17
Also published as: CN109671125A

Abstract

The invention relates to the field of deep learning, and discloses a highly-integrated GAN network device and a method for realizing text image generation, which solve the problems of small size, low quality and unstable network training process of generated images in the prior art, and effectively realize the generation of clear high-quality semantic images from input texts. The highly converged GAN network device of the present invention comprises: the system comprises a text compiler, a condition adding module, a generator and three independent discriminators; based on the highly fused GAN network device, a high-quality RGB image matching text semantic information can be generated under the condition of only one generator and three independent discriminators. In order to further optimize the network structure of the generator, feature maps with different sizes generated by the network middle layer are fully utilized, and the generator adopts a pyramid network structure to generate high-dimensional 256 × 256 features with rich semantic information from low-dimensional 64 × 64 features besides residual generation blocks in a residual network.

Description

Highly-integrated GAN network device and method for realizing text image generation

Technical Field

The invention relates to the field of deep learning, in particular to a highly-integrated GAN network device and a method for realizing text image generation.

Background

Although the generation of pictures from texts has more application scenes in real life, such as image compilation, cross-modal data generation and the like, only a few relevant researches on the task exist at the present stage. Initially, a GAN network is used as a basic network structure in a text image generation method, the size of a generated image is small, and the image quality is low, for example, GAN-INT-CLS [1] can only generate 64 × 64 images; in order to increase the size, the subsequent method adopts multiple GAN networks to train in steps, but these networks usually have complex network structures and high requirements on computing hardware, resulting in complex and long time consuming network training process, such as StackGAN [2], stackGAN + + [3], attnGAN [4] these schemes are divided into two steps to train two deep networks separately, not only one end-to-end network, but also complexity is increased and the whole training process is very unstable.

Reference:

[1]Reed,Scott,et al.2016.Generative adversarial text to image synthesis.arXiv preprint arXiv:1605.05396

[2]Zhang,H.；Xu,T.；Li,H.；Zhang,S.；Huang,X.；Wang,X.；and Metaxas,D.2017a.Stackgan:Text to photorealistic image synthesis with stacked generative adversarial networks.arXiv preprint

[3]Zhang,H.；Xu,T.；Li,H.；Zhang,S.；Wang,X.；Huang,X.；and Metaxas,D.2017b.Stackgan++:Realistic image synthesis with stacked generative adversarial networks.arXiv:1710.10916.

[4]Xu,T.,Zhang,P.,Huang,Q.,Zhang,H.,Gan,Z.,Huang,X.,&He,X.2017.Attngan:Fine-grained text to image generation with attentional generative adversarial networks.arXiv preprint.

disclosure of Invention

The technical problem to be solved by the invention is as follows: the highly-integrated GAN network device and the method for realizing the text image generation solve the problems of small size, low quality and unstable network training process of the generated image in the traditional technology, and effectively realize the generation of clear high-quality semantic images from input texts.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a highly converged GAN network device, comprising: the system comprises a text compiler, a condition adding module, a generator and three independent discriminators;

the text compiler is used for outputting compiled feature expressions to the texts input into the text compiler;

the condition increasing module is used for sampling a condition characteristic expression with a certain dimensionality from the compiled characteristic expression output by the text compiler, splicing the condition characteristic expression with noise in a channel dimensionality and inputting the spliced condition characteristic expression into a generator network;

the generator comprises a full-connection layer, seven sequentially-connected residual generation blocks connected with the full-connection layer, and three accumulated generation blocks which are sequentially connected and are correspondingly connected with the last three residual generation blocks one by one;

the full connection layer is used for performing feature dimension increasing on the features output by the condition increasing module and converting the shapes of the features into 4-dimensional features;

the residual generation block is used for generating features of different sizes;

the accumulation generating block is used for fusing the features with different sizes by adopting a pyramid network structure so as to generate RGB images with different sizes;

the three independent discriminators are connected with the three accumulation generating blocks of the generator in a one-to-one correspondence manner and are used for discriminating the quality of the RGB images with different sizes output by the generator and transmitting discrimination results back to the generator; the way of returning the discrimination result to the generator is as follows: the generated RGB images with different sizes are respectively input into corresponding independent discriminators, the quality of the images is discriminated through a loss function limited in the independent discriminators, the gradient of the images is calculated and then is transmitted to the whole generator network through backward propagation, and parameters of the independent discriminators and the whole generator network are updated.

As a further optimization, a perception loss function is set in the generator and used for improving semantic consistency and image diversity of the generated images.

For further optimization, each discriminator is provided with a matching pairwise loss function for discriminating whether the generated image is semantically matched with the input characters or not and a local image loss function for discriminating whether the generated image is locally real or not, and the last discriminator is also provided with a category information loss function for classifying the generated image.

As a further optimization, the residual generation block comprises an upsampling block, two 3*3 convolution units and an accumulator; the input signal of the up-sampling block is connected with the output signal of the first-stage residual error generation block; the output signal of the up sampling block is connected with one input signal of the accumulator, and the output signal of the up sampling block is connected with the other input signal of the accumulator after being sequentially subjected to convolution operation of two 3*3 convolution units.

As a further optimization, the accumulation generation block comprises a 1*1 convolution unit, an upsampling block, two 3*3 convolution units and an accumulator; the input signal of the 1*1 convolution unit is connected with the output signal of the residual generation block; the output signal of the 1*1 convolution unit is connected with the input signal of the upper sampling block; the output signal of the up-sampling block and the output signal of the previous-stage accumulation generation block are connected with two input signals of an accumulator; the output signal of the accumulator outputs higher dimensional characteristics through a 3*3 convolution unit, the output signal of the 3*3 convolution unit is connected with the input signal of another 3*3 convolution unit, and an RGB image is output through the other 3*3 convolution unit.

As a further optimization, the three independent discriminators comprise: a first discriminator, a second discriminator, and a third discriminator.

As a further optimization, the first and second discriminators each include: the system comprises a multilayer convolution network unit, a first 4*4 convolution unit, a second 4*4 convolution unit, a first full connection layer, a stereo copying unit, a channel splicing unit and a 1*1 convolution unit; the input signal of the multilayer convolution network unit is connected with the RGB image output by the accumulation generation block; the output signal of the multilayer convolution network unit is connected with one input signal of the channel splicing unit and is connected to the local image loss function through the first 4*4 convolution unit; the input signal of the first full-connection layer is connected with the text characteristic expression output by the text compiler; the output signal of the first full connection layer is subjected to stereo copy through a stereo copy unit and then is connected with the other input signal of the channel splicing unit; and the output signal of the channel splicing unit is connected to the matched pairwise loss function after being subjected to convolution operation of a 1*1 convolution unit and a second 4*4 convolution unit in sequence.

As a further optimization, the third discriminator includes: the system comprises a multilayer convolution network unit, a first 4*4 convolution unit, a second 4*4 convolution unit, a third 4*4 convolution unit, a first full connection layer, a second full connection layer, a stereo copying unit, a channel splicing unit and an 1*1 convolution unit; the input signal of the multilayer convolution network unit is connected with the RGB image output by the accumulation generation block; the output signal of the multilayer convolution network unit is connected with one input signal of the channel splicing unit and is connected to the local image loss function through the first 4*4 convolution unit; the input signal of the first full-connection layer is connected with the text characteristic expression output by the text compiler; the output signal of the first full connection layer is subjected to stereo copy through a stereo copy unit and then is connected with the other input signal of the channel splicing unit; output signals of the channel splicing units are connected to the matched pairwise loss function after being subjected to convolution operation of the 1*1 convolution unit and the second 4*4 convolution unit in sequence; and the output signal of 1*1 convolution unit is connected to the class information loss function through the third 4*4 convolution unit and the second full connection layer.

In addition, the invention also provides a method for realizing text image generation based on the highly integrated GAN network device, which comprises the following steps:

inputting a text into a trained text compiler, and outputting a compiled feature expression;

sampling a condition characteristic expression with a certain dimensionality by adopting a condition increasing module, splicing the condition characteristic expression with noise in a channel dimensionality, and inputting the spliced condition characteristic expression into a generator network;

in a generator network, feature dimension raising is carried out through a full connection layer, the shape of the feature is transformed to 4-dimensional features, and then the feature is input into seven continuous residual error generating blocks; inputting the characteristics of different dimensionalities output by the last three residual generation blocks into corresponding accumulative generation blocks, and outputting RGB images of different sizes through convolution operation of the accumulative generation blocks;

the generated RGB images with different sizes are respectively input into corresponding independent discriminators, the quality of the images is discriminated through a loss function limited in the independent discriminators, the gradient of the images is calculated and then is propagated to the whole generator network in a backward mode, and parameters of the independent discriminators and the whole generator network are updated.

As a further optimization, the determining the quality of the image by the loss function limited in the independent determiner specifically includes: judging whether the generated image is semantically matched with the input characters or not through the limited matching pairwise loss function in the three independent judgers, and judging whether the generated image is locally real or not through the limited local image loss function; in addition, for the last independent discriminator of the three independent discriminators, the generated images are also classified by a restricted class information loss function.

The invention has the beneficial effects that:

1) By using a pyramid network structure with fused features for reference, intermediate features generated in a deep network are effectively utilized to generate high-quality image features which are more matched with text semantics.

2) The structure of a generator of the GAN network is optimized by effectively utilizing the perception loss function, and the semantic information of the image features is enriched.

3) Various discriminant loss functions are effectively utilized, such as: the pair loss function, the local image loss function and the classification loss function are adopted, so that the structure of the GAN discriminator is optimized, the discrimination capability is improved, and the quality of the generated image is further improved.

4) By adopting the GAN network device structure provided by the invention, the training process can be stabilized, and the training time is reduced.

Drawings

FIG. 1 is a schematic diagram of a highly converged GAN network architecture in an embodiment of the present invention;

FIG. 2 is a schematic diagram of a residual generation block;

FIG. 3 is a schematic diagram of the structure of an accumulative total block;

fig. 4 is a schematic structural diagram of the discriminator.

Detailed Description

The invention aims to provide a highly-integrated GAN network device and a method for realizing text image generation, solves the problems of small size, low quality and unstable network training process of the generated image in the prior art, and effectively realizes the generation of clear and high-quality semantic images from input texts.

The core idea of the invention is as follows: in order to reduce the training cost as much as possible, a highly unified and structured GAN network device is designed, so that the device can still generate high-quality RGB images matched with text semantic information under the condition of only one generator and three independent discriminators. In order to further optimize the network structure of the generator, feature graphs of different sizes generated by the network middle layer are fully utilized, the generator adopts a pyramid network structure to generate high-dimensional 256 × 256 features rich in semantic information step by step from low-dimensional 64 × 64 features except for residual generation blocks in a residual network.

The embodiment is as follows:

as shown in fig. 1, the highly converged GAN network device in this embodiment includes: the system comprises a text compiler, a condition adding module, a generator and three independent discriminators;

the condition increasing module is used for sampling condition characteristic expression with a certain dimensionality from compiled characteristic expression output by the text compiler, splicing the condition characteristic expression with noise in channel dimensionality and inputting the condition characteristic expression into a generator network;

the full connection layer is used for performing feature dimension increasing on the features output by the condition increasing module and converting the shape to 4-dimensional features;

the three independent discriminators are connected with the three accumulation generating blocks of the generator in a one-to-one correspondence mode and used for discriminating the quality of the RGB images with different sizes output by the generator and transmitting discrimination results back to the generator.

In a specific implementation, each size of feature is first generated by a residual generation block. As shown in fig. 2, a residual generation block comprises an upsampling block, two 3*3 convolution units and an accumulator; the input signal of the up-sampling block is connected with the output signal of the first-stage residual error generation block; the output signal of the up sampling block is connected with one input signal of the accumulator, and the output signal of the up sampling block is connected with the other input signal of the accumulator after being sequentially subjected to convolution operation of two 3*3 convolution units.

In order to enrich the feature expression of each size, the invention provides an accumulative generation block, and features of different sizes are fused by adopting a pyramid network structure. As shown in fig. 3, one accumulation generation block includes one 1*1 convolution unit, one upsampling block, two 3*3 convolution units, and one accumulator; the input signal of the 1*1 convolution unit is connected with the output signal of the residual generation block; the output signal of the 1*1 convolution unit is connected with the input signal of the upsampling block; the output signal of the up-sampling block and the output signal of the previous-stage accumulation generation block are connected with two input signals of an accumulator; the output signal of the accumulator outputs higher dimensional characteristics through a 3*3 convolution unit, the output signal of the 3*3 convolution unit is connected with the input signal of another 3*3 convolution unit, and an RGB image is output through the other 3*3 convolution unit.

The structure of the discriminator in the invention is shown in fig. 4, wherein the broken line frame part is a unique structure of the third discriminator, and the rest parts are structures of the three discriminators; each of the discriminators includes: the system comprises a multilayer convolution network unit, a first 4*4 convolution unit, a second 4*4 convolution unit, a first full connection layer, a stereo copying unit, a channel splicing unit and a 1*1 convolution unit; the input signal of the multilayer convolution network unit is connected with the RGB image output by the accumulation generation block; the output signal of the multilayer convolution network unit is connected with one input signal of the channel splicing unit and is connected to the local image loss function through the first 4*4 convolution unit; the input signal of the first full-connection layer is connected with the text feature expression output by the text compiler; the output signal of the first full connection layer is subjected to stereo copy through a stereo copy unit and then is connected with the other input signal of the channel splicing unit; and the output signal of the channel splicing unit is connected to the matched pairwise loss function after being subjected to convolution operation of a 1*1 convolution unit and a second 4*4 convolution unit in sequence.

In the training process, we observe that the generated image has less difference to different objects, and in order to further increase the difference degree, we only limit the classification information loss function to the classifier (the third classifier) with 256 × 256 large-size images in this embodiment. Thus, the third discriminator may have the following configuration in addition to the above configuration: a second full link layer and a third 4*4 convolution unit; 5363 the output signal of the 1*1 convolution unit is connected to the class information loss function via a third 4*4 convolution unit and a second fully connected layer.

In addition, we also improve semantic consistency and image diversity of the generated images by limiting the perceptual loss function to the generator.

Based on the highly converged GAN network device in the embodiment, the present invention also provides a method for generating an image by using the device, which comprises the following implementation steps:

step 1: inputting the text into a trained text compiler and outputting the compiled feature expression. But the feature dimension is higher at this time, which is not beneficial to network learning to accurate mapping, a condition increasing module is adopted to sample out condition feature expression with proper dimension, and then the condition feature expression is spliced with noise in channel dimension and input into a generator network. In which the condition addition module is constructed based on the Variational Auto-Encoder (VAE) theory, in order to make the random distribution constructed by the condition variables close enough to the standard gaussian distribution, we limit the KL divergence loss function to the condition addition module.

And 2, step: the method comprises the steps of performing feature dimension raising through a full connection layer, converting the feature dimension and the shape of the feature into 4-dimensional features, inputting the features into 7 continuous residual generation blocks, respectively inputting 64 × 64, 128 × 128 and 256 × 256 features with different dimensions into an accumulation generation block in order to fully enhance feature expressions with different dimensions, and outputting RGB images with different dimensions through convolution operation.

The perceptual loss function is limited for images of 256 × 256 size, whose gradient is computed to update the entire generator network by back propagation.

And step 3: to ensure the quality of the image for each size, a separate discriminator follows for each size. All sizes are constrained to match the pairwise loss function, the local image loss function, except that the 256 by 256 image additionally constrains the categorical information loss function. In the process of forward propagation, the invention can generate RGB images with three different sizes at one time, output the RGB images to corresponding independent discriminators, enable the discriminators to discriminate whether semantic matching is performed between the generated images and input characters based on a matched pair loss function, discriminate whether the generated images are local and real based on a local image loss function, and generate the images based on classification information loss function classification. In the back propagation process, the three discriminators calculate the gradients and then transmit the gradients back to the whole generator together, and parameters of the three independent discriminators and the whole generator network are updated.

Claims

1. A highly converged GAN network apparatus, comprising: the system comprises a text compiler, a condition adding module, a generator and three independent discriminators;

the residual generation block is used for generating features of different sizes; the residual error generating block comprises an up-sampling block, two 3*3 convolution units and an accumulator; the input signal of the up-sampling block is connected with the output signal of the upper-stage residual error generation block; the output signal of the up-sampling block is connected with one input signal of the accumulator, and the output signal of the up-sampling block is connected with the other input signal of the accumulator after being subjected to convolution operation of two 3*3 convolution units in sequence;

the accumulation generating block is used for fusing the features with different sizes by adopting a pyramid network structure so as to generate RGB images with different sizes; the accumulation generating block comprises a 1*1 convolution unit, an up sampling block, two 3*3 convolution units and an accumulator; the input signal of the 1*1 convolution unit is connected with the output signal of the residual generation block; the output signal of the 1*1 convolution unit is connected with the input signal of the upsampling block; the output signal of the up-sampling block and the output signal of the previous-stage accumulation generation block are connected with two input signals of an accumulator; the output signal of the accumulator outputs higher dimensional characteristics through a 3*3 convolution unit, the output signal of the 3*3 convolution unit is connected with the input signal of another 3*3 convolution unit, and an RGB image is output through the other 3*3 convolution unit;

the three independent discriminators are connected with the three accumulation generating blocks of the generator in a one-to-one correspondence manner and are used for discriminating the quality of the RGB images with different sizes output by the generator and returning discrimination results to the generator; each discriminator is provided with a matching pairwise loss function for discriminating whether the generated image is semantically matched with the input characters or not and a local image loss function for discriminating whether the generated image is locally real or not, and the last discriminator is also provided with a category information loss function for classifying the generated image;

the mode of returning the judgment result to the generator is as follows: the generated RGB images with different sizes are respectively input into corresponding independent discriminators, the quality of the images is discriminated through a loss function limited in the independent discriminators, the gradient of the images is calculated and then is propagated to the whole generator network in a backward mode, and parameters of the independent discriminators and the whole generator network are updated.

2. The apparatus of claim 1, wherein the generator has a perceptual loss function for improving semantic consistency and image diversity of the generated image.

3. The highly converged GAN network device of claim 1, wherein the three independent discriminators comprise: a first discriminator, a second discriminator, and a third discriminator.

4. A highly converged GAN network device according to claim 3, wherein the first discriminator and the second discriminator each comprise: the system comprises a multilayer convolution network unit, a first 4*4 convolution unit, a second 4*4 convolution unit, a first full connection layer, a stereo copying unit, a channel splicing unit and a 1*1 convolution unit; the input signal of the multilayer convolution network unit is connected with the RGB image output by the accumulation generation block; the output signal of the multilayer convolution network unit is connected with one input signal of the channel splicing unit and is connected to the local image loss function through the first 4*4 convolution unit; the input signal of the first full-connection layer is connected with the text characteristic expression output by the text compiler; the output signal of the first full connection layer is subjected to stereo copy through a stereo copy unit and then is connected with the other input signal of the channel splicing unit; and output signals of the channel splicing units are connected to the matched pairwise loss function after being subjected to convolution operation of the 1*1 convolution unit and the second 4*4 convolution unit in sequence.

5. A highly converged GAN network device according to claim 3, wherein the third discriminator comprises: the system comprises a multilayer convolution network unit, a first 4*4 convolution unit, a second 4*4 convolution unit, a third 4*4 convolution unit, a first full connection layer, a second full connection layer, a stereo copying unit, a channel splicing unit and an 1*1 convolution unit; the input signal of the multilayer convolution network unit is connected with the RGB image output by the accumulation generation block; the output signal of the multilayer convolution network unit is connected with one input signal of the channel splicing unit and is connected to the local image loss function through the first 4*4 convolution unit; the input signal of the first full-connection layer is connected with the text characteristic expression output by the text compiler; the output signal of the first full connection layer is subjected to stereo copy through a stereo copy unit and then is connected with the other input signal of the channel splicing unit; output signals of the channel splicing units are connected to the matched pairwise loss function after being subjected to convolution operation of the 1*1 convolution unit and the second 4*4 convolution unit in sequence; and the output signal of 1*1 convolution unit is connected to the class information loss function through the third 4*4 convolution unit and the second full connection layer.

6. A method for generating image by text, wherein the text is processed by a highly converged GAN network device according to any one of claims 1 to 5, comprising the following steps:

sampling a condition characteristic expression with a certain dimensionality by adopting a condition increasing module, and then splicing the condition characteristic expression with noise in a channel dimensionality to input the condition characteristic expression into a generator network;

in a generator network, feature dimension raising is carried out through a full connection layer, the shape of the feature is transformed to 4-dimensional features, and then the feature is input into seven continuous residual error generating blocks; inputting the characteristics of different dimensions output by the last three residual generation blocks into corresponding accumulation generation blocks, and outputting RGB images of different sizes through convolution operation of the accumulation generation blocks;

7. The method of claim 6,

the distinguishing of the quality of the image through the loss function limited in the independent discriminator specifically includes: judging whether the generated image is semantically matched with the input characters or not through a limited matching pairwise loss function in the three independent judgers, and judging whether the generated image is locally real or not through a limited local image loss function; in addition, for the last independent discriminator of the three independent discriminators, the generated images are also classified by a restricted class information loss function.