CN109671125A

CN109671125A - A kind of GAN network model that height merges and the method for realizing text generation image

Info

Publication number: CN109671125A
Application number: CN201811542578.4A
Authority: CN
Inventors: 宋井宽; 陈岱渊; 高联丽
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2018-12-17
Filing date: 2018-12-17
Publication date: 2019-04-23
Anticipated expiration: 2038-12-17
Also published as: CN109671125B

Abstract

The present invention relates to deep learning fields, it discloses a kind of GAN network model of height fusion and the methods for realizing text generation image, it is smaller to solve generation picture size present in traditional technology, quality is lower, the unstable problem of network training process is effectively realized by the clear high-quality semantic image of input text generation.The GAN network model of height fusion in the present invention, comprising: text compiler, condition increase module, generator and three independent arbiters；Based on the GAN network model of the height fusion, the high quality RGB image of matched text semantic information is still produced in the case where only one generator and three independent arbiters.To advanced optimize generator network structure, the various sizes of characteristic pattern for making full use of network middle layer to generate, generator is in addition to using the Residual Generation block in residual error network, 64*64 feature of the pyramid network structure from low-dimensional is additionally used, semantic information higher-dimension 256*256 feature abundant is gradually generated to.

Description

A kind of GAN network model that height merges and the method for realizing text generation image

Technical field

The present invention relates to deep learning fields, and in particular to a kind of GAN network model of height fusion and realizes that text is raw At the method for image.

Background technique

Although there is more application scenarios in real life by text generation picture, such as image compiling, across modal data Generate etc., but there was only less correlative study to the task at this stage.Originally, the method for text generation picture uses GAN Network is as basic network topology, and generation picture size is smaller, and picture quality is lower, as GAN-INT-CLS [1] can be only generated 64*64 image；For increased in size, method later is trained step by step using multiple GAN networks, but these networks are logical Height is required often with the network structure for having complexity, and to computing hardware, results in network training process complexity, time-consuming, such as StackGAN [2], StackGAN++ [3], AttnGAN [4] these schemes are divided into two steps and individually train two depth networks, not only It is not a network end to end, increases complexity and entire training process is highly unstable.

Bibliography:

[1]Reed,Scott,et al.2016.Generative adversarial text to image synthesis.arXiv preprint arXiv:1605.05396

[2]Zhang,H.；Xu,T.；Li,H.；Zhang,S.；Huang,X.；Wang,X.；and Metaxas, D.2017a.Stackgan:Text to photorealistic image synthesis with stacked generative adversarial networks.arXiv preprint

[3]Zhang,H.；Xu,T.；Li,H.；Zhang,S.；Wang,X.；Huang,X.；and Metaxas, D.2017b.Stackgan++:Realistic image synthesis with stacked generative adversarial networks.arXiv:1710.10916.

[4]Xu,T.,Zhang,P.,Huang,Q.,Zhang,H.,Gan,Z.,Huang,X.,&He, X.2017.Attngan:Fine-grained text to image generation with attentional generative adversarial networks.arXiv preprint.

Summary of the invention

The technical problems to be solved by the present invention are: providing a kind of GAN network model of height fusion and realizing that text is raw At the method for image, generation picture size present in solution traditional technology is smaller, and quality is lower, and network training process is unstable The problem of, it effectively realizes by the clear high-quality semantic image of input text generation.

The technical proposal adopted by the invention to solve the above technical problems is that:

A kind of GAN network model of height fusion, comprising: it is only that text compiler, condition increase module, generator and three Vertical arbiter；

The text compiler, the feature representation compiled for the text output to input text compiler；

The condition increases module, for sampling out certain dimension from the compiled feature representation that text compiler exports The condition flag of degree is expressed, and is mutually spliced with noise in channel dimension again later, is input in generator network；

The generator, including full articulamentum, seven sequentially connected Residual Generation blocks being connected with full articulamentum, successively It is connected and corresponds three accumulations being connected with last three Residual Generation blocks and generates block；

The full articulamentum, the feature for increasing module output to condition carries out feature and rises dimension, and converts its shape to 4 Dimensional feature；

The Residual Generation block, for generating various sizes of feature；

The accumulation generates block, for being merged using pyramid network structure to various sizes of feature, thus raw At various sizes of RGB image；

Three independent arbiters generate block one-to-one correspondence with three accumulations of generator and are connected, for generation The quality of the various sizes of RGB image of device output is differentiated, and will differentiate that result returns to generator.

As advanced optimizing, perception loss function is provided in the generator, for improving the language of the image generated Adopted consistency and image diversity.

As advanced optimizing, it is provided in each arbiter for differentiating whether generation image and input text are semantic It is matched to match pairs of loss function and generate the whether local really topography's loss function of image for differentiating, and most It is additionally provided in the latter arbiter for the classification information loss function for generating image of classifying.

As advanced optimizing, the Residual Generation block includes that a up-sampling block, two 3*3 convolution units and one are tired Add device；The output signal of the input signal connection upper level Residual Generation block of the up-sampling block；The output of the up-sampling block Signal connects an input signal of accumulator, and the output signal for up-sampling block passes sequentially through the volume of two 3*3 convolution units Another input signal of accumulator is connected after product operation.

As advanced optimizing, it includes a 1*1 convolution unit, up-sampling block, two 3*3 that the accumulation, which generates block, Convolution unit and an accumulator；The output signal of the input signal connection Residual Generation block of the 1*1 convolution unit；The 1* The input signal of the output signal connection up-sampling block of 1 convolution unit；The output signal and upper level accumulation of the up-sampling block Generate two input signals of the output signal connection accumulator of block；The output signal of the accumulator passes through a 3*3 convolution Unit exports more high-dimensional feature, and the output signal of the 3*3 convolution unit connects the input signal of another 3*3 convolution unit, RGB image is exported by another described 3*3 convolution unit.

As advanced optimizing, three independent arbiters include: that the first arbiter, the second arbiter and third are sentenced Other device.

As advanced optimizing, first arbiter and the second arbiter include: multilayer convolutional network unit, first 4*4 convolution unit, the 2nd 4*4 convolution unit, the first full articulamentum, stereocopy unit, channel concatenation unit and 1*1 convolution list Member；The input signal connection accumulation of the multilayer convolutional network unit generates the RGB image of block output；The multilayer convolutional network One input signal of the output signal interface channel concatenation unit of unit, while office is connected to by the first 4*4 convolution unit Portion's image impairment function；The text feature expression of the input signal connection text compiler output of the first full articulamentum；Institute It states the output signal of the first full articulamentum and the another of interface channel concatenation unit after stereocopy is carried out by stereocopy unit A input signal；The output signal of the channel concatenation unit passes sequentially through the volume of 1*1 convolution unit and the 2nd 4*4 convolution unit The pairs of loss function of matching is connected to after product operation.

As advanced optimizing, the third arbiter includes: multilayer convolutional network unit, the first 4*4 convolution unit, Two 4*4 convolution units, the 3rd 4*4 convolution unit, the first full articulamentum, the second full articulamentum, stereocopy unit, channel splicing Unit and 1*1 convolution unit；The input signal connection accumulation of the multilayer convolutional network unit generates the RGB image of block output； One input signal of the output signal interface channel concatenation unit of the multilayer convolutional network unit, while passing through the first 4*4 Convolution unit is connected to topography's loss function；The input signal connection text compiler output of the first full articulamentum Text feature expression；The output signal of the first full articulamentum passes through interface channel after stereocopy unit progress stereocopy Another input signal of concatenation unit；The output signal of the channel concatenation unit passes sequentially through 1*1 convolution unit and second The pairs of loss function of matching is connected to after the convolution operation of 4*4 convolution unit；And the output signal of 1*1 convolution unit passes through third 4*4 convolution unit and the second full articulamentum are connected to classification information loss function.

In addition, the present invention also provides a kind of GAN network models based on the fusion of above-mentioned height to realize text generation image Method comprising following steps:

It inputs text into trained text compiler, exports compiled feature representation；

The condition flag expression that module samples go out certain dimension is increased using condition, later again with noise in channel dimension phase Splicing is input in generator network；

In generator network, feature liter dimension is carried out by full articulamentum and converts its shape to 4 dimensional features, is inputted later Into continuous seven Residual Generation blocks；The feature of the different dimensions of last three Residual Generation blocks output is input to pair again The accumulation answered generates in block, exports various sizes of RGB image by the convolution operation that accumulation generates block；

The various sizes of RGB image of generation is separately input into corresponding independent arbiter, independent arbiter is passed through The loss function of middle limitation differentiates the quality of image, passes through back-propagating to entire generator after calculating the gradient of image In network, and update the parameter of independent arbiter and entire generator network.

As advanced optimizing, the loss function by limiting in independent arbiter sentences the quality of image Not, it specifically includes: being differentiated by the pairs of loss function of matching limited in three independent arbiters and generate image and input text Word whether semantic matches, differentiated by topography's loss function of limitation and generate image whether part is true；In addition, for The last one independent arbiter in three independent arbiters, generation figure of also being classified by the classification information loss function of limitation Picture.

The beneficial effects of the present invention are:

1) the pyramid network structure for using for reference Fusion Features, effectively utilizes the intermediate features of depth network internal generation, Generate the high quality graphic feature of more matched text semanteme.

2) the generator structure of perception loss function optimization GAN network, the semantic letter of rich image feature are effectively utilized Breath.

3) a variety of differentiation loss functions are effectively utilized, such as: pairs of loss function, topography's loss function, classification damage It loses function and improves its discriminating power to optimize the arbiter structure of GAN network, further increase generation picture quality.

4) GAN network architecture proposed by the present invention is used, training process can be stablized, reduces the training time.

Detailed description of the invention

Fig. 1 is the GAN schematic network structure of the height fusion in the embodiment of the present invention；

Fig. 2 is the structural schematic diagram of Residual Generation block；

Fig. 3 is the structural schematic diagram that accumulation generates block；

Fig. 4 is the structural schematic diagram of arbiter.

Specific embodiment

The present invention is intended to provide a kind of GAN network model of height fusion and the method for realizing text generation image, solve Generation picture size present in traditional technology is smaller, and quality is lower, and the unstable problem of network training process effectively realizes By the clear high-quality semantic image of input text generation.

Core of the invention thought is: to reduce trained cost as far as possible, designing the GAN net of a high unity structuring Network model makes it still produce matched text semantic information in the case where only one generator and three independent arbiters High quality RGB image.To advanced optimize generator network structure, the various sizes of spy for making full use of network middle layer to generate Sign figure, generator is in addition to additionally using 64*64 of the pyramid network structure from low-dimensional using the Residual Generation block in residual error network Feature is gradually generated to semantic information higher-dimension 256*256 feature abundant.

Embodiment:

As shown in Figure 1, the GAN network model that the height in the present embodiment merges, comprising: text compiler, condition increase Module, generator and three independent arbiters；

The full articulamentum, for condition increase module output feature carry out feature rise dimension, and change shape to 4 dimension Feature；

The Residual Generation block, for generating various sizes of feature；

In specific implementation, the feature of each size is generated by Residual Generation block first.As shown in Fig. 2, a residual error is raw Blocking includes a up-sampling block, two 3*3 convolution units and an accumulator；In the input signal connection of the up-sampling block The output signal of level-one Residual Generation block；One input signal of the output signal connection accumulator of the up-sampling block, and on The output signal of sampling block passes sequentially through another input letter that accumulator is connected after the convolution operations of two 3*3 convolution units Number.

In the present invention in order to enrich the feature representation of each size, it is proposed that a kind of accumulation generates block, using pyramid Network structure merges different size characteristics.As shown in figure 3, it includes 1*1 convolution unit, one that an accumulation, which generates block, A up-sampling block, two 3*3 convolution units and an accumulator；The input signal of the 1*1 convolution unit connects Residual Generation The output signal of block；The input signal of the output signal connection up-sampling block of the 1*1 convolution unit；It is described to up-sample the defeated of block Signal connects two input signals of accumulator with the output signal that upper level accumulates generation block out；The output of the accumulator is believed Number more high-dimensional feature is exported by 3*3 convolution unit, the output signal of the 3*3 convolution unit connects another 3*3 volumes The input signal of product unit exports RGB image by another described 3*3 convolution unit.

The structure of arbiter in the present invention is as shown in figure 4, wherein dotted box portion is the exclusive knot of third arbiter Structure, rest part are the structure that three arbiters have；Each arbiter includes: multilayer convolutional network unit, the first 4*4 volumes Product unit, the 2nd 4*4 convolution unit, the first full articulamentum, stereocopy unit, channel concatenation unit and 1*1 convolution unit；Institute The input signal connection accumulation for stating multilayer convolutional network unit generates the RGB image of block output；The multilayer convolutional network unit Output signal interface channel concatenation unit an input signal, while Local map is connected to by the first 4*4 convolution unit As loss function；The text feature expression of the input signal connection text compiler output of the first full articulamentum；Described The output signal of one full articulamentum by stereocopy unit carry out interface channel concatenation unit after stereocopy another is defeated Enter signal；The output signal of the channel concatenation unit passes sequentially through the convolution behaviour of 1*1 convolution unit and the 2nd 4*4 convolution unit The pairs of loss function of matching is connected to after work.

In the training process, it is observed that generation image is smaller to the difference of inhomogeneity object, in order to further increase area Other degree, we are only to arbiter (third arbiter) the limitation classification letter of 256*256 large-size images in the present embodiment Cease loss function.The structure of third arbiter is on the basis of above structure as a result, further include: the second full articulamentum and third 4*4 convolution unit；The output signal of 1*1 convolution unit is connected to classification by the 3rd 4*4 convolution unit and the second full articulamentum Information loss function.

In addition, we are also by generator restriction aware loss function, come improve generate image semantic consistency and Image diversity.

Based on the GAN network model of the above-mentioned height fusion in embodiment, the model is utilized the present invention also provides a kind of The method for realizing text generation image comprising following implementation steps:

Step 1: inputting text into trained text compiler, export compiled feature representation.But it is special at this time The sign higher network that is unfavorable for of dimension acquires accurate mapping, increases the condition flag table that module samples go out appropriate dimension using condition It reaches, mutually splices with noise in channel dimension be again input in generator network later.Its conditional increases module and is based on variation certainly Encoder VAE (Variational Auto-Encoder) theory constructs, in order to enable random point that conditional-variable is constructed Cloth is sufficiently close to standard gaussian distribution, we increase module to condition and limit KL divergence loss function.

Step 2: feature liter dimension being carried out by full articulamentum and converts its shape to 4 dimensional features, is input to continuous 7 later In a Residual Generation block, in order to sufficiently enhance the feature representation of different dimensions, the present invention respectively by 64*64,128*128, The feature of tri- kinds of different dimensions of 256*256 is input to accumulation again and generates in block, exports different sizes by convolution operation again later RGB image.

For the image restriction aware loss function of 256*256 size, calculates its gradient and updated entirely by back-propagating Generator network.

Step 3: the quality in order to guarantee each sized image is followed by an independent arbiter to each size.In addition to In addition 256*256 image limits classification information loss function, all sizes all limit the pairs of loss function of matching, topography's damage Lose function.During propagated forward, the present invention can disposably generate three kinds of various sizes of RGB images, be output to In corresponding independent arbiter, allow arbiter be based on matching pairs of loss function differentiate generate image and input text whether language Justice matching, differentiating generation image based on topography's loss function, whether part is true, is classified based on classification information loss function Generate image.During back-propagating, three arbiters calculate its gradient and are passed back in entire generator together again, update three The parameter of a independent arbiter and entire generator network.

Claims

1. a kind of GAN network model of height fusion characterized by comprising text compiler, condition increase module, generate Device and three independent arbiters；

The condition increases module, for sampling out certain dimension from the compiled feature representation that text compiler exports Condition flag expression, mutually splices again in channel dimension with noise later, is input in generator network；

The generator including full articulamentum, seven sequentially connected Residual Generation blocks being connected with full articulamentum, is sequentially connected And three accumulations being connected are corresponded with last three Residual Generation blocks and generate block；

The full articulamentum, the feature for increasing module output to condition carries out feature and rises dimension, and change shape is to 4 Wei Te Sign；

The Residual Generation block, for generating various sizes of feature；

The accumulation generates block, for being merged using pyramid network structure to various sizes of feature, to generate not With the RGB image of size；

Three independent arbiters generate block one-to-one correspondence with three accumulations of generator and are connected, for defeated to generator The quality of various sizes of RGB image out is differentiated, and will differentiate that result returns to generator.

2. a kind of GAN network model of height fusion as described in claim 1, which is characterized in that be arranged in the generator There is perception loss function, for improving the semantic consistency and image diversity of the image generated.

3. a kind of GAN network model of height fusion as described in claim 1, which is characterized in that be all provided in each arbiter Be equipped with for differentiate generate image and input text whether semantic matches the pairs of loss function of matching and for differentiate generate The whether local true topography loss function of image, and be additionally provided in the last one arbiter and generate image for classifying Classification information loss function.

4. a kind of GAN network model of height fusion as described in claim 1, which is characterized in that the Residual Generation block packet Include a up-sampling block, two 3*3 convolution units and an accumulator；The input signal connection upper level of the up-sampling block is residual The blocking output signal of bad student；One input signal of the output signal connection accumulator of the up-sampling block, and up-sample block Output signal pass sequentially through another input signal that accumulator is connected after the convolution operations of two 3*3 convolution units.

5. a kind of GAN network model of height fusion as described in claim 1, which is characterized in that the accumulation generates block packet Include a 1*1 convolution unit, a up-sampling block, two 3*3 convolution units and an accumulator；The 1*1 convolution unit The output signal of input signal connection Residual Generation block；The input of the output signal connection up-sampling block of the 1*1 convolution unit Signal；The output signal of the up-sampling block connects two inputs letter of accumulator with the output signal that upper level accumulates generation block Number；The output signal of the accumulator exports more high-dimensional feature by 3*3 convolution unit, the 3*3 convolution unit it is defeated Signal connects the input signal of another 3*3 convolution unit out, exports RGB image by another described 3*3 convolution unit.

6. a kind of GAN network model of height fusion as described in claim 1, which is characterized in that described three independent to sentence Other device includes: the first arbiter, the second arbiter and third arbiter.

7. a kind of GAN network model of height fusion as claimed in claim 6, which is characterized in that first arbiter and Second arbiter includes: multilayer convolutional network unit, the first 4*4 convolution unit, the 2nd 4*4 convolution unit, the first full connection Layer, stereocopy unit, channel concatenation unit and 1*1 convolution unit；The input signal of the multilayer convolutional network unit connects Accumulation generates the RGB image of block output；One of the output signal interface channel concatenation unit of the multilayer convolutional network unit Input signal, while topography's loss function is connected to by the first 4*4 convolution unit；The input of the first full articulamentum Signal connects the text feature expression of text compiler output；The output signal of the first full articulamentum passes through stereocopy list Member carries out another input signal of interface channel concatenation unit after stereocopy；The output signal of the channel concatenation unit according to It is secondary to match pairs of loss function by being connected to after the convolution operation of 1*1 convolution unit and the 2nd 4*4 convolution unit.

8. a kind of GAN network model of height fusion as claimed in claim 6, which is characterized in that the third arbiter packet Include: multilayer convolutional network unit, the first 4*4 convolution unit, the 2nd 4*4 convolution unit, the 3rd 4*4 convolution unit, first connect entirely Connect layer, the second full articulamentum, stereocopy unit, channel concatenation unit and 1*1 convolution unit；The multilayer convolutional network unit Input signal connection accumulation generate block output RGB image；The output signal interface channel of the multilayer convolutional network unit One input signal of concatenation unit, while topography's loss function is connected to by the first 4*4 convolution unit；Described first The text feature expression of the input signal connection text compiler output of full articulamentum；The output signal of the first full articulamentum Another input signal of interface channel concatenation unit after stereocopy is carried out by stereocopy unit；The channel splicing is single The output signal of member is connected to matching damage in pairs after passing sequentially through the convolution operation of 1*1 convolution unit and the 2nd 4*4 convolution unit Lose function；And the output signal of 1*1 convolution unit is connected to classification information by the 3rd 4*4 convolution unit and the second full articulamentum Loss function.

9. the method for realizing text generation image, which is characterized in that melted using the height as described in claim 1-8 any one The GAN network model of conjunction handles text, comprising the following steps:

The condition flag expression that module samples go out certain dimension is increased using condition, is mutually spliced with noise in channel dimension again later It is input in generator network；

In generator network, passes through full articulamentum and carry out feature and rise to tie up and convert its shape to 4 dimensional features, later the company of being input to In seven continuous Residual Generation blocks；The feature of the different dimensions of last three Residual Generation blocks output is input to again corresponding Accumulation generates in block, exports various sizes of RGB image by the convolution operation that accumulation generates block；

The various sizes of RGB image of generation is separately input into corresponding independent arbiter, by being limited in independent arbiter The loss function of system differentiates the quality of image, passes through back-propagating to entire generator network after calculating the gradient of image In, and update the parameter of independent arbiter and entire generator network.

10. realizing the method for text generation image as claimed in claim 9, which is characterized in that

The loss function by limiting in independent arbiter differentiates the quality of image, specifically includes: by three The pairs of loss function of the matching limited in independent arbiter come differentiate generate image and input text whether semantic matches, pass through limit Whether part is true to differentiate generation image for topography's loss function of system；In addition, in arbiter independent for three most The latter independence arbiter also generates image by the classification information loss function of limitation to classify.