CN114078172B

CN114078172B - Text image generation method for progressively generating confrontation network based on resolution

Info

Publication number: CN114078172B
Application number: CN202010836037.3A
Authority: CN
Inventors: 何小海; 许一宁; 卿粼波; 张津; 罗晓东; 滕奇志; 吴小强
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-08-19
Filing date: 2020-08-19
Publication date: 2023-04-07
Anticipated expiration: 2040-08-19
Also published as: CN114078172A

Abstract

Aiming at the problem of instability of a generated network, the invention designs a method for generating an image by text against the network based on resolution progressive generation. In the field of text-generated images, generation networks have been able to generate high-resolution pictures with clear details. The invention provides a text generation image method for generating a confrontation network based on resolution progression. And a semantic separation-fusion generation module is adopted at the low resolution layer to separate the feature vectors into three feature vectors, a generator is used for generating corresponding feature maps, mask maps are combined for feature fusion, finally, the low resolution maps are obtained, and Mask pictures are used as semantic constraints to improve the stability of the low resolution generator. Meanwhile, a resolution progressive residual error structure is adopted in a high-resolution layer, and the quality of generated pictures is further improved. In the field of text generation of images, the method provides ideas and wide application prospects.

Description

Text image generation method for progressively generating confrontation network based on resolution

Technical Field

The invention designs a text image generation method for progressively generating a confrontation network based on resolution, and relates to the technical field of deep learning and computer vision.

Background

A Text-to-Image Synthesis (Text 2 Image) is a direction of a comparative front edge in the field of computer vision. The text generation image aims to generate a corresponding natural image by inputting a sentence description sentence, belongs to the cross application field of computer vision and natural language processing, and is beneficial to mining the potential relation between the text and the image and forming a visual semantic mechanism of a computer.

The task of generating images by texts was originally proposed in 2016, and the main task is to require automatic generation of images corresponding to text descriptions input for each sentence, so that Reed and the like build a GAN-INT-CLS and other networks based on a conditional countermeasure generation network to solve the problem. Although the network can basically generate images which are related to description and have certain definition, the generated images have low quality resolution, and the semantic consistency problem of texts and generated images is basically not considered.

Text-generated images are a very challenging problem with two main goals: (1) capable of generating realistic images; and (2) matching the generated image with the input text description. Most of the current text generation image basic frameworks adopt a condition-generation countermeasure network (cGAN) mode, and adopt a pre-trained text encoder to encode an input descriptive sentence into a corresponding semantic vector, connect a noise vector which obeys normal distribution, and input the semantic vector as a cGAN condition to generate a natural image. In the aspect of generating a high-resolution clear picture, a method of generating multi-scale output and a multi-scale discriminator is adopted, so that the quality of the generated picture is improved. In semantic consistency, fine-tuning is often performed on high-resolution maps using attention mechanisms and the like.

Most networks that generate images of text are prone to many semantically unreasonable pictures due to the instability of the generating countermeasure network. Taking a bird picture as an example, a generated target structure is not restricted to a certain extent, and some generated pictures are not real, such as double-headed birds, partial loss, disconnection of a target area, fuzzy boundaries and the like caused by blurring of a foreground and a background, so that a generated result is poor and pleasant. At present, most of research interest points of generating pictures based on texts are improved in a high-resolution generator, and the generated pictures are corrected and fine-tuned by means of an attention mechanism and the like. In the generation network, in order to generate clear high-resolution natural pictures, a mode of cascading a plurality of generators is often adopted, so that gradual thinning from low-resolution pictures to high-resolution pictures is achieved. Meanwhile, studies have shown that low resolution generators focus on structure and layout, high resolution generators focus on detail and random variations, and if generation fails on the spatial structure of the picture, it is futile how much detail correction is done.

Therefore, the low resolution generator initially generates pictures that have a greater impact on the spatial semantic structure of the generated result. The better low-resolution generator can ensure the semantic rationality of the low-resolution generated picture and improve the stability of the generated network generated picture to a certain extent.

Disclosure of Invention

The invention provides a method for generating an image based on a text of a confrontation network generated by resolution progressive generation, which aims to promote the stability of image generation. And a semantic separation-fusion generation module is adopted at a low resolution layer, text features are separated into three feature vectors under the guidance of a self-attention mechanism, corresponding feature maps are generated by a generator and fused into a low resolution map, and Mask pictures are adopted as semantic constraints to improve the stability of the low resolution generator. Meanwhile, a resolution progressive residual error structure is adopted in a high-resolution layer, and the quality of a generated picture is further improved by combining a word attention mechanism and pixel shuffling. The method for generating the text generation image of the confrontation network by resolution progressive generation reduces the structural error of the generated target to a certain extent, and further improves the quality of the generated image.

The invention realizes the purpose through the following technical scheme:

the method comprises the following steps: coding the input description sentence into a Text semantic feature vector c and a noise z which obeys normal distribution through a Text-Encoder to obtain a new feature vector s;

step two: adopting a semantic separation module to calculate corresponding attention weight of the feature vector output by the encoding end through a self-attention module, and multiplying the attention weight by the original semantic feature vector to obtain a separated foreground feature vector s _fore Background feature vector s _back And Mask feature vector s _mask ；

Step three: by a first stage of three different generators G _fore ,G _back ,G _mask Generate the data with size of 6464 characteristic map R _fore ,R _back ,R _mask Through R _mask Calculating to obtain a generated binary mask image I _mask The first stage generator outputs a feature map R ₀ And first level generating picture I ₀ ；

Step four: the first-stage feature map passes through a second-stage generator G and a third-stage generator G ₁ ,G ₂ Finally, the generated pictures I of 128 × 128 and 256 × 256 are obtained respectively by combining the resolution progressive residual structure ₁ ，I ₂ ；

Step five: for each generation stage, there is a corresponding discriminator, D ₀ ,D ₁ ,D ₂ Meanwhile, the Mask picture generated in the first stage also has a corresponding discriminator D _mask Constraining the generated result;

step six: DAMSM loss was calculated using the 256 x 256 size image generated by the last generator.

It should be noted that:

the semantic attention extracting module in the step two is an ith semantic feature vector in the semantic attention separating module

The calculation method is as follows:

α _i,j ＝exp(W _i s ^T s)/∑ _j exp(W _i s ^T s)

wherein, W _i A weight that is a linear transformation;

passage in step three _mask Calculating to obtain a generated binary mask image I _mask The first stage generator outputs a feature map R ₀ And first level generating picture I ₀ The method comprises the following steps:

(1) R is to be _mask Obtaining a single-channel binary mask image I through the convolution layer and the activation layer _mask ；

(2) By the formula:

calculating to obtain a first-stage characteristic spectrum R ₀ ；

(3) R is to be ₀ Finally obtaining a first-level generator generated picture I through the convolution layer and the activation layer ₀ 。

The invention mainly provides a text image generation method for generating a confrontation network based on resolution progressive generation. The semantic feature separation-fusion module is adopted in the low-resolution generation layer to improve the image structure generation stability, the resolution progressive residual error structure is adopted in the high-resolution generation layer to improve the image generation quality, and the effectiveness of the proposed network is verified on the public data sets CUB and Oxford-102.

Drawings

Fig. 1 is a diagram of the network architecture of the present invention.

FIG. 2 is a diagram of the self-attention mechanism detachment architecture of the present invention.

Fig. 3 is a high resolution residual network architecture of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings in which:

FIG. 1 is a network architecture diagram of a method for generating a text-generating image against a network based on resolution progression.

And a text encoding end: the Text encoding end of the generator consists of a pre-trained Text Encoder Text-Encoder, an input description sentence is encoded into a Text semantic feature vector C through the Text-Encoder, and the Text semantic feature vector C is connected with noise Z which obeys normal distribution to form a new feature vector which is used as the input of the image decoding end of the generator. Text-Encoder is also responsible for computing words in the Text description as attention maps as one of the inputs to the last two stages (64 × 64 to 128 × 128, 128 × 128 to 256 × 256) of the image decoding side.

An image encoding end: and obtaining a condition vector from the encoded semantic feature vector through a condition enhancement module. The feature vector is in a low resolution layer, and semantic feature vectors with three different attention weights are obtained through a self-attention separation module. Three different semantic feature maps are generated by adopting three different generators, and the generated low-resolution map is obtained by a feature fusion method. In the high-resolution layer, a residual error structure is adopted in combination with an attention mechanism to finely adjust the high-resolution map, so that the generation from low resolution to high resolution is realized, and finally, a high-quality picture is obtained.

An image decoding end: for each generation stage, there is a corresponding discriminator, D0, D1, D2 respectively. In the final generation phase, the generated 256 × 256 size image is also used to calculate DAMSM losses.

FIG. 2 is a diagram of the self-attention mechanism detachment architecture of the present invention. The semantic separation module adopts a self-attention mechanism, the feature vector output by the encoding end is subjected to corresponding attention weight calculation by the self-attention module, and then the attention weight is multiplied by the original semantic feature vector to obtain a foreground feature vector, a background feature vector and a Mask feature vector after separation.

Fig. 3 is a high resolution residual network architecture of the present invention. In a residual error network, firstly obtaining an attention map through word vector guidance, connecting the attention map with a previous generation map, calculating an attention weight of the previous characteristic map and the word vector, multiplying the attention weight by the characteristic map to obtain the attention map, splicing the attention map and the previous characteristic map to be used as the input of a generator, simultaneously up-sampling twice the previous characteristic map, adding the output of the generator and the up-sampled result, and obtaining a picture of a corresponding scale in the stage through an activation layer.

Claims

1. The method for generating the text generation image of the confrontation network based on the resolution progressive generation is characterized in that: the method comprises the following steps:

the method comprises the following steps: coding an input description sentence into a Text semantic feature vector c through a Text-Encoder, and connecting the Text semantic feature vector c with a noise z which obeys normal distribution to obtain a new feature vector s;

step two: adopting a semantic separation module to calculate corresponding attention weight of the feature vector output by the encoding end through a self-attention module, and multiplying the attention weight by the original semantic feature vector to obtain a separated foreground feature vector s _fore Background feature vector s _back And Mask feature vector s _mask The semantic attention extracting module is in the semantic attention separating module, the ith semantic feature vector

The calculation method is as follows:

α _i，j ＝exp(W _i s ^T s)/∑ _j exp(W _i s ^T s)

wherein, W _i A weight that is a linear transformation;

step three: by a first stage of three different generators G _fore ，G _back ，G _mask Respectively generating feature maps R with the size of 64 multiplied by 64 _fore ，R _back ，R _mask Through R _mask Calculating to obtain a generated binary mask image I _mask The first stage generator outputs a feature map R ₀ And first level generating picture I ₀ The method comprises the following steps:

(2) By the formula:

calculating to obtain a first-stage characteristic spectrum R ₀ ；

(3) R is to be ₀ Finally obtaining a first-level generator generated picture I through the convolution layer and the activation layer ₀ ；

Step four: the first-stage feature map passes through a second-stage generator G and a third-stage generator G ₁ ，G ₂ Finally, the generated pictures I of 128 × 128 and 256 × 256 are obtained respectively by combining the resolution progressive residual structure ₁ ，I ₂ ；

Step five: for each generation stage, there is a corresponding discriminator, D ₀ ，D ₁ ，D ₂ Meanwhile, the Mask picture generated in the first stage also has a corresponding discriminator D _mask Constraining the generated result;