CN115775284A

CN115775284A - Network architecture method for generating image by multi-path text in stages

Info

Publication number: CN115775284A
Application number: CN202211505806.7A
Authority: CN
Inventors: 俞俊; 沈铭; 丁佳骏; 刘贝利; 范梦婷; 杨苏杭; 赵天宁; 陈盛款
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-03-10

Abstract

The invention discloses a novel grading multi-path network architecture of a text generation image based on residual learning and multi-scale learning, which is used for improving the extraction of different-scale features of the image, generating an image with fine granularity of details and improving the generation effect of a cross-modal task of the text generation image. The invention provides a new and improved generation countermeasure neural network architecture to improve the definition of image generation. The characteristic diagram formed by the information of the adjacent stage information and the text information is directly transmitted to the tail of the current stage by using the staged residual connection to participate in the image generation of the current stage, so that the requirement of long-time storage is avoided, and the generation effect of the current stage is improved. The multi-scale learning utilizes a plurality of parallel paths with different convolution kernel sizes to extract the features of the input image, and properly integrates feature maps from different spaces to obtain higher-quality features and fine-grained text details.

Description

Network architecture method for generating image by multi-path text in stages

Technical Field

The invention relates to the field of Residual Learning (Residual Learning) and Multi-scale Learning (Multi-scale Learning), in particular to a staged Multi-path (Multi-path) network architecture method for text-generated images (T2I Synthesis).

Background

Text-to-image (T2I) refers to the generation of an image that correctly reflects the semantics of a Text description, a challenging task that links vision and language. Due to its great potential in many application fields, such as interactive artistic creation and computer-aided painting, it has become one of the most popular studies in multi-model learning tasks, drawing considerable attention, and has important applications in many fields, such as image editing, computer-aided design, electronic games, etc.

Most existing methods solve this problem by using a generation countermeasure network (GAN) which has a strong image generation capability. The images generated by the T2I task also need to match the textual description, which uses a conditional generation countermeasure network (cGAN), which is conditionally constrained to both natural language descriptions and noise, rather than starting directly from noise.

The original T2I model GAN-INT-CLS can only generate images of 64 × 64 resolution. To generate higher resolution images, stack-GAN was proposed, successfully generating 256 × 256 resolution images, after which Stack GAN-v2, improved based on Stack-GAN, showed more stable training behavior and further improved the quality of the generated images. The architecture and performance of the model have improved over the last few years, and the quality and resolution of the image have improved significantly. In generating high resolution images, attnGAN and DM-GAN build correlations between sub-regions of the generated low resolution images and words in the text description based on the model of attention. RiFe-GAN and RiFeGAN2 select and refine compatible candidate texts from a priori knowledge using an attention-based text matching model. The OP-GAN architecture is particularly focused on single objects while generating a background that fits the overall image description. Bridge-GAN creates a transitional space with interpretable tokens as a Bridge to connect text and images.

Since he jemmy et al proposed ResNet, residual connection has been a basic structure in deep networks. The rain stripe layer modeling task can be converted into a simulation task through a residual error learning method, so that the optimization difficulty is relieved, and a better background image is reconstructed. DnCNN successfully implicitly removed the potential clean image in the hidden layer through residual learning. Through a Depth Residual Learning (DRL) network, the non-linear mapping from the input blurred image to the output deblurred image can be directly estimated.

Multi-scale learning has shown effectiveness in computer vision. The basic idea of the strategy is to fuse features with different resolutions and enhance the representation capability of the neural network. The multi-scale residual error network (MSRN) utilizes convolution kernels with different sizes to construct a multi-scale residual error block, and the reconstruction capability is enhanced. Multi-scale dense cross-connect networks (MDCNs), further take advantage of the features of previous layers. Against a multi-path residual network (AMPRN), features can be aggregated from different paths, enhancing the flow and gradient of information through a large number of hop connections.

In the T2I task of high resolution image generation, a multi-stage framework is widely adopted. The StackGAN model first defines a two-stage model of two cascaded GANs (as shown in fig. 1) and successfully generates realistic high-resolution images (256 × 256). Their successor StackGAN-v2 then further improves the architecture by means of a tree-like multi-stage structure. On a multi-stage framework basis, a Symmetric Distillation Network (SDN) can deliver hierarchical knowledge unimpeded, with the DA-GAN translating each word into a sub-region of the image. AttnGAN introduces a text-level visual attention mechanism, captures fine-grained image-text correlations at a later stage, and refines the image to a high-resolution image, with each word in the input sentence having a different level of information describing the image content. And a dynamic memory component is introduced in the DM-GAN later stage, so that even if the initial image is not generated well, a high-quality image can be generated, and a more vivid image is successfully generated.

However, existing models still have some limitations and drawbacks. These models typically first generate a low resolution image with the original shape and color, and then at a later stage generate a realistic high resolution image. Due to similar loss constraints and inheritance of feature information, the low resolution images generated in the first stage and the high resolution images generated in the subsequent stages always have homogeneous features in a coarse to fine scheme. Subsequent stages must repeat the construction of the entire image, including the original shape and color of the object, which were generated by the previous adjacent stages. In this case, the stages following the model must retain most of the feature information details of the input.

Disclosure of Invention

The invention provides a method based on Residual Learning (Residual Learning) and multi-scale Learning

The Multi-scale image generation method based on the Multi-scale matching comprises the steps of (Multi-scale Learning) and generating a staged Multi-path network architecture of a text generation image (T2I Synthesis), wherein the staged Multi-path network architecture is used for improving the characteristics of different scales of an extracted image, generating an image with finer details and improving the generation effect of a cross-modal task of the text generation image. The invention provides a new and improved generation countermeasure neural network architecture to improve the definition of image generation. The feature graph formed by the information of the adjacent stage information and the text information is directly transmitted to the tail of the current stage by using the staged residual connection, so that the feature information details of the image generated in the previous stage can be reserved and participate in the image generation of the current stage, the requirement on long-time storage is avoided, the details of the generated image are absorbed in the current layer for modification and supplement, and the generation efficiency of the current stage is improved. The multi-scale learning utilizes a plurality of parallel paths with different convolution kernel sizes to extract the features of the input image, and properly integrates feature maps from different spaces to obtain features with higher quality and fine-grained details. Experiments on a plurality of models and data sets show that a multi-path model architecture consisting of staged residual connection and multi-scale modules can effectively improve the performance of text generated images and the quality of generated images.

A network architecture method for generating images by multi-path texts in stages comprises

Adding a multi-stage residual error learning mechanism to a generation type countermeasure network of the multi-stage framework;

the multi-stage residual learning mechanism is represented as:

image h of i-1 stage generated in previous stage _i-1 And also the feature f of its fine sentence _wo (h _i-1 W) directly moving to the end of the i-th stage, fusing the features f learned from the feature extraction module _i Participating in the image generation of the i stage;

replacing the convolution filter in each stage with a multi-scale module;

the mathematical expression of the multiscale module is as follows:

wherein

Is the p-th path module of the extracted feature,

respectively, are the outputs of the corresponding paths,

representing the feature map generated for each path, stitched, and then passed through a feature fusion block

Selecting proper characteristics in each path and carrying out self-adaptive fusion; use of

And fusing the feature maps.

Preferably, the multi-scale module comprises: a multi-scale path and feature fusion block FFB;

the multi-scale path includes: three parallel paths;

the feature fusion block uses a 3 × 3 convolution layer, and the mathematical expression is as follows:

wherein

Finger ECA model, fuse _i One filter number is C _i Convolution layer with convolution kernel size of 3.

After the features of the method of the invention are added, the overall flow of image generation is as follows:

inputting a Text information of natural language for describing image, and obtaining a sentence feature vector (sense feature) f through a pre-trained Text Encoder (Text Encoder) _s And word feature matrices (word features).

Sentence characteristic f _s Spliced with an N (0, 1) normally distributed noise vector z, it is passed through a full connection layer (FC with resipe) and then an F consisting of 4 upsampled layers (upsampling) ₀ Module to obtain h ₀ ＝F ₀ (z，f _s (s)), by means of a first generator G ₀ An original image G with a resolution of 64 × 64 × 3 will be obtained ₀ (h ₀ ). The following modules will use our proposed Multhe ti-Path Structure in Multi-stage frame completes the detail supplement and resolution improvement of the image.

At h _i (i =0, 1) before passing into the module of the next stage, the feature vector of the word is passed through F each time _wo (Word Operation), and f obtained _wo And h _i (i =0,1) spliced vector H _i (i =0,1) residual block divided into paths of three scales (1 × 1,3 × 3,5 × 5)

Synthesizing hidden layer feature maps of different sizes

Passing feature maps of different sizes through a blending module (Fusion Block)

And H _i (i =0,1) residual concatenation followed by a 3 × 3 upsampling layer yields h _i (i =1, 2), by means of a generator G _i (i =1,2) generates images of higher resolution of 128 × 128 × 3 and 256 × 256 × 3.

The invention has the following beneficial effects that in order to enable the subsequent stage to be more concentrated on the details with rich fine granularity and improve the quality of the finally generated image, the invention provides a novel stage multi-path structure frame confrontation generation network architecture from the text to the image. The framework utilizes the phase residual connection to reserve the detail of the characteristic information of the image generated in the previous phase and transmits the detail to the image generation process in the current phase, thereby avoiding the requirement of long-time storage. The other part of the structure is a multi-scale module, and 3 parallel paths with different convolution kernels are used for extracting input features to generate an image with finer details. Experimental results show that the structure can enable the network to focus on modifying the details of the generated image, so that clues with higher quality can be obtained for fine-grained detail reasoning. Compared with a benchmark model, the image generation architecture adopting the staged multi-path text can obtain higher performance.

Drawings

FIG. 1 is a multi-stage network architecture that is widely used in the task of text generation images.

Fig. 2 is an improved structure of T2I residual learning and multi-scale module proposed by the present invention.

Fig. 3 is an overall framework of the staged multipath structure T2I model proposed by the present invention.

Detailed Description

The technical solution of the present invention is further specifically described below by way of specific examples in conjunction with the accompanying drawings.

The invention aims to concentrate on enriching fine-grained details in the subsequent generation stage of the image, improve the quality of the finally generated image and provide a new multi-path structure aiming at a multi-stage frame. Fig. 3 shows the framework of the T2I model using our proposed multipath structure. This multipath structure is mainly composed of two parts. The detail of the feature information of the image generated in the adjacent stage before the transfer is reserved and transferred by the staged residual connection, so that the requirement of long-time memory is avoided, and the generation efficiency of the current stage is improved. The multi-scale module extracts input features from three parallel paths with different convolution kernel sizes, so that the features with higher quality can be obtained, and better fine-grained details can be generated.

Residual learning and multi-scale mechanisms are commonly used strategies in underlying image processing tasks such as image super-resolution, image derailment and defogging, but such effective mechanisms are rarely employed in GAN models. Experimental results show that by adopting the strategies of residual error learning and multi-scale learning, the task of generating the image by the text can obtain considerable performance improvement. Images generated by the multi-path structure model are more vivid than the corresponding baseline model because they can not only be generated by extracting abundant and diverse feature maps through different paths, but also the generation efficiency is reduced by the proposed phase residual connection.

A network architecture for staged multi-path text generation of images, comprising two improved structures (shown in fig. 2):

structure one, multi-stage residual join

For a generative countermeasure network with a multi-stage framework, assume that it has m stages:

{(F _i ，U _psamp le _i ，C _i )|i＝0，l，...，m-1}

will hide the state (h) ₀ ，h ₁ ，...，h _m-1 ) Generating as input images from small to large sizes

Mathematically, its forward propagation process can be expressed in the form:

z is a noise vector, usually sampled from a gaussian distribution, s is a global sentence vector, and w is a word vector matrix from a text encoder such as LSTM or DAMSM. Function f _s (. Cndot.) denotes a vector operation on a sentence, e.g. Conditional amplification, f _wo (. Cndot.) represents operations on words, such as the attention model of AttnGAN and dynamic memory operations in DMGAN. F _op (. Cndot.) represents operations on the input feature map and text features of previous adjacent stages, such as joint convolution in stackGAN-v2 and concatenation in AttnGAN. F is to be _i (·)、U _psamplei (. Cndot.) and G (. Cndot.) are modeled as neural networks.

The first stage has outlined the general shape and color of the object. The goal of the subsequent stages of the network is to gradually correct and enrich the fine-grained features and then generate a high-resolution realistic image. In the coarse-to-fine image generation model architecture, a low-resolution image generated in a first stage and a high-resolution image generated in a subsequent stage share similar information to a great extent so as to improve the generation efficiency of the subsequent stage, and the scheme introduces a multi-stage residual error learning mechanism and can be expressed in the following mathematical form:

image h of i-1 stage generated in previous stage _i-1 And also the feature f of its fine sentence _wo (h _i-1 W) directly moving to the end of the i-th stage, fusing the features f learned from the feature extraction module _i And participating in the image generation of the ith stage. The multi-stage residual learning join avoids the need for long-term memory, and the i-th stage layer can focus on modifying and supplementing the details of the generated image. The input to the first stage is highly abstract text semantics whose semantic feature mapping is largely inconsistent with the modality of the image. Therefore, the multi-stage residual learning mechanism is not suitable for the image generation applied to the first stage.

Structure two, multi-scale module

In a typical T2I synthesis, only one path always uses a convolution filter with a constant kernel size of 3 × 3 to extract features, resulting in a network that cannot fully exploit information from different aspects. To further exploit the current stage input, including previously generated image and text features, we add a multi-scale module.

The mathematical expression can be written as follows:

wherein

Is the p-th path module of the extracted feature,

respectively, are the outputs of the corresponding paths,

representing the feature map generated for each path, spliced, and passed through a feature fusion block

And selecting proper characteristics in each path and performing adaptive fusion. Use of

The multi-scale module provided by the fusion feature diagram mainly comprises a multi-scale path and a Feature Fusion Block (FFB). Multi-scale modules are explored, utilizing information in different aspects. Specifically, the section consists of three parallel paths, each with two ResBlocks. This method uses larger (5 × 5) and smaller (1 × 1) filters to extract features from different spatial angles compared to the original module. The path with the larger initial convolution kernel is used for extracting global structure characteristics, and the path with the smaller size of the convolution kernel is used for acquiring local detail information. The feature fusion block integrates all features from the multi-scale path together in a cascaded manner. Then, using Efficient Channel Attentions (ECAs), which is a very lightweight attentive mechanism module, weights can be calculated to redistribute the importance of the feature maps of each channel to select the appropriate information. Finally, a 3 x 3 convolutional layer is used to adaptively fuse the feature maps of each path.

The mathematical representation is as follows:

finger ECA model, fuse _i Is one filter number of C _i Convolution layer with convolution kernel size of 3.

By applying these mechanisms, the model will benefit from rich features of different scales, and can obtain high-quality image features of finer granularity.

Example 1

The experimental methods of the present invention and the detailed parameters and details thereof are further described below.

(1) Text sentence feature and word feature extraction

Using the data sets CUB-200 and Oxford-102, the CUB-200 data set contained 11788 images of 200 bird categories, 150 of which (8,855 images) were used for training and the remaining 50 (2,933 images) were used for testing. Oxford-102 contains pictures of 8,189 flowers from 102 different categories, 7,034 of which were used for training and 1,155 of which were used for testing. There are 10 text descriptions for each image in the CUB-200 dataset and the Oxford-102 dataset.

(2) Text sentence feature and word feature extraction

Feature extraction is performed on natural language text descriptions in the dataset, and features are extracted from the text using a pre-trained bi-directional long short term memory network (BilSTM). In a two-way long-short term memory network, each word corresponds to two hidden states, one for each direction. Thus, its two hidden states are connected as semantic information for a word. Finally, a word feature matrix e belonging to R is obtained ^D×T Wherein the ith column vector e of the matrix ⁱ Representing the characteristics of the ith word, D =256 representing the dimension of the characteristics of the word, T =25 representing the number of the words, and connecting the hidden states of the last layer of the bidirectional long-short term memory network as the global sentence characteristics

(3) Building improved networks for staged multi-path text-generated images

By adopting the AttnGAN as a reference model, the multi-stage stacked network increases the resolution of an image by stacking a generator and a discriminator, and generates an image with richer details. For the generator of the model, given random noise z-N (0, 1) and a condition variable C, the dimensions are respectively 100 and 256, and an improved text generation image neural network architecture is constructed according to the third diagram, so that images with the resolution of 64 × 64, 128 × 128 and 256 × 256 can be generated in multiple stages.

(4) Establishment of loss function

The joint condition generation and the unconditional generation are used for training the anti-neural network together, and the objective function of the model comprises two items, namely unconditional loss and conditional loss. Arbiter D of the ith stage _i The loss of (a) is defined as follows:

corresponding ith stage generator G _i The loss of (a) also consists of two part losses:

wherein x _i Is the actual image, s, of the corresponding text description in the data set _i Is generator G _i The generated false image.

(5) Model training

And alternately training the arbiter and the generator in a training process according to the loss function. The relevant training parameters are set as follows: training epoch is 800, batch size is 20, adam optimizer, discriminator and generator initial learning rate is 2e-4.

The generator model is fixed during the training of the discriminator, and the gradient information is only transmitted on the discriminator; gradient information is transmitted from the discriminator to the generator during generator training, but the model of the discriminator is not subjected to gradient updating, and only parameters of a generator network are optimized. And finally, updating model parameters through a Back-Propagation (BP) algorithm until the model converges.

After training, the stored generator model can generate a corresponding high-resolution image according to the specified text description, and the values of the evaluation indexes FID and IS can be calculated by using the mean value and covariance of the generated image, so that the performance of the model can be quantized.

(6) Evaluating the index of the model

The Inclusion Score (IS) and Frechet Inclusion Distance (FID) metrics were used to quantify the performance of the invention. Each model generated 30,000 images for the CUB-200 dataset and 11,550 images for the Oxford-102 dataset, provided that a given text description from the unseen test set. The IS IS defined by the KL-divergence between the conditional distribution and the edge class distribution, calculated using a pre-trained inclusion v3 network. A large IS means that the generated model outputs highly diverse images of all classes and each image explicitly belongs to a particular class, the higher IS value the better the quality of the generated image. The FID calculates the Frechet distance between the synthetic image and the real image according to the features extracted from the previously trained Incepotion v3 network. A lower FID means that the generated image distribution is closer to the real image distribution, and the lower the FID value, the better the generated image quality.

Claims

1. A network architecture method for generating images from staged multi-path texts is characterized by comprising

the multi-stage residual learning mechanism is represented as:

image h of i-1 stage generated in previous stage _i-1 And also the feature f of its fine sentence _wo (h _i-1 W) directly moving to the end of the i-th stage, fusing the features f learned from the feature extraction module _i Participating in the generation of the image in the i stage;

replacing the convolution filter in each stage with a multi-scale module;

the mathematical expression of the multiscale module is as follows: