CN112163605A

CN112163605A - Multi-domain image translation method based on attention network generation

Info

Publication number: CN112163605A
Application number: CN202010976851.5A
Authority: CN
Inventors: 张友彩; 邵明文; 禹发
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2020-09-17
Filing date: 2020-09-17
Publication date: 2021-01-01

Abstract

Image translation is the mapping of images from one domain to another. Currently this task is mainly faced with three challenging problems: 1) the flexibility in handling multi-domain translations is not sufficient; 2) cannot focus only on the area to be converted while leaving other irrelevant attributes unchanged; 3) blurred image artifacts are easily created. The invention aims at the limitations and provides a novel multi-domain image translation method. Aiming at the problem 2), the invention embeds an attention module in a generator and a discriminator, so that the model can apply a larger weight coefficient to the most important region in the image translation process according to the attention obtained by an auxiliary classifier. The invention abandons the traditional discriminator structure and adopts the Patch discriminator, so that the discriminator can pay more attention to the detail part in the image, thereby improving the quality of the generated image.

Description

Multi-domain image translation method based on attention network generation

Technical Field

The invention relates to deep learning style migration and image translation. The Pytorch deep learning framework is needed, and the main development environments used are Pytorch1.1, python3.5 and CUDA10.0.

Background

Before the creation of countermeasure network proposals, the main focus of deep learning has been on studying rich hierarchical models to represent the probability distributions of various data encountered during application, such as natural images, audio waveforms containing speech, and various symbols in natural language libraries. Until the generation countermeasure network (GAN) proposed by Ian Goodfellow in 2014 broke this situation, GAN once proposed took much attention and became the hottest model in deep learning. The generation of the countermeasure network can be understood in two parts, and the generation is to enable the model to learn some data such as pictures, languages and the like the brain and to automatically generate some similar data. For example, let the model learn some pictures of cats, then it can generate pictures of cats itself; antagonism, as the name implies, is the relationship between the two, and therefore necessarily involves two networks to form an antagonistic network. Two networks in GAN are the generator network (G) and the arbiter network (D), respectively, and their respective roles are as follows:

g is a generator network that receives a random noise (i.e. randomly generated data) and generates pictures based on this random noise (maximally learning its data distribution), and the generated pictures are denoted as G (z).

D is a discriminator network which is used for discriminating whether a picture is a real picture or a generated picture, receives a picture x and outputs D (x), if x is the real picture, D (x) is 1, and if x is the generated false picture, D (x) is 0.

The advent of GAN has revolutionized the field of image generation, since GAN, a number of variants of GAN are named, and the following are relatively representative derivatives of GAN.

(1) CGAN. The advantages of the generation of the countermeasure network are undoubted, the gradient is obtained only through a back propagation algorithm, a Markov chain is not needed, complex reasoning is not needed in the learning process, and various factors and the relationship among the factors can be easily fused into the model. Such a generation model has no constraint, and therefore, it is impossible to control what data is generated. The CGAN is a model conditional constraint through some additional information, so that the generation process of data can be guided. These additional information include some sort of label, some reminder information for image inpainting, or information from other modes. Compared with the original GAN, the CGAN adds constraint conditions in both the discriminator and the generator, so that the generation of the picture is not unsupervised and purposeless.

(2) DCGAN. The DCGAN perfects achievements of the convolutional network in the aspects of supervised learning and unsupervised learning through certain architectural constraints, and becomes a strong candidate for unsupervised learning by virtue of good performance. In the large volume of unlabeled data collection, it has been an active area to study how to represent the portion of the features of data that are reusable. People learn with a virtually unlimited number of unlabeled images and videos, and after a good intermediate representation is obtained, it can be used in different supervised learning studies or tasks. DCGAN follows the above idea by training generative confrontation networks (GANs), then reusing the generative and discriminator networks, performing feature extraction in different supervision tasks, and proposing a set of constraints on evaluating GAN topology, which keep GAN stably trained under most settings, avoiding the generative output from being meaningless.

(3) InfoGAN. The InfoGAN maximizes the mutual information between the underlying variables and the observed data. Specifically, the InfoGAN successfully separated the writing style from the number shapes in the MNIST dataset, the gestures from the 3D rendered image, and the background numbers from the center numbers of the SVHN dataset. It also finds some visual concepts including hairstyle, whether glasses are worn, and facial emotions in the CelebA facial dataset. In the original GAN, the input received by the generator is an irregular single continuous noise, which is uninterpretable, and there is no way to control a certain dimension to generate specific image information, and the noise is usually subjected to fitting process. Analyzing the MNIST data set, the numbers can be decomposed into a plurality of dimensions, each dimension represents different characteristics such as digital content, line thickness, font inclination degree and the like, and a certain dimension cannot be changed in the original GAN to enable a generator to generate an image with a specific dimension. The InfoGAN is improved on the basis that a single continuous input noise Z is processed and is decomposed into two parts, one part is the original noise Z, the other part is the characteristic dimension of the noise Z, and different dimensions represent different characteristics.

Disclosure of Invention

A multi-domain image translation method based on an attention network generation is characterized in that an attention module is added in an anti-generation network on the basis of the existing work to realize multi-domain image translation. Considering that the existing image translation method cannot enable the network to pay more attention to the areas more important to the translation process, the invention embeds the attention module in the generator and the discriminator, and the attention module integrates an auxiliary classifier for generating the attention diagram, so that the model can apply a larger weight coefficient to the areas most important in the image translation process according to the attention diagram obtained by the auxiliary classifier. In order to generate a clearer and more natural translation image, the invention abandons the traditional discriminator structure and adopts a Patch discriminator. The traditional discriminator discriminates a whole image as input, so that the detailed part in the image is inevitably ignored, and the discriminator can pay more attention to the detailed part in the image by dividing one image into a plurality of Patch with the same scale, thereby improving the quality of the generated image. The method mainly comprises the following steps:

randomly sampling a batch data set, and preprocessing an original domain label to obtain a target domain label;

step (2), inputting the image and the target domain label in the step (1) into an encoder together after channel-level concat, and extracting features;

step (3), inputting the feature map extracted in the step (2) into an auxiliary classifier to obtain corresponding importance weight, and performing weighted multiplication on the attention weight and the corresponding feature map to obtain an attention map;

step (4), inputting the attention diagram obtained in the step (3) into a decoder for decoding, and finally generating an output image;

step (5), the output image coded by the coder is input into a discriminator for discrimination; the discriminator has two functions, one is to judge whether the image is true or false according to the distribution of the image input into the discriminator, and the other is to classify the image according to the characteristics of the input image and output the classification label of the image, and the classification label output by the discriminator should be the same as the target class label of the image, and the discriminator is optimized according to the following formulas (1), (2) and (3).

And (6) inputting the false target domain image generated in the step (5) and the original domain label thereof into the same generator to reconstruct the image, wherein the reconstructed image has the same image characteristics as the original input image, so that the following formula (4) is provided.

And (7) performing iterative optimization on the network according to the loss function.

Drawings

FIG. 1 is a schematic diagram of the architecture of the present invention, and the whole network structure includes two generators G_tAnd G_rAnd a discriminator D. Conversion generator G_tThe original domain image can be translated according to the target domain label, and the reconstruction generator can reconstruct the converted image by using the original domain label. Two different generators are used to handle different tasks and allowMany different network architecture designs.

A block diagram of the generator is shown in fig. 2. According to the picture display, the generator is mainly composed of three parts, namely an encoding part, an attention part and a decoding part. The generator part adopts a network structure of Unet, and can ensure that other irrelevant attributes are reserved to the maximum extent in the conversion process. An encoder receives an original picture and a target domain label as input and performs feature extraction; and the attention module inputs the feature map extracted by the encoder into a classifier to obtain corresponding importance weights, and finally multiplies the importance weights by the feature map weights to obtain the final attention map. The encoder is responsible for decoding the attention map to generate the final output image.

Detailed Description

And (7) carrying out iterative optimization and annotation on the network according to the loss function.

And (7) after the face segmentation is finished, applying histogram matching to each part, and then obtaining loss functions of the three parts, wherein the loss functions can be expressed as a formula (2).

And (8) extracting the content of the image by utilizing an vgg16 network pre-trained on the ImageNet data set, wherein the content is shown as a formula (4).

And (9) performing iterative optimization on the generator according to the loss function.

Claims

1. A multi-domain image translation method based on an attention network generation is characterized in that an attention module is added in an anti-generation network on the basis of the existing work to realize multi-domain image translation. Considering that the existing image translation method cannot enable the network to pay more attention to the areas more important to the translation process, the invention embeds the attention module in the generator and the discriminator, and the attention module integrates an auxiliary classifier for generating the attention diagram, so that the model can apply a larger weight coefficient to the areas most important in the image translation process according to the attention diagram obtained by the auxiliary classifier. In order to generate a clearer and more natural translation image, the invention abandons the traditional discriminator structure and adopts a Patch discriminator. The traditional discriminator discriminates a whole image as input, so that the detailed part in the image is inevitably ignored, and the discriminator can pay more attention to the detailed part in the image by dividing one image into a plurality of Patch with the same scale, thereby improving the quality of the generated image.

The method mainly comprises the following steps:

step (5), the output image coded by the coder is input into a discriminator for discrimination; the discriminator has two functions, one is to judge whether the image is true or false according to the distribution of the image input into the discriminator, the other is to classify the image according to the characteristics of the input image and output the classification label of the image,

at this time, the classification label output by the discriminator should be the same as the target class label of the image, and the discriminator is optimized as the following formulas (1), (2) and (3).