CN112163605A - Multi-domain image translation method based on attention network generation - Google Patents
Multi-domain image translation method based on attention network generation Download PDFInfo
- Publication number
- CN112163605A CN112163605A CN202010976851.5A CN202010976851A CN112163605A CN 112163605 A CN112163605 A CN 112163605A CN 202010976851 A CN202010976851 A CN 202010976851A CN 112163605 A CN112163605 A CN 112163605A
- Authority
- CN
- China
- Prior art keywords
- image
- discriminator
- attention
- domain
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013519 translation Methods 0.000 title claims abstract description 25
- 238000000034 method Methods 0.000 title claims abstract description 23
- 238000010586 diagram Methods 0.000 claims description 11
- 230000006870 function Effects 0.000 claims description 9
- 238000009826 distribution Methods 0.000 claims description 5
- 238000005457 optimization Methods 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 230000014616 translation Effects 0.000 abstract 4
- 238000013507 mapping Methods 0.000 abstract 1
- 238000013135 deep learning Methods 0.000 description 4
- 241000282326 Felis catus Species 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000001815 facial effect Effects 0.000 description 2
- 230000003042 antagnostic effect Effects 0.000 description 1
- 230000008485 antagonism Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
Image translation is the mapping of images from one domain to another. Currently this task is mainly faced with three challenging problems: 1) the flexibility in handling multi-domain translations is not sufficient; 2) cannot focus only on the area to be converted while leaving other irrelevant attributes unchanged; 3) blurred image artifacts are easily created. The invention aims at the limitations and provides a novel multi-domain image translation method. Aiming at the problem 2), the invention embeds an attention module in a generator and a discriminator, so that the model can apply a larger weight coefficient to the most important region in the image translation process according to the attention obtained by an auxiliary classifier. The invention abandons the traditional discriminator structure and adopts the Patch discriminator, so that the discriminator can pay more attention to the detail part in the image, thereby improving the quality of the generated image.
Description
Technical Field
The invention relates to deep learning style migration and image translation. The Pytorch deep learning framework is needed, and the main development environments used are Pytorch1.1, python3.5 and CUDA10.0.
Background
Before the creation of countermeasure network proposals, the main focus of deep learning has been on studying rich hierarchical models to represent the probability distributions of various data encountered during application, such as natural images, audio waveforms containing speech, and various symbols in natural language libraries. Until the generation countermeasure network (GAN) proposed by Ian Goodfellow in 2014 broke this situation, GAN once proposed took much attention and became the hottest model in deep learning. The generation of the countermeasure network can be understood in two parts, and the generation is to enable the model to learn some data such as pictures, languages and the like the brain and to automatically generate some similar data. For example, let the model learn some pictures of cats, then it can generate pictures of cats itself; antagonism, as the name implies, is the relationship between the two, and therefore necessarily involves two networks to form an antagonistic network. Two networks in GAN are the generator network (G) and the arbiter network (D), respectively, and their respective roles are as follows:
g is a generator network that receives a random noise (i.e. randomly generated data) and generates pictures based on this random noise (maximally learning its data distribution), and the generated pictures are denoted as G (z).
D is a discriminator network which is used for discriminating whether a picture is a real picture or a generated picture, receives a picture x and outputs D (x), if x is the real picture, D (x) is 1, and if x is the generated false picture, D (x) is 0.
The advent of GAN has revolutionized the field of image generation, since GAN, a number of variants of GAN are named, and the following are relatively representative derivatives of GAN.
(1) CGAN. The advantages of the generation of the countermeasure network are undoubted, the gradient is obtained only through a back propagation algorithm, a Markov chain is not needed, complex reasoning is not needed in the learning process, and various factors and the relationship among the factors can be easily fused into the model. Such a generation model has no constraint, and therefore, it is impossible to control what data is generated. The CGAN is a model conditional constraint through some additional information, so that the generation process of data can be guided. These additional information include some sort of label, some reminder information for image inpainting, or information from other modes. Compared with the original GAN, the CGAN adds constraint conditions in both the discriminator and the generator, so that the generation of the picture is not unsupervised and purposeless.
(2) DCGAN. The DCGAN perfects achievements of the convolutional network in the aspects of supervised learning and unsupervised learning through certain architectural constraints, and becomes a strong candidate for unsupervised learning by virtue of good performance. In the large volume of unlabeled data collection, it has been an active area to study how to represent the portion of the features of data that are reusable. People learn with a virtually unlimited number of unlabeled images and videos, and after a good intermediate representation is obtained, it can be used in different supervised learning studies or tasks. DCGAN follows the above idea by training generative confrontation networks (GANs), then reusing the generative and discriminator networks, performing feature extraction in different supervision tasks, and proposing a set of constraints on evaluating GAN topology, which keep GAN stably trained under most settings, avoiding the generative output from being meaningless.
(3) InfoGAN. The InfoGAN maximizes the mutual information between the underlying variables and the observed data. Specifically, the InfoGAN successfully separated the writing style from the number shapes in the MNIST dataset, the gestures from the 3D rendered image, and the background numbers from the center numbers of the SVHN dataset. It also finds some visual concepts including hairstyle, whether glasses are worn, and facial emotions in the CelebA facial dataset. In the original GAN, the input received by the generator is an irregular single continuous noise, which is uninterpretable, and there is no way to control a certain dimension to generate specific image information, and the noise is usually subjected to fitting process. Analyzing the MNIST data set, the numbers can be decomposed into a plurality of dimensions, each dimension represents different characteristics such as digital content, line thickness, font inclination degree and the like, and a certain dimension cannot be changed in the original GAN to enable a generator to generate an image with a specific dimension. The InfoGAN is improved on the basis that a single continuous input noise Z is processed and is decomposed into two parts, one part is the original noise Z, the other part is the characteristic dimension of the noise Z, and different dimensions represent different characteristics.
Disclosure of Invention
A multi-domain image translation method based on an attention network generation is characterized in that an attention module is added in an anti-generation network on the basis of the existing work to realize multi-domain image translation. Considering that the existing image translation method cannot enable the network to pay more attention to the areas more important to the translation process, the invention embeds the attention module in the generator and the discriminator, and the attention module integrates an auxiliary classifier for generating the attention diagram, so that the model can apply a larger weight coefficient to the areas most important in the image translation process according to the attention diagram obtained by the auxiliary classifier. In order to generate a clearer and more natural translation image, the invention abandons the traditional discriminator structure and adopts a Patch discriminator. The traditional discriminator discriminates a whole image as input, so that the detailed part in the image is inevitably ignored, and the discriminator can pay more attention to the detailed part in the image by dividing one image into a plurality of Patch with the same scale, thereby improving the quality of the generated image. The method mainly comprises the following steps:
randomly sampling a batch data set, and preprocessing an original domain label to obtain a target domain label;
step (2), inputting the image and the target domain label in the step (1) into an encoder together after channel-level concat, and extracting features;
step (3), inputting the feature map extracted in the step (2) into an auxiliary classifier to obtain corresponding importance weight, and performing weighted multiplication on the attention weight and the corresponding feature map to obtain an attention map;
step (4), inputting the attention diagram obtained in the step (3) into a decoder for decoding, and finally generating an output image;
step (5), the output image coded by the coder is input into a discriminator for discrimination; the discriminator has two functions, one is to judge whether the image is true or false according to the distribution of the image input into the discriminator, and the other is to classify the image according to the characteristics of the input image and output the classification label of the image, and the classification label output by the discriminator should be the same as the target class label of the image, and the discriminator is optimized according to the following formulas (1), (2) and (3).
And (6) inputting the false target domain image generated in the step (5) and the original domain label thereof into the same generator to reconstruct the image, wherein the reconstructed image has the same image characteristics as the original input image, so that the following formula (4) is provided.
And (7) performing iterative optimization on the network according to the loss function.
Drawings
FIG. 1 is a schematic diagram of the architecture of the present invention, and the whole network structure includes two generators GtAnd GrAnd a discriminator D. Conversion generator GtThe original domain image can be translated according to the target domain label, and the reconstruction generator can reconstruct the converted image by using the original domain label. Two different generators are used to handle different tasks and allowMany different network architecture designs.
A block diagram of the generator is shown in fig. 2. According to the picture display, the generator is mainly composed of three parts, namely an encoding part, an attention part and a decoding part. The generator part adopts a network structure of Unet, and can ensure that other irrelevant attributes are reserved to the maximum extent in the conversion process. An encoder receives an original picture and a target domain label as input and performs feature extraction; and the attention module inputs the feature map extracted by the encoder into a classifier to obtain corresponding importance weights, and finally multiplies the importance weights by the feature map weights to obtain the final attention map. The encoder is responsible for decoding the attention map to generate the final output image.
Detailed Description
A multi-domain image translation method based on an attention network generation is characterized in that an attention module is added in an anti-generation network on the basis of the existing work to realize multi-domain image translation. Considering that the existing image translation method cannot enable the network to pay more attention to the areas more important to the translation process, the invention embeds the attention module in the generator and the discriminator, and the attention module integrates an auxiliary classifier for generating the attention diagram, so that the model can apply a larger weight coefficient to the areas most important in the image translation process according to the attention diagram obtained by the auxiliary classifier. In order to generate a clearer and more natural translation image, the invention abandons the traditional discriminator structure and adopts a Patch discriminator. The traditional discriminator discriminates a whole image as input, so that the detailed part in the image is inevitably ignored, and the discriminator can pay more attention to the detailed part in the image by dividing one image into a plurality of Patch with the same scale, thereby improving the quality of the generated image. The method mainly comprises the following steps:
randomly sampling a batch data set, and preprocessing an original domain label to obtain a target domain label;
step (2), inputting the image and the target domain label in the step (1) into an encoder together after channel-level concat, and extracting features;
step (3), inputting the feature map extracted in the step (2) into an auxiliary classifier to obtain corresponding importance weight, and performing weighted multiplication on the attention weight and the corresponding feature map to obtain an attention map;
step (4), inputting the attention diagram obtained in the step (3) into a decoder for decoding, and finally generating an output image;
step (5), the output image coded by the coder is input into a discriminator for discrimination; the discriminator has two functions, one is to judge whether the image is true or false according to the distribution of the image input into the discriminator, and the other is to classify the image according to the characteristics of the input image and output the classification label of the image, and the classification label output by the discriminator should be the same as the target class label of the image, and the discriminator is optimized according to the following formulas (1), (2) and (3).
And (6) inputting the false target domain image generated in the step (5) and the original domain label thereof into the same generator to reconstruct the image, wherein the reconstructed image has the same image characteristics as the original input image, so that the following formula (4) is provided.
And (7) carrying out iterative optimization and annotation on the network according to the loss function.
And (7) after the face segmentation is finished, applying histogram matching to each part, and then obtaining loss functions of the three parts, wherein the loss functions can be expressed as a formula (2).
And (8) extracting the content of the image by utilizing an vgg16 network pre-trained on the ImageNet data set, wherein the content is shown as a formula (4).
And (9) performing iterative optimization on the generator according to the loss function.
Claims (1)
1. A multi-domain image translation method based on an attention network generation is characterized in that an attention module is added in an anti-generation network on the basis of the existing work to realize multi-domain image translation. Considering that the existing image translation method cannot enable the network to pay more attention to the areas more important to the translation process, the invention embeds the attention module in the generator and the discriminator, and the attention module integrates an auxiliary classifier for generating the attention diagram, so that the model can apply a larger weight coefficient to the areas most important in the image translation process according to the attention diagram obtained by the auxiliary classifier. In order to generate a clearer and more natural translation image, the invention abandons the traditional discriminator structure and adopts a Patch discriminator. The traditional discriminator discriminates a whole image as input, so that the detailed part in the image is inevitably ignored, and the discriminator can pay more attention to the detailed part in the image by dividing one image into a plurality of Patch with the same scale, thereby improving the quality of the generated image.
The method mainly comprises the following steps:
randomly sampling a batch data set, and preprocessing an original domain label to obtain a target domain label;
step (2), inputting the image and the target domain label in the step (1) into an encoder together after channel-level concat, and extracting features;
step (3), inputting the feature map extracted in the step (2) into an auxiliary classifier to obtain corresponding importance weight, and performing weighted multiplication on the attention weight and the corresponding feature map to obtain an attention map;
step (4), inputting the attention diagram obtained in the step (3) into a decoder for decoding, and finally generating an output image;
step (5), the output image coded by the coder is input into a discriminator for discrimination; the discriminator has two functions, one is to judge whether the image is true or false according to the distribution of the image input into the discriminator, the other is to classify the image according to the characteristics of the input image and output the classification label of the image,
at this time, the classification label output by the discriminator should be the same as the target class label of the image, and the discriminator is optimized as the following formulas (1), (2) and (3).
And (6) inputting the false target domain image generated in the step (5) and the original domain label thereof into the same generator to reconstruct the image, wherein the reconstructed image has the same image characteristics as the original input image, so that the following formula (4) is provided.
And (7) performing iterative optimization on the network according to the loss function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010976851.5A CN112163605A (en) | 2020-09-17 | 2020-09-17 | Multi-domain image translation method based on attention network generation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010976851.5A CN112163605A (en) | 2020-09-17 | 2020-09-17 | Multi-domain image translation method based on attention network generation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112163605A true CN112163605A (en) | 2021-01-01 |
Family
ID=73859158
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010976851.5A Pending CN112163605A (en) | 2020-09-17 | 2020-09-17 | Multi-domain image translation method based on attention network generation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112163605A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115841589A (en) * | 2022-11-08 | 2023-03-24 | 河南大学 | Unsupervised image translation method based on generation type self-attention mechanism |
CN116958468A (en) * | 2023-07-05 | 2023-10-27 | 中国科学院地理科学与资源研究所 | Mountain snow environment simulation method and system based on SCycleGAN |
-
2020
- 2020-09-17 CN CN202010976851.5A patent/CN112163605A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115841589A (en) * | 2022-11-08 | 2023-03-24 | 河南大学 | Unsupervised image translation method based on generation type self-attention mechanism |
CN116958468A (en) * | 2023-07-05 | 2023-10-27 | 中国科学院地理科学与资源研究所 | Mountain snow environment simulation method and system based on SCycleGAN |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108596265B (en) | Video generation model based on text description information and generation countermeasure network | |
CN112887698B (en) | High-quality face voice driving method based on nerve radiation field | |
CN109919830B (en) | Method for restoring image with reference eye based on aesthetic evaluation | |
CN109815826B (en) | Method and device for generating face attribute model | |
CN110555896B (en) | Image generation method and device and storage medium | |
CN110570481A (en) | calligraphy word stock automatic repairing method and system based on style migration | |
CN111160452A (en) | Multi-modal network rumor detection method based on pre-training language model | |
CN113807265B (en) | Diversified human face image synthesis method and system | |
CN110909680A (en) | Facial expression recognition method and device, electronic equipment and storage medium | |
CN112036276A (en) | Artificial intelligent video question-answering method | |
CN112633288B (en) | Face sketch generation method based on painting brush touch guidance | |
CN113837366A (en) | Multi-style font generation method | |
CN109711411B (en) | Image segmentation and identification method based on capsule neurons | |
CN110232564A (en) | A kind of traffic accident law automatic decision method based on multi-modal data | |
CN117058266B (en) | Handwriting word generation method based on skeleton and outline | |
CN112163605A (en) | Multi-domain image translation method based on attention network generation | |
CN116311483B (en) | Micro-expression recognition method based on local facial area reconstruction and memory contrast learning | |
CN116797868A (en) | Text image generation method and diffusion generation model training method | |
CN116129013A (en) | Method, device and storage medium for generating virtual person animation video | |
CN112330759A (en) | Face attribute editing method based on generation countermeasure network | |
CN117496567A (en) | Facial expression recognition method and system based on feature enhancement | |
CN113658285B (en) | Method for generating face photo to artistic sketch | |
CN116127959A (en) | Image mood mining and mood conversion Chinese ancient poems method based on deep learning | |
CN113722536B (en) | Video description method based on bilinear adaptive feature interaction and target perception | |
CN115346259A (en) | Multi-granularity academic emotion recognition method combined with context information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20210101 |
|
WD01 | Invention patent application deemed withdrawn after publication |