CN112102303B - Semantic image analogy method for generating antagonistic network based on single image - Google Patents

Semantic image analogy method for generating antagonistic network based on single image Download PDF

Info

Publication number
CN112102303B
CN112102303B CN202011001562.XA CN202011001562A CN112102303B CN 112102303 B CN112102303 B CN 112102303B CN 202011001562 A CN202011001562 A CN 202011001562A CN 112102303 B CN112102303 B CN 112102303B
Authority
CN
China
Prior art keywords
image
semantic
source
target
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011001562.XA
Other languages
Chinese (zh)
Other versions
CN112102303A (en
Inventor
熊志伟
李家丞
刘�东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202011001562.XA priority Critical patent/CN112102303B/en
Publication of CN112102303A publication Critical patent/CN112102303A/en
Application granted granted Critical
Publication of CN112102303B publication Critical patent/CN112102303B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Abstract

The invention discloses a semantic image analogy method for generating an anti-network based on a single image, and the technical scheme provided by the invention can be seen that a generation model special for a given image can be trained under the condition of giving any image and a semantic segmentation image thereof, the model can recombine a source image according to different expected semantic layouts to generate an image conforming to a target semantic layout, and the semantic image analogy effect is achieved. The visual quality and the conformity accuracy of the result generated by the method are optimal.

Description

Semantic image analogy method for generating countermeasure network based on single image
Technical Field
The invention relates to the technical field of image processing, in particular to a semantic image analogy method for generating an anti-collision network based on a single image.
Background
Generative models such as variable Auto-encoders (VAEs) and Generative Adaptive Networks (GANs) have advanced significantly in modeling natural image layouts in a Generative manner. By taking as input additional signals such as class labels, text, edges or segmentation maps, the conditional generation model can generate photo-level realistic samples in a controllable manner, which is useful in many multimedia applications such as interactive design and artistic style transfer.
In particular, segmentation maps provide dense pixel-level guidance for generating models and enable users to spatially control desired instances, which is much more flexible than image-level guidance like class labels or styles.
Isola et al propose that the Pix2Pix model shows the ability of the Conditional GAN to generate controlled images given dense Conditional signals (including sketches and segmentation maps) (Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efrost.2017. Image-to-Image transformation with Conditional Adversal networks. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5967-. Wang et al extended the above framework With a coarse-to-fine generator and a multi-scale discriminator to generate images With high Resolution details (Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro.2018.high-Resolution Image Synthesis and semiconductor management With Conditioning GANs. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8798-. Park et al propose a Spatially Adaptive normalization technique (SPADE) that uses a semantic graph to predict affine transformation parameters to modulate activation in a normalization layer (Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu.2019.semantic Image Synthesis With navigation-Adaptive normalization. in Proceedings of the IEEE/CVF Conference Computer Vision and Pattern Recognition (CVPR). 2337. 2346). Typically, these methods require a large training dataset to map the segmentation class labels to the image patch appearances of the entire dataset. However, the appearance of a certain label instance in the generated image is limited to the appearance of the label in the training dataset, thus limiting the generalization ability of these models on random natural images.
On the other hand, recent studies on single images GAN have shown that it is possible to learn a generative model from the internal patch layout of a single image. InGAN defines the resized transformation and trains a generative model to capture the internal patch statistics for redirection (Assaf shocker, Shai Bagon, Phillip Isola, and Michal Irani.2019.InGAN: Capturing and targeting the "DNA" of a Natural image in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 4491-4500). SinGAN generates unconditional images using a multi-stage training scheme, which can generate images of arbitrary size From noise (Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli.2019.SinGAN: Learning a genetic Model From a Single Natural image in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 4569-4579). KernelGAN uses depth-linear generators and constrains them to learn image-specific degradation kernels for blind Super-Resolution (Sefi Bell-Kligler, Assaf shocker, and Michal Irani.2019. Blank Super-Resolution Kernel Estimation using an Internal-GAN. in Advances in Neural Information Processing Systems 32: annular Conference on Neural Information Processing Systems (NeurIPS). 284- & 293). Although these image-specific GANs are independent of the dataset and produce favorable results, the semantic meaning of patches within a single image is still barely explored.
Disclosure of Invention
The invention aims to provide a semantic image analogy method for generating an antagonistic network based on a single image, and the generated result visual quality and coincidence accuracy are optimal.
The purpose of the invention is realized by the following technical scheme:
a semantic image analogy method for generating an anti-network based on a single image is realized by a network model consisting of an encoder, a generator, an auxiliary classifier and a discriminator; wherein:
a training stage: during each training iteration, carrying out the same random expansion operation on a given source image and a corresponding source semantic segmentation image to obtain a corresponding enhanced image and an enhanced semantic segmentation image; extracting respective feature tensors of the source semantic segmentation image and the enhanced semantic segmentation image through the same encoder, and predicting transformation parameters in an image domain based on the two feature tensors through a semantic feature conversion module in the generator so as to generate a target image by combining the source image under the guidance of the transformation parameters; respectively inputting the target image into a discriminator and an auxiliary classifier, and respectively predicting a score map of the target image and the enhanced image and a target semantic segmentation image corresponding to the target image; constructing a total loss function by using the appearance similarity loss between a target image and a source image, the feature matching loss between the target image and an enhanced image obtained based on a score map, and the semantic alignment loss between the target semantic segmentation image and the enhanced semantic segmentation image for training;
and (3) an inference stage: and inputting the source image, the corresponding source semantic segmentation image and the appointed semantic segmentation image into a semantic image analogy network, and outputting an image with the same semantic layout as the appointed semantic segmentation image.
The technical scheme provided by the invention can be seen that a generation model special for a given image can be trained under the condition of giving any image and a semantic segmentation image thereof, the model can recombine a source image according to different expected semantic layouts to generate an image conforming to a target semantic layout, and the semantic image analogy effect is achieved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
FIG. 1 is a diagram of semantic image analogy concepts provided by an embodiment of the present invention;
fig. 2 is a schematic diagram of a semantic image analogy method for generating an anti-collision network based on a single image according to an embodiment of the present invention;
FIG. 3 is a flowchart of the calculation of the SFT module provided for an embodiment of the present invention;
FIG. 4 is a comparison of the image generation effect of the present invention and the visual effect of the prior image analogy method provided for the embodiment of the present invention;
FIG. 5 is a comparison of the image generation effect of the present invention provided for an embodiment of the present invention and the visual effect of the existing single image GAN method;
FIG. 6 is a comparison of the image generation effect of the present invention and the visual effect of the existing semantic image translation method provided for the embodiment of the present invention;
FIG. 7 is a visual effect of the present invention on semantic image analogy task provided for an embodiment of the present invention;
FIG. 8 is a visual effect of the present invention on an image object removal task provided for an embodiment of the present invention;
FIG. 9 is a diagram illustrating the visual effect of the present invention on a human face editing task according to an embodiment of the present invention;
FIG. 10 is a diagram of the visual effect of the present invention on the task of edge-to-image translation provided for an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a semantic image analogy method for generating an anti-network based on a single image. This task was named "semantic image analogy," as a variation of "image analogy" (Aaron Hertzmann, Charles E.Jacobs, Nuria Oliver, Brian Curles, and David Salesin.2001.image analogy. in Proceedings of the 28th Annual Conference on Computer Graphics and Interactive technologies.327-340) and is defined below.
Given a source image I and its corresponding semantic segmentation map P, and some other semantic segmentation maps P ', a new target image I' is synthesized such that:
Figure GDA0003698500160000041
in the above formula: : a category relationship is represented.
As shown in fig. 1, the target image I '(four images in dashed box) should match both the appearance of the source image I and the layout of the target partition P' (four segmented images in dashed box). The task settings aim to find a similar transition from I to I 'in the same way from P to P'. In addition, two metrics are also used to evaluate the quality of the images generated from the semantic image analogy model: image block level distance and semantic alignment score. The former limits the original image I to be the only source of the patch generating the image I ', while the latter forces the generated image I ' to have a semantic layout aligned with the target segmentation map P '.
In practice, the source segmentation map P or another image with similar context may be edited to obtain the target segmentation map P'. The generator may then generate a semantically aligned target image I 'from the source image I, similar to the way P' is obtained from P. Comparison with existing methods shows that the method provided by the invention has advantages in both quantitative and qualitative assessment. Due to the flexible task setup, the proposed method can easily be extended to various applications including object removal, face editing and sketch-to-image generation of natural images.
As shown in fig. 2, the main principle of the semantic image analogy method for generating an anti-collision network based on a single image is provided for the present invention, and is implemented by a network model composed of an encoder, a generator, an auxiliary classifier and a discriminator; wherein:
a training stage: we have designed an auto-supervised framework for training the condition GAN from a single image. During each training iteration, for a given source image I source And corresponding source semantic segmentation image P source Carry out the same randomAn expansion operation to obtain a corresponding enhanced image I aug And enhancing the semantically segmented image P aug (ii) a Extracting respective feature tensors of the source semantic segmentation image and the enhanced semantic segmentation image through the same encoder, and predicting transformation parameters in an image domain based on the two feature tensors through a semantic feature conversion module in the generator, so that a target image I is generated by combining the source image under the guidance of the transformation parameters target (ii) a Respectively inputting the target image into a discriminator and an auxiliary classifier, and respectively predicting a score map of the target image and the enhanced image and a target semantic segmentation image corresponding to the target image; constructing a total loss function by using the appearance similarity loss between a target image and a source image, the feature matching loss between the target image and an enhanced image obtained based on a score map, and the semantic alignment loss between the target semantic segmentation image and the enhanced semantic segmentation image for training;
during the training process, the enhanced randomness is gradually increased. Since the generator is homomorphic, so P is the time when P is target And P source At the same time, the source image can be reconstructed well. Herein P is target Is a general expression, in practice P in training target =P aug Thus, both can be mixed in the following training process.
In the embodiment of the invention, sampling and reconstruction modes are adopted for alternate training. In sampling mode, i.e. in the manner described above, the generator takes the enhancement of the semantically segmented image as a guide to generate the appearance and enhancement image I aug Identical and semantically laid out and enhanced semantically segmented image P aug The same target image. The working process of the reconstruction mode is the same as that of the sampling mode, the given source image and the corresponding source semantic segmentation image are directly input, and the source image is reconstructed by utilizing the source semantic segmentation image.
And an inference stage: inputting a source image, a corresponding source semantic segmentation image and a designated semantic segmentation image into a semantic image analogy network, and outputting an image with the same semantic layout as the designated semantic segmentation image.
After training is completed, the network model can generate a target image matched with the given semantic segmentation graph under the condition of giving the semantic segmentation graph with any shape layout, so that the content information of the source image is reserved, and the target image can be matched with the target semantic layout. As shown in fig. 1, the trained network model can change the shape of the source image horse according to a given shape.
For the purposes of promoting an understanding, reference will now be made in detail to the principles and procedures of the present invention.
The technical principle of the invention is the generation of a single image against the network. For a single image, training a generation countermeasure network (namely a generation model) taking a semantic segmentation graph as a condition, mainly comprising the generator, the auxiliary classifier and the discriminator, adopting a series of novel designs to establish semantic association between the semantic segmentation graph and image pixels, and further utilizing the association to achieve the purpose of recombining the image through the semantic segmentation graph.
And converting the semantic image analogy task into a patch-level layout matching problem, and performing conversion guidance in a semantic segmentation domain. For this reason, three main challenges need to be addressed: the paired data sources that generate the model are trained from a single image, the conditional approach that provides guidance from the segmentation domain to the image domain, and the appropriate oversight of the generated samples (i.e., the output of the generator).
To accomplish this task, a novel method is proposed that integrates the following three basic components:
1) a self-supervised training framework with progressive data enhancement strategy was designed. By alternating optimization with enhanced segmentation and original segmentation, the conditional GAN is successfully trained from a single image, which summarizes well the invisible transformations.
2) A semantic feature transformation module is designed which transforms the transformation parameters from the segmentation domain to the image domain.
3) A semantically-aware patch consistency loss is designed that encourages the transformed image to contain only the patches in the source image. Together with the semantic alignment constraints, it allows our generator to generate a real image with the target semantic layout.
As shown in fig. 1, the training phase mainly comprises the following steps:
step 1, giving a source image I source And corresponding source semantic segmentation image P source First, random expansion is performed to obtain an enhanced image I aug And enhancing semantically segmented images P aug Then segmenting the source semantic image P source And enhancing semantically segmented images P aug Input the same encoder E (i.e., E in FIG. 1) seg ) To extract features separately.
In an embodiment of the present invention, the random expansion operation includes one or more of the following operations: random turning, size adjustment, rotation and cutting. This progressive strategy may help the encoder learn the appearance of the source image in early iterations of training as the training step linearly increases the randomness of these operations.
And 2, designing a Semantic Feature Transformation (SFT) module to predict transformation parameters in the image domain from the Feature tensor.
The conversion parameters are explicitly converted from the segmentation domain to the image domain by the SFT module, as shown in fig. 3. Image P will be segmented from source senses source Segmentation of images P into enhanced semantics aug Is modeled as a linear transformation at the feature level. Thus, the feature tensor F of the image is segmented for the source semantics source And enhancing the feature tensor F of the semantically segmented image aug Performing element-by-element comparison and subtraction to obtain a feature scaling tensor F scale And an eigenshift tensor F shift For a subsequent downsampling stage, for the l-th downsampling stage, compute:
Figure GDA0003698500160000061
Figure GDA0003698500160000069
wherein the content of the first and second substances,
Figure GDA0003698500160000062
respectively as the feature tensor F in the l < th > down-sampling stage aug Feature tensor F source Extracting a feature tensor; for example, if the number of downsampling times is K, the two feature tensors are divided into K parts, and each downsampling stage takes out the corresponding part and carries out the calculation.
Using feature scaling tensors
Figure GDA0003698500160000063
And the characteristic shift tensor
Figure GDA0003698500160000064
To approximate the scaling factor as a segmentation map transformation
Figure GDA0003698500160000065
And a shifting factor, as shown in FIG. 3, two SFT units can be used to model the conversion process from the segmentation domain to the image domain, and the two SFT units are processed separately
Figure GDA0003698500160000066
And
Figure GDA0003698500160000067
obtaining a scaling factor and a shifting factor of an image field
Figure GDA0003698500160000068
The parameters of the SFT unit are learned through a training process.
Step 3, obtaining the scaling factor and the shifting factor (gamma) of the image domain from the SFT module img ,β img ) Encoder-decoder part of the generator G
Figure GDA0003698500160000071
To the target image under the direction of (2).
For each downsampling stage of the l +1 th in the generator, the output feature tensor is given by:
Figure GDA0003698500160000072
where DS represents the down-sampling module (i.e., encoder) and mean and std represent the mean and standard deviation, respectively.
The up-sampling module (i.e., decoder) of the generator then maps the image feature tensor output by the down-sampling stage to the image domain, thereby generating the target image. In an embodiment of the present invention, the generator is an encoder-decoder structure having K downsample blocks and K upsample blocks; each block contains a 3 x 3 convolutional layer of step size 3 and a 4 x 4 convolutional or transposed convolutional layer of step size 2 for down-sampling or up-sampling, and each block also uses spectral normalization, batch normalization and leakage ReLU activation operations. Illustratively, the starting channel number is 32, which is doubled during the down-sampling.
For example, K may be set to 3, and each stage of three downsample blocks may receive
Figure GDA0003698500160000073
Then output according to the above formula
Figure GDA0003698500160000074
And the input of the upsampling block is
Figure GDA0003698500160000075
The output is a target image I target
Step 4, the discriminator D will enhance the image I aug As a real sample, and generating a target image I target As a dummy sample. At the same time, the generated image is also input into the auxiliary classifier S to predict its segmentation map.
In an embodiment of the present invention, the discriminator is a fully convoluted PatchGAN (Phillip Isola, Jun-Yan Zhu, TinghuiZhou. and Alexei A. Efros.2017. image-to-large transformation with comparative adaptive networks. in Proceedings of the IEEE/CVF reference on Computer Vision and Pattern Recognition (CVPR). 5967-.
In an embodiment of the invention, Semantic Segmentation is performed in the auxiliary classifier (Segmentation Network in FIG. 2) using a simplified version of the DeepLab V3 architecture (Liang-Chieh Chen, Yukun Zhu, George Papandrou, Florian Schroff, and Hartwig Adam.2018.encoder-Decoder with associated estimation restriction for Semantic Image Segmentation. in Proceedings of the European Conference Computer Vision (ECCV), Vol.11211.833-851).
And 5, constructing a loss function training designed self-supervision network.
According to the task setting of semantic image analogy, the generated image should meet the following requirements: 1) consistent with the source image content; 2) the semantic layout is aligned with the target segmentation graph. Therefore, a Patch Cohenrence Loss (Patch) is proposed to measure the apparent similarity between the generated image and the source image. And a Semantic Alignment Loss (Semantic Alignment Loss) is proposed, the target Semantic segmentation image and the source Semantic segmentation image P predicted from the target image by the auxiliary classifier source Measure the consistency between. Specifically, the method comprises the following steps:
1) the apparent similarity between the generated image and the source image is measured by the image block coherence loss, if the generator generates a corresponding image block that cannot be found in the source image, this constraint will adversely affect the generator G, defined as the average of the lower limit of the image block distance between the source image and the target image:
Figure GDA0003698500160000081
wherein, N target Is a target image I target Number of image blocks in, I source Representing a source image, G (I) source )=I target ;U class And V class The segmentation labels, d (-) are distance metric functions, representing image blocks U and V. This loss relaxes the position dependence of the pixel distance. Instead, we treat the image as a bag of visual feature words. For each image block in the target image, a nearest neighbor search is run to find the most similar image block with the same class label from the source image, and then the most similar image block is takenThe average of their distances. We have found empirically that features from pre-trained VGG networks (Karen simony and Andrew zisserman.2015.very Deep conditional nets for Large-Scale registration. in Proceedings of the 3rd International Conference Learning Retrieval (ICLR)) produce good results, although other feature descriptors are also applicable.
2) An auxiliary classifier is used to predict a segmentation map of the target image (i.e., the target semantically segmented image). Then, the cross-entropy (CE) loss between the predicted segmentation map and the enhanced segmentation map is calculated. The semantic alignment penalty of the generator is defined as:
Figure GDA0003698500160000082
where CE represents the cross entropy loss, where P predict =S(G(I source ))=S(I target ),P predict The image is segmented for the target semantics (output of the auxiliary classifier S).
3) Using least squares GAN loss
Figure GDA0003698500160000083
As a competing constraint, and features are extracted from the discriminator to calculate a loss of feature matching between the enhanced image and the generated image
Figure GDA0003698500160000084
And (4) an image.
The total loss function is:
Figure GDA0003698500160000085
wherein the content of the first and second substances,
Figure GDA0003698500160000086
indicating a loss of the degree of similarity of the appearance,
Figure GDA0003698500160000087
representing a semantic alignment penalty; lambda seg 、λ GAN And λ fm All the parameters are over parameters, and all the parameters are set to be 1.0 in the experiment.
In the embodiment of the invention, two modes of sampling and reconstruction are adopted for alternate training;
in the sampling mode, namely the steps 1 to 5 described above, the generator takes the enhancement semantic segmentation image as a guide to generate the appearance and enhancement image I aug Identical and semantically laid out and enhanced semantically segmented image P aug The same target image.
The reconstruction mode and the sampling mode have the same working process, and the difference is that the random operation in the step 1 is not required to be executed, the given source image and the corresponding source semantic segmentation image are directly input, and the source image is reconstructed by utilizing the source semantic segmentation image; the total loss function is slightly different, namely the loss of appearance similarity is reduced
Figure GDA0003698500160000091
The appearance similarity loss is replaced by L1 reconstruction loss between the output reconstructed image and the source image, and the feature matching loss and the semantic alignment loss are respectively the loss between the target image and the source image and between the target semantic segmentation image and the source semantic segmentation image.
During the inference phase, the network model parameters are fixed, and a source image I is given source And corresponding source semantic segmentation image P source And a specified semantically segmented image (which can be edited by P) source Obtained, can also be obtained by other methods); then, inputting the two semantic segmentation images into an encoder E, and then executing the steps 2 to 3, wherein the obtained images are matched with the specified semantic segmentation images, and the content information of the source images is retained.
To verify the effectiveness of the present invention, the performance of the above-described method of the present invention was evaluated in terms of numerical indicators and visual effects, respectively.
The semantic image analogy task was applied to images from different datasets, including COCO-Stuff, ADE20K, CelebAMask-HQ, and the web (i.e., randomly selected natural pictures in the web). The results of the above-described method of the invention and the comparative method were evaluated in two respects: 1) appearance similarity between the source image and the target image; 2) semantic consistency of the target image and the target segmentation graph.
To evaluate the appearance similarity of the generated image to the source image, a user study is performed in the following manner. From the COCO-Stuff dataset 10 pairs of images with the same class label were randomly selected. For each pair of images, one image is used as the source image and the other image is used to provide the target layout, we use the image analogy (Aaron Hertzmann, Charles E.Jacobs, Nuria Oliver, Brian Curles, and David Salesin.2001. images analogy. in Proceedings of the 28th Annual Conference on Graphics and Interactive techniques.327-340, IA) and depth image analogy (JingLiao, Yuan Yao, Lu Yuan, Gang Hua.and Sing big kang Ka2017. visual attribute transfer depth image analogy. ACM Trans. graph.36, 4(2017), 120: 1-120: 15, DIA) methods to transfer the source images into the layout of the other image. IA and DIA are the two most relevant jobs for the above-described method of the present invention. The DIA requires a pair of pictures as source and target, whereas the above-described methods and IA of the present invention require only one source picture and two segmentation pictures. The results are displayed in a random order and require 20 users to rank the appearance similarity with the source image as a reference. Then, the average Ranking (avg. user Ranking) of each method among all images and users was calculated. Table 1 shows the superiority of the above-described method (Our) of the present invention over two competitors.
Figure GDA0003698500160000101
TABLE 1 Performance under semantic alignment index and user subjective evaluation Performance
To evaluate the semantic consistency of the generated image with the target segmentation map, the segmentation map of the generated image was predicted using the panoramic segmentation model of Detectron2, and then the Pixel-by-Pixel Accuracy (Pixel Accuracy) and mean intersection ratio (mIOU) of the target segmentation were calculated. The images used for evaluation are the same as those in the user study. As shown in table 1, the method achieves the highest accuracy.
In FIGS. 4, 5 and 6, the Single picture generation versus network Model SinGAN (Tamar Rott Shaham, Tali Dekel, and Tomer Michaeli.2019.SinGAN: Learning a generic Model From a Single Natural Image in Proceedings of the IEEE/CVF International Conference Computer Vision (ICCV). 4569-4579) and the segmentation map to Image translation Model SPADE (Taesung part, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu.2019. Setic Image Synthesis Spaly-Adaptation. in Productions of the CVF semantic and the semantic viscosity of the Image matching Model 2349. the Image matching method of the Image matching distribution is performed With the currently optimal Image analogy algorithms IA and DIA, the Single picture generation versus network Model SinGAN (Tamar Rott Shaham, Tally De, Taley Dekell, and Tomer Michaeli Micheli.2019. Singaol. Singan-Sing With the semantic Model Singan Image matching With the visual Image matching algorithm PR and the semantic map generation and Image quality matching algorithm PR (I) and the matching results are repeated With the target Image matching method of the matching Image generation and Image quality of the target Image matching algorithm PR 7, the matching method of the matching Image matching Model Singaol, the target texture distribution of the target texture generation and the matching method of the target Image matching method of the target matching Model Singaol of the Image matching Model Singaol generation and the target matching (IEEE Image matching method of the target Image matching and PR 7, the target quality of the Image matching method of the target matching method of the Image matching and the target matching method of the target matching (IEEE Image matching method of the target matching and PR) and the target matching of the Image matching method of the target matching of the Image matching (IEEE Image matching and the matching and Image matching method of the target matching of the Image quality of the target matching of the Image matching of the target matching of the Image matching of the Image generation and matching of the Image matching of the Image generation of the Image matching of the Image generation of the matching of the Image generation and matching of the Image matching of the matching of, DIA may produce unrealistic results. Without considering the semantic structure, SinGAN edits often alter the unedited area and produce undesirable textures, or simply blur pasted objects, which results in very similar editorial result versions. Although SPADE is semantically consistent with the target layout, its content is limited to the training data set and the appearance of the source image is lost. Our method produces an image that is faithful in appearance to the source image and semantically consistent with the target layout.
The method can perform semantic processing on the image through the segmentation map of the image. Instances may be moved, resized, or deleted in the source semantic segmentation graph to obtain the target layout. As shown in fig. 7, the above method of the present invention produces high quality results through arbitrary semantic modifications while preserving well the local appearance of the modified instances.
Our flexible semantic image analogy task setup can enable various applications. Due to the intensive condition inputs, pixel level control can be used to recombine image blocks in an image. In fig. 8, 9 and 10, three applications of the above-described method of the present invention are illustrated, including 1) object removal, where unwanted objects can be easily removed by modifying the class labels in the semantic segmentation map to background classes, 2) face editing, where the face image can be edited by altering the shape of the face in the segmentation map, and 3) edge-to-image generation, where other spatial conditions (e.g., edge maps) can be used as conditional inputs.
Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solutions of the embodiments may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (7)

1. A semantic image analogy method based on a single image generation countermeasure network is characterized by being realized by a network model formed by an encoder, a generator G, an auxiliary classifier S and a discriminator D; wherein:
a training stage: during each training iteration, carrying out the same random expansion operation on a given source image and a corresponding source semantic segmentation image to obtain a corresponding enhanced image and an enhanced semantic segmentation image; extracting respective feature tensors of the source semantic segmentation image and the enhanced semantic segmentation image through the same encoder, and predicting transformation parameters in an image domain based on the two feature tensors through a semantic feature conversion module in the generator so as to generate a target image by combining the source image under the guidance of the transformation parameters; respectively inputting the target image into a discriminator and an auxiliary classifier, and respectively predicting a score map of the target image and the enhanced image and a target semantic segmentation image corresponding to the target image; constructing a total loss function by using appearance similarity loss between a target image and a source image, feature matching loss between the target image and an enhanced image obtained based on a score map, and semantic alignment loss between the target semantic segmentation image and the enhanced semantic segmentation image for training;
and (3) an inference stage: inputting a source image, a corresponding source semantic segmentation image and a designated semantic segmentation image into a semantic image analogy network, and outputting an image with the same semantic layout as the designated semantic segmentation image;
wherein the predicting, by the semantic feature conversion module in the generator, the transformation parameters in the image domain based on the two feature tensors comprises: feature tensor F for source semantic segmentation images source And enhancing the feature tensor F of the semantically segmented image aug Performing element-by-element comparison and subtraction to obtain a feature scaling tensor F scale And an eigenshift tensor F shift For a subsequent down-sampling stage; for the l-th downsampling stage, calculate:
Figure FDA0003698500150000011
Figure FDA0003698500150000012
wherein the content of the first and second substances,
Figure FDA0003698500150000013
respectively as the feature tensor F in the l < th > down-sampling stage aug Feature tensor F source Extracting a feature tensor;
using feature scaling tensors
Figure FDA0003698500150000014
And the characteristic shift tensor
Figure FDA0003698500150000015
As a scaling factor for the segmentation map transformation
Figure FDA0003698500150000016
And a shifting factor
Figure FDA0003698500150000017
Modeling the transformation process from the segmentation domain to the image domain by using two semantic feature transformation modules, respectively processing
Figure FDA0003698500150000018
And
Figure FDA0003698500150000019
obtaining a scaling factor and a shifting factor of an image domain
Figure FDA00036985001500000110
2. The method of claim 1, wherein the stochastic augmentation operation comprises one or more of the following operations: random flipping, resizing, rotating, and cropping.
3. The method of claim 1, wherein in the (l + 1) th down-sampling stage of the generator, the output feature tensor is obtained by the following formula:
Figure FDA0003698500150000021
wherein DS represents the down-sampling module, mean and std represent the mean and standard deviation, respectively
An up-sampling module of the generator maps the image characteristic tensor output in the down-sampling stage to an image domain so as to generate a target image;
the generator is an encoder-decoder structure having K downsample blocks and K upsample blocks; each block contains a 3 x 3 convolutional layer of step size 3 and a 4 x 4 convolutional or transposed convolutional layer of step size 2 for down-sampling or up-sampling, and each block also uses spectral normalization, batch normalization and leakage ReLU activation operations.
4. The method of claim 1, wherein the semantic image analogy method for generating the countermeasure network based on the single image,
measuring appearance similarity between a generated image and a source image through image block coherence loss, wherein the appearance similarity is defined as an average value of image block distance lower limits between the source image and a target image:
Figure FDA0003698500150000022
wherein N is target Is a target image I target Number of image blocks in, I source Representing a source image, G (I) source )=I target ;U class And V class The segmentation labels, d (-) are distance metric functions, representing image blocks U and V.
5. The method of claim 1, wherein the semantic image analogy method for generating the countermeasure network based on the single image is characterized in that the semantic alignment loss is expressed as:
Figure FDA0003698500150000023
where CE represents cross entropy loss; i is source Representing a source image, P aug Representing an enhanced semantically segmented image.
6. The method of claim 1, wherein the total loss function is expressed as:
Figure FDA0003698500150000024
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003698500150000025
indicating a loss of the degree of similarity of the appearance,
Figure FDA0003698500150000026
indicating a loss of semantic alignment in the form of,
Figure FDA0003698500150000027
a loss of the matching of the features is indicated,
Figure FDA0003698500150000028
representing least squares GAN loss as an antagonistic constraint; lambda seg 、λ GAN And λ fm All are hyper-parameters.
7. The semantic image analogy method for generating the countermeasure network based on the single image according to claim 1, characterized in that two modes of sampling and reconstruction are adopted for alternate training;
under the sampling mode, the generator takes the enhancement semantic segmentation image as a guide to generate the appearance and enhancement image I aug Identical and semantically layout and enhanced semantically segmented image P aug The same target image;
the working process of the reconstruction mode and the sampling mode is the same, a given source image and a corresponding source semantic segmentation image are directly input, and the source image is reconstructed by utilizing the source semantic segmentation image; and in the total loss function, the loss of the appearance similarity is replaced by the loss of L1 reconstruction between the output reconstructed image and the source image, and the loss of the feature matching and the loss of the semantic alignment are respectively the loss between the target image and the source image and the loss between the target semantic segmentation image and the source semantic segmentation image.
CN202011001562.XA 2020-09-22 2020-09-22 Semantic image analogy method for generating antagonistic network based on single image Active CN112102303B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011001562.XA CN112102303B (en) 2020-09-22 2020-09-22 Semantic image analogy method for generating antagonistic network based on single image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011001562.XA CN112102303B (en) 2020-09-22 2020-09-22 Semantic image analogy method for generating antagonistic network based on single image

Publications (2)

Publication Number Publication Date
CN112102303A CN112102303A (en) 2020-12-18
CN112102303B true CN112102303B (en) 2022-09-06

Family

ID=73755788

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011001562.XA Active CN112102303B (en) 2020-09-22 2020-09-22 Semantic image analogy method for generating antagonistic network based on single image

Country Status (1)

Country Link
CN (1) CN112102303B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818997A (en) * 2021-01-29 2021-05-18 北京迈格威科技有限公司 Image synthesis method and device, electronic equipment and computer-readable storage medium
US11593945B2 (en) 2021-03-15 2023-02-28 Huawei Cloud Computing Technologies Co., Ltd. Methods and systems for semantic augmentation of images
CN113011429B (en) * 2021-03-19 2023-07-25 厦门大学 Real-time street view image semantic segmentation method based on staged feature semantic alignment
CN113313147B (en) * 2021-05-12 2023-10-20 北京大学 Image matching method based on depth semantic alignment network model
CN113610704B (en) * 2021-09-30 2022-02-08 北京奇艺世纪科技有限公司 Image generation method, device, equipment and readable storage medium
CN114596379A (en) * 2022-05-07 2022-06-07 中国科学技术大学 Image reconstruction method based on depth image prior, electronic device and storage medium
US20230394811A1 (en) * 2022-06-02 2023-12-07 Hon Hai Precision Industry Co., Ltd. Training method and electronic device
CN115761239B (en) * 2023-01-09 2023-04-28 深圳思谋信息科技有限公司 Semantic segmentation method and related device
CN117765372A (en) * 2024-02-22 2024-03-26 广州市易鸿智能装备股份有限公司 Industrial defect sample image generation method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109377537A (en) * 2018-10-18 2019-02-22 云南大学 Style transfer method for heavy color painting
JP2019046269A (en) * 2017-09-04 2019-03-22 株式会社Soat Machine learning training data generation
CN110197226A (en) * 2019-05-30 2019-09-03 厦门大学 A kind of unsupervised image interpretation method and system
EP3686848A1 (en) * 2019-01-25 2020-07-29 Nvidia Corporation Semantic image synthesis for generating substantially photorealistic images using neural networks

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8289326B2 (en) * 2007-08-16 2012-10-16 Southwest Research Institute Image analogy filters for terrain modeling

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019046269A (en) * 2017-09-04 2019-03-22 株式会社Soat Machine learning training data generation
CN109377537A (en) * 2018-10-18 2019-02-22 云南大学 Style transfer method for heavy color painting
EP3686848A1 (en) * 2019-01-25 2020-07-29 Nvidia Corporation Semantic image synthesis for generating substantially photorealistic images using neural networks
CN111489412A (en) * 2019-01-25 2020-08-04 辉达公司 Semantic image synthesis for generating substantially realistic images using neural networks
CN110197226A (en) * 2019-05-30 2019-09-03 厦门大学 A kind of unsupervised image interpretation method and system

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
SinGAN: Learning a Generative Model From a Single Natural Image;Tamar Rott Shaham et al;《2019 IEEE/CVF International Conference on Computer Vision (ICCV)》;20200227;第4569-4579页 *
The Conditional Analogy GAN: Swapping Fashion Articles on People Images;Nikolay Jetchev et al;《2017 IEEE International Conference on Computer Vision Workshops (ICCVW)》;20180123;第2287-2292页 *
Visual attribute transfer through deep image analogy;Jing Liao et al;《ACM Transactions on Graphics (TOG)》;20170831;第36卷(第4期);第1-15页 *
平行图像:图像生成的一个新型理论框架;王坤峰等;《模式识别与人工智能》;20170715(第07期);第3-13页 *
生成对抗网络及其应用研究综述;淦艳等;《小型微型计算机系统》;20200529(第06期);第15-21页 *
融合语义标签和噪声先验的图像生成;张素素等;《计算机应用》;20200109(第05期);第195-203页 *

Also Published As

Publication number Publication date
CN112102303A (en) 2020-12-18

Similar Documents

Publication Publication Date Title
CN112102303B (en) Semantic image analogy method for generating antagonistic network based on single image
Gao et al. Get3d: A generative model of high quality 3d textured shapes learned from images
CN110222588A (en) A kind of human face sketch image aging synthetic method, device and storage medium
Ahmadi et al. Context-aware saliency detection for image retargeting using convolutional neural networks
Zhu et al. Learning deep patch representation for probabilistic graphical model-based face sketch synthesis
Shen et al. Clipgen: A deep generative model for clipart vectorization and synthesis
Zheng et al. Semantic layout manipulation with high-resolution sparse attention
Bende et al. VISMA: A Machine Learning Approach to Image Manipulation
Xu et al. Generative image completion with image-to-image translation
Dey Python image processing cookbook: over 60 recipes to help you perform complex image processing and computer vision tasks with ease
Wang et al. Generative image inpainting with enhanced gated convolution and Transformers
Galatolo et al. Tetim-eval: A novel curated evaluation data set for comparing text-to-image models
Wang et al. Diverse image inpainting with normalizing flow
Ueno et al. Continuous and Gradual Style Changes of Graphic Designs with Generative Model
Hwang et al. WeatherGAN: Unsupervised multi-weather image-to-image translation via single content-preserving UResNet generator
Saint et al. 3dbooster: 3d body shape and texture recovery
Ren et al. Example-based image synthesis via randomized patch-matching
Chen et al. Deep3DSketch+: rapid 3D modeling from single free-hand sketches
Li et al. Scraping Textures from Natural Images for Synthesis and Editing
Zhou et al. Neural Texture Synthesis with Guided Correspondence
Jiang et al. Image inpainting based on cross-hierarchy global and local aware network
Luhman et al. High fidelity image synthesis with deep vaes in latent space
Cao et al. An improved defocusing adaptive style transfer method based on a stroke pyramid
Xu et al. Draw2Edit: Mask-Free Sketch-Guided Image Manipulation
Shi et al. Intelligent layout generation based on deep generative models: A comprehensive survey

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant