CN115424109A - Deformable instance-level image translation method - Google Patents

Deformable instance-level image translation method Download PDF

Info

Publication number
CN115424109A
CN115424109A CN202210987590.6A CN202210987590A CN115424109A CN 115424109 A CN115424109 A CN 115424109A CN 202210987590 A CN202210987590 A CN 202210987590A CN 115424109 A CN115424109 A CN 115424109A
Authority
CN
China
Prior art keywords
mask
domain
image
target domain
instance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210987590.6A
Other languages
Chinese (zh)
Inventor
俞再亮
苏思桐
李海燕
靖伟
刘玉
宋井宽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Zhejiang Lab
Original Assignee
University of Electronic Science and Technology of China
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China, Zhejiang Lab filed Critical University of Electronic Science and Technology of China
Priority to CN202210987590.6A priority Critical patent/CN115424109A/en
Publication of CN115424109A publication Critical patent/CN115424109A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the field of image processing, in particular to a deformable example-level image translation method, which solves the problems that in the prior art, an example is difficult to deform and cannot be consistent with mask information due to overlarge difference between domains. Fusing the edge information of the foreground, the background mask information and the feature codes of the target domain label information to obtain a mixed mask; then, inputting the background feature and the mixed mask into a generator, decoding the input background feature by a decoding network of the generator, simultaneously extracting additional information from the mixed mask, acting the extracted additional information on the normalized decoded output, and performing affine transformation on the normalized decoded output through the additional information so as to obtain fusion information comprising foreground information corresponding to the target domain mask and position information indicating the foreground position; and finally, fusing the generated foreground information and the source domain background image by using the position information, and outputting a target domain picture with the source domain background.

Description

Deformable instance-level image translation method
Technical Field
The invention relates to the field of image processing, in particular to a deformable instance-level image translation method.
Background
In recent years, with the rapid development of deep neural networks, image processing technologies based on partial neural networks are gradually replacing the original time-consuming and labor-consuming traditional image processing methods, so that some tasks of editing high-level image semantics are possible, such as: and (5) image translation task. The image translation task aims to convert a source domain image into a target domain image through a designed model, namely, to learn mapping between a source domain and a target domain, namely, a picture set with certain characteristics. Since the generation of countermeasure networks has been proposed, a significant portion of visual tasks can be translated into image translation tasks such as style migration, super-resolution, label-guided image generation, and image inpainting.
The example-level image translation task mainly acts on a specific foreground example in an image, and is generally divided into two types:
1. when the foreground example is translated, the background information is not restricted;
2. the original background is preserved while translating a particular foreground instance.
For the first task, a common design paradigm is that different image translation processes are performed on the foreground and the background on the premise that the model learns to distinguish the foreground instance and the background in the input image. However, this approach can sometimes erroneously distinguish between foreground and background, thereby generating unintended results.
Therefore, the second task has more application prospects in real life, such as: virtual fitting, post-processing of video images, etc., often require replacement of specific instances while preserving the original background. However, the early models only achieve translation of low-level characteristics such as texture and pattern when translating images, and cannot give reasonable results when facing changes of high-level characteristics such as foreground shapes.
Therefore, in recent years, in order to cope with this demand, definition of an unsupervised deformable instance image translation task has been proposed. This task aims to transform the foreground to the target domain while preserving background information, with accompanying apparent foreground shape changes.
From this point of view, attempts have been made in recent years to introduce an external mask as guidance information and to guide a model to learn a cross-domain mapping relationship on a shape by setting a target domain mask as a learning target. However, experimental verification shows that the current model has several problems. First, unreasonable target domain masks are highly likely to appear in the generated results, indicating that the model has insufficient ability to generate target domain masks. Although the method guides the generation of the cross-domain mask by constructing the mask data pair in the training stage, the mask information is introduced in a mode of only splicing the mask data pair in the image information, and the mask information is not fully utilized. Second, translated foreground images often suffer from non-matching of the generated mask, which reflects that the target domain mask does not provide reasonable guidance and constraints for the generated image. In addition, when the current method faces the situation of multiple foreground instances of a single image, a serial translation mode is adopted, which also greatly increases the time overhead of training.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: a deformable example-level image translation method is provided, and the problems that an example is difficult to deform and cannot be kept consistent with mask information due to the fact that the difference between domains is too large are solved.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a deformable instance-level image translation method comprises the following steps:
c1, inputting an image mask pair of a source domain and label information and an example mask of a target domain into an image translation model, wherein the image mask pair comprises a group of example masks and corresponding images; the image translation model comprises a pre-trained image completion model and an example generation network;
c2, based on the source domain image mask pairs input in the step C1, firstly aggregating all example masks of the source domain to obtain a source domain mask, then removing the foreground of the corresponding source domain image according to the source domain mask to obtain a residual image with the mask part removed, and completing the residual image by using an image completion model to obtain a background image of the source domain;
c3, based on the target domain instance mask input in the step C1, firstly aggregating all instance masks of the target domain to obtain a target domain mask, and then combining the target domain mask and the source domain background image B obtained in the step C2 S Inputting an instance generation network; the example generation network includes an encoder and a generator and is processed as follows:
extracting background features of the source domain background image through an encoder based on the input source domain background image;
based on the input target domain mask, obtaining the edge information of the foreground through an edge extraction algorithm; obtaining background mask information by negating the target domain mask; carrying out feature coding on the label information of the target domain; then, fusing the characteristic codes of the edge information of the foreground, the background mask information and the target domain label information to obtain a mixed mask;
inputting the background feature and the mixed mask into a generator, wherein the generator comprises a decoding network, the decoding network decodes the input background feature, simultaneously extracts additional information from the mixed mask, acts the extracted additional information on the output obtained by the normalized decoding, and performs affine transformation on the normalized decoded output through the additional information so as to obtain fusion information comprising foreground information corresponding to the target domain mask and position information indicating the position of the foreground;
and finally, fusing the generated foreground information and the source domain background image by using the position information, and outputting a target domain image retaining the source domain background.
Specifically, the image completion model is a HiFill model; the encoder of the example generation network is a multilayer residual error neural network; the edge extraction algorithm is a CANNY edge detection algorithm; performing characteristic coding on the label information of the target domain in a single hot coding mode; and fusing the characteristic codes of the edge information of the foreground, the background mask information and the target domain label information through matrix multiplication.
Specifically, the generator of the instance generation network is an APADE-ResNet network, the APADE-ResNet network is a neural network formed by adding an APADE block between each convolution layer and a ReLU layer on the basis of the ResNet network, namely, the decoding network of the generator is a ResNet network, and additional information is extracted from the hybrid mask through the APADE block;
the APADE block comprises two paths of inputs, wherein one path is the output of the convolution layer, and the other path is a mixed mask;
scaling the mixed mask to APADE block input feature dimension, inputting APADE block, and passing through a convolution layer Conv s Then, conv is adjusted s Respectively input into two independent convolution layers Conv 1 、Conv 2 Generating two embedded vectors gamma and beta;
after the APADE block is input, the output of the convolution layer firstly passes through a batch normalization layer, and then the output of the batch normalization layer and the embedded vectors gamma and beta are calculated according to the following formula to be used as the output of the APADE block:
F out =γ·F in
wherein, F in Output representing batch normalization layer, F out Representing the output of the APADE block.
Further, the APADE-ResNet network is a neural network formed by adding an APADE block on the basis of two layers of ResNet networks, that is to say:
each layer of the APADE-ResNet network comprises a forward neural network branch and a shortcut branch, wherein the forward neural network branch comprises a first convolution layer, a first APADE block, a first ReLU layer, a second convolution layer, a second APADE block and a second ReLU layer which are sequentially connected in series, and the shortcut branch constructs jump between the input and the output of the second APADE block and adds the input to the output of the second APADE block to form the input of the second ReLU layer.
Further, the APADE-ResNet network is a multi-layer APADE-ResNet network, the input of the first layer is the background feature and the mixed mask, and the input of each layer after the first layer is the output and the mixed mask of the previous layer; the layers are up-sampled to enlarge so that the final output is the same size as the source domain image.
Further, the example generates a network, and trains according to the following steps:
d1, training sample data preparation:
acquiring data from a data set, defining domains according to the types of scenes, and constructing an image mask pair of each domain, wherein the image mask pair comprises a group of example masks and corresponding images;
d2, inputting sample data, and training the example generation network, wherein the training comprises the following steps:
d21, inputting an image mask pair of at least one domain including a target domain specified by the task to be translated;
d22, processing the image mask pair of each input domain according to the following steps:
taking the image mask pair of the input domain as a source domain image mask pair of the input image translation model; taking the label information and the example mask of the input domain as the label information and the example mask of the target domain of the input image translation model; generating an image I 'from an input by an image translation model' i Wherein i represents the ith sample data which is input corresponding to the generated image;
taking the image in the input domain image mask pair as a real image I i And from the real image I i And generating image I' i Forming a positive and negative sample pair;
d23, inputting the positive and negative sample pairs obtained in the step D22 into a discriminator, and performing countermeasure training on the example generation network;
d24, when the set iteration number or the example generation network convergence is reached, the training is finished, otherwise, the step D22 is returned.
Optimally, in the step D23, the loss function of the resistance training is:
Figure BDA0003802448990000041
Figure BDA0003802448990000042
wherein,
Figure BDA0003802448990000043
in order to generate the loss function of the generator,
Figure BDA0003802448990000044
in order to be a loss function of the discriminator,
Figure BDA0003802448990000045
in order to generate the penalty-combating function of the generator,
Figure BDA0003802448990000046
as a function of the penalty of the arbiter, L fmap Lambda is a hyper-parameter for adjusting the balance of each loss function;
penalty function of the generator
Figure BDA0003802448990000047
The calculation formula of (a) is as follows:
Figure BDA0003802448990000048
penalty function of said arbiter
Figure BDA0003802448990000049
The calculation formula of (a) is as follows:
Figure BDA00038024489900000410
wherein D is img Denotes a discriminator, I ', for example Generation network countermeasure training' i Representing a generated image generated based on the input image mask pair of the ith domain; i is i An image representing an input image mask pair of an ith domain; p is the number of sample data input in the step D21;
the fusion graph loss function L fmap The calculation formula of (a) is as follows:
Figure BDA0003802448990000051
wherein, M i An input field mask formed by aggregating instance masks of an image mask pair representing an ith field; alpha' i And position information indicating a foreground position in the fusion information generated by the generator based on the image mask pair of the ith domain.
In order to solve the problem of lack of target domain masks in some scenes, the invention also provides a deformable instance-level image translation method, which comprises two stages of mask deformation and image generation, and comprises the following steps of:
A. mask morphing
A1, inputting an example mask of a source domain and a target domain label into a mask deformation network trained in advance; the mask morphing network includes an encoder and a generator;
a2, the mask deformation network deforms the mask according to the following steps:
a21, aggregating all the example masks of the source domain to obtain a source domain mask; performing feature extraction on the source domain mask through an encoder to obtain the overall feature F of the source domain mask img (ii) a Masking instances of a source domain
Figure BDA0003802448990000052
Respectively with integral features F img Fusing to obtain the real mask corresponding to each instanceExample mask feature F mask(i) (ii) a Then, feature coding is carried out on the label information of the target domain, and then the feature coding of the label information of the target domain is respectively embedded into the mask features F of each example mask(i)
A22, respectively inputting each instance mask feature fused with label information feature codes into a generator, and generating a mask for a target domain finally output by the generator
Figure BDA0003802448990000053
As an instance mask for the corresponding destination domain.
B. Image generation
And B, taking the image mask pair of the source domain, the label information of the target domain and the target domain example mask obtained in the step A as input, and generating a target domain image with the source domain background reserved according to the deformable example-level image translation method of any one of claims 1 to 7.
Specifically, the encoder of the mask deformation network is a multilayer convolutional neural network, and masks the instances of the source domain by matrix dot multiplication
Figure BDA0003802448990000054
With integral feature F img Carrying out fusion;
performing characteristic coding on the label information of the target domain in a mode of one-hot coding, and masking the example characteristic F by matrix multiplication mask(i) Fusing with the characteristic code of the target domain label information; alternatively, the label information of the target domain is encoded by the convolutional neural network, and then the example mask feature F mask(i) And splicing with the characteristic code of the target domain label information.
Specifically, the generator of the mask deformation network comprises a multilayer residual error neural network and a multilayer convolution neural network; firstly, the input example mask feature F fused with label information is input by a multi-layer residual error neural network mask(i) Scaling to match the input dimensionality of the multilayer convolutional neural network, and then decoding by the multilayer convolutional neural network to generate a target domain generation mask; multi-layer rollThe layers of the integrating neural network are amplified by upsampling so that the final output is the same size as the source domain image.
Further, the mask deformation network is trained according to the following steps:
b1, training sample data preparation:
acquiring a mask from a data set, defining domains according to categories to which a foreground belongs, and constructing sample pairs based on the constructed domains in a pairwise combination manner, wherein each sample pair comprises two domains, one domain serves as a source domain, the other domain serves as a target domain, and in all the sample pairs, each constructed domain serves as at least one primary target domain;
b2, training the mask deformation network:
b21, inputting at least one sample pair comprising a source domain and a target domain appointed by the task to be translated;
b22, aiming at each input sample pair, the mask deformation network is respectively processed according to the following steps:
randomly sampling a set number of example masks from the example masks of the source domain and the example masks of the target domain of the sample pair respectively; pairwise pairing the sample instance mask of the source domain and the sample instance mask of the target domain, namely the sample instance mask of the source domain
Figure BDA0003802448990000061
Instance mask corresponding to a target domain
Figure BDA0003802448990000062
The subscript i represents the ith sample pair, the range is 1-P, P is the number of input sample pairs, j represents the jth mask of the corresponding domain, the range is 1-Q, Q is the set sampling number, the superscript T represents the target domain, and S represents the source domain;
inputting the target domain label and the sample obtained example mask of the source domain into a mask deformation network, and generating target domain generation masks corresponding to the example masks of the source domain respectively
Figure BDA0003802448990000063
Source-Domain-based instance mask
Figure BDA0003802448990000064
And target Domain instance mask
Figure BDA0003802448990000065
And a target domain generation mask
Figure BDA0003802448990000066
And source field instance mask
Figure BDA0003802448990000067
Constructing a mask of the corresponding source domain instance
Figure BDA0003802448990000068
Target Domain instance mask
Figure BDA0003802448990000069
And target Domain Generation mask
Figure BDA00038024489900000610
A constructed triplet;
b23, according to the triples obtained in the step B22, for each triplet, masking the target domain instance thereof
Figure BDA00038024489900000611
Scaling so that it generates a mask with the corresponding target domain
Figure BDA00038024489900000612
Matching the sizes to be used as a target domain real mask; forming a positive and negative sample pair by the corresponding target domain generating mask and the target domain real mask;
b24, inputting the positive and negative sample pairs obtained in the step B23 into a discriminator, and performing countermeasure training on the mask deformation network;
and B25, finishing the training when the set iteration number or the mask deformation network convergence is reached, otherwise, returning to the step B22.
Further, in the step B22, after pairwise matching the sampled instance masks of the source domain and the instance masks of the target domain, center positions of the pairwise matched instance masks are aligned.
Further, the generator of the mask morphable network is a multilayer network, in the step B22, for each input instance mask feature fused with the tag information feature code, when the multilayer network of the generator decodes, the last K layers of the generator output layer by layer the target domains with different sizes to generate the mask
Figure BDA0003802448990000071
That is to say
Figure BDA0003802448990000072
Is composed of
Figure BDA0003802448990000073
Figure BDA0003802448990000074
Generating a mask sequence by the constructed target domain, wherein K is the number of network layers selected to be output layer by layer in the multilayer network;
in step B23, for each triplet, its target domain instance is masked
Figure BDA0003802448990000075
Scaling to obtain mask sequences respectively generated with the target domain
Figure BDA0003802448990000076
Size matched target field true mask
Figure BDA0003802448990000077
And generating a mask from the corresponding target domain
Figure BDA0003802448990000078
And target Domain true mask
Figure BDA0003802448990000079
And constituting a countermeasure sample, wherein n is the sequence number of the mask sequence.
Further, in the step B24, the loss function of the resistance training is:
Figure BDA00038024489900000710
Figure BDA00038024489900000711
wherein,
Figure BDA00038024489900000712
in order to be a function of the losses of the generator,
Figure BDA00038024489900000713
in order to be a loss function of the discriminator,
Figure BDA00038024489900000714
in order to generate the penalty-combating function of the generator,
Figure BDA00038024489900000715
as a function of the penalty of the arbiter, L pc As a masked pseudo closed-loop loss function, L const As a mask consistency loss function, L reg Is a mask regularization function, and lambda is a hyper-parameter used for adjusting the balance of each loss function;
penalty function of the generator
Figure BDA00038024489900000716
The calculation formula of (a) is as follows:
Figure BDA00038024489900000717
penalty function of said arbiter
Figure BDA00038024489900000718
The calculation formula of (a) is as follows:
Figure BDA00038024489900000719
wherein D is mask A discriminator for a mask-deformed network countermeasure training is shown,
Figure BDA00038024489900000720
a target field generation mask representing an nth output layer output corresponding to a jth instance mask of a source field in an ith sample pair;
Figure BDA00038024489900000721
representing the nth target domain real mask obtained by scaling the target domain instance mask corresponding to the jth instance mask of the source domain in the ith sample pair;
the mask pseudo closed-loop loss function L pc The calculation formula of (c) is as follows:
Figure BDA0003802448990000081
wherein,
Figure BDA0003802448990000082
representing a source domain instance mask sequence in an ith sample pair
Figure BDA0003802448990000083
Label information representing the ith sample to the source domain;
Figure BDA0003802448990000084
generator G representing a network morphed by a mask based on source domain instance masks and label information in an ith sample pair mask A reconstructed source domain instance mask;
the mask consistency loss function L const The calculation formula of (c) is as follows:
Figure BDA0003802448990000085
wherein d (-) represents a down-sampling function;
the mask regularization function L reg The calculation formula of (a) is as follows:
Figure BDA0003802448990000086
wherein,
Figure BDA0003802448990000087
and representing that a Kth target domain corresponding to the jth example mask of the ith sample pair source domain generates a mask, wherein the Kth target domain is the final output of the generator, and sum (-) represents that all pixel points in the input mask are subjected to numerical summation.
The invention has the beneficial effects that:
according to the deformable example-level translation method, in the translation process, target domain mask information is introduced, the additional information is extracted from the mixed mask through the generator in the decoding stage, the extracted additional information acts on the normalized decoded output, and the normalized decoded output is subjected to affine transformation through the additional information, so that fusion information comprising foreground information corresponding to the target domain mask and position information indicating the foreground position is obtained. Therefore, the target domain mask information can be used as the guide information to generate the corresponding instance, and the condition that the translated mask is inconsistent with the corresponding instance and the instance is difficult to deform successfully is greatly relieved.
In the foreground replacement task, when a matched target domain instance image and a mask are used as references, the foreground can easily complete cross-domain conversion of the shape and the image by the first method of the invention.
However, in more practical situations, only the label of the target domain can be obtained, that is, only what kind of target domain image needs to be converted is known, and the guidance of the mask is absent. Although there are other ways of sourcing the target domain mask in the prior art, which can be adapted to the first method of the present invention. However, in order to improve the application range of the present invention, the present invention provides another translation method, which adds a mask transformation stage on the basis of the first method to translate the source domain mask into the target domain mask, so that the translation task can be effectively completed on the premise of only knowing the target domain label information.
Meanwhile, the invention designs a training mode with effective supervision information for two unsupervised training stages, the supervision information is obtained by mining own knowledge, no additional data is needed, the stability of training is ensured, and an effective and efficient training paradigm is provided for unsupervised deformable instance translation.
Drawings
FIG. 1 is a reasoning data flow diagram of a two-stage deformable example-level image translation method of the present invention;
FIG. 2 is a network architecture diagram of a multi-layer APADE-ResNet network for a deformable example-level image translation method of the present invention;
FIG. 3 is a comparison diagram of a simulation experiment of the deformable example-level image translation method of the present invention.
Detailed Description
The invention aims to provide a deformable example-level image translation method, which is characterized in that target domain mask information is introduced in the translation process and is used as guide information to generate a corresponding example so as to solve the problems that the translated mask is inconsistent with the corresponding example and the example is difficult to deform successfully.
The deformable example-level image translation method provided by the invention only comprises an image generation stage, the other one comprises a mask deformation stage and an image generation stage, and the image generation stages of the mask deformation stage and the image generation stage adopt the same model, so that the invention is further described below by combining the drawing and the specific implementation mode of the deformable example-level image translation method comprising the two stages for simplifying the description.
Example (b):
as shown in fig. 1, a deformable instance-level image translation method includes the following steps:
s1, mask morphing
S11, inputting an example mask of a source domain and a target domain label into a mask deformation network trained in advance; the mask morph network includes an encoder and a generator. The example mask is a mask corresponding to a foreground example in the image, and the essence of the example mask is a segmentation mask formed by segmenting the foreground mask of the image according to examples, and the mask corresponding to the image is formed by aggregating all the example masks corresponding to the image, for example, there are two persons in the image, one of the persons corresponds to an example mask, and the aggregate of the example masks corresponding to the two persons forms the mask of the image.
S12, the mask is deformed by the mask deformation network according to the following steps:
s121, aggregating all the example masks of the source domain to obtain a source domain mask; performing feature extraction on the source domain mask through an encoder to obtain the overall feature F of the source domain mask img (ii) a Masking instances of a source domain
Figure BDA0003802448990000091
Respectively with integral features F img Fusing to obtain mask characteristics F corresponding to each mask mask(i) (ii) a Then, feature coding is carried out on the label information of the target domain, and then the feature coding of the label information of the target domain is respectively embedded into the mask features F of each example mask(i)
The encoder can adopt any existing encoding network for image feature extraction. In this embodiment, specifically, a multilayer convolutional neural network is used as an encoder of the mask deformation network, and the example of the source domain is masked by matrix dot multiplication
Figure BDA0003802448990000092
With integral feature F img Fusion is performed.
Similarly, the tag information of the target domain may be encoded by using any encoding method. Such as: manner of one-hot coding, e.g.Encoding sheep to [0,1]The Giraffe code is [1,0 ]]And fusion is performed by matrix multiplication. In this embodiment, specifically, the tag information of the target domain is encoded by the convolutional neural network, and then the instance mask feature F is used mask(i) And splicing with the characteristic code of the target domain label information.
S122, respectively inputting each instance mask feature fused with the label information feature codes into the generator, and generating a mask for the target domain finally output by the generator
Figure BDA0003802448990000101
As an instance mask for the corresponding target domain.
The above-described generator may also employ any existing decoding network for image generation. In this embodiment, specifically, the generator of the mask deformation network includes a multilayer residual neural network and a multilayer convolutional neural network; firstly, the input example mask feature F fused with label information is input by a multi-layer residual error neural network mask(i) Scaling to enable the scaling to be matched with the input dimensionality of the multilayer convolutional neural network, and then decoding by the multilayer convolutional neural network to generate a target domain generation mask; the layers of the multi-layer convolutional neural network are amplified by upsampling so that the final output is the same size as the source domain image.
S2, image generation
S21, inputting an image translation model by taking an image mask pair of a source domain, label information of a target domain and the target domain instance mask obtained in the step S1 as input, wherein the image mask pair comprises a group of instance masks and corresponding images; the image translation model comprises a pre-trained image completion model and an example generation network.
S22, based on the source domain image mask pairs input in the step S21, firstly aggregating all example masks of the source domain to obtain a source domain mask, then removing the foreground of the corresponding source domain image according to the source domain mask to obtain a residual image with the mask part removed, and completing the residual image by using an image completing model to obtain a background image of the source domain.
In the present embodiment, the Image completion model is a HiFill model, which is published in the article "context research organization for Ultra High-Resolution Image Inpainting" in CVPR 2020. Of course, other existing image completion models may be used and pre-trained in the existing manner through the large picture dataset.
S23, based on the target domain instance mask input in the step S21, firstly aggregating all the instance masks of the target domain to obtain a target domain mask, and then combining the target domain mask and the source domain background image B obtained in the step S22 S Inputting an instance generation network; the example generation network includes an encoder and a generator and is processed as follows:
extracting background features of the source domain background image through an encoder based on the input source domain background image;
based on the input target domain mask, obtaining the edge information of the foreground through an edge extraction algorithm; obtaining background mask information by negating the target domain mask; carrying out feature coding on the label information of the target domain; then, fusing the edge information of the foreground, the background mask information and the feature codes of the target domain label information to obtain a mixed mask;
and inputting the background feature and the mixed mask into a generator, wherein the generator comprises a decoding network, the decoding network decodes the input background feature, meanwhile, the generator extracts additional information from the mixed mask, acts the extracted additional information on the output obtained by the normalized decoding, and performs affine transformation on the normalized decoded output through the additional information, so that fusion information comprising foreground information corresponding to the target domain mask and position information indicating the position of the foreground is obtained.
The core of the invention is as follows: target domain mask information is introduced in the translation process and is used as guidance information to generate corresponding examples, which are embodied in a centralized manner as follows: designing a mixing mask, extracting additional information from the mixing mask through a generator, applying the extracted additional information to the normalized decoded output, and performing affine transformation on the normalized decoded output through the additional information. Therefore, the generator may be any decoder capable of extracting the neighborhood information in the blend mask and applying the additional information to the normalized decoded output for image generation as in the prior art.
Specifically, in this embodiment, the encoder of the example generation network is a multi-layer residual neural network; the edge extraction algorithm is a CANNY edge detection algorithm; performing characteristic coding on the label information of the target domain in a single hot coding mode; and fusing the feature codes of the edge information of the foreground, the background mask information and the target domain label information through matrix multiplication.
As shown in fig. 2, the generator of the example generation Network is an APADE-ResNet Network, which is a Neural Network formed by adding an APADE-Adaptive minimization between each convolution layer and a ReLU layer of the example generation Network, on the basis of a ResNet Network (Residual Neural Network), that is, the decoding Network of the generator is a ResNet Network, and extracts additional information from the blending mask through the APADE block; the APADE block comprises two paths of inputs, wherein one path is the output of the convolutional layer, and the other path is a mixed mask; scaling the mixed mask to APADE block input feature dimension, inputting APADE block, and passing through a convolution layer Conv s Then, conv is added s Respectively input two independent convolution layers Conv 1 、Conv 2 Generating two embedded vectors gamma and beta; after the APADE block is input, the output of the convolution layer firstly passes through a batch normalization layer, and then the output of the batch normalization layer and the embedded vectors gamma and beta are calculated according to the following formula to be used as the output of the APADE block:
T out =γ·F in
wherein, F in Output representing batch normalization layer, F out Representing the output of the APADE block.
Further, in this embodiment, the APADE-ResNet network is a neural network formed by adding an APADE block on the basis of two layers of ResNet networks, that is to say: each layer of the APADE-ResNet network comprises a forward neural network branch and a shortcut branch, wherein the forward neural network branch comprises a first convolution layer, a first APADE block, a first ReLU layer, a second convolution layer, a second APADE block and a second ReLU layer which are sequentially connected in series, and the shortcut branch builds jump between the input and the output after the second APADE block and adds the input to the output of the second APADE block to form the input of the second ReLU layer. The APADE-ResNet network is a multi-layer APADE-ResNet network, the input of the first layer is background characteristics and a mixed mask, and the input of each layer behind the first layer is the output and the mixed mask of the previous layer; the layers are up-sampled to enlarge so that the final output is the same size as the source domain image.
Finally, the generated foreground information I 'is converted into position information alpha' fg And fusing the image with the source domain background image B, and outputting a target domain image I' with the source domain background reserved. This process is characterized as follows:
I′ fg ,α′=G img (B,M′)
I′=I′ fg ·α′+B·(1-α′)
wherein G is img A representation generator.
In terms of performance, a network with a larger number of layers generally achieves better results, but the training overhead is increased. Therefore, in order to balance performance and overhead, the multi-layer network in the above embodiment adopts a structure of 4-6 layers.
The mask deformation network and the image translation model are essentially a generation network for translating part of semantic information in an image, so that training of the generation network can be based on the conventional mode of generating an antagonistic network by using the GAN.
However, in order to train more effectively and efficiently, the model is trained as follows:
firstly, training sample data preparation:
data is collected from a data set, such as: MS COCO dataset, which may provide picture-foreground mask data pairs for multiple domains. Defining domains according to the classes to which the foreground belongs, and constructing an image mask pair of each domain, wherein the image mask pair comprises a group of example masks and images corresponding to the example masks; and constructing sample pairs based on the constructed domains in a pairwise combination manner, wherein each sample pair comprises two domains, one domain serves as a source domain, the other domain serves as a target domain, and in all the sample pairs, each constructed domain serves as at least one target domain.
Then, training the model, wherein training the mask deformation network comprises:
s31, inputting at least one sample pair comprising a source domain and a target domain appointed by a task to be translated;
s32, aiming at each input sample pair, the mask deformation network is processed according to the following steps:
randomly sampling a set number of example masks from the example masks of the source domain and the example masks of the target domain of the sample pair respectively; pairwise pairing the sample instance mask of the source domain and the sample instance mask of the target domain, namely the sample instance mask of the source domain
Figure BDA0003802448990000121
Instance mask corresponding to a target domain
Figure BDA0003802448990000122
Wherein, the subscript i represents the ith sample pair, the range is 1 to P, P is the number of the input sample pairs, j represents the jth mask of the corresponding domain, the range is 1 to Q, Q is the set sampling number, the superscript T represents the target domain, and S represents the source domain. Meanwhile, after pairwise matching of the sampled example mask of the source domain and the sample mask of the target domain, the center positions of the pairwise matched example masks are aligned, so that better spatial matching is achieved.
Inputting the target domain label and the sample obtained example mask of the source domain into a mask deformation network, and generating target domain generation masks corresponding to the example masks of the source domain respectively
Figure BDA0003802448990000131
Source-Domain-based instance mask
Figure BDA0003802448990000132
And target Domain instance mask
Figure BDA0003802448990000133
And a target domain generation mask
Figure BDA0003802448990000134
And source field instance mask
Figure BDA0003802448990000135
Constructing a mask of the corresponding source domain instance
Figure BDA0003802448990000136
Target Domain instance mask
Figure BDA0003802448990000137
And target Domain Generation mask
Figure BDA0003802448990000138
A constituent triplet.
In the embodiment, the generator of the mask deformation network is a multilayer network, so as to achieve a better training effect, each layer of the multilayer network can be constrained through loss, and the smooth proceeding of the countermeasure training is ensured from multiple scales, therefore, for each input example mask feature integrated with the tag information feature code, when the multilayer network of the generator decodes, the final K layers of the generator output target domains with different sizes layer by layer to generate the mask
Figure BDA0003802448990000139
That is to say
Figure BDA00038024489900001310
Is composed of
Figure BDA00038024489900001311
The formed target domain generates a mask sequence, and K is a multi-layer networkAnd selecting the number of network layers to output layer by layer in the network.
S33, according to the triples obtained in the step S32, for each triplet, masking the target domain instance of the triplet
Figure BDA00038024489900001312
Scaling so that it generates a mask with the corresponding target domain
Figure BDA00038024489900001313
Matching the sizes to be used as a target domain real mask; and the corresponding target domain generation mask and the target domain real mask form a positive and negative sample pair.
Due to the fact that
Figure BDA00038024489900001314
Is composed of
Figure BDA00038024489900001315
The constructed target domain generates a mask sequence, so that in this step, the target domain instance is masked for each triplet adaptively
Figure BDA00038024489900001316
Scaling to obtain mask sequences respectively generated with the target domain
Figure BDA00038024489900001317
Size matched target field true mask
Figure BDA00038024489900001318
And generating a mask from the corresponding target domain
Figure BDA00038024489900001319
And target Domain true mask
Figure BDA00038024489900001320
And constituting a countermeasure sample, wherein n is the sequence number of the mask sequence.
And S34, inputting the positive and negative sample pairs obtained in the step S33 into a discriminator, and performing countermeasure training on the mask deformation network.
Wherein the generator's penalty function
Figure BDA00038024489900001321
The calculation formula of (a) is as follows:
Figure BDA00038024489900001322
penalty function of arbiter
Figure BDA00038024489900001323
The calculation formula of (a) is as follows:
Figure BDA0003802448990000141
wherein D is mask A discriminator for mask-deformed network countermeasure training is shown,
Figure BDA0003802448990000142
a target field generation mask representing an nth output layer output corresponding to a jth instance mask of a source field in an ith sample pair;
Figure BDA0003802448990000143
and the nth target domain real mask obtained by scaling the target domain instance mask corresponding to the jth instance mask of the source domain in the ith sample pair is represented.
Meanwhile, a mask pseudo closed-loop loss function is constructed by reconstructing a source domain mask, and the mask generation process is supervised. The mask pseudo closed-loop loss function L pc The calculation formula is as follows:
Figure BDA0003802448990000144
wherein,
Figure BDA0003802448990000145
representing a source field instance mask sequence in an ith sample pair
Figure BDA0003802448990000146
Representing source domain label information in an ith sample pair;
Figure BDA0003802448990000147
generator G representing a network morphed by a mask based on source domain instance masks and label information in an ith sample pair mask The reconstructed source domain instance mask.
In order to stabilize training, semantic consistency of masks generated in different scales needs to be evaluated, so that masks except the mask output by the first layer are down-sampled to the size of the mask of the first layer, and a mask consistency loss function L is constructed const The calculation formula is as follows:
Figure BDA0003802448990000148
where d (-) denotes the down-sampling function.
Because the mask is binary, noise is generated by an activation function in the generation process, and in order to suppress the noise, a mask regularization function L is constructed reg The calculation formula is as follows:
Figure BDA0003802448990000149
wherein,
Figure BDA00038024489900001410
and representing that a Kth target domain corresponding to the jth example mask of the ith sample pair source domain generates a mask, wherein the Kth target domain is the final output of the generator, and sum (-) represents that all pixel points in the input mask are subjected to numerical summation.
In summary, the penalty function for the confrontation training is:
Figure BDA0003802448990000151
Figure BDA0003802448990000152
wherein,
Figure BDA0003802448990000153
in order to be a function of the losses of the generator,
Figure BDA0003802448990000154
in order to be a loss function of the discriminator,
Figure BDA0003802448990000155
in order to generate the penalty-combating function of the generator,
Figure BDA0003802448990000156
as a function of the penalty of the arbiter, L pc As a masked pseudo closed-loop loss function, L const As a mask consistency loss function, L reg To mask the regularization function, λ is a hyper-parameter used to adjust the balance of the respective loss functions.
And S35, finishing the training when the set iteration times or mask deformation network convergence is reached, or returning to the step S32.
Wherein, training the example generation network comprises: the example generates a network, and training is carried out according to the following steps:
s41, inputting an image mask pair of at least one domain including a target domain specified by a task to be translated;
s42, aiming at the image mask pair of each input domain, processing is respectively carried out according to the following steps:
taking the image mask pair of the input domain as a source domain image mask pair of the input image translation model; taking the label information and the example mask of the input domain as the label information and the example mask of the target domain of the input image translation model; by image translation model rootGenerating an image I from an input i Wherein, i represents the ith sample data corresponding to the input of the generated image.
Taking the image in the input domain image mask pair as a real image I i And from the real image I i And generating an image I i Positive and negative sample pairs are formed.
And S43, inputting the positive and negative sample pairs obtained in the step S42 into a discriminator, and carrying out countermeasure training on the example generation network.
The loss function for the resistance training is:
Figure BDA0003802448990000157
Figure BDA0003802448990000158
wherein,
Figure BDA0003802448990000159
in order to be a function of the losses of the generator,
Figure BDA00038024489900001510
in order to be a loss function of the arbiter,
Figure BDA00038024489900001511
in order to generate the penalty-combating function of the generator,
Figure BDA00038024489900001512
as a function of the penalty of the arbiter, L fmap For the fusion graph loss function, lambda is a hyper-parameter used for adjusting the balance of each loss function;
penalty function of the generator
Figure BDA00038024489900001513
The calculation formula of (a) is as follows:
Figure BDA00038024489900001514
penalty function of said arbiter
Figure BDA00038024489900001515
The calculation formula of (a) is as follows:
Figure BDA0003802448990000161
wherein D is img Representing discriminators for instance-generating network countermeasure training, I i Representing a generated image generated based on the input image mask pair of the ith domain; i is i An image representing an input image mask pair of an ith domain; p is the number of sample data input in step D21;
the fusion graph loss function L fmap The calculation formula of (a) is as follows:
Figure BDA0003802448990000162
wherein M is i An input field mask formed by aggregating instance masks of an image mask pair representing an ith field; alpha is alpha i And position information indicating a foreground position in the fusion information generated by the generator based on the image mask pair of the ith domain.
The above-mentioned fusion graph loss function L fmap The position information can be better guided to change towards the correct mask direction.
And S44, finishing the training when the set iteration number or the example generation network convergence is reached, or returning to the step S42.
Although the present invention has been described herein with reference to the preferred embodiments thereof, the embodiments described above are intended to be illustrative only and not to be limiting of the invention, it being understood that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the spirit and scope of the principles of this disclosure.
The effect of the present invention is explained below by using the model of the above embodiment and combining with simulation experiments:
simulation experiment I:
the test conditions are as follows:
the system comprises: ubuntu 20.04, software: python 3.6, processor: intel (R) Xeon (R) CPU E5-2678 v3@2.50GHz × 2, memory: 256GB.
The experimental contents are as follows:
comparing the existing unsupervised image translation scheme with the technical scheme of the invention, and taking the source domain data, the image and the mask as well as the target domain label as input, on the premise of keeping background information, generating an image containing the foreground of the target domain.
A total of four pairs of data sets were tested, including: sheep & giraffe, bottles & cups, orange & bananas, trousers & herds, the results are shown in figure 3.
And (3) analyzing an experimental result:
as can be seen from fig. 3, compared with the previous scheme, that is, instaGAN, the method of the present invention introduces the mask information of the target domain as a guide in the image generation stage, so that the generated foreground instance better conforms to the shape constraint of the mask, has a more reasonable visual effect, and well completes the conversion from the source domain to the target domain in terms of shape and texture. The mask deformation network can also realize more reasonable mask cross-domain translation through a matched self-supervision learning method.
And (2) simulation experiment II:
the effects of the invention are illustrated by comparison through simulation experiments by combining the video question-answering method in the prior art:
the test conditions are as follows:
the system comprises: ubuntu 20.04, software: python 3.6, processor: intel (R) Xeon (R) CPU E5-2678 v3@2.50GHz x 2, memory: 256GB;
description of the test: the sheep and the giraffes are used as data sets, and the data sets used in the experiment are all in the form of image mask pairs, namely one picture corresponds to a plurality of foreground example masks. Due to the particularity of the task, the training data is sent into the network for training in the form of image mask pairs of two different domains. Specifically, the invention and the comparison algorithm are trained on training sets in the data set in turn. After training, the invention and the comparison algorithm are respectively used for carrying out generation test on the data set test set to obtain a generated picture result. The comparison algorithm is instaGAN.
In the experiment, the test set was randomly divided into batches, each comprising 2 image mask pairs from two fields respectively.
1) Examine the classification accuracy of the generated picture/foreground instance:
testing is carried out on the test set, a target domain picture-mask set is generated, and statistics is carried out on the generated picture set in two ways: (1) Classifying the images by using a pre-trained image classification model by taking the images as a unit, and counting the number of the correctly classified images in a target domain; (2) And counting the number of foreground examples correctly classified to the target domain by using the pre-trained example classification model with the foreground examples as units. The two methods are respectively adopted and an image classification score CS and an example classification score CS (Masked) are calculated.
2) Examining the accuracy of correctly detected and identified generated foreground instances:
predicted labels, scores and masks are obtained from each generated picture using a pre-trained Mask-RCNN as a detector. Furthermore, three statistical analysis methods are adopted to evaluate the quality of the generated foreground image. (1) The statistical detector obtains the average matching ratio (MMR) from the number of masks detected in the generated picture set and calculates the ratio to the number of masks generated by the model, and estimates the probability that the generated masks are correctly identified from the number perspective. (2) The predicted score represents the confidence of being classified into a particular domain, so we get the average object detection score (MODS) by calculating the average score of being classified into the target domain. (3) And calculating the intersection ratio of the predicted mask and the generated mask to obtain an average effective IoU score (MVIS), and evaluating whether the shape is successfully translated to the target domain in the translation process from the perspective of whether the generated mask shape fits the predicted mask shape.
The results of the above experiments are shown in tables 1 and 2. Through data analysis and comparison of the tables 1 and 2, the quality of the picture generated by the method is better, and the results verify the effectiveness of the translation method and the corresponding supervision data construction method.
TABLE 1
Figure BDA0003802448990000181
TABLE 2
Figure BDA0003802448990000182

Claims (14)

1. A deformable instance-level image translation method is characterized by comprising the following steps:
c1, inputting an image mask pair of a source domain, label information and an example mask of a target domain into an image translation model, wherein the image mask pair comprises a group of example masks and corresponding images; the image translation model comprises a pre-trained image completion model and an example generation network;
c2, based on the source domain image mask code pair input in the step C1, firstly aggregating all example mask codes of the source domain to obtain a source domain mask code, then removing the foreground of the corresponding source domain image according to the source domain mask code to obtain a residual image with the mask code removed, and completing the residual image by using an image completion model to obtain a background image of the source domain;
c3, based on the target domain instance mask input in the step C1, firstly aggregating all the instance masks of the target domain to obtain a target domain mask, and then combining the target domain mask and the source domain background image B obtained in the step C2 S Inputting an instance generation network; the example generation network includes an encoder and a generator and is processed as follows:
extracting background features of the source domain background image through an encoder based on the input source domain background image;
based on the input target domain mask, obtaining the edge information of the foreground through an edge extraction algorithm; obtaining background mask information by negating the target domain mask; carrying out feature coding on the label information of the target domain; then, fusing the characteristic codes of the edge information of the foreground, the background mask information and the target domain label information to obtain a mixed mask;
inputting the background feature and the mixed mask into a generator, wherein the generator comprises a decoding network, the decoding network decodes the input background feature, meanwhile, the generator extracts additional information from the mixed mask, acts the extracted additional information on the output obtained by the normalized decoding, and performs affine transformation on the normalized decoded output through the additional information, so that fusion information comprising foreground information corresponding to the target domain mask and position information indicating the foreground position is obtained;
and finally, fusing the generated foreground information and the source domain background image by using the position information, and outputting a target domain image retaining the source domain background.
2. A method for deformable instance-level image translation as claimed in claim 1, characterized by:
the image completion model is a HiFill model; the encoder of the example generation network is a multilayer residual error neural network; the edge extraction algorithm is a CANNY edge detection algorithm; performing characteristic coding on the label information of the target domain in a single hot coding mode; and fusing the characteristic codes of the edge information of the foreground, the background mask information and the target domain label information through matrix multiplication.
3. A method for deformable instance-level image translation as claimed in claim 1, characterized by:
the generator of the example generation network is an APADE-ResNet network, the APADE-ResNet network is a neural network formed by adding APADE blocks between each convolution layer and the ReLU layer on the basis of the ResNet network, namely the decoding network of the generator is the ResNet network, and additional information is extracted from the mixed mask through the APADE blocks;
the APADE block comprises two paths of inputs, wherein one path is the output of the convolutional layer, and the other path is a mixed mask;
scaling the mixed mask to APADE block input feature dimension, inputting APADE block, and passing through a convolution layer Conv s Then, conv is added s Respectively input two independent convolution layers Conv 1 、Conv 2 Generating two embedded vectors gamma and beta;
after the output of the convolution layer is input into the APADE block, firstly, the convolution layer passes through a batch normalization layer, and then, the output of the batch normalization layer and the embedded vectors gamma and beta are calculated according to the following formula as the output of the APADE block:
F out =γ·F in
wherein, F in Output representing batch normalization layer, F out Representing the output of the APADE block.
4. A method for deformable instance-level image translation as claimed in claim 3, characterized by:
the APADE-ResNet network is a neural network formed by adding APADE blocks on the basis of two layers of ResNet networks, namely:
each layer of the APADE-ResNet network comprises a forward neural network branch and a shortcut branch, wherein the forward neural network branch comprises a first convolution layer, a first APADE block, a first ReLU layer, a second convolution layer, a second APADE block and a second ReLU layer which are sequentially connected in series, and the shortcut branch builds jump between the input and the output after the second APADE block and adds the input to the output of the second APADE block to form the input of the second ReLU layer.
5. A method for deformable instance-level image translation, according to any of claims 3 or 4, characterized by:
the APADE-ResNet network is a multilayer APADE-ResNet network, the input of the first layer is background characteristics and a mixed mask, and the input of each layer behind the first layer is the output and the mixed mask of the previous layer; the layers are up-sampled to enlarge so that the final output is the same size as the source domain image.
6. A method for deformable instance-level image translation, according to any of claims 3 or 4, characterized in that said instance generates a network, trained according to the following steps:
d1, training sample data preparation:
acquiring data from a data set, defining domains according to the types of scenes, and constructing an image mask pair of each domain, wherein the image mask pair comprises a group of example masks and corresponding images;
d2, inputting sample data, and training the example generation network, wherein the training comprises the following steps:
d21, inputting an image mask pair of at least one domain including a target domain specified by the task to be translated;
d22, processing the image mask pair of each input domain according to the following steps:
taking the image mask pair of the input domain as a source domain image mask pair of the input image translation model; taking the label information and the example mask of the input domain as the label information and the example mask of the target domain of the input image translation model; generating an image I 'from an input by an image translation model' i Wherein, i represents the ith sample data correspondingly input to the generated image;
taking the image in the input domain image mask pair as a real image I i And from the real image I i And generating image I' i Forming a positive and negative sample pair;
d23, inputting the positive and negative sample pairs obtained in the step D22 into a discriminator, and performing countermeasure training on the example generation network;
d24, when the set iteration number or the example generation network convergence is reached, the training is finished, otherwise, the step D22 is returned.
7. A method for deformable example-level image translation as claimed in claim 6, characterized in that in said step D23, the loss function for the resistance training is:
Figure FDA0003802448980000031
Figure FDA0003802448980000032
wherein,
Figure FDA0003802448980000033
in order to generate the loss function of the generator,
Figure FDA0003802448980000034
in order to be a loss function of the discriminator,
Figure FDA0003802448980000035
in order to generate the penalty-combating function of the generator,
Figure FDA0003802448980000036
as a function of the penalty of the arbiter, L fmap For the fusion graph loss function, lambda is a hyper-parameter used for adjusting the balance of each loss function;
penalty function of the generator
Figure FDA0003802448980000037
The calculation formula of (a) is as follows:
Figure FDA0003802448980000038
penalty function of said arbiter
Figure FDA0003802448980000039
The calculation formula of (a) is as follows:
Figure FDA00038024489800000310
wherein D is img Denotes an arbiter, I 'for example Generation network countermeasure training' i Representing a generated image generated based on the input image mask pair of the ith domain; i is i An image representing an input image mask pair of an ith domain; p is the number of sample data input in step D21;
the fusion graph loss function L fmap The calculation formula of (a) is as follows:
Figure FDA0003802448980000041
wherein, M i An input field mask formed by aggregating instance masks of an image mask pair representing an ith field; alpha' i And representing position information indicating a foreground position in the fusion information generated by the generator based on the image mask pair of the ith domain.
8. A deformable instance-level image translation method is characterized by comprising the following steps of:
A. mask morphing
A1, inputting an example mask of a source domain and a target domain label into a mask deformation network trained in advance; the mask morphing network includes an encoder and a generator;
a2, the mask deformation network deforms the mask according to the following steps:
a21, aggregating all the example masks of the source domain to obtain a source domain mask; performing feature extraction on the source domain mask through an encoder to obtain the overall feature F of the source domain mask img (ii) a Masking instances of a source domain
Figure FDA0003802448980000042
Respectively with integral features F img Go on to meltCombining to obtain the example mask feature F corresponding to each example mask mask(i) (ii) a Then, the label information of the target domain is subjected to feature coding, and the feature coding of the label information of the target domain is respectively embedded into the mask features F of each example mask(i)
A22, respectively inputting each instance mask feature fused with label information feature codes into a generator, and generating a mask for a target domain finally output by the generator
Figure FDA0003802448980000043
As an instance mask for the corresponding destination domain.
B. Image generation
Generating a target domain image with a source domain background according to the deformable example-level image translation method of any one of claims 1 to 7 by taking the image mask pair of the source domain, the label information of the target domain and the target domain example mask obtained in the step A as input.
9. A method for deformable instance-level image translation as claimed in claim 8, characterized by:
the encoder of the mask deformation network is a multilayer convolution neural network, and the example mask of the source domain is obtained by matrix point multiplication
Figure FDA0003802448980000044
With integral feature F img Carrying out fusion;
performing feature encoding on the label information of the target domain in a mode of unique hot encoding, and masking an instance feature F by matrix multiplication mask(i) Fusing with the characteristic code of the target domain label information; alternatively, the label information of the target domain is encoded by the convolutional neural network, and then the example mask feature F is used mask(i) And splicing with the characteristic code of the target domain label information.
10. A method for deformable instance-level image translation as claimed in claim 8, characterized by:
the generator of the mask deformation network comprises a multilayer residual error neural network and a multilayer convolution neural network; firstly, the input example mask feature F fused with label information is input by a multilayer residual error neural network mask(i) Scaling to enable the scaling to be matched with the input dimensionality of the multilayer convolutional neural network, and then decoding by the multilayer convolutional neural network to generate a target domain generation mask; the layers of the multi-layer convolutional neural network are amplified by upsampling so that the final output is the same size as the source domain image.
11. A method for deformable instance-level image translation according to any of claims 8, 9 or 10, characterized in that said mask deformation network is trained by the following steps:
b1, training sample data preparation:
acquiring a mask from a data set, defining domains according to categories to which a foreground belongs, and constructing sample pairs based on the constructed domains in a pairwise combination manner, wherein each sample pair comprises two domains, one domain serves as a source domain, the other domain serves as a target domain, and in all the sample pairs, each constructed domain serves as at least one primary target domain;
b2, training the mask deformation network:
b21, inputting at least one sample pair comprising a source domain and a target domain appointed by the task to be translated;
b22, aiming at each input sample pair, the mask deformation network is processed according to the following steps:
randomly sampling a set number of example masks from the example masks of the source domain and the example masks of the target domain of the sample pair respectively; pairwise pairing the sample instance mask of the source domain and the sample instance mask of the target domain, namely the sample instance mask of the source domain
Figure FDA0003802448980000051
Instance mask corresponding to a target domain
Figure FDA0003802448980000052
The subscript i represents the ith sample pair, the range is 1-P, P is the number of input sample pairs, j represents the jth mask of the corresponding domain, the range is 1-Q, Q is the set sampling number, the superscript T represents the target domain, and S represents the source domain;
inputting the target domain label and the sample obtained example mask of the source domain into a mask deformation network, and generating target domain generation masks corresponding to the example masks of the source domain respectively
Figure FDA0003802448980000053
Mask based on source domain instance
Figure FDA0003802448980000054
And target Domain instance mask
Figure FDA0003802448980000055
And a target domain generation mask
Figure FDA0003802448980000056
And source field instance mask
Figure FDA0003802448980000057
Constructing a mask of the corresponding source domain instance
Figure FDA0003802448980000058
Target Domain instance mask
Figure FDA0003802448980000059
And target Domain Generation mask
Figure FDA00038024489800000510
A constructed triplet;
b23, according to the triples obtained in the step B22, for each triplet, masking the target domain instance of the triplet
Figure FDA00038024489800000511
Scaling so that it generates a mask with the corresponding target domain
Figure FDA00038024489800000512
Matching the sizes to be used as a target domain real mask; forming a positive and negative sample pair by the corresponding target domain generating mask and the target domain real mask;
b24, inputting the positive and negative sample pairs obtained in the step B23 into a discriminator, and performing countermeasure training on the mask deformation network;
and B25, finishing the training when the set iteration number or the mask deformation network convergence is reached, otherwise, returning to the step B22.
12. A method for deformable instance-level image translation as claimed in claim 11, characterized by:
in the step B22, after pairwise matching the sample obtained instance mask of the source domain and the instance mask of the target domain, the center positions of the pairwise matched instance masks are aligned.
13. A method for deformable instance-level image translation as claimed in claim 11, characterized by:
a generator of the mask deformation network is a multilayer network, in the step B22, for each input instance mask feature fused with the tag information feature code, when the multilayer network of the generator decodes, the final K layers of the generator output layer by layer the generated masks corresponding to the target domains of different sizes
Figure FDA0003802448980000061
That is to say
Figure FDA0003802448980000062
Is composed of
Figure FDA0003802448980000063
Figure FDA0003802448980000064
Generating a mask sequence by the constructed target domain, wherein K is the number of network layers selected to be output layer by layer in the multilayer network;
in step B23, for each triplet, its target domain instance is masked
Figure FDA0003802448980000065
Scaling to obtain mask sequences respectively generated with the target domain
Figure FDA0003802448980000066
Size matched target domain true mask
Figure FDA0003802448980000067
And generating a mask from the corresponding target domain
Figure FDA0003802448980000068
And target Domain true mask
Figure FDA0003802448980000069
And constituting a countermeasure sample, wherein n is the sequence number of the mask sequence.
14. A method for deformable instance-level image translation as claimed in claim 13, characterized by:
in step B24, the loss function of the resistance training is:
Figure FDA00038024489800000610
Figure FDA00038024489800000611
wherein,
Figure FDA00038024489800000612
in order to be a function of the losses of the generator,
Figure FDA00038024489800000613
in order to be a loss function of the discriminator,
Figure FDA00038024489800000614
in order to generate the countering loss function of the generator,
Figure FDA00038024489800000615
as a function of the penalty of the arbiter, L pc As a masked pseudo closed-loop loss function, L const As a mask consistency loss function, L reg Is a mask regularization function, and lambda is a hyper-parameter used for adjusting the balance of each loss function;
penalty function of the generator
Figure FDA00038024489800000616
The calculation formula of (a) is as follows:
Figure FDA00038024489800000617
penalty function of said arbiter
Figure FDA00038024489800000618
The calculation formula of (c) is as follows:
Figure FDA00038024489800000619
wherein D is mask A discriminator for a mask-deformed network countermeasure training is shown,
Figure FDA00038024489800000620
representing the jth instance mask corresponding to the source field in the ith sample pairGenerating masks by the target domains output by the n output layers;
Figure FDA0003802448980000071
representing the nth target domain real mask obtained by scaling the target domain instance mask corresponding to the jth instance mask of the source domain in the ith sample pair;
the mask pseudo closed loop loss function L pc The calculation formula of (c) is as follows:
Figure FDA0003802448980000072
wherein,
Figure FDA0003802448980000073
representing a source field instance mask sequence in an ith sample pair
Figure FDA0003802448980000074
Figure FDA0003802448980000075
Label information representing the ith sample to the source domain;
Figure FDA0003802448980000076
generator G representing a network morphed by a mask based on source domain instance masks and label information in an ith sample pair mask A reconstructed source domain instance mask;
the mask consistency loss function L const The calculation formula of (c) is as follows:
Figure FDA0003802448980000077
wherein d (-) denotes a down-sampling function;
the mask regularization function L reg The calculation formula of (a) is as follows:
Figure FDA0003802448980000078
wherein,
Figure FDA0003802448980000079
and representing that a Kth target domain corresponding to the jth example mask of the ith sample pair source domain generates a mask, wherein the Kth target domain is the final output of the generator, and sum (-) represents that all pixel points in the input mask are subjected to numerical summation.
CN202210987590.6A 2022-08-17 2022-08-17 Deformable instance-level image translation method Pending CN115424109A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210987590.6A CN115424109A (en) 2022-08-17 2022-08-17 Deformable instance-level image translation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210987590.6A CN115424109A (en) 2022-08-17 2022-08-17 Deformable instance-level image translation method

Publications (1)

Publication Number Publication Date
CN115424109A true CN115424109A (en) 2022-12-02

Family

ID=84197582

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210987590.6A Pending CN115424109A (en) 2022-08-17 2022-08-17 Deformable instance-level image translation method

Country Status (1)

Country Link
CN (1) CN115424109A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024160178A1 (en) * 2023-01-30 2024-08-08 厦门美图之家科技有限公司 Image translation model training method, image translation method, device, and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024160178A1 (en) * 2023-01-30 2024-08-08 厦门美图之家科技有限公司 Image translation model training method, image translation method, device, and storage medium

Similar Documents

Publication Publication Date Title
Patrick et al. Capsule networks–a survey
US11908244B2 (en) Human posture detection utilizing posture reference maps
CN109711426B (en) Pathological image classification device and method based on GAN and transfer learning
CN111444881A (en) Fake face video detection method and device
CN111950453A (en) Optional-shape text recognition method based on selective attention mechanism
CN110826638A (en) Zero sample image classification model based on repeated attention network and method thereof
CN110288555B (en) Low-illumination enhancement method based on improved capsule network
CN110263666B (en) Action detection method based on asymmetric multi-stream
CN111428718A (en) Natural scene text recognition method based on image enhancement
CN107506796A (en) A kind of alzheimer disease sorting technique based on depth forest
CN111612008A (en) Image segmentation method based on convolution network
CN106909938B (en) Visual angle independence behavior identification method based on deep learning network
CN112686898B (en) Automatic radiotherapy target area segmentation method based on self-supervision learning
CN112651940B (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN112733768A (en) Natural scene text recognition method and device based on bidirectional characteristic language model
CN113393370A (en) Method, system and intelligent terminal for migrating Chinese calligraphy character and image styles
CN113642621A (en) Zero sample image classification method based on generation countermeasure network
CN106650617A (en) Pedestrian abnormity identification method based on probabilistic latent semantic analysis
CN113449801B (en) Image character behavior description generation method based on multi-level image context coding and decoding
CN110458178A (en) The multi-modal RGB-D conspicuousness object detection method spliced more
CN110490189A (en) A kind of detection method of the conspicuousness object based on two-way news link convolutional network
CN111696136A (en) Target tracking method based on coding and decoding structure
CN114764939A (en) Heterogeneous face recognition method and system based on identity-attribute decoupling
CN113450313A (en) Image significance visualization method based on regional contrast learning
CN114596605A (en) Expression recognition method with multi-feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination