CN115424109A

CN115424109A - Deformable instance-level image translation method

Info

Publication number: CN115424109A
Application number: CN202210987590.6A
Authority: CN
Inventors: 俞再亮; 苏思桐; 李海燕; 靖伟; 刘玉; 宋井宽
Original assignee: University of Electronic Science and Technology of China; Zhejiang Lab
Current assignee: University of Electronic Science and Technology of China; Zhejiang Lab
Priority date: 2022-08-17
Filing date: 2022-08-17
Publication date: 2022-12-02

Abstract

The invention relates to the field of image processing, in particular to a deformable example-level image translation method, which solves the problems that in the prior art, an example is difficult to deform and cannot be consistent with mask information due to overlarge difference between domains. Fusing the edge information of the foreground, the background mask information and the feature codes of the target domain label information to obtain a mixed mask; then, inputting the background feature and the mixed mask into a generator, decoding the input background feature by a decoding network of the generator, simultaneously extracting additional information from the mixed mask, acting the extracted additional information on the normalized decoded output, and performing affine transformation on the normalized decoded output through the additional information so as to obtain fusion information comprising foreground information corresponding to the target domain mask and position information indicating the foreground position; and finally, fusing the generated foreground information and the source domain background image by using the position information, and outputting a target domain picture with the source domain background.

Description

Deformable instance-level image translation method

Technical Field

The invention relates to the field of image processing, in particular to a deformable instance-level image translation method.

Background

In recent years, with the rapid development of deep neural networks, image processing technologies based on partial neural networks are gradually replacing the original time-consuming and labor-consuming traditional image processing methods, so that some tasks of editing high-level image semantics are possible, such as: and (5) image translation task. The image translation task aims to convert a source domain image into a target domain image through a designed model, namely, to learn mapping between a source domain and a target domain, namely, a picture set with certain characteristics. Since the generation of countermeasure networks has been proposed, a significant portion of visual tasks can be translated into image translation tasks such as style migration, super-resolution, label-guided image generation, and image inpainting.

The example-level image translation task mainly acts on a specific foreground example in an image, and is generally divided into two types:

1. when the foreground example is translated, the background information is not restricted;

2. the original background is preserved while translating a particular foreground instance.

For the first task, a common design paradigm is that different image translation processes are performed on the foreground and the background on the premise that the model learns to distinguish the foreground instance and the background in the input image. However, this approach can sometimes erroneously distinguish between foreground and background, thereby generating unintended results.

Therefore, the second task has more application prospects in real life, such as: virtual fitting, post-processing of video images, etc., often require replacement of specific instances while preserving the original background. However, the early models only achieve translation of low-level characteristics such as texture and pattern when translating images, and cannot give reasonable results when facing changes of high-level characteristics such as foreground shapes.

Therefore, in recent years, in order to cope with this demand, definition of an unsupervised deformable instance image translation task has been proposed. This task aims to transform the foreground to the target domain while preserving background information, with accompanying apparent foreground shape changes.

From this point of view, attempts have been made in recent years to introduce an external mask as guidance information and to guide a model to learn a cross-domain mapping relationship on a shape by setting a target domain mask as a learning target. However, experimental verification shows that the current model has several problems. First, unreasonable target domain masks are highly likely to appear in the generated results, indicating that the model has insufficient ability to generate target domain masks. Although the method guides the generation of the cross-domain mask by constructing the mask data pair in the training stage, the mask information is introduced in a mode of only splicing the mask data pair in the image information, and the mask information is not fully utilized. Second, translated foreground images often suffer from non-matching of the generated mask, which reflects that the target domain mask does not provide reasonable guidance and constraints for the generated image. In addition, when the current method faces the situation of multiple foreground instances of a single image, a serial translation mode is adopted, which also greatly increases the time overhead of training.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: a deformable example-level image translation method is provided, and the problems that an example is difficult to deform and cannot be kept consistent with mask information due to the fact that the difference between domains is too large are solved.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a deformable instance-level image translation method comprises the following steps:

c1, inputting an image mask pair of a source domain and label information and an example mask of a target domain into an image translation model, wherein the image mask pair comprises a group of example masks and corresponding images; the image translation model comprises a pre-trained image completion model and an example generation network;

c2, based on the source domain image mask pairs input in the step C1, firstly aggregating all example masks of the source domain to obtain a source domain mask, then removing the foreground of the corresponding source domain image according to the source domain mask to obtain a residual image with the mask part removed, and completing the residual image by using an image completion model to obtain a background image of the source domain;

c3, based on the target domain instance mask input in the step C1, firstly aggregating all instance masks of the target domain to obtain a target domain mask, and then combining the target domain mask and the source domain background image B obtained in the step C2 ^S Inputting an instance generation network; the example generation network includes an encoder and a generator and is processed as follows:

extracting background features of the source domain background image through an encoder based on the input source domain background image;

based on the input target domain mask, obtaining the edge information of the foreground through an edge extraction algorithm; obtaining background mask information by negating the target domain mask; carrying out feature coding on the label information of the target domain; then, fusing the characteristic codes of the edge information of the foreground, the background mask information and the target domain label information to obtain a mixed mask;

inputting the background feature and the mixed mask into a generator, wherein the generator comprises a decoding network, the decoding network decodes the input background feature, simultaneously extracts additional information from the mixed mask, acts the extracted additional information on the output obtained by the normalized decoding, and performs affine transformation on the normalized decoded output through the additional information so as to obtain fusion information comprising foreground information corresponding to the target domain mask and position information indicating the position of the foreground;

and finally, fusing the generated foreground information and the source domain background image by using the position information, and outputting a target domain image retaining the source domain background.

Specifically, the image completion model is a HiFill model; the encoder of the example generation network is a multilayer residual error neural network; the edge extraction algorithm is a CANNY edge detection algorithm; performing characteristic coding on the label information of the target domain in a single hot coding mode; and fusing the characteristic codes of the edge information of the foreground, the background mask information and the target domain label information through matrix multiplication.

Specifically, the generator of the instance generation network is an APADE-ResNet network, the APADE-ResNet network is a neural network formed by adding an APADE block between each convolution layer and a ReLU layer on the basis of the ResNet network, namely, the decoding network of the generator is a ResNet network, and additional information is extracted from the hybrid mask through the APADE block;

the APADE block comprises two paths of inputs, wherein one path is the output of the convolution layer, and the other path is a mixed mask;

scaling the mixed mask to APADE block input feature dimension, inputting APADE block, and passing through a convolution layer Conv _s Then, conv is adjusted _s Respectively input into two independent convolution layers Conv ₁ 、Conv ₂ Generating two embedded vectors gamma and beta;

after the APADE block is input, the output of the convolution layer firstly passes through a batch normalization layer, and then the output of the batch normalization layer and the embedded vectors gamma and beta are calculated according to the following formula to be used as the output of the APADE block:

F _out ＝γ·F _in +β

wherein, F _in Output representing batch normalization layer, F _out Representing the output of the APADE block.

Further, the APADE-ResNet network is a neural network formed by adding an APADE block on the basis of two layers of ResNet networks, that is to say:

each layer of the APADE-ResNet network comprises a forward neural network branch and a shortcut branch, wherein the forward neural network branch comprises a first convolution layer, a first APADE block, a first ReLU layer, a second convolution layer, a second APADE block and a second ReLU layer which are sequentially connected in series, and the shortcut branch constructs jump between the input and the output of the second APADE block and adds the input to the output of the second APADE block to form the input of the second ReLU layer.

Further, the APADE-ResNet network is a multi-layer APADE-ResNet network, the input of the first layer is the background feature and the mixed mask, and the input of each layer after the first layer is the output and the mixed mask of the previous layer; the layers are up-sampled to enlarge so that the final output is the same size as the source domain image.

Further, the example generates a network, and trains according to the following steps:

d1, training sample data preparation:

acquiring data from a data set, defining domains according to the types of scenes, and constructing an image mask pair of each domain, wherein the image mask pair comprises a group of example masks and corresponding images;

d2, inputting sample data, and training the example generation network, wherein the training comprises the following steps:

d21, inputting an image mask pair of at least one domain including a target domain specified by the task to be translated;

d22, processing the image mask pair of each input domain according to the following steps:

taking the image mask pair of the input domain as a source domain image mask pair of the input image translation model; taking the label information and the example mask of the input domain as the label information and the example mask of the target domain of the input image translation model; generating an image I 'from an input by an image translation model' _i Wherein i represents the ith sample data which is input corresponding to the generated image;

taking the image in the input domain image mask pair as a real image I _i And from the real image I _i And generating image I' _i Forming a positive and negative sample pair;

d23, inputting the positive and negative sample pairs obtained in the step D22 into a discriminator, and performing countermeasure training on the example generation network;

d24, when the set iteration number or the example generation network convergence is reached, the training is finished, otherwise, the step D22 is returned.

Optimally, in the step D23, the loss function of the resistance training is:

wherein,

in order to generate the loss function of the generator,

in order to be a loss function of the discriminator,

in order to generate the penalty-combating function of the generator,

as a function of the penalty of the arbiter, L _fmap Lambda is a hyper-parameter for adjusting the balance of each loss function;

penalty function of the generator

The calculation formula of (a) is as follows:

penalty function of said arbiter

The calculation formula of (a) is as follows:

wherein D is _img Denotes a discriminator, I ', for example Generation network countermeasure training' _i Representing a generated image generated based on the input image mask pair of the ith domain; i is _i An image representing an input image mask pair of an ith domain; p is the number of sample data input in the step D21;

the fusion graph loss function L _fmap The calculation formula of (a) is as follows:

wherein, M _i An input field mask formed by aggregating instance masks of an image mask pair representing an ith field; alpha' _i And position information indicating a foreground position in the fusion information generated by the generator based on the image mask pair of the ith domain.

In order to solve the problem of lack of target domain masks in some scenes, the invention also provides a deformable instance-level image translation method, which comprises two stages of mask deformation and image generation, and comprises the following steps of:

A. mask morphing

A1, inputting an example mask of a source domain and a target domain label into a mask deformation network trained in advance; the mask morphing network includes an encoder and a generator;

a2, the mask deformation network deforms the mask according to the following steps:

a21, aggregating all the example masks of the source domain to obtain a source domain mask; performing feature extraction on the source domain mask through an encoder to obtain the overall feature F of the source domain mask _img (ii) a Masking instances of a source domain

Respectively with integral features F _img Fusing to obtain the real mask corresponding to each instanceExample mask feature F _mask(i) (ii) a Then, feature coding is carried out on the label information of the target domain, and then the feature coding of the label information of the target domain is respectively embedded into the mask features F of each example _mask(i) ；

A22, respectively inputting each instance mask feature fused with label information feature codes into a generator, and generating a mask for a target domain finally output by the generator

As an instance mask for the corresponding destination domain.

B. Image generation

And B, taking the image mask pair of the source domain, the label information of the target domain and the target domain example mask obtained in the step A as input, and generating a target domain image with the source domain background reserved according to the deformable example-level image translation method of any one of claims 1 to 7.

Specifically, the encoder of the mask deformation network is a multilayer convolutional neural network, and masks the instances of the source domain by matrix dot multiplication

With integral feature F _img Carrying out fusion;

performing characteristic coding on the label information of the target domain in a mode of one-hot coding, and masking the example characteristic F by matrix multiplication _mask(i) Fusing with the characteristic code of the target domain label information; alternatively, the label information of the target domain is encoded by the convolutional neural network, and then the example mask feature F _mask(i) And splicing with the characteristic code of the target domain label information.

Specifically, the generator of the mask deformation network comprises a multilayer residual error neural network and a multilayer convolution neural network; firstly, the input example mask feature F fused with label information is input by a multi-layer residual error neural network _mask(i) Scaling to match the input dimensionality of the multilayer convolutional neural network, and then decoding by the multilayer convolutional neural network to generate a target domain generation mask; multi-layer rollThe layers of the integrating neural network are amplified by upsampling so that the final output is the same size as the source domain image.

Further, the mask deformation network is trained according to the following steps:

b1, training sample data preparation:

acquiring a mask from a data set, defining domains according to categories to which a foreground belongs, and constructing sample pairs based on the constructed domains in a pairwise combination manner, wherein each sample pair comprises two domains, one domain serves as a source domain, the other domain serves as a target domain, and in all the sample pairs, each constructed domain serves as at least one primary target domain;

b2, training the mask deformation network:

b21, inputting at least one sample pair comprising a source domain and a target domain appointed by the task to be translated;

b22, aiming at each input sample pair, the mask deformation network is respectively processed according to the following steps:

randomly sampling a set number of example masks from the example masks of the source domain and the example masks of the target domain of the sample pair respectively; pairwise pairing the sample instance mask of the source domain and the sample instance mask of the target domain, namely the sample instance mask of the source domain

Instance mask corresponding to a target domain

The subscript i represents the ith sample pair, the range is 1-P, P is the number of input sample pairs, j represents the jth mask of the corresponding domain, the range is 1-Q, Q is the set sampling number, the superscript T represents the target domain, and S represents the source domain;

inputting the target domain label and the sample obtained example mask of the source domain into a mask deformation network, and generating target domain generation masks corresponding to the example masks of the source domain respectively

Source-Domain-based instance mask

And target Domain instance mask

And a target domain generation mask

And source field instance mask

Constructing a mask of the corresponding source domain instance

Target Domain instance mask

And target Domain Generation mask

A constructed triplet;

b23, according to the triples obtained in the step B22, for each triplet, masking the target domain instance thereof

Scaling so that it generates a mask with the corresponding target domain

Matching the sizes to be used as a target domain real mask; forming a positive and negative sample pair by the corresponding target domain generating mask and the target domain real mask;

b24, inputting the positive and negative sample pairs obtained in the step B23 into a discriminator, and performing countermeasure training on the mask deformation network;

and B25, finishing the training when the set iteration number or the mask deformation network convergence is reached, otherwise, returning to the step B22.

Further, in the step B22, after pairwise matching the sampled instance masks of the source domain and the instance masks of the target domain, center positions of the pairwise matched instance masks are aligned.

Further, the generator of the mask morphable network is a multilayer network, in the step B22, for each input instance mask feature fused with the tag information feature code, when the multilayer network of the generator decodes, the last K layers of the generator output layer by layer the target domains with different sizes to generate the mask

That is to say

Is composed of

Generating a mask sequence by the constructed target domain, wherein K is the number of network layers selected to be output layer by layer in the multilayer network;

in step B23, for each triplet, its target domain instance is masked

Scaling to obtain mask sequences respectively generated with the target domain

Size matched target field true mask

And generating a mask from the corresponding target domain

And target Domain true mask

And constituting a countermeasure sample, wherein n is the sequence number of the mask sequence.

Further, in the step B24, the loss function of the resistance training is:

wherein,

in order to be a function of the losses of the generator,

in order to be a loss function of the discriminator,

in order to generate the penalty-combating function of the generator,

as a function of the penalty of the arbiter, L _pc As a masked pseudo closed-loop loss function, L _const As a mask consistency loss function, L _reg Is a mask regularization function, and lambda is a hyper-parameter used for adjusting the balance of each loss function;

penalty function of the generator

The calculation formula of (a) is as follows:

penalty function of said arbiter

The calculation formula of (a) is as follows:

wherein D is _mask A discriminator for a mask-deformed network countermeasure training is shown,

a target field generation mask representing an nth output layer output corresponding to a jth instance mask of a source field in an ith sample pair;

representing the nth target domain real mask obtained by scaling the target domain instance mask corresponding to the jth instance mask of the source domain in the ith sample pair;

the mask pseudo closed-loop loss function L _pc The calculation formula of (c) is as follows:

wherein,

representing a source domain instance mask sequence in an ith sample pair

Label information representing the ith sample to the source domain;

generator G representing a network morphed by a mask based on source domain instance masks and label information in an ith sample pair _mask A reconstructed source domain instance mask;

the mask consistency loss function L _const The calculation formula of (c) is as follows:

wherein d (-) represents a down-sampling function;

the mask regularization function L _reg The calculation formula of (a) is as follows:

wherein,

and representing that a Kth target domain corresponding to the jth example mask of the ith sample pair source domain generates a mask, wherein the Kth target domain is the final output of the generator, and sum (-) represents that all pixel points in the input mask are subjected to numerical summation.

The invention has the beneficial effects that:

according to the deformable example-level translation method, in the translation process, target domain mask information is introduced, the additional information is extracted from the mixed mask through the generator in the decoding stage, the extracted additional information acts on the normalized decoded output, and the normalized decoded output is subjected to affine transformation through the additional information, so that fusion information comprising foreground information corresponding to the target domain mask and position information indicating the foreground position is obtained. Therefore, the target domain mask information can be used as the guide information to generate the corresponding instance, and the condition that the translated mask is inconsistent with the corresponding instance and the instance is difficult to deform successfully is greatly relieved.

In the foreground replacement task, when a matched target domain instance image and a mask are used as references, the foreground can easily complete cross-domain conversion of the shape and the image by the first method of the invention.

However, in more practical situations, only the label of the target domain can be obtained, that is, only what kind of target domain image needs to be converted is known, and the guidance of the mask is absent. Although there are other ways of sourcing the target domain mask in the prior art, which can be adapted to the first method of the present invention. However, in order to improve the application range of the present invention, the present invention provides another translation method, which adds a mask transformation stage on the basis of the first method to translate the source domain mask into the target domain mask, so that the translation task can be effectively completed on the premise of only knowing the target domain label information.

Meanwhile, the invention designs a training mode with effective supervision information for two unsupervised training stages, the supervision information is obtained by mining own knowledge, no additional data is needed, the stability of training is ensured, and an effective and efficient training paradigm is provided for unsupervised deformable instance translation.

Drawings

FIG. 1 is a reasoning data flow diagram of a two-stage deformable example-level image translation method of the present invention;

FIG. 2 is a network architecture diagram of a multi-layer APADE-ResNet network for a deformable example-level image translation method of the present invention;

FIG. 3 is a comparison diagram of a simulation experiment of the deformable example-level image translation method of the present invention.

Detailed Description

The invention aims to provide a deformable example-level image translation method, which is characterized in that target domain mask information is introduced in the translation process and is used as guide information to generate a corresponding example so as to solve the problems that the translated mask is inconsistent with the corresponding example and the example is difficult to deform successfully.

The deformable example-level image translation method provided by the invention only comprises an image generation stage, the other one comprises a mask deformation stage and an image generation stage, and the image generation stages of the mask deformation stage and the image generation stage adopt the same model, so that the invention is further described below by combining the drawing and the specific implementation mode of the deformable example-level image translation method comprising the two stages for simplifying the description.

Example (b):

as shown in fig. 1, a deformable instance-level image translation method includes the following steps:

s1, mask morphing

S11, inputting an example mask of a source domain and a target domain label into a mask deformation network trained in advance; the mask morph network includes an encoder and a generator. The example mask is a mask corresponding to a foreground example in the image, and the essence of the example mask is a segmentation mask formed by segmenting the foreground mask of the image according to examples, and the mask corresponding to the image is formed by aggregating all the example masks corresponding to the image, for example, there are two persons in the image, one of the persons corresponds to an example mask, and the aggregate of the example masks corresponding to the two persons forms the mask of the image.

S12, the mask is deformed by the mask deformation network according to the following steps:

s121, aggregating all the example masks of the source domain to obtain a source domain mask; performing feature extraction on the source domain mask through an encoder to obtain the overall feature F of the source domain mask _img (ii) a Masking instances of a source domain

Respectively with integral features F _img Fusing to obtain mask characteristics F corresponding to each mask _mask(i) (ii) a Then, feature coding is carried out on the label information of the target domain, and then the feature coding of the label information of the target domain is respectively embedded into the mask features F of each example _mask(i) 。

The encoder can adopt any existing encoding network for image feature extraction. In this embodiment, specifically, a multilayer convolutional neural network is used as an encoder of the mask deformation network, and the example of the source domain is masked by matrix dot multiplication

With integral feature F _img Fusion is performed.

Similarly, the tag information of the target domain may be encoded by using any encoding method. Such as: manner of one-hot coding, e.g.Encoding sheep to [0,1]The Giraffe code is [1,0 ]]And fusion is performed by matrix multiplication. In this embodiment, specifically, the tag information of the target domain is encoded by the convolutional neural network, and then the instance mask feature F is used _mask(i) And splicing with the characteristic code of the target domain label information.

S122, respectively inputting each instance mask feature fused with the label information feature codes into the generator, and generating a mask for the target domain finally output by the generator

As an instance mask for the corresponding target domain.

The above-described generator may also employ any existing decoding network for image generation. In this embodiment, specifically, the generator of the mask deformation network includes a multilayer residual neural network and a multilayer convolutional neural network; firstly, the input example mask feature F fused with label information is input by a multi-layer residual error neural network _mask(i) Scaling to enable the scaling to be matched with the input dimensionality of the multilayer convolutional neural network, and then decoding by the multilayer convolutional neural network to generate a target domain generation mask; the layers of the multi-layer convolutional neural network are amplified by upsampling so that the final output is the same size as the source domain image.

S2, image generation

S21, inputting an image translation model by taking an image mask pair of a source domain, label information of a target domain and the target domain instance mask obtained in the step S1 as input, wherein the image mask pair comprises a group of instance masks and corresponding images; the image translation model comprises a pre-trained image completion model and an example generation network.

S22, based on the source domain image mask pairs input in the step S21, firstly aggregating all example masks of the source domain to obtain a source domain mask, then removing the foreground of the corresponding source domain image according to the source domain mask to obtain a residual image with the mask part removed, and completing the residual image by using an image completing model to obtain a background image of the source domain.

In the present embodiment, the Image completion model is a HiFill model, which is published in the article "context research organization for Ultra High-Resolution Image Inpainting" in CVPR 2020. Of course, other existing image completion models may be used and pre-trained in the existing manner through the large picture dataset.

S23, based on the target domain instance mask input in the step S21, firstly aggregating all the instance masks of the target domain to obtain a target domain mask, and then combining the target domain mask and the source domain background image B obtained in the step S22 ^S Inputting an instance generation network; the example generation network includes an encoder and a generator and is processed as follows:

based on the input target domain mask, obtaining the edge information of the foreground through an edge extraction algorithm; obtaining background mask information by negating the target domain mask; carrying out feature coding on the label information of the target domain; then, fusing the edge information of the foreground, the background mask information and the feature codes of the target domain label information to obtain a mixed mask;

and inputting the background feature and the mixed mask into a generator, wherein the generator comprises a decoding network, the decoding network decodes the input background feature, meanwhile, the generator extracts additional information from the mixed mask, acts the extracted additional information on the output obtained by the normalized decoding, and performs affine transformation on the normalized decoded output through the additional information, so that fusion information comprising foreground information corresponding to the target domain mask and position information indicating the position of the foreground is obtained.

The core of the invention is as follows: target domain mask information is introduced in the translation process and is used as guidance information to generate corresponding examples, which are embodied in a centralized manner as follows: designing a mixing mask, extracting additional information from the mixing mask through a generator, applying the extracted additional information to the normalized decoded output, and performing affine transformation on the normalized decoded output through the additional information. Therefore, the generator may be any decoder capable of extracting the neighborhood information in the blend mask and applying the additional information to the normalized decoded output for image generation as in the prior art.

Specifically, in this embodiment, the encoder of the example generation network is a multi-layer residual neural network; the edge extraction algorithm is a CANNY edge detection algorithm; performing characteristic coding on the label information of the target domain in a single hot coding mode; and fusing the feature codes of the edge information of the foreground, the background mask information and the target domain label information through matrix multiplication.

As shown in fig. 2, the generator of the example generation Network is an APADE-ResNet Network, which is a Neural Network formed by adding an APADE-Adaptive minimization between each convolution layer and a ReLU layer of the example generation Network, on the basis of a ResNet Network (Residual Neural Network), that is, the decoding Network of the generator is a ResNet Network, and extracts additional information from the blending mask through the APADE block; the APADE block comprises two paths of inputs, wherein one path is the output of the convolutional layer, and the other path is a mixed mask; scaling the mixed mask to APADE block input feature dimension, inputting APADE block, and passing through a convolution layer Conv _s Then, conv is added _s Respectively input two independent convolution layers Conv ₁ 、Conv ₂ Generating two embedded vectors gamma and beta; after the APADE block is input, the output of the convolution layer firstly passes through a batch normalization layer, and then the output of the batch normalization layer and the embedded vectors gamma and beta are calculated according to the following formula to be used as the output of the APADE block:

T _out ＝γ·F _in +β

Further, in this embodiment, the APADE-ResNet network is a neural network formed by adding an APADE block on the basis of two layers of ResNet networks, that is to say: each layer of the APADE-ResNet network comprises a forward neural network branch and a shortcut branch, wherein the forward neural network branch comprises a first convolution layer, a first APADE block, a first ReLU layer, a second convolution layer, a second APADE block and a second ReLU layer which are sequentially connected in series, and the shortcut branch builds jump between the input and the output after the second APADE block and adds the input to the output of the second APADE block to form the input of the second ReLU layer. The APADE-ResNet network is a multi-layer APADE-ResNet network, the input of the first layer is background characteristics and a mixed mask, and the input of each layer behind the first layer is the output and the mixed mask of the previous layer; the layers are up-sampled to enlarge so that the final output is the same size as the source domain image.

Finally, the generated foreground information I 'is converted into position information alpha' _fg And fusing the image with the source domain background image B, and outputting a target domain image I' with the source domain background reserved. This process is characterized as follows:

I′ _fg ,α′＝G _img (B,M′)

I′＝I′ _fg ·α′+B·(1-α′)

wherein G is _img A representation generator.

In terms of performance, a network with a larger number of layers generally achieves better results, but the training overhead is increased. Therefore, in order to balance performance and overhead, the multi-layer network in the above embodiment adopts a structure of 4-6 layers.

The mask deformation network and the image translation model are essentially a generation network for translating part of semantic information in an image, so that training of the generation network can be based on the conventional mode of generating an antagonistic network by using the GAN.

However, in order to train more effectively and efficiently, the model is trained as follows:

firstly, training sample data preparation:

data is collected from a data set, such as: MS COCO dataset, which may provide picture-foreground mask data pairs for multiple domains. Defining domains according to the classes to which the foreground belongs, and constructing an image mask pair of each domain, wherein the image mask pair comprises a group of example masks and images corresponding to the example masks; and constructing sample pairs based on the constructed domains in a pairwise combination manner, wherein each sample pair comprises two domains, one domain serves as a source domain, the other domain serves as a target domain, and in all the sample pairs, each constructed domain serves as at least one target domain.

Then, training the model, wherein training the mask deformation network comprises:

s31, inputting at least one sample pair comprising a source domain and a target domain appointed by a task to be translated;

s32, aiming at each input sample pair, the mask deformation network is processed according to the following steps:

Instance mask corresponding to a target domain

Wherein, the subscript i represents the ith sample pair, the range is 1 to P, P is the number of the input sample pairs, j represents the jth mask of the corresponding domain, the range is 1 to Q, Q is the set sampling number, the superscript T represents the target domain, and S represents the source domain. Meanwhile, after pairwise matching of the sampled example mask of the source domain and the sample mask of the target domain, the center positions of the pairwise matched example masks are aligned, so that better spatial matching is achieved.

Source-Domain-based instance mask

And target Domain instance mask

And a target domain generation mask

And source field instance mask

Constructing a mask of the corresponding source domain instance

Target Domain instance mask

And target Domain Generation mask

A constituent triplet.

In the embodiment, the generator of the mask deformation network is a multilayer network, so as to achieve a better training effect, each layer of the multilayer network can be constrained through loss, and the smooth proceeding of the countermeasure training is ensured from multiple scales, therefore, for each input example mask feature integrated with the tag information feature code, when the multilayer network of the generator decodes, the final K layers of the generator output target domains with different sizes layer by layer to generate the mask

That is to say

Is composed of

The formed target domain generates a mask sequence, and K is a multi-layer networkAnd selecting the number of network layers to output layer by layer in the network.

S33, according to the triples obtained in the step S32, for each triplet, masking the target domain instance of the triplet

Scaling so that it generates a mask with the corresponding target domain

Matching the sizes to be used as a target domain real mask; and the corresponding target domain generation mask and the target domain real mask form a positive and negative sample pair.

Due to the fact that

Is composed of

The constructed target domain generates a mask sequence, so that in this step, the target domain instance is masked for each triplet adaptively

Scaling to obtain mask sequences respectively generated with the target domain

Size matched target field true mask

And generating a mask from the corresponding target domain

And target Domain true mask

And S34, inputting the positive and negative sample pairs obtained in the step S33 into a discriminator, and performing countermeasure training on the mask deformation network.

Wherein the generator's penalty function

The calculation formula of (a) is as follows:

penalty function of arbiter

The calculation formula of (a) is as follows:

wherein D is _mask A discriminator for mask-deformed network countermeasure training is shown,

and the nth target domain real mask obtained by scaling the target domain instance mask corresponding to the jth instance mask of the source domain in the ith sample pair is represented.

Meanwhile, a mask pseudo closed-loop loss function is constructed by reconstructing a source domain mask, and the mask generation process is supervised. The mask pseudo closed-loop loss function L _pc The calculation formula is as follows:

wherein,

representing a source field instance mask sequence in an ith sample pair

Representing source domain label information in an ith sample pair;

generator G representing a network morphed by a mask based on source domain instance masks and label information in an ith sample pair _mask The reconstructed source domain instance mask.

In order to stabilize training, semantic consistency of masks generated in different scales needs to be evaluated, so that masks except the mask output by the first layer are down-sampled to the size of the mask of the first layer, and a mask consistency loss function L is constructed _const The calculation formula is as follows:

where d (-) denotes the down-sampling function.

Because the mask is binary, noise is generated by an activation function in the generation process, and in order to suppress the noise, a mask regularization function L is constructed _reg The calculation formula is as follows:

wherein,

In summary, the penalty function for the confrontation training is:

wherein,

in order to be a function of the losses of the generator,

in order to be a loss function of the discriminator,

in order to generate the penalty-combating function of the generator,

as a function of the penalty of the arbiter, L _pc As a masked pseudo closed-loop loss function, L _const As a mask consistency loss function, L _reg To mask the regularization function, λ is a hyper-parameter used to adjust the balance of the respective loss functions.

And S35, finishing the training when the set iteration times or mask deformation network convergence is reached, or returning to the step S32.

Wherein, training the example generation network comprises: the example generates a network, and training is carried out according to the following steps:

s41, inputting an image mask pair of at least one domain including a target domain specified by a task to be translated;

s42, aiming at the image mask pair of each input domain, processing is respectively carried out according to the following steps:

taking the image mask pair of the input domain as a source domain image mask pair of the input image translation model; taking the label information and the example mask of the input domain as the label information and the example mask of the target domain of the input image translation model; by image translation model rootGenerating an image I from an input _i ^′ Wherein, i represents the ith sample data corresponding to the input of the generated image.

Taking the image in the input domain image mask pair as a real image I _i And from the real image I _i And generating an image I _i ^′ Positive and negative sample pairs are formed.

And S43, inputting the positive and negative sample pairs obtained in the step S42 into a discriminator, and carrying out countermeasure training on the example generation network.

The loss function for the resistance training is:

wherein,

in order to be a function of the losses of the generator,

in order to be a loss function of the arbiter,

in order to generate the penalty-combating function of the generator,

as a function of the penalty of the arbiter, L _fmap For the fusion graph loss function, lambda is a hyper-parameter used for adjusting the balance of each loss function;

penalty function of the generator

The calculation formula of (a) is as follows:

penalty function of said arbiter

The calculation formula of (a) is as follows:

wherein D is _img Representing discriminators for instance-generating network countermeasure training, I _i ^′ Representing a generated image generated based on the input image mask pair of the ith domain; i is _i An image representing an input image mask pair of an ith domain; p is the number of sample data input in step D21;

wherein M is _i An input field mask formed by aggregating instance masks of an image mask pair representing an ith field; alpha is alpha _i ^′ And position information indicating a foreground position in the fusion information generated by the generator based on the image mask pair of the ith domain.

The above-mentioned fusion graph loss function L _fmap The position information can be better guided to change towards the correct mask direction.

And S44, finishing the training when the set iteration number or the example generation network convergence is reached, or returning to the step S42.

Although the present invention has been described herein with reference to the preferred embodiments thereof, the embodiments described above are intended to be illustrative only and not to be limiting of the invention, it being understood that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the spirit and scope of the principles of this disclosure.

The effect of the present invention is explained below by using the model of the above embodiment and combining with simulation experiments:

simulation experiment I:

the test conditions are as follows:

the system comprises: ubuntu 20.04, software: python 3.6, processor: intel (R) Xeon (R) CPU E5-2678 v3@2.50GHz × 2, memory: 256GB.

The experimental contents are as follows:

comparing the existing unsupervised image translation scheme with the technical scheme of the invention, and taking the source domain data, the image and the mask as well as the target domain label as input, on the premise of keeping background information, generating an image containing the foreground of the target domain.

A total of four pairs of data sets were tested, including: sheep & giraffe, bottles & cups, orange & bananas, trousers & herds, the results are shown in figure 3.

And (3) analyzing an experimental result:

as can be seen from fig. 3, compared with the previous scheme, that is, instaGAN, the method of the present invention introduces the mask information of the target domain as a guide in the image generation stage, so that the generated foreground instance better conforms to the shape constraint of the mask, has a more reasonable visual effect, and well completes the conversion from the source domain to the target domain in terms of shape and texture. The mask deformation network can also realize more reasonable mask cross-domain translation through a matched self-supervision learning method.

And (2) simulation experiment II:

the effects of the invention are illustrated by comparison through simulation experiments by combining the video question-answering method in the prior art:

the test conditions are as follows:

the system comprises: ubuntu 20.04, software: python 3.6, processor: intel (R) Xeon (R) CPU E5-2678 v3@2.50GHz x 2, memory: 256GB;

description of the test: the sheep and the giraffes are used as data sets, and the data sets used in the experiment are all in the form of image mask pairs, namely one picture corresponds to a plurality of foreground example masks. Due to the particularity of the task, the training data is sent into the network for training in the form of image mask pairs of two different domains. Specifically, the invention and the comparison algorithm are trained on training sets in the data set in turn. After training, the invention and the comparison algorithm are respectively used for carrying out generation test on the data set test set to obtain a generated picture result. The comparison algorithm is instaGAN.

In the experiment, the test set was randomly divided into batches, each comprising 2 image mask pairs from two fields respectively.

1) Examine the classification accuracy of the generated picture/foreground instance:

testing is carried out on the test set, a target domain picture-mask set is generated, and statistics is carried out on the generated picture set in two ways: (1) Classifying the images by using a pre-trained image classification model by taking the images as a unit, and counting the number of the correctly classified images in a target domain; (2) And counting the number of foreground examples correctly classified to the target domain by using the pre-trained example classification model with the foreground examples as units. The two methods are respectively adopted and an image classification score CS and an example classification score CS (Masked) are calculated.

2) Examining the accuracy of correctly detected and identified generated foreground instances:

predicted labels, scores and masks are obtained from each generated picture using a pre-trained Mask-RCNN as a detector. Furthermore, three statistical analysis methods are adopted to evaluate the quality of the generated foreground image. (1) The statistical detector obtains the average matching ratio (MMR) from the number of masks detected in the generated picture set and calculates the ratio to the number of masks generated by the model, and estimates the probability that the generated masks are correctly identified from the number perspective. (2) The predicted score represents the confidence of being classified into a particular domain, so we get the average object detection score (MODS) by calculating the average score of being classified into the target domain. (3) And calculating the intersection ratio of the predicted mask and the generated mask to obtain an average effective IoU score (MVIS), and evaluating whether the shape is successfully translated to the target domain in the translation process from the perspective of whether the generated mask shape fits the predicted mask shape.

The results of the above experiments are shown in tables 1 and 2. Through data analysis and comparison of the tables 1 and 2, the quality of the picture generated by the method is better, and the results verify the effectiveness of the translation method and the corresponding supervision data construction method.

TABLE 1

TABLE 2

Claims

1. A deformable instance-level image translation method is characterized by comprising the following steps:

c1, inputting an image mask pair of a source domain, label information and an example mask of a target domain into an image translation model, wherein the image mask pair comprises a group of example masks and corresponding images; the image translation model comprises a pre-trained image completion model and an example generation network;

c2, based on the source domain image mask code pair input in the step C1, firstly aggregating all example mask codes of the source domain to obtain a source domain mask code, then removing the foreground of the corresponding source domain image according to the source domain mask code to obtain a residual image with the mask code removed, and completing the residual image by using an image completion model to obtain a background image of the source domain;

c3, based on the target domain instance mask input in the step C1, firstly aggregating all the instance masks of the target domain to obtain a target domain mask, and then combining the target domain mask and the source domain background image B obtained in the step C2 ^S Inputting an instance generation network; the example generation network includes an encoder and a generator and is processed as follows:

inputting the background feature and the mixed mask into a generator, wherein the generator comprises a decoding network, the decoding network decodes the input background feature, meanwhile, the generator extracts additional information from the mixed mask, acts the extracted additional information on the output obtained by the normalized decoding, and performs affine transformation on the normalized decoded output through the additional information, so that fusion information comprising foreground information corresponding to the target domain mask and position information indicating the foreground position is obtained;

2. A method for deformable instance-level image translation as claimed in claim 1, characterized by:

the image completion model is a HiFill model; the encoder of the example generation network is a multilayer residual error neural network; the edge extraction algorithm is a CANNY edge detection algorithm; performing characteristic coding on the label information of the target domain in a single hot coding mode; and fusing the characteristic codes of the edge information of the foreground, the background mask information and the target domain label information through matrix multiplication.

3. A method for deformable instance-level image translation as claimed in claim 1, characterized by:

the generator of the example generation network is an APADE-ResNet network, the APADE-ResNet network is a neural network formed by adding APADE blocks between each convolution layer and the ReLU layer on the basis of the ResNet network, namely the decoding network of the generator is the ResNet network, and additional information is extracted from the mixed mask through the APADE blocks;

the APADE block comprises two paths of inputs, wherein one path is the output of the convolutional layer, and the other path is a mixed mask;

scaling the mixed mask to APADE block input feature dimension, inputting APADE block, and passing through a convolution layer Conv _s Then, conv is added _s Respectively input two independent convolution layers Conv ₁ 、Conv ₂ Generating two embedded vectors gamma and beta;

after the output of the convolution layer is input into the APADE block, firstly, the convolution layer passes through a batch normalization layer, and then, the output of the batch normalization layer and the embedded vectors gamma and beta are calculated according to the following formula as the output of the APADE block:

F _out ＝γ·F _in +β

4. A method for deformable instance-level image translation as claimed in claim 3, characterized by:

the APADE-ResNet network is a neural network formed by adding APADE blocks on the basis of two layers of ResNet networks, namely:

each layer of the APADE-ResNet network comprises a forward neural network branch and a shortcut branch, wherein the forward neural network branch comprises a first convolution layer, a first APADE block, a first ReLU layer, a second convolution layer, a second APADE block and a second ReLU layer which are sequentially connected in series, and the shortcut branch builds jump between the input and the output after the second APADE block and adds the input to the output of the second APADE block to form the input of the second ReLU layer.

5. A method for deformable instance-level image translation, according to any of claims 3 or 4, characterized by:

the APADE-ResNet network is a multilayer APADE-ResNet network, the input of the first layer is background characteristics and a mixed mask, and the input of each layer behind the first layer is the output and the mixed mask of the previous layer; the layers are up-sampled to enlarge so that the final output is the same size as the source domain image.

6. A method for deformable instance-level image translation, according to any of claims 3 or 4, characterized in that said instance generates a network, trained according to the following steps:

d1, training sample data preparation:

taking the image mask pair of the input domain as a source domain image mask pair of the input image translation model; taking the label information and the example mask of the input domain as the label information and the example mask of the target domain of the input image translation model; generating an image I 'from an input by an image translation model' _i Wherein, i represents the ith sample data correspondingly input to the generated image;

7. A method for deformable example-level image translation as claimed in claim 6, characterized in that in said step D23, the loss function for the resistance training is:

wherein,

in order to generate the loss function of the generator,

in order to be a loss function of the discriminator,

in order to generate the penalty-combating function of the generator,

penalty function of the generator

The calculation formula of (a) is as follows:

penalty function of said arbiter

The calculation formula of (a) is as follows:

wherein D is _img Denotes an arbiter, I 'for example Generation network countermeasure training' _i Representing a generated image generated based on the input image mask pair of the ith domain; i is _i An image representing an input image mask pair of an ith domain; p is the number of sample data input in step D21;

wherein, M _i An input field mask formed by aggregating instance masks of an image mask pair representing an ith field; alpha' _i And representing position information indicating a foreground position in the fusion information generated by the generator based on the image mask pair of the ith domain.

8. A deformable instance-level image translation method is characterized by comprising the following steps of:

A. mask morphing

Respectively with integral features F _img Go on to meltCombining to obtain the example mask feature F corresponding to each example mask _mask(i) (ii) a Then, the label information of the target domain is subjected to feature coding, and the feature coding of the label information of the target domain is respectively embedded into the mask features F of each example _mask(i) ；

As an instance mask for the corresponding destination domain.

B. Image generation

Generating a target domain image with a source domain background according to the deformable example-level image translation method of any one of claims 1 to 7 by taking the image mask pair of the source domain, the label information of the target domain and the target domain example mask obtained in the step A as input.

9. A method for deformable instance-level image translation as claimed in claim 8, characterized by:

the encoder of the mask deformation network is a multilayer convolution neural network, and the example mask of the source domain is obtained by matrix point multiplication

With integral feature F _img Carrying out fusion;

performing feature encoding on the label information of the target domain in a mode of unique hot encoding, and masking an instance feature F by matrix multiplication _mask(i) Fusing with the characteristic code of the target domain label information; alternatively, the label information of the target domain is encoded by the convolutional neural network, and then the example mask feature F is used _mask(i) And splicing with the characteristic code of the target domain label information.

10. A method for deformable instance-level image translation as claimed in claim 8, characterized by:

the generator of the mask deformation network comprises a multilayer residual error neural network and a multilayer convolution neural network; firstly, the input example mask feature F fused with label information is input by a multilayer residual error neural network _mask(i) Scaling to enable the scaling to be matched with the input dimensionality of the multilayer convolutional neural network, and then decoding by the multilayer convolutional neural network to generate a target domain generation mask; the layers of the multi-layer convolutional neural network are amplified by upsampling so that the final output is the same size as the source domain image.

11. A method for deformable instance-level image translation according to any of claims 8, 9 or 10, characterized in that said mask deformation network is trained by the following steps:

b1, training sample data preparation:

b2, training the mask deformation network:

b22, aiming at each input sample pair, the mask deformation network is processed according to the following steps:

Instance mask corresponding to a target domain

Mask based on source domain instance

And target Domain instance mask

And a target domain generation mask

And source field instance mask

Constructing a mask of the corresponding source domain instance

Target Domain instance mask

And target Domain Generation mask

A constructed triplet;

b23, according to the triples obtained in the step B22, for each triplet, masking the target domain instance of the triplet

Scaling so that it generates a mask with the corresponding target domain

12. A method for deformable instance-level image translation as claimed in claim 11, characterized by:

in the step B22, after pairwise matching the sample obtained instance mask of the source domain and the instance mask of the target domain, the center positions of the pairwise matched instance masks are aligned.

13. A method for deformable instance-level image translation as claimed in claim 11, characterized by:

a generator of the mask deformation network is a multilayer network, in the step B22, for each input instance mask feature fused with the tag information feature code, when the multilayer network of the generator decodes, the final K layers of the generator output layer by layer the generated masks corresponding to the target domains of different sizes

That is to say

Is composed of

in step B23, for each triplet, its target domain instance is masked

Scaling to obtain mask sequences respectively generated with the target domain

Size matched target domain true mask

And generating a mask from the corresponding target domain

And target Domain true mask

14. A method for deformable instance-level image translation as claimed in claim 13, characterized by:

in step B24, the loss function of the resistance training is:

wherein,

in order to be a function of the losses of the generator,

in order to be a loss function of the discriminator,

in order to generate the countering loss function of the generator,

penalty function of the generator

The calculation formula of (a) is as follows:

penalty function of said arbiter

The calculation formula of (c) is as follows:

representing the jth instance mask corresponding to the source field in the ith sample pairGenerating masks by the target domains output by the n output layers;

the mask pseudo closed loop loss function L _pc The calculation formula of (c) is as follows:

wherein,

representing a source field instance mask sequence in an ith sample pair

Label information representing the ith sample to the source domain;

wherein d (-) denotes a down-sampling function;

wherein,