CN116524290A - Image synthesis method based on countermeasure generation network - Google Patents
Image synthesis method based on countermeasure generation network Download PDFInfo
- Publication number
- CN116524290A CN116524290A CN202310236028.4A CN202310236028A CN116524290A CN 116524290 A CN116524290 A CN 116524290A CN 202310236028 A CN202310236028 A CN 202310236028A CN 116524290 A CN116524290 A CN 116524290A
- Authority
- CN
- China
- Prior art keywords
- size
- feature
- map
- foreground
- multiplied
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000001308 synthesis method Methods 0.000 title claims abstract description 9
- 230000009466 transformation Effects 0.000 claims abstract description 72
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 32
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 32
- 238000000034 method Methods 0.000 claims abstract description 24
- 238000012549 training Methods 0.000 claims abstract description 23
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims description 59
- 230000006870 function Effects 0.000 claims description 47
- 239000013598 vector Substances 0.000 claims description 32
- 239000002131 composite material Substances 0.000 claims description 17
- 238000010586 diagram Methods 0.000 claims description 17
- 238000000605 extraction Methods 0.000 claims description 14
- 241000282326 Felis catus Species 0.000 claims description 13
- 230000004913 activation Effects 0.000 claims description 12
- 230000003993 interaction Effects 0.000 claims description 11
- 230000002776 aggregation Effects 0.000 claims description 10
- 238000004220 aggregation Methods 0.000 claims description 10
- 238000011176 pooling Methods 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 230000002194 synthesizing effect Effects 0.000 claims description 6
- 230000008485 antagonism Effects 0.000 claims description 5
- 230000003044 adaptive effect Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 description 4
- 238000012360 testing method Methods 0.000 description 3
- 230000002452 interceptive effect Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000000844 transformation Methods 0.000 description 2
- 235000005612 Grewia tenax Nutrition 0.000 description 1
- 244000041633 Grewia tenax Species 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013434 data augmentation Methods 0.000 description 1
- 238000013100 final test Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/766—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an image synthesis method based on an countermeasure generation network, which comprises the following steps: s1, acquiring a first data set and a second data set, wherein each sample in the first data set comprises a first splicing unit, a second splicing unit and a third splicing unit, each sample in the second data set comprises a fourth splicing unit, a fifth splicing unit and transformation parameters corresponding to the fifth splicing unit, and each splicing unit has the same size; s2, building an countermeasure generation network model and training; s3, inputting the first splicing unit and the second splicing unit to be synthesized into a trained countermeasure generation network model, and taking the first synthesis unit correspondingly output by the discriminator under the unsupervised path as an image synthesis prediction result. The method improves the rationality and diversity of the synthetic image, can obtain a more real and natural synthetic image, and can adapt to complex application scenes.
Description
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to an image synthesis method based on an countermeasure generation network.
Background
Image synthesis includes four study directions: object location (object placement), image fusion (image blending), image harmony (image harmonization), and shadow generation (shadow generation). For object position problems, the embodied geometrical inconsistencies include, but are not limited to: 1) The foreground object is too large or too small; 2) The foreground object is not supported by force, such as being suspended in the air; 3) Foreground objects appear where semantics are inappropriate, such as where ships are on inland; 4) Unreasonable shielding relation exists between foreground objects and surrounding objects; 5) Perspective angles of the foreground and the background are inconsistent. The method is summarized that the size, the position and the shape of the foreground object are unreasonable. Object position (object placement) and spatial deformation (spatial transformation) are intended to find reasonable sizes, positions, shapes for the foreground, avoiding many of the unreasonable factors mentioned above. Object positions are generally mainly translation and scaling of foreground objects, whereas spatial deformations involve relatively complex geometric deformations, such as affine transformations or perspective transformations. For convenience of description, any geometric deformation is referred to by object position.
Learning to place foreground objects on a background scene graph often occurs in applications such as image editing and scene parsing, an original synthetic graph and a foreground mask are input through a model, and a more real and natural synthetic graph is obtained after output adjustment. To date, most related studies have involved some troublesome problems, including insufficient utilization of interactive feature relationships between foreground objects and scenes, less prior knowledge involved in model training, etc., resulting in unreasonable positions of foreground objects in a synthesized result graph, such as most works considering only pasting one foreground object onto another background picture, and assuming that the foreground object is complete, however in real-world applications it is often necessary to synthesize multiple foreground objects onto the same background picture, and the foreground object may be incomplete. In addition, the prior art generally only outputs a single location for the same input. But for a foreground object and background image, in most cases, the foreground object will have a number of reasonable positions on the background image, such as a vase placed on a table with another background image, in a myriad of reasonable sizes and positions. Therefore, in order to obtain more reasonable and diversified synthesis results, it is necessary to improve the image synthesis algorithm so that it can adapt to complex application scenarios.
Disclosure of Invention
Aiming at the problems, the invention provides an image synthesis method based on an countermeasure generation network, which improves the rationality and diversity of a synthesis image, can obtain a more real and natural synthesis image, and can adapt to complex application scenes.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
the invention provides an image synthesis method based on an countermeasure generation network, which comprises the following steps:
s1, acquiring a first data set and a second data set, wherein each sample in the first data set comprises a first splicing unit [ I ] fg ,M fg ]Second splicing unit [ I ] bg ,M bg ]And a third splicing unit [ I ] gt ,M gt ]Each sample in the second data set comprises a fourth splicing unitFifth splicing unit->Transformation parameter t corresponding to fifth splicing unit gt =(t r gt ,t x gt ,t y gt ) Each splice unit has the same size, wherein I fg As a foreground map, M fg Mask for foreground map, I bg As background picture, M bg Mask for background image, I gt Is a positive label graph, M gt Mask for positive tag map, ">Representing negative label composite map, ">Representing negative label composite map mask, ">Representing a positive label composite map, ">Representing a positive label composite map mask, ">Representing the scaling of the foreground object corresponding to the fifth stitching unit,/for>Representing the x-axis coordinate of the foreground position corresponding to the fifth stitching unit on the background map, < ->Representing the y-axis coordinate of the foreground position corresponding to the fifth splicing unit on the background map;
s2, building and training a countermeasure generation network model, wherein the countermeasure generation network model comprises a generator, a discriminator and a priori knowledge extractor, the generator comprises a preliminary feature extractor, a multi-scale feature aggregation module, a joint attention module, a Concat function and a regression block, the multi-scale feature aggregation module comprises two parallel first feature extraction units, the first feature extraction units comprise a multi-scale encoder and a feature aggregator which are sequentially connected, the priori knowledge extractor comprises a global feature extractor and an automatic encoder which are sequentially connected, and the training process is as follows:
s21, a first splicing unit [ I ] for splicing each sample in the first data set fg ,M fg ]Second splicing unit [ I ] bg ,M bg ]And a third splicing unit [ I ] gt ,M gt ]Respectively input preliminaryThe feature extractor correspondingly obtains a first basic feature map F fg Second basic feature map F bg And a third basic feature map F gt ;
S22, the first basic feature diagram F fg And a second basic feature map F bg The first feature extraction unit of the one-to-one input multi-scale feature aggregation module correspondingly obtains a first multi-scale feature map P fg And a second multiscale feature map P bg And map the first multiscale feature map P fg And a second multiscale feature map P bg Inputting a joint attention module to acquire a global interaction feature Z;
s23, combining the third basic feature map F gt A global feature extractor input to the a priori knowledge extractor, obtains a first extracted feature, and encodes the first extracted feature into a priori vector z by an automatic encoder p ;
S24, a random variable Z u A priori vector Z p Respectively fusing the Concat function with the global interaction feature Z to correspondingly obtain a first splicing vector Z i Forming an unsupervised path and a second splice vector Z j Forming a self-supervision path;
s25, respectively splicing the first splicing vectors Z i And a second splice vector Z j Inputting regression blocks, correspondingly predicting a first affine transformation parameter t u =(t r u ,t x u ,t y u ) And a second affine transformation parameter t s =(t r s ,t x s ,t y s ),t r u Scaling rate, t, representing foreground objects in an unsupervised path x u X-axis coordinate, t, representing foreground position on background map under unsupervised path y u Y-axis coordinate, t, representing foreground position on background map under unsupervised path r s Scaling rate, t, representing foreground objects under self-supervising path x s X-axis coordinate, t, representing foreground position on background map under self-supervision path y s Representing y-axis coordinates of a foreground position on a background map under a self-supervision path;
s26, respectivelyAccording to the first affine transformation parameter t u =(t r u ,t x u ,t y u ) And a second affine transformation parameter t s =(t r s ,t x s ,t y s ) Carrying out affine transformation on the input foreground image and the foreground image mask, correspondingly obtaining a first affine transformation result and a second affine transformation result, wherein the first affine transformation result comprises the foreground image subjected to affine transformation under an unsupervised pathForeground map maskThe second affine transformation result comprises a foreground map +.>Foreground mask->
S27, respectively synthesizing the first affine transformation result and the second affine transformation result with the background image to correspondingly obtain a first synthesized imageAnd a second synthetic diagram->The expression is as follows:
s28, combining the first synthesis unitAnd a second synthesis unit->Inputting the discriminant and training the discriminant through the second data set;
s29, calculating joint loss to update network parameters, and obtaining a trained countermeasure generation network model;
s3, a first splicing unit [ I ] to be synthesized fg ,M fg ]And a second splice unit [ I ] bg ,M bg ]Inputting a trained countermeasure generation network model, and outputting a first synthesis unit corresponding to the discriminator under an unsupervised pathAs a result of image synthesis prediction.
Preferably, the preliminary feature extractor employs a VGG16 network model.
Preferably, the preliminary feature extractor performs the following operations:
F 1 MaxPool (Conv 64 (Input 1))) with dimensions H/2×w/2×64;
F 2 =MaxPool(Conv128(Conv128(F 1 ) A) size H/4 XW/4X 128;
F 3 =MaxPool(Conv256(Conv256(Conv256(F 2 ) -1) with dimensions H/8 xw/8 x 256;
F fg =MaxPool(Conv512(Conv512(Conv512(F 3 ) -1) with dimensions H/16 xw/16 x 512;
wherein Maxpool represents the maximum pooling operation, convX represents the convolution operation with the number of channels X, and Input1 represents the first splice element [ I ] fg ,M fg ]Or a second splice unit [ I ] bg ,M bg ]Or a third splice unit [ I ] gt ,M gt ]Each splice unit has a height H, a width W and a channel number of 4.
Preferably, the multi-scale encoder performs the following operations:
P 1 =ReLU(BatchNorm(Conv512 (Input 2)) with a size H 1 ×W 1 ×512;
P 2 =ReLU(BatchNorm(Conv512(P 1 ) And) size H) 1 ×W 1 ×512;
P 3 =ReLU(BatchNorm(ConvC(P 2 ) And) size H) 1 ×W 1 ×C;
P S1= AdaptiveAvgPool{S 1 ×S 1 }(P 3 ) Size S 1 ×S 1 ×C;
P S2= AdaptiveAvgPool{S 2 ×S 2 }(P 3 ) Size S 2 ×S 2 ×C;
P S3= AdaptiveAvgPool{S 3 ×S 3 }(P 3 ) Size S 3 ×S 3 ×C;
Wherein, reLU represents ReLU activation function, batchNorm represents normalization operation, convX represents convolution operation with channel number X, adapteveAvgPool { n×n } represents adaptive pooling of high×width of corresponding input image into n×n, S 1 、S 2 、S 3 Sequentially a first preset size, a second preset size and a third preset size, wherein Input2 represents a first basic feature diagram F fg Or a second basic feature map F bg The height of each basic characteristic diagram is H 1 Width W 1 Channel number C, H/16=h 1 ,W/16=W 1 The height of each splicing unit is H, and the width of each splicing unit is W;
the feature aggregator performs the following operations:
P S1= Reshape(P 3 ) The size is 1× (S 1* S 1 )×C;
P S2= Reshape(P 3 ) The size is 1× (S 2* S 2 )×C;
P S3= Reshape(P 3 ) The size is 1× (S 3* S 3 )×C;
P g =Concat(Concat(P S1 ,P S2 ),P S3 ) The size is 1× (S 1* S 1+ S 2* S 2+ S 3* S 3 )×C;
Wherein Reshape denotes Reshape function, concat denotes Concat function, P g Representing the output features of the first feature extraction unit corresponding to Input2, namely a first multi-scale feature map P fg Or a second multiscale feature map P bg 。
Preferably, the global feature extractor performs the following operations:
Z 1 =ReLU(BatchNorm(Conv512(F fg ) And) size H) 1 ×W 1 ×512;
Z 2 =ReLU(BatchNorm(Conv512(Z 1 ) And) size H) 1 ×W 1 ×512;
Z 3 =ReLU(BatchNorm(ConvC(Z 2 ) And) size H) 1 ×W 1 ×C;
Z 4= AdaptiveAvgPool{1×1}(P 3 ) The size is 1 multiplied by C;
the automatic encoder performs the following operations:
h=ReLU(FC1024(Z 4 ) 1 x 1024 in size;
mu=fc512 (h), size 1×1×512;
logvar=fc512 (h), size 1×1×512;
Z p =mu+e Logvar/2 the size is 1 multiplied by 512;
wherein FCY is a full connection layer that maps the number of channels of the corresponding input image to Y.
Preferably, the joint attention module performs the following operations:
Q fg =ConvC/8(P fg ) The size is H multiplied by W multiplied by C/8;
K fg =ConvC/4(P fg ) The size is H multiplied by W multiplied by C/4;
V fg =ConvC/4(P fg ) The size is H multiplied by W multiplied by C;
Q bg =ConvC/8(P bg ) The size is H multiplied by W multiplied by C/8;
K bg =ConvC/4(P bg ) The size is H multiplied by W multiplied by C/4;
V bg =ConvC/4(P bg ) The size is H multiplied by W multiplied by C;
respectively Q fg 、K fg 、V fg 、Q bg 、K bg 、V bg The first dimension and the second dimension of the pattern are integrated through a Reshpe function, and Q 'is obtained correspondingly in sequence' fg 、K’ fg 、V’ fg 、Q’ bg 、K’ bg 、V’ bg The expression is as follows:
Q’ fg size is HW×C/8;
K’ fg size HW×C/4;
V’ fg size HW×C;
Q’ bg size is HW×C/8;
K’ bg size HW×C/4;
V’ bg size HW×C;
will Q' fg And Q' bg The third dimension of (2) is spliced and combined as follows:
Q cat =Concat(Q’ fg ,Q’ bg ) The size is HW multiplied by C/4,
will Q cat Performing attention calculation to obtain X fg And X bg The expression is as follows:
X fg =Softmax(Q cat *K fg T) V fg +P fg size HW×C;
X bg =Softmax(Q cat *K bg T) V bg +P bg size HW×C;
Z=AdaptiveAvgPool{1×1}(Conv512(Concat(X fg ,X bg ) A) of size 1×1×c;
where ConvX represents a convolution operation with a channel number X.
Preferably, the regression block performs the following operations:
t u =FC3(FC1024(ReLU(FC1024(Z i ))));
t s =FC3(FC1024(ReLU(FC1024(Z j ))));
wherein FCY is a fully connected layer that maps the number of channels of the corresponding input image to Y, and ReLU represents a ReLU activation function.
Preferably, the affine transformation uses Spatial Transformer Network.
Preferably, the arbiter performs the following operations:
R=Sigmoid(Conv1(
LeakyReLU(Conv512(
LeakyReLU(Conv256(
LeakyReLU(Conv128(
LeakyReLU(Conv64(Input)))))))))
wherein Sigmoid represents a Sigmoid activation function, convX represents a convolution operation with a channel number X, leakyReLU represents a LeakyReLU activation function, R represents an output feature of the arbiter, and Input represents an Input feature of the arbiter.
Preferably, the joint loss is expressed as follows:
wherein,,
in θ G Learning parameters, θ, represented in generator G D The learning parameters represented in the arbiter D,generating a loss function for antagonism on an unsupervised path,>generating a loss function for countermeasures on a self-supervising path, L kld (G) As KL divergence loss function, L rec (G) To reconstruct the loss function, L bce (D) For cross entropy loss function, +.>Representing a priori vector z p Mean value of distribution,/->Representing a priori vector z p Variance of distribution, D KL Indicating the calculation of KL divergence, N (a) 1 ,b 1 ) Mean value of a 1 Variance is b 1 Distribution of (a), if a 1 =0,b 1 And=1, then normal distribution.
Compared with the prior art, the invention has the beneficial effects that:
the application provides a brand new end-to-end framework based on foreground images, background images and masks thereof, namely an antagonism network based on joint attention generation, specifically, a multi-scale feature aggregation module is designed in a generator to extract multi-scale information from the background and the foreground, after multi-scale features of the foreground images and the background images are extracted, a joint attention module is used to extract global feature interaction information of foreground objects and the background images, affine transformation parameters of the foreground images are predicted based on the feature information, affine transformation is carried out on the foreground images and the masks thereof according to the predicted parameters, and the affine transformation is placed at corresponding positions of the background images, so that synthesis of the input foreground images and the background images is completed. In addition, a self-supervision route is added in the training process, priori knowledge is learned from the positive label synthetic graph, so that a generator is further guided to find the credible position of a foreground target in the background graph, the method has advantages in terms of rationality and diversity of the position in the result compared with other existing methods, a more real and natural synthetic graph can be obtained, and the method can adapt to complex application scenes.
Drawings
FIG. 1 is a flow chart of an image synthesis method based on an countermeasure generation network of the present invention;
FIG. 2 is a schematic diagram of the architecture of the challenge-generating network model of the present invention;
FIG. 3 is a schematic diagram of a combined attention module according to the present invention;
FIG. 4 is a graph comparing the effects of the composite image of the method of the present invention with those of the prior art.
Detailed Description
The following description of the technical solutions in the embodiments of the present application will be made clearly and completely with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
It will be understood that when an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
As shown in fig. 1 to 4, an image synthesizing method based on an countermeasure generation network includes the steps of:
s1, acquiring a first data set and a second data set, wherein each sample in the first data set comprises a first splicing unit [ I ] fg ,M fg ]Second splicing unit [ I ] bg ,M bg ]And a third splicing unit [ I ] gt ,M gt ]Each sample in the second data set comprises a fourth splicing unitFifth splicing unit->Transformation parameter t corresponding to fifth splicing unit gt =(t r gt ,t x gt ,t y gt ) Each splice unit has the same size, wherein I fg As a foreground map, M fg Mask for foreground map, I bg As background picture, M bg Mask for background image, I gt Is a positive label graph, M gt Mask for positive tag map, ">Representing negative label composite map, ">Representing negative label composite map mask, ">Representing a positive label composite map, ">Representing a positive label composite map mask, ">Representing the scaling of the foreground object corresponding to the fifth stitching unit,/for>Representing the corresponding foreground of the fifth splicing unitX-axis coordinates of the position on the background map, < >>And representing the y-axis coordinate of the foreground position corresponding to the fifth splicing unit on the background map.
In this embodiment, the countermeasure generation network model includes a generator, a arbiter, and a priori knowledge extractor. In the training phase, the model training architecture is as shown in FIG. 2, and the input of the countermeasure generation network model is the foreground map and the concatenation of the foreground map mask thereof [ I ] fg ,M fg ](first stitching unit), stitching of background images and background image masks [ I ] bg ,M bg ](second splicing unit), splicing of the positive tag map and the mask of the positive tag map [ I ] gt ,M gt ](third splicing unit), wherein I fg 、I bg And I gt The size of (2) is H×W×3, M fg 、M bg And M gt The size of (C) is H×W×1, and [ I ] after splicing fg ,M fg ],[I bg ,M bg ]And [ I ] gt ,M gt ]The dimensions of (2) are H X W X4. In the training phase, the generator consists of an unsupervised path and a self-supervised path (two paths are independent of each other), when the model is trained, a label result graph (namely a second data set) is used for calculating a loss function for a result graph generated on the self-supervised path, and when the model is not trained on the unsupervised path, the label result graph is output as 2 groups of lists consisting of 3 transformation parameters respectively, the input foreground graph and a foreground graph mask thereof are subjected to affine transformation respectively according to transformation parameters predicted by the model, and then are synthesized into a background graph, so that two final synthesis units are respectively generatedWherein->And->Has a size of H×W×3, +.>And->The size of (2) is H×W×1. Finally the two-way generated synthesis unit +.>The sizes H multiplied by W multiplied by 4 are input into the discriminators, and the generators and the discriminators are trained according to the countermeasure generation training mode.
During the test phase, the model architecture used is the unsupervised path portion of the generator of FIG. 2, input as [ I ] fg ,M fg ]And [ I ] bg ,M bg ]Outputting 1 group of list composed of 3 transformation parameters, carrying out affine transformation on the input foreground image and mask thereof according to the transformation parameters predicted by the model, and synthesizing the foreground image and mask thereof into the background image to generate a final synthesizing unitSince the modules used in the test phase are part of the training phase model, the training process for generating the network model is mainly described.
S2, building and training a countermeasure generation network model, wherein the countermeasure generation network model comprises a generator, a discriminator and a priori knowledge extractor, the generator comprises a preliminary feature extractor, a multi-scale feature aggregation module, a joint attention module, a Concat function and a regression block, the multi-scale feature aggregation module comprises two parallel first feature extraction units, the first feature extraction units comprise a multi-scale encoder and a feature aggregator which are sequentially connected, the priori knowledge extractor comprises a global feature extractor and an automatic encoder which are sequentially connected, and the training process is as follows:
s21, a first splicing unit [ I ] for splicing each sample in the first data set fg ,M fg ]Second splicing unit [ I ] bg ,M bg ]And a third splicing unit [ I ] gt ,M gt ]Input preliminary feature extraction respectivelyThe extractor correspondingly obtains a first basic feature diagram F fg Second basic feature map F bg And a third basic feature map F gt 。
In one embodiment, the preliminary feature extractor employs a VGG16 network model.
In one embodiment, the preliminary feature extractor performs the following operations:
F 1 MaxPool (Conv 64 (Input 1))) with dimensions H/2×w/2×64;
F 2 =MaxPool(Conv128(Conv128(F 1 ) A) size H/4 XW/4X 128;
F 3 =MaxPool(Conv256(Conv256(Conv256(F 2 ) -1) with dimensions H/8 xw/8 x 256;
F fg =MaxPool(Conv512(Conv512(Conv512(F 3 ) -1) with dimensions H/16 xw/16 x 512;
wherein Maxpool represents the maximum pooling operation, convX represents the convolution operation with the number of channels X, and Input1 represents the first splice element [ I ] fg ,M fg ]Or a second splice unit [ I ] bg ,M bg ]Or a third splice unit [ I ] gt ,M gt ]Each splice unit has a height H, a width W and a channel number of 4.
Specifically, the preliminary feature extractor may refer to a VGG architecture, such as a VGG16 network model, or the like, to [ I ] fg ,M fg ]For example, for input [ I ] fg ,M fg ]The size is H multiplied by W multiplied by 4, and corresponds to the obtained first basic characteristic diagram F fg Size H 1 ×W 1 X C, let H/16=h 1 ,W/16=W 1 Second basic feature map F bg And a third basic feature map F gt Similarly, the size is H 1 ×W 1 X C. Wherein the convolution layers of the preliminary feature extractor are all convolution kernels of 3*3, and X in ConvX represents the number of channels. Sliding step size of each layer convolution = 1, padding = 1; maxpool refers to maximum pooling operation, and in a VGG16 network model, a 2 x 2 maximum pooling method is adopted; padding refers to padding the matrix with n turns outside, and padding=1, i.e. padding a matrix with a size of 5×5, for example, into a size of 7×7 after padding one turn.
S22, the first basic feature diagram F fg And a second basic feature map F bg The first feature extraction unit of the one-to-one input multi-scale feature aggregation module correspondingly obtains a first multi-scale feature map P fg And a second multiscale feature map P bg And map the first multiscale feature map P fg And a second multiscale feature map P bg And inputting the joint attention module to acquire the global interaction characteristic Z.
In one embodiment, a multi-scale encoder performs the following operations:
P 1 =relu (batch norm (Conv 512 (Input 2))), size H 1 ×W 1 ×512;
P 2 =ReLU(BatchNorm(Conv512(P 1 ) And) size H) 1 ×W 1 ×512;
P 3 =ReLU(BatchNorm(ConvC(P 2 ) And) size H) 1 ×W 1 ×C;
P S1= AdaptiveAvgPool{S 1 ×S 1 }(P 3 ) Size S 1 ×S 1 ×C;
P S2= AdaptiveAvgPool{S 2 ×S 2 }(P 3 ) Size S 2 ×S 2 ×C;
P S3= AdaptiveAvgPool{S 3 ×S 3 }(P 3 ) Size S 3 ×S 3 ×C;
Wherein, reLU represents ReLU activation function, batchNorm represents normalization operation, convX represents convolution operation with channel number X, adapteveAvgPool { n×n } represents adaptive pooling of high×width of corresponding input image into n×n, S 1 、S 2 、S 3 Sequentially a first preset size, a second preset size and a third preset size, wherein Input2 represents a first basic feature diagram F fg Or a second basic feature map F bg The height of each basic characteristic diagram is H 1 Width W 1 Channel number C, H/16=h 1 ,W/16=W 1 The height of each splicing unit is H, and the width of each splicing unit is W;
the feature aggregator performs the following operations:
P S1= Reshape(P 3 ) The size is 1× (S 1* S 1 )×C;
P S2= Reshape(P 3 ) The size is 1× (S 2* S 2 )×C;
P S3= Reshape(P 3 ) The size is 1× (S 3* S 3 )×C;
P g =Concat(Concat(P S1 ,P S2 ),P S3 ) The size is 1× (S 1* S 1+ S 2* S 2+ S 3* S 3 )×C;
Wherein Reshape denotes Reshape function, concat denotes Concat function, P g Representing the output features of the first feature extraction unit corresponding to Input2, namely a first multi-scale feature map P fg Or a second multiscale feature map P bg 。
For the obtained first basic feature map F fg And a second basic feature map F bg All sizes are H 1 ×W 1 X C, inputting it into a multi-scale feature aggregator to obtain a first multi-scale feature map P aggregating multiple scale information in each image fg And a second multiscale feature map P bg . In a first basic feature map F fg For example, the multi-scale encoder presets a plurality of sizes S 1 、S 2 And S is 3 First, for the first basic feature map F fg Further extracting features to obtain P 3 Thereafter P was applied using AdapteveAvgPool 3 Pooling to a predetermined plurality of scales. In order to aggregate the information in the feature images of multiple scales obtained in the previous step so as to facilitate the subsequent global feature interaction of background and foreground objects on multiple scales, a feature aggregator is designed, and the feature images of each scale are spliced and aggregated on the 2 nd dimension by further adjusting the image scale of each scale, so that a first multi-scale feature image P can be obtained fg Second multiscale feature map P bg The same applies to the acquisition process.
In one embodiment, the joint attention module performs the following operations:
Q fg =ConvC/8(P fg ) The size is H multiplied by W multiplied by C/8;
K fg =ConvC/4(P fg ) The size is H multiplied by W multiplied by C/4;
V fg =ConvC/4(P fg ) The size is H multiplied by W multiplied by C;
Q bg =ConvC/8(P bg ) The size is H multiplied by W multiplied by C/8;
K bg =ConvC/4(P bg ) The size is H multiplied by W multiplied by C/4;
V bg =ConvC/4(P bg ) The size is H multiplied by W multiplied by C;
respectively Q fg 、K fg 、V fg 、Q bg 、K bg 、V bg The first dimension and the second dimension of the pattern are integrated through a Reshpe function, and Q 'is obtained correspondingly in sequence' fg 、K’ fg 、V’ fg 、Q’ bg 、K’ bg 、V’ bg The expression is as follows:
Q’ fg size is HW×C/8;
K’ fg size HW×C/4;
V’ fg size HW×C;
Q’ bg size is HW×C/8;
K’ bg size HW×C/4;
V’ bg size HW×C;
will Q' fg And Q' bg The third dimension of (2) is spliced and combined as follows:
Q cat =Concat(Q’ fg ,Q’ bg ) The size is HW multiplied by C/4,
will Q cat Performing attention calculation to obtain X fg And X bg The expression is as follows:
X fg =Softmax(Q cat *K fg T) V fg +P fg size HW×C;
X bg =Softmax(Q cat *K bg T) V bg +P bg size HW×C;
Z=AdaptiveAvgPool{1×1}(Conv512(Concat(X fg ,X bg ) A) of size 1×1×c;
where ConvX represents a convolution operation with a channel number X.
For the resulting first multi-scale feature map P fg And a second multiscale feature map P bg Global feature interaction information between the foreground image and the background image on multiple scales is extracted through the joint attention module, so that prediction of reasonable positions of foreground objects in the background image is greatly promoted. As shown in fig. 3, the parameters of each convolution layer are different when the joint attention module is trained, so that the output result is different, and all the convolution layers of the joint attention module are convolution kernels of 1*1. The sliding step size of each layer convolution=1 and padding=0. And pair Q by Reshpe function fg 、K fg 、V fg 、Q bg 、K bg 、V bg Adjusting the dimension, integrating the first dimension and the second dimension, and sequentially and correspondingly obtaining Q '' fg 、K’ fg 、V’ fg 、Q’ bg 、K’ bg 、V’ bg Then Q' fg And Q' bg Splicing and combining at the channel level, namely the third dimension, and then carrying out Q cat K with foreground and background images respectively fg ,K bg V (V) fg ,V bg Attention calculating to obtain X fg And X bg The size is HW×C. Finally, the feature map obtained by combining attention from two sides is synthesized, namely X fg And X bg Splicing, and performing rolling and pooling operation to obtain global interactive features Z with the size of 1 multiplied by C. In the view of figure 3 of the drawings,
s23, combining the third basic feature map F gt A global feature extractor input to the a priori knowledge extractor for obtaining a first extracted feature, andencoding the first extracted feature into an a priori vector z by an automatic encoder p 。
In one embodiment, the global feature extractor performs the following operations:
Z 1 =ReLU(BatchNorm(Conv512(F fg ) And) size H) 1 ×W 1 ×512;
Z 2 =ReLU(BatchNorm(Conv512(Z 1 ) And) size H) 1 ×W 1 ×512;
Z 3 =ReLU(BatchNorm(ConvC(Z 2 ) And) size H) 1 ×W 1 ×C;
Z 4= AdaptiveAvgPool{1×1}(P 3 ) The size is 1 multiplied by C;
the automatic encoder performs the following operations:
h=ReLU(FC1024(Z 4 ) 1 x 1024 in size;
mu=fc512 (h), size 1×1×512;
logvar=fc512 (h), size 1×1×512;
Z p =mu+e Logvar/2 the size is 1 multiplied by 512;
wherein FCY is a full connection layer that maps the number of channels of the corresponding input image to Y.
By extracting foreground object reasonable position information from a positive label graph during training and encoding the extracted information into a priori vector Z p Incorporated into the autoregressive process of the generator to provide a priori knowledge guidance for the training of the generator. For the obtained third basic feature map F gt Firstly, global feature extraction is carried out to obtain Z 4 Then using a VAE-based auto encoder, Z 4 Encoding into a priori vector Z p Where FC denotes a fully connected layer, such as FC1024 (Z 4 ) Meaning Z 4 The number of channels of (a) is mapped from original C to 1024, and FC used in different places corresponds to different full connection layers.
S24, a random variable Z u A priori vector Z p Respectively fusing the Concat function with the global interaction feature Z to correspondingly obtain a first splicing vector Z i Forming an unsupervised path and a second splice vector Z j A self-supervising path is formed.
S25, respectively splicing the first splicing vectors Z i And a second splice vector Z j Inputting regression blocks, correspondingly predicting a first affine transformation parameter t u =(t r u ,t x u ,t y u ) And a second affine transformation parameter t s =(t r s ,t x s ,t y s ),t r u Scaling rate, t, representing foreground objects in an unsupervised path x u X-axis coordinate, t, representing foreground position on background map under unsupervised path y u Y-axis coordinate, t, representing foreground position on background map under unsupervised path r s Scaling rate, t, representing foreground objects under self-supervising path x s X-axis coordinate, t, representing foreground position on background map under self-supervision path y s Representing the y-axis coordinates of the foreground position on the background map under the self-supervising path.
In one embodiment, the regression block performs the following:
t u =FC3(FC1024(ReLU(FC1024(Z i ))));
t s =FC3(FC1024(ReLU(FC1024(Z j ))));
wherein FCY is a fully connected layer that maps the number of channels of the corresponding input image to Y, and ReLU represents a ReLU activation function.
Specifically, for an unsupervised path in the generator, by introducing a random variable Z in the global interaction feature Z before autoregressive u (size 1×1×512) to make the synthesis result of the model diversified; for the self-supervised paths in the generator, by introducing a priori vectors Z in the global interaction features Z before autoregressive p (size 1 x 512) to guide the generator to make reasonable foreground object placement predictions, where:
first splice vector Z i Expressed as:
Z i =Concat(Z,Z u ) Size 1 x (c+512);
second splice vector Z j Expressed as:
Z j =Concat(Z,Z p ) Size 1 x (c+512);
then the first splice vector Z i Inputting into regression block, predicting first affine transformation parameter t u =(t r u ,t x u ,t y u ) Second splice vector Z j Inputting into regression block, predicting second affine transformation parameter t s =(t r s ,t x s ,t y s )。
S26, respectively according to the first affine transformation parameters t u =(t r u ,t x u ,t y u ) And a second affine transformation parameter t s =(t r s ,t x s ,t y s ) Carrying out affine transformation on the input foreground image and the foreground image mask, correspondingly obtaining a first affine transformation result and a second affine transformation result, wherein the first affine transformation result comprises the foreground image subjected to affine transformation under an unsupervised pathForeground map maskThe second affine transformation result comprises a foreground map +.>Foreground mask->
In an embodiment, the affine transformation employs Spatial Transformer Network. Affine transformation is performed on the input foreground map and its foreground map mask according to transformation parameters predicted by the generator. Wherein the affine transformation principle is based on Spatial Transformer Network.
According to t s Affine transformation of an input image is exemplified, where t r s ∈[0,1]Definition of scaled height h=t r s H and width w=t r s W, assuming that the upper left vertex coordinates of the foreground object placed on the background map are (x, y), thus defining t x s =x/W-w,t y s =y/H-H. Then for the foreground map I fg And its foreground map mask M fg Affine transformation is carried out, taking affine transformation of a foreground image as an example, and the formula is as follows:
wherein, (x) fg ,y fg ) Representing coordinate points on the foreground map, (x' fg ,y′ fg ) Is the coordinate point of the foreground graph after affine transformation. After affine transformation is carried out on foreground images of the unsupervised path and the self-supervised path and a foreground image mask thereof, a first affine transformation result and a second affine transformation result are respectively and correspondingly obtained, wherein the first affine transformation result is the foreground image after affine transformation under the unsupervised pathForeground mask->The second affine transformation result is a foreground diagram +.>Foreground mask->
S27, respectively synthesizing the first affine transformation result and the second affine transformation result with the background image to correspondingly obtain a first synthesized imageAnd a second synthetic diagram->The expression is as follows:
s28, combining the first synthesis unitAnd a second synthesis unit->Input the discriminant and train the discriminant through the second data set.
All the synthetic graphs generated by the generator are input into the discriminator for discrimination training based on the countermeasure generation network training mode. The input of the discriminator isOr->The output is a value R epsilon {0,1}, and if 1 indicates that the position of the foreground object of the composite graph is reasonable as considered by the arbiter, 0 indicates unreasonable. The input of the discriminator is that, besides training by the synthetic image generated by the generator, the discriminator is trained by the second data setOr->The output is a value R epsilon {0,1}, if 1 represents the foreground object position of the arbiter synthetic graphReasonable, 0 indicates unreasonable.
In one embodiment, the arbiter performs the following operations:
R=Sigmoid(Conv1(
LeakyReLU(Conv512(
LeakyReLU(Conv256(
LeakyReLU(Conv128(
LeakyReLU(Conv64(Input)))))))))
wherein Sigmoid represents a Sigmoid activation function, convX represents a convolution operation with a channel number X, leakyReLU represents a LeakyReLU activation function, R represents an output feature of the arbiter, and Input represents an Input feature of the arbiter.
S29, calculating joint loss to update network parameters, and obtaining a trained countermeasure generation network model.
In one embodiment, the joint loss is expressed as follows:
wherein,,
in θ G Learning parameters, θ, represented in generator G D The learning parameters represented in the arbiter D,generating a loss function for antagonism on an unsupervised path,>generating a loss function for countermeasures on a self-supervising path, L kld (G) As KL divergence loss function, L rec (G) To reconstruct the loss function, L bce (D) For cross entropy loss function, +.>Representing a priori vector z p Mean value of distribution,/->Representing a priori vector z p Variance of distribution, D KL Indicating the calculation of KL divergence, N (a) 1 ,b 1 ) Mean value of a 1 Variance is b 1 Distribution of (a), if a 1 =0,b 1 And=1, then normal distribution.
Wherein the antagonism generates the network model and updates the parameter of each module of the network by calculating the joint loss when training. Using countermeasures on unsupervised paths to generate loss functionsThe generator can be trained to generate a composite graph on an unsupervised path that the arbiter deems reasonable. Using the countermeasure generation loss function on the self-supervising path>Can train the generator to make it in self-monitoring roadThe radially generated composite graph is considered reasonable by the arbiter. Using KL divergence loss function L kld (G) The distribution of a priori vectors learned from the positive label composite graph can be made to tend towards gaussian distribution. Using a reconstruction loss function L rec (G) So that the transformation parameters predicted on the self-supervising path tend to be the transformation parameters of the positive label. Employing cross entropy loss L bce (D) The discriminators are trained through the positive label synthetic image and the negative label synthetic image, and the discrimination capability of the discriminators is improved.
S3, a first splicing unit [ I ] to be synthesized fg ,M fg ]And a second splice unit [ I ] bg ,M bg ]Inputting a trained countermeasure generation network model, and outputting a first synthesis unit corresponding to the discriminator under an unsupervised pathAs a result of image synthesis prediction. />
The final test effect of the technical scheme on the OPA data set is shown in table 1:
TABLE 1
Technical proposal | FID | LPLIPS |
TERSE | 46.88 | 0 |
PlaceNet | 37.01 | 0.161 |
GracoNet | 28.10 | 0.207 |
The application | 23.21 | 0.270 |
The FID index refers to a difference value between a synthetic diagram and a positive label synthetic diagram generated by the technical scheme in the test set. The LPLIPS index refers to that for the same input, 10 synthetic graphs are generated by the technical scheme, and the degree of difference between the 10 synthetic graphs is measured, namely, the diversity of the synthetic graphs generated by the technical scheme is measured. Fig. 4 shows the effect of the synthetic patterns of the technical solution of the present application compared with other technical solutions, and it can be seen that the synthetic effects of the technical solution of the present application are superior to other prior art solutions, including TERSE (reference S.Tripathi, S.Chandra, A.Agrawal, A.Tyagi, J.M.Rehg, and v. chari, "Learning to generate synthetic data via compositing," in CVPR, pp.461-470,2019.), placeNet (reference L.Zhang, T.Wen, J.Min, J.Wang, D.Han, and j. Shi, "Learning object placement by inpainting for compositional data augmentation," in ECCV, pp.566-581,2020 "), and gracenet (reference S.Zhou, L.Liu, L.Niu, and l.zhang," Learning object placement via dual-path graph completion, "in ECCV, pp.373-389,2022).
The technical features of the above embodiments may be combined arbitrarily, and specific steps are not limited herein, and may be sequentially adjusted according to actual needs by those skilled in the art. And in order to simplify the description, all possible combinations of the technical features in the above embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope described in the present specification.
The above-described embodiments are merely representative of the more specific and detailed embodiments described herein and are not to be construed as limiting the claims. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.
Claims (10)
1. An image synthesis method based on an countermeasure generation network is characterized in that: the image synthesis method based on the countermeasure generation network comprises the following steps:
s1, acquiring a first data set and a second data set, wherein each sample in the first data set comprises a first splicing unit [ I ] fg ,M fg ]Second splicing unit [ I ] bg ,M bg ]And a third splicing unit [ I ] gt ,M gt ]Each sample in the second data set comprises a fourth splicing unitFifth splicing unit->Transformation parameter t corresponding to fifth splicing unit gt =(t r gt ,t x gt ,t y gt ) Each splice unit has the same size, wherein I fg As a foreground map, M fg Mask for foreground map, I bg As background picture, M bg Mask for background image, I gt Is a positive label graph, M gt Mask for positive tag map, ">A negative label composite map is shown and is shown,representing negative label composite map mask, ">Representing a positive label composite map, ">Representing a positive label composite map mask, ">Representing the scaling of the foreground object corresponding to the fifth stitching unit,/for>Representing the x-axis coordinate of the foreground position corresponding to the fifth stitching unit on the background map, < ->Representing the y-axis coordinate of the foreground position corresponding to the fifth splicing unit on the background map;
s2, building an countermeasure generation network model and training, wherein the countermeasure generation network model comprises a generator, a discriminator and a priori knowledge extractor, the generator comprises a preliminary feature extractor, a multi-scale feature aggregation module, a joint attention module, a Concat function and a regression block, the multi-scale feature aggregation module comprises two parallel first feature extraction units, the first feature extraction units comprise a multi-scale encoder and a feature aggregator which are connected in sequence, the priori knowledge extractor comprises a global feature extractor and an automatic encoder which are connected in sequence, and the training process is as follows:
s21, a first splicing unit [ I ] for splicing each sample in the first data set fg ,M fg ]Second splicing unit [ I ] bg ,M bg ]And a third splicing unit [ I ] gt ,M gt ]Respectively input into a preliminary feature extractor to correspondingly obtain a first basic feature diagram F fg Second basic feature map F bg And a third basic feature map F gt ;
S22, the first basic feature diagram F fg And a second basic feature map F bg The first feature extraction unit of the one-to-one input multi-scale feature aggregation module correspondingly obtains a first multi-scaleDegree feature map P fg And a second multiscale feature map P bg And map the first multiscale feature map P fg And a second multiscale feature map P bg Inputting a joint attention module to acquire a global interaction feature Z;
s23, combining the third basic feature map F gt A global feature extractor input to the a priori knowledge extractor, obtains a first extracted feature, and encodes the first extracted feature into a priori vector z by an automatic encoder p ;
S24, a random variable Z u A priori vector Z p Respectively fusing the Concat function with the global interaction feature Z to correspondingly obtain a first splicing vector Z i Forming an unsupervised path and a second splice vector Z j Forming a self-supervision path;
s25, respectively splicing the first splicing vectors Z i And a second splice vector Z j Inputting regression blocks, correspondingly predicting a first affine transformation parameter t u =(t r u ,t x u ,t y u ) And a second affine transformation parameter t s =(t r s ,t x s ,t y s ),t r u Scaling rate, t, representing foreground objects in an unsupervised path x u X-axis coordinate, t, representing foreground position on background map under unsupervised path y u Y-axis coordinate, t, representing foreground position on background map under unsupervised path r s Scaling rate, t, representing foreground objects under self-supervising path x s X-axis coordinate, t, representing foreground position on background map under self-supervision path y s Representing y-axis coordinates of a foreground position on a background map under a self-supervision path;
s26, respectively according to the first affine transformation parameters t u =(t r u ,t x u ,t y u ) And a second affine transformation parameter t s =(t r s ,t x s ,t y s ) Affine transformation is carried out on the input foreground image and the foreground image mask, and a first affine transformation result and a second affine transformation result are correspondingly obtainedAs a result, the first affine transformation result comprises a foreground image after affine transformation under an unsupervised pathForeground map maskThe second affine transformation result comprises a foreground map +.>Foreground mask->
S27, respectively synthesizing the first affine transformation result and the second affine transformation result with the background image to correspondingly obtain a first synthesized imageAnd a second synthetic diagram->The expression is as follows:
s28, combining the first synthesis unitAnd a second synthesis unit->Inputting the discriminant and training the discriminant through the second data set;
s29, calculating joint loss to update network parameters, and obtaining a trained countermeasure generation network model;
s3, a first splicing unit [ I ] to be synthesized fg ,M fg ]And a second splice unit [ I ] bg ,M bg ]Inputting a trained countermeasure generation network model, and outputting a first synthesis unit corresponding to the discriminator under an unsupervised pathAs a result of image synthesis prediction.
2. The method of image synthesis based on an countermeasure generation network of claim 1, wherein: the preliminary feature extractor employs a VGG16 network model.
3. The method of image synthesis based on an countermeasure generation network of claim 1, wherein: the preliminary feature extractor performs the following operations:
F 1 MaxPool (Conv 64 (Input 1))) with dimensions H/2×w/2×64;
F 2 =MaxPool(Conv128(Conv128(F 1 ) A) size H/4 XW/4X 128;
F 3 =MaxPool(Conv256(Conv256(Conv256(F 2 ) -1) with dimensions H/8 xw/8 x 256;
F fg =MaxPool(Conv512(Conv512(Conv512(F 3 ) -1) with dimensions H/16 xw/16 x 512;
wherein Maxpool represents the maximum pooling operation, convX represents the convolution operation with the number of channels X, and Input1 represents the first splice element [ I ] fg ,M fg ]Or a second splice unit [ I ] bg ,M bg ]Or a third splice unit [ I ] gt ,M gt ]Each splice unit has a height H, a width W and a channel number of 4.
4. The method of image synthesis based on an countermeasure generation network of claim 1, wherein:
the multi-scale encoder performs the following operations:
P 1 =relu (batch norm (Conv 512 (Input 2))), size H 1 ×W 1 ×512;
P 2 =ReLU(BatchNorm(Conv512(P 1 ) And) size H) 1 ×W 1 ×512;
P 3 =ReLU(BatchNorm(ConvC(P 2 ) And) size H) 1 ×W 1 ×C;
P S1= AdaptiveAvgPool{S 1 ×S 1 }(P 3 ) Size S 1 ×S 1 ×C;
P S2= AdaptiveAvgPool{S 2 ×S 2 }(P 3 ) Size S 2 ×S 2 ×C;
P S3= AdaptiveAvgPool{S 3 ×S 3 }(P 3 ) Size S 3 ×S 3 ×C;
Wherein, reLU represents ReLU activation function, batchNorm represents normalization operation, convX represents convolution operation with channel number X, adapteveAvgPool { n×n } represents adaptive pooling of high×width of corresponding input image into n×n, S 1 、S 2 、S 3 Sequentially a first preset size, a second preset size and a third preset size, wherein Input2 represents a first basic feature diagram F fg Or a second basic feature map F bg The height of each basic characteristic diagram is H 1 Width W 1 Channel number C, H/16=h 1 ,W/16=W 1 The height of each splicing unit is H, and the width of each splicing unit is W;
the feature aggregator performs the following operations:
P S1= Reshape(P 3 ) The size is 1× (S 1* S 1 )×C;
P S2= Reshape(P 3 ) The size is 1× (S 2* S 2 )×C;
P S3= Reshape(P 3 ) Size 1×(S 3* S 3 )×C;
P g =Concat(Concat(P S1 ,P S2 ),P S3 ) The size is 1× (S 1* S 1+ S 2* S 2+ S 3* S 3 )×C;
Wherein Reshape denotes Reshape function, concat denotes Concat function, P g Representing the output features of the first feature extraction unit corresponding to Input2, namely a first multi-scale feature map P fg Or a second multiscale feature map P bg 。
5. The method of image synthesis based on an countermeasure generation network of claim 4, wherein:
the global feature extractor performs the following operations:
Z 1 =ReLU(BatchNorm(Conv512(F fg ) And) size H) 1 ×W 1 ×512;
Z 2 =ReLU(BatchNorm(Conv512(Z 1 ) And) size H) 1 ×W 1 ×512;
Z 3 =ReLU(BatchNorm(ConvC(Z 2 ) And) size H) 1 ×W 1 ×C;
Z 4= AdaptiveAvgPool{1×1}(P 3 ) The size is 1 multiplied by C;
the automatic encoder performs the following operations:
h=ReLU(FC1024(Z 4 ) 1 x 1024 in size;
mu=fc512 (h), size 1×1×512;
logvar=fc512 (h), size 1×1×512;
Z p =mu+e Logvar/2 the size is 1 multiplied by 512;
wherein FCY is a full connection layer that maps the number of channels of the corresponding input image to Y.
6. The method of image synthesis based on an countermeasure generation network of claim 4, wherein: the joint attention module performs the following operations:
Q fg =ConvC/8(P fg ) The size is H multiplied by W multiplied by C/8;
K fg =ConvC/4(P fg ) The size is H multiplied by W multiplied by C/4;
V fg =ConvC/4(P fg ) The size is H multiplied by W multiplied by C;
Q bg =ConvC/8(P bg ) The size is H multiplied by W multiplied by C/8;
K bg =ConvC/4(P bg ) The size is H multiplied by W multiplied by C/4;
V bg =ConvC/4(P bg ) The size is H multiplied by W multiplied by C;
respectively Q fg 、K fg 、V fg 、Q bg 、K bg 、V bg The first dimension and the second dimension of the pattern are integrated through a Reshpe function, and Q 'is obtained correspondingly in sequence' fg 、K’ fg 、V’ fg 、Q’ bg 、K’ bg 、V’ bg The expression is as follows:
Q’ fg size is HW×C/8;
K’ fg size HW×C/4;
V’ fg size HW×C;
Q’ bg size is HW×C/8;
K’ bg size HW×C/4;
V’ bg size HW×C;
will Q' fg And Q' bg The third dimension of (2) is spliced and combined as follows:
Q cat =Concat(Q’ fg ,Q’ bg ) The size is HW multiplied by C/4,
will Q cat Performing attention calculation to obtain X fg And X bg The expression is as follows:
X fg =Softmax(Q cat *K fg T) V fg +P fg size HW×C;
X bg =Softmax(Q cat *K bg T) V bg +P bg size HW×C;
Z=AdaptiveAvgPool{1×1}(Conv512(Concat(X fg ,X bg ) A) of size 1×1×c;
where ConvX represents a convolution operation with a channel number X.
7. The method of image synthesis based on an countermeasure generation network of claim 1, wherein: the regression block performs the following operations:
t u =FC3(FC1024(ReLU(FC1024(Z i ))));
t s =FC3(FC1024(ReLU(FC1024(Z j ))));
wherein FCY is a fully connected layer that maps the number of channels of the corresponding input image to Y, and ReLU represents a ReLU activation function.
8. The method of image synthesis based on an countermeasure generation network of claim 1, wherein: the affine transformation uses Spatial Transformer Network.
9. The method of image synthesis based on an countermeasure generation network of claim 1, wherein: the arbiter performs the following operations:
R=Sigmoid(Conv1(
LeakyReLU(Conv512(
LeakyReLU(Conv256(
LeakyReLU(Conv128(
LeakyReLU(Conv64(Input)))))))))
wherein Sigmoid represents a Sigmoid activation function, convX represents a convolution operation with a channel number X, leakyReLU represents a LeakyReLU activation function, R represents an output feature of the arbiter, and Input represents an Input feature of the arbiter.
10. The method of image synthesis based on an countermeasure generation network of claim 1, wherein: the joint loss is expressed as follows:
wherein,,
in θ G Learning parameters, θ, represented in generator G D The learning parameters represented in the arbiter D,generating a loss function for antagonism on an unsupervised path,>generating a loss function for countermeasures on a self-supervising path, L kld (G) As KL divergence loss function, L rec (G) To reconstruct the loss function, L bce (D) For cross entropy loss function, +.>Representing a priori vector z P Mean value of distribution,/->Representing a priori vector z P Variance of distribution, D KL Indicating the calculation of KL divergence, N (a) 1 ,b 1 ) Mean value of a 1 Variance is b 1 Distribution of (a), if a 1 =0,b 1 And=1, then normal distribution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310236028.4A CN116524290A (en) | 2023-03-09 | 2023-03-09 | Image synthesis method based on countermeasure generation network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310236028.4A CN116524290A (en) | 2023-03-09 | 2023-03-09 | Image synthesis method based on countermeasure generation network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116524290A true CN116524290A (en) | 2023-08-01 |
Family
ID=87398335
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310236028.4A Withdrawn CN116524290A (en) | 2023-03-09 | 2023-03-09 | Image synthesis method based on countermeasure generation network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116524290A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118314052A (en) * | 2024-06-07 | 2024-07-09 | 北京数慧时空信息技术有限公司 | Method for removing thin cloud of remote sensing image |
-
2023
- 2023-03-09 CN CN202310236028.4A patent/CN116524290A/en not_active Withdrawn
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118314052A (en) * | 2024-06-07 | 2024-07-09 | 北京数慧时空信息技术有限公司 | Method for removing thin cloud of remote sensing image |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110097609B (en) | Sample domain-based refined embroidery texture migration method | |
Zhang et al. | Avatargen: a 3d generative model for animatable human avatars | |
Porav et al. | Adversarial training for adverse conditions: Robust metric localisation using appearance transfer | |
Ji et al. | Deep view morphing | |
CN108921926B (en) | End-to-end three-dimensional face reconstruction method based on single image | |
CN111798400A (en) | Non-reference low-illumination image enhancement method and system based on generation countermeasure network | |
CN111696137B (en) | Target tracking method based on multilayer feature mixing and attention mechanism | |
US20210358197A1 (en) | Textured neural avatars | |
CN112434655B (en) | Gait recognition method based on adaptive confidence map convolution network | |
CN113808005A (en) | Video-driving-based face pose migration method and device | |
CN113762358A (en) | Semi-supervised learning three-dimensional reconstruction method based on relative deep training | |
CN116524290A (en) | Image synthesis method based on countermeasure generation network | |
CN112819951A (en) | Three-dimensional human body reconstruction method with shielding function based on depth map restoration | |
CN111368637A (en) | Multi-mask convolution neural network-based object recognition method for transfer robot | |
Li et al. | PoT-GAN: Pose transform GAN for person image synthesis | |
CN116934972B (en) | Three-dimensional human body reconstruction method based on double-flow network | |
CN114373110A (en) | Method and device for detecting target of input image and related products | |
CN117237623B (en) | Semantic segmentation method and system for remote sensing image of unmanned aerial vehicle | |
Vobecky et al. | Advanced pedestrian dataset augmentation for autonomous driving | |
CN113627481A (en) | Multi-model combined unmanned aerial vehicle garbage classification method for smart gardens | |
CN115984949B (en) | Low-quality face image recognition method and equipment with attention mechanism | |
Yin et al. | Novel view synthesis for large-scale scene using adversarial loss | |
CN112541972A (en) | Viewpoint image processing method and related equipment | |
CN113808006B (en) | Method and device for reconstructing three-dimensional grid model based on two-dimensional image | |
CN115761801A (en) | Three-dimensional human body posture migration method based on video time sequence information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20230801 |