CN116524290A

CN116524290A - Image synthesis method based on countermeasure generation network

Info

Publication number: CN116524290A
Application number: CN202310236028.4A
Authority: CN
Inventors: 郑建炜; 王逸彬; 李梦晗; 徐雷
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2023-03-09
Filing date: 2023-03-09
Publication date: 2023-08-01

Abstract

The invention discloses an image synthesis method based on an countermeasure generation network, which comprises the following steps: s1, acquiring a first data set and a second data set, wherein each sample in the first data set comprises a first splicing unit, a second splicing unit and a third splicing unit, each sample in the second data set comprises a fourth splicing unit, a fifth splicing unit and transformation parameters corresponding to the fifth splicing unit, and each splicing unit has the same size; s2, building an countermeasure generation network model and training; s3, inputting the first splicing unit and the second splicing unit to be synthesized into a trained countermeasure generation network model, and taking the first synthesis unit correspondingly output by the discriminator under the unsupervised path as an image synthesis prediction result. The method improves the rationality and diversity of the synthetic image, can obtain a more real and natural synthetic image, and can adapt to complex application scenes.

Description

An image synthesis method based on generative adversarial network

技术领域Technical Field

本发明属于图像处理技术领域，具体涉及一种基于对抗生成网络的图像合成方法。The present invention belongs to the technical field of image processing, and in particular relates to an image synthesis method based on a generative adversarial network.

背景技术Background Art

图像合成包括四个研究方向：物体位置(object placement)、图像融合(imageblending)、图像和谐化(image harmonization)和阴影生成(shadow generation)。对于物体位置问题，所体现的几何不一致性包括但不局限于：1)前景物体过大或者过小；2)前景物体没有受力支撑，比如悬浮在空中；3)前景物体出现在语义不合适的地方，比如船出现在内陆上；4)前景物体和周围物体存在不合理的遮挡关系；5)前景和背景的透视角度不一致。总结起来就是前景物体的大小、位置、形状不合理。物体位置(object placement)和空间形变(spatial transformation)旨在为前景寻找合理的大小、位置、形状，从而避免上面提到的诸多不合理因素。物体位置一般来说主要是对前景物体进行平移和缩放，而空间形变则会涉及到相对复杂的几何形变，比如仿射变换或透视变换。为了方便描述，用物体位置指代任意几何形变。Image synthesis includes four research directions: object placement, image blending, image harmonization, and shadow generation. For the object placement problem, the geometric inconsistencies include but are not limited to: 1) the foreground object is too large or too small; 2) the foreground object is not supported by force, such as floating in the air; 3) the foreground object appears in a semantically inappropriate place, such as a ship appearing inland; 4) there is an unreasonable occlusion relationship between the foreground object and the surrounding objects; 5) the perspective angles of the foreground and background are inconsistent. In summary, the size, position, and shape of the foreground object are unreasonable. Object placement and spatial transformation aim to find a reasonable size, position, and shape for the foreground, thereby avoiding the many unreasonable factors mentioned above. Object placement generally refers to the translation and scaling of the foreground object, while spatial transformation involves relatively complex geometric deformations, such as affine transformation or perspective transformation. For the convenience of description, object placement refers to any geometric deformation.

学习在背景场景图上放置前景对象经常出现在图像编辑和场景解析等应用中，通过模型输入原始合成图和前景掩码，输出调整之后获得更加真实自然的合成图。迄今为止，已有的大多数相关研究都存在一些棘手的问题，包括对前景物体与场景之间交互特征关系的利用不足，模型训练过程中涉及的先验知识较少等，导致合成出的结果图中，前景物体的位置并不合理，如大多数工作只考虑把一个前景物体粘贴到另外一张背景图片上，并且假设前景物体是完整的，然而在现实应用中往往需要把多个前景物体合成在同一张背景图片上，并且前景物体可能残缺不全。除此之外，现有技术对于相同的输入，通常只会输出一种位置的情况。但是对于一个前景物体和背景图，大部分情况下，前景物体在背景图上是有多个合理的位置，比如把一个花瓶放在另外一张背景图片的桌子上有无数种合理的大小和位置。因此，为获得更加合理性和多样性的合成结果，需要改进图像合成算法使其能够适应复杂的应用场景。Learning to place foreground objects on a background scene often appears in applications such as image editing and scene parsing. The original composite image and the foreground mask are input into the model, and the output is adjusted to obtain a more realistic and natural composite image. So far, most of the existing related studies have some thorny problems, including insufficient utilization of the interactive feature relationship between the foreground object and the scene, less prior knowledge involved in the model training process, etc., resulting in unreasonable positions of foreground objects in the synthesized result image. For example, most of the work only considers pasting a foreground object onto another background image and assumes that the foreground object is complete. However, in real applications, it is often necessary to synthesize multiple foreground objects on the same background image, and the foreground object may be incomplete. In addition, the existing technology usually only outputs one position for the same input. However, for a foreground object and a background image, in most cases, the foreground object has multiple reasonable positions on the background image, such as placing a vase on a table in another background image. There are countless reasonable sizes and positions. Therefore, in order to obtain more reasonable and diverse synthesis results, it is necessary to improve the image synthesis algorithm so that it can adapt to complex application scenarios.

发明内容Summary of the invention

本发明的目的在于针对上述问题，提出一种基于对抗生成网络的图像合成方法，提高了合成图的合理性和多样性，可获得更加真实自然的合成图，能够适应复杂的应用场景。The purpose of the present invention is to propose an image synthesis method based on a generative adversarial network to address the above problems, thereby improving the rationality and diversity of the synthesized images, obtaining more realistic and natural synthesized images, and being able to adapt to complex application scenarios.

为实现上述目的，本发明所采取的技术方案为：To achieve the above object, the technical solution adopted by the present invention is:

本发明提出的一种基于对抗生成网络的图像合成方法，包括如下步骤：The present invention proposes an image synthesis method based on a generative adversarial network, comprising the following steps:

S1、获取第一数据集和第二数据集，第一数据集中的各样本包括第一拼接单元[I_fg,M_fg]、第二拼接单元[I_bg,M_bg]和第三拼接单元[I_gt,M_gt]，第二数据集中的各样本包括第四拼接单元第五拼接单元和第五拼接单元对应的变换参数t_gt＝(t_r ^gt,t_x ^gt,t_y ^gt)，各拼接单元具有相同尺寸，其中，I_fg为前景图，M_fg为前景图掩码，I_bg为背景图，M_bg为背景图掩码，I_gt为正标签图，M_gt为正标签图掩码，表示负标签合成图，表示负标签合成图掩码，表示正标签合成图，表示正标签合成图掩码，表示第五拼接单元对应的前景物体的缩放率，表示第五拼接单元对应的前景位置在背景图上的x轴坐标，表示第五拼接单元对应的前景位置在背景图上的y轴坐标；S1. Obtain a first data set and a second data set, wherein each sample in the first data set includes a first splicing unit [I _fg ,M _fg ], a second splicing unit [I _bg ,M _bg ] and a third splicing unit [I _gt ,M _gt ], and each sample in the second data set includes a fourth splicing unit Fifth splicing unit The transformation parameter corresponding to the fifth splicing unit is t _gt =(t _r ^gt ,t _x ^gt , _ty ^gt ), each splicing unit has the same size, wherein _Ifg is the foreground image, M _fg is the foreground image mask, _Ibg is the background image, M _bg is the background image mask, _Igt is the positive label image, _Mgt is the positive label image mask, represents the negative label synthetic graph, represents the negative label composite image mask, represents the positive label composite image, represents the positive label composite image mask, Indicates the scaling ratio of the foreground object corresponding to the fifth splicing unit, Indicates the x-axis coordinate of the foreground position corresponding to the fifth splicing unit on the background image, Indicates the y-axis coordinate of the foreground position corresponding to the fifth splicing unit on the background image;

S2、建立对抗生成网络模型并进行训练，对抗生成网络模型包括生成器、判别器和先验知识提取器，生成器包括初步特征提取器、多尺度特征聚合模块、联合注意力模块、Concat函数和回归块，多尺度特征聚合模块包括两个并行的第一特征提取单元，第一特征提取单元包括依次连接的多尺度编码器和特征聚合器，先验知识提取器包括依次连接的全局特征提取器和自动编码器，训练过程如下：S2. Establish a generative adversarial network model and train it. The generative adversarial network model includes a generator, a discriminator, and a priori knowledge extractor. The generator includes a preliminary feature extractor, a multi-scale feature aggregation module, a joint attention module, a Concat function, and a regression block. The multi-scale feature aggregation module includes two parallel first feature extraction units. The first feature extraction unit includes a multi-scale encoder and a feature aggregator connected in sequence. The priori knowledge extractor includes a global feature extractor and an autoencoder connected in sequence. The training process is as follows:

S21、将第一数据集中各样本的第一拼接单元[I_fg,M_fg]、第二拼接单元[I_bg,M_bg]和第三拼接单元[I_gt,M_gt]分别输入初步特征提取器对应获得第一基本特征图F_fg、第二基本特征图F_bg和第三基本特征图F_gt；S21, inputting the first concatenation unit [I _fg , M _fg ], the second concatenation unit [I _bg , M _bg ] and the third concatenation unit [I _gt , M _gt ] of each sample in the first data set into a preliminary feature extractor to obtain a first basic feature map F _fg , a second basic feature map F _bg and a third basic feature map F _gt ;

S22、将第一基本特征图F_fg和第二基本特征图F_bg一一对应输入多尺度特征聚合模块的第一特征提取单元，对应获得第一多尺度特征图P_fg和第二多尺度特征图P_bg，并将第一多尺度特征图P_fg和第二多尺度特征图P_bg输入联合注意力模块，获取全局交互特征Z；S22, inputting the first basic feature map F _fg and the second basic feature map F _bg into the first feature extraction unit of the multi-scale feature aggregation module in a one-to-one correspondence, obtaining the first multi-scale feature map P _fg and the second multi-scale feature map P _bg correspondingly, and inputting the first multi-scale feature map P _fg and the second multi-scale feature map P _bg into the joint attention module to obtain the global interaction feature Z;

S23、将第三基本特征图F_gt输入先验知识提取器的全局特征提取器，获得第一提取特征，并通过自动编码器将第一提取特征编码成先验向量z_p；S23, inputting the third basic feature graph F _gt into the global feature extractor of the prior knowledge extractor to obtain a first extracted feature, and encoding the first extracted feature into a priori vector z _p through an automatic encoder;

S24、将随机变量Z_u和先验向量Z_p通过Concat函数分别与全局交互特征Z融合，对应获得第一拼接向量Z_i形成无监督路径和第二拼接向量Z_j形成自监督路径；S24, the random variable _Zu and the prior vector _Zp are respectively fused with the global interaction feature Z through the Concat function, and the first concatenation vector _Zi is obtained to form an unsupervised path and the second concatenation vector _Zj is obtained to form a self-supervised path;

S25、分别将第一拼接向量Z_i和第二拼接向量Z_j输入回归块，对应预测出第一仿射变换参数t_u＝(t_r ^u,t_x ^u,t_y ^u)和第二仿射变换参数t_s＝(t_r ^s,t_x ^s,t_y ^s)，t_r ^u表示无监督路径下前景物体的缩放率，t_x ^u表示无监督路径下前景位置在背景图上的x轴坐标，t_y ^u表示无监督路径下前景位置在背景图上的y轴坐标，t_r ^s表示自监督路径下前景物体的缩放率，t_x ^s表示自监督路径下前景位置在背景图上的x轴坐标，t_y ^s表示自监督路径下前景位置在背景图上的y轴坐标；S25. Input the first splicing vector _Zi and the second splicing vector _Zj into the regression block respectively, and predict the first affine transformation parameter _tu = ( _tru ^, _txu , _tyu ) ^and _{the second affine transformation parameter ts = (trs, txs, tys} ₎ ^{correspondingly} ^, ^tru ^represents ^the _scaling rate of the foreground object under the unsupervised path, _txu represents the x-axis coordinate of the foreground position on the background image under _the unsupervised path, ^tyu represents the y _- ^axis coordinate of the foreground position on the background image under the unsupervised path, _trs represents the scaling rate of the foreground object under the self-supervised path, _txs represents the x- ^axis coordinate of the ^foreground position on the background image under the self-supervised path, and _tys represents the y _- axis coordinate of the foreground position on the background image under the ^self -supervised path;

S26、分别根据第一仿射变换参数t_u＝(t_r ^u,t_x ^u,t_y ^u)和第二仿射变换参数t_s＝(t_r ^s,t_x ^s,t_y ^s)对输入的前景图及前景图掩码进行仿射变换，对应获得第一仿射变换结果和第二仿射变换结果，第一仿射变换结果包括无监督路径下仿射变换后的前景图及前景图掩码第二仿射变换结果包括自监督路径下仿射变换后的前景图及前景图掩码 S26. Perform affine transformation on the input foreground image ^{and the foreground image mask according to the first affine transformation parameter tu = (tr u, tx u, ty u) and the second affine transformation parameter ts = (trs, txs, tys} ₎ _, ^respectively _, ^to _obtain ^a _first _affine _{transformation} ^result _and ^a second affine transformation result, the first affine transformation result including the foreground image after affine transformation under the unsupervised path. and foreground mask The second affine transformation result includes the foreground image after affine transformation under the self-supervised path and foreground mask

S27、将第一仿射变换结果和第二仿射变换结果分别与背景图进行合成，对应获得第一合成图和第二合成图表示如下：S27, synthesize the first affine transformation result and the second affine transformation result with the background image respectively, and obtain a first synthesized image And the second composite image It is expressed as follows:

S28、将第一合成单元和第二合成单元输入判别器，并通过第二数据集来训练判别器；S28, the first synthesis unit and the second synthesis unit Input the discriminator and train the discriminator using the second data set;

S29、计算联合损失来更新网络参数，获得训练好的对抗生成网络模型；S29, calculate the joint loss to update the network parameters and obtain the trained adversarial generation network model;

S3、将待合成的第一拼接单元[I_fg,M_fg]和第二拼接单元[I_bg,M_bg]输入训练好的对抗生成网络模型，并将无监督路径下由判别器对应输出的第一合成单元作为图像合成预测结果。S3, input the first splicing unit [I _fg ,M _fg ] and the second splicing unit [I _bg ,M _bg ] to be synthesized into the trained adversarial generative network model, and input the first synthesis unit corresponding to the output of the discriminator under the unsupervised path As the image synthesis prediction result.

优选地，初步特征提取器采用VGG16网络模型。Preferably, the preliminary feature extractor adopts the VGG16 network model.

优选地，初步特征提取器执行如下操作：Preferably, the preliminary feature extractor performs the following operations:

F₁＝MaxPool(Conv64(Conv64(Input1)))，尺寸为H/2×W/2×64；F ₁ = MaxPool(Conv64(Conv64(Input1))), with size H/2×W/2×64;

F₂＝MaxPool(Conv128(Conv128(F₁)))，尺寸为H/4×W/4×128；F ₂ =MaxPool(Conv128(Conv128(F ₁ ))), with size H/4×W/4×128;

F₃＝MaxPool(Conv256(Conv256(Conv256(F₂))))，尺寸为H/8×W/8×256；F ₃ =MaxPool(Conv256(Conv256(Conv256(F ₂ )))), with size H/8×W/8×256;

F_fg＝MaxPool(Conv512(Conv512(Conv512(F₃))))，尺寸为H/16×W/16×512； _Ffg = MaxPool(Conv512(Conv512(Conv512( _F3 )))), with size H/16×W/16×512;

其中，Maxpool表示最大池化操作，ConvX表示通道数为X的卷积操作，Input1表示第一拼接单元[I_fg,M_fg]或第二拼接单元[I_bg,M_bg]或第三拼接单元[I_gt,M_gt]，各拼接单元的高度为H、宽度为W、通道数为4。Among them, Maxpool represents the maximum pooling operation, ConvX represents the convolution operation with the number of channels as X, Input1 represents the first splicing unit [I _fg ,M _fg ] or the second splicing unit [I _bg ,M _bg ] or the third splicing unit [I _gt ,M _gt ], and the height, width and number of channels of each splicing unit are 4.

优选地，多尺度编码器执行如下操作：Preferably, the multi-scale encoder performs the following operations:

P₁＝ReLU(BatchNorm(Conv512(Input2)))，尺寸为H₁×W₁×512；P ₁ =ReLU(BatchNorm(Conv512(Input2))), with size H ₁ ×W ₁ ×512;

P₂＝ReLU(BatchNorm(Conv512(P₁)))，尺寸为H₁×W₁×512；P ₂ =ReLU(BatchNorm(Conv512(P ₁ ))), with size H ₁ ×W ₁ ×512;

P₃＝ReLU(BatchNorm(ConvC(P₂)))，尺寸为H₁×W₁×C；P ₃ =ReLU(BatchNorm(ConvC(P ₂ ))), with size H ₁ ×W ₁ ×C;

P_S1＝AdaptiveAvgPool{S₁×S₁}(P₃)，尺寸为S₁×S₁×C； _{PS1 =} AdaptiveAvgPool{ _S1 × _S1 }( _P3 ), with size _S1 × _S1 ×C;

P_S2＝AdaptiveAvgPool{S₂×S₂}(P₃)，尺寸为S₂×S₂×C；PS2 ₌ AdaptiveAvgPool{ _S2 × _S2 }( _P3 ), with size _S2 × _S2 ×C;

P_S3＝AdaptiveAvgPool{S₃×S₃}(P₃)，尺寸为S₃×S₃×C；PS3 ₌ AdaptiveAvgPool{S ₃ ×S ₃ }(P ₃ ), with size S ₃ ×S ₃ ×C;

其中，ReLU表示ReLU激活函数，BatchNorm表示归一化操作，ConvX表示通道数为X的卷积操作，AdaptiveAvgPool{n×n}表示将对应的输入图像的高×宽自适应池化为n×n，S₁、S₂、S₃依次为第一预设尺寸、第二预设尺寸和第三预设尺寸，Input2表示第一基本特征图F_fg或第二基本特征图F_bg，各基本特征图的高度为H₁、宽度为W₁、通道数为C，H/16＝H₁，W/16＝W₁，各拼接单元的高度为H、宽度为W；Wherein, ReLU represents the ReLU activation function, BatchNorm represents the normalization operation, ConvX represents the convolution operation with the number of channels being X, AdaptiveAvgPool{n×n} represents adaptively pooling the height×width of the corresponding input image to n×n, S ₁ , S ₂ , and S ₃ are the first preset size, the second preset size, and the third preset size respectively, Input2 represents the first basic feature map F _fg or the second basic feature map F _bg , the height of each basic feature map is H ₁ , the width is W ₁ , the number of channels is C, H/16＝H ₁ , W/16＝W ₁ , and the height of each splicing unit is H and the width is W;

特征聚合器执行如下操作：The feature aggregator performs the following operations:

P_S1＝Reshape(P₃)，尺寸为1×(S_1*S₁)×C； _{PS1 =} Reshape( _P3 ), with size 1×( _S1* _S1 )×C;

P_S2＝Reshape(P₃)，尺寸为1×(S_2*S₂)×C；PS2 ₌ Reshape( _P3 ), with size 1×( _S2* _S2 )×C;

P_S3＝Reshape(P₃)，尺寸为1×(S_3*S₃)×C；PS3 ₌ Reshape( _P3 ), with size 1×( _S3* _S3 )×C;

P_g＝Concat(Concat(P_S1,P_S2),P_S3)，尺寸为1×(S_1*S₁₊S_2*S₂₊S_3*S₃)×C； _Pg = Concat(Concat( _PS1 , _PS2 ), _PS3 ), with size 1×(S1 _* S1 ₊ _S2* S2 ₊ _S3* _S3 )×C;

其中，Reshape表示reshape函数，Concat表示Concat函数，P_g表示Input2对应的第一特征提取单元的输出特征，即第一多尺度特征图P_fg或第二多尺度特征图P_bg。Among them, Reshape represents a reshape function, Concat represents a Concat function, and _Pg represents the output feature of the first feature extraction unit corresponding to Input2, that is, the first multi-scale feature map _Pfg or the second multi-scale feature map _Pbg .

优选地，全局特征提取器执行如下操作：Preferably, the global feature extractor performs the following operations:

Z₁＝ReLU(BatchNorm(Conv512(F_fg)))，尺寸为H₁×W₁×512；Z ₁ =ReLU(BatchNorm(Conv512(F _fg ))), with size H ₁ ×W ₁ ×512;

Z₂＝ReLU(BatchNorm(Conv512(Z₁)))，尺寸为H₁×W₁×512；Z ₂ =ReLU(BatchNorm(Conv512(Z ₁ ))), with size H ₁ ×W ₁ ×512;

Z₃＝ReLU(BatchNorm(ConvC(Z₂)))，尺寸为H₁×W₁×C；Z ₃ =ReLU(BatchNorm(ConvC(Z ₂ ))), with size H ₁ ×W ₁ ×C;

Z_4＝AdaptiveAvgPool{1×1}(P₃)，尺寸为1×1×C；Z _{4 =} AdaptiveAvgPool{1×1}(P ₃ ), with size 1×1×C;

自动编码器执行如下操作：The autoencoder performs the following operations:

h＝ReLU(FC1024(Z₄))，尺寸为1×1×1024；h = ReLU(FC1024(Z ₄ )), with size 1×1×1024;

mu＝FC512(h)，尺寸为1×1×512；mu = FC512(h), size is 1 × 1 × 512;

Logvar＝FC512(h)，尺寸为1×1×512；Logvar=FC512(h), size is 1×1×512;

Z_p＝mu+e^Logvar/2，尺寸为1×1×512；Z _p = mu + e ^Logvar/2 , with dimensions of 1 × 1 × 512;

其中，FCY为将对应的输入图像的通道数映射为Y的全连接层。Among them, FCY is a fully connected layer that maps the number of channels of the corresponding input image to Y.

优选地，联合注意力模块，执行如下操作：Preferably, the joint attention module performs the following operations:

Q_fg＝ConvC/8(P_fg),尺寸为H×W×C/8； _Qfg = ConvC/8( _Pfg ), size is H×W×C/8;

K_fg＝ConvC/4(P_fg),尺寸为H×W×C/4； _Kfg = ConvC/4( _Pfg ), size is H×W×C/4;

V_fg＝ConvC/4(P_fg),尺寸为H×W×C；V _fg =ConvC/4(P _fg ), size is H×W×C;

Q_bg＝ConvC/8(P_bg),尺寸为H×W×C/8；Q _bg =ConvC/8(P _bg ), size is H×W×C/8;

K_bg＝ConvC/4(P_bg),尺寸为H×W×C/4；K _bg =ConvC/4(P _bg ), size is H×W×C/4;

V_bg＝ConvC/4(P_bg),尺寸为H×W×C；V _bg =ConvC/4(P _bg ), size is H×W×C;

分别将Q_fg、K_fg、V_fg、Q_bg、K_bg、V_bg的第一维度和第二维度通过Reshpe函数进行整合，依次对应获得Q’_fg、K’_fg、V’_fg、Q’_bg、K’_bg、V’_bg，表示如下：The first dimension and the second dimension of Q _fg , K _fg , V _fg , Q _bg , K _bg , and V _bg are respectively integrated by the Reshpe function to obtain Q' _fg , K' _fg , V' _fg , Q' _bg , K' _bg , and V' _bg , which are expressed as follows:

Q’_fg尺寸为HW×C/8；The size of Q' _fg is HW × C/8;

K’_fg尺寸为HW×C/4；The size of K' _fg is HW × C/4;

V’_fg尺寸为HW×C；V' _fg has dimensions of HW × C;

Q’_bg尺寸为HW×C/8；The size of Q' _bg is HW × C/8;

K’_bg尺寸为HW×C/4；The size of _K'bg is HW×C/4;

V’_bg尺寸为HW×C；V' _bg size is HW × C;

将Q’_fg和Q’_bg的第三维度进行拼接联合，表示如下：The third dimension of _Q'fg and _Q'bg is concatenated and expressed as follows:

Q_cat＝Concat(Q’_fg,Q’_bg)，尺寸为HW×C/4，Q _cat = Concat(Q' _fg ,Q' _bg ), size is HW×C/4,

将Q_cat进行注意力计算得到X_fg和X_bg，表示如下：Perform attention calculation on Q _cat to obtain X _fg and X _bg , which are expressed as follows:

X_fg＝Softmax(Q_cat*K_fg ^T)V_fg+P_fg,尺寸为HW×C； _Xfg = Softmax( _Qcat * _KfgT ⁾ _Vfg + _Pfg , size is HW×C;

X_bg＝Softmax(Q_cat*K_bg ^T)V_bg+P_bg,尺寸为HW×C；X _bg = Softmax(Q _cat *K _bg ^T) V _bg +P _bg , size is HW×C;

Z＝AdaptiveAvgPool{1×1}(Conv512(Concat(X_fg,X_bg)))，尺寸为1×1×C；Z = AdaptiveAvgPool{1×1}(Conv512(Concat( _Xfg , _Xbg ))), with size 1×1×C;

其中，ConvX表示通道数为X的卷积操作。Among them, ConvX represents a convolution operation with X channels.

优选地，回归块执行如下操作：Preferably, the regression block performs the following operations:

t_u＝FC3(FC1024(ReLU(FC1024(Z_i))))；t _u =FC3(FC1024(ReLU(FC1024(Z _i ))));

t_s＝FC3(FC1024(ReLU(FC1024(Z_j))))；t _s =FC3(FC1024(ReLU(FC1024(Z _j ))));

其中，FCY为将对应的输入图像的通道数映射为Y的全连接层，ReLU表示ReLU激活函数。Among them, FCY is a fully connected layer that maps the number of channels of the corresponding input image to Y, and ReLU represents the ReLU activation function.

优选地，仿射变换采用Spatial Transformer Network。Preferably, the affine transformation adopts Spatial Transformer Network.

优选地，判别器执行如下操作：Preferably, the discriminator performs the following operations:

R＝Sigmoid(Conv1(R = Sigmoid(Conv1(

LeakyReLU(Conv512(LeakyReLU(Conv512(

LeakyReLU(Conv256(LeakyReLU(Conv256(

LeakyReLU(Conv128(LeakyReLU(Conv128(

LeakyReLU(Conv64(Input)))))))))LeakyReLU(Conv64(Input)))))))))

其中，Sigmoid表示Sigmoid激活函数，ConvX表示通道数为X的卷积操作，LeakyReLU表示LeakyReLU激活函数，R表示判别器的输出特征，Input表示判别器的输入特征。Among them, Sigmoid represents the Sigmoid activation function, ConvX represents the convolution operation with the number of channels X, LeakyReLU represents the LeakyReLU activation function, R represents the output feature of the discriminator, and Input represents the input feature of the discriminator.

优选地，联合损失表示如下：Preferably, the joint loss is expressed as follows:

其中，in,

式中，θ_G表示在生成器G中的学习参数，θ_D表示在判别器D中的学习参数，为无监督路径上的对抗生成损失函数，为自监督路径上的对抗生成损失函数，L_kld(G)为KL散度损失函数，L_rec(G)为重建损失函数，L_bce(D)为交叉熵损失函数，表示先验向量z_p分布的均值，表示先验向量z_p分布的方差，D_KL表示计算KL散度，N(a₁,b₁)表示均值为a₁、方差为b₁的分布，若a₁＝0，b₁＝1，则为正态分布。In the formula, θ _G represents the learning parameters in the generator G, θ _D represents the learning parameters in the discriminator D, Generate loss function for adversarial on unsupervised path, is the adversarial generation loss function on the self-supervised path, _Lkld (G) is the KL divergence loss function, _Lrec (G) is the reconstruction loss function, _Lbce (D) is the cross entropy loss function, represents the mean of the distribution of the prior vector z _p , represents the variance of the distribution of the prior vector z _p , D _KL represents the calculated KL divergence, N(a ₁ ,b ₁ ) represents the distribution with mean a ₁ and variance b ₁ , if a ₁ ＝0, b ₁ ＝1, then it is a normal distribution.

与现有技术相比，本发明的有益效果为：Compared with the prior art, the present invention has the following beneficial effects:

本申请基于前景图和背景图及它们的掩码，提出一种全新的端到端框架，即基于联合注意力的生成对抗网络，具体地，在生成器中设计了多尺度特征聚合模块来从背景和前景中提取多尺度信息，在对前景图和背景图的多尺度特征提取后，使用联合注意力模块来提取前景物体和背景图的全局特征交互信息，并基于这些特征信息，预测出前景图的仿射变换参数，根据预测出的参数，对前景图及其掩码进行仿射变换，并放置在背景图的对应位置上，完成对输入的前景图和背景图的合成。此外，在训练过程中，添加了一个自监督的路线，从正标签合成图中学习先验知识，从而进一步指导生成器发现前景目标在背景图中的可信位置，能够在结果上对于位置的合理性和多样性方面都比其他现有方法具有优势，可获得更加真实自然的合成图，并能够适应复杂的应用场景。Based on the foreground image and background image and their masks, this application proposes a new end-to-end framework, namely, a generative adversarial network based on joint attention. Specifically, a multi-scale feature aggregation module is designed in the generator to extract multi-scale information from the background and foreground. After the multi-scale features of the foreground image and the background image are extracted, the joint attention module is used to extract the global feature interaction information of the foreground object and the background image, and based on these feature information, the affine transformation parameters of the foreground image are predicted. According to the predicted parameters, the foreground image and its mask are affine transformed and placed at the corresponding position of the background image to complete the synthesis of the input foreground image and background image. In addition, during the training process, a self-supervised route is added to learn prior knowledge from the positive label synthetic image, so as to further guide the generator to find the credible position of the foreground target in the background image, which can have advantages over other existing methods in terms of the rationality and diversity of the position in the result, and can obtain more realistic and natural synthetic images, and can adapt to complex application scenarios.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明基于对抗生成网络的图像合成方法的流程图；FIG1 is a flow chart of an image synthesis method based on a generative adversarial network according to the present invention;

图2为本发明对抗生成网络模型的结构示意图；FIG2 is a schematic diagram of the structure of a generative adversarial network model according to the present invention;

图3为本发明联合注意力模块的结构示意图；FIG3 is a schematic diagram of the structure of the joint attention module of the present invention;

图4为本发明方法与现有技术方法的合成图效果对比图。FIG. 4 is a comparison diagram of the synthetic graph effects of the method of the present invention and the method of the prior art.

具体实施方式DETAILED DESCRIPTION

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The following will be combined with the drawings in the embodiments of the present application to clearly and completely describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are only part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of this application.

需要说明的是，当组件被称为与另一个组件“连接”时，它可以直接与另一个组件连接或者也可以存在居中的组件。除非另有定义，本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的，不是在于限制本申请。It should be noted that when a component is referred to as being "connected" to another component, it may be directly connected to the other component or there may be an intermediate component. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as those commonly understood by those skilled in the art to which this application belongs. The terms used herein in the specification of this application are only for the purpose of describing specific embodiments and are not intended to limit this application.

如图1-4所示，一种基于对抗生成网络的图像合成方法，包括如下步骤：As shown in Figure 1-4, an image synthesis method based on a generative adversarial network includes the following steps:

S1、获取第一数据集和第二数据集，第一数据集中的各样本包括第一拼接单元[I_fg,M_fg]、第二拼接单元[I_bg,M_bg]和第三拼接单元[I_gt,M_gt]，第二数据集中的各样本包括第四拼接单元第五拼接单元和第五拼接单元对应的变换参数t_gt＝(t_r ^gt,t_x ^gt,t_y ^gt)，各拼接单元具有相同尺寸，其中，I_fg为前景图，M_fg为前景图掩码，I_bg为背景图，M_bg为背景图掩码，I_gt为正标签图，M_gt为正标签图掩码，表示负标签合成图，表示负标签合成图掩码，表示正标签合成图，表示正标签合成图掩码，表示第五拼接单元对应的前景物体的缩放率，表示第五拼接单元对应的前景位置在背景图上的x轴坐标，表示第五拼接单元对应的前景位置在背景图上的y轴坐标。S1. Obtain a first data set and a second data set, wherein each sample in the first data set includes a first splicing unit [I _fg ,M _fg ], a second splicing unit [I _bg ,M _bg ] and a third splicing unit [I _gt ,M _gt ], and each sample in the second data set includes a fourth splicing unit Fifth splicing unit The transformation parameter corresponding to the fifth splicing unit is t _gt =(t _r ^gt ,t _x ^gt , _ty ^gt ), each splicing unit has the same size, wherein _Ifg is the foreground image, M _fg is the foreground image mask, _Ibg is the background image, M _bg is the background image mask, _Igt is the positive label image, _Mgt is the positive label image mask, represents the negative label synthetic graph, represents the negative label composite image mask, represents the positive label composite image, represents the positive label composite image mask, Indicates the scaling ratio of the foreground object corresponding to the fifth splicing unit, Indicates the x-axis coordinate of the foreground position corresponding to the fifth splicing unit on the background image, Indicates the y-axis coordinate of the foreground position corresponding to the fifth splicing unit on the background image.

本实施例中，对抗生成网络模型包括生成器、判别器和先验知识提取器。在训练阶段，模型训练架构如图2所示，对抗生成网络模型的输入为前景图及其前景图掩码的拼接[I_fg,M_fg](第一拼接单元)，背景图及其背景图掩码的拼接[I_bg,M_bg](第二拼接单元)，正标签图及其正标签图掩码的拼接[I_gt,M_gt](第三拼接单元)，其中I_fg、I_bg和I_gt的尺寸为H×W×3，M_fg、M_bg和M_gt的尺寸为H×W×1，拼接后，[I_fg,M_fg]，[I_bg,M_bg]和[I_gt,M_gt]的尺寸均为H×W×4。在训练阶段，生成器由无监督路径和自监督路径组成(两路互相独立)，模型在训练时，对于在自监督路径上生成的结果图，会使用标签结果图(即第二数据集)来计算损失函数，而无监督路径上，则不引入标签结果图进行训练，其输出为2组分别由3个变换参数组成的列表，根据模型预测的变换参数对输入的前景图及其前景图掩码分别进行仿射变换后，再将其合成到背景图中，从而分别产生两个最终的合成单元其中和的尺寸为H×W×3，和的尺寸为H×W×1。最后将两路生成的合成单元尺寸均为H×W×4都输入到判别器中，依据对抗生成训练模式分别对生成器和判别器进行训练。In this embodiment, the adversarial generative network model includes a generator, a discriminator, and a priori knowledge extractor. In the training stage, the model training architecture is shown in FIG2. The input of the adversarial generative network model is the concatenation of the foreground image and its foreground image mask [I _fg , M _fg ] (first concatenation unit), the concatenation of the background image and its background image mask [I _bg , M _bg ] (second concatenation unit), and the concatenation of the positive label image and its positive label image mask [I _gt , M _gt ] (third concatenation unit), wherein the size of I _fg , I _bg and I _gt is H×W×3, and the size of M _fg , M _bg and M _gt is H×W×1. After concatenation, the size of [I _fg , M _fg ], [I _bg , M _bg ] and [I _gt , M _gt ] is H×W×4. In the training phase, the generator consists of an unsupervised path and a self-supervised path (the two paths are independent of each other). When the model is training, for the result image generated on the self-supervised path, the label result image (i.e., the second data set) is used to calculate the loss function, while on the unsupervised path, the label result image is not introduced for training. Its output is 2 sets of lists consisting of 3 transformation parameters respectively. The input foreground image and its foreground image mask are affine transformed according to the transformation parameters predicted by the model, and then synthesized into the background image, thereby generating two final synthesis units respectively. in and The dimensions are H×W×3, and The size of is H×W×1. Finally, the synthesis unit generated by the two paths is All the sizes are H×W×4 and are input to the discriminator. The generator and discriminator are trained separately according to the adversarial generation training model.

在测试阶段，所用到的模型架构为图2中生成器的无监督路径部分，输入为[I_fg,M_fg]和[I_bg,M_bg]，输出为1组由3个变换参数组成的列表，根据模型预测的变换参数对输入的前景图及其掩码分别进行仿射变换后，再将其合成到背景图中，从而产生了最终的合成单元由于在测试阶段的用到的模块属于训练阶段模型的一部分，因此主要介绍对抗生成网络模型的训练流程。In the test phase, the model architecture used is the unsupervised path part of the generator in Figure 2. The input is [I _fg ,M _fg ] and [I _bg ,M _bg ], and the output is a list of 3 transformation parameters. The input foreground image and its mask are affine transformed according to the transformation parameters predicted by the model, and then synthesized into the background image, thus generating the final synthesis unit. Since the modules used in the testing phase are part of the model in the training phase, this paper mainly introduces the training process of the adversarial generative network model.

S21、将第一数据集中各样本的第一拼接单元[I_fg,M_fg]、第二拼接单元[I_bg,M_bg]和第三拼接单元[I_gt,M_gt]分别输入初步特征提取器对应获得第一基本特征图F_fg、第二基本特征图F_bg和第三基本特征图F_gt。S21. Input the first concatenated unit [I _fg , M _fg ], the second concatenated unit [I _bg , M _bg ] and the third concatenated unit [I _gt , M _gt ] of each sample in the first data set into a preliminary feature extractor to obtain a first basic feature map F _fg , a second basic feature map F _bg and a third basic feature map F _gt , respectively.

在一实施例中，初步特征提取器采用VGG16网络模型。In one embodiment, the preliminary feature extractor adopts the VGG16 network model.

在一实施例中，初步特征提取器执行如下操作：In one embodiment, the preliminary feature extractor performs the following operations:

具体地，初步特征提取器可参考VGG架构，如为VGG16网络模型等，以[I_fg,M_fg]为例，对于输入的[I_fg,M_fg],尺寸为H×W×4，对应获得的第一基本特征图F_fg，尺寸为H₁×W₁×C，令H/16＝H₁,W/16＝W₁，第二基本特征图F_bg和第三基本特征图F_gt同理，尺寸为H₁×W₁×C。其中，初步特征提取器的卷积层全部都是3*3的卷积核，ConvX中X表示通道数。每层卷积的滑动步长stride＝1，padding＝1；Maxpool指最大池化操作，在VGG16网络模型中，采用的是2*2的最大池化方法；padding指对矩阵在外边填充n圈，padding＝1即填充1圈，如5×5大小的矩阵，填充一圈后变成7×7大小。Specifically, the preliminary feature extractor can refer to the VGG architecture, such as the VGG16 network model, etc. Taking [I _fg ,M _fg ] as an example, for the input [I _fg ,M _fg ], the size is H×W×4, and the corresponding first basic feature map F _fg is H ₁ ×W ₁ ×C. Let H/16＝H ₁ , W/16＝W ₁ , and the second basic feature map F _bg and the third basic feature map F _gt are similarly sized to H ₁ ×W ₁ ×C. Among them, the convolution layers of the preliminary feature extractor are all 3*3 convolution kernels, and X in ConvX represents the number of channels. The sliding step size of each convolution layer is stride=1, padding=1; Maxpool refers to the maximum pooling operation. In the VGG16 network model, the 2*2 maximum pooling method is adopted; padding refers to filling the matrix with n circles on the outside, and padding=1 means filling 1 circle. For example, a 5×5 matrix becomes 7×7 after filling one circle.

S22、将第一基本特征图F_fg和第二基本特征图F_bg一一对应输入多尺度特征聚合模块的第一特征提取单元，对应获得第一多尺度特征图P_fg和第二多尺度特征图P_bg，并将第一多尺度特征图P_fg和第二多尺度特征图P_bg输入联合注意力模块，获取全局交互特征Z。S22. Input the first basic feature map F _fg and the second basic feature map F _bg into the first feature extraction unit of the multi-scale feature aggregation module in a one-to-one correspondence to obtain the first multi-scale feature map P _fg and the second multi-scale feature map P _bg accordingly, and input the first multi-scale feature map P _fg and the second multi-scale feature map P _bg into the joint attention module to obtain the global interaction feature Z.

在一实施例中，多尺度编码器执行如下操作：In one embodiment, the multi-scale encoder performs the following operations:

对得到的第一基本特征图F_fg和第二基本特征图F_bg，尺寸均为H₁×W₁×C，将其输入到多尺度特征聚合器中，从而分别获得聚合了每个图像中多个尺度信息的第一多尺度特征图P_fg和第二多尺度特征图P_bg。以第一基本特征图F_fg为例，多尺度编码器预先设定多个尺寸S₁、S₂和S₃，首先对第一基本特征图F_fg进行进一步的特征提取，获得P₃，之后使用AdaptiveAvgPool将P₃池化到预先设定的多个尺度。为了将上一步得到的多个尺度的特征图中的信息进行聚合，以便后续进行背景和前景物体在多个尺度上的全局特征交互，设计了特征聚合器，通过对每个尺度的图像尺度进一步调整，然后将它们的特征图在第2维度上进行拼接聚合，即可获得第一多尺度特征图P_fg，第二多尺度特征图P_bg的获取过程同理。The first basic feature map F _fg and the second basic feature map F _bg obtained, both of which have a size of H ₁ ×W ₁ ×C, are input into the multi-scale feature aggregator, thereby obtaining the first multi-scale feature map P _fg and the second multi-scale feature map P _bg that aggregate multiple scale information in each image. Taking the first basic feature map F _fg as an example, the multi-scale encoder pre-sets multiple scales S ₁ , S ₂ and S ₃ , firstly performs further feature extraction on the first basic feature map F _fg to obtain P ₃ , and then uses AdaptiveAvgPool to pool P ₃ to multiple pre-set scales. In order to aggregate the information in the feature maps of multiple scales obtained in the previous step, so as to perform global feature interaction of background and foreground objects at multiple scales in the subsequent step, a feature aggregator is designed. By further adjusting the image scale of each scale, and then splicing and aggregating their feature maps in the second dimension, the first multi-scale feature map P _fg can be obtained. The acquisition process of the second multi-scale feature map P _bg is similar.

在一实施例中，联合注意力模块，执行如下操作：In one embodiment, the joint attention module performs the following operations:

Q’_fg尺寸为HW×C/8；The size of Q' _fg is HW × C/8;

K’_fg尺寸为HW×C/4；The size of K' _fg is HW × C/4;

V’_fg尺寸为HW×C；V' _fg has dimensions of HW × C;

Q’_bg尺寸为HW×C/8；The size of Q' _bg is HW × C/8;

K’_bg尺寸为HW×C/4；The size of _K'bg is HW×C/4;

V’_bg尺寸为HW×C；V' _bg size is HW × C;

对于得到的第一多尺度特征图P_fg和第二多尺度特征图P_bg，通过联合注意力模块来提取前景图和背景图之间在多个尺度上的全局特征交互信息，大大促进了前景物体在背景图中合理位置的预测。如图3所示，联合注意力模块在训练时，每个卷积层的参数不同，故输出的结果也不同，联合注意力模块的卷积层全部都是1*1的卷积核。每层卷积的滑动步长stride＝1，padding＝0。并通过Reshpe函数对Q_fg、K_fg、V_fg、Q_bg、K_bg、V_bg调整维度，将第一维度和第二维度进行整合，依次对应获得Q’_fg、K’_fg、V’_fg、Q’_bg、K’_bg、V’_bg，然后将Q’_fg和Q’_bg在通道级即第三维度进行拼接联合，再将Q_cat分别与前景图和背景图的K_fg，K_bg以及V_fg，V_bg做注意力计算得到X_fg和X_bg，尺寸为HW×C。最后，综合两边联合注意力得到的特征图，即将X_fg和X_bg拼接，并进行卷积和池化操作，得到全局交互特征Z，尺寸为1×1×C。图3中， For the first multi-scale feature map _Pfg and the second multi-scale feature map _Pbg obtained, the global feature interaction information between the foreground image and the background image at multiple scales is extracted through the joint attention module, which greatly promotes the prediction of the reasonable position of the foreground object in the background image. As shown in Figure 3, when the joint attention module is trained, the parameters of each convolution layer are different, so the output results are also different. The convolution layers of the joint attention module are all 1*1 convolution kernels. The sliding step size of each convolution layer is stride=1, and padding=0. And adjust the dimensions of Q _fg , K _fg , V _fg , Q _bg , K _bg , V _bg through the Reshpe function, integrate the first dimension and the second dimension, and obtain Q' _fg , K' _fg , V' _fg , Q' _bg , K' _bg , V' _bg in turn. Then, Q' _fg and Q' _bg are spliced and combined at the channel level, that is, the third dimension. Then, Q _cat is respectively calculated with K _fg , K _bg and V _fg , V _bg of the foreground image and the background image to obtain X _fg and X _bg with a size of HW×C. Finally, the feature map obtained by the joint attention of both sides is combined, that is, X _fg and X _bg are spliced, and convolution and pooling operations are performed to obtain the global interaction feature Z with a size of 1×1×C. In Figure 3,

S23、将第三基本特征图F_gt输入先验知识提取器的全局特征提取器，获得第一提取特征，并通过自动编码器将第一提取特征编码成先验向量z_p。S23. Input the third basic feature map F _gt into the global feature extractor of the prior knowledge extractor to obtain a first extracted feature, and encode the first extracted feature into a priori vector z _p through an automatic encoder.

在一实施例中，全局特征提取器执行如下操作：In one embodiment, the global feature extractor performs the following operations:

mu＝FC512(h)，尺寸为1×1×512；mu = FC512(h), size is 1 × 1 × 512;

Logvar＝FC512(h)，尺寸为1×1×512；Logvar=FC512(h), size is 1×1×512;

在训练时通过从正标签图中提取前景物体合理位置信息，并将提取到的信息编码成先验向量Z_p结合到生成器的自回归过程中，来为生成器训练提供先验知识指导。对得到的第三基本特征图F_gt，首先进行全局特征提取获得Z₄，然后使用基于VAE的自动编码器，将Z₄编码成先验向量Z_p，其中，FC表示全连接层，如FC1024(Z₄)意为将Z₄的通道数由原先的C映射为1024，不同地方使用的FC对应不同的全连接层。During training, the reasonable position information of the foreground object is extracted from the positive label map, and the extracted information is encoded into a priori vector Z _p and combined with the autoregressive process of the generator to provide prior knowledge guidance for the generator training. For the third basic feature map F _gt obtained, global feature extraction is first performed to obtain Z ₄ , and then a VAE-based autoencoder is used to encode Z ₄ into a priori vector Z _p , where FC represents a fully connected layer, such as FC1024 (Z ₄ ) means that the number of channels of Z ₄ is mapped from the original C to 1024, and the FC used in different places corresponds to different fully connected layers.

S24、将随机变量Z_u和先验向量Z_p通过Concat函数分别与全局交互特征Z融合，对应获得第一拼接向量Z_i形成无监督路径和第二拼接向量Z_j形成自监督路径。S24. The random variable _Zu and the prior vector _Zp are respectively fused with the global interaction feature Z through the Concat function, and the first concatenated vector _Zi is obtained to form an unsupervised path and the second concatenated vector _Zj is obtained to form a self-supervised path.

S25、分别将第一拼接向量Z_i和第二拼接向量Z_j输入回归块，对应预测出第一仿射变换参数t_u＝(t_r ^u,t_x ^u,t_y ^u)和第二仿射变换参数t_s＝(t_r ^s,t_x ^s,t_y ^s)，t_r ^u表示无监督路径下前景物体的缩放率，t_x ^u表示无监督路径下前景位置在背景图上的x轴坐标，t_y ^u表示无监督路径下前景位置在背景图上的y轴坐标，t_r ^s表示自监督路径下前景物体的缩放率，t_x ^s表示自监督路径下前景位置在背景图上的x轴坐标，t_y ^s表示自监督路径下前景位置在背景图上的y轴坐标。S25. Input the first splicing vector _Zi and the second splicing vector _Zj into the regression block respectively, and predict the first affine transformation parameter t _u = (t _r ^u , t _x ^u , _ty ^u ) and the second affine transformation parameter t _s = (t _r ^s , t _x ^s , _ty ^s ) accordingly, t _r ^u represents the scaling rate of the foreground object under the unsupervised path, t _x ^u represents the x-axis coordinate of the foreground position on the background image under the unsupervised path, t _y ^u represents the y-axis coordinate of the foreground position on the background image under the unsupervised path, t _r ^s represents the scaling rate of the foreground object under the self-supervised path, t _x ^s represents the x-axis coordinate of the foreground position on the background image under the self-supervised path, and _ty ^s represents the y-axis coordinate of the foreground position on the background image under the self-supervised path.

在一实施例中，回归块执行如下操作：In one embodiment, the regression block performs the following operations:

t_u＝FC3(FC1024(ReLU(FC1024(Z_i))))；t _u =FC3(FC1024(ReLU(FC1024(Z _i ))));

t_s＝FC3(FC1024(ReLU(FC1024(Z_j))))；t _s =FC3(FC1024(ReLU(FC1024(Z _j ))));

具体地，对于生成器中的无监督路径，通过在自回归前，在全局交互特征Z中引入随机变量Z_u(尺寸为1×1×512)来使模型的合成结果具有多样性；对于生成器中的自监督路径，通过在自回归前，在全局交互特征Z中引入先验向量Z_p(尺寸为1×1×512)来指导生成器进行合理地前景物体放置预测，其中：Specifically, for the unsupervised path in the generator, a random variable _Zu (size is 1×1×512) is introduced into the global interaction feature Z before autoregression to make the synthesis result of the model diverse; for the self-supervised path in the generator, a prior vector _Zp (size is 1×1×512) is introduced into the global interaction feature Z before autoregression to guide the generator to make reasonable foreground object placement predictions, where:

第一拼接向量Z_i表示为：The first concatenation vector _Zi is expressed as:

Z_i＝Concat(Z,Z_u)，尺寸为1×1×(C+512)； _Zi = Concat(Z, _Zu ), with size 1×1×(C+512);

第二拼接向量Z_j表示为：The second splicing vector _Zj is expressed as:

Z_j＝Concat(Z,Z_p)，尺寸为1×1×(C+512)；Z _j =Concat(Z,Z _p ), with size 1×1×(C+512);

之后将第一拼接向量Z_i输入到回归块中，预测出第一仿射变换参数t_u＝(t_r ^u，t_x ^u，t_y ^u)，将第二拼接向量Z_j输入到回归块中，预测出第二仿射变换参数t_s＝(t_r ^s，t_x ^s，t_y ^s)。Then, the first splicing vector _Zi is input into the regression block to predict the first affine transformation parameter _tu ⁼ ( _tru , _txu ^, _tyu ), and the second splicing vector _Zj is input ^into the regression block to predict ^the second affine transformation parameter _ts = ( _trs _, _txs ^, ^tys ).

S26、分别根据第一仿射变换参数t_u＝(t_r ^u,t_x ^u,t_y ^u)和第二仿射变换参数t_s＝(t_r ^s,t_x ^s,t_y ^s)对输入的前景图及前景图掩码进行仿射变换，对应获得第一仿射变换结果和第二仿射变换结果，第一仿射变换结果包括无监督路径下仿射变换后的前景图及前景图掩码第二仿射变换结果包括自监督路径下仿射变换后的前景图及前景图掩码 S26. Perform affine transformation on ^the input foreground image ^{and the foreground image mask according to the first affine transformation parameter tu = (tru, txu, tyu) and the second affine transformation parameter ts = (trs, txs, tys} ₎ _, _respectively _, ^to _obtain ^a _first _affine _{transformation} ^result ^and a second affine transformation result, the first affine transformation result including the foreground image after affine transformation under the unsupervised path. and foreground mask The second affine transformation result includes the foreground image after affine transformation under the self-supervised path and foreground mask

在一实施例中，仿射变换采用Spatial Transformer Network。根据生成器预测出的变换参数对输入的前景图及其前景图掩码进行仿射变换。其中仿射变换原理基于Spatial Transformer Network。In one embodiment, the affine transformation uses a Spatial Transformer Network, and performs an affine transformation on the input foreground image and its foreground image mask according to the transformation parameters predicted by the generator, wherein the affine transformation principle is based on the Spatial Transformer Network.

以根据t_s对输入的图像进行仿射变换为例，其中t_r ^s∈[0,1]，定义缩放后的高h＝t_r ^sH和宽w＝t_r ^sW，假设前景物体被放置到背景图上位置的左上顶点坐标为(x,y)，因此定义t_x ^s＝x/W-w,t_y ^s＝y/H-h。则对于前景图I_fg及其前景图掩码M_fg进行仿射变换，以对前景图的仿射变换为例，其公式为：Take the affine transformation of the input image according to _ts as an example, where _trs∈ [0,1], define the height h= _trsH and width ^w = _trsW ^after ^scaling , and assume that the coordinates of the upper left vertex of the foreground object placed on the background image are (x,y), so define _txs =x/Ww, ^tys =y/ _Hh . Then, perform affine transformation on the foreground image ^Ifg and _its foreground image mask _Mfg . Taking the affine transformation of the foreground image as an example, the formula is:

其中，(x_fg,y_fg)代表前景图上的坐标点，(x′_fg，y′_fg)为仿射变换后前景图的坐标点。无监督路径和自监督路径的前景图及其前景图掩码经过仿射变换后，分别对应得到第一仿射变换结果和第二仿射变换结果，第一仿射变换结果为无监督路径下仿射变换后的前景图及前景图掩码第二仿射变换结果为自监督路径下仿射变换后的前景图及前景图掩码 Among them, (x _fg ,y _fg ) represents the coordinate point on the foreground image, and (x′ _fg ,y′ _fg ) is the coordinate point of the foreground image after affine transformation. After the foreground image and the foreground image mask of the unsupervised path and the self-supervised path are affine transformed, the first affine transformation result and the second affine transformation result are obtained respectively. The first affine transformation result is the foreground image after affine transformation under the unsupervised path. and foreground mask The second affine transformation result is the foreground image after affine transformation under the self-supervised path and foreground mask

S28、将第一合成单元和第二合成单元输入判别器，并通过第二数据集来训练判别器。S28, the first synthesis unit and the second synthesis unit The discriminator is input and trained using the second dataset.

本申请基于对抗生成网络训练模式，通过生成器生成出的所有合成图，还要输入到判别器中进行判别训练。判别器的输入为或输出为一个值R∈{0,1}，若为1表示判别器认为合成图的前景物体位置合理，0则表示不合理。判别器除了通过生成器生成的合成图进行训练外，还通过第二数据集来训练判别器，判别器的输入为或输出为一个值R∈{0,1}，若为1表示判别器合成图的前景物体位置合理，0则表示不合理。This application is based on the adversarial generative network training model. All synthetic images generated by the generator must be input into the discriminator for discriminant training. The input of the discriminator is or The output is a value R∈{0,1}. If it is 1, it means that the discriminator thinks that the foreground object position of the composite image is reasonable, and 0 means it is unreasonable. In addition to being trained by the composite image generated by the generator, the discriminator is also trained by the second data set. The input of the discriminator is or The output is a value R∈{0,1}, where 1 indicates that the position of the foreground object in the discriminator's synthetic image is reasonable, and 0 indicates that it is unreasonable.

在一实施例中，判别器执行如下操作：In one embodiment, the discriminator performs the following operations:

R＝Sigmoid(Conv1(R = Sigmoid(Conv1(

LeakyReLU(Conv512(LeakyReLU(Conv512(

LeakyReLU(Conv256(LeakyReLU(Conv256(

LeakyReLU(Conv128(LeakyReLU(Conv128(

LeakyReLU(Conv64(Input)))))))))LeakyReLU(Conv64(Input)))))))))

S29、计算联合损失来更新网络参数，获得训练好的对抗生成网络模型。S29. Calculate the joint loss to update the network parameters and obtain the trained adversarial generative network model.

在一实施例中，联合损失表示如下：In one embodiment, the combined loss is expressed as follows:

其中，in,

其中，对抗生成网络模型在训练时，通过计算联合损失来更新网络各个模块的参数。采用无监督路径上的对抗生成损失函数能训练生成器使之在无监督路径上生成的合成图让判别器认为是合理的。采用自监督路径上的对抗生成损失函数能训练生成器使之在自监督路径上生成的合成图让判别器认为是合理的。采用KL散度损失函数L_kld(G)能使得从正标签合成图中学习到的先验向量的分布趋向于高斯分布。采用重建损失函数L_rec(G)使得自监督路径上预测出的变换参数趋向于正标签的变换参数。采用交叉熵损失L_bce(D)通过正标签合成图和负标签合成图来训练判别器，提高判别器的判别能力。Among them, when training the adversarial generative network model, the parameters of each network module are updated by calculating the joint loss. The adversarial generative loss function on the unsupervised path is used The generator can be trained to generate synthetic images on the unsupervised path that the discriminator considers reasonable. The adversarial generation loss function on the self-supervised path is used. The generator can be trained so that the synthetic images generated on the self-supervised path are considered reasonable by the discriminator. The KL divergence loss function L _kld (G) can make the distribution of the prior vector learned from the positive label synthetic image tend to Gaussian distribution. The reconstruction loss function L _rec (G) is used to make the transformation parameters predicted on the self-supervised path tend to the transformation parameters of the positive label. The cross entropy loss L _bce (D) is used to train the discriminator through the positive label synthetic image and the negative label synthetic image to improve the discriminant ability of the discriminator.

本申请技术方案最终在OPA数据集上的测试效果如表1所示：The final test results of the technical solution of this application on the OPA dataset are shown in Table 1:

表1Table 1

技术方案Technical Solution FIDFID LPLIPSLPLIPS TERSETERSE 46.8846.88 00 PlaceNetPlaceNet 37.0137.01 0.1610.161 GracoNetGracoNet 28.1028.10 0.2070.207 本申请This application 23.2123.21 0.2700.270

FID指标指的是在测试集中，技术方案生成的合成图和正标签合成图之间的差异值。LPLIPS指标指的是对于相同的输入，技术方案生成10张合成图，测量这10张合成图之间的差异程度，即衡量技术方案生成的合成图的多样性。图4展示了本申请技术方案与其他技术方案的合成图效果对比，可以看出本申请技术方案的合成效果优于其他现有技术方案，包括TERSE(参考文献S.Tripathi,S.Chandra,A.Agrawal,A.Tyagi,J.M.Rehg,andV.Chari,“Learning to generate synthetic data via compositing,”in CVPR,pp.461–470,2019.)、PlaceNet(参考文献L.Zhang,T.Wen,J.Min,J.Wang,D.Han,and J.Shi,“Learning object placement by inpainting for compositional dataaugmentation,”in ECCV,pp.566–581,2020.)、GracoNet(参考文献S.Zhou,L.Liu,L.Niu,and L.Zhang,“Learning object placement via dual-path graph completion,”inECCV,pp.373–389,2022.)。The FID indicator refers to the difference between the synthetic image generated by the technical solution and the positive label synthetic image in the test set. The LPLIPS indicator refers to the difference between the 10 synthetic images generated by the technical solution for the same input, that is, the diversity of the synthetic images generated by the technical solution. Figure 4 shows a comparison of the synthetic image effects of the technical solution of the present application and other technical solutions. It can be seen that the synthetic effect of the technical solution of the present application is better than other existing technical solutions, including TERSE (reference S.Tripathi, S.Chandra, A.Agrawal, A.Tyagi, J.M.Rehg, and V.Chari, "Learning to generate synthetic data via compositing," in CVPR, pp.461–470, 2019.), PlaceNet (reference L.Zhang, T.Wen, J.Min, J.Wang, D.Han, and J.Shi, "Learning object placement by inpainting for compositional dataaugmentation," in ECCV, pp.566–581, 2020.), and GracoNet (reference S.Zhou, L.Liu, L.Niu, and L.Zhang, "Learning object placement via dual-path graph completion," in ECCV, pp.373–389, 2022.).

以上所述实施例的各技术特征可以进行任意的组合，文中具体步骤并不做限定，本领域技术人员可以根据实际需求进行顺序调整。且为使描述简洁，未对上述实施例中的各技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above-mentioned embodiments can be combined arbitrarily, and the specific steps are not limited in the text. Those skilled in the art can adjust the order according to actual needs. In order to make the description concise, not all possible combinations of the technical features in the above-mentioned embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

以上所述实施例仅表达了本申请描述较为具体和详细的实施例，但并不能因此而理解为对申请专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请专利的保护范围应以所附权利要求为准。The above-described embodiments only express the more specific and detailed embodiments described in this application, but they cannot be understood as limiting the scope of the patent application. It should be pointed out that for ordinary technicians in this field, several modifications and improvements can be made without departing from the concept of this application, which all belong to the protection scope of this application. Therefore, the protection scope of the patent application shall be based on the attached claims.

Claims

1. An image synthesis method based on a generative adversarial network, characterized in that: the image synthesis method based on a generative adversarial network comprises the following steps:

S1. Acquire a first data set and a second data set, wherein each sample in the first data set includes a first splicing unit [I _fg , M _fg ], a second splicing unit [I _bg , M _bg ] and a third splicing unit [I _gt , M _gt ], and each sample in the second data set includes a fourth splicing unit Fifth splicing unit The transformation parameter corresponding to the fifth splicing unit is t _gt =(t _r ^gt ,t _x ^gt , _ty ^gt ), each splicing unit has the same size, wherein _Ifg is the foreground image, M _fg is the foreground image mask, _Ibg is the background image, M _bg is the background image mask, _Igt is the positive label image, _Mgt is the positive label image mask, represents the negative label synthetic graph, represents the negative label composite image mask, represents the positive label composite image, represents the positive label composite image mask, Indicates the scaling ratio of the foreground object corresponding to the fifth splicing unit, Indicates the x-axis coordinate of the foreground position corresponding to the fifth splicing unit on the background image, Indicates the y-axis coordinate of the foreground position corresponding to the fifth splicing unit on the background image;

S2. Establish a generative adversarial network model and perform training. The generative adversarial network model includes a generator, a discriminator, and a priori knowledge extractor. The generator includes a preliminary feature extractor, a multi-scale feature aggregation module, a joint attention module, a Concat function, and a regression block. The multi-scale feature aggregation module includes two parallel first feature extraction units. The first feature extraction unit includes a multi-scale encoder and a feature aggregator connected in sequence. The priori knowledge extractor includes a global feature extractor and an autoencoder connected in sequence. The training process is as follows:

S21, inputting the first concatenation unit [I _fg , M _fg ], the second concatenation unit [I _bg , M _bg ] and the third concatenation unit [I _gt , M _gt ] of each sample in the first data set into a preliminary feature extractor to obtain a first basic feature map F _fg , a second basic feature map F _bg and a third basic feature map F _gt ;

S22, inputting the first basic feature map F _fg and the second basic feature map F _bg into the first feature extraction unit of the multi-scale feature aggregation module in a one-to-one correspondence, obtaining the first multi-scale feature map P _fg and the second multi-scale feature map P _bg correspondingly, and inputting the first multi-scale feature map P _fg and the second multi-scale feature map P _bg into the joint attention module to obtain the global interaction feature Z;

S23, inputting the third basic feature graph F _gt into the global feature extractor of the prior knowledge extractor to obtain a first extracted feature, and encoding the first extracted feature into a priori vector z _p through an automatic encoder;

S24, the random variable _Zu and the prior vector _Zp are respectively fused with the global interaction feature Z through the Concat function, and the first concatenation vector _Zi is obtained to form an unsupervised path and the second concatenation vector _Zj is obtained to form a self-supervised path;

S25. Input the first splicing vector _Zi and the second splicing vector _Zj into the regression block respectively, and predict the first affine transformation parameter _tu = ( _tru ^, _txu , _tyu ) ^and _{the second affine transformation parameter ts = (trs, txs, tys} ₎ ^{correspondingly} ^, ^tru ^represents ^the _scaling rate of the foreground object under the unsupervised path, _txu represents the x-axis coordinate of the foreground position on the background image under _the unsupervised path, ^tyu represents the y _- ^axis coordinate of the foreground position on the background image under the unsupervised path, _trs represents the scaling rate of the foreground object under the self-supervised path, _txs represents the x- ^axis coordinate of the ^foreground position on the background image under the self-supervised path, and _tys represents the y _- axis coordinate of the foreground position on the background image under the ^self -supervised path;

S26. Perform an affine transformation on the input foreground image and the foreground image mask according to the first affine transformation parameter _tu = ( _tr ^u , _tx ^u , _ty ^u ) and the second affine transformation parameter _ts = ( _trs , _txs , _tys ), respectively, to obtain a ^first ^affine transformation result and a second affine transformation result correspondingly, wherein the ^first affine transformation result includes the foreground image after affine transformation under the unsupervised path. and foreground mask The second affine transformation result includes the foreground image after affine transformation under the self-supervised path and foreground mask

S27, synthesize the first affine transformation result and the second affine transformation result with the background image respectively, and obtain a first synthesized image And the second composite image It is expressed as follows:

S28, the first synthesis unit and the second synthesis unit Input the discriminator and train the discriminator using the second data set;

S29, calculate the joint loss to update the network parameters and obtain the trained adversarial generation network model;

S3, input the first splicing unit [I _fg ,M _fg ] and the second splicing unit [I _bg ,M _bg ] to be synthesized into the trained adversarial generative network model, and input the first synthesis unit corresponding to the output of the discriminator under the unsupervised path As the image synthesis prediction result.

2. The image synthesis method based on the generative adversarial network as described in claim 1 is characterized in that the preliminary feature extractor adopts the VGG16 network model.

3. The image synthesis method based on the adversarial generative network according to claim 1, characterized in that: the preliminary feature extractor performs the following operations:

F ₁ = MaxPool(Conv64(Conv64(Input1))), with size H/2×W/2×64;

F ₂ =MaxPool(Conv128(Conv128(F ₁ ))), with size H/4×W/4×128;

F ₃ =MaxPool(Conv256(Conv256(Conv256(F ₂ )))), with size H/8×W/8×256;

_Ffg = MaxPool(Conv512(Conv512(Conv512( _F3 )))), with size H/16×W/16×512;

Among them, Maxpool represents the maximum pooling operation, ConvX represents the convolution operation with the number of channels as X, Input1 represents the first splicing unit [I _fg ,M _fg ] or the second splicing unit [I _bg ,M _bg ] or the third splicing unit [I _gt ,M _gt ], and the height, width and number of channels of each splicing unit are 4.

4. The image synthesis method based on a generative adversarial network as claimed in claim 1, characterized in that:

The multi-scale encoder performs the following operations:

P ₁ =ReLU(BatchNorm(Conv512(Input2))), with size H ₁ ×W ₁ ×512;

P ₂ =ReLU(BatchNorm(Conv512(P ₁ ))), with size H ₁ ×W ₁ ×512;

P ₃ =ReLU(BatchNorm(ConvC(P ₂ ))), with size H ₁ ×W ₁ ×C;

_{PS1 =} AdaptiveAvgPool{ _S1 × _S1 }( _P3 ), with size _S1 × _S1 ×C;

PS2 ₌ AdaptiveAvgPool{ _S2 × _S2 }( _P3 ), with size _S2 × _S2 ×C;

PS3 ₌ AdaptiveAvgPool{S ₃ ×S ₃ }(P ₃ ), with size S ₃ ×S ₃ ×C;

Wherein, ReLU represents the ReLU activation function, BatchNorm represents the normalization operation, ConvX represents the convolution operation with the number of channels being X, AdaptiveAvgPool{n×n} represents adaptively pooling the height×width of the corresponding input image to n×n, S ₁ , S ₂ , and S ₃ are the first preset size, the second preset size, and the third preset size respectively, Input2 represents the first basic feature map F _fg or the second basic feature map F _bg , the height of each basic feature map is H ₁ , the width is W ₁ , the number of channels is C, H/16＝H ₁ , W/16＝W ₁ , and the height of each splicing unit is H and the width is W;

The feature aggregator performs the following operations:

_{PS1 =} Reshape( _P3 ), with size 1×( _S1* _S1 )×C;

PS2 ₌ Reshape( _P3 ), with size 1×( _S2* _S2 )×C;

PS3 ₌ Reshape( _P3 ), with size 1×( _S3* _S3 )×C;

_Pg = Concat(Concat( _PS1 , _PS2 ), _PS3 ), with size 1×(S1 _* S1 ₊ _S2* S2 ₊ _S3* _S3 )×C;

Among them, Reshape represents a reshape function, Concat represents a Concat function, and _Pg represents the output feature of the first feature extraction unit corresponding to Input2, that is, the first multi-scale feature map _Pfg or the second multi-scale feature map _Pbg .

5. The image synthesis method based on the adversarial generative network as claimed in claim 4, characterized in that:

The global feature extractor performs the following operations:

Z ₁ =ReLU(BatchNorm(Conv512(F _fg ))), with size H ₁ ×W ₁ ×512;

Z ₂ =ReLU(BatchNorm(Conv512(Z ₁ ))), with size H ₁ ×W ₁ ×512;

Z ₃ =ReLU(BatchNorm(ConvC(Z ₂ ))), with size H ₁ ×W ₁ ×C;

Z _{4 =} AdaptiveAvgPool{1×1}(P ₃ ), with size 1×1×C;

The autoencoder performs the following operations:

h = ReLU(FC1024(Z ₄ )), with size 1×1×1024;

mu = FC512(h), size is 1 × 1 × 512;

Logvar=FC512(h), size is 1×1×512;

Z _p = mu + e ^Logvar/2 , with dimensions of 1 × 1 × 512;

Among them, FCY is a fully connected layer that maps the number of channels of the corresponding input image to Y.

6. The image synthesis method based on the adversarial generative network according to claim 4, characterized in that: the joint attention module performs the following operations:

_Qfg = ConvC/8( _Pfg ), size is H×W×C/8;

_Kfg = ConvC/4( _Pfg ), size is H×W×C/4;

V _fg =ConvC/4(P _fg ), size is H×W×C;

Q _bg =ConvC/8(P _bg ), size is H×W×C/8;

K _bg =ConvC/4(P _bg ), size is H×W×C/4;

V _bg =ConvC/4(P _bg ), size is H×W×C;

The first dimension and the second dimension of Q _fg , K _fg , V _fg , Q _bg , K _bg , and V _bg are respectively integrated by the Reshpe function to obtain Q' _fg , K' _fg , V' _fg , Q' _bg , K' _bg , and V' _bg , which are expressed as follows:

The size of Q' _fg is HW × C/8;

The size of K' _fg is HW × C/4;

V' _fg has dimensions of HW × C;

The size of Q' _bg is HW × C/8;

The size of _K'bg is HW×C/4;

V' _bg size is HW × C;

The third dimension of _Q'fg and _Q'bg is concatenated and expressed as follows:

Q _cat = Concat(Q' _fg ,Q' _bg ), size is HW×C/4,

Perform attention calculation on Q _cat to obtain X _fg and X _bg , which are expressed as follows:

_Xfg = Softmax( _Qcat * _KfgT ⁾ _Vfg + _Pfg , size is HW×C;

X _bg = Softmax(Q _cat *K _bg ^T) V _bg +P _bg , size is HW×C;

Z = AdaptiveAvgPool{1×1}(Conv512(Concat( _Xfg , _Xbg ))), with size 1×1×C;

Among them, ConvX represents a convolution operation with X channels.

7. The image synthesis method based on the adversarial generative network according to claim 1, characterized in that: the regression block performs the following operations:

t _u =FC3(FC1024(ReLU(FC1024(Z _i ))));

t _s =FC3(FC1024(ReLU(FC1024(Z _j ))));

Among them, FCY is a fully connected layer that maps the number of channels of the corresponding input image to Y, and ReLU represents the ReLU activation function.

8. The image synthesis method based on the generative adversarial network as described in claim 1 is characterized in that the affine transformation adopts the Spatial Transformer Network.

9. The image synthesis method based on the adversarial generative network according to claim 1, characterized in that: the discriminator performs the following operations:

R = Sigmoid(Conv1(

LeakyReLU(Conv512(

LeakyReLU(Conv256(

LeakyReLU(Conv128(

LeakyReLU(Conv64(Input)))))))))

Among them, Sigmoid represents the Sigmoid activation function, ConvX represents the convolution operation with the number of channels X, LeakyReLU represents the LeakyReLU activation function, R represents the output feature of the discriminator, and Input represents the input feature of the discriminator.

10. The image synthesis method based on the adversarial generative network according to claim 1, characterized in that: the joint loss is expressed as follows:

in,

In the formula, θ _G represents the learning parameters in the generator G, θ _D represents the learning parameters in the discriminator D, Generate loss function for adversarial on unsupervised path, is the adversarial generation loss function on the self-supervised path, _Lkld (G) is the KL divergence loss function, _Lrec (G) is the reconstruction loss function, _Lbce (D) is the cross entropy loss function, represents the mean of the distribution of the prior vector z _P , represents the variance of the distribution of the prior vector z _P , D _KL represents the calculated KL divergence, N(a ₁ ,b ₁ ) represents the distribution with mean a ₁ and variance b ₁ , if a ₁ ＝0, b ₁ ＝1, it is a normal distribution.