CN111696027A

CN111696027A - Multi-modal image style migration method based on adaptive attention mechanism

Info

Publication number: CN111696027A
Application number: CN202010431594.7A
Authority: CN
Inventors: 程深; 潘力立
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-05-20
Filing date: 2020-05-20
Publication date: 2020-09-22
Anticipated expiration: 2040-05-20
Also published as: CN111696027B

Abstract

The invention discloses a multimodal image style transfer method based on an adaptive attention mechanism, which belongs to the field of computer vision. The method first chooses to use the generative adversarial network as the basic framework. At the same time, it draws on the idea of EM attention mechanism algorithm and channel scale transformation, and improves the EM attention mechanism in the channel domain, so that the network pays more attention to style features. And use the noise to weight the bases in the attention module, so that it can change adaptively, and finally make the style change. At this time, the noise and the image are input into the network at the same time, and the generated adversarial network is used to confront the training algorithm. After training the network, changing the size of the noise enables multimodal style transfer. Through the above method, the present invention makes full use of the advantages of the EM attention mechanism and the generative adversarial network, and proposes an adaptive channel domain EM attention module, which improves the image quality and image diversity of the existing methods after style transfer.

Description

A Multimodal Image Style Transfer Method Based on Adaptive Attention Mechanism

技术领域technical field

本发明属于计算机视觉领域，主要涉及多模态的图像风格迁移问题；主要应用于影视娱乐产业，人机交互以及机器视觉理解等方面。The invention belongs to the field of computer vision, mainly relates to the problem of multimodal image style transfer, and is mainly applied to the film and television entertainment industry, human-computer interaction and machine vision understanding.

背景技术Background technique

图像风格迁移是指通过计算机技术对不同风格的图片进行分析后，使其中一种风格的图片在保留图片内容的情况下，将其风格转变为其他不同种类的风格的技术。影视娱乐产业，人机交互以及机器视觉理解等领域，对图像风格迁移的需求越来越大。例如：通过摄像头可以实时将人物头像变成卡通人物头像；在自动驾驶中，可以利用风格迁移辅助图片向分割图片转换等。现有的图像风格迁移的方法，主要分为基于图片优化和基于模型优化的方法。Image style transfer refers to the technology of transforming the style of one style of pictures into other different styles while retaining the content of the pictures after analyzing pictures of different styles through computer technology. In the film and television entertainment industry, human-computer interaction and machine vision understanding, there is an increasing demand for image style transfer. For example, the avatar can be turned into a cartoon avatar in real time through the camera; in automatic driving, the style transfer can be used to assist the conversion of the picture to the segmentation picture, etc. The existing image style transfer methods are mainly divided into image-based optimization and model-based optimization methods.

基于图片优化的风格迁移方法是出现时间较早且较为稳定的一种方法，其基本原理可以分为三个步骤。第一个步骤是选择一个能提取图片特征的神经网络，第二个步骤是利用神经网络对原始图片和目标图片进行特征提取，并利用特征去设计损失函数，第三个步骤是利用损失函数对原始图片进行求导操作，并不断优化迭代使原始图片的风格向目标图片的风格靠近。这类方法不需要大量的数据，所以简单且便于操作；但其缺点在于迭代时间过长，无法实时地转换图片。参考文献：.L.A.Gatys,A.S.Ecker,M.Bethge.Image styletransfer using convolutional neural networks,IEEE Conference on ComputerVision and Pattern Recognition,2016,pp.2414-2423。The style transfer method based on image optimization is an early and relatively stable method, and its basic principle can be divided into three steps. The first step is to select a neural network that can extract image features. The second step is to use the neural network to extract features from the original image and the target image, and use the features to design a loss function. The third step is to use the loss function to The derivation operation is performed on the original image, and continuous optimization and iteration are performed to make the style of the original image approach the style of the target image. This type of method does not require a large amount of data, so it is simple and easy to operate; but its disadvantage is that the iteration time is too long, and the image cannot be converted in real time. References: L.A. Gatys, A.S. Ecker, M. Bethge. Image styletransfer using convolutional neural networks, IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2414-2423.

基于模型迭代的方法主要是通过大量不同风格图片去训练模型，从而使模型学习从一种图片风格与其他种类图片风格的映射函数，然后往训练好的模型中输入一种风格图片，就可以从模型输出中获取不同种类风格，但内容一致的图片。这种方法的优点在于训练好模型后就不需要进行迭代步骤，可以实时地进行图像风格迁移，并且这类方法可以通过输入额外变量，可以达到输入一种类别风格图片，而同时输出多个不同种类风格的图片。但其缺点在于训练时没有的风格种类，在测试时无法很好地进行迁移，且在进行多模态的风格迁移时样本多样性仍然欠缺。参考文献：A.Yazeed,S.Neil,W.Peter.Latent filterscaling for multimodal unsupervised image-to-image translation.2019,pp.1458-1466The method based on model iteration mainly trains the model through a large number of pictures of different styles, so that the model can learn the mapping function from one picture style and other kinds of picture styles, and then input a style picture into the trained model, and then it can be obtained from Get pictures of different styles but the same content in the model output. The advantage of this method is that there is no need for iterative steps after the model is trained, and image style transfer can be performed in real time, and this type of method can input additional variables to input a category of style pictures, and output multiple different styles at the same time. Variety style picture. However, its disadvantage is that the types of styles that are not available during training cannot be transferred well during testing, and the sample diversity is still lacking during multimodal style transfer. References: A. Yazeed, S. Neil, W. Peter. Latent filterscaling for multimodal unsupervised image-to-image translation. 2019, pp. 1458-1466

近年来，基于模型优化的方法越来越成熟，多模态的风格迁移的需求也愈发增高。目前的方法在多模态的风格迁移中，多样性以及图片质量仍然不足。多模态风格迁移指的是，给定一张输入图片，可以同时得到输出不同的风格图片，如图2所示，第一张为输入图片，其他为同时输出的多风格图片。本发明针对该领域，并考虑以上不足之处，提出了一种基于适应性注意力机制的多模态的图像风格迁移方法，取得了出色的结果。In recent years, model-based optimization methods have become more and more mature, and the demand for multimodal style transfer has also increased. Current methods still lack diversity and image quality in multimodal style transfer. Multimodal style transfer means that, given an input image, different styles of images can be output at the same time, as shown in Figure 2, the first image is the input image, and the others are multi-style images output at the same time. Aiming at this field, and considering the above shortcomings, the present invention proposes a multimodal image style transfer method based on an adaptive attention mechanism, and has achieved excellent results.

发明内容SUMMARY OF THE INVENTION

本发明是一种适应性通道域EM注意力机制的多模态风格迁移方法，解决现有技术中风格多样性欠缺的问题。The present invention is a multi-modal style transfer method of adaptive channel domain EM attention mechanism, which solves the problem of lack of style diversity in the prior art.

该方法首先选择使用生成对抗网络作为基本框架，并对训练图片进行归一化至256*256*3的大小，还对正态分布进行采样得到噪声。同时，借鉴了EM注意力机制算法和通道尺度变换的思想，并将EM注意力机制进行通道域上的改进，使网络对风格特征增加关注度，并利用噪声对注意力模块中的基进行加权，使其能够适应性地发生改变，最终使风格发生变化。此时噪声和图片同时输入网络，并利用生成对抗网络地对抗训练算法。在训练好网络后，改变噪声的大小就可以进行多模态的风格迁移。通过上述方法，本发明充分利用了EM注意力机制和生成对抗网络的优势，并提出了适应性通道域EM注意力模块，提高了现有方法在风格迁移后的图片质量和图片多样性。算法总体结构示意参见图1。The method first chooses to use the generative adversarial network as the basic framework, normalizes the training image to the size of 256*256*3, and also samples the normal distribution to obtain noise. At the same time, the idea of EM attention mechanism algorithm and channel scale transformation is borrowed, and the EM attention mechanism is improved in the channel domain, so that the network can pay more attention to style features, and use noise to weight the basis in the attention module. , enabling it to adapt adaptively and ultimately to change the style. At this time, the noise and the image are input into the network at the same time, and the generated adversarial network is used to confront the training algorithm. After training the network, changing the size of the noise enables multimodal style transfer. Through the above method, the present invention makes full use of the advantages of the EM attention mechanism and the generative adversarial network, and proposes an adaptive channel domain EM attention module, which improves the image quality and image diversity of the existing methods after style transfer. The overall structure of the algorithm is shown in Figure 1.

为了方便地描述本发明内容，首先对一些术语进行定义。For the convenience of describing the content of the present invention, some terms are first defined.

定义1：正态分布。也称常态分布，又名高斯分布，是一个在数学、物理及工程等领域都非常重要的概率分布，在统计学的许多方面有着重大的影响力。若随机变量x，其概率密度函数满足

其中μ为正态分布的数学期望，σ²为正态分布的方差，则称其满足正态分布，常记作

Definition 1: Normal distribution. Also known as the normal distribution, also known as the Gaussian distribution, is a very important probability distribution in the fields of mathematics, physics and engineering, and has a significant impact in many aspects of statistics. If the random variable x, its probability density function satisfies

where μ is the mathematical expectation of the normal distribution, and σ ² is the variance of the normal distribution, then it is said to satisfy the normal distribution, which is often denoted as

定义2：生成对抗网络。生成对抗网络包含两个不相同的神经网络，一个称为生成器G，另一个称为判别器D,这两个神经网络在训练过程中相互对抗，判别器的目的是区分真实数据分布P_data和生成器分布P_G，而生成器的目的则是不让判别器将这两个分布区分开来，最终结果将得到P_data＝P_G。Definition 2: Generative Adversarial Networks. The generative adversarial network contains two different neural networks, one is called the generator G and the other is called the discriminator D. These two neural networks fight against each other during the training process. The purpose of the discriminator is to distinguish the real data distribution P _data and the generator distribution P _G , and the purpose of the generator is not to let the discriminator distinguish the two distributions, and the final result will be P _data =P _G .

定义3：EM算法。即期望最大化算法。对于观测数据X，以及不可观测数据Z，两者统称为完整数据D＝(X，Z)。EM算法首先初始化一种模型及其参数，利用该模型对Z进行估计，这一步称为E步骤；接着利用估计好的Z对模型进行更新，称为M步骤。Definition 3: EM algorithm. That is, the expectation maximization algorithm. For the observed data X and the unobservable data Z, the two are collectively referred to as complete data D=(X, Z). The EM algorithm first initializes a model and its parameters, and uses the model to estimate Z, which is called the E step; then uses the estimated Z to update the model, which is called the M step.

定义4：广义核函数。是描述点到点之间的关系的函数，也是描述不同空间映射关系的函数。它的选择有很多种，比如向量之间的点积。Definition 4: Generalized kernel function. It is a function that describes the relationship between points and points, and it is also a function that describes different spatial mapping relationships. There are many options for it, such as the dot product between vectors.

定义5：注意力机制。注意力机制通常包括3个模块，query，key和value。query和key首先做相关度的运算，最后再与value进行加权操作，核心算子为

其中f(·，·)表示广义核函数，x表示输入，C(x)表示x的总和，g表示任意变换，结构示意参见图3.Definition 5: Attention mechanism. The attention mechanism usually includes 3 modules, query, key and value. The query and key first perform the correlation operation, and finally perform the weighting operation with the value. The core operator is

where f( , ) represents the generalized kernel function, x represents the input, C(x) represents the sum of x, and g represents an arbitrary transformation. See Figure 3 for the structure diagram.

定义6：EM注意力机制。即EM算法和注意力机制的结合，主要通过对注意力机制改进并添加循环迭代步骤。Definition 6: EM attention mechanism. That is, the combination of the EM algorithm and the attention mechanism, mainly by improving the attention mechanism and adding loop iteration steps.

定义7：适应性通道域EM注意力机制。是本发明中提出的方法，是对EM注意力机制的改进，通过改变注意力的作用域至通道域并添加新的输入而成。细节参见步骤3。Definition 7: Adaptive Channel Domain EM Attention Mechanism. It is the method proposed in the present invention, which is an improvement of the EM attention mechanism by changing the scope of attention to the channel field and adding new inputs. See step 3 for details.

定义8：softmax函数。或称归一化指数函数，它能将一个含任意实数的K维向量x“压缩”到另一个K维实向量softmax(x)中，使得每一个元素的范围都在(0，1)之间，并且所有元素的和为1。其公式可以表示为：

Definition 8: softmax function. Or normalized exponential function, it can "compress" a K-dimensional vector x containing any real number into another K-dimensional real vector softmax(x), so that the range of each element is between (0, 1). and the sum of all elements is 1. Its formula can be expressed as:

定义9：Relu函数。又称修正线性单元,是一种人工神经网络中常用的激活函数，通常指代以斜坡函数及其变种为代表的非线性函数，表达式为f(x)＝max(0，x)。Definition 9: Relu function. Also known as the modified linear unit, it is a commonly used activation function in artificial neural networks, usually referring to the nonlinear function represented by the ramp function and its variants, and the expression is f(x)=max(0, x).

定义10：Tanh函数。可以用表达式

定义。Definition 10: Tanh function. expression can be used

definition.

因而本发明技术方案为一种基于适应性注意力机制的多模态的图像风格迁移方法，该方法包括：Therefore, the technical solution of the present invention is a multi-modal image style transfer method based on an adaptive attention mechanism, and the method includes:

步骤1：对数据集进行预处理；Step 1: Preprocess the dataset;

获取edges2shoes数据集，edges2shoes数据集是包含鞋子轮廓以及真实鞋子图片，总共49825个图片队；再对数据集进行分类，鞋子轮廓为一类，真实鞋子为另一类，随机打乱顺序处理；最后对图片像素值进行归一化至范围[-1,1]；Obtain the edges2shoes data set, the edges2shoes data set contains shoe outlines and real shoe pictures, a total of 49825 picture teams; then classify the data set, shoe outlines are one type, real shoes are another type, and the order is randomly shuffled; finally Normalize the image pixel values to the range [-1,1];

步骤2：构建卷积神经网络和全连接神经网络；Step 2: Build a convolutional neural network and a fully connected neural network;

1)构建卷积神经网络包括两个子网络，一个为生成器，另一个为判别器；生成器输入输出均为图片，而判别器输入为图片，输出为标量；生成器网络的前两层为2个下采样卷积块，之后接着9个残差网络块，最后再跟着2个上采样卷积块；判别器网络依次采用4个下采样卷积块，以及两个标准卷积块；标准卷积块，上采样卷积块，下采样卷积块以及残差网络块如图5所示。1) The construction of a convolutional neural network includes two sub-networks, one for the generator and the other for the discriminator; the input and output of the generator are pictures, while the input and output of the discriminator are pictures, and the output is a scalar; the first two layers of the generator network are 2 downsampling convolution blocks, followed by 9 residual network blocks, followed by 2 upsampling convolution blocks; the discriminator network sequentially uses 4 downsampling convolution blocks, and two standard convolution blocks; standard The convolution block, the upsampling convolution block, the downsampling convolution block, and the residual network block are shown in Figure 5.

2)构建全连接网络输入大小为8维的向量

表示维度，假设构建的卷积神经网络中生成器所有通道数目大小为L，全连接网络的输出包含两个部分，第一部为向量

另一部分为向量

其中K为步骤3中基的个数；而其总体包括两层大小均为128维的隐层，中间层使用Relu函数作为激活函数，输出层使用Tanh函数作为损失函数；2) Construct a fully connected network with an input size of 8-dimensional vector

Represents the dimension, assuming that the number of all channels of the generator in the constructed convolutional neural network is L, the output of the fully connected network consists of two parts, the first part is a vector

The other part is a vector

Among them, K is the number of bases in step 3; and its overall includes two hidden layers with a size of 128 dimensions, the middle layer uses the Relu function as the activation function, and the output layer uses the Tanh function as the loss function;

步骤3：构建适应性通道域EM注意力模块，参见图4；对应混合高斯模型中的过程，设一张图片送入卷积神经网络中的生成器后，通过生成器中的卷积块输出得到的特征图为X，大小为C×H×W，其中C为通道数，H和W分别为特征图的高和宽；X是输入图片通过生成器中卷积块的激活函数后得到的；将X改变形状至C×N，其中N＝H×W；

表示第i个通道的N维向量；给定

以及通过正态分布随机采样，初始化一组由K个基向量

组成的矩阵

其中K＜N；则步骤3分为以下三个小步骤进行；第一步是估计隐藏变量

第二步是利用第一步估计的结果来更新基向量矩阵M；第一步和第二步循环迭代直至μ和Z收敛；第三步是利用M和Z来重构X，并利用步骤2中得到的S来对M进行乘法运算；Step 3: Build an adaptive channel domain EM attention module, see Figure 4; corresponding to the process in the Gaussian mixture model, after a picture is sent to the generator in the convolutional neural network, it is output through the convolution block in the generator The obtained feature map is X, the size is C×H×W, where C is the number of channels, H and W are the height and width of the feature map respectively; X is the input image obtained after passing the activation function of the convolution block in the generator ; reshape X to C×N, where N=H×W;

N-dimensional vector representing the ith channel; given

and random sampling through the normal distribution, initialize a set of K basis vectors

composed of matrices

Where K<N; then step 3 is divided into the following three small steps; the first step is to estimate hidden variables

The second step is to use the results estimated in the first step to update the basis vector matrix M; the first and second steps iterate in a loop until μ and Z converge; the third step is to use M and Z to reconstruct X, and use step 2 Multiply M with the S obtained in ;

步骤4：总神经网络；Step 4: Total Neural Network;

将步骤3中的适应性通道域EM注意力模块嵌入到步骤2中的生成器当中，总共在3个不同之处嵌入；第一处在第二层下采样卷积块后第一个残差网络块之前，第二处在第5个残差网络块的位置替换，第三处在最后一个残差网络块之后第一个上采样卷积块之前嵌入；全连接神经网络的输出中的特征图控制编码d乘入生成器中的所有卷积层的输出，而基控制编码S乘入步骤3中得到的适应性通道域EM注意力模块中的基M；生成器的输出作为判别器的输入，判别器的输出为总神经网络的输出；Embed the adaptive channel domain EM attention module in step 3 into the generator in step 2, in a total of 3 different embeddings; the first is the first residual after the second layer downsampling the convolution block Before the network block, the second is replaced at the position of the 5th residual network block, and the third is embedded before the first upsampling convolution block after the last residual network block; features in the output of the fully connected neural network The graph control code d is multiplied into the outputs of all convolutional layers in the generator, while the base control code S is multiplied into the base M in the adaptive channel domain EM attention module obtained in step 3; the output of the generator is used as the discriminator’s Input, the output of the discriminator is the output of the total neural network;

总的网络框架如图1所示；The overall network framework is shown in Figure 1;

步骤5：设计损失函数；Step 5: Design the loss function;

在步骤1中获取到的图片，记鞋子轮廓类别图片为I_A，真实鞋子图片为I_B；并对正态分布进行随机采样得到向量v，步骤2中的生成器以及全连接网络一起记为G，判别器记为D；G中的生成器输入为I_A，全连接网络的输入为v，两者共同作用并将输出记为G(I_A，v)；判别器的输入为I_B和G(I_A，v)，它们的输出分别记为D(I_B)和D(G(I_A，v))。则网络损失可以描述为：For the pictures obtained in step 1, record the shoe outline category picture as _IA , and the real shoe picture as _IB ; and perform random sampling from the normal distribution to obtain a vector v, the generator in step 2 and the fully connected network are recorded as G, the discriminator is recorded as D; the generator input in G is I _A , the input of the fully connected network is v, the two work together and the output is recorded as G(I _A , v); the input of the discriminator is I _B and G(I _A , v), and their outputs are denoted as D(I _B ) and D(G(I _A , v)), respectively. Then the network loss can be described as:

为判别器的损失函数，

为生成器的损失函数；

分别表示对(I_A，v)和I_B求期望；

is the loss function of the discriminator,

is the loss function of the generator;

respectively represent the expectation of (I _A , v) and I _B ;

步骤6：训练总神经网络，利用步骤5构建的损失函数进行网络训练，在更新G时固定D的参数，而更新D是则固定G的参数，每次迭代交替更新一次；Step 6: Train the total neural network, use the loss function constructed in step 5 for network training, fix the parameters of D when updating G, and update D to fix the parameters of G, and update it alternately once per iteration;

步骤7：测试阶段，在步骤6中训练好模型，只取网络G部分；给定一张输入图片I_A，以及不同的正态分布样本v，得到多张不同风格的输出图片。Step 7: In the testing phase, the model is trained in step 6, and only the G part of the network is taken; given an input image I _A , and different normal distribution samples v, multiple output images of different styles are obtained.

进一步的，所述步骤3的具体方法为：Further, the specific method of the step 3 is:

步骤3.1：估计隐藏变量

这一步是计算每一个基对每一个通道的负责程度，即每个通道属于每个基的可能性；z_ck表示的是μ中第k个基对第c个通道x_c的权值，其中1≤k≤K且1≤c≤C；构建条件于μ_c的x_c的后验概率分布如下：Step 3.1: Estimate hidden variables

This step is to calculate the degree of responsibility of each basis for each channel, that is, the possibility that each channel belongs to each basis; z _ck represents the weight of the k-th basis to the c-th channel x _c in μ, where 1≤k≤K and 1≤c≤C; the posterior probability distribution of x _c conditional on μ _c is constructed as follows:

其中

表示的是广义核函数；则z_ck可以用如下公式进行计算：in

represents the generalized kernel function; then z _ck can be calculated by the following formula:

核函数

选择

的形式，则对于第t次迭代，隐藏变量Z采用下面的公式进行计算：Kernel function

choose

, then for the t-th iteration, the hidden variable Z is calculated using the following formula:

Z^(t)＝softmax(X(M^(t-1))^T)Z ^(t) = softmax(X(M ^(t-1) ) ^T )

步骤3.2：更新基向量μ：这一步是通过最大化完整数据的似然函数得到的，对应于混合高斯模型，这一步是利用第一步计算的权值来对样本进行加权求和，用样本属于某个基的可能性来更新基的值；对于第t次迭代，通过对X的加权求和，基向量的更新可以表示成：Step 3.2: Update the basis vector μ: This step is obtained by maximizing the likelihood function of the complete data, which corresponds to the Gaussian mixture model. This step is to use the weights calculated in the first step to perform a weighted sum of the samples. The probability of belonging to a certain basis is used to update the value of the basis; for the t-th iteration, by the weighted sum of X, the update of the basis vector can be expressed as:

步骤3.3：步骤3.1和步骤3.2交替执行T次之后，进行步骤3.3，用M和Z来重构X，并利用步骤2中得到的S来对μ进行乘法运算；对于步骤2中的得到的S，其长度为K，与基μ的个数相等；则我们最终可以利用如下公式来进行X的重构：Step 3.3: After step 3.1 and step 3.2 are alternately performed T times, go to step 3.3, use M and Z to reconstruct X, and use S obtained in step 2 to multiply μ; for the S obtained in step 2 , whose length is K, which is equal to the number of bases μ; then we can finally use the following formula to reconstruct X:

本发明的创新之处在于：The innovation of the present invention is:

1)将注意力机制的空域转变为通道域，空域的注意力是把像素点看作变量，求基对像素点的权值，而我们把空域转变为了通道域，即求基对通道的权值，如图6所示。1) Transform the air domain of the attention mechanism into the channel domain. The attention of the air domain is to regard the pixel as a variable and find the weight of the basis to the pixel, and we convert the air domain to the channel domain, that is, to find the weight of the basis to the channel. value, as shown in Figure 6.

2)对注意力机制的适应性加权，对特征图的加权可以改变输出图片的风格，但我们利用对注意力中的基进行加权替代对特征图的加权，如图7所示。2) Adaptive weighting of the attention mechanism, the weighting of the feature map can change the style of the output picture, but we use the weighting of the basis in the attention to replace the weighting of the feature map, as shown in Figure 7.

3)我们将这一个方式引入到多模态风格迁移中，并在实验中取得出色的结果。3) We introduce this approach into multimodal style transfer and achieve excellent results in experiments.

1)中的改进可以使注意力机制对风格更加关注，2)中的改进可以让我们更准确的对输出风格进行改变，两者的结合最终使我们的实验结果得到提升。The improvement in 1) can make the attention mechanism pay more attention to the style, and the improvement in 2) can allow us to change the output style more accurately, and the combination of the two finally improves our experimental results.

附图说明Description of drawings

图1为本发明方法主要网络结构图Fig. 1 is the main network structure diagram of the method of the present invention

图2为本发明多模态风格迁移结果示意图。FIG. 2 is a schematic diagram of a multimodal style transfer result of the present invention.

图3为本发明注意力机制示意图。FIG. 3 is a schematic diagram of the attention mechanism of the present invention.

图4为本发明适应性通道域EM注意机机制示意图。FIG. 4 is a schematic diagram of the adaptive channel domain EM attention machine mechanism of the present invention.

图5为本发明标准卷积块，上采样卷积块，下采样卷积块以及残差网络块示意图。5 is a schematic diagram of a standard convolution block, an up-sampling convolution block, a down-sampling convolution block and a residual network block of the present invention.

图6为本发明空域注意力转通道域注意力示意图。FIG. 6 is a schematic diagram of converting the attention from the spatial domain to the attention from the channel domain according to the present invention.

图7为本发明适应性加权的方式示意图。FIG. 7 is a schematic diagram of an adaptive weighting manner of the present invention.

具体实施方式Detailed ways

步骤1：对数据集进行预处理；Step 1: Preprocess the dataset;

获取edges2shoes(http://efrosgans.eecs.berkeley.edu/pix2pix/datasets/edges2shoes.tar.gz)数据集，edges2shoes数据集是包含鞋子轮廓以及真实鞋子图片，总共49825对图片对；再对数据集进行分类，鞋子轮廓为一类，真实鞋子为另一类，随机打乱顺序处理；最后对图片像素值进行归一化至范围[-1,1]。Get the edges2shoes (http://efrosgans.eecs.berkeley.edu/pix2pix/datasets/edges2shoes.tar.gz) dataset, the edges2shoes dataset contains shoe outlines and real shoe pictures, a total of 49825 pairs of pictures; then pair the dataset For classification, the shoe outline is one class, and the real shoe is another class, and the order is randomly shuffled; finally, the pixel values of the picture are normalized to the range [-1, 1].

1)此步骤构建的卷积神经网络包括两个子网络，一个为生成器，另一个为判别器；生成器输入输出均为图片，而判别器输入为图片，输出为标量；生成器网络的前两层为2个下采样卷积块，之后接着9个残差网络块，最后再跟着2个上采样卷积块；判别器网络依次采用4个下采样卷积块，以及两个标准卷积块；标准卷积块，上采样卷积块，下采样卷积块以及残差网络块如图5所示。1) The convolutional neural network constructed in this step includes two sub-networks, one is the generator and the other is the discriminator; the input and output of the generator are both pictures, while the input and output of the discriminator are pictures and the output is a scalar; The two layers are 2 downsampling convolution blocks, followed by 9 residual network blocks, followed by 2 upsampling convolution blocks; the discriminator network sequentially uses 4 downsampling convolution blocks, and two standard convolution blocks block; standard convolution block, upsampling convolution block, downsampling convolution block and residual network block are shown in Figure 5.

2)此步骤构建的全连接网络输入大小为8维的向量

表示维度，假设构建的卷积神经网络中生成器所有卷积核数目大小为L，全连接网络的输出包含两个部分，第一部为向量

另一部分为向量

其中K为步骤3中基的个数；而其总体包括两层大小均为128维的隐层，中间层使用Relu函数作为激活函数，输出层使用Tanh函数作为损失函数；2) The fully connected network constructed in this step has an input size of 8-dimensional vector

Represents the dimension, assuming that the number of all convolution kernels of the generator in the constructed convolutional neural network is L, the output of the fully connected network consists of two parts, the first part is a vector

The other part is a vector

3)全连接神经网络的输出中的d乘上生成器中的所有的卷积输出，而S乘上适应性通道域EM注意力模块(步骤3中构建)中的基M。3) The d in the output of the fully connected neural network is multiplied by all the convolutional outputs in the generator, and S is multiplied by the base M in the adaptive channel domain EM attention module (built in step 3).

表示第i个通道的N维向量；给定

以及通过正态分布随机采样，初始化一组由K个基向量

组成的矩阵

N-dimensional vector representing the ith channel; given

composed of matrices

The second step is to use the result estimated in the first step to update the basis vector matrix M; the first and second steps iterate in a loop until μ and Z converge; the third step is to use M and Z to reconstruct X, and use step 2 The S obtained in the multiplication operation is performed on M;

步骤4：总神经网络结构；Step 4: Overall neural network structure;

将步骤3中的适应性通道域EM注意力模块嵌入到步骤2中的生成器当中，总共在3个不同之处嵌入；第一处在第二层下采样卷积块后第一个残差网络块之前，第二处在第5个残差网络块的位置替换，第三处在最后一个残差网络块之后第一个上采样卷积块之前嵌入；总的网络框架如图1所示；Embed the adaptive channel domain EM attention module in step 3 into the generator in step 2, in a total of 3 different embeddings; the first is the first residual after the second layer downsampling the convolution block Before the network block, the second is replaced at the position of the fifth residual network block, and the third is embedded before the first upsampling convolution block after the last residual network block; the overall network framework is shown in Figure 1 ;

步骤5：设计损失函数；Step 5: Design the loss function;

为判别器的损失函数，

为生成器的损失函数；

分别表示对(I_A，v)和I_B求期望；

is the loss function of the discriminator,

is the loss function of the generator;

respectively represent the expectation of (I _A , v) and I _B ;

步骤6：训练网络，利用步骤5构建的损失函数进行网络训练，在更新G时固定D的参数，而更新D是则相反，每次迭代交替更新一次，实际训练中采用1000000次迭代次数；Step 6: Train the network, use the loss function constructed in step 5 for network training, fix the parameters of D when updating G, and update D is the opposite, alternately update each iteration, and use 1,000,000 iterations in actual training;

步骤7：测试阶段，在步骤6中训练好模型，只取网络G部分。给定一张输入图片I_A，以及不同的正态分布样本v，则可以得到多张不同风格的输出图片，且完成图片质量和图片多样性的测试。实验结果，在edges2shoes数据集上，图片质量得分较之前的10.32提升了0.15分，达到10.47分；图片多样性得分较之前的0.109提高了0.005分，达到了0.114分。Step 7: In the testing phase, the model is trained in step 6, and only the G part of the network is taken. Given an input image I _A and different normal distribution samples v, multiple output images of different styles can be obtained, and the test of image quality and image diversity can be completed. The experimental results show that on the edges2shoes dataset, the image quality score is 0.15 points higher than the previous 10.32, reaching 10.47 points; the image diversity score is 0.005 points higher than the previous 0.109, reaching 0.114 points.

步骤3.1：估计隐藏变量

其中

表示的是广义核函数；则z_ck可以用如下公式进行计算：in

核函数

选择

choose

Z^(t)＝softmax(X(M^(t-1))^T)Z ^(t) = softmax(X(M ^(t-1) ) ^T )

步骤3.3：步骤3.1和步骤3.2交替执行T次之后，进行步骤3.3，用M和Z来重构X，并利用步骤2中得到的S来对μ进行乘法运算：对于步骤2中的得到的S，其长度为K，与基μ的个数相等；则我们最终可以利用如下公式来进行X的重构：Step 3.3: After performing step 3.1 and step 3.2 alternately T times, proceed to step 3.3, use M and Z to reconstruct X, and use S obtained in step 2 to multiply μ: For the S obtained in step 2 , whose length is K, which is equal to the number of bases μ; then we can finally use the following formula to reconstruct X:

图片大小：256*256*3Image size: 256*256*3

学习率：0.0002，且随迭代次数线性下降Learning rate: 0.0002, and decreases linearly with the number of iterations

训练批次大小：1Training batch size: 1

迭代次数：1000000Iterations: 1000000

适应性通道域EM注意力模块迭代次数T：3。Adaptive channel domain EM attention module iteration number T: 3.

Claims

1. A multimodal image style transfer method based on adaptive attention mechanism, the method comprising:

Step 1: Preprocess the dataset;

Obtain the edges2shoes data set, the edges2shoes data set contains shoe outlines and real shoe pictures, a total of 49825 picture teams; then classify the data set, shoe outlines are one type, real shoes are another type, and the order is randomly shuffled; finally Normalize the image pixel values to the range [-1, 1];

Step 2: Build a convolutional neural network and a fully connected neural network;

1) The construction of a convolutional neural network includes two sub-networks, one for the generator and the other for the discriminator; the input and output of the generator are pictures, while the input and output of the discriminator are pictures, and the output is a scalar; the first two layers of the generator network are 2 downsampling convolution blocks, followed by 9 residual network blocks, and finally 2 upsampling convolution blocks; the discriminator network sequentially uses 4 downsampling convolution blocks, and two standard convolution blocks;

2) Construct a fully connected network with an input size of 8-dimensional vector

The other part is a vector

Step 3: Build an adaptive channel domain EM attention module; corresponding to the process in the Gaussian mixture model, after a picture is sent to the generator in the convolutional neural network, the feature map obtained by the convolution block output in the generator is X, and the size is C×H×W, where C is the number of channels, and H and W are the height and width of the feature map respectively; X is the input image obtained after passing the activation function of the convolution block in the generator; changing X shape to C×N, where N=H×W;

N-dimensional vector representing the ith channel; given

composed of matrices

Step 4: Total Neural Network;

Embed the adaptive channel domain EM attention module in step 3 into the generator in step 2, in a total of 3 different embeddings; the first is the first residual after the second layer downsampling the convolution block Before the network block, the second is replaced at the position of the 5th residual network block, and the third is embedded before the first upsampling convolution block after the last residual network block; features in the output of the fully connected neural network The graph control code d is multiplied into the outputs of all convolutional layers in the generator, while the base control code S is multiplied into the base M in the adaptive channel domain EM attention module obtained in step 3; the output of the generator is used as the discriminator’s Input, the output of the discriminator is the output of the total neural network;

Step 5: Design the loss function;

For the pictures obtained in step 1, record the shoe outline category picture as _IA , and the real shoe picture as _IB ; and perform random sampling from the normal distribution to obtain a vector v, the generator in step 2 and the fully connected network are recorded as G, the discriminator is recorded as D; the generator input in G is I _A , the input of the fully connected network is v, the two work together and the output is recorded as G(I _A , v); the input of the discriminator is I _B and G(I _A , v), and their outputs are denoted as D(I _B ) and D(G(I _A , v)), respectively. Then the network loss can be described as:

is the loss function of the discriminator,

is the loss function of the generator;

respectively represent the expectation of (I _A , v) and I _B ;

Step 6: Train the total neural network, use the loss function constructed in step 5 for network training, fix the parameters of D when updating G, and update D to fix the parameters of G, and update it alternately once per iteration;

Step 7: In the testing phase, the model is trained in step 6, and only the G part of the network is taken; given an input image I _A , and different normal distribution samples v, multiple output images of different styles are obtained.

2. a kind of multimodal image style transfer method based on adaptive attention mechanism as claimed in claim 1 is characterized in that the concrete method of described step 3 is:

Step 3.1: Estimate hidden variables

in

Kernel function

Choosing the form of exp(a ^T b), then for the t-th iteration, the hidden variable Z is calculated using the following formula:

Z ^(t) = softmax(X(M ^(t-1) ) ^T )

Step 3.2: For the t-th iteration, by weighted summation of X, the update of the basis vector can be expressed as:

Step 3.3: After step 3.1 and step 3.2 are alternately performed T times, go to step 3.3, use M and Z to reconstruct X, and use S obtained in step 2 to multiply μ; for the S obtained in step 2 , whose length is K, which is equal to the number of bases μ; then we can finally use the following formula to reconstruct X: