CN112884636A

CN112884636A - Style migration method for automatically generating stylized video

Info

Publication number: CN112884636A
Application number: CN202110117964.4A
Authority: CN
Inventors: 霍静; 孔美豪; 李文斌; 高阳
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-01-28
Filing date: 2021-01-28
Publication date: 2021-06-01
Anticipated expiration: 2041-01-28
Also published as: CN112884636B

Abstract

The invention discloses a style transfer method for automatically generating stylized videos, which comprises a high-compression self-encoder model based on knowledge distillation and a feature transfer module based on semantic alignment, and a style transfer model for automatically generating stylized videos; There are two parts: encoder and decoder. The encoder can encode the original video content frames and style images into feature maps, and the feature transfer module can fuse the content features and style features encoded by the encoder based on semantic alignment. Finally, the fusion transfer feature based on semantic alignment is obtained, and finally the stylized video frame is obtained through the decoder. The invention can ensure the stability of the video after the migration, can realize the stylization of any video in any style, and the speed of the style transfer process is very fast, and has higher practicability.

Description

A style transfer method for automatically generating stylized videos

技术领域technical field

本发明属于计算机应用领域，具体涉及一种自动生成风格化视频的风格迁移方法。The invention belongs to the field of computer applications, and in particular relates to a style transfer method for automatically generating stylized videos.

背景技术Background technique

随着互联网和移动互联网的发展普及，越来越多的短视频平台开始兴起，基于此人们对于视频的艺术化需求逐步提升，而通过专业艺术家或者专业剪辑师创作的形式不仅不方便，而且成本很高。因此，通过计算机技术根据视频自动生成任意艺术风格视频受到人们的关注和青睐。With the development and popularization of the Internet and mobile Internet, more and more short video platforms have begun to emerge. Based on this, people's artistic demand for videos has gradually increased, and the creation of professional artists or professional editors is not only inconvenient, but also cost-effective. very high. Therefore, the automatic generation of any artistic style video from the video through computer technology has attracted people's attention and favor.

给定一张内容图和一张目标风格图，风格迁移的目的是产生一张能够同时具备内容图结构和风格图纹理的风格化图像。基于单个图像的风格迁移方法已经有了大量的研究工作，目前大量的注意力已将开始转向视频风格迁移领域，因为视频风格迁移具有非常广阔的应用前景(包括短视频艺术转化等)；显然，较比于单图像的风格迁移，视频的风格迁移是更具有实用性和挑战性的。Given a content map and a target style map, the purpose of style transfer is to generate a stylized image that has both the content map structure and the style map texture. There has been a lot of research work on style transfer methods based on a single image, and a lot of attention has now turned to the field of video style transfer, because video style transfer has very broad application prospects (including short video art transformation, etc.); Obviously, Compared with single-image style transfer, video style transfer is more practical and challenging.

较比于传统的图像风格迁移，视频风格迁移更为困难的点在于要同时兼顾风格化质量，稳定性以及计算效率。目前已有的视频风格迁移方法可根据是否使用光流大致分为两类。Compared with traditional image style transfer, the more difficult point of video style transfer is to take into account the stylization quality, stability and computational efficiency at the same time. The existing video style transfer methods can be roughly divided into two categories according to whether optical flow is used or not.

第一类是使用光流的方法，他们通过光流的监督约束，提出了时序一致性损失来获得相邻帧之间的稳定性。其中包括基于优化的光流约束方法，虽然这种方法能够获得稳定的迁移视频，但是对于视频的每一帧风格迁移都需要将近三分钟的时间，这种极慢的迁移速度，是人们所不能接受的。后续有人提出了基于前馈网络的视频风格迁移方法，但是由于在训练阶段和测试阶段还是依旧使用了光流约束，所以还是没法在视频迁移任务中达到实时的效果。为了解决这个问题，一些方法只在训练阶段使用光流而避免在测试阶段使用光流，但是这样的话较比于那些测试阶段也使用光流的方法虽然速度提升了但是最后迁移的效果非常不稳定。The first category is methods using optical flow, they propose a temporal consistency loss to obtain stability between adjacent frames through the supervision constraints of optical flow. Among them is the optical flow constraint method based on optimization. Although this method can obtain stable transfer video, it takes nearly three minutes for each frame of video style transfer. This extremely slow transfer speed is impossible. accepted. Later, someone proposed a video style transfer method based on feedforward network, but because the optical flow constraints are still used in the training phase and the testing phase, it is still impossible to achieve real-time effect in the video transfer task. In order to solve this problem, some methods only use optical flow in the training phase and avoid using optical flow in the testing phase, but in this case, compared with those methods that also use optical flow in the testing phase, although the speed is improved, the final transfer effect is very unstable. .

第二类是不使用光流的方法，如LST，能够实现特征仿射从而能够得到稳定的风格化视频。在此之后，有研究提出了使用一个基于Avatar-Net装饰模块结合一个compound归一化方法来保证视频的稳定性。但是这一类目前存在的不使用光流的方法都采用了原始的VGG网络来编码内容和风格特征，而VGG网络是非常庞大的，意味着需要非常大的内存空间来存储VGG模型，这将会很大程度的限制他们在一些小型终端设备中的应用。The second category is methods that do not use optical flow, such as LST, which can achieve feature affine to obtain stable stylized videos. After this, some studies proposed to use an Avatar-Net-based decoration module combined with a compound normalization method to ensure the stability of the video. However, these existing methods that do not use optical flow all use the original VGG network to encode content and style features, and the VGG network is very large, which means that a very large memory space is required to store the VGG model, which will It will greatly limit their application in some small terminal devices.

发明内容SUMMARY OF THE INVENTION

发明目的：本发明提出一种自动生成风格化视频的风格迁移方法，能够实现实时稳定的任意视频风格迁移。Purpose of the invention: The present invention proposes a style transfer method for automatically generating stylized videos, which can realize real-time and stable style transfer of any video.

技术方案：本发明提出一种自动生成风格化视频的风格迁移方法，具体包括以下步骤：Technical solution: The present invention proposes a style transfer method for automatically generating stylized videos, which specifically includes the following steps:

(1)构建视频风格迁移网络模型，所述模型包括基于知识蒸馏的高压缩自编码器模块和基于语义对齐的特征迁移模块；所述自编码器模块中包括了轻量级编码器和轻量级解码器；(1) Build a video style transfer network model, the model includes a high-compression self-encoder module based on knowledge distillation and a feature transfer module based on semantic alignment; the self-encoder module includes a lightweight encoder and a lightweight level decoder;

(2)对于内容视频帧和风格图进行编码器编码：将轻量级编码器基于VGG网络进行知识蒸馏，让编码器在足够轻量级的同时学到教师网络VGG编码器的编码能力，将原始的视频内容帧以及风格图像编码为特征图；(2) Encoder encoding for content video frames and style maps: The lightweight encoder is based on the VGG network for knowledge distillation, so that the encoder can learn the encoding ability of the teacher network VGG encoder while being lightweight enough. The original video content frames and style images are encoded as feature maps;

(3)基于语义对齐的特征迁移模块：对编码器编码得到的内容特征和风格特征进行融合，得到基于语义对齐的融合迁移特征；(3) Feature transfer module based on semantic alignment: fuse the content features and style features encoded by the encoder to obtain the fusion transfer feature based on semantic alignment;

(4)将轻量级解码器基于VGG网络进行知识蒸馏：让解码器在足够轻量级的同时能够去学到教师网络VGG解码器的解码能力，并且对于融合迁移后的特征进行解码器解码从而得到风格化后的视频帧，最终合成视频。(4) Perform knowledge distillation on the lightweight decoder based on the VGG network: let the decoder learn the decoding ability of the teacher network VGG decoder while being lightweight enough, and perform decoder decoding on the fused and migrated features Thereby, the stylized video frames are obtained, and the video is finally synthesized.

进一步地，所述步骤(2)的实现需要优化损失函数如下：Further, the realization of the step (2) needs to optimize the loss function as follows:

其中，I为原始图像，VGG的网络中的编码器为E，轻量级编码器为

I'是重构的图片，E_k(I)为原始VGG编码器中第k层输出特征图，

为轻量级编码器中第k层输出特征图，λ和γ均为超参数。Among them, I is the original image, the encoder in the VGG network is E, and the lightweight encoder is

I' is the reconstructed picture, E _k (I) is the output feature map of the kth layer in the original VGG encoder,

Output feature map for the kth layer in the lightweight encoder, both λ and γ are hyperparameters.

进一步地，所述步骤(3)实现过程如下：Further, described step (3) realization process is as follows:

内容图像经过编码器得到输出的特征图为F_c∈R^Cx(WxH)，风格图像得到的输出为F_s∈R^Cx(WxH)，其中C为特征图通道数，W和H分别为特征图的宽和高；基于语义对齐的特征迁移模块目标为找到一个转换使得不同视频帧的内容图能够进行语义对齐的特征迁移，假定转换过程可以参数化为一个投影矩阵P∈R^CxC，则优化的目标函数为：The output feature map of the content image through the encoder is F _c ∈ R ^Cx(WxH) , and the output of the style image is F _s ∈ R ^Cx(WxH) , where C is the number of feature map channels, and W and H are the feature maps respectively. width and height; the goal of the feature transfer module based on semantic alignment is to find a transformation that enables semantically aligned feature transfer for the content maps of different video frames. Assuming that the transformation process can be parameterized as a projection matrix P∈R ^CxC , then the optimized The objective function is:

其中，

表示从F_c中挑选第i个位置特征向量的操作，A_ij表示

和

的k近邻矩阵；in,

Represents the operation of picking the i-th position feature vector from F _c , and A _ij represents

and

The k-nearest neighbor matrix of ;

求解P为：Solving for P is:

其中，A是上面定义的仿射矩阵，U为对角矩阵，且

是一个具备特征对齐功能的矩阵，投影矩阵P形式化为P＝g(f(F_c)f(F_s)^T)，在线性转换过程中，g(x)＝MX且f(x)＝XT^T；f(x)过程选择用三层卷积层来拟合，g()过程采用的是一个全连接层拟合。where A is the affine matrix defined above, U is the diagonal matrix, and

is a matrix with feature alignment function. The projection matrix P is formalized as P=g(f(F _c )f(F _s ) ^T ). In the linear transformation process, g(x)=MX and f(x)= XT ^T ; The f(x) process chooses to fit with three convolutional layers, and the g() process uses a fully connected layer to fit.

进一步地，所述步骤(4)的实现需要优化损失函数如下：Further, the realization of the step (4) needs to optimize the loss function as follows:

其中，I为原始图像，E_k(I)为原始VGG编码器中第k层输出特征图，

为轻量级编码器中第k层输出特征图，I”是用轻量级的解码器

解码得到的重构图片，λ为超参数，上面蒸馏过程的目标是让

能够保留原始E的信息的同时，

能够很好的将

得到的输出特征信息进行图像重构。Among them, I is the original image, E _k (I) is the output feature map of the kth layer in the original VGG encoder,

Output feature map for the kth layer in the light-weight encoder, I" is the light-weight decoder

The reconstructed image obtained by decoding, λ is a hyperparameter, and the goal of the above distillation process is to make

While retaining the information of the original E,

able to well

The obtained output feature information is used for image reconstruction.

有益效果：与现有技术相比，本发明的有益效果：1、在完成视频帧较高风格化质量的同时需要兼顾到相邻帧之间的稳定性，即时序一致性，从而才能够保证迁移后视频的稳定性；2、风格化具有丰富的多样性，能够实现任意风格进行任意视频的风格化；3、在进行视频风格迁移的过程中，要能够达到实时，也即需要保证风格迁移过程的速度非常快，并且为了具备更高的实用性，需要保证整个模型轻量。Beneficial effects: Compared with the prior art, the beneficial effects of the present invention are: 1. When completing the higher stylized quality of video frames, it is necessary to take into account the stability between adjacent frames, that is, timing consistency, so as to ensure that The stability of the video after migration; 2. The stylization has rich diversity, which can realize the stylization of any video in any style; 3. In the process of video style migration, it is necessary to achieve real-time, that is, it is necessary to ensure the style migration The speed of the process is very fast, and in order to have higher practicability, the entire model needs to be kept lightweight.

附图说明Description of drawings

图1为发明的流程图；Fig. 1 is the flow chart of the invention;

图2本发明基于知识蒸馏的高压缩自编码器模块结构示意图；2 is a schematic structural diagram of a high-compression self-encoder module based on knowledge distillation of the present invention;

图3本发明构建的视频风格迁移网络结构示意图；3 is a schematic diagram of the video style transfer network structure constructed by the present invention;

图4本发明视频风格迁移效果示例图。FIG. 4 is an example diagram of the video style transfer effect of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明作进一步详细说明。The present invention will be further described in detail below in conjunction with the accompanying drawings.

本发明提出一种自动生成风格化视频的风格迁移方法，在视频风格迁移过程中，主要需要经过三个阶段，第一阶段是对于内容视频帧和风格图进行编码器编码，第二阶段是对于编码得到的内容和风格特征进行特征风格迁移融合，第三阶段是对于迁移融合后的特征进行解码器解码从而得到风格化后的视频帧，最终合成视频。其中编码器和解码器的模型大小很大程度上决定了模型是否轻量，而特征迁移部分的设计好与否将会直接决定迁移得到的风格化视频是否稳定，是否能够完成实时风格迁移以及是否具备任意风格迁移的能力。如图1所示，具体包括以下步骤：The present invention proposes a style transfer method for automatically generating stylized videos. In the process of video style transfer, it mainly needs to go through three stages. The encoded content and style features undergo feature style transfer fusion. The third stage is to decode the transferred and fused features with a decoder to obtain stylized video frames, and finally synthesize the video. The model size of the encoder and decoder largely determines whether the model is lightweight, and the design of the feature transfer part will directly determine whether the transferred stylized video is stable, whether it can complete real-time style transfer, and whether Ability to transfer any style. As shown in Figure 1, it specifically includes the following steps:

步骤1：构建视频风格迁移网络模型，如图3所示，包括基于知识蒸馏的高压缩自编码器模块和基于语义对齐的特征迁移模块。Step 1: Build a video style transfer network model, as shown in Figure 3, including a high-compression autoencoder module based on knowledge distillation and a feature transfer module based on semantic alignment.

自编码器分为编码器和解码器两个部分，编码器能够将原始的视频内容帧以及风格图像编码为特征图，而特征迁移模块能够基于语义对齐地对编码器编码得到的内容特征和风格特征进行融合，最后得到基于语义对齐的融合迁移特征，最后通过解码器得到风格化后的视频帧。The autoencoder is divided into two parts: the encoder and the decoder. The encoder can encode the original video content frames and style images into feature maps, and the feature transfer module can encode the content features and styles obtained by the encoder based on semantic alignment. The features are fused, and finally the fusion transfer feature based on semantic alignment is obtained, and finally the stylized video frame is obtained through the decoder.

基于语义对齐的特征风格迁移模块(FTM)，够保证在视频风格迁移过程的相邻帧之间的稳定性；所述的视频风格迁移模型大小只有2.67MB，并且在执行视频风格迁移时的速度能够达到166.67fps。The feature style transfer module (FTM) based on semantic alignment can ensure the stability between adjacent frames in the video style transfer process; the size of the video style transfer model is only 2.67MB, and the speed when performing video style transfer is Able to reach 166.67fps.

步骤2：对于内容视频帧和风格图进行编码器编码：将轻量级编码器基于VGG网络进行知识蒸馏，让编码器在足够轻量级的同时学到教师网络VGG编码器的编码能力，将原始的视频内容帧以及风格图像编码为特征图。Step 2: Encoder encoding the content video frame and style map: The lightweight encoder is based on the VGG network for knowledge distillation, so that the encoder can learn the encoding ability of the teacher network VGG encoder while being lightweight enough. Raw video content frames and style images are encoded as feature maps.

如图2所示，轻量级的编码器和解码器网络结构具体包括：对称的四组上采样以及下采样卷积层、最大池化层并且采用了ReLU激活函数对输入视频帧以及任意风格图像进行特征编码的轻量级编码器网络。VGG网络是风格迁移中被广泛使用的一种网络结构，轻量级编码器网络是基于VGG教师网络进行知识蒸馏得到的学生网络，使其能够用尽量少的参数能够实现图像的编码过程。如图2网络结构中的

部分所示，需要让编码器网络在足够轻量级的同时能够去学到教师网络VGG编码器的编码能力，这里需要优化的损失函数如下：As shown in Figure 2, the lightweight encoder and decoder network structure specifically includes: symmetrical four groups of up-sampling and down-sampling convolutional layers, maximum pooling layers, and the use of ReLU activation function to input video frames and arbitrary styles. A lightweight encoder network for feature encoding of images. The VGG network is a widely used network structure in style transfer. The lightweight encoder network is a student network obtained by knowledge distillation based on the VGG teacher network, so that it can realize the image encoding process with as few parameters as possible. As shown in the network structure in Figure 2

As shown in the section, the encoder network needs to be lightweight enough to learn the encoding ability of the teacher network VGG encoder. The loss function that needs to be optimized here is as follows:

其中，原始基于VGG的网络中的编码器为E，轻量级编码器定义为

I'是经过解码器重构的得到的重构图片，其中，I为原始图像，VGG的网络中的编码器为E，轻量级编码器为

I'是重构的图片，E_k(I)为原始VGG编码器中第k层输出特征图，

为轻量级编码器中第k层输出特征图，λ和γ均为超参数Among them, the encoder in the original VGG-based network is E, and the lightweight encoder is defined as

I' is the reconstructed image reconstructed by the decoder, where I is the original image, the encoder in the VGG network is E, and the lightweight encoder is

is the output feature map of the kth layer in the lightweight encoder, λ and γ are both hyperparameters

步骤3：将轻量级解码器基于VGG网络进行知识蒸馏，让解码器在足够轻量级的同时能够去学到教师网络VGG解码器的解码能力。Step 3: The lightweight decoder is based on the VGG network for knowledge distillation, so that the decoder can learn the decoding ability of the teacher network VGG decoder while being lightweight enough.

如图2网络结构中的

部分所示，对于迁移后的特征进行特征解码的轻量级解码器网络，采用了VGG网络作为教师网络进行知识蒸馏，需要让解码器网络在足够轻量级的同时能够去学到教师网络VGG解码器的解码能力，这里需要优化的损失函数如下：As shown in the network structure in Figure 2

As shown in the section, for the lightweight decoder network for feature decoding of the migrated features, the VGG network is used as the teacher network for knowledge distillation, and the decoder network needs to be lightweight enough to learn the teacher network VGG. The decoding ability of the decoder, the loss function that needs to be optimized here is as follows:

其中

是用轻量级的解码器

解码得到的重构图片，上面蒸馏过程的目标是让

能够保留原始E的信息的同时，

能够很好的将

得到的输出特征信息进行图像重构。in

is a lightweight decoder

The reconstructed image obtained by decoding, the goal of the above distillation process is to make

While retaining the information of the original E,

able to well

The obtained output feature information is used for image reconstruction.

步骤4：基于语义对齐的特征迁移模块对编码器编码得到的内容特征和风格特征进行融合，得到基于语义对齐的融合迁移特征。Step 4: The semantic alignment-based feature transfer module fuses the content features and style features encoded by the encoder to obtain semantic alignment-based fusion transfer features.

基于语义对齐的特征迁移模块是实现实时稳定视频风格迁移的关键，需要能够在高效完成风格特征迁移的同时，进行特征语义对齐。为了实现上述想法，采用流形对齐的思想。假设内容图像经过编码器得到输出的特征图为F_c∈R^Cx(WxH)，风格图像经过轻量级编码网络得到的输出为F_s∈R^Cx(WxH)，其中C为特征图通道数，W和H分别为特征图的宽和高。设计的FTM模块将会输出语义对齐迁移之后的特征F_cs并且将其输出到解码器得到迁移后的结果图。实际上，我们设计的FTM模块的目标就是找到一个转换使得不同视频帧的内容图能够进行语义对齐的特征迁移，假定转换过程可以参数化为一个投影矩阵P∈R^CxC，则优化的目标函数为：The feature transfer module based on semantic alignment is the key to realize real-time stable video style transfer, which needs to be able to perform feature semantic alignment while efficiently completing style feature transfer. In order to realize the above idea, the idea of manifold alignment is adopted. Assume that the output feature map of the content image through the encoder is F _c ∈ R ^Cx(WxH) , and the output of the style image through the lightweight encoding network is F _s ∈ R ^Cx(WxH) , where C is the number of feature map channels, W and H are the width and height of the feature map, respectively. The designed FTM module will output the feature F _cs after semantic alignment transfer and output it to the decoder to get the result map after transfer. In fact, the goal of our designed FTM module is to find a transformation that enables semantically aligned feature transfer between the content maps of different video frames. Assuming that the transformation process can be parameterized as a projection matrix P∈R ^CxC , the optimized objective function is :

其中，

表示从F_c中挑选第i个位置特征向量的操作，A_ij表示

和

的k近邻矩阵。因此，目标函数就是让转换之后的内容特征和风格特征空间的k近邻特征相似。等价于在视频风格迁移的过程中，可能会存在一些移动的物体以及一些光照的变化，会导致迁移之后的抖动。但是基于上面的仿射保留变换，相邻两帧之间能够保持稳定一致，从而产生稳定的视频风格迁移结果。in,

and

The k-nearest neighbor matrix of . Therefore, the objective function is to make the transformed content features similar to the k-nearest neighbor features of the style feature space. It is equivalent to that in the process of video style transfer, there may be some moving objects and some lighting changes, which will cause jitter after the transfer. However, based on the above affine-preserving transformation, two adjacent frames can remain stable and consistent, resulting in stable video style transfer results.

求解上面等式实际上就是计算其对于P的闭式解，通过对P求导并且令导数为0可以得到关于P的解为：Solving the above equation is actually calculating its closed-form solution for P. By taking the derivative of P and setting the derivative to 0, the solution for P can be obtained as:

其中，A是上面定义的仿射矩阵，U为对角矩阵，且

也是一个矩阵。由于A是一个对角矩阵所以可以被分解为T^TT，因此投影矩阵P可以形式化为P＝g(f(F_c)f(F_s)^T)，在上面的线性转换过程中，g(x)＝MX且f(x)＝XT^T。即便我们能够P的闭式解，但是矩阵求逆的过程也会非常消耗时间的，所以我们设计了一个FTM网络模块结果来拟合上述求解过程。其中f(x)过程我们选择用三层卷积层来拟合，g()过程采用的是一个全连接层拟合。where A is the affine matrix defined above, U is the diagonal matrix, and

is also a matrix. Since A is a diagonal matrix, it can be decomposed into T ^T T, so the projection matrix P can be formalized as P=g(f(F _c )f(F _s ) ^T ), in the above linear transformation process, g (x)=MX and f(x)=XT ^T . Even if we can solve the closed form of P, the process of matrix inversion will be very time-consuming, so we design an FTM network module result to fit the above solution process. Among them, we choose to use three convolutional layers to fit the f(x) process, and use a fully connected layer to fit the g() process.

对需要给自编码器训练用到的内容图像进行预处理。将图像统一调整为256*256像素的大小；将内容图像分别输入到学生自编码器网络以及教师自编码器网络中，这部分包含编码和解码两部分。编码部分将图像进行编码；解码部分根据编码器得到的特征编码将输入图像重建出来。同时通过特征感知损失以及重构损失，如图2所示，基于知识蒸馏的训练方法保证蒸馏得到的轻量级自编码器网络能够具备多层级特征抽取以及基于特征重构图像的能力；将内容图像和风格图像分别送入到如图3所示的加入了语义对齐特征迁移模块的风格迁移网络中，对中间的特征迁移模块进行训练(固定已经蒸馏好的轻量级自编码器网络)，基于设计的内容损失Lc和风格损失Ls对迁移模块进行训练。Preprocess the content images needed to train the autoencoder. The image is uniformly adjusted to a size of 256*256 pixels; the content image is input into the student self-encoder network and the teacher's self-encoder network respectively, this part includes two parts of encoding and decoding. The encoding part encodes the image; the decoding part reconstructs the input image according to the feature code obtained by the encoder. At the same time, through the feature perception loss and reconstruction loss, as shown in Figure 2, the training method based on knowledge distillation ensures that the lightweight autoencoder network obtained by distillation can have the ability to extract multi-level features and reconstruct images based on features; The image and the style image are respectively fed into the style transfer network with the semantic alignment feature transfer module as shown in Figure 3, and the intermediate feature transfer module is trained (fixing the already distilled lightweight autoencoder network), The transfer module is trained based on the designed content loss Lc and style loss Ls.

在测试阶段，直接将视频帧和所选的风格图像输入到训练好的轻量级风格迁移模型中，模型将自动高效的输出风格化的结果，最终实现实时合成稳定的风格化视频，如图4所示，为一个视频中的每间隔10帧的风格迁移结果，可以看出无论是前景还是背景，都能够进行语义对齐的风格迁移从而产生稳定的视频帧结果。In the testing phase, directly input the video frame and the selected style image into the trained lightweight style transfer model, the model will automatically and efficiently output the stylized result, and finally realize the real-time synthesis of stable stylized video, as shown in the figure 4 shows the style transfer results of every 10 frames in a video. It can be seen that whether it is the foreground or the background, the style transfer of semantic alignment can be performed to generate stable video frame results.

Claims

1. A style migration method for automatically generating stylized video is characterized by comprising the following steps:

(1) constructing a video style migration network model, wherein the model comprises a high-compression self-encoder module based on knowledge distillation and a feature migration module based on semantic alignment; the self-encoder module comprises a lightweight encoder and a lightweight decoder;

(2) encoder encoding of content video frames and trellis diagrams: knowledge distillation is carried out on the lightweight encoder based on the VGG network, the encoder learns the encoding capacity of a teacher network VGG encoder while having enough lightweight, and original video content frames and style images are encoded into feature maps;

(3) a semantic alignment based feature migration module: fusing content features and style features obtained by encoding of an encoder to obtain fusion migration features based on semantic alignment;

(4) knowledge distillation of the lightweight decoder based on VGG network: the decoder can learn the decoding capability of a VGG decoder of a teacher network while the decoder is light enough, and the decoder decodes the fused and migrated features to obtain stylized video frames, and finally synthesizes the videos.

2. The method for automatically generating style migration of stylized video according to claim 1, wherein said step (2) is implemented by optimizing a loss function as follows:

wherein, I is an original image, an encoder in a VGG network is E, and a lightweight encoder is E

I' is the reconstructed picture, E_k(I) For the k-th layer output characteristic diagram in the original VGG encoder,

and the k-th layer output characteristic diagram in the lightweight encoder is shown, wherein lambda and gamma are both hyper-parameters.

3. The method for automatically generating style migration of stylized video according to claim 1, wherein said step (3) is implemented as follows:

the characteristic diagram of the content image output by the encoder is F_c∈R^Cx(WxH)The output obtained from the stylized image is F_s∈R^Cx ^(WxH)Wherein C is the number of the channels of the feature map, and W and H are the width and the height of the feature map respectively; the feature migration module based on semantic alignment aims at finding a feature migration which enables semantic alignment of content graphs of different video frames through conversion, and supposing that the conversion process can be parameterized into a projection matrix P e R^CxCThen the optimized objective function is:

wherein ,

denotes from F_cIn the operation of selecting the i-th position feature vector, A_ijTo represent

And

k neighbor matrix of (1);

solving for P as:

wherein A is the affine matrix defined above, U is the diagonal matrix, and

is a matrix with characteristic alignment function, and the projection matrix P is formed as P ═ g (F (F)_c)f(F_s)^T) In the linear conversion process, g (x) MX and f (x) XT^T(ii) a The (x) process chooses to fit with three convolutional layers, and the g () process uses a fully-connected layer fit.

4. The method for automatically generating style migration of stylized video according to claim 1, characterized in that said step (4) is implemented by optimizing a loss function as follows:

wherein I is an original image, E_k(I) For the k-th layer output characteristic diagram in the original VGG encoder,

for the k-th layer output characteristic diagram in the lightweight encoder, I' is a decoder using lightweight

Decoding the resulting reconstructed picture with λ being a hyperparametric, the above distillation process being aimed at

While the information of the original E can be retained,

can be well combined with

And performing image reconstruction on the obtained output characteristic information.