CN112884636A - Style migration method for automatically generating stylized video - Google Patents

Style migration method for automatically generating stylized video Download PDF

Info

Publication number
CN112884636A
CN112884636A CN202110117964.4A CN202110117964A CN112884636A CN 112884636 A CN112884636 A CN 112884636A CN 202110117964 A CN202110117964 A CN 202110117964A CN 112884636 A CN112884636 A CN 112884636A
Authority
CN
China
Prior art keywords
encoder
style
feature
video
lightweight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110117964.4A
Other languages
Chinese (zh)
Other versions
CN112884636B (en
Inventor
霍静
孔美豪
李文斌
高阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202110117964.4A priority Critical patent/CN112884636B/en
Publication of CN112884636A publication Critical patent/CN112884636A/en
Application granted granted Critical
Publication of CN112884636B publication Critical patent/CN112884636B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/04Context-preserving transformations, e.g. by using an importance map
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本发明公开一种自动生成风格化视频的风格迁移方法,构建包括基于知识蒸馏的高压缩自编码器模型和基于语义对齐的特征迁移模块的自动生成风格化视频的风格迁移模型;自编码器分为编码器和解码器两个部分,编码器能够将原始的视频内容帧以及风格图像编码为特征图,而特征迁移模块能够基于语义对齐地对编码器编码得到的内容特征和风格特征进行融合,最后得到基于语义对齐的融合迁移特征,最后通过解码器得到风格化后的视频帧。本发明能够保证迁移后视频的稳定性,能够实现任意风格进行任意视频的风格化,且风格迁移过程的速度非常快,具备更高的实用性。

Figure 202110117964

The invention discloses a style transfer method for automatically generating stylized videos, which comprises a high-compression self-encoder model based on knowledge distillation and a feature transfer module based on semantic alignment, and a style transfer model for automatically generating stylized videos; There are two parts: encoder and decoder. The encoder can encode the original video content frames and style images into feature maps, and the feature transfer module can fuse the content features and style features encoded by the encoder based on semantic alignment. Finally, the fusion transfer feature based on semantic alignment is obtained, and finally the stylized video frame is obtained through the decoder. The invention can ensure the stability of the video after the migration, can realize the stylization of any video in any style, and the speed of the style transfer process is very fast, and has higher practicability.

Figure 202110117964

Description

一种自动生成风格化视频的风格迁移方法A style transfer method for automatically generating stylized videos

技术领域technical field

本发明属于计算机应用领域,具体涉及一种自动生成风格化视频的风格迁移方法。The invention belongs to the field of computer applications, and in particular relates to a style transfer method for automatically generating stylized videos.

背景技术Background technique

随着互联网和移动互联网的发展普及,越来越多的短视频平台开始兴起,基于此人们对于视频的艺术化需求逐步提升,而通过专业艺术家或者专业剪辑师创作的形式不仅不方便,而且成本很高。因此,通过计算机技术根据视频自动生成任意艺术风格视频受到人们的关注和青睐。With the development and popularization of the Internet and mobile Internet, more and more short video platforms have begun to emerge. Based on this, people's artistic demand for videos has gradually increased, and the creation of professional artists or professional editors is not only inconvenient, but also cost-effective. very high. Therefore, the automatic generation of any artistic style video from the video through computer technology has attracted people's attention and favor.

给定一张内容图和一张目标风格图,风格迁移的目的是产生一张能够同时具备内容图结构和风格图纹理的风格化图像。基于单个图像的风格迁移方法已经有了大量的研究工作,目前大量的注意力已将开始转向视频风格迁移领域,因为视频风格迁移具有非常广阔的应用前景(包括短视频艺术转化等);显然,较比于单图像的风格迁移,视频的风格迁移是更具有实用性和挑战性的。Given a content map and a target style map, the purpose of style transfer is to generate a stylized image that has both the content map structure and the style map texture. There has been a lot of research work on style transfer methods based on a single image, and a lot of attention has now turned to the field of video style transfer, because video style transfer has very broad application prospects (including short video art transformation, etc.); Obviously, Compared with single-image style transfer, video style transfer is more practical and challenging.

较比于传统的图像风格迁移,视频风格迁移更为困难的点在于要同时兼顾风格化质量,稳定性以及计算效率。目前已有的视频风格迁移方法可根据是否使用光流大致分为两类。Compared with traditional image style transfer, the more difficult point of video style transfer is to take into account the stylization quality, stability and computational efficiency at the same time. The existing video style transfer methods can be roughly divided into two categories according to whether optical flow is used or not.

第一类是使用光流的方法,他们通过光流的监督约束,提出了时序一致性损失来获得相邻帧之间的稳定性。其中包括基于优化的光流约束方法,虽然这种方法能够获得稳定的迁移视频,但是对于视频的每一帧风格迁移都需要将近三分钟的时间,这种极慢的迁移速度,是人们所不能接受的。后续有人提出了基于前馈网络的视频风格迁移方法,但是由于在训练阶段和测试阶段还是依旧使用了光流约束,所以还是没法在视频迁移任务中达到实时的效果。为了解决这个问题,一些方法只在训练阶段使用光流而避免在测试阶段使用光流,但是这样的话较比于那些测试阶段也使用光流的方法虽然速度提升了但是最后迁移的效果非常不稳定。The first category is methods using optical flow, they propose a temporal consistency loss to obtain stability between adjacent frames through the supervision constraints of optical flow. Among them is the optical flow constraint method based on optimization. Although this method can obtain stable transfer video, it takes nearly three minutes for each frame of video style transfer. This extremely slow transfer speed is impossible. accepted. Later, someone proposed a video style transfer method based on feedforward network, but because the optical flow constraints are still used in the training phase and the testing phase, it is still impossible to achieve real-time effect in the video transfer task. In order to solve this problem, some methods only use optical flow in the training phase and avoid using optical flow in the testing phase, but in this case, compared with those methods that also use optical flow in the testing phase, although the speed is improved, the final transfer effect is very unstable. .

第二类是不使用光流的方法,如LST,能够实现特征仿射从而能够得到稳定的风格化视频。在此之后,有研究提出了使用一个基于Avatar-Net装饰模块结合一个compound归一化方法来保证视频的稳定性。但是这一类目前存在的不使用光流的方法都采用了原始的VGG网络来编码内容和风格特征,而VGG网络是非常庞大的,意味着需要非常大的内存空间来存储VGG模型,这将会很大程度的限制他们在一些小型终端设备中的应用。The second category is methods that do not use optical flow, such as LST, which can achieve feature affine to obtain stable stylized videos. After this, some studies proposed to use an Avatar-Net-based decoration module combined with a compound normalization method to ensure the stability of the video. However, these existing methods that do not use optical flow all use the original VGG network to encode content and style features, and the VGG network is very large, which means that a very large memory space is required to store the VGG model, which will It will greatly limit their application in some small terminal devices.

发明内容SUMMARY OF THE INVENTION

发明目的:本发明提出一种自动生成风格化视频的风格迁移方法,能够实现实时稳定的任意视频风格迁移。Purpose of the invention: The present invention proposes a style transfer method for automatically generating stylized videos, which can realize real-time and stable style transfer of any video.

技术方案:本发明提出一种自动生成风格化视频的风格迁移方法,具体包括以下步骤:Technical solution: The present invention proposes a style transfer method for automatically generating stylized videos, which specifically includes the following steps:

(1)构建视频风格迁移网络模型,所述模型包括基于知识蒸馏的高压缩自编码器模块和基于语义对齐的特征迁移模块;所述自编码器模块中包括了轻量级编码器和轻量级解码器;(1) Build a video style transfer network model, the model includes a high-compression self-encoder module based on knowledge distillation and a feature transfer module based on semantic alignment; the self-encoder module includes a lightweight encoder and a lightweight level decoder;

(2)对于内容视频帧和风格图进行编码器编码:将轻量级编码器基于VGG网络进行知识蒸馏,让编码器在足够轻量级的同时学到教师网络VGG编码器的编码能力,将原始的视频内容帧以及风格图像编码为特征图;(2) Encoder encoding for content video frames and style maps: The lightweight encoder is based on the VGG network for knowledge distillation, so that the encoder can learn the encoding ability of the teacher network VGG encoder while being lightweight enough. The original video content frames and style images are encoded as feature maps;

(3)基于语义对齐的特征迁移模块:对编码器编码得到的内容特征和风格特征进行融合,得到基于语义对齐的融合迁移特征;(3) Feature transfer module based on semantic alignment: fuse the content features and style features encoded by the encoder to obtain the fusion transfer feature based on semantic alignment;

(4)将轻量级解码器基于VGG网络进行知识蒸馏:让解码器在足够轻量级的同时能够去学到教师网络VGG解码器的解码能力,并且对于融合迁移后的特征进行解码器解码从而得到风格化后的视频帧,最终合成视频。(4) Perform knowledge distillation on the lightweight decoder based on the VGG network: let the decoder learn the decoding ability of the teacher network VGG decoder while being lightweight enough, and perform decoder decoding on the fused and migrated features Thereby, the stylized video frames are obtained, and the video is finally synthesized.

进一步地,所述步骤(2)的实现需要优化损失函数如下:Further, the realization of the step (2) needs to optimize the loss function as follows:

Figure BDA0002921029310000021
Figure BDA0002921029310000021

其中,I为原始图像,VGG的网络中的编码器为E,轻量级编码器为

Figure BDA0002921029310000022
I'是重构的图片,Ek(I)为原始VGG编码器中第k层输出特征图,
Figure BDA0002921029310000023
为轻量级编码器中第k层输出特征图,λ和γ均为超参数。Among them, I is the original image, the encoder in the VGG network is E, and the lightweight encoder is
Figure BDA0002921029310000022
I' is the reconstructed picture, E k (I) is the output feature map of the kth layer in the original VGG encoder,
Figure BDA0002921029310000023
Output feature map for the kth layer in the lightweight encoder, both λ and γ are hyperparameters.

进一步地,所述步骤(3)实现过程如下:Further, described step (3) realization process is as follows:

内容图像经过编码器得到输出的特征图为Fc∈RCx(WxH),风格图像得到的输出为Fs∈RCx(WxH),其中C为特征图通道数,W和H分别为特征图的宽和高;基于语义对齐的特征迁移模块目标为找到一个转换使得不同视频帧的内容图能够进行语义对齐的特征迁移,假定转换过程可以参数化为一个投影矩阵P∈RCxC,则优化的目标函数为:The output feature map of the content image through the encoder is F c ∈ R Cx(WxH) , and the output of the style image is F s ∈ R Cx(WxH) , where C is the number of feature map channels, and W and H are the feature maps respectively. width and height; the goal of the feature transfer module based on semantic alignment is to find a transformation that enables semantically aligned feature transfer for the content maps of different video frames. Assuming that the transformation process can be parameterized as a projection matrix P∈R CxC , then the optimized The objective function is:

Figure BDA0002921029310000031
Figure BDA0002921029310000031

其中,

Figure BDA0002921029310000032
表示从Fc中挑选第i个位置特征向量的操作,Aij表示
Figure BDA0002921029310000033
Figure BDA0002921029310000034
的k近邻矩阵;in,
Figure BDA0002921029310000032
Represents the operation of picking the i-th position feature vector from F c , and A ij represents
Figure BDA0002921029310000033
and
Figure BDA0002921029310000034
The k-nearest neighbor matrix of ;

求解P为:Solving for P is:

Figure BDA0002921029310000035
Figure BDA0002921029310000035

其中,A是上面定义的仿射矩阵,U为对角矩阵,且

Figure BDA0002921029310000036
Figure BDA0002921029310000037
是一个具备特征对齐功能的矩阵,投影矩阵P形式化为P=g(f(Fc)f(Fs)T),在线性转换过程中,g(x)=MX且f(x)=XTT;f(x)过程选择用三层卷积层来拟合,g()过程采用的是一个全连接层拟合。where A is the affine matrix defined above, U is the diagonal matrix, and
Figure BDA0002921029310000036
Figure BDA0002921029310000037
is a matrix with feature alignment function. The projection matrix P is formalized as P=g(f(F c )f(F s ) T ). In the linear transformation process, g(x)=MX and f(x)= XT T ; The f(x) process chooses to fit with three convolutional layers, and the g() process uses a fully connected layer to fit.

进一步地,所述步骤(4)的实现需要优化损失函数如下:Further, the realization of the step (4) needs to optimize the loss function as follows:

Figure BDA0002921029310000038
Figure BDA0002921029310000038

其中,I为原始图像,Ek(I)为原始VGG编码器中第k层输出特征图,

Figure BDA0002921029310000039
为轻量级编码器中第k层输出特征图,I”是用轻量级的解码器
Figure BDA00029210293100000310
解码得到的重构图片,λ为超参数,上面蒸馏过程的目标是让
Figure BDA00029210293100000311
能够保留原始E的信息的同时,
Figure BDA00029210293100000312
能够很好的将
Figure BDA00029210293100000313
得到的输出特征信息进行图像重构。Among them, I is the original image, E k (I) is the output feature map of the kth layer in the original VGG encoder,
Figure BDA0002921029310000039
Output feature map for the kth layer in the light-weight encoder, I" is the light-weight decoder
Figure BDA00029210293100000310
The reconstructed image obtained by decoding, λ is a hyperparameter, and the goal of the above distillation process is to make
Figure BDA00029210293100000311
While retaining the information of the original E,
Figure BDA00029210293100000312
able to well
Figure BDA00029210293100000313
The obtained output feature information is used for image reconstruction.

有益效果:与现有技术相比,本发明的有益效果:1、在完成视频帧较高风格化质量的同时需要兼顾到相邻帧之间的稳定性,即时序一致性,从而才能够保证迁移后视频的稳定性;2、风格化具有丰富的多样性,能够实现任意风格进行任意视频的风格化;3、在进行视频风格迁移的过程中,要能够达到实时,也即需要保证风格迁移过程的速度非常快,并且为了具备更高的实用性,需要保证整个模型轻量。Beneficial effects: Compared with the prior art, the beneficial effects of the present invention are: 1. When completing the higher stylized quality of video frames, it is necessary to take into account the stability between adjacent frames, that is, timing consistency, so as to ensure that The stability of the video after migration; 2. The stylization has rich diversity, which can realize the stylization of any video in any style; 3. In the process of video style migration, it is necessary to achieve real-time, that is, it is necessary to ensure the style migration The speed of the process is very fast, and in order to have higher practicability, the entire model needs to be kept lightweight.

附图说明Description of drawings

图1为发明的流程图;Fig. 1 is the flow chart of the invention;

图2本发明基于知识蒸馏的高压缩自编码器模块结构示意图;2 is a schematic structural diagram of a high-compression self-encoder module based on knowledge distillation of the present invention;

图3本发明构建的视频风格迁移网络结构示意图;3 is a schematic diagram of the video style transfer network structure constructed by the present invention;

图4本发明视频风格迁移效果示例图。FIG. 4 is an example diagram of the video style transfer effect of the present invention.

具体实施方式Detailed ways

下面结合附图对本发明作进一步详细说明。The present invention will be further described in detail below in conjunction with the accompanying drawings.

本发明提出一种自动生成风格化视频的风格迁移方法,在视频风格迁移过程中,主要需要经过三个阶段,第一阶段是对于内容视频帧和风格图进行编码器编码,第二阶段是对于编码得到的内容和风格特征进行特征风格迁移融合,第三阶段是对于迁移融合后的特征进行解码器解码从而得到风格化后的视频帧,最终合成视频。其中编码器和解码器的模型大小很大程度上决定了模型是否轻量,而特征迁移部分的设计好与否将会直接决定迁移得到的风格化视频是否稳定,是否能够完成实时风格迁移以及是否具备任意风格迁移的能力。如图1所示,具体包括以下步骤:The present invention proposes a style transfer method for automatically generating stylized videos. In the process of video style transfer, it mainly needs to go through three stages. The encoded content and style features undergo feature style transfer fusion. The third stage is to decode the transferred and fused features with a decoder to obtain stylized video frames, and finally synthesize the video. The model size of the encoder and decoder largely determines whether the model is lightweight, and the design of the feature transfer part will directly determine whether the transferred stylized video is stable, whether it can complete real-time style transfer, and whether Ability to transfer any style. As shown in Figure 1, it specifically includes the following steps:

步骤1:构建视频风格迁移网络模型,如图3所示,包括基于知识蒸馏的高压缩自编码器模块和基于语义对齐的特征迁移模块。Step 1: Build a video style transfer network model, as shown in Figure 3, including a high-compression autoencoder module based on knowledge distillation and a feature transfer module based on semantic alignment.

自编码器分为编码器和解码器两个部分,编码器能够将原始的视频内容帧以及风格图像编码为特征图,而特征迁移模块能够基于语义对齐地对编码器编码得到的内容特征和风格特征进行融合,最后得到基于语义对齐的融合迁移特征,最后通过解码器得到风格化后的视频帧。The autoencoder is divided into two parts: the encoder and the decoder. The encoder can encode the original video content frames and style images into feature maps, and the feature transfer module can encode the content features and styles obtained by the encoder based on semantic alignment. The features are fused, and finally the fusion transfer feature based on semantic alignment is obtained, and finally the stylized video frame is obtained through the decoder.

基于语义对齐的特征风格迁移模块(FTM),够保证在视频风格迁移过程的相邻帧之间的稳定性;所述的视频风格迁移模型大小只有2.67MB,并且在执行视频风格迁移时的速度能够达到166.67fps。The feature style transfer module (FTM) based on semantic alignment can ensure the stability between adjacent frames in the video style transfer process; the size of the video style transfer model is only 2.67MB, and the speed when performing video style transfer is Able to reach 166.67fps.

步骤2:对于内容视频帧和风格图进行编码器编码:将轻量级编码器基于VGG网络进行知识蒸馏,让编码器在足够轻量级的同时学到教师网络VGG编码器的编码能力,将原始的视频内容帧以及风格图像编码为特征图。Step 2: Encoder encoding the content video frame and style map: The lightweight encoder is based on the VGG network for knowledge distillation, so that the encoder can learn the encoding ability of the teacher network VGG encoder while being lightweight enough. Raw video content frames and style images are encoded as feature maps.

如图2所示,轻量级的编码器和解码器网络结构具体包括:对称的四组上采样以及下采样卷积层、最大池化层并且采用了ReLU激活函数对输入视频帧以及任意风格图像进行特征编码的轻量级编码器网络。VGG网络是风格迁移中被广泛使用的一种网络结构,轻量级编码器网络是基于VGG教师网络进行知识蒸馏得到的学生网络,使其能够用尽量少的参数能够实现图像的编码过程。如图2网络结构中的

Figure BDA0002921029310000051
部分所示,需要让编码器网络在足够轻量级的同时能够去学到教师网络VGG编码器的编码能力,这里需要优化的损失函数如下:As shown in Figure 2, the lightweight encoder and decoder network structure specifically includes: symmetrical four groups of up-sampling and down-sampling convolutional layers, maximum pooling layers, and the use of ReLU activation function to input video frames and arbitrary styles. A lightweight encoder network for feature encoding of images. The VGG network is a widely used network structure in style transfer. The lightweight encoder network is a student network obtained by knowledge distillation based on the VGG teacher network, so that it can realize the image encoding process with as few parameters as possible. As shown in the network structure in Figure 2
Figure BDA0002921029310000051
As shown in the section, the encoder network needs to be lightweight enough to learn the encoding ability of the teacher network VGG encoder. The loss function that needs to be optimized here is as follows:

Figure BDA0002921029310000052
Figure BDA0002921029310000052

其中,原始基于VGG的网络中的编码器为E,轻量级编码器定义为

Figure BDA0002921029310000053
I'是经过解码器重构的得到的重构图片,其中,I为原始图像,VGG的网络中的编码器为E,轻量级编码器为
Figure BDA0002921029310000054
I'是重构的图片,Ek(I)为原始VGG编码器中第k层输出特征图,
Figure BDA0002921029310000055
为轻量级编码器中第k层输出特征图,λ和γ均为超参数Among them, the encoder in the original VGG-based network is E, and the lightweight encoder is defined as
Figure BDA0002921029310000053
I' is the reconstructed image reconstructed by the decoder, where I is the original image, the encoder in the VGG network is E, and the lightweight encoder is
Figure BDA0002921029310000054
I' is the reconstructed picture, E k (I) is the output feature map of the kth layer in the original VGG encoder,
Figure BDA0002921029310000055
is the output feature map of the kth layer in the lightweight encoder, λ and γ are both hyperparameters

步骤3:将轻量级解码器基于VGG网络进行知识蒸馏,让解码器在足够轻量级的同时能够去学到教师网络VGG解码器的解码能力。Step 3: The lightweight decoder is based on the VGG network for knowledge distillation, so that the decoder can learn the decoding ability of the teacher network VGG decoder while being lightweight enough.

如图2网络结构中的

Figure BDA0002921029310000056
部分所示,对于迁移后的特征进行特征解码的轻量级解码器网络,采用了VGG网络作为教师网络进行知识蒸馏,需要让解码器网络在足够轻量级的同时能够去学到教师网络VGG解码器的解码能力,这里需要优化的损失函数如下:As shown in the network structure in Figure 2
Figure BDA0002921029310000056
As shown in the section, for the lightweight decoder network for feature decoding of the migrated features, the VGG network is used as the teacher network for knowledge distillation, and the decoder network needs to be lightweight enough to learn the teacher network VGG. The decoding ability of the decoder, the loss function that needs to be optimized here is as follows:

Figure BDA0002921029310000057
Figure BDA0002921029310000057

其中

Figure BDA0002921029310000058
是用轻量级的解码器
Figure BDA0002921029310000059
解码得到的重构图片,上面蒸馏过程的目标是让
Figure BDA00029210293100000510
能够保留原始E的信息的同时,
Figure BDA00029210293100000511
能够很好的将
Figure BDA00029210293100000512
得到的输出特征信息进行图像重构。in
Figure BDA0002921029310000058
is a lightweight decoder
Figure BDA0002921029310000059
The reconstructed image obtained by decoding, the goal of the above distillation process is to make
Figure BDA00029210293100000510
While retaining the information of the original E,
Figure BDA00029210293100000511
able to well
Figure BDA00029210293100000512
The obtained output feature information is used for image reconstruction.

步骤4:基于语义对齐的特征迁移模块对编码器编码得到的内容特征和风格特征进行融合,得到基于语义对齐的融合迁移特征。Step 4: The semantic alignment-based feature transfer module fuses the content features and style features encoded by the encoder to obtain semantic alignment-based fusion transfer features.

基于语义对齐的特征迁移模块是实现实时稳定视频风格迁移的关键,需要能够在高效完成风格特征迁移的同时,进行特征语义对齐。为了实现上述想法,采用流形对齐的思想。假设内容图像经过编码器得到输出的特征图为Fc∈RCx(WxH),风格图像经过轻量级编码网络得到的输出为Fs∈RCx(WxH),其中C为特征图通道数,W和H分别为特征图的宽和高。设计的FTM模块将会输出语义对齐迁移之后的特征Fcs并且将其输出到解码器得到迁移后的结果图。实际上,我们设计的FTM模块的目标就是找到一个转换使得不同视频帧的内容图能够进行语义对齐的特征迁移,假定转换过程可以参数化为一个投影矩阵P∈RCxC,则优化的目标函数为:The feature transfer module based on semantic alignment is the key to realize real-time stable video style transfer, which needs to be able to perform feature semantic alignment while efficiently completing style feature transfer. In order to realize the above idea, the idea of manifold alignment is adopted. Assume that the output feature map of the content image through the encoder is F c ∈ R Cx(WxH) , and the output of the style image through the lightweight encoding network is F s ∈ R Cx(WxH) , where C is the number of feature map channels, W and H are the width and height of the feature map, respectively. The designed FTM module will output the feature F cs after semantic alignment transfer and output it to the decoder to get the result map after transfer. In fact, the goal of our designed FTM module is to find a transformation that enables semantically aligned feature transfer between the content maps of different video frames. Assuming that the transformation process can be parameterized as a projection matrix P∈R CxC , the optimized objective function is :

Figure BDA0002921029310000061
Figure BDA0002921029310000061

其中,

Figure BDA0002921029310000062
表示从Fc中挑选第i个位置特征向量的操作,Aij表示
Figure BDA0002921029310000063
Figure BDA0002921029310000064
的k近邻矩阵。因此,目标函数就是让转换之后的内容特征和风格特征空间的k近邻特征相似。等价于在视频风格迁移的过程中,可能会存在一些移动的物体以及一些光照的变化,会导致迁移之后的抖动。但是基于上面的仿射保留变换,相邻两帧之间能够保持稳定一致,从而产生稳定的视频风格迁移结果。in,
Figure BDA0002921029310000062
Represents the operation of picking the i-th position feature vector from F c , and A ij represents
Figure BDA0002921029310000063
and
Figure BDA0002921029310000064
The k-nearest neighbor matrix of . Therefore, the objective function is to make the transformed content features similar to the k-nearest neighbor features of the style feature space. It is equivalent to that in the process of video style transfer, there may be some moving objects and some lighting changes, which will cause jitter after the transfer. However, based on the above affine-preserving transformation, two adjacent frames can remain stable and consistent, resulting in stable video style transfer results.

求解上面等式实际上就是计算其对于P的闭式解,通过对P求导并且令导数为0可以得到关于P的解为:Solving the above equation is actually calculating its closed-form solution for P. By taking the derivative of P and setting the derivative to 0, the solution for P can be obtained as:

Figure BDA0002921029310000065
Figure BDA0002921029310000065

其中,A是上面定义的仿射矩阵,U为对角矩阵,且

Figure BDA0002921029310000066
Figure BDA0002921029310000067
也是一个矩阵。由于A是一个对角矩阵所以可以被分解为TTT,因此投影矩阵P可以形式化为P=g(f(Fc)f(Fs)T),在上面的线性转换过程中,g(x)=MX且f(x)=XTT。即便我们能够P的闭式解,但是矩阵求逆的过程也会非常消耗时间的,所以我们设计了一个FTM网络模块结果来拟合上述求解过程。其中f(x)过程我们选择用三层卷积层来拟合,g()过程采用的是一个全连接层拟合。where A is the affine matrix defined above, U is the diagonal matrix, and
Figure BDA0002921029310000066
Figure BDA0002921029310000067
is also a matrix. Since A is a diagonal matrix, it can be decomposed into T T T, so the projection matrix P can be formalized as P=g(f(F c )f(F s ) T ), in the above linear transformation process, g (x)=MX and f(x)=XT T . Even if we can solve the closed form of P, the process of matrix inversion will be very time-consuming, so we design an FTM network module result to fit the above solution process. Among them, we choose to use three convolutional layers to fit the f(x) process, and use a fully connected layer to fit the g() process.

对需要给自编码器训练用到的内容图像进行预处理。将图像统一调整为256*256像素的大小;将内容图像分别输入到学生自编码器网络以及教师自编码器网络中,这部分包含编码和解码两部分。编码部分将图像进行编码;解码部分根据编码器得到的特征编码将输入图像重建出来。同时通过特征感知损失以及重构损失,如图2所示,基于知识蒸馏的训练方法保证蒸馏得到的轻量级自编码器网络能够具备多层级特征抽取以及基于特征重构图像的能力;将内容图像和风格图像分别送入到如图3所示的加入了语义对齐特征迁移模块的风格迁移网络中,对中间的特征迁移模块进行训练(固定已经蒸馏好的轻量级自编码器网络),基于设计的内容损失Lc和风格损失Ls对迁移模块进行训练。Preprocess the content images needed to train the autoencoder. The image is uniformly adjusted to a size of 256*256 pixels; the content image is input into the student self-encoder network and the teacher's self-encoder network respectively, this part includes two parts of encoding and decoding. The encoding part encodes the image; the decoding part reconstructs the input image according to the feature code obtained by the encoder. At the same time, through the feature perception loss and reconstruction loss, as shown in Figure 2, the training method based on knowledge distillation ensures that the lightweight autoencoder network obtained by distillation can have the ability to extract multi-level features and reconstruct images based on features; The image and the style image are respectively fed into the style transfer network with the semantic alignment feature transfer module as shown in Figure 3, and the intermediate feature transfer module is trained (fixing the already distilled lightweight autoencoder network), The transfer module is trained based on the designed content loss Lc and style loss Ls.

在测试阶段,直接将视频帧和所选的风格图像输入到训练好的轻量级风格迁移模型中,模型将自动高效的输出风格化的结果,最终实现实时合成稳定的风格化视频,如图4所示,为一个视频中的每间隔10帧的风格迁移结果,可以看出无论是前景还是背景,都能够进行语义对齐的风格迁移从而产生稳定的视频帧结果。In the testing phase, directly input the video frame and the selected style image into the trained lightweight style transfer model, the model will automatically and efficiently output the stylized result, and finally realize the real-time synthesis of stable stylized video, as shown in the figure 4 shows the style transfer results of every 10 frames in a video. It can be seen that whether it is the foreground or the background, the style transfer of semantic alignment can be performed to generate stable video frame results.

Claims (4)

1. A style migration method for automatically generating stylized video is characterized by comprising the following steps:
(1) constructing a video style migration network model, wherein the model comprises a high-compression self-encoder module based on knowledge distillation and a feature migration module based on semantic alignment; the self-encoder module comprises a lightweight encoder and a lightweight decoder;
(2) encoder encoding of content video frames and trellis diagrams: knowledge distillation is carried out on the lightweight encoder based on the VGG network, the encoder learns the encoding capacity of a teacher network VGG encoder while having enough lightweight, and original video content frames and style images are encoded into feature maps;
(3) a semantic alignment based feature migration module: fusing content features and style features obtained by encoding of an encoder to obtain fusion migration features based on semantic alignment;
(4) knowledge distillation of the lightweight decoder based on VGG network: the decoder can learn the decoding capability of a VGG decoder of a teacher network while the decoder is light enough, and the decoder decodes the fused and migrated features to obtain stylized video frames, and finally synthesizes the videos.
2. The method for automatically generating style migration of stylized video according to claim 1, wherein said step (2) is implemented by optimizing a loss function as follows:
Figure FDA0002921029300000011
wherein, I is an original image, an encoder in a VGG network is E, and a lightweight encoder is E
Figure FDA0002921029300000012
I' is the reconstructed picture, Ek(I) For the k-th layer output characteristic diagram in the original VGG encoder,
Figure FDA0002921029300000013
and the k-th layer output characteristic diagram in the lightweight encoder is shown, wherein lambda and gamma are both hyper-parameters.
3. The method for automatically generating style migration of stylized video according to claim 1, wherein said step (3) is implemented as follows:
the characteristic diagram of the content image output by the encoder is Fc∈RCx(WxH)The output obtained from the stylized image is Fs∈RCx (WxH)Wherein C is the number of the channels of the feature map, and W and H are the width and the height of the feature map respectively; the feature migration module based on semantic alignment aims at finding a feature migration which enables semantic alignment of content graphs of different video frames through conversion, and supposing that the conversion process can be parameterized into a projection matrix P e RCxCThen the optimized objective function is:
Figure FDA0002921029300000014
wherein ,
Figure FDA0002921029300000021
denotes from FcIn the operation of selecting the i-th position feature vector, AijTo represent
Figure FDA0002921029300000022
And
Figure FDA0002921029300000023
k neighbor matrix of (1);
solving for P as:
Figure FDA00029210293000000212
wherein A is the affine matrix defined above, U is the diagonal matrix, and
Figure FDA0002921029300000024
Figure FDA0002921029300000025
is a matrix with characteristic alignment function, and the projection matrix P is formed as P ═ g (F (F)c)f(Fs)T) In the linear conversion process, g (x) MX and f (x) XTT(ii) a The (x) process chooses to fit with three convolutional layers, and the g () process uses a fully-connected layer fit.
4. The method for automatically generating style migration of stylized video according to claim 1, characterized in that said step (4) is implemented by optimizing a loss function as follows:
Figure FDA0002921029300000026
wherein I is an original image, Ek(I) For the k-th layer output characteristic diagram in the original VGG encoder,
Figure FDA0002921029300000027
for the k-th layer output characteristic diagram in the lightweight encoder, I' is a decoder using lightweight
Figure FDA0002921029300000028
Decoding the resulting reconstructed picture with λ being a hyperparametric, the above distillation process being aimed at
Figure FDA0002921029300000029
While the information of the original E can be retained,
Figure FDA00029210293000000210
can be well combined with
Figure FDA00029210293000000211
And performing image reconstruction on the obtained output characteristic information.
CN202110117964.4A 2021-01-28 2021-01-28 A style transfer method to automatically generate stylized videos Active CN112884636B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110117964.4A CN112884636B (en) 2021-01-28 2021-01-28 A style transfer method to automatically generate stylized videos

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110117964.4A CN112884636B (en) 2021-01-28 2021-01-28 A style transfer method to automatically generate stylized videos

Publications (2)

Publication Number Publication Date
CN112884636A true CN112884636A (en) 2021-06-01
CN112884636B CN112884636B (en) 2023-09-26

Family

ID=76052976

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110117964.4A Active CN112884636B (en) 2021-01-28 2021-01-28 A style transfer method to automatically generate stylized videos

Country Status (1)

Country Link
CN (1) CN112884636B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113989102A (en) * 2021-10-19 2022-01-28 复旦大学 Rapid style migration method with high shape-preserving property
CN114331827A (en) * 2022-03-07 2022-04-12 深圳市其域创新科技有限公司 Style migration method, device, equipment and storage medium
CN114885174A (en) * 2022-02-23 2022-08-09 中国科学院自动化研究所 Video processing method and device and electronic equipment
CN117036550A (en) * 2023-08-14 2023-11-10 北京信息科技大学 Human action style migration method, system and storable medium
CN118283201A (en) * 2024-06-03 2024-07-02 上海蜜度科技股份有限公司 Video synthesis method, system, storage medium and electronic equipment
WO2025021026A1 (en) * 2023-07-21 2025-01-30 北京字跳网络技术有限公司 Video processing method and apparatus, and electronic device and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190236814A1 (en) * 2016-10-21 2019-08-01 Google Llc Stylizing input images
CN110175951A (en) * 2019-05-16 2019-08-27 西安电子科技大学 Video Style Transfer method based on time domain consistency constraint
CN110310221A (en) * 2019-06-14 2019-10-08 大连理工大学 A Multi-Domain Image Style Transfer Method Based on Generative Adversarial Networks
CN110706151A (en) * 2018-09-13 2020-01-17 南京大学 Video-oriented non-uniform style migration method
US20200151938A1 (en) * 2018-11-08 2020-05-14 Adobe Inc. Generating stylized-stroke images from source images utilizing style-transfer-neural networks with non-photorealistic-rendering
US20200167658A1 (en) * 2018-11-24 2020-05-28 Jessica Du System of Portable Real Time Neurofeedback Training
CN111325681A (en) * 2020-01-20 2020-06-23 南京邮电大学 An Image Style Transfer Method Combining Meta-Learning Mechanism and Feature Fusion
CN111932445A (en) * 2020-07-27 2020-11-13 广州市百果园信息技术有限公司 Compression method for style migration network and style migration method, device and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190236814A1 (en) * 2016-10-21 2019-08-01 Google Llc Stylizing input images
CN110706151A (en) * 2018-09-13 2020-01-17 南京大学 Video-oriented non-uniform style migration method
US20200151938A1 (en) * 2018-11-08 2020-05-14 Adobe Inc. Generating stylized-stroke images from source images utilizing style-transfer-neural networks with non-photorealistic-rendering
US20200167658A1 (en) * 2018-11-24 2020-05-28 Jessica Du System of Portable Real Time Neurofeedback Training
CN110175951A (en) * 2019-05-16 2019-08-27 西安电子科技大学 Video Style Transfer method based on time domain consistency constraint
CN110310221A (en) * 2019-06-14 2019-10-08 大连理工大学 A Multi-Domain Image Style Transfer Method Based on Generative Adversarial Networks
CN111325681A (en) * 2020-01-20 2020-06-23 南京邮电大学 An Image Style Transfer Method Combining Meta-Learning Mechanism and Feature Fusion
CN111932445A (en) * 2020-07-27 2020-11-13 广州市百果园信息技术有限公司 Compression method for style migration network and style migration method, device and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
暴雨轩;芦天亮;杜彦辉;: "深度伪造视频检测技术综述", 计算机科学, no. 09, pages 289 - 298 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113989102A (en) * 2021-10-19 2022-01-28 复旦大学 Rapid style migration method with high shape-preserving property
CN114885174A (en) * 2022-02-23 2022-08-09 中国科学院自动化研究所 Video processing method and device and electronic equipment
CN114331827A (en) * 2022-03-07 2022-04-12 深圳市其域创新科技有限公司 Style migration method, device, equipment and storage medium
WO2025021026A1 (en) * 2023-07-21 2025-01-30 北京字跳网络技术有限公司 Video processing method and apparatus, and electronic device and storage medium
CN117036550A (en) * 2023-08-14 2023-11-10 北京信息科技大学 Human action style migration method, system and storable medium
CN118283201A (en) * 2024-06-03 2024-07-02 上海蜜度科技股份有限公司 Video synthesis method, system, storage medium and electronic equipment
CN118283201B (en) * 2024-06-03 2024-10-15 上海蜜度科技股份有限公司 Video synthesis method, system, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN112884636B (en) 2023-09-26

Similar Documents

Publication Publication Date Title
CN112884636B (en) A style transfer method to automatically generate stylized videos
CN106980641B (en) Unsupervised Hash quick picture retrieval system and unsupervised Hash quick picture retrieval method based on convolutional neural network
CN113642604B (en) Audio-video auxiliary touch signal reconstruction method based on cloud edge cooperation
CN113934890B (en) Method and system for automatically generating scene video by characters
CN110163796B (en) Unsupervised multi-modal countermeasures self-encoding image generation method and framework
CN112381716A (en) Image enhancement method based on generation type countermeasure network
CN115203409A (en) A video emotion classification method based on gated fusion and multi-task learning
CN115712709A (en) Multi-modal dialog question-answer generation method based on multi-relationship graph model
CN116363560A (en) Video mask self-coding method and system
CN116109920A (en) A method for extracting buildings from remote sensing images based on Transformer
CN114841859A (en) Single-image super-resolution reconstruction method based on lightweight neural network and Transformer
CN114972851B (en) Ship target intelligent detection method based on remote sensing image
CN115797835A (en) Non-supervision video target segmentation algorithm based on heterogeneous Transformer
Gao et al. Generalized pyramid co-attention with learnable aggregation net for video question answering
CN115359294A (en) Cross-granularity small sample learning method based on similarity regularization intra-class mining
CN112837212B (en) Image arbitrary style migration method based on manifold alignment
CN113869396A (en) PC screen semantic segmentation method based on efficient attention mechanism
CN116052154B (en) Scene text recognition method based on semantic enhancement and graph reasoning
CN118195899A (en) A lightweight hybrid attention distillation network based image super-resolution model
CN117132500A (en) Weak light enhancement method based on sparse conversion network
CN113780209B (en) Attention mechanism-based human face attribute editing method
Wang et al. Hierarchical shared architecture search for real-time semantic segmentation of remote sensing images
Wang et al. Boosting light field image super resolution learnt from single-image prior
Lin Virtual reality and its application for producing TV programs
CN116311002B (en) An Unsupervised Video Object Segmentation Method Based on Optical Flow Information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant