CN110533579B - Video style conversion method based on self-coding structure and gradient order preservation - Google Patents

Video style conversion method based on self-coding structure and gradient order preservation Download PDF

Info

Publication number
CN110533579B
CN110533579B CN201910680259.8A CN201910680259A CN110533579B CN 110533579 B CN110533579 B CN 110533579B CN 201910680259 A CN201910680259 A CN 201910680259A CN 110533579 B CN110533579 B CN 110533579B
Authority
CN
China
Prior art keywords
video
layer
stylized
network
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910680259.8A
Other languages
Chinese (zh)
Other versions
CN110533579A (en
Inventor
牛毅
郭博嘉
李甫
李宜烜
石光明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201910680259.8A priority Critical patent/CN110533579B/en
Publication of CN110533579A publication Critical patent/CN110533579A/en
Application granted granted Critical
Publication of CN110533579B publication Critical patent/CN110533579B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/04Context-preserving transformations, e.g. by using an importance map
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention provides a video style conversion method based on a self-coding structure and gradient order preservation, which is used for solving the technical problem that halo is generated at the edge of a foreground target in a stylized video in the existing video style conversion method. The method comprises the following implementation steps: 1) Constructing a training sample set and a testing sample set; 2) Constructing a video stylized network model; 3) Training a video stylized network model; 4) Testing the trained video stylized network model, and 5) obtaining a video style conversion result. According to the method, the video stylized network model based on the self-coding structure and the gradient order-preserving loss function is constructed, and the time consistency constraint is redefined in a more reasonable mode, so that the halo generated at the edge of the foreground target in the stylized video is effectively eliminated, the texture detail information of the original video is retained, the visual sensory experience of people is improved, and the method can be used for finishing the post-production processing of photography and movie and television works.

Description

Video style conversion method based on self-coding structure and gradient order preservation
Technical Field
The invention belongs to the technical field of digital image processing, relates to a video style conversion method, and particularly relates to a video style conversion method based on a self-coding structure and gradient order preservation, which can be used for finishing post-production processing of photographic and movie works.
Background
An important branch of the computer vision field is image generation, which includes image super resolution, image coloring, image semantic segmentation, style conversion of images or videos, and the like. In the field of computer vision, style conversion of images or videos is considered as a general problem of texture synthesis, i.e. in the case of a given style image, texture extraction and transfer from source to target, generating a corresponding style conversion result.
The image stylization conversion method can be divided into two types, namely an image style conversion method based on a traditional iteration method and an image style conversion method based on a neural network, wherein the image style conversion method based on the traditional iteration method comprises the following steps: a stroke-based rendering method, a region-technology-based rendering method, an example-based rendering method; this type of method, although faithfully describing a specific style pattern without CNN, has limitations in flexibility, style diversity, and effective extraction of image structure. A second type of image style conversion method based on neural network, for example: the method comprises the steps of extracting content features of an input image through a pre-trained convolutional neural network, extracting texture features of the input image through a Gram matrix, and optimizing an output image by adopting an iteration method aiming at matching feature distribution of the output image with feature distribution of an expected convolutional neural network.
After the image style conversion method achieves certain results, many scholars turn the attention to the video style conversion method. The method is characterized in that a video is formed by combining a plurality of frames of images, continuous stylized video frames are obtained by frame splitting and an image style conversion method is used for combining the frames of the videos into a stylized video, and therefore the video style conversion method is improved based on the image style conversion method. The video style conversion method can be applied to post-production processing and treatment of film and television works, and can generate corresponding video style conversion results under the condition of specifying target style images.
At present, the following methods are mainly used for typical video style conversion:
manuel Ruder et al published an article entitled "adaptive style transfer for video" in 2016 and discloses a video style conversion method based on iteration, which adds time loss on the basis of image style conversion and provides a time consistency concept between adjacent video frames. To penalize deviations between two frames. The time consistency between adjacent stylized video frames is ensured, and the video flicker phenomenon is effectively prevented. However, due to iterative optimization, the video generation speed is very slow, which causes high time cost of video generation.
Haozhi Huang et al published an article entitled "Real-Time Neural Style Transfer for video" in 2017 on Computer Vision and Pattern Recognition, and disclosed a video Style conversion method based on a feed-forward network, and achieved the purpose of shortening the conversion Time by training a Style conversion feed-forward Neural network with Time consistency. Although this method has been successful in improving the efficiency of video generation, it still has two drawbacks when converting the video style: firstly, halos exist around foreground objects in the generated stylized video, and visual and sensory experiences of people are influenced; secondly, due to the fact that accuracy of optical flow estimation is insufficient, a computing mode of time consistency loss is unreasonable, optical flow data detected on an original video frame is not suitable for restraining time consistency between stylized video frames, training errors are generated, and fluency and continuity of stylized videos are affected.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a video style conversion method based on a self-coding structure and gradient order preservation, which improves the visual and sensory experience of people by eliminating the halo around the foreground object in the stylized video on the premise of ensuring the temporal video style conversion speed.
The technical idea of the invention is as follows: firstly, a video stylized network structure is constructed by adopting the idea of a self-coding structure, the video stylized network structure is used for improving the generation speed of a stylized video, defining a new time consistency loss function, inhibiting the jittering and flickering phenomena of the stylized video, increasing a reconstruction loss function, retaining the detail information of an original video in the stylized video, eliminating the halo around a foreground object in the stylized video by increasing a gradient order-preserving loss function, and improving the visual sensory experience of people, and the method comprises the following specific steps:
(1) Constructing a training sample set and a testing sample set:
(1a) Obtaining target style images s and M r Resolution size N r ×N r And for each video data at a frame rate N f Performing frame splitting to obtain M r Grouping original video frames x, and simultaneously extracting optical flow data of each video data to obtain M r Group of optical flow data, wherein N r ≥64,M r ≥100,N f ≥25;
(1b) Will be 4M r Original video frames of/5 groups x and 4M r The optical flow data corresponding to the 5 groups of original video frames form a training set, the target style image s and the training set form a training sample set, and the rest 1M are used r 5, forming a test sample set by the original video frames x;
(2) Constructing a video stylized network model:
(2a) Constructing a video stylized network structure:
constructing a video stylized network structure comprising an encoder network, a decoder network, and a loss network, wherein:
an encoder network including an input layer, a plurality of convolutional layers, a plurality of residual layers, and a plurality of anti-convolutional layers for generating a stylized video frame
Figure BDA0002144558420000031
A decoder network comprising an input layer, a plurality of convolutional layers, a plurality of residual layers and a plurality of deconvolution layers for generating a reconstructed video frame
Figure BDA0002144558420000032
A lossy network comprising an input layer, a plurality of convolutional layers and a plurality of pooling layers for extracting a target-style image s, a stylized video frame
Figure BDA0002144558420000033
And higher order of original video frame xCharacteristic;
(2b) Defining a total loss function L for a video stylized network structure total
Defining includes spatial structure loss function
Figure BDA0002144558420000034
Time consistency loss function
Figure BDA0002144558420000035
Gradient order preserving loss function
Figure BDA0002144558420000036
And reconstruction loss function
Figure BDA0002144558420000037
Total loss function L of the video stylized network structure of (1) total
Figure BDA0002144558420000038
Figure BDA0002144558420000039
Figure BDA00021445584200000310
Figure BDA0002144558420000041
Figure BDA0002144558420000042
Wherein x is t Is the original video frame at time t,
Figure BDA0002144558420000043
is style at time tThe video frames are then assembled into a video frame,
Figure BDA0002144558420000044
is the reconstructed video frame at time t; α, β, λ and γ are each
Figure BDA0002144558420000045
And
Figure BDA0002144558420000046
the weight of (c); d = H × W × C, H is x t
Figure BDA0002144558420000047
And
Figure BDA0002144558420000048
w is x t
Figure BDA0002144558420000049
And
Figure BDA00021445584200000410
c is x t
Figure BDA00021445584200000411
And
Figure BDA00021445584200000412
the number of channels of (a);
μ is a content loss function
Figure BDA00021445584200000413
Is a style loss function
Figure BDA00021445584200000414
The weight of (c); f is an affine transformation operation; a. The t Is at x t The number of elements obtained by convolution is D m Three-dimensional gradient matrix of, B t Is that
Figure BDA00021445584200000415
The number of elements obtained by convolution is D m The three-dimensional gradient matrix of (a); g T Is a threshold function, M is a morphological dilation operation;
(3) Training a video stylized network model:
the original video frame x of the target style image s and t in the training sample set t And original video frame x at time t +1 t+1 And x t And x t+1 Taking the optical flow data as the input of the video stylized network model, and performing K times of iterative training on the video stylized network model to obtain a trained video stylized network model, wherein K is more than or equal to 20000;
(4) Testing the trained video stylized network model:
taking the test sample set as the input of the encoder network in the trained video stylized network model to obtain 1M r A/5 set of stylized video frame sequences;
(5) Acquiring a video style conversion result:
at frame rate N for each group of stylized video frame sequences f And combining frames according to the time sequence to obtain the video with the converted style.
Compared with the prior art, the invention has the following advantages:
1. the invention constructs a video stylized network model containing a gradient order-preserving loss function and a reconstruction loss function based on the thought of a self-coding structure, and restrains the intermediate value and the output value of an encoder network and a decoder network in the self-coding structure by adopting the gradient order-preserving loss function and the reconstruction loss function in the training process, so as to prevent the pixel value at the edge of a foreground target in a stylized video from being too smooth or gradient reverse, effectively eliminate the halo at the edge of the foreground target of the stylized video in the prior art, provide a sharp halo-free foreground target boundary, retain the texture detail information of an original video in the stylized video, and effectively improve the visual sensory experience of people.
2. The invention constructs a video stylized network model containing a redefined time consistency loss function based on the idea of a self-coding structure, and calculates the mean square error between the pixels of the two adjacent frames of reconstructed video frames, because the reconstructed video frames are basically similar to the original video frames in the spatial structure, the calculation of the two adjacent frames of stylized video frames in the prior art is avoided, the error of using the optical flow extracted by the original video evaluation for video style conversion is favorably reduced, the flickering and shaking phenomena of the stylized video are effectively inhibited, the smoothness and smoothness of the stylized video are improved, and the visual sensory experience of people is further improved.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a schematic diagram of a video stylized network architecture constructed in accordance with the present invention;
fig. 3 is a graph comparing the video style conversion results of the present invention and the prior art.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples.
Referring to fig. 1, the implementation steps of the invention are as follows:
step 1) constructing a training sample set and a test sample set:
(1a) Obtaining target style images s and M r Resolution size N r ×N r And for each video data at a frame rate N f Performing frame splitting to obtain M r Grouping original video frames x, and simultaneously extracting optical flow data of each video data to obtain M r Group of optical stream data, wherein N r ≥64,M r ≥100,N f ≥25;
(1b) Will be 4M r Original video frames of/5 groups x and 4M r The optical flow data corresponding to the 5 groups of original video frames form a training set, the target style image s and the training set form a training sample set, and the rest 1M are used r The 5 groups of original video frames x form a test sample set;
in the existing video data set, the resolution size of most video data sets is rectangular, but when the optical flow is evaluated and extracted, rectangular video frames are automatically converted into the resolution size N r ×N r In the training process, only the optical flow data is evaluated and extracted for the square video, so that the original rectangular video is adjusted into the square video as a training set. During training, down-sampling operation is performed, and the effect of video with too small resolution after style conversion is not ideal, so that the resolution of the input video has a lower limit, N r And the content is more than or equal to 64. In this embodiment, a target style image is obtained, 124 video data included in two video data sets of Sintel and DAVIS are adopted, 124 videos are unframed, an optical flow data is extracted by using a Flownet2 algorithm, 124 groups of original video frames with the resolution of 256 × 256 and 124 groups of corresponding optical flow data are obtained, optical flow data corresponding to 102 groups of video frames and 102 groups of video frames are combined into a training set, a training sample set is formed by the target style image and the training set, and the rest 22 groups of video frames are used as a test sample set.
Step 2), constructing a video stylized network model:
(2a) Constructing a video stylized network structure:
referring to fig. 2, a video stylized network structure is constructed comprising an encoder network, a decoder network and a loss network, wherein the encoder network and the decoder network together form a self-encoding structure for generating intermediate values and output values, providing conditions for constraints of respective loss functions:
the encoder network includes an input layer, four convolutional layers, five residual layers, and two deconvolution layers, for generating stylized video frames
Figure BDA0002144558420000061
Input layer → first convolution layer → second convolution layer → third convolution layer → first residual layer → second residual layer → third residual layer → fourth residual layer → fifth residual layer → first deconvolution layer → second deconvolution layer → fourth convolution layer;
the decoder network comprises an input layer, three convolutional layers, two residual layers and an anti-convolutional layer for generating a reconstructed video frame
Figure BDA0002144558420000063
Input layer → first convolution layer → second convolution layer → first residual layer → second residual layer → first deconvolution layer → third convolution layer;
the loss network adopts the front 16 layers of pre-trained VGG-19, including an input layer, sixteen convolutional layers and four pooling layers, and is used for extracting a target style image s and a stylized video frame
Figure BDA0002144558420000062
And the high-order features of the original video frame x:
the input layer → the first buildup layer → the second buildup layer → the first pooling layer → the third buildup layer → the fourth buildup layer → the second pooling layer → the fifth buildup layer → the sixth buildup layer → the seventh buildup layer → the eighth buildup layer → the third pooling layer → the ninth buildup layer → the tenth buildup layer → the eleventh buildup layer → the twelfth buildup layer → the fourth pooling layer → the thirteenth buildup layer → the fourteenth buildup layer → the fifteenth buildup layer → the sixteenth buildup layer.
Wherein the parameter settings of each layer of the decoder network and the encoder network are as follows:
Figure BDA0002144558420000071
(2b) Defining a total loss function L for a video stylized network structure total
Defining including spatial structure loss function
Figure BDA0002144558420000072
Time consistency loss function
Figure BDA0002144558420000073
Gradient order preserving loss function
Figure BDA0002144558420000074
And reconstructing the loss function L reconstruction (x t ,x t ) Video wind ofTotal loss function L of a lattice network structure total
Figure BDA0002144558420000081
Figure BDA0002144558420000082
Figure BDA0002144558420000083
Figure BDA0002144558420000084
Figure BDA0002144558420000085
Wherein x is t Is the original video frame at time t,
Figure BDA0002144558420000086
is the stylized video frame at time t,
Figure BDA0002144558420000087
is the reconstructed video frame at time t; α, β, λ and γ are each
Figure BDA0002144558420000088
And
Figure BDA0002144558420000089
the weight of (c); d = H × W × C, H is x t
Figure BDA00021445584200000810
And
Figure BDA00021445584200000811
w is x t
Figure BDA00021445584200000812
And
Figure BDA00021445584200000813
c is x t
Figure BDA00021445584200000814
And
Figure BDA00021445584200000815
the number of channels of (a);
μ is a content loss function
Figure BDA00021445584200000816
Is a style loss function
Figure BDA00021445584200000817
The weight of (c); f is an affine transformation operation; a. The t Is at x t The number of elements obtained by convolution is D m Three-dimensional gradient matrix of, B t Is that
Figure BDA00021445584200000818
The number of elements obtained by convolution is D m The three-dimensional gradient matrix of (a); g T Is a threshold function, M is a morphological dilation operation;
wherein the style loss function
Figure BDA00021445584200000819
Function of content loss
Figure BDA00021445584200000820
Three-dimensional gradient matrix A t Three-dimensional gradient matrix B t And a threshold function G T Are respectively:
Figure BDA00021445584200000821
Figure BDA00021445584200000822
Figure BDA00021445584200000823
Figure BDA0002144558420000091
Figure BDA0002144558420000092
wherein x is t Is the original video frame at time t,
Figure BDA0002144558420000093
is the stylized video frame at time t +1,
Figure BDA0002144558420000094
is the reconstructed video frame at time t; phi is a l Is a feature map extracted from the first layer convolution in the loss network, C l 、H l And W l Are respectively phi l Number of channels, height and width, G l Extracting a Gram matrix of a characteristic diagram from the l layer convolution in the loss network;
A t is prepared by mixing 12
Figure BDA0002144558420000095
Cascading to obtain a three-dimensional gradient matrix; b is t Is prepared by mixing 12
Figure BDA0002144558420000096
The three-dimensional gradient matrix obtained by cascading is obtained,
Figure BDA0002144558420000097
is by a convolution kernel K q In that
Figure BDA0002144558420000098
The two-dimensional gradient matrix obtained by the up-convolution,
Figure BDA0002144558420000099
is by a convolution kernel K q In that
Figure BDA00021445584200000910
Convolving the acquired two-dimensional gradient matrix, wherein,
Figure BDA00021445584200000911
and
Figure BDA00021445584200000912
are respectively:
Figure BDA00021445584200000913
Figure BDA00021445584200000914
Figure BDA00021445584200000915
is x t Is provided on the p-th channel of (a),
Figure BDA00021445584200000916
is that
Figure BDA00021445584200000917
P is the serial number of the channel number, and p belongs to {1,2,3}; for
Figure BDA00021445584200000918
And
Figure BDA00021445584200000919
all the pixel points on the upper part adopt a left lower part, a lower part and a,Convolution kernel K of four directions of lower right and right q Performing a convolution operation with q being a convolution kernel K of different directions q Is given by the sequence number of (1), q ∈ {1,2,3,4}, K q (m, n) is K q The value of the (m, n) th position, m and n being K respectively q The row and column indices; i and j are
Figure BDA00021445584200000920
And
Figure BDA00021445584200000921
an index of rows and columns; k is a radical of r Is K q Width of (k) c Is K q The length of (d); in this embodiment, K q The convolution operation is performed with four inverse convolution kernels as follows:
Figure BDA00021445584200000922
and
Figure BDA00021445584200000923
G T in brief, the gradient sequence is direction information of the gradient, and by taking the gradient in one direction alone, for example, if the pixel value of the current pixel point is a greater than that of the pixel point at the lower right, the value of the pixel point at the lower right of the current pixel point is assigned to 1, so that when the point G is a point G T A value of 1 indicates that the pixel value of the point is greater than the pixel value of the point below and to the right. The value of the threshold a is empirically derived in experiments.
Step 3) training the video stylized network model:
the original video frame x of the target style image at the time of s and t in the training sample set t And original video frame x at time t +1 t+1 And x t And x t+1 The optical flow data is used as the input of the video stylized network model, K times of iterative training are carried out on the video stylized network model to obtain the trained video stylized network model, wherein K is more than or equal to 20000, and the specific implementation steps are as follows:
(3a) Initializing parameters of an encoder network and a decoder network, loading the parameters of a trained loss network, setting the iteration frequency as T, setting the maximum iteration frequency as K, wherein K is more than or equal to 20000, and enabling T =0 and T =1;
(3b) Original video frame x t And x t+1 Simultaneously inputting the video frames into an encoder network of the video stylized network model to obtain stylized video frames output by the encoder network
Figure BDA0002144558420000101
And
Figure BDA0002144558420000102
(3c) According to gradient order-preserving loss function
Figure BDA0002144558420000103
Calculating the t time x t And
Figure BDA0002144558420000104
gradient order-preserving loss in between, and time x at t +1 t+1 And
Figure BDA0002144558420000105
gradient order preserving loss is obtained, and the sum of the gradient order preserving loss at the time t and the time t +1 is obtained to obtain the trained gradient order preserving loss; the method is used for preventing the pixel value at the edge of the foreground target in the stylized video from being too smooth or gradient reversal, and eliminating the halo at the edge of the foreground target in the stylized video in the prior art;
(3d) X is to be t And x t+1
Figure BDA0002144558420000106
And
Figure BDA0002144558420000107
and simultaneously inputting the target style image s into a loss network of the video stylized network model, and extracting x t 、x t+1
Figure BDA0002144558420000108
S higher order features of sum, and according to spatial structure loss function
Figure BDA0002144558420000109
Calculating the t time x t High-order characteristics of,
Figure BDA00021445584200001010
And the loss of spatial structure between the higher order features of s, and the time t +1 x t+1 High-order characteristics of,
Figure BDA00021445584200001011
The space structure loss between the high-order characteristic of s and the high-order characteristic of s, and the sum of the space structure losses at the time t and the time t +1 is solved to obtain the space structure loss after training; the system comprises a video processing unit, a storage unit and a processing unit, wherein the video processing unit is used for constraining content information, spatial features and texture features of a stylized video;
(3e) Formatting video frames
Figure BDA0002144558420000111
And
Figure BDA0002144558420000112
inputting the video frame into a decoder network of a video stylized network model to obtain a reconstructed video frame
Figure BDA0002144558420000113
And
Figure BDA0002144558420000114
(3f) According to x t And x t+1 Optical flow data between
Figure BDA0002144558420000115
Affine transformation
Figure BDA0002144558420000116
Predicted value at time t
Figure BDA0002144558420000117
And according to a time consistency loss function
Figure BDA0002144558420000118
Computing
Figure BDA0002144558420000119
And
Figure BDA00021445584200001110
loss of time consistency between, i.e. loss of time consistency after training; the method is used for inhibiting the flickering and shaking phenomena of the stylized video and improving the smoothness and smoothness of the stylized video;
(3g) According to a reconstruction loss function
Figure BDA00021445584200001111
Calculating the t time x t And
Figure BDA00021445584200001112
reconstruction loss in between, and t +1 time x t+1 And
Figure BDA00021445584200001113
the sum of reconstruction losses at the t moment and the t +1 moment is obtained to obtain the reconstruction loss after training; the texture detail information used for keeping the original video;
(3h) Substituting the results calculated in steps (3 c), (3 d), (3 f) and (3 g) into the total loss function L of the video stylized network structure total Calculating the total loss of the trained video stylized network structure, and updating the parameters of the encoder network and the decoder network through the total loss of the video stylized network structure by using a gradient descent algorithm;
(3i) Judging whether T is equal to K, if so, obtaining a trained video stylized network model; otherwise, let t = t +1, and perform step (3 b).
Step 4) testing the trained video stylized network model:
taking a test sample set as a trained video stylized meshThe input of the encoder network in the network model is obtained as 1M r A/5 set of stylized video frame sequences;
step 5) obtaining a video style conversion result:
at frame rate N for each group of stylized video frame sequences f And combining frames according to the time sequence to obtain the video with the converted style.
The effects of the present invention can be further illustrated by the following practical experiments:
1. conditions of the experiment
The hardware test platform adopted in the actual experiment of the experiment is as follows: intel Core i7CPU with 3.60GHz main frequency and 8GB memory; the software simulation platform comprises: a Ubuntu 16.04 64-bit operating system, a Pycharm development platform; software simulation language: python; using a deep learning framework: tensorflow.
2. Analysis of experimental content and results
The experimental contents are as follows: the same video data is subjected to style conversion by using the method of the present invention and Haozhi Huang et al, and the obtained style conversion result is shown in FIG. 3.
Fig. 3 (a) is a target-style image.
Fig. 3 (b) is the 10 th original video frame in the scene of the sinter dataset ambush _ 1.
Figure 3 (c) is the 9 th frame original video frame in the scene of the sinter dataset ambush _ 1.
Fig. 3 (d) and 3 (e) are stylized video frames after the style conversion of fig. 3 (b) and 3 (c) by the prior art.
Fig. 3 (f) and 3 (g) are stylized video frames after the present invention style-converts fig. 3 (b) and 3 (c).
Comparing the experimental results in fig. 3 (f) and fig. 3 (g) with the experimental results in fig. 3 (d) and fig. 3 (e), it can be seen from the comparison of the results in the white square frame that the video style conversion effect of the method of the present invention is better, and the video stylization model based on the self-coding structure and the gradient order preservation is used in the experiment, thereby effectively eliminating the halo generated around the foreground object of the stylized video in the prior art, providing a sharp and non-halo foreground image boundary, and it can be seen that the new time consistency algorithm provided by the present invention also effectively inhibits the dithering and flickering phenomena of the stylized video, retains the texture detail information of the original video, and effectively improves the visual sensory experience of human.
The above description is only one specific example of the present invention and does not constitute any limitation of the present invention. It will be apparent to persons skilled in the relevant art that various modifications and changes in form and detail can be made therein without departing from the principles and arrangements of the invention, but these modifications and changes are still within the scope of the invention as defined in the appended claims.

Claims (4)

1. A video style conversion method based on self-coding structure and gradient order preservation is characterized by comprising the following steps:
(1) Constructing a training sample set and a testing sample set:
(1a) Obtaining target style images s and M r Resolution size N r ×N r And for each video data at a frame rate N f Performing frame splitting to obtain M r Grouping original video frames x, and simultaneously extracting optical flow data of each video data to obtain M r Group of optical flow data, wherein N r ≥64,M r ≥100,N f ≥25;
(1b) Will be 4M r 5 original video frames x and 4M r The optical flow data corresponding to the 5 groups of original video frames form a training set, the target style image s and the training set form a training sample set, and the rest 1M are used r The 5 groups of original video frames x form a test sample set;
(2) Constructing a video stylized network model:
(2a) Constructing a video stylized network structure:
constructing a video stylized network structure comprising an encoder network, a decoder network, and a loss network, wherein:
an encoder network comprising an input layer, a plurality of convolutional layers, a plurality of residual layers, and a plurality of anti-convolutional layers for generating a stylized video frame
Figure FDA0002144558410000011
A decoder network comprising an input layer, a plurality of convolutional layers, a plurality of residual layers and a plurality of deconvolution layers for generating a reconstructed video frame
Figure FDA0002144558410000012
A lossy network comprising an input layer, a plurality of convolutional layers and a plurality of pooling layers for extracting a target-style image s, a stylized video frame
Figure FDA0002144558410000013
And high-order features of the original video frame x;
(2b) Defining a total loss function L for a video stylized network structure total
Defining including spatial structure loss function
Figure FDA0002144558410000014
Time consistency loss function
Figure FDA0002144558410000015
Gradient order preserving loss function
Figure FDA0002144558410000016
And reconstruction loss function
Figure FDA0002144558410000017
Total loss function L of video stylized network structure of total
Figure FDA0002144558410000021
Figure FDA0002144558410000022
Figure FDA0002144558410000023
Figure FDA0002144558410000024
Figure FDA0002144558410000025
Wherein x is t Is the original video frame at time t,
Figure FDA0002144558410000026
is the stylized video frame at time t,
Figure FDA0002144558410000027
is the reconstructed video frame at time t; α, β, λ and γ are each
Figure FDA0002144558410000028
And
Figure FDA0002144558410000029
the weight of (c); d = H × W × C, H is x t
Figure FDA00021445584100000210
And
Figure FDA00021445584100000211
w is x t
Figure FDA00021445584100000212
And
Figure FDA00021445584100000213
is wideDegree, C is x t
Figure FDA00021445584100000214
And
Figure FDA00021445584100000215
the number of channels of (a);
μ is a content loss function
Figure FDA00021445584100000216
Is a style loss function
Figure FDA00021445584100000217
The weight of (c); f is an affine transformation operation; a. The t Is at x t The number of elements obtained by convolution is D m Three-dimensional gradient matrix of (A), B t Is that
Figure FDA00021445584100000218
The number of elements obtained by convolution is D m The three-dimensional gradient matrix of (a); g T Is a threshold function, M is a morphological dilation operation;
(3) Training a video stylized network model:
the original video frame x of the target style image at the time of s and t in the training sample set t And original video frame x at time t +1 t+1 And x t And x t+1 Taking the optical flow data as the input of the video stylized network model, and performing K times of iterative training on the video stylized network model to obtain a trained video stylized network model, wherein K is more than or equal to 20000;
(4) Testing the trained video stylized network model:
taking the test sample set as the input of the encoder network in the trained video stylized network model to obtain 1M r A/5 set of stylized video frame sequences;
(5) Acquiring a video style conversion result:
for each group of stylized video frame sequencesFrame rate N f And combining frames according to the time sequence to obtain the video with the converted style.
2. The method for transforming video style based on self-coding structure and gradient order preservation according to claim 1, wherein the specific structures of the encoder network, the decoder network and the loss network in step (2 a) are respectively:
the encoder network includes an input layer, four convolutional layers, five residual layers, and two deconvolution layers:
input layer → first convolution layer → second convolution layer → third convolution layer → first residual layer → second residual layer → third residual layer → fourth residual layer → fifth residual layer → first deconvolution layer → second deconvolution layer → fourth convolution layer;
the decoder network includes an input layer, three convolutional layers, two residual layers, and one deconvolution layer:
input layer → first convolution layer → second convolution layer → first residual layer → second residual layer → first deconvolution layer → third convolution layer;
the loss network adopts a trained loss network and comprises an input layer, sixteen convolutional layers and four pooling layers:
the input layer → the first buildup layer → the second buildup layer → the first pooling layer → the third buildup layer → the fourth buildup layer → the second pooling layer → the fifth buildup layer → the sixth buildup layer → the seventh buildup layer → the eighth buildup layer → the third pooling layer → the ninth buildup layer → the tenth buildup layer → the eleventh buildup layer → the twelfth buildup layer → the fourth pooling layer → the thirteenth buildup layer → the fourteenth buildup layer → the fifteenth buildup layer → the sixteenth buildup layer.
3. The method for converting video style based on self-coding structure and gradient order preservation according to claim 2, wherein the training of the video stylization network model in step (3) is implemented by the steps of:
(3a) Initializing parameters of an encoder network and a decoder network, loading the parameters of a trained loss network, setting the iteration frequency as T, setting the maximum iteration frequency as K, wherein K is more than or equal to 20000, and enabling T =0 and T =1;
(3b) Original video frame x t And x t+1 Simultaneously inputting the video frames into an encoder network of the video stylized network model to obtain stylized video frames output by the encoder network
Figure FDA0002144558410000031
And
Figure FDA0002144558410000032
(3c) According to gradient order-preserving loss function
Figure FDA0002144558410000033
Calculating the t time x t And
Figure FDA0002144558410000034
gradient order-preserving loss in between, and time x at t +1 t+1 And
Figure FDA0002144558410000035
gradient order preserving loss is obtained, and the sum of the gradient order preserving loss at the time t and the time t +1 is obtained to obtain the trained gradient order preserving loss;
(3d) X is to be t And x t+1
Figure FDA0002144558410000041
And
Figure FDA0002144558410000042
and simultaneously inputting the target style image s into a loss network of the video stylized network model, and extracting x t 、x t+1
Figure FDA0002144558410000043
S higher order features of sum, and according to spatial structure loss function
Figure FDA0002144558410000044
Calculating the t time x t High-order characteristics of,
Figure FDA0002144558410000045
And the loss of spatial structure between the higher-order features of s, and the time x at t +1 t+1 High-order characteristics of,
Figure FDA0002144558410000046
The space structure loss between the high-order feature of s and the high-order feature of s, and the sum of the space structure losses at the time t and the time t +1 is solved to obtain the space structure loss after training;
(3e) Formatting video frames
Figure FDA0002144558410000047
And
Figure FDA0002144558410000048
inputting the video frame into a decoder network of a video stylized network model to obtain a reconstructed video frame
Figure FDA0002144558410000049
And
Figure FDA00021445584100000410
(3f) According to x t And x t+1 Optical flow data therebetween, will
Figure FDA00021445584100000411
Affine transformation
Figure FDA00021445584100000412
Predicted value at time t
Figure FDA00021445584100000413
And according to a time consistency loss function
Figure FDA00021445584100000414
Computing
Figure FDA00021445584100000415
And
Figure FDA00021445584100000416
loss of time consistency between, i.e. loss of time consistency after training;
(3g) According to a reconstruction loss function
Figure FDA00021445584100000417
Calculating the t time x t And
Figure FDA00021445584100000418
reconstruction loss in between, and t +1 time x t+1 And
Figure FDA00021445584100000419
the sum of reconstruction losses at the t moment and the t +1 moment is obtained to obtain the reconstruction loss after training;
(3h) Substituting the results calculated in steps (3 c), (3 d), (3 f) and (3 g) into the total loss function L of the video stylized network structure total Calculating the total loss of the trained video stylized network structure, and updating the parameters of the encoder network and the decoder network through the total loss of the video stylized network structure by using a gradient descent algorithm;
(3i) Judging whether T is equal to K, if so, obtaining a trained video stylized network model; otherwise, let t = t +1, and perform step (3 b).
4. The method according to claim 1, wherein the style loss function in step (2 b) is a style loss function
Figure FDA00021445584100000420
Function of content loss
Figure FDA00021445584100000421
Three-dimensional gradient matrix A t Three-dimensional gradient matrix B t And a threshold function G T Are respectively:
Figure FDA0002144558410000051
Figure FDA0002144558410000052
Figure FDA0002144558410000053
Figure FDA0002144558410000054
Figure FDA0002144558410000055
wherein x is t Is the original video frame at time t,
Figure FDA0002144558410000056
is the stylized video frame at time t +1,
Figure FDA0002144558410000057
is the reconstructed video frame at time t; phi is a l Is a feature map extracted by the first layer convolution in the loss network, C l 、H l And W l Are respectively phi l Number of channels, height and width, G l Extracting characteristic diagram from the l layer convolution in the loss networkA Gram matrix;
A t is prepared by mixing 12
Figure FDA0002144558410000058
Cascading the three-dimensional gradient matrixes; b is t Is prepared by mixing 12
Figure FDA0002144558410000059
The three-dimensional gradient matrix obtained by cascading is obtained,
Figure FDA00021445584100000510
is by a convolution kernel K q In that
Figure FDA00021445584100000511
The two-dimensional gradient matrix obtained by the up-convolution,
Figure FDA00021445584100000512
is by a convolution kernel K q In that
Figure FDA00021445584100000513
And (c) up-convolving the acquired two-dimensional gradient matrix, wherein,
Figure FDA00021445584100000514
and
Figure FDA00021445584100000515
are respectively:
Figure FDA00021445584100000516
Figure FDA00021445584100000517
Figure FDA00021445584100000518
is x t Is provided on the p-th channel of (a),
Figure FDA00021445584100000519
is that
Figure FDA00021445584100000520
P is the serial number of the channel number, and p belongs to {1,2,3}; for
Figure FDA00021445584100000521
And
Figure FDA00021445584100000522
all the pixel points on the upper surface adopt convolution kernels K in four directions of left lower direction, right lower direction and right lower direction q Performing a convolution operation with q being a convolution kernel K of different directions q Is given by the sequence number of (1), q ∈ {1,2,3,4}, K q (m, n) is K q The value of the (m, n) th position, m and n being K respectively q An index of rows and columns; i and j are
Figure FDA0002144558410000061
And
Figure FDA0002144558410000062
the row and column indices; k is a radical of formula r Is K q Width of (k) c Is K q Length of (d);
G T is a threshold function and a is an empirically derived threshold.
CN201910680259.8A 2019-07-26 2019-07-26 Video style conversion method based on self-coding structure and gradient order preservation Active CN110533579B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910680259.8A CN110533579B (en) 2019-07-26 2019-07-26 Video style conversion method based on self-coding structure and gradient order preservation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910680259.8A CN110533579B (en) 2019-07-26 2019-07-26 Video style conversion method based on self-coding structure and gradient order preservation

Publications (2)

Publication Number Publication Date
CN110533579A CN110533579A (en) 2019-12-03
CN110533579B true CN110533579B (en) 2022-12-02

Family

ID=68661805

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910680259.8A Active CN110533579B (en) 2019-07-26 2019-07-26 Video style conversion method based on self-coding structure and gradient order preservation

Country Status (1)

Country Link
CN (1) CN110533579B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111263226B (en) * 2020-01-17 2021-10-22 中国科学技术大学 Video processing method, video processing device, electronic equipment and medium
CN111556244B (en) * 2020-04-23 2022-03-11 北京百度网讯科技有限公司 Video style migration method and device
CN112561864B (en) * 2020-12-04 2024-03-29 深圳格瑞健康科技有限公司 Training method, system and storage medium for caries image classification model
CN113128614B (en) * 2021-04-29 2023-06-16 西安微电子技术研究所 Convolution method based on image gradient, neural network based on direction convolution and classification method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10152768B2 (en) * 2017-04-14 2018-12-11 Facebook, Inc. Artifact reduction for image style transfer
US10318889B2 (en) * 2017-06-26 2019-06-11 Konica Minolta Laboratory U.S.A., Inc. Targeted data augmentation using neural style transfer
CN107481185A (en) * 2017-08-24 2017-12-15 深圳市唯特视科技有限公司 A kind of style conversion method based on video image optimization
CN108924528B (en) * 2018-06-06 2020-07-28 浙江大学 Binocular stylized real-time rendering method based on deep learning

Also Published As

Publication number Publication date
CN110533579A (en) 2019-12-03

Similar Documents

Publication Publication Date Title
CN110533579B (en) Video style conversion method based on self-coding structure and gradient order preservation
CN113658051B (en) Image defogging method and system based on cyclic generation countermeasure network
CN109087273B (en) Image restoration method, storage medium and system based on enhanced neural network
Shocher et al. Ingan: Capturing and remapping the" dna" of a natural image
CN105631807B (en) The single-frame image super-resolution reconstruction method chosen based on sparse domain
JP5645842B2 (en) Image processing apparatus and method using scale space
CN108830913B (en) Semantic level line draft coloring method based on user color guidance
CN111986075B (en) Style migration method for target edge clarification
CN113177882B (en) Single-frame image super-resolution processing method based on diffusion model
CN110570377A (en) group normalization-based rapid image style migration method
CN110717868B (en) Video high dynamic range inverse tone mapping model construction and mapping method and device
CN112837224A (en) Super-resolution image reconstruction method based on convolutional neural network
CN103279933A (en) Method for reconstructing single-image super-resolution based on double-layer model
CN111080591A (en) Medical image segmentation method based on combination of coding and decoding structure and residual error module
CN116682120A (en) Multilingual mosaic image text recognition method based on deep learning
CN112884668A (en) Lightweight low-light image enhancement method based on multiple scales
CN111986132A (en) Infrared and visible light image fusion method based on DLatLRR and VGG & Net
WO2023279936A1 (en) Methods and systems for high definition image manipulation with neural networks
CN108924528B (en) Binocular stylized real-time rendering method based on deep learning
Liu et al. Facial image inpainting using attention-based multi-level generative network
CN110415169A (en) A kind of depth map super resolution ratio reconstruction method, system and electronic equipment
CN109741258B (en) Image super-resolution method based on reconstruction
CN107424119A (en) A kind of super-resolution method of single image
CN104123707B (en) Local rank priori based single-image super-resolution reconstruction method
CN116523985A (en) Structure and texture feature guided double-encoder image restoration method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant