CN110533579B - Video style conversion method based on self-coding structure and gradient order preservation - Google Patents
Video style conversion method based on self-coding structure and gradient order preservation Download PDFInfo
- Publication number
- CN110533579B CN110533579B CN201910680259.8A CN201910680259A CN110533579B CN 110533579 B CN110533579 B CN 110533579B CN 201910680259 A CN201910680259 A CN 201910680259A CN 110533579 B CN110533579 B CN 110533579B
- Authority
- CN
- China
- Prior art keywords
- video
- layer
- stylized
- network
- loss
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 41
- 238000004321 preservation Methods 0.000 title claims abstract description 9
- 238000012549 training Methods 0.000 claims abstract description 39
- 238000012360 testing method Methods 0.000 claims abstract description 16
- 230000003287 optical effect Effects 0.000 claims description 22
- 239000011159 matrix material Substances 0.000 claims description 20
- 238000011176 pooling Methods 0.000 claims description 12
- 238000004422 calculation algorithm Methods 0.000 claims description 4
- 238000002156 mixing Methods 0.000 claims description 4
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims description 3
- 238000010586 diagram Methods 0.000 claims description 3
- 230000010339 dilation Effects 0.000 claims description 3
- 230000000877 morphologic effect Effects 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 230000001131 transforming effect Effects 0.000 claims 1
- 125000001475 halogen functional group Chemical group 0.000 abstract description 9
- 238000012545 processing Methods 0.000 abstract description 7
- 230000001953 sensory effect Effects 0.000 abstract description 7
- 230000000007 visual effect Effects 0.000 abstract description 7
- 238000004519 manufacturing process Methods 0.000 abstract description 3
- 230000000717 retained effect Effects 0.000 abstract 1
- 230000006870 function Effects 0.000 description 18
- 238000002474 experimental method Methods 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000009877 rendering Methods 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 238000009826 distribution Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000002401 inhibitory effect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 101100001672 Emericella variicolor andG gene Proteins 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004040 coloring Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 230000000452 restraining effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/04—Context-preserving transformations, e.g. by using an importance map
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
- G06T9/002—Image coding using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The invention provides a video style conversion method based on a self-coding structure and gradient order preservation, which is used for solving the technical problem that halo is generated at the edge of a foreground target in a stylized video in the existing video style conversion method. The method comprises the following implementation steps: 1) Constructing a training sample set and a testing sample set; 2) Constructing a video stylized network model; 3) Training a video stylized network model; 4) Testing the trained video stylized network model, and 5) obtaining a video style conversion result. According to the method, the video stylized network model based on the self-coding structure and the gradient order-preserving loss function is constructed, and the time consistency constraint is redefined in a more reasonable mode, so that the halo generated at the edge of the foreground target in the stylized video is effectively eliminated, the texture detail information of the original video is retained, the visual sensory experience of people is improved, and the method can be used for finishing the post-production processing of photography and movie and television works.
Description
Technical Field
The invention belongs to the technical field of digital image processing, relates to a video style conversion method, and particularly relates to a video style conversion method based on a self-coding structure and gradient order preservation, which can be used for finishing post-production processing of photographic and movie works.
Background
An important branch of the computer vision field is image generation, which includes image super resolution, image coloring, image semantic segmentation, style conversion of images or videos, and the like. In the field of computer vision, style conversion of images or videos is considered as a general problem of texture synthesis, i.e. in the case of a given style image, texture extraction and transfer from source to target, generating a corresponding style conversion result.
The image stylization conversion method can be divided into two types, namely an image style conversion method based on a traditional iteration method and an image style conversion method based on a neural network, wherein the image style conversion method based on the traditional iteration method comprises the following steps: a stroke-based rendering method, a region-technology-based rendering method, an example-based rendering method; this type of method, although faithfully describing a specific style pattern without CNN, has limitations in flexibility, style diversity, and effective extraction of image structure. A second type of image style conversion method based on neural network, for example: the method comprises the steps of extracting content features of an input image through a pre-trained convolutional neural network, extracting texture features of the input image through a Gram matrix, and optimizing an output image by adopting an iteration method aiming at matching feature distribution of the output image with feature distribution of an expected convolutional neural network.
After the image style conversion method achieves certain results, many scholars turn the attention to the video style conversion method. The method is characterized in that a video is formed by combining a plurality of frames of images, continuous stylized video frames are obtained by frame splitting and an image style conversion method is used for combining the frames of the videos into a stylized video, and therefore the video style conversion method is improved based on the image style conversion method. The video style conversion method can be applied to post-production processing and treatment of film and television works, and can generate corresponding video style conversion results under the condition of specifying target style images.
At present, the following methods are mainly used for typical video style conversion:
manuel Ruder et al published an article entitled "adaptive style transfer for video" in 2016 and discloses a video style conversion method based on iteration, which adds time loss on the basis of image style conversion and provides a time consistency concept between adjacent video frames. To penalize deviations between two frames. The time consistency between adjacent stylized video frames is ensured, and the video flicker phenomenon is effectively prevented. However, due to iterative optimization, the video generation speed is very slow, which causes high time cost of video generation.
Haozhi Huang et al published an article entitled "Real-Time Neural Style Transfer for video" in 2017 on Computer Vision and Pattern Recognition, and disclosed a video Style conversion method based on a feed-forward network, and achieved the purpose of shortening the conversion Time by training a Style conversion feed-forward Neural network with Time consistency. Although this method has been successful in improving the efficiency of video generation, it still has two drawbacks when converting the video style: firstly, halos exist around foreground objects in the generated stylized video, and visual and sensory experiences of people are influenced; secondly, due to the fact that accuracy of optical flow estimation is insufficient, a computing mode of time consistency loss is unreasonable, optical flow data detected on an original video frame is not suitable for restraining time consistency between stylized video frames, training errors are generated, and fluency and continuity of stylized videos are affected.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a video style conversion method based on a self-coding structure and gradient order preservation, which improves the visual and sensory experience of people by eliminating the halo around the foreground object in the stylized video on the premise of ensuring the temporal video style conversion speed.
The technical idea of the invention is as follows: firstly, a video stylized network structure is constructed by adopting the idea of a self-coding structure, the video stylized network structure is used for improving the generation speed of a stylized video, defining a new time consistency loss function, inhibiting the jittering and flickering phenomena of the stylized video, increasing a reconstruction loss function, retaining the detail information of an original video in the stylized video, eliminating the halo around a foreground object in the stylized video by increasing a gradient order-preserving loss function, and improving the visual sensory experience of people, and the method comprises the following specific steps:
(1) Constructing a training sample set and a testing sample set:
(1a) Obtaining target style images s and M r Resolution size N r ×N r And for each video data at a frame rate N f Performing frame splitting to obtain M r Grouping original video frames x, and simultaneously extracting optical flow data of each video data to obtain M r Group of optical flow data, wherein N r ≥64,M r ≥100,N f ≥25;
(1b) Will be 4M r Original video frames of/5 groups x and 4M r The optical flow data corresponding to the 5 groups of original video frames form a training set, the target style image s and the training set form a training sample set, and the rest 1M are used r 5, forming a test sample set by the original video frames x;
(2) Constructing a video stylized network model:
(2a) Constructing a video stylized network structure:
constructing a video stylized network structure comprising an encoder network, a decoder network, and a loss network, wherein:
an encoder network including an input layer, a plurality of convolutional layers, a plurality of residual layers, and a plurality of anti-convolutional layers for generating a stylized video frame
A decoder network comprising an input layer, a plurality of convolutional layers, a plurality of residual layers and a plurality of deconvolution layers for generating a reconstructed video frame
A lossy network comprising an input layer, a plurality of convolutional layers and a plurality of pooling layers for extracting a target-style image s, a stylized video frameAnd higher order of original video frame xCharacteristic;
(2b) Defining a total loss function L for a video stylized network structure total :
Defining includes spatial structure loss functionTime consistency loss functionGradient order preserving loss functionAnd reconstruction loss functionTotal loss function L of the video stylized network structure of (1) total :
Wherein x is t Is the original video frame at time t,is style at time tThe video frames are then assembled into a video frame,is the reconstructed video frame at time t; α, β, λ and γ are eachAndthe weight of (c); d = H × W × C, H is x t 、Andw is x t 、Andc is x t 、Andthe number of channels of (a);
μ is a content loss functionIs a style loss functionThe weight of (c); f is an affine transformation operation; a. The t Is at x t The number of elements obtained by convolution is D m Three-dimensional gradient matrix of, B t Is thatThe number of elements obtained by convolution is D m The three-dimensional gradient matrix of (a); g T Is a threshold function, M is a morphological dilation operation;
(3) Training a video stylized network model:
the original video frame x of the target style image s and t in the training sample set t And original video frame x at time t +1 t+1 And x t And x t+1 Taking the optical flow data as the input of the video stylized network model, and performing K times of iterative training on the video stylized network model to obtain a trained video stylized network model, wherein K is more than or equal to 20000;
(4) Testing the trained video stylized network model:
taking the test sample set as the input of the encoder network in the trained video stylized network model to obtain 1M r A/5 set of stylized video frame sequences;
(5) Acquiring a video style conversion result:
at frame rate N for each group of stylized video frame sequences f And combining frames according to the time sequence to obtain the video with the converted style.
Compared with the prior art, the invention has the following advantages:
1. the invention constructs a video stylized network model containing a gradient order-preserving loss function and a reconstruction loss function based on the thought of a self-coding structure, and restrains the intermediate value and the output value of an encoder network and a decoder network in the self-coding structure by adopting the gradient order-preserving loss function and the reconstruction loss function in the training process, so as to prevent the pixel value at the edge of a foreground target in a stylized video from being too smooth or gradient reverse, effectively eliminate the halo at the edge of the foreground target of the stylized video in the prior art, provide a sharp halo-free foreground target boundary, retain the texture detail information of an original video in the stylized video, and effectively improve the visual sensory experience of people.
2. The invention constructs a video stylized network model containing a redefined time consistency loss function based on the idea of a self-coding structure, and calculates the mean square error between the pixels of the two adjacent frames of reconstructed video frames, because the reconstructed video frames are basically similar to the original video frames in the spatial structure, the calculation of the two adjacent frames of stylized video frames in the prior art is avoided, the error of using the optical flow extracted by the original video evaluation for video style conversion is favorably reduced, the flickering and shaking phenomena of the stylized video are effectively inhibited, the smoothness and smoothness of the stylized video are improved, and the visual sensory experience of people is further improved.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a schematic diagram of a video stylized network architecture constructed in accordance with the present invention;
fig. 3 is a graph comparing the video style conversion results of the present invention and the prior art.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples.
Referring to fig. 1, the implementation steps of the invention are as follows:
step 1) constructing a training sample set and a test sample set:
(1a) Obtaining target style images s and M r Resolution size N r ×N r And for each video data at a frame rate N f Performing frame splitting to obtain M r Grouping original video frames x, and simultaneously extracting optical flow data of each video data to obtain M r Group of optical stream data, wherein N r ≥64,M r ≥100,N f ≥25;
(1b) Will be 4M r Original video frames of/5 groups x and 4M r The optical flow data corresponding to the 5 groups of original video frames form a training set, the target style image s and the training set form a training sample set, and the rest 1M are used r The 5 groups of original video frames x form a test sample set;
in the existing video data set, the resolution size of most video data sets is rectangular, but when the optical flow is evaluated and extracted, rectangular video frames are automatically converted into the resolution size N r ×N r In the training process, only the optical flow data is evaluated and extracted for the square video, so that the original rectangular video is adjusted into the square video as a training set. During training, down-sampling operation is performed, and the effect of video with too small resolution after style conversion is not ideal, so that the resolution of the input video has a lower limit, N r And the content is more than or equal to 64. In this embodiment, a target style image is obtained, 124 video data included in two video data sets of Sintel and DAVIS are adopted, 124 videos are unframed, an optical flow data is extracted by using a Flownet2 algorithm, 124 groups of original video frames with the resolution of 256 × 256 and 124 groups of corresponding optical flow data are obtained, optical flow data corresponding to 102 groups of video frames and 102 groups of video frames are combined into a training set, a training sample set is formed by the target style image and the training set, and the rest 22 groups of video frames are used as a test sample set.
Step 2), constructing a video stylized network model:
(2a) Constructing a video stylized network structure:
referring to fig. 2, a video stylized network structure is constructed comprising an encoder network, a decoder network and a loss network, wherein the encoder network and the decoder network together form a self-encoding structure for generating intermediate values and output values, providing conditions for constraints of respective loss functions:
the encoder network includes an input layer, four convolutional layers, five residual layers, and two deconvolution layers, for generating stylized video frames
Input layer → first convolution layer → second convolution layer → third convolution layer → first residual layer → second residual layer → third residual layer → fourth residual layer → fifth residual layer → first deconvolution layer → second deconvolution layer → fourth convolution layer;
the decoder network comprises an input layer, three convolutional layers, two residual layers and an anti-convolutional layer for generating a reconstructed video frame
Input layer → first convolution layer → second convolution layer → first residual layer → second residual layer → first deconvolution layer → third convolution layer;
the loss network adopts the front 16 layers of pre-trained VGG-19, including an input layer, sixteen convolutional layers and four pooling layers, and is used for extracting a target style image s and a stylized video frameAnd the high-order features of the original video frame x:
the input layer → the first buildup layer → the second buildup layer → the first pooling layer → the third buildup layer → the fourth buildup layer → the second pooling layer → the fifth buildup layer → the sixth buildup layer → the seventh buildup layer → the eighth buildup layer → the third pooling layer → the ninth buildup layer → the tenth buildup layer → the eleventh buildup layer → the twelfth buildup layer → the fourth pooling layer → the thirteenth buildup layer → the fourteenth buildup layer → the fifteenth buildup layer → the sixteenth buildup layer.
Wherein the parameter settings of each layer of the decoder network and the encoder network are as follows:
(2b) Defining a total loss function L for a video stylized network structure total :
Defining including spatial structure loss functionTime consistency loss functionGradient order preserving loss functionAnd reconstructing the loss function L reconstruction (x t ,x t ) Video wind ofTotal loss function L of a lattice network structure total :
Wherein x is t Is the original video frame at time t,is the stylized video frame at time t,is the reconstructed video frame at time t; α, β, λ and γ are eachAndthe weight of (c); d = H × W × C, H is x t 、Andw is x t 、Andc is x t 、Andthe number of channels of (a);
μ is a content loss functionIs a style loss functionThe weight of (c); f is an affine transformation operation; a. The t Is at x t The number of elements obtained by convolution is D m Three-dimensional gradient matrix of, B t Is thatThe number of elements obtained by convolution is D m The three-dimensional gradient matrix of (a); g T Is a threshold function, M is a morphological dilation operation;
wherein the style loss functionFunction of content lossThree-dimensional gradient matrix A t Three-dimensional gradient matrix B t And a threshold function G T Are respectively:
wherein x is t Is the original video frame at time t,is the stylized video frame at time t + 1,is the reconstructed video frame at time t; phi is a l Is a feature map extracted from the first layer convolution in the loss network, C l 、H l And W l Are respectively phi l Number of channels, height and width, G l Extracting a Gram matrix of a characteristic diagram from the l layer convolution in the loss network;
A t is prepared by mixing 12Cascading to obtain a three-dimensional gradient matrix; b is t Is prepared by mixing 12The three-dimensional gradient matrix obtained by cascading is obtained,is by a convolution kernel K q In thatThe two-dimensional gradient matrix obtained by the up-convolution,is by a convolution kernel K q In thatConvolving the acquired two-dimensional gradient matrix, wherein,andare respectively:
is x t Is provided on the p-th channel of (a),is thatP is the serial number of the channel number, and p belongs to {1,2,3}; forAndall the pixel points on the upper part adopt a left lower part, a lower part and a,Convolution kernel K of four directions of lower right and right q Performing a convolution operation with q being a convolution kernel K of different directions q Is given by the sequence number of (1), q ∈ {1,2,3,4}, K q (m, n) is K q The value of the (m, n) th position, m and n being K respectively q The row and column indices; i and j areAndan index of rows and columns; k is a radical of r Is K q Width of (k) c Is K q The length of (d); in this embodiment, K q The convolution operation is performed with four inverse convolution kernels as follows:andG T in brief, the gradient sequence is direction information of the gradient, and by taking the gradient in one direction alone, for example, if the pixel value of the current pixel point is a greater than that of the pixel point at the lower right, the value of the pixel point at the lower right of the current pixel point is assigned to 1, so that when the point G is a point G T A value of 1 indicates that the pixel value of the point is greater than the pixel value of the point below and to the right. The value of the threshold a is empirically derived in experiments.
Step 3) training the video stylized network model:
the original video frame x of the target style image at the time of s and t in the training sample set t And original video frame x at time t +1 t+1 And x t And x t+1 The optical flow data is used as the input of the video stylized network model, K times of iterative training are carried out on the video stylized network model to obtain the trained video stylized network model, wherein K is more than or equal to 20000, and the specific implementation steps are as follows:
(3a) Initializing parameters of an encoder network and a decoder network, loading the parameters of a trained loss network, setting the iteration frequency as T, setting the maximum iteration frequency as K, wherein K is more than or equal to 20000, and enabling T =0 and T =1;
(3b) Original video frame x t And x t+1 Simultaneously inputting the video frames into an encoder network of the video stylized network model to obtain stylized video frames output by the encoder networkAnd
(3c) According to gradient order-preserving loss functionCalculating the t time x t Andgradient order-preserving loss in between, and time x at t +1 t+1 Andgradient order preserving loss is obtained, and the sum of the gradient order preserving loss at the time t and the time t +1 is obtained to obtain the trained gradient order preserving loss; the method is used for preventing the pixel value at the edge of the foreground target in the stylized video from being too smooth or gradient reversal, and eliminating the halo at the edge of the foreground target in the stylized video in the prior art;
(3d) X is to be t And x t+1 、Andand simultaneously inputting the target style image s into a loss network of the video stylized network model, and extracting x t 、x t+1 、S higher order features of sum, and according to spatial structure loss functionCalculating the t time x t High-order characteristics of,And the loss of spatial structure between the higher order features of s, and the time t +1 x t+1 High-order characteristics of,The space structure loss between the high-order characteristic of s and the high-order characteristic of s, and the sum of the space structure losses at the time t and the time t +1 is solved to obtain the space structure loss after training; the system comprises a video processing unit, a storage unit and a processing unit, wherein the video processing unit is used for constraining content information, spatial features and texture features of a stylized video;
(3e) Formatting video framesAndinputting the video frame into a decoder network of a video stylized network model to obtain a reconstructed video frameAnd
(3f) According to x t And x t+1 Optical flow data betweenAffine transformationPredicted value at time tAnd according to a time consistency loss functionComputingAndloss of time consistency between, i.e. loss of time consistency after training; the method is used for inhibiting the flickering and shaking phenomena of the stylized video and improving the smoothness and smoothness of the stylized video;
(3g) According to a reconstruction loss functionCalculating the t time x t Andreconstruction loss in between, and t +1 time x t+1 Andthe sum of reconstruction losses at the t moment and the t +1 moment is obtained to obtain the reconstruction loss after training; the texture detail information used for keeping the original video;
(3h) Substituting the results calculated in steps (3 c), (3 d), (3 f) and (3 g) into the total loss function L of the video stylized network structure total Calculating the total loss of the trained video stylized network structure, and updating the parameters of the encoder network and the decoder network through the total loss of the video stylized network structure by using a gradient descent algorithm;
(3i) Judging whether T is equal to K, if so, obtaining a trained video stylized network model; otherwise, let t = t +1, and perform step (3 b).
Step 4) testing the trained video stylized network model:
taking a test sample set as a trained video stylized meshThe input of the encoder network in the network model is obtained as 1M r A/5 set of stylized video frame sequences;
step 5) obtaining a video style conversion result:
at frame rate N for each group of stylized video frame sequences f And combining frames according to the time sequence to obtain the video with the converted style.
The effects of the present invention can be further illustrated by the following practical experiments:
1. conditions of the experiment
The hardware test platform adopted in the actual experiment of the experiment is as follows: intel Core i7CPU with 3.60GHz main frequency and 8GB memory; the software simulation platform comprises: a Ubuntu 16.04 64-bit operating system, a Pycharm development platform; software simulation language: python; using a deep learning framework: tensorflow.
2. Analysis of experimental content and results
The experimental contents are as follows: the same video data is subjected to style conversion by using the method of the present invention and Haozhi Huang et al, and the obtained style conversion result is shown in FIG. 3.
Fig. 3 (a) is a target-style image.
Fig. 3 (b) is the 10 th original video frame in the scene of the sinter dataset ambush _ 1.
Figure 3 (c) is the 9 th frame original video frame in the scene of the sinter dataset ambush _ 1.
Fig. 3 (d) and 3 (e) are stylized video frames after the style conversion of fig. 3 (b) and 3 (c) by the prior art.
Fig. 3 (f) and 3 (g) are stylized video frames after the present invention style-converts fig. 3 (b) and 3 (c).
Comparing the experimental results in fig. 3 (f) and fig. 3 (g) with the experimental results in fig. 3 (d) and fig. 3 (e), it can be seen from the comparison of the results in the white square frame that the video style conversion effect of the method of the present invention is better, and the video stylization model based on the self-coding structure and the gradient order preservation is used in the experiment, thereby effectively eliminating the halo generated around the foreground object of the stylized video in the prior art, providing a sharp and non-halo foreground image boundary, and it can be seen that the new time consistency algorithm provided by the present invention also effectively inhibits the dithering and flickering phenomena of the stylized video, retains the texture detail information of the original video, and effectively improves the visual sensory experience of human.
The above description is only one specific example of the present invention and does not constitute any limitation of the present invention. It will be apparent to persons skilled in the relevant art that various modifications and changes in form and detail can be made therein without departing from the principles and arrangements of the invention, but these modifications and changes are still within the scope of the invention as defined in the appended claims.
Claims (4)
1. A video style conversion method based on self-coding structure and gradient order preservation is characterized by comprising the following steps:
(1) Constructing a training sample set and a testing sample set:
(1a) Obtaining target style images s and M r Resolution size N r ×N r And for each video data at a frame rate N f Performing frame splitting to obtain M r Grouping original video frames x, and simultaneously extracting optical flow data of each video data to obtain M r Group of optical flow data, wherein N r ≥64,M r ≥100,N f ≥25;
(1b) Will be 4M r 5 original video frames x and 4M r The optical flow data corresponding to the 5 groups of original video frames form a training set, the target style image s and the training set form a training sample set, and the rest 1M are used r The 5 groups of original video frames x form a test sample set;
(2) Constructing a video stylized network model:
(2a) Constructing a video stylized network structure:
constructing a video stylized network structure comprising an encoder network, a decoder network, and a loss network, wherein:
an encoder network comprising an input layer, a plurality of convolutional layers, a plurality of residual layers, and a plurality of anti-convolutional layers for generating a stylized video frame
A decoder network comprising an input layer, a plurality of convolutional layers, a plurality of residual layers and a plurality of deconvolution layers for generating a reconstructed video frame
A lossy network comprising an input layer, a plurality of convolutional layers and a plurality of pooling layers for extracting a target-style image s, a stylized video frameAnd high-order features of the original video frame x;
(2b) Defining a total loss function L for a video stylized network structure total :
Defining including spatial structure loss functionTime consistency loss functionGradient order preserving loss functionAnd reconstruction loss functionTotal loss function L of video stylized network structure of total :
Wherein x is t Is the original video frame at time t,is the stylized video frame at time t,is the reconstructed video frame at time t; α, β, λ and γ are eachAndthe weight of (c); d = H × W × C, H is x t 、Andw is x t 、Andis wideDegree, C is x t 、Andthe number of channels of (a);
μ is a content loss functionIs a style loss functionThe weight of (c); f is an affine transformation operation; a. The t Is at x t The number of elements obtained by convolution is D m Three-dimensional gradient matrix of (A), B t Is thatThe number of elements obtained by convolution is D m The three-dimensional gradient matrix of (a); g T Is a threshold function, M is a morphological dilation operation;
(3) Training a video stylized network model:
the original video frame x of the target style image at the time of s and t in the training sample set t And original video frame x at time t +1 t+1 And x t And x t+1 Taking the optical flow data as the input of the video stylized network model, and performing K times of iterative training on the video stylized network model to obtain a trained video stylized network model, wherein K is more than or equal to 20000;
(4) Testing the trained video stylized network model:
taking the test sample set as the input of the encoder network in the trained video stylized network model to obtain 1M r A/5 set of stylized video frame sequences;
(5) Acquiring a video style conversion result:
for each group of stylized video frame sequencesFrame rate N f And combining frames according to the time sequence to obtain the video with the converted style.
2. The method for transforming video style based on self-coding structure and gradient order preservation according to claim 1, wherein the specific structures of the encoder network, the decoder network and the loss network in step (2 a) are respectively:
the encoder network includes an input layer, four convolutional layers, five residual layers, and two deconvolution layers:
input layer → first convolution layer → second convolution layer → third convolution layer → first residual layer → second residual layer → third residual layer → fourth residual layer → fifth residual layer → first deconvolution layer → second deconvolution layer → fourth convolution layer;
the decoder network includes an input layer, three convolutional layers, two residual layers, and one deconvolution layer:
input layer → first convolution layer → second convolution layer → first residual layer → second residual layer → first deconvolution layer → third convolution layer;
the loss network adopts a trained loss network and comprises an input layer, sixteen convolutional layers and four pooling layers:
the input layer → the first buildup layer → the second buildup layer → the first pooling layer → the third buildup layer → the fourth buildup layer → the second pooling layer → the fifth buildup layer → the sixth buildup layer → the seventh buildup layer → the eighth buildup layer → the third pooling layer → the ninth buildup layer → the tenth buildup layer → the eleventh buildup layer → the twelfth buildup layer → the fourth pooling layer → the thirteenth buildup layer → the fourteenth buildup layer → the fifteenth buildup layer → the sixteenth buildup layer.
3. The method for converting video style based on self-coding structure and gradient order preservation according to claim 2, wherein the training of the video stylization network model in step (3) is implemented by the steps of:
(3a) Initializing parameters of an encoder network and a decoder network, loading the parameters of a trained loss network, setting the iteration frequency as T, setting the maximum iteration frequency as K, wherein K is more than or equal to 20000, and enabling T =0 and T =1;
(3b) Original video frame x t And x t+1 Simultaneously inputting the video frames into an encoder network of the video stylized network model to obtain stylized video frames output by the encoder networkAnd
(3c) According to gradient order-preserving loss functionCalculating the t time x t Andgradient order-preserving loss in between, and time x at t +1 t+1 Andgradient order preserving loss is obtained, and the sum of the gradient order preserving loss at the time t and the time t +1 is obtained to obtain the trained gradient order preserving loss;
(3d) X is to be t And x t+1 、Andand simultaneously inputting the target style image s into a loss network of the video stylized network model, and extracting x t 、x t+1 、S higher order features of sum, and according to spatial structure loss functionCalculating the t time x t High-order characteristics of,And the loss of spatial structure between the higher-order features of s, and the time x at t +1 t+1 High-order characteristics of,The space structure loss between the high-order feature of s and the high-order feature of s, and the sum of the space structure losses at the time t and the time t +1 is solved to obtain the space structure loss after training;
(3e) Formatting video framesAndinputting the video frame into a decoder network of a video stylized network model to obtain a reconstructed video frameAnd
(3f) According to x t And x t+1 Optical flow data therebetween, willAffine transformationPredicted value at time tAnd according to a time consistency loss functionComputingAndloss of time consistency between, i.e. loss of time consistency after training;
(3g) According to a reconstruction loss functionCalculating the t time x t Andreconstruction loss in between, and t +1 time x t+1 Andthe sum of reconstruction losses at the t moment and the t +1 moment is obtained to obtain the reconstruction loss after training;
(3h) Substituting the results calculated in steps (3 c), (3 d), (3 f) and (3 g) into the total loss function L of the video stylized network structure total Calculating the total loss of the trained video stylized network structure, and updating the parameters of the encoder network and the decoder network through the total loss of the video stylized network structure by using a gradient descent algorithm;
(3i) Judging whether T is equal to K, if so, obtaining a trained video stylized network model; otherwise, let t = t +1, and perform step (3 b).
4. The method according to claim 1, wherein the style loss function in step (2 b) is a style loss functionFunction of content lossThree-dimensional gradient matrix A t Three-dimensional gradient matrix B t And a threshold function G T Are respectively:
wherein x is t Is the original video frame at time t,is the stylized video frame at time t +1,is the reconstructed video frame at time t; phi is a l Is a feature map extracted by the first layer convolution in the loss network, C l 、H l And W l Are respectively phi l Number of channels, height and width, G l Extracting characteristic diagram from the l layer convolution in the loss networkA Gram matrix;
A t is prepared by mixing 12Cascading the three-dimensional gradient matrixes; b is t Is prepared by mixing 12The three-dimensional gradient matrix obtained by cascading is obtained,is by a convolution kernel K q In thatThe two-dimensional gradient matrix obtained by the up-convolution,is by a convolution kernel K q In thatAnd (c) up-convolving the acquired two-dimensional gradient matrix, wherein,andare respectively:
is x t Is provided on the p-th channel of (a),is thatP is the serial number of the channel number, and p belongs to {1,2,3}; forAndall the pixel points on the upper surface adopt convolution kernels K in four directions of left lower direction, right lower direction and right lower direction q Performing a convolution operation with q being a convolution kernel K of different directions q Is given by the sequence number of (1), q ∈ {1,2,3,4}, K q (m, n) is K q The value of the (m, n) th position, m and n being K respectively q An index of rows and columns; i and j areAndthe row and column indices; k is a radical of formula r Is K q Width of (k) c Is K q Length of (d);
G T is a threshold function and a is an empirically derived threshold.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910680259.8A CN110533579B (en) | 2019-07-26 | 2019-07-26 | Video style conversion method based on self-coding structure and gradient order preservation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910680259.8A CN110533579B (en) | 2019-07-26 | 2019-07-26 | Video style conversion method based on self-coding structure and gradient order preservation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110533579A CN110533579A (en) | 2019-12-03 |
CN110533579B true CN110533579B (en) | 2022-12-02 |
Family
ID=68661805
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910680259.8A Active CN110533579B (en) | 2019-07-26 | 2019-07-26 | Video style conversion method based on self-coding structure and gradient order preservation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110533579B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111263226B (en) * | 2020-01-17 | 2021-10-22 | 中国科学技术大学 | Video processing method, video processing device, electronic equipment and medium |
CN111556244B (en) * | 2020-04-23 | 2022-03-11 | 北京百度网讯科技有限公司 | Video style migration method and device |
CN112561864B (en) * | 2020-12-04 | 2024-03-29 | 深圳格瑞健康科技有限公司 | Training method, system and storage medium for caries image classification model |
CN113128614B (en) * | 2021-04-29 | 2023-06-16 | 西安微电子技术研究所 | Convolution method based on image gradient, neural network based on direction convolution and classification method |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10152768B2 (en) * | 2017-04-14 | 2018-12-11 | Facebook, Inc. | Artifact reduction for image style transfer |
US10318889B2 (en) * | 2017-06-26 | 2019-06-11 | Konica Minolta Laboratory U.S.A., Inc. | Targeted data augmentation using neural style transfer |
CN107481185A (en) * | 2017-08-24 | 2017-12-15 | 深圳市唯特视科技有限公司 | A kind of style conversion method based on video image optimization |
CN108924528B (en) * | 2018-06-06 | 2020-07-28 | 浙江大学 | Binocular stylized real-time rendering method based on deep learning |
-
2019
- 2019-07-26 CN CN201910680259.8A patent/CN110533579B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN110533579A (en) | 2019-12-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110533579B (en) | Video style conversion method based on self-coding structure and gradient order preservation | |
CN113658051B (en) | Image defogging method and system based on cyclic generation countermeasure network | |
CN109087273B (en) | Image restoration method, storage medium and system based on enhanced neural network | |
Shocher et al. | Ingan: Capturing and remapping the" dna" of a natural image | |
CN105631807B (en) | The single-frame image super-resolution reconstruction method chosen based on sparse domain | |
JP5645842B2 (en) | Image processing apparatus and method using scale space | |
CN108830913B (en) | Semantic level line draft coloring method based on user color guidance | |
CN111986075B (en) | Style migration method for target edge clarification | |
CN113177882B (en) | Single-frame image super-resolution processing method based on diffusion model | |
CN110570377A (en) | group normalization-based rapid image style migration method | |
CN110717868B (en) | Video high dynamic range inverse tone mapping model construction and mapping method and device | |
CN112837224A (en) | Super-resolution image reconstruction method based on convolutional neural network | |
CN103279933A (en) | Method for reconstructing single-image super-resolution based on double-layer model | |
CN111080591A (en) | Medical image segmentation method based on combination of coding and decoding structure and residual error module | |
CN116682120A (en) | Multilingual mosaic image text recognition method based on deep learning | |
CN112884668A (en) | Lightweight low-light image enhancement method based on multiple scales | |
CN111986132A (en) | Infrared and visible light image fusion method based on DLatLRR and VGG & Net | |
WO2023279936A1 (en) | Methods and systems for high definition image manipulation with neural networks | |
CN108924528B (en) | Binocular stylized real-time rendering method based on deep learning | |
Liu et al. | Facial image inpainting using attention-based multi-level generative network | |
CN110415169A (en) | A kind of depth map super resolution ratio reconstruction method, system and electronic equipment | |
CN109741258B (en) | Image super-resolution method based on reconstruction | |
CN107424119A (en) | A kind of super-resolution method of single image | |
CN104123707B (en) | Local rank priori based single-image super-resolution reconstruction method | |
CN116523985A (en) | Structure and texture feature guided double-encoder image restoration method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |