CN106686472A

CN106686472A - High-frame-rate video generation method and system based on depth learning

Info

Publication number: CN106686472A
Application number: CN201611241691.XA
Authority: CN
Inventors: 王兴刚; 罗浩; 姜玉静; 刘文予
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2016-12-29
Filing date: 2016-12-29
Publication date: 2017-05-17
Anticipated expiration: 2036-12-29
Also published as: CN106686472B

Abstract

The invention discloses a high-frame-rate video generation method based on depth learning. The method includes the steps that one or more original high-frame-rate video segments are used for generating a training sample set; multiple video frame subsets in the training sample set are used for training a dual-channel convolutional neural network model so as to obtain an optimized dual-channel convolutional neural network, wherein the dual-channel convolutional neural network model is a convolutional neural network formed by fusing two convolutional channels; the optimized dual-channel convolutional neural network is adopted, according to any two adjacent video frames in a low-frame-rate video, an insert frame of the two video frames is generated, and therefore a video whose frame rate is higher than that of the low-frame-rate video is generated. The whole process of the method is in an end-to-end mode, no subsequent processing is needed for video frames, the video frame rate conversion effect is good, the fluency of the synthetic video is high, and the method has excellent robustness on shaking, video scene changing and other problems in the video shooting process.

Description

A kind of high frame-rate video generation method and system based on deep learning

Technical field

The invention belongs to technical field of computer vision, regards more particularly, to a kind of high frame per second based on deep learning Frequency generation method and system.

Background technology

With the development of science and technology, the mode that people obtain video is more and more convenient, most of the reason for yet with hardware Video is all that non-professional equipment is collected, and frame per second typically only has 24fps-30fps.The video of high frame per second has high smoothness Degree, can bring more preferable visual experience.If people directly upload the video of high frame per second on the net, due to flow Increase is consumed, the cost of people is also increased as.If directly above transmitting the video of low frame per second, due to due to network line, There is unavoidably frame losing in video, video is more big easier this phenomenon occurs during transmission so that distal end regards Frequency quality can not be effectively guaranteed, and this greatly have impact on the experience of people.It is therefore desirable in distal end using rational Processing mode carries out subsequent treatment to the video that people upload so that the quality of video can meet the demand of people even further Lift the experience of people.

The content of the invention

For the disadvantages described above or Improvement requirement of prior art, the invention provides a kind of high frame per second based on deep learning Video generation method, its object is to the video that the video conversion of low frame per second is high frame per second thus be solved because low frame per second is regarded Frame losing of the frequency during network transmission and cause video quality to decline the technical problem that the experience for giving people brings impact.

For achieving the above object, according to one aspect of the present invention, there is provided a kind of high frame per second based on deep learning is regarded Frequency generation method, comprises the following steps：

(1) training sample set is generated using one or more original high frame-rate video fragments, the training sample concentrates bag Multiple video frame set are included, comprising two training frames and a control frame in described each video frame set, described two Training frames are two frame of video that a frame or multiframe are spaced in high frame-rate video fragment, and the control frame is two training frames Any one frame of midfeather；The frame per second of the high frame-rate video fragment is higher than setting frame per second threshold value；

(2) the multiple video frame set training dual pathways convolutional neural networks model concentrated using the training sample, With dual pathways convolutional neural networks after being optimized；Wherein, the dual pathways convolutional neural networks model is to be led to by two convolution The convolutional neural networks of road fusion, two convolutional channels are respectively used to two frame of video in input video frame subclass simultaneously Respectively the frame of video to being input into carries out convolution, and dual pathways convolutional neural networks model is carried out to the convolution results of two convolutional channels Merge and be output as to predict frame, the bilateral is trained with the frame flyback that compares in the video frame set according to the prediction frame Road convolutional neural networks model；

(3) video of arbitrary neighborhood two using dual pathways convolutional neural networks after the optimization, in low frame-rate video Frame generates the insertion frame of this two frame of video, so as to generate video of the frame per second higher than the low frame-rate video.

In one embodiment of the present of invention, each convolutional channel in the dual pathways convolutional neural networks model includes k Individual convolutional layer, wherein k>0, the mathematical description of each convolutional layer is：

Z_i(Y)=W_i*F_i-1(Y)+B_i

Wherein i represents the number of plies of convolutional layer, and input video frame is the 0th layer, and * represents convolution operation, F_i-1Represent the i-th -1 layer Output, Z_i(Y) output after i-th layer of convolution operation, W are represented_iFor i-th layer of convolution nuclear parameter, B_iJoin for i-th layer of biasing Number.

In one embodiment of the present of invention, in the convolutional channel, one is connected to respectively after front k-1 convolutional layer To keep the openness of network, its mathematical description is the active coating of ReLU：

F_i(Y)=max (0, Z_i)。

In one embodiment of the present of invention, in the feature that two frame of video are obtained after last convolutional layer Response diagram is merged by the way of the addition of correspondence position value.

In one embodiment of the present of invention, the Sigmoid that is followed by for obtaining characteristic response figure in the mixing operation swashs So that the pixel value of picture is mapped between 0-1, its mathematical description is layer living：

In one embodiment of the present of invention, average is adopted for 0, standard deviation is 1 Gauss distribution initialization convolution nuclear parameter, Biasing is initialized as 0, and benchmark learning rate is initialized as 1e^-6, benchmark learning rate reduces 10 times after the iteration m cycle, wherein m For preset value.

In one embodiment of the present of invention, frame flyback instruction is compareed with the video frame set according to the prediction frame The white silk dual pathways convolutional neural networks model, specially：

Using predicting frame and compareing the error between frame, the dual pathways convolution is trained using error backpropagation algorithm Neutral net；Least squares error is wherein adopted for our majorized function, its mathematical description is：

Wherein i represents i-th samples pictures, and n represents the quantity of sample training collection, Y_iThe frame of video of neural network forecast is represented, Represent the actual value of corresponding video frame.

In one embodiment of the present of invention, the k values are 3；First convolutional layer has the convolution kernel of 64 9*9, step-length For 1 pixel, Filling power is 4, and Filling power refers to the number of turns in characteristic pattern periphery zero padding；Second convolutional layer has 32 1*1's Convolution kernel, step-length is 1 pixel, and Filling power is 0；3rd convolutional layer has the convolution kernel of 3 5*5, and step-length is 1, and Filling power is 2。

It is another aspect of this invention to provide that additionally providing a kind of high frame-rate video based on deep learning generates system, bag Training sample set generation module, dual pathways convolutional neural networks optimization module and high frame-rate video generation module are included, wherein：

The training sample set generation module, for generating training sample using one or more high frame-rate video fragments Collection, the training sample is concentrated includes multiple video frame set, and two training frames are included in described each video frame set With a control frame, two training frames are two frame of video that a frame or multiframe are spaced in high frame-rate video fragment, described Control frame is any one frame of the midfeather of two training frames；The frame per second of the high frame-rate video fragment is higher than setting frame Rate threshold value；

The dual pathways convolutional neural networks optimization module, for the multiple video frames concentrated using the training sample Set training dual pathways convolutional neural networks model, dual pathways convolutional neural networks after being optimized；Wherein, the dual pathways volume Product neural network model is the convolutional neural networks of two passage fusions, and two passages are respectively used to be input into the video frame Two frame of video in conjunction and the frame of video to being input into carry out respectively convolution, dual pathways convolutional neural networks model it is logical to two The result of road convolution is merged and is output as predicting frame, and frame is compareed with the video frame set according to the prediction frame Dual pathways convolutional neural networks model described in regression training；

The high frame-rate video generation module, for using dual pathways convolutional neural networks after the optimization, according to low frame The frame of video of arbitrary neighborhood two in rate video generates the insertion frame of this two frame of video, regards higher than the low frame per second so as to generate frame per second The video of frequency.

Z_i(Y)=W_i*F_i-1(Y)+B_i

In general, by the above technical scheme that the present invention is contemplated, compared with prior art, the present invention has following Technique effect：

(1) feature extraction of the invention and the prediction of frame are obtained by the supervised learning of training sample, without the need for artificial Intervene, spatial diversity information can be preferably fitted under the scene of large-scale data；

(2) whole process of the invention is end to end, using the ability of self-teaching of convolutional neural networks, by self The mode of study learns model parameter, it is succinct efficiently, overcome conventional art take time and effort when video frame rate is changed processing and The characteristics of DeGrain.

Description of the drawings

Fig. 1 be the present invention the method for converting video frame rate based on deep learning flow chart, wherein F_iRepresent i-th layer Output, Y_t-1、Y_t、Y_t+1Represent continuous three frames frame of video, Y_tIt is used for calculation error as actual value, Prediction represents net The frame of video of network prediction.

Specific embodiment

In order that the objects, technical solutions and advantages of the present invention become more apparent, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, and It is not used in the restriction present invention.As long as additionally, technical characteristic involved in invention described below each embodiment Not constituting conflict each other just can be mutually combined.

Hereinafter first just the technical term of the present invention is explained and illustrated：

Convolutional neural networks (Convolutional Neural Network, CNN)：One kind can be used for image classification, return The neutral net of task, its particularity such as return to be embodied in two aspects, the interneuronal connection for being on the one hand it is non-complete Connection, the weight of the connection in another aspect same layer between some neurons is shared.Network is generally by convolutional layer, pond Change layer and full articulamentum is constituted.Convolutional layer and pond layer are responsible for extracting the hierarchy characteristic of image, and full articulamentum is responsible for extracting Feature classified or returned.The parameter of network includes parameter and the biasing of convolution kernel and full articulamentum, and parameter can be with Obtained from data learning by reverse conduction algorithm.

Reverse conduction algorithm (Backpropagation Algorithm, BP)：Be it is a kind of with optimization method (such as gradient Descent method) be used in combination, for training the common methods of artificial neural network.The method is damaged to all weight calculation in network The gradient of function is lost, this gradient can feed back to optimization method, for updating weights to minimize loss function.Algorithm master To include two stages：The forward direction of excitation, back propagation and the renewal of weight.

As the arrival in big data epoch, the scale of video database are also increasing, the solution of this problem is also more next It is more urgent.The working method that deep neural network can simulate human brain in a kind of preferable mode is analyzed to data, In recent years, deep learning is all achieved in the every field of computer vision and successfully applied, but turning for video frame rate The problem of changing there is no significantly research, in view of traditional method for converting video frame rate process is complicated, time human cost is higher, this The bright one kind that proposes is based on deep learning method for converting video frame rate.The method whole process be end to end, it is easy and efficiently, The problems such as shake, scene for video switches all has stronger robustness.

As shown in figure 1, method for converting video frame rate of the present invention based on deep learning, may comprise steps of：

Specifically, high frame-rate video fragment can be extracted and obtains sets of video frames, training sample is obtained according to a certain percentage Collection；

Training sample set is combined into by multiple video frames, comprising two instructions in described each video frame set Practice frame and a control frame.Control frame is chosen for the most middle of two training frames or near that most middle frame.Typically In the case of refer to and take continuous 3 frame, a middle frame is control frame, and another two frame is training frames；If frame per second is sufficiently high, can also take Be separated by two frames of multiframe (depending on frame per second, it is impossible to too many) as training frames, and mesophase every multiframe in can choose middle ware Every any one frame for control frame；High video frame rate for example for training is 60, and the video has N frames, then according to interval one The mode of frame sample this training, from the 2nd to N-1 frames in take a frame at random as actual value (control frame), it is and the frame is adjacent Two frames as training sample (two training frames) be input to network the inside.In the same manner, it is also possible to which the mode according to interval multiframe is come Training sample, so can be used for the video of lower frame per second, i.e., the video conversion of lower frame per second is the video of high frame per second.

First have to design and Implement a dual pathways convolutional neural networks, specifically：

The dual pathways convolutional neural networks model set up is the convolutional neural networks of two convolutional channel fusions, is included altogether K convolutional layer, k>0, preferably 3, respectively convolution is individually carried out to two frame of video pictures (training frames).First convolutional layer has The convolution kernel of 64 9*9, step-length is 1 pixel, and Filling power is 4, and Filling power refers to the number of turns in characteristic pattern periphery zero padding.Second Individual convolutional layer has the convolution kernel of 32 1*1, and step-length is 1 pixel, and Filling power is 0.3rd volume layer has the convolution kernel of 3 5*5, Step-length is 1, and Filling power is 2.The mathematical description of convolutional layer is：

Z_i(Y)=W_i*F_i-1(Y)+B_i

Wherein i represents the number of plies of network, and input picture is the 0th layer, and * represents convolution operation, F_i-1Represent the i-th -1 layer defeated Go out, Z_i(Y) output after i-th layer of convolution operation, W are represented_iFor i-th layer of convolution nuclear parameter, B_iFor i-th layer of offset parameter；

In 3 convolutional layers, the active coating of a ReLU is connected to after the 1st and the 2nd convolutional layer respectively to keep Network it is openness, its mathematical description is：

F_i(Y)=max (0, Z_i)。

The characteristic response figure that two frame of video pictures are obtained after the 3rd convolutional layer is added using correspondence position value Mode merged；

After the mixing operation, the characteristic response figure for obtaining is followed by a Sigmoid active coating with by the picture of picture Plain value is mapped between 0-1, and its mathematical description is：

Before the dual pathways convolutional neural networks are trained, need to enter each pixel value in frame of video divided by 255 Row normalized, the pixel value after normalization is between 0 to 1；

Also, before the dual pathways convolutional neural networks are trained, need to initialize the employing of convolutional neural networks parameter Average is 0, and standard deviation is 1 Gauss distribution initialization convolution nuclear parameter, and biasing is initialized as 0, the initialization of benchmark learning rate For 1e-6, benchmark learning rate reduces 10 times after the iteration m cycle, and wherein m is preset value；For example, m preferably 2, then in front 1-m In individual iteration cycle, learning rate=1e-6, after the iteration m cycle, learning rate=1e-7, and be always maintained at constant.

Specifically, it is possible to use the predictive value of network with compare between error, instructed using error backpropagation algorithm Practice dual pathways convolutional neural networks.Least squares error is adopted for our majorized function, its mathematical description is：

Wherein i represents i-th samples pictures, and n represents the quantity of sample training collection, Y_iThe frame of video of neural network forecast is represented, Represent the actual value of corresponding video frame；

As it will be easily appreciated by one skilled in the art that the foregoing is only presently preferred embodiments of the present invention, not to The present invention, all any modification, equivalent and improvement made within the spirit and principles in the present invention etc. are limited, all should be included Within protection scope of the present invention.

Claims

1. a kind of high frame-rate video generation method based on deep learning, it is characterised in that the method comprising the steps of：

(1) training sample set is generated using one or more original high frame-rate video fragments, the training sample is concentrated including many Individual video frame set, comprising two training frames and a control frame, two training in described each video frame set Frame is two frame of video that a frame or multiframe are spaced in high frame-rate video fragment, and the control frame is in the middle of two training frames Any one frame at interval；The frame per second of the high frame-rate video fragment is higher than setting frame per second threshold value；

(2) the multiple video frame set training dual pathways convolutional neural networks model concentrated using the training sample, to obtain Dual pathways convolutional neural networks after must optimizing；Wherein, the dual pathways convolutional neural networks model is to be melted by two convolutional channels The convolutional neural networks for closing, two convolutional channels are respectively used to two frame of video in input video frame subclass and distinguish Frame of video to being input into carries out convolution, and dual pathways convolutional neural networks model merges to the convolution results of two convolutional channels And be output as predicting frame, the dual pathways volume is trained with the frame flyback that compares in the video frame set according to the prediction frame Product neural network model；

(3) frame of video of arbitrary neighborhood two life using dual pathways convolutional neural networks after the optimization, in low frame-rate video Into the insertion frame of this two frame of video, so as to generate video of the frame per second higher than the low frame-rate video.

2. the high frame-rate video generation method of deep learning is based on as claimed in claim 1, it is characterised in that the dual pathways Each convolutional channel in convolutional neural networks model includes k convolutional layer, wherein k>0, the mathematical description of each convolutional layer is：

Z_i(Y)=W_i*F_i-1(Y)+B_i

Wherein i represents the number of plies of convolutional layer, and input video frame is the 0th layer, and * represents convolution operation, F_i-1Represent the i-th -1 layer defeated Go out, Z_i(Y) output after i-th layer of convolution operation, W are represented_iFor i-th layer of convolution nuclear parameter, B_iFor i-th layer of offset parameter.

3. the high frame-rate video generation method of deep learning is based on as claimed in claim 2, it is characterised in that in the convolution In passage, the active coating of a ReLU is connected to respectively after front k-1 convolutional layer to keep the openness of network, its mathematics is retouched State for：

F_i(Y)=max (0, Z_i)。

4. the high frame-rate video generation method of deep learning is based on as claimed in claim 1 or 2, it is characterised in that described The characteristic response figure that two frame of video are obtained after last convolutional layer is carried out by the way of the addition of correspondence position value Fusion.

5. the high frame-rate video generation method of deep learning is based on as claimed in claim 1 or 2, it is characterised in that described What mixing operation obtained characteristic response figure is followed by a Sigmoid active coating so that the pixel value of picture is mapped between 0-1, its Mathematical description is：

F_{i} (Y) = \frac{1}{1 + \exp (- Z_{i})} .

6. the high frame-rate video generation method based on deep learning as claimed in claim 2, it is characterised in that adopt average for 0, standard deviation is 1 Gauss distribution initialization convolution nuclear parameter, and biasing is initialized as 0, and benchmark learning rate is initialized as 1e^-6, Benchmark learning rate reduces 10 times after the iteration m cycle, and wherein m is preset value.

7. the high frame-rate video generation method of deep learning is based on as claimed in claim 1 or 2, it is characterised in that according to institute State prediction frame and train the dual pathways convolutional neural networks model with the frame flyback that compares in the video frame set, specifically For：

Using predicting frame and compareing the error between frame, the dual pathways convolutional Neural is trained using error backpropagation algorithm Network；Least squares error is wherein adopted for our majorized function, its mathematical description is：

Wherein i represents i-th samples pictures, and n represents the quantity of sample training collection, Y_iThe frame of video of neural network forecast is represented,Represent The actual value of corresponding video frame.

8. the high frame-rate video generation method of deep learning is based on as claimed in claim 2, it is characterised in that the k values For 3；First convolutional layer has the convolution kernel of 64 9*9, and step-length is 1 pixel, and Filling power is 4, and Filling power is referred in characteristic pattern The number of turns of periphery zero padding；Second convolutional layer has the convolution kernel of 32 1*1, and step-length is 1 pixel, and Filling power is 0；3rd volume Lamination has the convolution kernel of 3 5*5, and step-length is 1, and Filling power is 2.

9. a kind of high frame-rate video based on deep learning generates system, it is characterised in that including training sample set generation module, Dual pathways convolutional neural networks optimization module and high frame-rate video generation module, wherein：

The training sample set generation module, for generating training sample set, institute using one or more high frame-rate video fragments Stating training sample and concentrating includes multiple video frame set, and two training frames and one are included in described each video frame set Control frame, two training frames are two frame of video that a frame or multiframe are spaced in high frame-rate video fragment, the control frame For any one frame of the midfeather of two training frames；The frame per second of the high frame-rate video fragment is higher than setting frame per second threshold Value；

The dual pathways convolutional neural networks optimization module, for the multiple video frame set concentrated using the training sample Training dual pathways convolutional neural networks model, dual pathways convolutional neural networks after being optimized；Wherein, the dual pathways convolution god Jing network modeies are the convolutional neural networks of two passage fusions, and two passages are respectively used to be input in the video frame set Two frame of video and the frame of video to being input into carries out respectively convolution, dual pathways convolutional neural networks model to two passages volumes Long-pending result is merged and is output as predicting frame, and frame flyback is compareed with the video frame set according to the prediction frame Train the dual pathways convolutional neural networks model；

The high frame-rate video generation module, for using dual pathways convolutional neural networks after the optimization, being regarded according to low frame per second The frame of video of arbitrary neighborhood two in frequency generates the insertion frame of this two frame of video, so as to generate frame per second higher than the low frame-rate video Video.

10. the high frame-rate video based on deep learning as claimed in claim generates system, it is characterised in that the dual pathways Each convolutional channel in convolutional neural networks model includes k convolutional layer, wherein k>0, the mathematical description of each convolutional layer is：

Z_i(Y)=W_i*F_i-1(Y)+B_i