CN107133919A

CN107133919A - Time dimension video super-resolution method based on deep learning

Info

Publication number: CN107133919A
Application number: CN201710341864.3A
Authority: CN
Inventors: 董伟生; 巨丹; 石光明; 谢雪梅; 吴金建; 李甫
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2017-05-16
Filing date: 2017-05-16
Publication date: 2017-09-05

Abstract

The invention discloses a kind of time dimension video super-resolution method based on deep learning, the problem of mainly solving the video image interleave stability difference and low precision of prior art reconstruct.Its key problem in technology is the Nonlinear Mapping relation for utilizing neural metwork training to be fitted between raw video image and down-sampling video image, including：1) raw video image collection and down-sampling video image collection are obtained as the training sample of neutral net；2) build neural network model and the parameter of neutral net is trained using training sample；3) the one section of video given will be appointed as test sample, be input in the neural network model trained, the output result of neutral net is the video image of reconstruct.Present invention reduces the computation complexity of video image interleave reconstruct, the stability and precision of reconstructed video image interleave are improved, available for scene interpolation, cartoon making, the time-domain interleave of low frame-rate video is realized.

Description

Time dimension video super-resolution method based on deep learning

Technical field

The invention belongs to image processing field, and in particular to a kind of time dimension video super resolution, be inserted available for scene Value, cartoon making, the time-domain interleave for realizing low frame-rate video.

Background technology

Video image not only contains the spatial information of target being observed, and contains target being observed in time Movable information, the property for possessing " space-time unification ".Due to video image can be reflection volume property spatial information and the time Information is maintained together, therefore greatly improves the ability of human cognitive objective world, remote sensing, military affairs, agricultural, medical science, The fields such as biochemistry are all proved to have huge application value.

The video image cost for obtaining precision using video imaging apparatus is very high, and by sensor and optics system The limitation of technique is made, in order to improve the resolution ratio of imaging video, it usually needs video is compressed, to sacrifice the time of video Resolution ratio is cost, and this is obviously difficult to the demand for meeting scientific research and large-scale practical application.So utilizing signal transacting skill Art reconstructs high-resolution video image from the video image after compression turns into an important channel for obtaining video image.

Kang S J et al. are " in Dual Motion Estimation for Frame Rate Up-Conversion " Propose the algorithm that a kind of method of use Motion estimation and compensation realizes the reconstruct of video image interleave.The video image is inserted Frame reconstruction is an ill-condition problem, and it utilizes the temporal information of video figure image and the spatial information of combination video image To realize that video image interleave is reconstructed, but the algorithm is not due to making full use of consecutive frame stronger present in video image Between structural similarity so that the video image stabilization and precision of reconstruct are difficult to meet scientific research and large-scale practical application Requirement.

The content of the invention

It is an object of the invention to for above-mentioned the deficiencies in the prior art, propose that a kind of time dimension based on deep learning is regarded Frequency super-resolution method, to improve the stability and precision of reconstructed video image, meets the requirement of large-scale practical application.

The technical proposal of the invention is realized in this way：

Sample is trained using the video image collection and raw video image collection Jing Guo down-sampling as the input of neutral net Originally with output training sample, it is fitted by neural metwork training non-linear between down-sampling video image and raw video image Mapping relations, and using this relation to instruct the interleave for carrying out test sample to reconstruct, regarded so as to reach using neutral net The purpose of frequency time-domain interleave, its specific steps include as follows：

(1) by color video frequency image collection S={ S₁,S₂,...,S_i,...,S_NBe converted to greyscale video image set, i.e., it is original Video image collection X={ X₁,X₂,...,X_i,...,X_N, and raw video image collection X is carried out directly using down-sampling matrix F Down-sampling, obtains down-sampling video image collection Y={ Y₁,Y₂,...,Y_i,...,Y_N, wherein,Represent i-th it is original Video image sample,I-th of down-sampling video image sample is represented, 1≤i≤N, N represents raw video image collection The quantity of middle image pattern, M represents the size of raw video image block, L_hRepresent image in each sample of raw video image collection The quantity of block, L_lRepresent the quantity of image block in each sample of down-sampling video image collection, and L_h=r × L_l, r represents original and regards Multiplication factor of the frequency image set to down-sampling video image collection；

(2) neural network model is built, and utilizes down-sampling video image collection Y and raw video image collection X training nerves Network parameter：

(2a) determines neural network input layer nodes, output layer nodes, hides the number of plies and hidden layer number of nodes, with Machine initializes the connection weight W of each layer^(t)With biasing b^(t), learning rate η is given, selected activation primitive is：Wherein, g represents the input value of neural network node, t=1,2, n, n represents neutral net Total number of plies；

The down-sampling video image Y that (2b) stochastic inputs down-sampling video image is concentrated_iAs input training sample, The raw video image X that corresponding raw video image is concentrated is inputted simultaneously_iAs output training sample, using selected Activation primitive calculates the activation value of each layer of neutral net, and calculating is obtained：

The 1st layer of i.e. activation value of input layer be：a⁽¹⁾=Y_i,

T'=2,3 ..., the activation value of n-layer is：a^(t′)=f (W^(t′-1)*a^(t′-1)+b^(t′^-1)), wherein, in the network The second layer, third layer, the 4th layer i.e. t'=2, t'=3, during t'=4, in order to fully extract the correlation of video interframe, design Three three-dimensional filters are used for replacing traditional two dimensional filter, and f (g) represents tanh (g) activation primitive, g=W^(t′-1)*a^(t′-1)+b^(t′-1), W^(t'-¹⁾And b^(t'-¹⁾T'-1 layers of weight and biasing, a are represented respectively^(t'-¹⁾Represent t'-1 layers of activation Value；

(2c) calculates the learning error of each layer of neutral net：

Output layer is that the error of n-th layer is：δ⁽ⁿ⁾=X_i-a⁽ⁿ⁾,

T "=n-1, n-2 ..., 2 layers of error is：δ^(t")=((W^(t”))^Tδ^(t”+1)).*f'(W^(t”-1)*a^(t”-1)+b^(t”-1)), wherein, W^(t”)Represent the weights of t " layers, δ^(t"+1)Represent+1 layer of t " error, W^(t”-1)And b^(t”-1)Represent respectively - 1 layer of t " weights and biasing, a^(t”-1)Represent -1 layer of t " activation value, f'(g') representative function f (g') derivative, (g " )^TRepresent transposed transform, g'=W^(t”-1)*a^(t”-1)+b^(t”-1), g "=W^(t”)；

(2d) is updated weights and the biasing of each layer of neutral net by error gradient descending method：

It is W by right value update^(t)=W^(t)-ηδ^(t+1)(a^(t))^T, biasing is updated to b^(t)=b^(t)-ηδ^(t+1), wherein, δ^(t+1)Represent t+1 layers of error, a^(t)Represent t layers of activation value；

(2e) repeatedly perform step (2b)-(2d), until the output layer error of neutral net reach default required precision or Frequency of training reaches maximum iteration, terminates training, preserves network structure and parameter, the neutral net mould trained Type；

(3) appoint to one section of video, be input in the neural network model trained, the output of neutral net is time dimension Video after super-resolution.

The present invention has advantages below compared with prior art:

1) present invention drops compared with prior art due to carrying out time dimension video super-resolution reconstruction using convolutional neural networks Low computation complexity, improves the stability of time three-D video image Super-resolution Reconstruction；

2) three-dimensional filter designed by the present invention, due to having taken into full account the correlation of the adjacent interframe of video, is improved The precision of three-D video image time, Super-resolution Reconstruction time.

Brief description of the drawings

Fig. 1 is implementation process figure of the invention；

The neural network structure figure that Fig. 2 builds for the present invention；

Fig. 3 is the original image of the bus videos used in emulation experiment of the present invention；

Fig. 4 is to use the reconstruction of existing Kang ' s methods and Choi ' s methods and the inventive method to bus video images Result figure.

Embodiment

Embodiments of the invention and effect are described in further detail below in conjunction with accompanying drawing.

Reference picture 1, the time dimension video super-resolution method of the invention based on deep learning, implementation step is as follows：

Step 1, color video frequency image collection S is obtained.

(1a) chooses the color video frequency image collection S={ S that sample number is 464814 from data-oriented storehouse₁,S₂,..., S_i,...,S₄₆₄₈₁₄, S is converted into greyscale video image set, i.e. raw video image collection X={ X₁,X₂,...,X_i,..., X₄₆₄₈₁₄, wherein,I-th of raw video image sample is represented, 1≤i≤464814, M represents raw video image The size of block, M=576, L_hRepresent the quantity of image block in each sample of raw video image collection, L_h=6；

(1b) utilizes down-sampling matrix F, carries out direct down-sampling to raw video image collection X, obtains down-sampling video figure Image set Y=FX, equivalent to raw video image collection X={ X₁,X₂,...,X_i,...,X₄₆₄₈₁₄In each sample carry out under Sampling obtains down-sampling video image collection Y={ Y₁,Y₂,...,Y_i,...,Y₄₆₄₈₁₄, wherein, Y_iRepresent i-th of down-sampling video Image pattern, Y_i=FX_i,1≤i≤464814, M represents the size of down-sampling video image blocks, M=576, L_lTable Show the quantity of image block in each sample of down-sampling video image collection, L_l=3,

Step 2, neural network model is built, and utilizes down-sampling video image collection Y and raw video image collection X training god Through network parameter.

This step is implemented as follows：

(2a) initializes neural network parameter；

The down-sampling video image that (2a1) concentrates down-sampling video image is as input training sample, by original video Raw video image in image set is used as output training sample；

(2a2) determines the input layer number of neutral net, the present embodiment according to the video frame number of input training sample In, the quantity L of image block in each sample of down-sampling video image collection is equal to according to input layer number_l, input layer section is set Count as 3；

(2a3) determines the output layer nodes of neutral net, the present embodiment according to the video frame number of output training sample In, the quantity L of image block in each sample of raw video image collection is equal to according to output layer nodes_h, output node layer is set Number is 6；

(2a4) determines to hide the number of plies and hidden layer nodes：

Because the hiding number of plies and hidden layer nodes of neutral net determine the scale of neutral net, thus it should ensure On the premise of can solve the problem that problem, the scale for making every effort to neutral net is tried one's best simply, in the present embodiment, by the hidden layer of neutral net Number is determined directly as 7 layers, and it is 64 that hidden layer nodes, which adjust every layer of nodes, i.e. first layer nodal point number by testing, the Two layers of nodal point number are 32, and third layer nodal point number is 24, and the 4th layer of nodal point number is 12, and layer 5 nodal point number is 32, the Six layers of nodal point number are 32, and layer 7 nodal point number is 6；

Each layer connection weight W of (2a5) random initializtion^(t)With biasing b^(t), t=1,2,3,4,5,6,7,8；

(2a6) gives learning rate η=0.0005；

(2a7) selectes activation primitive：Wherein, g represents that neural network node includes being biased in Interior weighted input and；

One input training sample Y of (2b) stochastic inputs_i, each layer of neutral net is calculated using selected activation primitive Activation value, calculating is obtained：

The 1st layer of i.e. activation value of input layer be：a⁽¹⁾=Y_i,

T'=2,3,4,5,6,7 layers of activation value is：a^(t')=f (W^(t'-1)*a^(t'-1)+b^(t'-1)), wherein, in the net The second layer of network, third layer, the 4th layer i.e. t'=2, t'=3, during t'=4, in order to fully extract the correlation of video interframe, if Count three three-dimensional filters to be used for replacing traditional two dimensional filter, f (g) represents tanh (g) activation primitive, g=W^(t'-1)*a^(t'-1)+b^(t'-1), W^(t'-1)And b^(t'-1)T'-1 layers of weights and biasing, a are represented respectively^(t'-1)Represent t'-1 layers of activation Value；

(2c) inputs a corresponding output training sample X_i, calculate the learning error of each layer of neutral net：

Output layer is that the 4th layer of error is：δ⁽⁴⁾=X_i-a⁽⁴⁾,

The error of t "=3,2 layer is：δ^(t”)=((W^(t”))^Tδ^(t”+1)).*f'(W^(t”-1)*a^(t”-1)+b^(t”-1)), wherein, W^(t”)Represent t " weights of layer, W^(t”-1)And b^(t”-1)T is represented respectively " -1 layer of weights and biasing, a^(t”-1)Represent t " -1 Layer activation value, f'(g') representative function f (g') derivative, (g ")^TRepresent transposed transform, g'=W^(t”-1)*a^(t”-1)+b^(t”-1), G "=W^(t”)；

It is by right value update：W^(t)=W^(t)-ηδ^(t+1)(a^(t))^T,

Biasing is updated to：b^(t)=b^(t)-ηδ^(t+1), wherein, δ^(t+1)Represent t+1 layers of error, a^(t)Represent t layers Activation value；

(2e) performs step (2b)-(2d) repeatedly, until network output layer error reaches default required precision or training time Number reaches maximum iteration, terminates training, preserves network structure and parameter, the neural network model trained, this reality Apply in example, maximum iteration is 500；

Neutral net constructed by this step 2 as shown in Fig. 2 it include 1 input layer, 3 Three dimensional convolution layers, 3 two Convolutional layer, 1 output layer are tieed up, input layer there are 3 nodes, and the node number of 7 hidden layers is respectively 64,32,24,12,32, 32,6, output layer has 6 nodes.

Step 3, using the neural network model trained, the time dimension Super-resolution Reconstruction of video image is carried out.

(3a) will appoint the one section of video given as test sample, be by each video image sample in the video image Y_iA column vector is pulled into, the size of each vector is 1728 × 1；

(3b) using these column vectors as the neural network model trained input, for each input to Amount, the output result of neutral net is the vector that a dimension is added, and the vectorial size is 3456 × 1；

Combination is reconstructed in these vectors by (3c), i.e., first by these vector reconstructions into single-frame images, then by these single frames Image sets synthetic video, so that it may obtain the video of time dimension super-resolution.

The effect of the present invention can be illustrated by following emulation experiment：

1. simulated conditions：

1) the direct down-sampling transformation matrix F in emulation experiment is obtained by function imresize；

2) programming platform used in emulation experiment is Matlab R2015a and Pycharm v2016；

3) neural network structure built in emulation experiment is as shown in Figure 2；

4) the 14th two field picture of the bus video sequences used in emulation experiment is as shown in Figure 3；

5) video image that the video image used in emulation experiment is concentrated derives from Xiph databases, totally 464814 training Sample；

6) in emulation experiment, using Y-PSNR PSNR indexs come evaluation experimental result, Y-PSNR PSNR's determines Justice is：

Wherein, M represents the frame number of the video image reconstructed, MAX_jRepresent the maximum pixel of jth two field picture reconstructed Value, MSE_jMean square error between video image jth frame and raw video image jth frame that expression is reconstructed.

2. emulation content：Time dimension video super-resolution weight is carried out to the bus video images shown in Fig. 3 using the inventive method Build, its reconstructed results is as shown in figure 4, wherein：

Fig. 4 (a) represents the 14th two field picture reconstructed with Kang ' s methods,

Fig. 4 (b) represents the 14th two field picture reconstructed with Choi ' s methods,

The 14th two field picture that Fig. 4 (c) expressions are reconstructed with the inventive method,

The present invention, which is can be seen that, from the reconstruction result shown by Fig. 4 reconstructs the image come than Kang ' s methods and Choi ' S methods reconstruct next image closer to real image.

3. Y-PSNR PSNR is contrasted

Calculate existing Tsai ' s methods, Choi ' s methods and the inventive method and video time is carried out to bus video images The Y-PSNR PSNR of Super-resolution Reconstruction, as a result as shown in table 1.

The Y-PSNR PSNR value (units of the reconstructed video image of table 1：dB)

As it can be seen from table 1 the Y-PSNR PSNR for the video image that the inventive method is rebuild is than existing Kang ' s The high 2.99dB of method, than the existing high 2.38dB of Choi ' s methods.

Claims

1. the time dimension video super-resolution method based on deep learning, including：

(1) by color video frequency image collection S={ S₁,S₂,...,S_i,...,S_NBe converted to greyscale video image set, i.e. original video Image set X={ X₁,X₂,...,X_i,...,X_N, and using down-sampling matrix F to being adopted under raw video image collection X progress directly Sample, obtains down-sampling video image collection Y={ Y₁,Y₂,...,Y_i,...,Y_N, wherein,Represent i-th of original video Image pattern,I-th of down-sampling video image sample is represented, 1≤i≤N, N represents raw video image concentration figure The quantity of decent, M represents the size of raw video image block, L_hRepresent image block in each sample of raw video image collection Quantity, L_lRepresent the quantity of image block in each sample of down-sampling video image collection, and L_h=r × L_l, r represents original video figure Multiplication factor of the image set to down-sampling video image collection；

(2) neural network model is built, and utilizes down-sampling video image collection Y and raw video image collection X training neutral nets Parameter：

(2a) determines neural network input layer nodes, output layer nodes, hides the number of plies and hidden layer number of nodes, random first The connection weight W of each layer of beginningization^(t)With biasing b^(t), learning rate η is given, selected activation primitive is： Wherein, g represents the input value of neural network node, and t=1,2 ..., n, n represents total number of plies of neutral net；

The down-sampling video image Y that (2b) stochastic inputs down-sampling video image is concentrated_iAs training sample is inputted, simultaneously Input the raw video image X that corresponding raw video image is concentrated_iAs output training sample, selected activation is used Function calculates the activation value of each layer of neutral net, and calculating is obtained：

The 1st layer of i.e. activation value of input layer be：a⁽¹⁾=Y_i,

T'=2,3 ..., the activation value of n-layer is：a^(t′)=f (W^(t′-1)*a^(t′-1)+b^(t′-1)), wherein, the second of the network Layer, third layer, the 4th layer i.e. t'=2, t'=3 during t'=4, in order to fully extract the correlation of video interframe, devises three Three-dimensional filter is used for replacing traditional two dimensional filter, and f (g) represents tanh (g) activation primitive, g=W^(t′-1)*a^(t′-1)+b^(t′-1), W^(t′-1)And b^(t′-1)T'-1 layers of weight and biasing, a are represented respectively^(t′^-1)Represent t'-1 layers of activation value；

(2c) calculates the learning error of each layer of neutral net：

Output layer is that the error of n-th layer is：δ⁽ⁿ⁾=X_i-a⁽ⁿ⁾,

T "=n-1, n-2 ..., 2 layers of error is：δ^(t′)=((W^(t”))^Tδ^(t”+1)).*f'(W^(t”-1)*a^(t”-1)+b^(t”-1)), Wherein, W^(t”)Represent the weights of t " layers, δ^(t″+1)Represent+1 layer of t " error, W^(t”-1)And b^(t”-1)T " -1 is represented respectively The weights of layer and biasing, a^(t”^-1)Represent -1 layer of t " activation value, f'(g') representative function f (g') derivative, (g ")^TRepresent Transposed transform, g'=W^(t”-1)*a^(t”-1)+b^(t”-1), g "=W^(t”)；

(2e) performs step (2b)-(2d) repeatedly, until the output layer error of neutral net reaches default required precision or training Number of times reaches maximum iteration, terminates training, preserves network structure and parameter, the neural network model trained；

(3) appoint to one section of video, be input in the neural network model trained, the output of neutral net is time dimension oversubscription Video after distinguishing.

2. down-sampling matrix F according to the method described in claim 1, is utilized wherein in step (1), by raw video image collection X Down-sampling video image collection Y is converted to, is that down-sampling matrix F is multiplied by with raw video image, i.e.,：

Y=FX,

Wherein,M represents the size of raw video image block, L_lRepresent each sample of down-sampling video image collection The quantity of image block, L in this_hRepresent the quantity of image block in each sample of raw video image collection, and L_h=r × L_l, r represents Multiplication factor of the down-sampling video image collection to raw video image collection on time dimension.

It is basis 3. according to the method described in claim 1, determining the input layer number of neutral net wherein in step (2a) Input the video frame number of training sample to determine, i.e., input layer number is equal in each sample of down-sampling video image collection and schemed As the quantity L of block_l。

It is basis 4. according to the method described in claim 1, determining the output layer nodes of neutral net wherein in step (2a) Export the video frame number of training sample to determine, i.e., output layer nodes are equal to image in each sample of raw video image collection The quantity L of block_h。

It is to pass through 5. according to the method described in claim 1, determining the hidden layer nodes of neutral net wherein in step (2a) Experiment regulation is determined.