CN110852944A

CN110852944A - Multi-frame self-adaptive fusion video super-resolution method based on deep learning

Info

Publication number: CN110852944A
Application number: CN201910967482.0A
Authority: CN
Inventors: 曾明; 马金玉; 吴雨璇; 李祺; 王湘晖
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-10-12
Filing date: 2019-10-12
Publication date: 2020-02-28
Anticipated expiration: 2039-10-12
Also published as: CN110852944B

Abstract

The invention provides a multi-frame self-adaptive fusion video super-resolution method based on deep learning and electronic equipment thereof, comprising the following steps: the method comprises the following steps of firstly, constructing a data set required by training the network of the invention; and secondly, constructing a multi-frame self-adaptive fusion video super-resolution network through a deep learning framework TensorFlow, wherein the multi-frame self-adaptive fusion video super-resolution network is divided into two parts: the multi-frame adaptive registration network can distort adjacent frames of the key frames needing super resolution to enable the adjacent frames to be the same as the content of the key frames so as to provide more detail information for an algorithm, and the super resolution network super resolves the output of the multi-frame adaptive registration network into a high-resolution frame image; and (5) training.

Description

Multi-frame self-adaptive fusion video super-resolution method based on deep learning

Technical Field

The invention relates to a video super-resolution algorithm based on a convolutional neural network, and relates to a multi-frame self-adaptive fusion video image registration algorithm.

Background

High-resolution video brings clearer and more comfortable visual experience to users, so that technical research related to the high-resolution video is widely regarded by students. In recent years, a rapidly developed video super-resolution technology is taken as a new technology for obtaining high-definition images at low cost, has huge commercial values in multiple industries such as security protection, finance, modern logistics and the like, and becomes a leading-edge technology for competition of large companies. The basic task of the super-resolution technology is to reconstruct a corresponding high-resolution (HR) image or video from an original low-resolution (LR) image or video, which is a typical pathological problem. Currently, some solutions have been proposed by scholars.

The existing super-resolution algorithm is mainly realized by the following two ways: 1) the method is realized by adding constraint to the reconstruction process by using the prior knowledge of the structure or the content in the picture, for example, the smoothness of the image is used for realizing the super-resolution effect; 2) the method is also an implementation mode of the current best reconstruction effect algorithm. The specific implementation modes include a dictionary learning strategy, a random forest strategy and a neural network strategy. The single frame super resolution technology refers to a super resolution technology in which an input is one image. Multi-frame super resolution refers to a technique for reconstructing a high resolution video frame from a plurality of consecutive low resolution video frames. Compared with a single-frame super-resolution technology, the multi-frame super-resolution algorithm considers that the information between the frame images is complementary, and the algorithm can utilize the redundant information to improve the super-resolution effect.

The core problem of multi-frame super-resolution algorithm design is to find an effective method for realizing registration between continuous video frames. Recent research shows that information of a plurality of adjacent low-resolution frames can be fused by combining a Convolutional Neural Network (CNN) with a motion compensation principle, and then image registration is realized. Currently, the mainstream multi-frame super-resolution algorithm usually uses a set of continuous low-resolution images with a fixed number of frames to generate a single-frame high-resolution image. However, the multi-frame super-resolution algorithm based on the fixed frame number has the following two problems: 1) when the image content difference between adjacent frames is very large, if the selected frame number is too large, great difficulty is brought to image registration, and the fused video is easy to have bad flicker phenomenon to influence the user experience; 2) when the frame number is too small, the redundant information of the adjacent frames cannot be fully utilized. How to adaptively fuse image information predicted by different frame numbers is very important.

Disclosure of Invention

Aiming at the defects of the traditional multi-frame super-resolution algorithm with the fusion of fixed frame numbers in the aspect of effective utilization of multi-frame images, the invention provides a multi-frame super-resolution algorithm with the self-adaptive fusion of predicted images with different frame numbers. The algorithm can better adapt to the fluctuation of the image content difference between adjacent frames, so that a more stable and clear super-resolution effect can be obtained. The technical scheme is as follows:

a multi-frame self-adaptive fusion video super-resolution method based on deep learning comprises the following steps:

first, a data set required to train the network of the present invention is constructed

Reading the video frames by frames in the existing video data set into images and storing the images as a high-resolution image set Y_HRThen, the high resolution image set Y is collected_HREach image in the image group is down-sampled to obtain a corresponding low-resolution image set Y_LR。

Secondly, a multi-frame self-adaptive fusion video super-resolution network is built through a deep learning framework TensorFlow

The multi-frame self-adaptive fusion video super-resolution network is divided into two parts: the multi-frame adaptive registration network and the super-resolution network, wherein the multi-frame adaptive registration network can distort the adjacent frames of the key frames needing super-resolution to make the adjacent frames and the key frames tend to have the same content so as to provide more detail information for an algorithm, and the super-resolution network super-resolves the output of the multi-frame adaptive registration network into a high-resolution frame image, comprising the following steps:

(1) the multi-frame adaptive registration network is divided into different frames according to different lengths of video framesThree subsections: respectively a key frame direct output part, a three-frame motion registration part and a five-frame motion registration part; the three-frame motion registration part and the five-frame motion registration part are respectively composed of eight layers of convolutional neural networks and recorded as FNet, a ReLU function is selected after each convolutional layer to serve as an activation function, the former three layers of convolutional neural networks realize the down-sampling function of an image through twice-maximum pooling, the latter three layers of convolutional neural networks realize the up-sampling function through bicubic interpolation, and a key frame needing super resolution is set as an nth frame and recorded as an I_nThe following is a mathematical model of the multi-frame adaptive registration network:

F_out＝[α·FNet(I_n-2,I_n-1,I_n,I_n+1,I_n+2)+β·FNet(I_n-1,I_n,I_n+1)+γ·FNet(I_n)]

wherein F_outRepresenting the output of the multi-frame adaptive registration network, α, gamma represents the weights corresponding to the five-frame motion registration part, the three-frame motion registration part and the key frame direct output part, I_n-2,I_n-1,I_n+1,I_n+2Respectively representing the first two frames of images and the second two frames of images of the key frame;

(2) super-resolution network F_SRThe method comprises a plurality of convolution layers, a ReLU function is connected behind each convolution layer as an activation function, the network realizes the up-sampling of images by connecting two deconvolution layers, and the input and the output of the network are directly connected to prevent the gradient dispersion problem, I_outFor super-resolution networks F_SRThe output of (2), the mathematical model of the super-resolution network is as follows:

Y_out＝F_SR(F_out)

third, using the high resolution image set Y obtained in the first step_HRAnd a low resolution image set Y_LRTraining the designed network, wherein the loss of the network is defined as L₂Loss:

Loss＝(Y_out-Y_HR)²

wherein Y is_outOutput of super-resolution network, training completionThen saving the structure and parameters of the network;

and fourthly, setting the low-resolution video needing super resolution as V, taking the low-resolution video V as the input of the network stored in the third step, and correspondingly outputting the low-resolution video as the high-resolution video needing super resolution, thereby finishing the super resolution process of the video.

Preferably, in the third step, the network optimizer is set to Adam; one training batch was set to 128 images; the initial learning rate of the network is set to 0.01; when the loss of 100 consecutive epochs does not obviously decrease, the loss is reduced by 10 times, and the final learning rate is set to be 10^-5(ii) a Training epoch is set to 5000.

The present invention also provides an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the above-mentioned method steps when executing the program.

Compared with the traditional video super-resolution model with fixed frame number, the multi-frame self-adaptive fusion video super-resolution algorithm based on deep learning has stronger robustness under the condition that the difference of image contents between adjacent frames fluctuates greatly. The problems that image registration difficulty is increased and redundant information between adjacent frames is difficult to make full use of caused by a traditional video super-resolution algorithm with a fixed frame number are effectively solved, and flicker of a super-resolution result is effectively avoided.

The model designed by the invention can be widely used for super-resolution processing of low-quality videos, and the invention can fully consider the difference between the contents of adjacent video frames to select more appropriate network parameters to carry out super-resolution processing on the target video.

Drawings

FIG. 1 is an overall structure of a multi-frame adaptive fusion video super-resolution network

FIG. 2 is a structure of a multi-frame registration network

Fig. 3 shows the super-resolution result of the algorithm of the present invention for the same video frame, where the four images are: original low-resolution image, results of bicubic up-sampling, results of VESPCN video super-resolution network and results of the invention

FIG. 4 is a flow chart of the algorithm of the present invention

TABLE 1 parameters for a multi-frame registration network

TABLE 2 parameters of the image super-resolution network

Detailed Description

The mathematical model and the specific implementation of the deep learning based multi-frame adaptive fusion video super-resolution algorithm of the present patent are described in detail below with reference to the following examples and the accompanying drawings, and the specific flowchart is given by fig. 4:

firstly, constructing a data set required by training the network of the invention, namely reading the video in the Vimeo-90k video data set into images frame by frame and storing the images, and recording the images as a high-resolution image set Y_HRThen the high resolution image set Y is collected through matlab_HREach image in the image group is down-sampled to obtain a corresponding low-resolution image set Y_LR。

And secondly, constructing a multi-frame self-adaptive fusion video super-resolution network through a deep learning framework TensorFlow. As shown in fig. 1, which is an overall framework of the network of the present invention, the multi-frame adaptive fusion video super resolution network is divided into two parts: a multi-frame adaptive registration network and a super-resolution network. The multi-frame adaptive registration network can warp the adjacent frames of the key frames needing super resolution to make the adjacent frames and the key frames tend to have the same content, so as to provide more detail information for an algorithm. The super-resolution network super-resolves the output of the multi-frame adaptive registration network into a high-resolution frame image, which is specifically as follows:

(1) the multi-frame adaptive registration network is divided into three subsections according to the length of the video frame: respectively a key frame direct output part, a three-frame motion registration part and a five-frame motion registration part. Wherein the three-frame motion registration part and the five-frame motion registration part are respectively composed of eight layers of convolutional neural networks and are marked as FNet. The structure of FNet is shown in FIG. 2, specific parameters are given in Table 1, a ReLU function is selected as an activation function after each convolution layer, the down-sampling function of an image is realized by the first three layers of convolutional neural networks through twice maximal pooling, and the next three layers of convolutional neural networks are realized by the double layers of convolutional neural networksCubic interpolation implements the upsampling function. Suppose the key frame of the required super resolution is the nth frame (denoted as I)_n) The following is a mathematical model of the multi-frame adaptive registration network:

wherein F_outRepresenting the output of the multi-frame adaptive registration network, α, gamma represents the weights corresponding to the five-frame motion registration part, the three-frame motion registration part and the key frame direct output part, I_n-2,I_n-1,I_n+1,I_n+2The first two images and the second two images respectively represent the key frame.

(3) Super-resolution network F_SRThe structure of (1) is shown in the right half of fig. 1, specific parameters are shown in table 2, the structure comprises 12 convolutional layers, each convolutional layer is connected with a ReLU function as an activation function, and the network finally realizes the up-sampling of the image by connecting two deconvolution layers. Structurally, the present invention directly connects the inputs and outputs of the network to prevent the gradient dispersion problem from occurring during the training process. I is_outFor super-resolution networks F_SRThe output of (2), the mathematical model of the super-resolution network is as follows:

I_out＝F_SR(F_out)

third, using the high resolution image set Y obtained in the first step_HRAnd a low resolution image set Y_LRTraining the designed network, wherein the loss of the network is defined as L₂The losses are specifically as follows:

Loss＝(Y_out-Y_HR)²

wherein Y is_outAnd (4) outputting the super-resolution network. The network optimizer is set to Adam; one training batch was set to 128 images; the initial learning rate of the network is set to 0.01; when the loss of 100 consecutive epochs does not obviously decrease, the loss is reduced by 10 times, and the final learning rate is set to be 10^-5(ii) a Training epoch is set to 5000. And after the training is finished, the structure and the parameters of the network are saved.

And fourthly, assuming that the low-resolution video needing super resolution is V, only the low-resolution video V is needed to be used as the input of the network stored in the third step, and the corresponding output is the high-resolution video needing super resolution, so that the video super resolution process is completed.

The invention takes the same low-resolution frame image as the input of the network, and compares the output result with other classical methods, and the comparison result is shown in fig. 3. From the results, it can be seen that our algorithm achieves better results than other algorithms.

TABLE 1

TABLE 2

Claims

1. A multi-frame self-adaptive fusion video super-resolution method based on deep learning comprises the following steps:

(1) the multi-frame adaptive registration network is divided into three subsections according to the length of the video frame: respectively a key frame direct output part, a three-frame motion registration part and a five-frame motion registration part; the three-frame motion registration part and the five-frame motion registration part are respectively composed of eight layers of convolutional neural networks and recorded as FNet, a ReLU function is selected after each convolutional layer to serve as an activation function, the former three layers of convolutional neural networks realize the down-sampling function of an image through twice-maximum pooling, the latter three layers of convolutional neural networks realize the up-sampling function through bicubic interpolation, and a key frame needing super resolution is set as an nth frame and recorded as an I_nThe following is a mathematical model of the multi-frame adaptive registration network:

Y_out＝F_SR(F_out)

Loss＝(Y_out-Y_HR)²

wherein Y is_outOutputting a super-resolution network, and storing the structure and parameters of the network after training;

2. The method according to claim 1, characterized in that in the third step the network optimizer is set to Adam; one training batch was set to 128 images; the initial learning rate of the network is set to 0.01; when the loss of 100 consecutive epochs does not obviously decrease, the loss is reduced by 10 times, and the final learning rate is set to be 10^-5(ii) a Training epoch is set to 5000.

3. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1-2 are implemented when the program is executed by the processor.