CN111583112A

CN111583112A - Method, system, device and storage medium for video super-resolution

Info

Publication number: CN111583112A
Application number: CN202010353851.XA
Authority: CN
Inventors: 王�华; 金龙存; 彭新一; 刘闯闯
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2020-08-25

Abstract

The invention discloses a video super-resolution generation method, a system, a device and a storage medium, wherein the method comprises the steps of obtaining a low-resolution video frame to be processed, processing the low-resolution video frame through a video super-resolution model, outputting a high-resolution video, and collecting a training sample, wherein the training sample comprises a high-resolution video frame sample and a low-resolution video frame sample; and establishing a video super-resolution model based on a preset loss function and the high-resolution video frame sample according to the acquired training sample. The method realizes motion compensation processing and feature enhancement between low-resolution video frames and restores high-frequency information of the video frames through the selected video super-resolution model, so that the output high-resolution video contains more image details, the definition of the video is improved, and the interference of optical flow errors in the optical flow-based video super-resolution method on the restoration of the final video frames is avoided. The method can be widely applied to the technical field of image processing.

Description

Method, system, device and storage medium for video super-resolution

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a method, a system, an apparatus, and a storage medium for video super-resolution.

Background

In recent years, with the increasing demand for image and video quality, how to improve the image and video quality becomes an increasingly important issue. The video super-resolution aims to repair a low-resolution video, so that the video contains more detail information, and the definition of the video is improved. The video super-resolution technology has important practical significance; for example, in the field of video monitoring, the resolution of a camera is limited or the camera is too far away from a shot target, and the obtained monitoring video has the problems of low resolution and difficulty in distinguishing the target, so that the problem of difficulty in mining required information from the video is solved. Through the video super-resolution technology, the video can be recovered to a certain extent, and the quality of the monitoring video is improved. In the aspect of video entertainment, with the rapid development of high-resolution display devices, the corresponding ultra-high-resolution video film sources are in short supply, and meanwhile, the network transmission of ultra-high-resolution videos is also difficult. The video super-resolution technology can make up for missing film sources, visual experience of audiences is improved, and low-resolution videos can be restored through the super-resolution technology after transmission is completed, so that transmission cost is greatly saved, and transmission efficiency is improved.

Current video super-resolution methods can be divided into two major categories: the super-resolution method based on single-frame images and the super-resolution method based on multi-frame images. The single-frame image super-resolution method is used for completing the video super-resolution task, the motion correlation of video frames can be ignored, and the video super-resolution result with higher fidelity can not be obtained by utilizing time domain information in multiple frames, so that the method is a suboptimal option. As the extension of the single-frame image super-resolution algorithm, the multi-frame image based super-resolution method can better utilize inter-frame complementary information and improve the quality of a super-resolution result.

In recent years, with the development of deep learning and convolutional neural networks, a video super-resolution technology based on multi-frame images has made a great breakthrough. However, in the case of complex motion or large-scale motion, how to maintain high-precision video super-resolution is still a difficult problem, and the performance of the algorithm still needs to be improved. At present, many video super-resolution algorithms based on convolutional neural network perform motion estimation on video frames through optical flow, and explicitly perform motion compensation processing so as to extract valuable information from aligned video frames. Due to the introduction of an additional optical flow estimation network, an end-to-end architecture cannot be realized, and meanwhile, optical flow errors can interfere with the recovery of a final video frame, so that an optimal super-resolution result cannot be generated. Therefore, a more accurate and efficient video super-resolution method is needed to further improve the recovery capability of the video super-resolution network, so that the video super-resolution network can cope with video super-resolution tasks in various complex scenes.

Disclosure of Invention

In order to solve the above technical problems, it is an object of the present invention to provide a method, system, apparatus and storage medium for generating video super-resolution.

The first technical scheme adopted by the invention is as follows:

the method for generating the video super-resolution comprises the following steps:

acquiring a low-resolution video frame to be processed;

processing the low-resolution video frame through a video super-resolution model, and outputting a high-resolution video;

the video super-resolution model training process comprises the following steps:

acquiring training samples, wherein the training samples comprise high-resolution video frame samples and low-resolution video frame samples;

and establishing a video super-resolution model based on a preset loss function and the high-resolution video frame sample according to the acquired training sample.

Optionally, the step of acquiring a training sample, where the training sample includes a high resolution video frame sample and a low resolution video frame sample, specifically includes the following steps:

collecting a high-resolution video sample, obtaining a high-resolution video frame sample by adopting a threshold lens segmentation algorithm, and backing up the high-resolution video frame sample;

adopting an image scaling algorithm to carry out down-sampling on the high-resolution video frame sample to generate a low-resolution video frame sample;

and acquiring a high-resolution video frame sample and a low-resolution video frame sample to establish a training sample.

Optionally, the step of establishing a video super-resolution model based on the preset loss function and the high-resolution video frame sample according to the acquired training sample specifically includes the following steps:

acquiring a set number of low-resolution video frame samples, and setting a reference frame and an adjacent frame;

extracting features of the reference frame and the adjacent frame based on a residual error network, and generating reference frame features and adjacent frame features;

aligning the adjacent frames by combining the deformable convolution network, the reference frame features and the adjacent features;

establishing the correlation degree of the adjacent frame and the reference frame by adopting a preset function and a relation matrix, carrying out first series connection on the aligned adjacent frame characteristics and the aligned reference frame characteristics, and outputting characteristic data fused with high-frequency information;

transmitting the feature data fused with the high-frequency information into the reference frame features by adopting residual error dense connection, and reconstructing a high-resolution video frame;

and reversely converging the reconstructed high-resolution video frame and the backed-up high-resolution video frame sample based on a preset loss function, and establishing a video super-resolution model.

Optionally, the deformable convolution network is provided with 5 variable convolution layers, a multi-level feature fusion structure formed by 8 void convolutions, and 2 convolution kernels, and the step of aligning the adjacent frames by combining the deformable convolution network, the reference frame features, and the adjacent features specifically includes the following steps:

before each variable convolution layer is input, performing second concatenation on the adjacent frame features and the reference frame features in the channel dimension;

after the series connection of adjacent frame features is compressed by each convolution kernel and is superposed by cavity convolution, the offset and the adjustment coefficient of the convolution kernel are output;

and each variable convolution layer carries out self-adaptive sampling on adjacent features according to the convolution kernel offset and the adjusting coefficient, and outputs the adjacent frame features after motion compensation.

Optionally, the step of establishing a correlation between the adjacent frame and the reference frame by using a preset function and a relationship matrix, performing a first concatenation on the aligned adjacent frame feature and the aligned reference frame feature, and outputting feature data fused with high-frequency information specifically includes the following steps:

determining the mapping relation between any pixel point in the adjacent frame and all pixel points of the reference frame by using the relation matrix;

determining the correlation degree of the adjacent frame and the reference frame by adopting a preset function according to the mapping relation;

aligning the regions with the unaligned adjacent frame features in a jumping connection mode according to the correlation;

and performing first series connection on the aligned adjacent frame features and the reference frame features, and outputting feature data fused with high-frequency information.

Optionally, the step of reconstructing the high-resolution video frame by using residual dense connection to transmit the feature data fused with the high-frequency information into the reference frame feature specifically includes the following steps:

globally jumping and accessing the reference frame characteristics based on the characteristic data which is connected and fused with high-frequency information by dense connection and residual error;

and rearranging the spatial dimension of the pixels of the reference frame by adopting a preset sub-pixel sampling layer and a convolution kernel to establish a high-resolution video frame.

Optionally, the step of processing the low-resolution video frame through the video super-resolution model and outputting the high-resolution video specifically includes the following steps:

inputting the low-resolution video frame into a video super-resolution model in a sliding window mode, and outputting a high-resolution video frame;

searching adjacent frames nearest to the video frame of the starting end or the tail end of the video frame sequence, complementing the number of the adjacent frames, and outputting a high-resolution starting end or tail end video frame;

and recombining the output high-resolution video frame and/or the high-resolution start end or tail end video frame based on the video frame sequence to output the high-resolution video.

The second technical scheme adopted by the invention is as follows:

a video super-resolution generation system, comprising:

the acquisition module is used for acquiring a low-resolution video frame to be processed;

the output module is used for processing the low-resolution video frame through a video super-resolution model and outputting a high-resolution video;

the training module comprises:

the sampling submodule is used for acquiring a training sample, and the training sample contains a high-resolution video frame sample and a low-resolution video frame sample;

and the model establishing submodule is used for establishing a video super-resolution model based on a preset loss function and a high-resolution video frame sample according to the collected training sample.

Optionally, the acquisition sub-module comprises:

the acquisition unit is used for acquiring a high-resolution video sample, and obtaining and backing up the high-resolution video frame sample by adopting a threshold lens segmentation algorithm;

the sampling unit is used for carrying out downsampling on the high-resolution video frame sample by adopting an image scaling algorithm to generate a low-resolution video frame sample;

and the sample establishing unit is used for acquiring the high-resolution video frame sample and the low-resolution video frame sample to establish a training sample.

Optionally, the model building submodule includes:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a set number of low-resolution video frame samples and setting a reference frame and an adjacent frame;

the generating unit is used for extracting features of the reference frame and the adjacent frame based on a residual error network and generating reference frame features and adjacent frame features;

the alignment unit is used for carrying out alignment processing on the adjacent frames by combining the deformable convolution network, the reference frame features and the adjacent features;

the first output unit is used for establishing the correlation degree of the adjacent frame and the reference frame by adopting a preset function and a relation matrix, performing first series connection on the aligned adjacent frame characteristics and the aligned reference frame characteristics, and outputting characteristic data fused with high-frequency information;

the reconstruction unit is used for transmitting the feature data fused with the high-frequency information into the reference frame features by adopting residual error dense connection and reconstructing a high-resolution video frame;

and the model establishing unit is used for reversely converging the reconstructed high-resolution video frame and the backed-up high-resolution video frame sample based on a preset loss function and establishing a video super-resolution model.

Optionally, the deformable convolutional network is provided with 5 variable convolutional layers, a multi-level feature fusion structure formed by 8 cavity convolutions, and 2 convolution kernels, and the alignment unit includes:

a second concatenation subunit, configured to perform a second concatenation of the adjacent frame feature and the reference frame feature in the channel dimension before inputting each of the variable convolutional layers;

the first output subunit is used for outputting convolution kernel offset and an adjusting coefficient after the characteristics of the adjacent frames after series connection are compressed by each convolution kernel and are superposed by cavity convolution;

and the second output subunit is used for performing self-adaptive sampling on the adjacent features by each variable convolution layer according to the convolution kernel offset and the adjusting coefficient and outputting the adjacent frame features after motion compensation.

Optionally, the first output unit includes:

the first determining subunit is used for determining the mapping relation between any pixel point in the adjacent frame and all pixel points of the reference frame by adopting the relation matrix;

the second determining subunit is configured to determine, according to the mapping relationship, a correlation degree between the adjacent frame and the reference frame by using a preset function;

the alignment subunit is used for performing alignment processing on the regions with the unaligned adjacent frame features in a jumping connection mode according to the correlation degree;

and the third output subunit is used for performing first series connection on the aligned adjacent frame features and the reference frame features and outputting feature data fused with high-frequency information.

Optionally, the reconstruction unit comprises:

the access subunit is used for globally jumping and accessing the reference frame characteristics based on the characteristic data which is formed by fusing the dense connection and the residual connection and has high-frequency information;

and the rearrangement subunit is used for rearranging the spatial dimension of the pixels of the reference frame by adopting a preset sub-pixel sampling layer and a convolution kernel to establish a high-resolution video frame.

Optionally, the output module includes:

the second output unit is used for inputting the low-resolution video frame into the video super-resolution model in a sliding window mode and outputting the high-resolution video frame;

a third output unit, configured to search for an adjacent frame that is closest to a start end or a tail end video frame of the sequence of video frames, complement the number of the adjacent frames, and output a high resolution start end or tail end video frame;

and the fourth output unit is used for recombining the output high-resolution video frame and/or the high-resolution start end or tail end video frame based on the video frame sequence and outputting the high-resolution video.

The third technical scheme adopted by the invention is as follows:

an apparatus, the memory for storing at least one program, the processor for loading the at least one program to perform the method described above.

The fourth technical scheme adopted by the invention is as follows:

a storage medium having stored therein a processor-executable program for performing the method as described above when executed by a processor.

The invention has the beneficial effects that: the method has the advantages that the obtained low-resolution video frame to be processed is subjected to resolution processing by adopting the training sample containing the high-resolution video frame sample and the low-resolution video frame sample and the video training model established by the preset loss function, so that the effect of recovering the low-resolution video frame into the high-resolution video frame can be accurately and efficiently realized, and the interference of optical flow errors in the optical flow video super-resolution method on the recovery of the final video frame can be avoided.

Drawings

FIG. 1 is a flow chart illustrating steps of a method for generating super-resolution video provided by the present invention;

FIG. 2 is a block diagram of a system for generating super-resolution video provided by the present invention;

FIG. 3 is a schematic flow chart of a video super-resolution model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating the operation of the deformable convolution layer in a deforming operation in accordance with an embodiment of the present invention;

FIG. 5 is a diagram illustrating a multi-level feature fusion structure in a deformable convolutional network according to an embodiment of the present invention;

FIG. 6 is a structural diagram illustrating the correlation between adjacent frames and reference frames according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of residual error tight connection in the reconstruction operation according to an embodiment of the present invention;

FIG. 8 is a graph comparing visualization results using the existing optimal solution of Visd4 data set with the solution of the present application;

FIG. 9 is a comparison graph of the visualization results of the prior optimal solution and the solution of the present application using the SPMCS data set;

FIG. 10 is a graph comparing visualization results of the prior optimal solution using the Vimeo-90K-T data set with the solution of the present application.

Detailed Description

Example 1

As shown in fig. 1, the present embodiment provides a method for generating a video super-resolution, which includes the following steps:

s1, acquiring a low-resolution video frame to be processed, wherein the video frame comprises a complex motion scene;

s2, processing the low-resolution video frame through a video super-resolution model, and outputting a high-resolution video;

s3, collecting training samples, wherein the training samples comprise high-resolution video frame samples and low-resolution video frame samples;

and S4, establishing a video super-resolution model based on the preset loss function and the high-resolution video frame sample according to the collected training sample.

Optionally, the step S2 includes:

s21, inputting the low-resolution video frame into a video super-resolution model in a sliding window mode, and outputting a high-resolution video frame;

s22, searching the adjacent frame nearest to the video frame at the start end or the tail end of the video frame sequence, complementing the number of the adjacent frames, and outputting the video frame at the start end or the tail end of high resolution;

and S23, recombining the output high-resolution video frame and/or the high-resolution start end or tail end video frame based on the video frame sequence, and outputting the high-resolution video.

Optionally, the step S31 includes:

s31, collecting a high-resolution video sample, obtaining the high-resolution video frame sample by adopting a threshold shot segmentation algorithm and backing up the high-resolution video frame sample;

s32, adopting an image scaling algorithm to carry out down-sampling on the high-resolution video frame sample to generate a low-resolution video frame sample;

and S33, acquiring the high-resolution video frame sample and the low-resolution video frame sample to establish a training sample.

Optionally, the step S4 includes:

s41, acquiring a set number of low-resolution video frame samples, and setting a reference frame and an adjacent frame;

s42, extracting features of the reference frame and the adjacent frame based on a residual error network, and generating reference frame features and adjacent frame features;

s43, aligning the adjacent frames by combining the deformable convolution network, the reference frame features and the adjacent features;

s44, establishing the correlation degree of the adjacent frame and the reference frame by adopting a preset function and a relation matrix, carrying out first series connection on the aligned adjacent frame characteristics and the reference frame characteristics, and outputting characteristic data fused with high-frequency information;

s45, transmitting the feature data fused with the high-frequency information into the reference frame features by adopting residual error dense connection, and reconstructing a high-resolution video frame;

and S46, reversely converging the reconstructed high-resolution video frame and the backed-up high-resolution video frame sample based on a preset loss function, and establishing a video super-resolution model.

Optionally, the variable convolutional network is provided with 5 variable convolutional layers, a multi-level feature fusion structure formed by 8 cavity convolutions, and 2 convolutional kernels, and the step S43 includes:

s431, before inputting each variable convolution layer, performing second series connection on the adjacent frame features and the reference frame features in the channel dimension;

s432, after compressing each convolution kernel and performing cavity convolution and superposition on the adjacent frame features after series connection, outputting convolution kernel offset and an adjusting coefficient;

and S433, each variable convolution layer carries out self-adaptive sampling on adjacent features according to the convolution kernel offset and the adjusting coefficient, and outputs the adjacent frame features after motion compensation.

Optionally, the step S44 includes:

s441, determining the mapping relation between any pixel point in the adjacent frame and all pixel points of the reference frame by using the relation matrix;

s442, determining the correlation degree between the adjacent frame and the reference frame by adopting a preset function according to the mapping relation;

s443, aligning the regions with the unaligned adjacent frame features by adopting a jump connection mode according to the correlation;

and S444, carrying out first series connection on the aligned adjacent frame features and the reference frame features, and outputting feature data fused with high-frequency information.

Optionally, the step S45 includes:

s451, globally jumping access reference frame features of feature data fused with high-frequency information based on dense connection and residual connection;

and S452, rearranging the spatial dimension of the pixels of the reference frame by adopting a preset sub-pixel sampling layer and a convolution kernel to establish a high-resolution video frame.

Example 2

As shown in fig. 2, the present embodiment provides a system for generating a video super-resolution, the system including:

the training module comprises:

Optionally, the acquisition sub-module comprises:

Optionally, the model building submodule includes:

Optionally, the first output unit includes:

Optionally, the reconstruction unit comprises:

Optionally, the output module includes:

Example 3

The present embodiments provide an apparatus, comprising:

at least one processor;

at least one memory for storing at least one program;

when the at least one program is executed by the at least one processor, the at least one program causes the at least one processor to implement the steps of a method for generating video super-resolution as described in embodiment 1 above.

Example 4

A storage medium having stored therein a program executable by a processor, the program being executed by the processor for performing the steps of a method for generating video super-resolution as described in embodiment 1.

Example 5

Referring to fig. 3 to 10, a flow chart of a method for generating video super-resolution specifically includes the following steps:

A. acquiring training samples, wherein the training samples comprise high-resolution video frame samples and low-resolution video frame samples;

B. establishing a video super-resolution model according to the collected training samples;

C. acquiring a low-resolution video frame to be processed;

D. and processing the low-resolution video frame to be processed through the video super-resolution model, and outputting a high-resolution video.

Wherein, the specific implementation scheme of the step A is as follows:

a1, acquiring a public large-scale video data set Vimeo-90K as a training data set. The data set comprises a plurality of video frames with different motion scale ranges, so that the trained video resolution model has better generalization capability. The data set consisted of 64612 training samples, each sample containing 7 consecutive video frames of the same scene, of size 448 × 256.

A2, backing up the high-resolution video frame sample by using an image scaling algorithm such as an imresize function of MATLAB, then carrying out double-triple down-sampling for 4 times to obtain a corresponding low-resolution video frame sample, wherein the size of the low-resolution video frame sample is 112 x 64, and the backed-up high-resolution video frame sample and the low-resolution video frame sample generated by sampling form a pair of training samples; and a horizontal or vertical overturning, 90-degree rotation and random cutting of image blocks are adopted as a data enhancement mode.

The specific embodiment of the step B is as follows:

b1, selecting continuous 7 low-resolution video frame samples, randomly cutting 50 × 50 image blocks at the same position as input, wherein the middle frame is used as a reference frame to be restored and is marked as a reference frame to be restored

Other frames as neighboring frames to aid recovery, noted

i∈[t-3,t+3]And i ≠ t.

B2, using residual error network to input reference frame

And adjacent frame

Composed 7 low resolution video frame samples

And carrying out shallow feature extraction.

It can be understood that the feature extraction module H_feaThe resulting features contain 64 channels, which are the same size as the input picture, where the residual network consists of 5 concatenated residual blocks, each residual block contains two 3 × 3 convolutional layers, a ReLU activation function, and a skip connection.

B3, feature F in all video frames_TIn (1), the obtained reference frame feature is marked as F_tFeatures of adjacent frames are denoted as F_i. Using a deformable convolutional network to characterize each adjacent frame by F_iAligning to reference frame feature F_tIn (1), the aligned adjacent frame is characterized as

The part consisting of the deformable convolutional network can be understood as the alignment block H_alignReferring to fig. 4 and 5, the alignment module (deformable convolutional network) comprises a multilevel feature fusion structure formed by convolving 5 cascaded deformable convolutional layers, 23 × 3 convolutional layers and 8 3 × 3 holes with the hole rates of 1 to 8 respectively_tAnd adjacent frame feature F_iFirstly, series connection is carried out in channel dimension, the number of channels is compressed back to 64 through a 3 × 3 convolutional layer, then 8 3 × 3 cavity convolutions are used for effectively expanding the reception field, the output characteristic channel number is 32, the cavity convolution results are superposed and summed one by one to obtain the convolution results after superposition of 8 reception fields from small to large, after series connection, 1 × 1 convolution is used for compressing the number of channels to 64, and then two parameters needed by a deformable convolution kernel generated by a 3 × 3 convolutional layer are the convolution kernel offset delta P_iAnd the regulating coefficient Δ M_i。

This process can be expressed as:

ΔP_i,ΔM_i＝f([F_i,F_t])

the characteristic fusion can effectively enlarge the receptive field through a space pyramid structure formed by the cavity convolution, and the convolution results with different cavity rates are overlapped, so that the acquired information is richer, and the method is greatly helpful for capturing the motion relation of the adjacent frame characteristics and the reference frame characteristics on the pixel level and generating more accurate deformable convolution parameters. Method for obtaining convolution kernel offset delta P by deformable convolution layer_iAnd the regulating coefficient Δ M_iThen, the feature F of the adjacent frame can be adaptively set_iAnd performing up-sampling to realize implicit motion compensation processing.

F is to be_i,b-1And F_i,bThe operation of the deformable convolution as input and output to one of the deformable convolution layers can be expressed as follows:

wherein p is_kK sample point locations, ω, representing the convolution kernel_kFor a convolution kernel of 3 × 3, K is 9 and p_k∈ { (-1, -1), (-1,0), …, (1,1) }. Deformable convolution is offset by an additional convolution kernel Δ p_i,kSo that the sampling position can be correspondingly adjusted according to different central points p, and the coefficient delta m can be adjusted simultaneously_i,kSo that the corresponding weights of the convolution kernels can also be dynamically changed. Wherein, Δ P_i＝{Δp_i,k}，ΔM_i＝{Δm_i,kAnd the whole sampling process is self-adaptive and can accept end-to-end training, so that an excellent motion compensation effect is realized.

After passing through 5 deformable convolution layers, the adjacent frame feature F_iAnd a coarse-to-fine alignment process is realized, and the alignment precision is gradually improved. The alignment module has the following formula:

b4, respectively processing the adjacent frame features and the reference frame features by adopting 1 × 1 convolution, performing matrix multiplication after dimensional transformation, and obtaining the correlation degree between the adjacent frame and the reference frame by using a softmax function, namely the correlation degree between a certain pixel point in the adjacent frame features and all pixel points in the reference frame features; and performing first series connection on the aligned adjacent frame characteristics and the reference frame characteristics, and outputting characteristic data fused with high-frequency information.

This section can be understood as the attention module H_nlI.e., the areas where the alignment module failed to align adjacent frames in step B3 are again emphasized. The attention module is designed based on a non-local mechanism, and the calculation in the module can be expressed as:

x′_p＝w_zsoftmax((w_ux_p)^Tw_vy_q)(w_gy_q)+x_p

wherein x is_pAnd y_qRespectively representing input adjacent frame features

And reference frame feature F_tOne pixel point of, x'_pPixel points, W, on adjacent frame features representing corresponding outputs_ux_p，W_vy_qAnd W_gy_qRespectively representing data obtained by performing convolution transformation on input adjacent frame features and reference frame features through 3 pieces of 1 × 1, W_zIt is indicated that the feature data obtained by the correlation calculation is subjected to 1 × 1 convolution transformation again.

The output of the attention module is noted

The process can be expressed as:

all adjacent frame features

And reference frame F_tPerforming series connection in channel dimension, compressing the number of channels by using one 3 × 3 convolution layer, and outputting feature data F fused with high-frequency information_fusionIn which F is_fusionCan be expressed as:

b5, adopting residual error dense connection to fuse the feature data F with the high-frequency information_fusionIncoming reference frame feature F_tReconstructing high resolution video frames

The residual dense connections in this section can be understood as reconstruction modules. The reconstruction block, i.e. (residual tightly connected part) comprises 23 cascaded residual tightly connected blocks H_RRDBsAnd a globalA jump connection. As shown in FIG. 7, each residual error dense block H_RRDBsEach dense connection block consists of 5 convolutional layers, the increment number of channels in each dense connection block is set to be 32, and the output of each convolutional layer is transmitted to a subsequent convolutional layer in the dense connection block as an additional input through a plurality of jump connections. The residual error dense connection block combines the advantages of dense connection and residual error connection, and effectively extracts high-frequency information contained in the characteristics by utilizing multi-layer characteristics. Global skip connect incoming reference frame feature F_tAt the end of the network, a 3 × 3 convolutional layer is used to expand the number of channels to 64 × 16, a sub-pixel upsampling layer is used to rearrange the pixels in the channel dimension to the spatial dimension, resulting in a 4-fold expanded feature with a channel number of 64, and then a 3 × 3 convolutional layer is used to output a 3-channel high resolution reference frame

This operation is denoted as H_rec. The reconstruction process can be expressed as follows:

b6, adopting loss function to reconstruct high-resolution video frame

And reversely converging with the backed-up high-resolution video frame sample to establish a video super-resolution model.

Loss function L₁The formula is as follows:

where W, H and C represent the width, height, and number of channels, respectively, of the high resolution video frame. And setting a learning rate, reversely propagating the gradient by minimizing the loss function error, updating network parameters, and continuously iterating until the network is trained to be convergent.

In the backward convergence training, the batch size is set to 8, and the initial learning rate is set to 10^-4In the iterative training process, according to the convergence condition of the network, the learning rate can be halved for the first time after 70 periods, the training of the video resolution model is accelerated, and then the learning rate is halved every 20 periods₁＝0.9，β₂0.999 and ∈ -10^-8. Using L₁Loss function, calculation of high resolution video frames generated by video resolution model

With the original high resolution video frame

And back-propagates the updated network parameters by minimizing the error. And training the network to be converged in 120 periods. Then the loss function L is used₂And continuously training for 10 periods, and finely adjusting network parameters to further improve the performance of the video resolution model.

The scheme of the step C is specifically as follows:

respectively acquiring a Vid4 data set, an SPMCS data set and a Vimeo-90K-T data set, wherein the data sets comprise various videos of large-scale motion scenes and complex motion scenes, and video frames under the same shot are segmented and extracted in advance according to the scenes.

The scheme of the step D is specifically as follows:

b, extracting frames of the Vid4 data set, the SPMCS data set and the Vimeo-90K-T data set video frames to be recovered after splitting at intervals of step 1, inputting continuous 7 low-resolution video frames into a trained video super-resolution model each time, performing the implementation scheme of the step B on the input Vid4 data set, SPMCS data set and Vimeo-90K-T data set video frames through the video super-resolution model respectively in a sliding window-based mode, wherein for the video frames at the starting end or the tail end of the video frame sequence, searching the adjacent frames most adjacent to the video frames at the starting end or the tail end of the video frame sequence, complementing the number of the required adjacent frames, inputting the video super-resolution model in a sliding window-based mode, outputting high-resolution video frames after recombining the high-resolution video frames based on the video frame sequence, as shown in figures 8 to 10, and tables 1 to 3, wherein tables 1 to 3 are corresponding comparison graphs of the existing optimal scheme of the Vid4 dataset, the SPMCS dataset and the Vimeo-90K-T dataset and the scheme of the present application on peak signal-to-noise ratio (PSNR) and Structural Similarity (SSIM) indexes; fig. 8 to fig. 10 are corresponding comparison graphs of visualization results of the existing optimal scheme using the Vid4 data set, the SPMCS data set and the Vimeo-90K-T data set, respectively, and the scheme of the present application.

TABLE 1

TABLE 2

TABLE 3

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for generating video super-resolution is characterized by comprising the following steps:

acquiring a low-resolution video frame to be processed;

2. The method for generating super-resolution video images according to claim 1, wherein the training samples are collected,

the step of training samples containing high resolution video frame samples and low resolution video frame samples specifically comprises the following steps:

3. The method for generating video super-resolution according to claim 2, wherein the step of establishing a video super-resolution model based on the preset loss function and the high-resolution video frame sample according to the collected training samples specifically comprises the following steps:

4. The method for generating video super-resolution according to claim 3, wherein the deformable convolution network has 5 variable convolution layers, a multi-level feature fusion structure composed of 8 hole convolutions and 2 convolution kernels, and the step of aligning the adjacent frames by combining the deformable convolution network, the reference frame feature and the adjacent features specifically comprises the following steps:

5. The method for generating video super-resolution according to claim 4, wherein the step of establishing the correlation between the adjacent frame and the reference frame by using a preset function and a relation matrix, performing the first concatenation on the aligned adjacent frame feature and the reference frame feature, and outputting the feature data fused with the high-frequency information specifically comprises the steps of:

6. The method for generating super-resolution video of claim 5, wherein the step of reconstructing the high-resolution video frame by using residual dense connection to transmit the feature data fused with the high-frequency information into the reference frame feature comprises the following steps:

7. The method for generating super-resolution video of claim 1, wherein the step of processing the low-resolution video frame by the super-resolution video model and outputting the high-resolution video comprises the following steps:

8. A system for generating super-resolution video, comprising:

the training module comprises:

9. An apparatus comprising a memory for storing at least one program and a processor for loading the at least one program to perform the method of any one of claims 1-7.

10. A storage medium having stored therein a program executable by a processor, wherein the program executable by the processor is adapted to perform the method of any one of claims 1-7 when executed by the processor.