CN115546030B

CN115546030B - Compressed video super-resolution method and system based on twin super-resolution network

Info

Publication number: CN115546030B
Application number: CN202211515828.1A
Authority: CN
Inventors: 王中元; 李娜; 胡思成; 罗来干; 何政; 梁超; 韩镇
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-04-07
Anticipated expiration: 2042-11-30
Also published as: CN115546030A

Abstract

The invention discloses a compressed video super-resolution method and a system based on a twin super-resolution network, wherein a low-resolution video to be processed is input into the twin super-resolution network to obtain a super-resolution video; during training, the invention compresses the original high-quality video in an M-JPEG format. And then inputting the two types of data of the compressed version and the uncompressed version into the twin super-resolution network. Finally, constraining the training of the twin super-resolution network from four aspects, and the features extracted by the two types of videos through the encoder should be as close as possible; the hyper-score results and the true values of the two types of videos form a triple together, and elements in the triple are positive samples; negative examples are the result of over-scoring of the residual between the compressed data and the conventional degradation. By means of the comparative learning mode, the twin super-resolution network can learn the characteristic parameters required by compressed video super-resolution to the maximum extent, and a more precise super-resolution reconstruction result is obtained.

Description

Compressed video super-resolution method and system based on twin super-resolution network

Technical Field

The invention belongs to the technical field of artificial intelligence video processing, relates to a video super-resolution method and a system, and particularly relates to a compressed video super-resolution method and a system based on a twin super-resolution network.

Background

The super-resolution is a technology for processing various images and videos with poor quality into high resolution based on intelligent methods such as deep learning. In recent years, with the development of deep learning techniques, the super-resolution field based on deep learning has been developed at an unprecedented rate. At present, the application of the super-resolution technology is mainly embodied in the fields of recovering old photos and videos, recovering important ancient data, obtaining clearer monitoring videos, photographing and imaging of mobile phone cameras and the like.

When generating high and low resolution training data pairs, the traditional super-resolution method generally performs four-time down-sampling after performing gaussian blurring on a video to generate low resolution data. According to such paradigm, the hyper-molecular model has gained better and better performance in recent years. However, either applying the thus trained models directly to the compressed data or simply retraining the models on the compressed video data pairs suffers from a very severe performance penalty. Video compression is a key step of scenes such as video transmission, and therefore, super-resolution of compressed videos is of great practical significance.

The video compression process loses high-frequency detail information, so that super-resolution reconstruction of the compressed video is more difficult. The comparison learning is a typical unsupervised or self-supervised learning method, and the main idea is that the model can more fully learn the characteristics of the encoder, reduce the distance between similar samples as much as possible, and enlarge the distance between the compressed video sample and the negative sample, so that the clustering boundary is more obvious. Compared learning is used as a learning strategy, and a good effect is achieved in the fields of image super-resolution and the like. However, in general, existing research only simply uses a frame of a non-current hyper-resolution video as a negative sample, and the hyper-resolution effect of contrast learning is restricted.

Disclosure of Invention

In order to solve the technical problem, the invention provides a compressed video super-resolution method and a compressed video super-resolution system of a super-resolution network. Based on a contrast learning framework, a non-compressed video and a compressed video which are subjected to traditional Gaussian blur and bicubic down-sampling are input into a twin super-resolution network together, so that the twin super-resolution network is prompted to learn a feature expression which is insensitive to compression distortion, and the twin super-resolution network which is beneficial to compressing the video is obtained.

The method adopts the technical scheme that: a twin compressed video super-resolution method based on a super-resolution network comprises the steps that a low-resolution video to be processed is input into the twin super-resolution network to obtain a super-resolution video;

the twin super-resolution network consists of two parallel super-resolution networks; the super-resolution network comprises an encoder network and an up-sampling module;

the encoder network consists of two types of convolution layers and a plurality of PFRB modules; the first type of convolution layer takes the central frame of input continuous three-frame images as input, the size of each frame is 64 multiplied by 3, the convolution layer converts the size of the convolution layer into a Batchsize multiplied by 64 multiplied by 80 feature map, the second type of convolution layer takes other supplementary frames except the central frame as input, the size of the input data is Batchsize multiplied by 64 multiplied by 6, the convolution layer converts the data into the Batchsize multiplied by 64 multiplied by 80 feature map, and finally, the two features are input into a continuous PFRB module together with the super-separation result before the previous round of sampling; two encoder networks corresponding to the two inputs share a weight; wherein the batch size is the number of batch samples optimized by gradient;

in the PFRB module, data is firstly input into three convolution layers with the size of 3 x 3, the number of input and output channels of the three convolution layers is 80, the number of the input and output channels is recorded as x1, then the output of the convolution layers is spliced, a 1 x1 convolution layer is used for converting the total channel number 240 into 80, the obtained results are respectively spliced with three results in the x1, the obtained result is recorded as x2, finally, the three results in the x2 are respectively input into the three convolution layers, the number of channels of input data is compressed into 80, then the obtained three results and the initially input three results are correspondingly added to obtain a final result; wherein, leakyReLU activation function is used between each convolution layer to introduce nonlinear relation;

the up-sampling module inputs data with the size of batchsize multiplied by 3 multiplied by 64 multiplied by 80, firstly passes through a 3 multiplied by 3 convolutional layer, converts the number of data channels 240 into 80, continues to input into the convolutional layer, and converts the number of channels 80 into 48; upsampling the data with the scale of 2 by using a pixelshuffle function in the pytorech, and compressing the number of channels to 12; the obtained result is processed by a convolution layer, and finally, a pixelschuffle function is used for carrying out primary up-sampling with the scale of 2; the size of the output super-divided data is batch size × 3 × 256 × 256 × 3.

The technical scheme adopted by the system of the invention is as follows: a twin super-resolution network-based compressed video super-resolution system, comprising:

one or more processors;

a storage device to store one or more programs that when executed by the one or more processors cause the one or more processors to implement the twin super-resolution network-based compressed video super-resolution method.

The invention has the advantages and positive effects that:

(1) The invention defines the negative sample as the difference between the compressed data and BD data corresponding to the sample, and has better effect compared with the simple definition of other samples in the past method. In contrast to a single sample input network, negative samples, positive samples along with compressed video samples can cause the network to over-divide the compressed video samples.

(2) The method realizes the super-resolution model of the compressed video, and solves the problem that the effect of the traditional model is sharply reduced on the compressed video.

Drawings

FIG. 1 is a diagram of a twin super-resolution network structure according to an embodiment of the present invention;

FIG. 2 is a flow chart of the training of the twin super-resolution network according to the embodiment of the present invention;

FIG. 3 is a diagram of the architecture of a predecessor and successor of a twin super-resolution network according to an embodiment of the present invention;

FIG. 4 shows the subjective experimental results of the twin super-resolution network according to the embodiment of the present invention.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.

The existing convolutional neural network structure can achieve a good effect on BD degradation, but once a scene is switched, the performance of a model is greatly reduced. And 3 coupled samples of the original frame, the BD degraded frame and the compressed frame are sent into a twin network for comparison learning, so that a super-resolution model which is good in performance on a compressed video can be obtained.

The invention provides a twin super-resolution network-based compressed video super-resolution method, which comprises the steps of inputting a low-resolution video to be processed into a twin super-resolution network of the embodiment to obtain a super-resolution video;

referring to fig. 1 and 4, the twin super-resolution network of the present embodiment is composed of two parallel super-resolution networks; the super-resolution network of the embodiment comprises an encoder network and an up-sampling module;

the encoder network of the present embodiment refers to a network structure except for the upsampling operation in the network shown in fig. 1, and is composed of two types of convolution layers and a plurality of PFRB modules; the first type of convolution layer takes the central frame of the input continuous three-frame image as input, the size of each frame is 64 multiplied by 3, the convolution layer converts the size of the convolution layer into a Batchsize multiplied by 64 multiplied by 80 feature map, the second type of convolution layer takes the supplementary frames except the central frame as input, the size of the input data is Batchsize multiplied by 64 multiplied by 6, the convolution layer converts the data into the Batchsize multiplied by 64 multiplied by 80 feature map, and finally, the two features are input into a continuous PFRB module together with the super-division result before the previous round of up-sampling; two encoder networks corresponding to the two inputs share a weight; wherein, batchsize is the batch sample number of gradient optimization, and this embodiment takes value 16.

The PFRB module of this embodiment adopts the basic framework of ResNet, data is first input into three convolutional layers of 3 × 3 size, the number of input/output channels of the three convolutional layers is all 80, the output is recorded as x1, then the outputs of the convolutional layers are spliced, then one 1 × 1 convolutional layer is used to convert the total channel number 240 into 80, and the obtained results are respectively spliced with three results in x1, the obtained result is recorded as x2, finally, the three results in x2 are respectively input into the three convolutional layers, the number of channels of input data is compressed into 80, and then the obtained three results are correspondingly summed with the initially input three results to obtain a final result; wherein, leakyReLU activation function is used between each convolution layer to introduce nonlinear relation, so that the expression capability of the model is stronger;

the up-sampling module of this embodiment is to up-sample the input data to achieve the purpose of overdividing. The size of the input data of the up-sampling module is Batchsize multiplied by 3 multiplied by 64 multiplied by 80, firstly, the data channel number 240 is converted into 80 through a 3 multiplied by 3 convolutional layer, the data channel number is continuously input into the convolutional layer, and the channel number 80 is converted into 48; upsampling the data with the scale of 2 by using a pixelshuffle function in the pytorech, and compressing the number of channels to 12; the obtained result is processed by a convolution layer, and finally, a pixelschuffle function is used for carrying out primary up-sampling with the scale of 2; in general, up-sampling with a scale of 4 is performed, the super-resolution process is realized, and the size of the outputted super-resolution data is batchsize × 3 × 256 × 256 × 3, and the batchsize is set to 16 in this embodiment.

Referring to fig. 2, the twin super-resolution network of the present embodiment is a trained twin super-resolution network; the training process comprises the following steps:

step 1: performing M-JPEG compression on a video in an original data set and bicubic downsampling processing to obtain a low-resolution compressed image which is recorded as a compressed video aiming at the original data set comprising a plurality of high-resolution continuous frames; performing Gaussian blur and bicubic downsampling processing on the video in the original data set to obtain a traditional low-quality image, and recording the image as a BD video;

the original data set name used in the present embodiment is MM522, in which 522 videos of various types are contained; in this embodiment, a video is read from the selected original data set by using a function in a python scroll library, and is written in by using an M-JEPG format to obtain a compressed video, and finally bicubic downsampling is performed on the video by using an ims function in Matlab.

Step 2: the method comprises the steps that an encoder network is utilized to carry out primary feature extraction on an input video, for input 5-dimensional video data, the encoder network outputs a 4-dimensional feature, the 4-dimensional feature is processed through a LeakyReLU activation function, the main purpose is to introduce a nonlinear function relation and restrain loss between two extracted features;

and step 3: respectively sending the extracted features into an up-sampling module to continuously perform super-division, and outputting a video which is amplified by four times compared with the input length and width; respectively constraining the relationship between the two outputs and the truth value to enable the outputs and the truth value to be as close as possible to be used as the distance between the compressed video sample and the positive sample; recording the distance between the residual error after the over-score and the difference between the true value video and the compressed video as the distance between the compressed video sample and the negative sample, wherein the distance needs to be as large as possible in a set range;

in this embodiment, the super-resolution result of the BD video and the compressed video is obtained, and the BD video and the compressed video are constrained with the true value to obtain the lossdis _positive ；

；

Wherein the content of the first and second substances,y _BD refers to the result after the super-resolution of BD video,y _compress refers to the result after the super-resolution of the compressed video,gtit refers to the actual value of the light,εrepresents a constant value, set to 1e-4 in the experiment, which acts to maintain the loss value stable; will be provided withdis _positive Noting the distance between the compressed video sample and the positive sample;

considering the negative sample and the distance between the negative sample and the compressed video sample, subtracting the compressed video and the BD compressed video to obtain a negative sample, and recording the negative sample asx _nega (ii) a Calculating distance between negative sample and compressed video sampledis _negative ；

；

Wherein M is: (x _nega )、x _nega Respectively representing the over-resolution result obtained after inputting the negative sample into the model and the negative sample itself, as used herein

In the form of (1).

And 4, step 4: calculating the total loss, and training the twin super-resolution network of the embodiment by using a reverse propagation gradient mode; and (4) circularly executing the steps 1-4, and training the twin super-resolution network to be convergent to obtain the trained twin super-resolution network.

This embodiment calculates the distance between the compressed video sample and the positive sample according to step 3dis _positive And the distance between the compressed video sample and the negative sampledis _negative Calculating the final loss;

；

wherein the content of the first and second substances,marginthe super parameter is used for setting the interval between two characteristic distances, and the interval is set to be 0.2 in the embodiment;

after the loss is calculated, the twin super-resolution network of the embodiment is subjected to back propagation and optimization by using an Adam optimizer.

The super-resolution network of the embodiment is an operation framework combining a predecessor network and a successor network, and the structures of the predecessor network and the successor network are completely the same, that is, the combination of an encoder network and an upsampling network. The antecedent and successor concepts describe the operation mode of data in a model, as shown in detail in fig. 3, the structures of a predecessor network and a successor network are the same, the predecessor network is used for extracting information of all frames behind a frame to be restored in a time sequence in advance and taking the information as a reference of final over-scoring of the successor, and HTs in the graph refer to intermediate results generated by the successor; HTp refers to intermediate results produced by the predecessor; the predecessor and successor are the same model structure, and the diagram architecture mainly embodies the processing mode of data input in the network.

This embodiment uses the training strategy to train on the MM522 training set and test on the VID4 testing set, and the test indexes are PSNR/SSIM, and the results are as follows:

	Calendar	City	Foliage	Walk	Avg
						compressed video super-resolution	22.12/0.6949	25.55/0.6739	24.37/0.6186	27.85/0.8376	24.97/0.7062

In the aspect of generating a training data set, the experiment is different from the traditional model which trains a sample after Gaussian blur, and the original high-quality video is compressed in an M-JPEG format. Then, the two types of data, compressed and uncompressed, are input into a twin network designed as a combination of an encoder and an upsampling module. Finally, the training of the model is constrained in four aspects, and the features extracted by the two types of videos through the encoder should be as close as possible; the hyper-score results and the true values of the two types of videos form a triple together, and elements in the triple are positive samples; negative examples are the result of over-scoring of the residual between the compressed data and the conventional degradation. The subjective result obtained for the calenar video is shown in fig. 4, and it can be seen that through the comparative learning manner, the model can furthest learn the characteristic parameters required for the compressed video, and obtain a finer super-resolution reconstruction result.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A compressed video super-resolution method based on a twin super-resolution network is characterized in that: inputting a low-resolution video to be processed into the twin super-resolution network to obtain a super-resolution video;

the encoder network consists of two types of convolution layers and a plurality of PFRB modules; the first type of convolution layer takes the central frame of the input continuous three-frame image as input, the size of each frame is 64 multiplied by 3, the convolution layer converts the size of the convolution layer into a Batchsize multiplied by 64 multiplied by 80 feature map, the second type of convolution layer takes the supplementary frames except the central frame as input, the size of the input data is Batchsize multiplied by 64 multiplied by 6, the convolution layer converts the data into the Batchsize multiplied by 64 multiplied by 80 feature map, and finally, the two features are input into a continuous PFRB module together with the super-division result before the previous round of up-sampling; two encoder networks corresponding to the two inputs share a weight; wherein the batch size is the number of batch samples optimized by gradient;

2. The compressed video super-resolution method based on the twin super-resolution network of claim 1, wherein: the twin super-resolution network is a trained twin super-resolution network; the training process comprises the following steps:

step 2: performing initial feature extraction on an input video by using an encoder network, outputting a 4-dimensional feature by using the encoder network for input 5-dimensional video data, processing the 4-dimensional feature by using a LeakyReLU activation function, introducing a nonlinear functional relation, and constraining loss between the two extracted features;

and 4, step 4: calculating total loss, and training the twin super-resolution network in a mode of a reverse propagation gradient; and (4) circularly executing the steps 1-4, and training the twin super-resolution network to be convergent to obtain the trained twin super-resolution network.

3. The compressed video super-resolution method based on the twin super-resolution network of claim 2, wherein: in step 3, the super-resolution result of the BD video and the compressed video is obtained, and the super-resolution result and the real value are restricted to obtain the lossdis _positive ；

；

Wherein the content of the first and second substances,y _BD refers to the result after the super-resolution of BD video,y _compress refers to the result after the super-resolution of the compressed video,gtit refers to the actual value of the light,εrepresents a constant; will be provided withdis _positive Noting the distance between the compressed video sample and the positive sample;

the difference is made between the compressed video and the BD video to obtain a negative sample, which is recorded asx _nega (ii) a Calculating distance between negative sample and compressed video sampledis _negative ；

；

Wherein M is: (x _nega )、x _nega And respectively representing the over-resolution result obtained after the negative sample is input into the model and the negative sample.

4. The compressed video super-resolution method based on the twin super-resolution network of claim 2, wherein: in step 4, the distance between the compressed video sample and the positive sample obtained by the calculation in the step 3 is useddis _positive And the distance between the compressed video sample and the negative sampledis _negative Calculating the final loss;

；

wherein the content of the first and second substances,marginsetting the interval between two characteristic distances for the hyper-parameter;

and after calculating the loss, performing back propagation and optimization on the twin super-resolution network by using an Adam optimizer.

5. A compressed video super-resolution system based on a twin super-resolution network is characterized by comprising:

one or more processors;

a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the twin super resolution network based compressed video super resolution method of any of claims 1 to 4.