CN115115516B

CN115115516B - Real world video super-resolution construction method based on Raw domain

Info

Publication number: CN115115516B
Application number: CN202210733861.5A
Authority: CN
Inventors: 岳焕景; 张芝铭; 杨敬钰
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-06-27
Filing date: 2022-06-27
Publication date: 2023-05-12
Anticipated expiration: 2042-06-27
Also published as: CN115115516A

Abstract

The invention discloses a method for constructing real-world video super-resolution based on a Raw domain, and relates to the technical field of video signal processing. The method for constructing the real-world video super-resolution based on the Raw domain comprises the following steps: s1, establishing a real world Raw video super-resolution data set; s2, designing a real world Raw video super-resolution algorithm based on the S1; s3, training a model; s4, inputting the low-resolution Raw video sequence in the test set into the model to obtain a corresponding super-resolution input result; the invention constructs a first real-world VSR data set with three multiplying factors in Raw and sRGB domains, and provides a reference data set for training and evaluating a real original VSR method; the invention improves the real LR video super-resolution performance to a new height by utilizing the proposed combination pair Ji Jiaohu module, the time and channel fusion module.

Description

Real world video super-resolution construction method based on Raw domain

Technical Field

The invention belongs to the technical field of video signal processing, and relates to a method for constructing real-world video super-resolution based on a Raw domain.

Background

Capturing video with a short focus lens can expand the viewing angle by sacrificing resolution, while capturing with a long focus lens can increase resolution by sacrificing viewing angle; video super-resolution (VSR) is an efficient way to acquire wide-angle and high-resolution (HR) video; video super-resolution reconstructs high-resolution video from low-resolution (LR) inputs by exploring spatial and temporal correlations of the input sequences; in recent years, the development of video super-resolution has been shifted from traditional model-driven to deep learning-based approaches.

The performance of these deep learning based SR methods depends largely on the training dataset, considering that synthetic LR-HR datasets, such as DIV2K and REDS, cannot represent a degradation model between the truly captured LR and HR images, and thus many real SR datasets were constructed to improve real world SR performance; however, most of these datasets are for static LR-HR images, such as RealSR and imagepair. Recently, researchers have proposed the first real world VSR dataset by capturing using the iPhone11ProMax multi-camera system; however, parallax between LR and HR cameras increases the difficulty of alignment, and due to the limited focal length of the cell phone camera, there is only 2 times the LR-HR sequence pair in the dataset.

On the other hand, the trend of real scene image (video) restoration using Raw images, such as dim light enhancement, denoising, deblurring, and super resolution; the main reason is that the Raw image has a relatively wide bit depth (12 or 14 bits), i.e. contains the most primitive information, and its intensity is linear with illumination; however, little effort is still spent exploring the super resolution of Raw video; researchers synthesize LR Raw frames by downsampling from captured HR Raw frames, suggesting a Raw video super-resolution dataset; nevertheless, there is still a gap between the synthesized LR original frames and the true captured frames, which makes SR models trained on synthesized data not well generalize to real scenes.

Disclosure of Invention

The invention solves the technical problems:

(1) The invention aims to establish a real-world Raw video super-resolution data set and provides a video super-resolution algorithm adapted to the Raw data on the basis.

(II) in order to achieve the above-mentioned purpose, the invention has adopted the following technical scheme:

the method for constructing the real-world video super-resolution based on the Raw domain comprises the following steps:

s1, establishing a real world Raw video super-resolution data set: the data set establishment process mainly comprises the following 3 steps:

s101, hardware design: the incident light is split into two beams with a brightness ratio of 1 by a spectroscope: 1 and a transmitted beam; capturing LR-HR frame pairs of different proportions using a DSLR camera of a zoom lens; designing and printing a 3D model box to fix the spectroscope, placing the DSLR camera and the spectroscope box on an optical plate, and fixing a tripod below the DSLR camera and the spectroscope box;

s102, data acquisition: acquiring a Raw video in an MLV format, and then processing the Raw video in the MLV format by using MlRawViewer software to obtain corresponding sRGB frames and Raw frames in a DNG format;

s103, data processing: using a coarse-to-fine alignment strategy to generate aligned LR-HR frames, including sRGB frame pairs

And Raw frame pair->

S2, adopting an LR Raw frame and an HR sRGB frame based on the frame after S1 data processing

Designing a real world Raw video super-resolution algorithm for the training pair;

s3, training a model: constructing a model based on an algorithm designed in the step S2, training the model by utilizing a deep learning framework Pytorch platform, iterating 300k times on the whole data set, then reducing the learning rate to 0.00001, and continuing iterating until the loss converges to obtain a final model;

s4, inputting the low-resolution Raw video sequence in the test set into the model to obtain a corresponding super-resolution input result.

Preferably, the step of performing data processing on the sRGB frame using the alignment policy in S103 is as follows:

s1031, firstly, estimating a homography matrix H between up-sampling LR and HR frames by using SIFT key points selected by a RANSAC algorithm;

s1032 then align HR frames

To coarsely crop out the corresponding region in the LR frame that matches the HR frame;

s1033, performing pixel-by-pixel alignment on the matching area by using a traditional optical flow estimation method DeepFlow;

s1034, finally, clipping the center region to eliminate alignment artifacts around the boundary, generating an aligned LR-HR frame in the RGB domain, using

And (3) representing.

Preferably, the Raw frame should pass the same alignment strategy as the sRGB frame, however, directly applying global and local alignment would disrupt the Bayer format of the Raw input; the method comprises the steps of performing data processing on a Raw frame by adopting an alignment strategy, firstly, reorganizing an original frame into an RGGB subformat, wherein the size of the RGGB subformat is half of that of an sRGB frame, and therefore, the H matrix calculated from the sRGB frame needs to be changed by readjusting a translation parameter at a rate of 0.5; deepflow is also processed in the same manner and in this way a Raw frame pair is generated

Preferably, the real world Raw video super-resolution algorithm flow described in S2 mainly includes the following steps:

s201, double-branch strategy and feature extraction: to fully utilize the information of the Raw data, the LR consecutive frames are input

Into two branches of the network in different forms; the Bayer format branch directly uses the Raw continuous frame itself as input; the sub-frame format branching uses the recombined RGGB sub-format to form a new sequence as input; the input of the Bayer format branch is denoted +.>

The input of the sub-frame branch is denoted +.>

The number of channels is 4 times; the Bayer format branches keep the original sequence of original pixels, which is beneficial to space reconstruction; although the sub-frame format branch cannot preserve the original pixel order, it can exploit far-neighbor correlation to generate details; then, the two inputs respectively pass through a feature extraction module, wherein the feature extraction module is composed of five residual blocks;

s202, joint alignment: due to the time offset between adjacent frames, it is necessary to warp adjacent frames toA center frame; the alignment is carried out on the basis of a multilevel cascade alignment strategy, namely an alignment offset is calculated from a subframe format branch, and then the calculated offset is directly copied to a Bayer format branch for alignment, namely two branches are aligned together; features in sub-frame format branching

And->

L-1 times of convolution downsampling is carried out to form an L-level pyramid; pyramid features in Bayer format branches are constructed in the same way; the offset of the first stage is calculated according to the aggregate characteristics of the first stage and the up-sampling result of the offset of the (l+1) th stage:

since the input of the subframe format branch is actually a downsampled version of the Bayer format branch, the offset value of the Bayer format branch should be twice that of the subframe format branch; thus, the offset in the sub-frame format branch may be used

Up-sampling by 2 times and amplifying by 2 times to obtain Bayer format branch of the first stage +.>

Offset of (2):

given an offset, the alignment features of the two branches can be expressed as:

where g represents the mapping function implemented by several convolution layers and Dconv represents the deformable convolution. Dconv of the two branches share the same weight at the respective level;

after L levels are aligned, further use is made of

And->

Offset calculated between +.>

To optimize->

And->

And generates final alignment results for adjacent features in both branches>

And->

S203, an interaction module: the Bayer format branch features are downsampled by 3 x 3 convolution (stride=2) and the LeakyRelu layer, and these downsampled features are aggregated with features in the subframe format branches; similarly, sub-frame format branching features are up-sampled by a Pixelshuffle and then aggregated with features in the Bayer format branching;

s204, time fusion: aggregating the remote features with a non-local time awareness module to enhance feature representation along a time dimension; features are then fused together using a Temporal Spatial Attention (TSA) based fusion;

s205, channel fusion: combining features in two branches together using channel fusion, because

And->

May have different contributions to the final SR reconstruction; fusing the two branch features through channel weighted average by adopting selective kernel convolution (SKF);

s206, reconstruction and upsampling: features after fusion

Input into a reconstruction module implemented by 10 ResNet blocks for SR reconstruction; after reconstruction, the reconstruction is carried out by utilizing a Pixelshuffle to carry out up-sampling, and then three-channel output is generated by utilizing a convolution layer; meanwhile, the module also utilizes two long jump connections, one is the input of Bayer format for LR, which is processed by a convolution layer first, and then is up-sampled to three-channel output through Pixelshubble; the other is used for LR subframe format input, because the space size is half of the original input, the LR subframe format input is up-sampled twice, and three outputs are added to generate a final HR result +.>

S207, color correction and loss function: actual shooting data

And->

There is a difference in color and brightness, and directly exploiting the pixel loss between output and HR may result in the network optimizing color and brightness corrections without concern for the task of SR; to solve this problem, color correction is used before the loss calculation, i.e. channel-based colors are used for the RGB channels, respectivelyColor correction, instead of computing a 3×3 color correction matrix to correct them simultaneously: />

Wherein alpha is ^c Is the scaling factor of channel c by minimizing

And HR downsampled version ∈ ->

The least squares loss between corresponding pixels.

S208, optimizing the network using the chaseonnier loss between the corrected output and HR.

The beneficial effects of the invention include the following three points:

(1) The invention constructs a first real world VSR data set with three multiplying factors in the Raw and sRGB domains, which provides a reference data set for training and evaluation of a real original VSR method.

(2) The invention provides a Real-Raw VSR method based on a Real-world Raw video super-resolution data set obtained in S1, wherein the Real-Raw VSR method is used for processing Raw data input in two branches, one branch is used for Bayer format input, and the other branch is used for subframe format input; by utilizing the proposed joint alignment, interaction and time and channel fusion module, the complementary information of the two branches is well explored, and the real LR video super-resolution performance is improved to a new height.

(3) Experiments based on the invention show that the proposed method is superior to the currently mainstream VSR method of Raw and sRGB; through research and exploration of the invention, more researches on a video super-resolution method based on a Raw domain are hoped to be inspired.

Drawings

FIG. 1 is a hardware platform and a data processing flow chart in a method for constructing real world video super-resolution based on a Raw domain, which is adopted by the invention;

FIG. 2 is a flowchart of an algorithm in a method for constructing real world video super-resolution based on a Raw domain, which is adopted by the invention;

FIG. 3 is a block diagram of a joint alignment module in a method for constructing real world video super-resolution based on a Raw domain, which is adopted by the invention;

FIG. 4 is a table showing the comparison of the results of the algorithm used in the present invention with other video super-resolution algorithms on the test set.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1:

referring to fig. 1, the method for constructing the super-resolution of the real world video based on the Raw domain includes the following steps:

s101, hardware design: in order to capture LR-HR frame pairs of different proportions, a DSLR camera with 18-135mm zoom lens was used instead of the cell phone camera; to avoid the influence of natural light from other directions, the spectroscope is fixed by designing and printing a 3D model box; so that two cameras can receive natural light from the same viewpoint, and the size of the spectroscope is 150×150x1 (mm) ³ ) Sufficient to cover the camera lens; the camera and the spectroscope box are placed on the optical plate, and the tripod is fixed below the camera and the spectroscope box so as to improve the stability of the camera and the spectroscope box;

s102, data acquisition: according to the invention, two Canon 60D cameras with the third party software magicLantern upgraded are used for acquiring the Raw video in MagicLanternVideo (MLV) format; in order to keep the cameras synchronous, the invention uses the infrared remote controller to send signals to the two cameras to control shooting at the same time, and in the shooting process, the invention keeps the ISO of the two cameras in the range of 100 to 1600 to avoid noise, and the exposure time ranges from 1/400 seconds to 1/31 seconds to capture slow motion and fast motion; all other settings are set to default values to simulate a real captured scene; then, the invention uses MlRawViewer software to process the MLV video to obtain corresponding sRGB frames and Raw frames in DNG format; for each scene, the invention captures a short video of 6 seconds at a frame rate of 25FPS, i.e., each video contains approximately 150 frames of Raw and sRGB format maps;

s103, data processing: due to the presence of lens aberrations, misalignment still exists between LR-HR pairs; the present invention thus utilizes a coarse-to-fine alignment strategy to generate aligned LR-HR frames, including sRGB frame pairs

And Raw frame pairs

The step of performing data processing on the sRGB frame using the alignment policy in S103 is as follows:

s1032 then align HR frames

A representation;

the Raw frame should pass through the same as the sRGB frameHowever, directly applying global and local alignment would destroy the Bayer format of the Raw input; the method comprises the steps of performing data processing on a Raw frame by adopting an alignment strategy, firstly, reorganizing an original frame into an RGGB subformat, wherein the size of the RGGB subformat is half of that of an sRGB frame, and therefore, the H matrix calculated from the sRGB frame needs to be changed by readjusting a translation parameter at a rate of 0.5; deepflow is also processed in the same manner and in this way a Raw frame pair is generated

s3, training a model: the continuous frame number input by the training model in the invention is 5 frames, the used optimizer is an Adam optimizer, and the initial learning rate is set to be 0.0001; constructing a model based on an algorithm designed in the step S2, training the model by utilizing a deep learning framework Pytorch platform, iterating 300k times on the whole data set, then reducing the learning rate to 0.00001, and continuing iterating until the loss converges to obtain a final model;

training a model by using a deep learning framework Pytorch platform, iterating 300k times on the whole data set, then reducing the learning rate to 0.00001, and continuing iterating until the loss converges to obtain a final model;

s4, inputting the low-resolution Raw video sequence in the test set into the model to obtain a corresponding super-resolution input result;

the invention constructs a first real world VSR data set with three multiplying factors in the Raw and sRGB domains, which provides a reference data set for training and evaluation of a real original VSR method.

Example 2:

referring to fig. 2-4, the difference based on embodiment 1 is that:

the real world Raw video super-resolution algorithm flow described in S2 mainly comprises the following steps:

The input of the sub-frame branch is denoted +.>

s202, joint alignment: due to the time offset between adjacent frames, it is necessary to warp adjacent frames to a center frame; the alignment is carried out on the basis of a multilevel cascade alignment strategy, namely an alignment offset is calculated from a subframe format branch, and then the calculated offset is directly copied to a Bayer format branch for alignment, namely two branches are aligned together; features in sub-frame format branching

And->

Offset of (2):

after L levels are aligned, further use is made of

And->

Offset calculated between +.>

To optimize->

And->

And generates final alignment results for adjacent features in both branches>

And->

And->

s206, reconstruction and upsampling: features after fusion

Input to the system consisting of 10RThe reconstruction module realized by the eNet block is used for SR reconstruction; after reconstruction, the reconstruction is carried out by utilizing a Pixelshuffle to carry out up-sampling, and then three-channel output is generated by utilizing a convolution layer; meanwhile, the module also utilizes two long jump connections, one is the input of Bayer format for LR, which is processed by a convolution layer first, and then is up-sampled to three-channel output through Pixelshubble; the other is used for LR subframe format input, because the space size is half of the original input, the LR subframe format input is up-sampled twice, and three outputs are added to generate a final HR result +.>

S207, color correction and loss function: actual shooting data

And->

There is a difference in color and brightness, and directly exploiting the pixel loss between output and HR may result in the network optimizing color and brightness corrections without concern for the task of SR; to solve this problem, color correction is used before the loss calculation, i.e., channel-based color correction is used for RGB channels, respectively, instead of calculating a 3×3 color correction matrix to correct them simultaneously:

wherein alpha is ^c Is the scaling factor of channel c by minimizing

And HR downsampled version ∈ ->

The least squares loss between corresponding pixels.

S208, optimizing the network using the chaseonnier loss between the corrected output and HR;

based on the data set, the invention provides a Real-RawVSR method, which processes the Raw data input in two branches, wherein one branch is used for Bayer format input, and the other branch is used for subframe format input; by utilizing the proposed joint alignment, interaction and time channel fusion module, the complementary information of the two branches is well explored; experiments show that the proposed method is superior to the currently mainstream VSR method of Raw and sRGB; through research and exploration of the invention, more researches on a video super-resolution method based on a Raw domain are hoped to be inspired. The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical solution and the modified concept thereof, within the scope of the present invention.

Claims

1. The method for constructing the real-world video super-resolution based on the Raw domain is characterized by comprising the following steps of: the method comprises the following steps:

And Raw frame pair->

s4, inputting the low-resolution Raw video sequence in the test set into the model obtained in the S3, and obtaining a corresponding super-resolution input result.

2. The method for constructing the real world video super resolution based on the Raw domain according to claim 1, wherein the method comprises the following steps: the step of performing data processing on the sRGB frame using the alignment policy in S103 is as follows:

s1032 then align HR frames

And (3) representing.

3. The method for constructing the real world video super resolution based on the Raw domain according to claim 2, wherein the method comprises the following steps:

the method comprises the steps of performing data processing on a Raw frame by adopting an alignment strategy, firstly, reorganizing an original frame into an RGGB subformat, wherein the size of the RGGB subformat is half of that of an sRGB frame, and therefore, the H matrix calculated from the sRGB frame needs to be changed by readjusting a translation parameter at a rate of 0.5; deepflow is also processed in the same manner and in this way a Raw frame pair is generated

4. The method for constructing the real world video super resolution based on the Raw domain according to claim 1, wherein the method comprises the following steps: the real world Raw video super-resolution algorithm flow described in S2 mainly comprises the following steps:

The input of the sub-frame branch is denoted +.>

The number of channels is 4 times; the Bayer format branches keep the original sequence of original pixels, which is beneficial to space reconstruction; although the sub-frame format branch does not preserve the original pixel order, it can exploit far-neighbor correlation to generate a thinA section; then, the two inputs respectively pass through a feature extraction module, wherein the feature extraction module is composed of five residual blocks;

And->

Offset of (2):

where g represents the mapping function implemented by several convolution layers and Dconv represents the deformable convolution; dconv of the two branches share the same weight at the respective level;

after L levels are aligned, further use is made of

And->

Offset calculated between +.>

To optimize

And->

And generates final alignment results for adjacent features in both branches>

And->

S203, an interaction module: the Bayer format branch features are downsampled by a 3 x 3 convolution and LeakyRelu layer, and these downsampled features are aggregated with features in the subframe format branches; similarly, sub-frame format branching features are up-sampled by a Pixelshuffle and then aggregated with features in the Bayer format branching;

s204, time fusion: aggregating the remote features with a non-local time awareness module to enhance feature representation along a time dimension; then fusing the features together using a fusion based on temporal spatial attention;

And->

May have different contributions to the final SR reconstruction; fusing the two branch features through channel weighted average by adopting selective kernel convolution;

s206, reconstruction and upsampling: features after fusion

Input into a reconstruction module implemented by 10 ResNet blocks for SR reconstruction; after reconstruction, the reconstruction is carried out by utilizing a Pixelshuffle to carry out up-sampling, and then three-channel output is generated by utilizing a convolution layer; meanwhile, the module also utilizes two long jump connections, one is the input of Bayer format for LR, which is processed by a convolution layer first, and then is up-sampled to three-channel output through Pixelshubble; the other is used for LR subframe format input, which is up-sampled twice because of its half space size as the original input, and the three outputs are added to generate the final HR result

S207, color correction and loss function: actual shooting data

And->

wherein alpha is ^c Is the scaling factor of channel c by minimizing

And HR downsampled version ∈ ->

A least squares loss between corresponding pixels;