CN115115516A

CN115115516A - Real-world video super-resolution algorithm based on Raw domain

Info

Publication number: CN115115516A
Application number: CN202210733861.5A
Authority: CN
Inventors: 岳焕景; 张芝铭; 杨敬钰
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-06-27
Filing date: 2022-06-27
Publication date: 2022-09-27
Anticipated expiration: 2042-06-27
Also published as: CN115115516B

Abstract

The invention discloses a real-world video super-resolution algorithm based on a Raw domain, and relates to the technical field of video signal processing. The real-world video super-resolution algorithm based on the Raw domain comprises the following steps: s1, establishing a real world Raw video super-resolution data set; s2, designing a real-world Raw video super-resolution algorithm based on S1; s3, training a model; s4, inputting the low-resolution Raw video sequence in the test set into the model to obtain a corresponding super-resolution input result; the invention constructs a first real world VSR data set with three multiplying powers in Raw and sRGB domains, and provides a reference data set for training and evaluating a real original VSR method; the invention improves the real LR video super-resolution performance to a new height by utilizing the proposed joint alignment interaction module and the time and channel fusion module.

Description

Real-world video super-resolution algorithm based on Raw domain

Technical Field

The invention belongs to the technical field of video signal processing, and relates to a real-world video super-resolution algorithm based on a Raw domain.

Background

Capturing video with a short focus lens can enlarge the viewing angle by sacrificing resolution, while capturing with a long focus lens can improve resolution by sacrificing viewing angle; video super-resolution (VSR) is an efficient way to acquire wide-angle and high-resolution (HR) video; video super-resolution reconstructs high-resolution video from low-resolution (LR) input by exploring the spatial and temporal correlation of the input sequence; in recent years, the development of video super-resolution has shifted from traditional model-driven to deep learning-based approaches.

The performance of these deep learning based SR methods depends to a large extent on the training dataset, considering that synthetic LR-HR datasets, such as DIV2K and REDS, cannot represent a model of degradation between the real captured LR and HR images, and therefore many real SR datasets are constructed to improve real-world SR performance; however, most of these datasets are for static LR-HR images, such as RealSR and ImagePairs. Recently, researchers have proposed the first real-world VSR dataset by capturing with the multi-camera system using iPhone11 ProMax; however, parallax between the LR and HR cameras increases the difficulty of alignment, and since the handset camera has a limited focal length, there are only 2 times as many LR-HR sequence pairs in the data set.

On the other hand, the Raw image is used for carrying out the trend of real scene image (video) recovery, such as weak light enhancement, denoising, deblurring and super-resolution; the main reason is that Raw images have a wide bit depth (12 or 14 bits), i.e. contain the most primitive information, and their intensity is linear with the illumination; however, the work to explore the super-resolution of Raw video is still rare; researchers have proposed a Raw video super-resolution data set by synthesizing LR original frames by down-sampling from captured HR original frames; nevertheless, there is still a gap between the synthesized LR original frame and the real captured frame, which makes the SR model trained on the synthesized data not well generalized to real scenes.

Disclosure of Invention

The technical problems to be solved by the invention are as follows:

(1) the invention aims to establish a real-world Raw video super-resolution data set and provides a video super-resolution algorithm adapted to Raw data on the basis of the real-world Raw video super-resolution data set.

In order to achieve the purpose, the invention adopts the following technical scheme:

the real-world video super-resolution algorithm based on the Raw domain comprises the following steps:

s1, establishing a real world Raw video super-resolution data set: the process of establishing the data set mainly comprises the following 3 steps:

s101, hardware design: the incident light is divided into two beams by a spectroscope, the brightness ratio of the two beams is 1: 1, reflected and transmitted beams; capturing LR-HR frame pairs of different scales using a zoom lens DSLR camera; designing and printing a 3D model box to fix a spectroscope, placing a DSLR camera and the spectroscope box on an optical plate, and fixing a tripod below the DSLR camera and the spectroscope box;

s102, data acquisition: acquiring a Raw video in an MLV format, and then processing the Raw video in the MLV format by using MlRawViewer software to obtain a corresponding sRGB frame and a corresponding Raw frame in a DNG format;

s103, data processing: utilizing a coarse-to-fine alignment strategy to generate aligned LR-HR frames, including sRGB frame pairs

And Raw frame pair

S2, based on the processed frame S1, adopting LR Raw frame and HR sRGB frame

Designing a real-world Raw video super-resolution algorithm for the training pair;

s3, training a model: building a model based on an algorithm designed in S2, utilizing a deep learning frame Pythrch platform training model, iterating for 300k times on the whole data set, then reducing the learning rate to 0.00001, and continuing iterating until loss is converged to obtain a final model;

and S4, inputting the low-resolution Raw video sequence in the test set into the model to obtain a corresponding super-resolution input result.

Preferably, the step of performing data processing on the sRGB frame by using the alignment policy in S103 is as follows:

s1031, firstly, estimating a homography matrix H between the up-sampling LR and the HR frames by using SIFT key points selected by a RANSAC algorithm;

s1032, then aligning the HR frame

To roughly crop out the corresponding region in the LR frame that matches the HR frame;

s1033, performing pixel-by-pixel alignment on the matching area by utilizing a traditional optical flow estimation method DeepFlow;

s1034, finally, cutting the central area to eliminate the alignment artifact around the boundary, generating the aligned LR-HR frame in the RGB domain, and using

And (4) showing.

Preferably, Raw frames should pass the same alignment strategy as sRGB frames, however, applying global and local alignment directly would destroy the Bayer format of the Raw input; performing data processing on the Raw frame by adopting an alignment strategy, firstly recombining the original frame into an RGGB subformat, wherein the size of the RGGB subformat is half of that of the sRGB frame, so that an H matrix calculated from the sRGB frame needs to be changed by readjusting translation parameters at a ratio of 0.5; deepflow is also processed in the same manner and in this way generates the Raw frame pairs

Preferably, the real-world Raw video super-resolution algorithm flow described in S2 mainly includes the following steps:

s201, double-branch strategy and feature extraction: in order to fully utilize the information of Raw data, input LR continuous frames

Sending the data into two branches of the network in different forms; the Bayer format branch directly uses Raw continuous frames as input; the sub-frame format branch uses the recombined RGGB sub-format to form a new sequence as input; representing the input of the Bayer format branch as

The input of the sub-frame branch is represented as

The number of channels is 4 times; the Bayer format branches keep the original sequence of original pixels, which is beneficial to spatial reconstruction; although the sub-frame format branch cannot preserve the original pixel order, it can exploit far-neighbor correlation to generate detail; then, the two inputs respectively pass through a feature extraction module, wherein the feature extraction module is composed of five residual blocks;

s202, joint alignment: because of the time misalignment between the adjacent frames, the adjacent frames need to be distorted to the central frame; aligning on the basis of a multi-stage cascade alignment strategy, namely calculating alignment offset from a subframe format branch, and then directly copying the calculated offset to a Bayer format branch for aligning, namely that the two branches are aligned together; features in subframe format branching

And

performing convolution and downsampling for L-1 times to form an L-level pyramid; the pyramid features in the Bayer format branches are constructed in the same way; the offset of the l-th stage is calculated according to the aggregation characteristic of the l-th stage and the upsampling result of the offset of the (l +1) -th stage:

since the input of the sub-frame format branches is actually BayA downsampled version of the Bayer format branch, so the offset value of the Bayer format branch should be twice that of the subframe format branch; thus, the offset in the branching of the subframe format may be passed

Carrying out 2 times of upsampling and 2 times of amplification to obtain the Bayer format branch of the l level

Offset amount of (2):

given the offset, the alignment characteristics of the two branches can be expressed as:

where g denotes the mapping function implemented by several convolution layers and Dconv denotes the deformable convolution. Dconv of both branches share the same weight at the respective level;

after L levels of alignment, further use

And

calculated offset between

To optimize

And

and generates final alignment results for adjacent features in both branches

And

s203, an interaction module: the Bayer pattern branch features are downsampled by 3 × 3 convolution (stride 2) and the LeakyRelu layers, and these downsampled features are aggregated with the features in the subframe pattern branch; similarly, the subframe format branch features are upsampled by pixelschuffle and then aggregated with the features in the Bayer format branch;

s204, time fusion: aggregating remote features with a non-local temporal attention module to enhance the feature representation along a temporal dimension; features are then fused together using spatio-temporal attention (TSA) based fusion;

s205, channel fusion: features in both branches are merged together using channel fusion, since

And

may contribute differently to the final SR reconstruction; fusing two branch characteristics by adopting selective kernel convolution (SKF) through channel weighted average;

s206, reconstruction and upsampling: the features after fusion

Inputting the data into a reconstruction module realized by 10 ResNet blocks for SR reconstruction; after reconstruction, utilizing Pixelshuffle to perform upsampling on the reconstructed image, and then utilizing a convolutional layer to generate three-channel output; while the module also utilizes two long hop connections, one is the input to the Bayer format for LR, which is first processed by the convolutional layer and then upsampled by PixelshuffleOutput to three channels; the other is for the LR subframe format input, which is up-sampled twice because its spatial size is half of the original input, and the three outputs are added to generate the final HR result

S207, color correction and loss function: of actual photographic data

And

there are differences in color and brightness, exploiting pixel loss directly between the output and HR may lead to a network optimizing color and brightness correction without paying attention to the task of SR; to solve this problem, instead of computing a 3 × 3 color correction matrix to correct them simultaneously, color correction is used before the loss computation, i.e. channel-based color correction is used separately for the RGB channels:

wherein alpha is ^c Is the scaling factor of channel c, which is obtained by minimizing

And versions of HR downsampling

Calculated corresponding to the least squares penalty between pixels.

S208, optimizing the network by using the Charbonier loss between the corrected output and the HR.

The beneficial effects of the invention comprise the following three points:

(1) the invention constructs a first real-world VSR data set with three multiplying factors in the Raw and sRGB domains, and provides a reference data set for training and evaluating a real original VSR method.

(2) The invention provides a Real-RawVSR method based on a Real-world Raw video super-resolution data set obtained in S1, wherein Raw data input in two branches is processed, one branch is used for Bayer format input, and the other branch is used for subframe format input; by utilizing the proposed joint alignment, interaction and time and channel fusion module, complementary information of two branches is well explored, and the real LR video super-resolution performance is improved to a new height.

(3) Experiments carried out based on the invention show that the proposed method is superior to the VSR method of Raw and sRGB in the current mainstream; through research and exploration of the invention, the research of a video super-resolution method based on the Raw domain can be expected to be inspired.

Drawings

FIG. 1 is a hardware platform and data processing flow chart in a real-world video super-resolution algorithm based on a Raw domain adopted by the invention;

FIG. 2 is a flow chart of an algorithm in a real-world video super-resolution algorithm based on a Raw domain adopted by the invention;

FIG. 3 is a combined alignment module structure diagram in the real-world video super-resolution algorithm based on the Raw domain adopted by the present invention;

FIG. 4 is a table comparing the result indexes of the algorithm used in the present invention and other video super-resolution algorithms in the test set.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1:

referring to fig. 1, the real-world video super-resolution algorithm based on the Raw domain includes the following steps:

s1, establishing a real-world Raw video super-resolution data set: the process of establishing the data set mainly comprises the following 3 steps:

s101, hardware design: to capture LR-HR frame pairs of different scales, a DSLR camera with an 18-135mm zoom lens was used instead of a cell phone camera; in order to avoid the influence of natural light from other directions, a 3D model box is designed and printed to fix the spectroscope; so that two cameras can receive natural light from the same viewpoint, and the size of the spectroscope is 150 multiplied by 1 (mm) ³ ) Sufficient to cover the camera lens; according to the invention, the camera and the spectroscope box are placed on the optical plate, and the tripod is fixed below the optical plate, so that the stability of the optical plate is improved;

s102, data acquisition: the method uses two Canon 60D cameras with upgraded third-party software MagicLantern to acquire Raw videos in a MagicLantern video (MLV) format; in order to keep the cameras synchronous, the invention uses an infrared remote controller to send signals to the two cameras so as to simultaneously control shooting, and in the shooting process, the invention keeps the ISO of the two cameras in the range of 100 to 1600 so as to avoid noise, and the exposure time ranges from 1/400 seconds to 1/31 seconds so as to capture slow motion and fast motion; all other settings are set to default values to simulate a real capture scene; then, the MLV video is processed by using MlRawViewer software to obtain a corresponding sRGB frame and a corresponding Raw frame in a DNG format; for each scene, the invention captures a 6 second short video, with a frame rate of 25FPS, i.e., each video contains about 150 frames of Raw and sRGB format maps;

s103, data processing: misalignment still exists between the LR-HR pairs due to lens distortion; the present invention therefore utilizes a coarse-to-fine alignment strategy to generate aligned LR-HR frames, including sRGB frame pairs

And Raw frame pair

The step of performing data processing on the sRGB frame by using the alignment policy in S103 is as follows:

s1032, then aligning the HR frame

Represents;

the Raw frame should pass the same alignment strategy as the sRGB frame, however, applying global and local alignment directly would destroy the Bayer format of the Raw input; performing data processing on the Raw frame by adopting an alignment strategy, firstly recombining the original frame into an RGGB subformat, wherein the size of the RGGB subformat is half of that of the sRGB frame, so that an H matrix calculated from the sRGB frame needs to be changed by readjusting translation parameters at a ratio of 0.5; deepflow is also processed in the same manner, and in this way, generates a Raw frame pair

S2, based on the processed frame S1, adopting LR Raw frame and HR sRGB frame

Designing a real-world Raw video super-resolution algorithm for a training pair;

s3, training a model: the continuous frame number input by the training model is 5 frames, the used optimizer is an Adam optimizer, and the initial learning rate is set to be 0.0001; building a model based on an algorithm designed in S2, utilizing a deep learning frame Pythrch platform training model, iterating for 300k times on the whole data set, then reducing the learning rate to 0.00001, and continuing iterating until loss is converged to obtain a final model;

utilizing a deep learning frame Pythroch platform training model, iterating for 300k times on the whole data set, then reducing the learning rate to 0.00001, and continuing iterating until loss is converged to obtain a final model;

s4, inputting the low-resolution Raw video sequence in the test set into the model to obtain a corresponding super-resolution input result;

the invention constructs a first real-world VSR data set with three multiplying factors in the Raw and sRGB domains, and provides a reference data set for training and evaluating a real original VSR method.

Example 2:

referring to fig. 2-4, there is a difference based on embodiment 1:

the real-world Raw video super-resolution algorithm flow in the S2 mainly comprises the following steps:

The input of the sub-frame branch is represented as

The number of channels is 4 times; the Bayer format branches keep the original sequence of the original pixels, which is beneficial to space reconstruction; although the sub-frame format branch cannot preserve the original pixel order, it can exploit far-neighbor correlation to generate detail; then, the two inputs respectively pass through a feature extraction module, wherein the feature extraction module is composed of five residual blocks;

s202, joint alignment: because of the time misalignment between the adjacent frames, the adjacent frames need to be distorted to the central frame; aligning on the basis of a multi-stage cascade alignment strategy, namely calculating alignment offset from a subframe format branch, and then directly copying the calculated offset to a Bayer format branch for aligning, namely that the two branches are aligned together; features in sub-frame format branching

And

since the input of the sub-frame format branch is actually a down-sampled version of the Bayer format branch, the offset value of the Bayer format branch should be twice that of the sub-frame format branch; thus, the offset in the branching of the subframe format may be passed

Offset amount of (2):

after L levels of alignment, further use

And

calculated offset between

To optimize

And

and generates final alignment results for adjacent features in both branches

And

s204, time fusion: aggregating remote features with a non-local temporal attention module to enhance the feature representation along the temporal dimension; features are then fused together using spatio-temporal attention (TSA) based fusion;

And

s206, reconstruction and upsampling: feature after fusion

Inputting the data into a reconstruction module realized by 10 ResNet blocks for SR reconstruction; after reconstruction, utilizing Pixelshuffle to perform upsampling on the reconstructed image, and then utilizing a convolutional layer to generate three-channel output; the module also utilizes two long hop connections, one is the input for the Bayer format of LR, which is first processed by the convolutional layer and then up-sampled by Pixelshuffle to three-channel output; the other is for the LR subframe format input, which is up-sampled twice because its spatial size is half of the original input, and the three outputs are added to generate the final HR result

S207, color correction and loss function: of actual shot data

And

there are differences in color and brightness, exploiting pixel loss directly between the output and HR may lead to a network optimizing color and brightness correction without paying attention to the task of SR; to solve this problem, the loss is calculatedInstead of computing a 3 x 3 color correction matrix to correct them simultaneously, color correction was previously used, i.e. channel-based color correction was used separately for the RGB channels:

wherein alpha is ^c Is the scaling factor of channel c by minimizing

And versions of HR downsampling

Calculated corresponding to the least squares penalty between pixels.

S208, optimizing the network by using the Charbonier loss between the corrected output and the HR;

based on the data set, the invention provides a Real-RawVSR method, by processing Raw data input in two branches, one branch is used for Bayer format input, and the other branch is used for subframe format input; by utilizing the proposed joint alignment, interaction and time and channel fusion module, the complementary information of the two branches is well explored; experiments show that the proposed method is superior to the current mainstream VSR methods of Raw and sRGB; through research and exploration of the invention, the research of a video super-resolution method based on the Raw domain can be expected to be inspired.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the equivalent replacement or change according to the technical solution and the modified concept of the present invention should be covered by the scope of the present invention.

Claims

1. The real world video super-resolution algorithm based on the Raw domain is characterized in that: the method comprises the following steps:

And Raw frame pair

S2, based on the processed frame S1, adopting LR Raw frame and HR sRGB frame

and S4, inputting the low-resolution Raw video sequence in the test set into the model obtained in the S3 to obtain a corresponding super-resolution input result.

2. The real-world video super-resolution algorithm based on the Raw domain as claimed in claim 1, wherein: the step of performing data processing on the sRGB frame by using the alignment policy in S103 is as follows:

s1032 then align the HR frames

And (4) showing.

3. The real-world video super-resolution algorithm based on the Raw domain as claimed in claim 2, wherein:

performing data processing on the Raw frame by adopting an alignment strategy, firstly recombining the original frame into an RGGB subformat, wherein the size of the RGGB subformat is half of that of the sRGB frame, so that an H matrix calculated from the sRGB frame needs to be changed by readjusting translation parameters at a ratio of 0.5; deepflow is also processed in the same manner, and in this way, generates a Raw frame pair

4. The real-world video super-resolution algorithm based on the Raw domain as claimed in claim 1, wherein: the real-world Raw video super-resolution algorithm flow in the S2 mainly comprises the following steps:

s201, double-branch strategy and feature extraction: to fully utilize the information of Raw data, input LR continuous frames

The input of the sub-frame branch is expressed as

s202, joint alignment: because of the time misalignment between the adjacent frames, the adjacent frames need to be distorted to the central frame; aligning on the basis of a multi-stage cascade alignment strategy, namely calculating alignment offset from a subframe format branch, and directly copying the calculated offset to a Bayer format branch for aligning, namely that two branches are aligned together; features in subframe format branching

And

performing convolution and downsampling for L-1 times to form an L-level pyramid; pyramid features in Bayer format branches are constructed in the same manner; the offset of the l-th stage is calculated according to the aggregation characteristic of the l-th stage and the upsampling result of the offset of the (l +1) -th stage:

Offset of (c):

after L levels of alignment, further use

And

calculated offset between

To optimize

And

and generates final alignment results for adjacent features in both branches

And

s203, an interaction module: the Bayer pattern branch features are downsampled by 3 × 3 convolution (stride 2) and the LeakyRelu layers, and these downsampled features are aggregated with the features in the subframe pattern branch; similarly, the subframe format branch features are upsampled by Pixelshuffle and then aggregated with the features in the Bayer format branch;

And

s206, reconstruction and upsampling: feature after fusion

Input to a reconstruction module implemented by 10 ResNet blocksFor SR reconstruction; after reconstruction, utilizing Pixelshuffle to perform upsampling on the reconstructed image, and then utilizing a convolutional layer to generate three-channel output; the module also utilizes two long hop connections, one is the input for the Bayer format of LR, which is first processed by the convolutional layer and then up-sampled by Pixelshuffle to three-channel output; the other is for the LR subframe format input, which is up-sampled twice because its spatial size is half of the original input, and the three outputs are added to generate the final HR result

S207, color correction and loss function: of actual photographic data

And

wherein alpha is ^c Is the scaling factor of channel c by minimizing

And HR down-sampled versions

Calculated corresponding to the least squares penalty between pixels.