CN116862773A

CN116862773A - Video super-resolution reconstruction method applied to complex scene

Info

Publication number: CN116862773A
Application number: CN202310888738.5A
Authority: CN
Inventors: 宋建锋; 胡国正; 谢琨; 张文英; 韩露
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2023-07-19
Filing date: 2023-07-19
Publication date: 2023-10-10

Abstract

The method comprises the steps of performing feature extraction on a frame sequence to be processed, sending the frame sequence to be processed to a pre-fusion module for preprocessing and fusion operation, sending the obtained frame sequence to a multi-attention mechanism bidirectional propagation alignment module for feature alignment, performing feature deepening on features of units in bidirectional propagation by using the multi-attention mechanism, performing combined reconstruction on the obtained frame sequence, sending the frame sequence to a second-order bidirectional propagation alignment module for secondary feature alignment, and performing reconstruction again; amplifying the obtained feature map result to obtain a high-resolution feature map; and (3) sorting the high-resolution video frames corresponding to the frame sequence to be processed according to time, respectively carrying out bicubic interpolation up-sampling operation on the frame sequence reconstructed twice, and carrying out pixel addition operation on the frame sequence and the high-resolution feature map to finally obtain the super-resolution video frame sequence after super resolution. The invention solves the problems of artifacts and blurring caused by the original video frames under the complex superdivision scene.

Description

Video super-resolution reconstruction method applied to complex scene

Technical Field

The invention belongs to the technical field of video restoration processing, and particularly relates to a video super-resolution reconstruction method applied to complex scenes.

Background

With the advent of the 5G age and advances in electronic technology, the transmission speeds of wired and wireless networks have increased dramatically. More and more video websites provide higher definition play options for newly-put movie episodes, but old movie episodes cannot adapt to the latest definition options, and far falls under the current mainstream definition standard.

The classical old movies are mostly formed by film shooting, the definition after the digital recording is difficult to meet the requirements of the new era, the importance of protecting and inheriting the heritage of the old movies is continuously improved in recent years, and the old movies are supported to be processed by a movie institution and a cultural institution through a video super-resolution technology, so that the definition and detail display of the old movies are improved. In order to solve the problem of lower definition of the old movie episode, two solutions are generally adopted in the following, namely, the movie is repaired and reproduced manually frame by frame, and the method has better effect, but takes longer time for repairing and has higher labor cost. The other direction is to amplify the image by means of an image processing algorithm and simultaneously enhance and optimally amplify the image details, namely a super-resolution reconstruction technology. The technology mainly utilizes a mature algorithm to process an input image (video), so that the blurred low-resolution image is converted into a high-resolution image with rich details and higher quality, the cost is effectively saved, and the definition restoration speed is improved. The corresponding clear High Resolution (HR) image or video frame is reconstructed from the degraded Low Resolution (LR) image or video frame, i.e. the super Resolution reconstruction technique.

However, the current video super-resolution reconstruction technology is still to be further deepened, and the problems that the video frame characteristics are not fully utilized, the video restoration effect is not good enough under a complex degradation scene and the like exist in the field at present. The defects of the current video super-resolution application research can be overcome through the research on the problems, and the related problems in the super-resolution of the old movie episode are deeply solved.

In the video super-resolution reconstruction task, most of the differences between the high-resolution image and the low-resolution image are edge and texture information, while the focus of the video super-resolution task compared with the super-resolution recovery of the image is on how to extract enough image high-frequency information from the frame sequence through a good network design, so two different schemes exist for the design of frame sequence feature extraction.

Sliding window frames are a popular solution in the early stages of video super resolution, where each frame in the video is recovered using frames from a short time window. Although the input consists of many consecutive video frames, the sliding window corresponds to a separate task of considering the restoration of each video frame as a complete video sequence, and the video restoration effect is not ideal enough. In contrast to the sliding window framework, the loop framework aims to exploit long-term interdependencies by propagating potential features, but the effect of the loop network is highly dependent on the length of the video sequence, which presents challenges for network training and application conditions.

In 2021, the authors in Omniscient video super-resolution were trying to integrate the ideas of two frameworks, using a progressive fusion module based on Progressive fusion video super-resolution network via exploiting non-local fusion-temporal correlations used in one as the main structure for network feature restoration. While it works well on a particular dataset, when processing large motion video, the results may be somewhat degraded due to the lack of alignment modules. Networks that do video super-resolution for complex scenes either have a large number of parameters or recovery results are not promising.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a video super-resolution reconstruction method applied to a complex scene, which uses an attention mechanism to combine two ideas of a sliding window frame and a circulating frame, and more effectively recovers the video frame sequence characteristics under the complex super-division scene.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

a video super-resolution reconstruction method applied to complex scenes comprises the following steps:

step 1, extracting features of a frame sequence to be processed, and sending the frame sequence to a pre-fusion module to perform pre-processing and fusion operation, wherein the frame sequence to be processed consists of low-resolution video frames;

step 2, sending the frame sequence obtained after preprocessing and fusion operation into a multi-attention mechanism bidirectional propagation alignment module to perform characteristic alignment, and then performing characteristic deepening on the characteristics of each unit in bidirectional propagation by utilizing multi-attention mechanism design; the multi-attention mechanism bi-directional propagation alignment module comprises a plurality of bi-directional circulation units, each bi-directional circulation unit is composed of a forward propagation module and a backward propagation module, the forward propagation refers to the process of generating a high-resolution output image from a low-resolution input image, the backward propagation refers to the process of generating a low-resolution input image reversely from a high-resolution estimated output image, and the backward propagation depends on the result of the forward propagation to calculate errors and gradients. The method iteratively updates network parameters by reducing a loss function, so that the network prediction output is continuously approximate to the real output, the prediction output of a video frame is calculated by forward propagation, and the output error is returned by reverse propagation, and the network parameters are updated to obtain a better video super-resolution result;

step 3, sending the frame sequence obtained by feature deepening and fusion into a first reconstruction module for combined reconstruction;

step 4, the frame sequence after the combination reconstruction is sent to a second-order bidirectional propagation alignment module for secondary feature alignment, and then sent to a second modeling block for reconstruction;

step 5, amplifying the feature map result obtained in the step 4 by a pixel reorganization module to obtain a high-resolution feature map after super resolution;

and 6, ordering the high-resolution video frames corresponding to the frame sequence to be processed according to time, respectively carrying out bicubic interpolation up-sampling operation on the frame sequences reconstructed in the step 3 and the step 4, and carrying out pixel addition operation on the frame sequences and the high-resolution feature images output in the step 5, thereby finally obtaining the super-resolution video frame sequence after super resolution.

Compared with the prior art, the invention has the beneficial effects that:

the invention adopts a pre-fusion module and a multi-attention mechanism design, wherein the pre-fusion module carries out pre-fusion on adjacent frames in the window through a similar sliding window concept, cleans up artifact characteristics and noise characteristics, and enhances the short-term characteristics of the frame sequence of each frame in the window, thereby improving the recovery performance of the multi-video frame sequence and the sensibility to the long-period frame sequence characteristics.

The invention adds a multi-attention mechanism for the feature fusion process in the bidirectional propagation module, and provides a multi-attention mechanism bidirectional propagation alignment module, so that a network can acquire important features in a sequence in a self-adaptive manner and capture long-term relations in the features, thereby enhancing the restoration effect of the network on the long-term features in the video frame sequence and balancing the short-term feature influence.

In the bidirectional circulation process of the network, two deformable convolution alignment modes are designed, in the bidirectional transmission alignment module of the multi-attention mechanism, namely, an optical flow network is used for calculating a frame sequence and giving preliminary alignment so as to guide the deformable convolution to operate a feature map, and the subsequent second-order bidirectional transmission alignment module trains and aligns the deformable convolution on the basis of alignment data in the bidirectional transmission alignment module of the multi-attention mechanism, so that the intra-frame features can be aligned more accurately, the bidirectional circulation transmission module in the network is easier to train, the alignment speed is faster, the design improves the training stability of the network, and simultaneously accelerates the reasoning speed of the network.

The invention reduces the data operand and parameter, improves the super-resolution effect of the network on complex video scenes, and achieves better super-resolution video frame reconstruction accuracy.

Drawings

Fig. 1 is a network structure diagram of a video super-resolution reconstruction method applied to a complex scene.

Fig. 2 is a schematic diagram of a pre-fusion module structure according to the present invention.

FIG. 3 is a schematic diagram of a multi-attention mechanism bi-directional propagation alignment module according to the present invention.

FIG. 4 is a schematic diagram of a channel attention mechanism in the multi-attention mechanism of the present invention.

FIG. 5 is a schematic diagram of a spatial attention mechanism in the multi-attention mechanism of the present invention.

Fig. 6 is a schematic diagram of a second order deformable alignment employed in a second order bi-directional propagation alignment module of the present invention.

Fig. 7 is a schematic diagram of a sub-pixel convolution process in a pixel reorganization module according to the present invention.

Detailed Description

In order to make the technical solution of the present invention better understood by those skilled in the art, the technical solution of the present invention will be clearly and completely described with reference to the accompanying drawings in the drawings of the present specification. All other results obtained by the person skilled in the art without making any inventive effort on the basis of the solutions according to the invention shall fall within the scope of protection of the invention.

As described above, the existing video super-resolution algorithm also has the problems that the video frame characteristics are not fully utilized, and the video restoration effect is not good enough for complex degraded scenes. In order to solve the technical problems, the embodiment of the invention provides a super-resolution reconstruction method for video frames, which uses an attention mechanism to combine two ideas of a sliding window frame and a circulating frame.

Referring to fig. 1, the reconstruction method of the present invention relies on a pre-fusion module, a multi-attention mechanism bidirectional propagation alignment module, a first reconstruction module, a second order bidirectional propagation alignment module, a second reconstruction module, a pixel reorganization module, and the like, where the above modules form a main part of the video super-resolution reconstruction network of the present invention.

After the low-resolution video frame sequence to be processed is input into the super-resolution reconstruction network, each frame and a plurality of adjacent frames are firstly input into a corresponding pre-fusion module, each pre-fusion module is responsible for carrying out pre-processing and fusion operation on the frame in one fusion window, and cleaning and enhancing of local features in the current queue window are completed, so that bidirectional propagation in the super-resolution reconstruction network can be balanced in long-term feature and short-term feature propagation, and meanwhile, the network can be more suitable for special situations that certain features in the frame sequence are temporarily blocked, and complex super-division scenes can be met.

The frame sequence obtained by the preprocessing and the fusion operation is input into a multi-attention mechanism bidirectional propagation alignment module, and bidirectional cyclic processing is carried out on the video frames in the time dimension so as to capture dynamic information among the video frames.

The deformable convolution alignment sub-module which is used for guiding the optical flow and is adopted by the characteristic alignment in the multi-attention mechanism bidirectional propagation alignment module is mainly used for calculating the motion relation between the current frame and the adjacent frame, so that the alignment of the characteristics between frames is realized, meanwhile, the design of the multi-attention mechanism in the propagation process can adaptively weight the sharing characteristics of the bidirectional propagation process in the network, and the network can balance the long-term and short-term relation in the characteristic sequence.

In the invention, the multi-attention mechanism bidirectional propagation alignment module comprises a plurality of bidirectional circulation units, wherein each bidirectional circulation unit mainly comprises a forward propagation module and a backward propagation module, and in a bidirectional circulation process, forward propagation is responsible for generating a high-resolution output from a low-resolution input, namely, forward propagation refers to a process of generating a high-resolution output image from a low-resolution input image. Back propagation is used to update network parameters based on the difference between the estimated output and the true output, i.e., back propagation refers to the process of generating a low resolution input image back from a high resolution estimated output image. This bi-directional cyclic design allows the super-resolution reconstruction network to be gradually optimized during the training process to obtain more accurate and high quality high resolution image estimation results.

In the second-order bi-directional propagation alignment module, the same architecture as the multi-attention mechanism bi-directional propagation alignment module, but in each propagation module, the deformable convolution alignment method does not use optical flow data of an optical flow network, but uses alignment data of the same unit in the multi-attention mechanism bi-directional propagation alignment module.

And reconstructing after feature alignment and fusion are completed, and amplifying the existing feature images by using a sub-pixel convolution method by using a pixel recombination module, and expanding the channel number of each feature image to a larger value. For example, the number of channels per feature map is extended from channel to channel ² . And rearranging the expanded feature images so that adjacent pixels are respectively positioned at different positions of the output image. This rearrangement operation may be accomplished by pixel replacement or reordering to divide each channel of the feature map into adjacent pixels of the high resolution image. And carrying out convolution operation on the rearranged feature images to further extract and fuse information. And finally, carrying out rearrangement operation again through the rearranged feature images, and placing the pixels into the final high-resolution image.

Referring to fig. 1, the method specifically includes the following main steps:

step 1: acquiring a training sample set, wherein the training sample set comprises a plurality of high-resolution-low-resolution video sequence pairs; every N low-resolution video frames correspond to 1 high-resolution video frame in the reconstruction method, and in the method, the queue window can be flexibly adjusted, but the processing step length is one frame, namely one frame is output, namely one frame is processed.

Step 2: the frame sequence to be processed is subjected to feature extraction and sent to a pre-fusion module for preprocessing and fusion operation.

The preprocessing and fusion operation of the invention adopts a sliding window to pre-fuse adjacent frames in the window, wherein, for the ith frame, j frames adjacent to the ith frame before and after the ith frame are respectively collected, and each frame is aligned and fused after denoising and preprocessing. In order to facilitate the calculation and expression, j=1 is set in the embodiment of the invention, and when the value of j is other, the principle is consistent.

The current frame and the front and back adjacent frames in the sequence are input into a pre-fusion module, and the pre-fusion module performs the operation of merging adjacent local features in the sliding window on the current frame and the front and back adjacent frames through a deformable convolution alignment method guided by optical flow, as shown in fig. 2. When fusing 3 frames, the feature handling process can be formulated as:

representing a pre-fusion operation, A _FGD For the deformable convolution alignment of optical flows g _i-1 Representing the previous frame characteristics of the current frame g _i Representing the current frame characteristics g _i+1 Representing the next frame feature of the current frame.

Representing the feature fusion result, D is represented as a deformable convolution, o _i→i-1 Represents g _i-1 And g _i Offset between, it combines the optical flow offset calculated by optical flow network and the offset calculated by deformable convolution, m _i→i-1 Representing the modulation mask. After the alignment and fusion operation of the target frame in the current window is completed, the pre-fusion module designs a residual layer according to the characteristic cleaning and denoising process of the fusion result, and post-processes and outputs the fused frame characteristics.

Step 3: after the feature preprocessing and fusion operation on the input frame sequence and the feature cleaning, each frame feature in the processing queue is input to each node of the multi-attention mechanism bidirectional propagation alignment module, the design of the multi-attention mechanism is used between each node, the attention mechanism processing and propagation are carried out on the features between the nodes, and the long-term features in the frame sequence are deepened, so that other nodes in bidirectional propagation can capture the feature remote dependency relationship in the frame sequence more easily.

In the multi-attention mechanism bi-directional propagation alignment module, an optical flow network is used for calculating a frame sequence, giving out preliminary alignment of the optical flow network, and then a deformable convolution is used for accurately aligning and fusing the features on the basis of the preliminary alignment.

Specifically, the frame sequence obtained after preprocessing and fusion operation is sent to a multi-attention mechanism bidirectional propagation alignment module for characteristic alignment, and then the characteristics of each unit in bidirectional propagation are deepened by utilizing multi-attention mechanism design. In a multi-attention mechanism bi-directional propagation alignment module, the invention uses an optical flow guided deformable convolution alignment method as an alignment method to perform feature alignment on different frames. Unlike existing methods that directly calculate the deformable convolution offset, the benefits of using optical flow directed deformable convolution are twofold. First, since convolutional neural networks have local receptive fields, learning of offsets can be aided by using optical flow prealignment features. Secondly, the network only needs to learn the tiny deviation from the optical flow, so that the load of deformable alignment operation is reduced, and offset overflow is not easy to occur. Furthermore, the mask in the deformable convolution does not directly series-warp the features, but rather serves as a care to trade off the contributions of different pixels, providing more flexibility and more stable training.

Referring to fig. 3, the multi-attention mechanism bi-directional propagation alignment module of the present invention comprises a plurality of bi-directional circulation units, each of which is composed of a forward propagation module and a backward propagation module.

The forward propagation module receives the low resolution image as input during the training phase and generates a high resolution estimated output by forward propagation computation of the network, the forward propagation being aimed at minimizing the difference between the generated image and the true high resolution image.

The back propagation module is used for receiving the difference between the high-resolution estimation output and the real high-resolution image as input in a training stage and calculating gradient through the back propagation of the network so as to update the parameters of the network; the goal of the back-propagation is to adjust network parameters based on the difference between the high resolution estimated output and the true high resolution image, improving network performance through the bi-directional cycling of the forward and back-propagation.

In order to realize the guiding effect of optical flow information on deformable convolution, the invention connects the distorted characteristics and the optical flow, in a bidirectional circulation unit, an ith frame is taken as a frame to be rebuilt, j frames adjacent to the ith frame and the j frames adjacent to the ith frame are taken as reference frames, a forward propagation module is utilized to obtain a characteristic estimation graph by utilizing an optical flow guided deformable convolution alignment method, and a backward propagation module is utilized to obtain a secondary characteristic estimation graph by utilizing an optical flow guided deformable convolution alignment method of the ith frame and the j frames adjacent to the ith frame; and fusing the characteristic estimation graph, the secondary characteristic estimation graph and the ith frame to obtain an aligned video frame of the ith frame, and simultaneously strengthening and screening the characteristics in the current propagation module by using a multi-attention mechanism design in the bidirectional propagation process.

For example, at the ith point in time, a feature g calculated from the ith low resolution image is given _i Calculated features f of the previous frame to the current i frame _i -1 optical flow s of the previous frame _i→i-1 For f _i -1 and s _i→i-1 Performing warping, wherein the spatial warping operation is represented by W, namely dynamic compensation of each characteristic by an optical flow network, and the formula is as follows:

the pre-aligned features are then used to calculate the DCN offset o _i→i-1 And modulation mask m _i→i-1 . Here, the offset of the DCN is not directly calculated, but the residual of the optical flow is calculated, and the formula is as follows:

f _i ＝D(f _i-1 ；o _i→i-1 ,m _i→i-1 )

the above formula C represents convolution, C ^{o,m} Representing a stack of convolutions, σ represents a Sigmoid function. Finally, applying a deformable convolution to the undistorted feature f _i-1 Obtaining the i frame characteristic f after motion compensation _i 。

After the frame features are aligned, the invention uses a multi-attention mechanism to process and connect the fusion features of each propagation module, deepens the propagation range of certain features in the network, and makes each propagation module connected with each other capture the feature remote dependency relationship in the frame sequence more easily. Specifically, the multi-attention mechanism of the invention comprises a channel attention mechanism and a space attention mechanism; and the channel attention mechanism and the spatial attention mechanism are arranged in series, the multi-attention mechanism processing is carried out on the characteristic fusion process of the characteristic estimation graph, and the multi-attention mechanism processing is carried out on the characteristic fusion process of the secondary characteristic estimation graph.

First, the channel attention mechanism, because the network features captured by different channels are different in the feature map, it is these differences that can have different contributions to the recovery of high frequency features in the super-resolution task. The channel attention mechanism of the present invention uses a global averaging pool to generate attention statistics for channel features, as shown in fig. 4, input x= [ X ] for features ₁ ,x ₂ ,...,x _c ]It has a feature map with c dimension H W. Then the c-th element of c is determined by:

wherein x is _c (i, j) is the c-th feature x _c At the value of (i, j), H _GP Representing global pooling functions, collecting local features through a global average pool.

At the same time, in order to fully capture the dependency in the channel, the present invention applies a gate function, i.e. s=f (W _U δ(W _D z)), where f and δ represent Sigmoid and ReLU functions, respectively. W (W) _D Is a set of convolutional layer weights for channel downsampling. W (W) _U For channel up-sampling features. The final channel statistics s are obtained by the gate function and then used to readjust the input x, the adjustment result beingWherein s is _c Is the scale factor in the c-th channel.

In the spatial dimension, global features of information are collected and propagated throughout the spatio-temporal space of the input video so that subsequent convolution layers can effectively access features from the entire space. There are two steps here, as in fig. 5, the first stage is collection and the second stage is distribution.

For example, in a multi-attention mechanism design, X ε R ^{channel×d×h×w} Represents the input tensor, where channel represents the number of channels, d represents the time dimension, h, and w is the spatial dimension of the input frame. For each input position i=1.. dhw and local feature v _i Will Z _i The output, defined as the operation of collecting and distributing features to each input location i, is formulated as Z _i ＝F(G(X),v _i )。

At the same time in order to take into account the local features v of each location _i . In the first stage of collection bilinear pooling gives all feature vector pairs (a _i ,b _i ) The sum of the second order features of alpha of the outer product of (a), there is the following formula in the feature maps a and B:

wherein a= [ a ] ₁ ,…,a _dhw ]∈R ^m×dhw ,B＝[b ₁ ,…,b _dhw ]∈R ^n×dhw Then in the second stage of distribution, the features collected in the first stage are distributed to each position of the input so that the subsequent convolution layers can perceive global information even with a small convolution kernel.

Step 4: and sending the frame sequence obtained by feature deepening and fusion into a first reconstruction module for combined reconstruction.

In the invention, as the bidirectional propagation structure is used, and the design of a pre-fusion mechanism and a multi-attention mechanism is adopted, the extraction, alignment and fusion processes of the video frame sequence characteristics are enhanced.

Step 5: and the combined and reconstructed frame sequence is sent to a second-order bidirectional propagation alignment module for secondary feature alignment, and then sent to a second modeling block for reconstruction.

After the multi-attention mechanism bidirectional propagation alignment module is processed, in order to improve the alignment speed in the second bidirectional circulation alignment of the second half part of the existing method, the second-order bidirectional propagation alignment module changes the alignment method from an alignment method based on optical flow to deformable convolution alignment guided by alignment data of the multi-attention mechanism bidirectional propagation alignment module. In other words, in the second-order bidirectional propagation alignment module, alignment data from the multi-attention mechanism bidirectional propagation alignment module is used, on the basis of the existing alignment result, a second-order guided deformable convolution alignment method is used for carrying out secondary feature alignment on the feature map, and the feature alignment speed of the part and the stability of network training are greatly improved.

Specifically, the mask and offset generated by feature alignment of the multi-attentive mechanism bi-directional propagation alignment module are used to pre-align the input features of the second-order bi-directional propagation alignment module to obtain pre-aligned features, as shown in fig. 6. The mask and offset residuals are then generated using the pre-alignment features and the current frame features. The last two pairs of masks and offsets are used to make the final alignment, resulting in the final alignment feature, formulated as follows.

Wherein the method comprises the steps ofRepresenting pre-aligned features, D representing a deformable convolution,/->And->Representing the mask and offset in the last bi-directional cyclic propagation. The aligned features are then concatenated to produce a residual deformable convolution offset, formulated as follows:

the two offsets are then used for feature alignment, and the aligned features are combined to reconstruct a restored image, formulated as follows

Step 6: and (5) amplifying the feature map result obtained in the step (5) by using a sub-pixel convolution method by using a pixel recombination module to obtain a high-resolution feature map after super resolution.

After feature combination is completed, the pixel reorganization module performs a restoration operation on all frame features, expands the feature map by using a method based on feature extraction and subpixel convolution, and converts the feature map from a low-resolution space to a high-resolution space. The first part on the left in fig. 7 is for extracting features of an image. Then r is generated at the penultimate layer ² A channel profile, where r is a multiple of the desired upsampling.

Normally, convolution operations will reduce the height and width of the feature map. But when the step of the convolution isIn this case, the height and width of the feature map after convolution can be increased, that is, the resolution can be increased.

And 7, ordering the high-resolution video frames corresponding to the frame sequence to be processed in time, respectively performing bicubic interpolation up-sampling operation on the reconstructed frame sequence, and performing pixel addition operation on the reconstructed frame sequence and the high-resolution feature map output in the step 6 to finally obtain the super-resolution video frame sequence after super resolution.

In order to further demonstrate the advantages of the algorithm, the invention carries out extensive comparison experiments, and training and evaluation are carried out on four data sets of REDS4, UDM10 and Vimeo-90K-T, vid, so that 12 excellent algorithms at home and abroad are compared, namely: DRVSR, FRVSR, MTUDM, DUF, EDVR-M, PFNL, RLSP, TDAN, RRN, OVSR, basicVSR, iconVSR, etc. The effect of the method is further described below by using the peak signal-to-noise ratio PSNR and the structural similarity SSIM to evaluate the reconstruction effect of each network and combining an experimental result comparison table.

The invention is quantitatively compared with the prior video super-resolution method (PSNR/SSIM)

Method	Parameter(M)	REDS4	UDM10	Vimeo-90K-T	Vid4
						DRVSR	1.72	-	36.64/0.9472	-	25.52/0.7600
FRVSR	5.1	-	37.09/0.9522	35.64/0.9319	26.69/0.8103
						MTUDM	5.92	-	38.02/0.9589	-	26.57/0.7989
DUF	5.8	28.63/0.8251	38.05/0.9586	-	27.34/0.8327
						EDVR-M	3.3	30.53/0.8699	39.40/0.9663	37.33/0.9484	27.45/0.8406
PFNL	3.0	29.63/0.8502	38.74/0.9627	-	27.40/0.8384
						RLSP	4.2	-	38.48/0.9606	36.49/0.9403	27.48/0.8388
TDAN	2.29	-	38.19/0.9586	-	26.86/0.8140
						RRN	3.4	-	38.96/0.9644	-	27.69/0.8488
OVSR	1.89	-	39.37/0.9673	-	27.99/0.8599
						BasicVSR	6.3	31.42/0.8909	39.96/0.9694	37.18/0.9450	27.24/0.8251
IconVSR	8.7	31.67/0.8948	40.03/0.9694	37.47/0.9476	27.39/0.8279
						BasicVSR++	7.3	32.39/0.9069	40.72/0.9722	37.79/0.9500	27.79/0.8400
OURS	7.12	32.43/0.9073	40.86/0.9759	40.06/0.9699	27.93/0.8504

Measurement standard: (peak signal to noise ratio) PSNR, (structural similarity) SSIM, test object: REDS4, UDM10, vimeo-90K-T, vid4.

Compared with the existing method, the method has obvious index advantages, and compared with the existing method with the most excellent BasicVSR++, the method achieves a certain improvement of the video restoration index under the condition that the parameter quantity is lower than that of the BasicVSR++.

Compared with the prior art, the invention provides a design of a feature pre-fusion and multi-attention mechanism for improving the recovery performance of a network multi-video sequence and the sensibility to the features of long-period frame sequences, wherein the feature pre-fusion mechanism pre-fuses adjacent frames in a window through a similar sliding window concept, and enhances the short-period features of the frame sequences of each frame in the window. Meanwhile, in order to enhance the restoration effect of the network on the long-term characteristics in the video frame sequence and balance the short-term characteristic influence, the invention also adds a multi-attention mechanism for the fusion process in the bidirectional propagation module in the network, and provides a first-order bidirectional propagation alignment module, so that the network can acquire important characteristics in the sequence in a self-adaptive manner and capture the long-term relation in the characteristics.

Meanwhile, in order to enable the bidirectional circulation propagation module in the network to be more easy to train and enable the alignment speed to be higher, the characteristic alignment method in the second-order bidirectional circulation alignment module is changed into the second-order deformable convolution alignment method, and in the bidirectional circulation alignment process of the module, the deformable convolution is guided by using alignment data in the multi-attention mechanism bidirectional circulation alignment module, so that the network training stability is improved, the network reasoning speed is accelerated, the video restoration effect of the method is optimal, the problems of artifacts and blurring caused by an original video frame under a complex superdivision scene are solved, and a better video restoration effect is achieved.

The present invention is disclosed in more detail above, but the present invention is not limited thereto, and all technical solutions obtained by adopting equivalent substitution or equivalent transformation fall within the protection scope of the present invention.

Claims

1. The video super-resolution reconstruction method applied to the complex scene is characterized by comprising the following steps of:

step 2, sending the frame sequence obtained after preprocessing and fusion operation into a multi-attention mechanism bidirectional propagation alignment module to perform characteristic alignment, and then performing characteristic deepening on the characteristics of each unit in bidirectional propagation by utilizing multi-attention mechanism design; the multi-attention mechanism bidirectional propagation alignment module comprises a plurality of bidirectional circulation units, each bidirectional circulation unit is composed of a forward propagation module and a backward propagation module, the forward propagation refers to the process of generating a high-resolution output image from a low-resolution input image, and the backward propagation refers to the process of reversely generating a low-resolution input image from a high-resolution estimated output image;

2. The method for reconstructing the super-resolution of the video applied to the complex scene as recited in claim 1, wherein the preprocessing and the fusion operation are performed by adopting a sliding window to perform the pre-fusion of adjacent frames in the window, wherein for the ith frame, j frames adjacent to the ith frame before and after the ith frame are respectively collected, and each frame is aligned and fused after the denoising and the preprocessing are respectively performed.

3. The method for reconstructing the super-resolution of the video applied to the complex scene as recited in claim 2, wherein the deformable convolution alignment method guided by the optical flow is adopted to perform alignment fusion on the frame features.

4. The video super-resolution reconstruction method applied to a complex scene according to claim 1, wherein the multi-attention mechanism bidirectional propagation alignment module performs feature alignment on different frames by adopting an optical flow guided deformable convolution alignment method, and the second-order bidirectional propagation alignment module performs secondary feature alignment on a feature map by adopting a second-order guided deformable convolution alignment method.

5. The method for reconstructing video super-resolution in a complex scene as recited in claim 4, wherein said multi-attention mechanism bidirectional propagation alignment module uses an i-th frame as a frame to be reconstructed and uses j frames adjacent to the i-th frame as reference frames in a bidirectional cyclic unit; utilizing a forward propagation module to obtain a characteristic estimation graph by utilizing a deformable convolution alignment method guided by optical flow for an ith frame and a previous j frame adjacent to the ith frame, and utilizing a backward propagation module to obtain a secondary characteristic estimation graph by utilizing a deformable convolution alignment method guided by optical flow for an ith frame and a next j frame adjacent to the ith frame; and fusing the characteristic estimation graph, the secondary characteristic estimation graph and the ith frame to obtain an aligned video frame of the ith frame, and simultaneously strengthening and screening the characteristics in the current propagation module by using a multi-attention mechanism design in the bidirectional propagation process.

6. The method for reconstructing video super-resolution applied to a complex scene as recited in claim 5, wherein said multi-attention mechanism comprises a channel attention mechanism and a spatial attention mechanism, both attention mechanisms being distributed uniformly to a forward propagation module and a backward propagation module; and the channel attention mechanism and the spatial attention mechanism are arranged in series, the multi-attention mechanism processing is carried out on the characteristic fusion process of the characteristic estimation graph, and the multi-attention mechanism processing is carried out on the characteristic fusion process of the secondary characteristic estimation graph.

7. The method for reconstructing video super-resolution applied to a complex scene as recited in claim 5, wherein in said multi-attention mechanism bi-directional propagation alignment module, an alignment algorithm is as follows:

at the ith point in time, a feature g calculated from the ith low resolution image is given _i Calculated feature f of the previous frame of the current i frame _i -1 optical flow s of the previous frame _i→i-1 For f _i -1 and s _i→i-1 Performing warping, wherein the spatial warping operation is represented by W, namely dynamic compensation of each characteristic by an optical flow network, and the formula is as follows:

the pre-aligned features are then used to calculate the DCN offset o _i→i-1 And modulation mask m _i→i-1 The residual error of the optical flow is calculated, and the formula is as follows:

f _i ＝D(f _i-1 ；o _i→i-1 ,m _i→i-1 )

the above formula C represents convolution, C ^{o,m} Representing a stack of convolutions, σ representing a Sigmoid function;

finally, applying a deformable convolution to the undistorted feature f _i-1 Obtaining the i frame characteristic f after motion compensation _i 。

8. The method for reconstructing video super-resolution applied to complex scene as recited in claim 1, wherein in said second-order bi-directional propagation alignment module, an alignment algorithm is as follows:

firstly, prealigning input features of a second-order bidirectional propagation alignment module by using masks and offsets generated by feature alignment of the bidirectional propagation alignment module by using a multi-attention mechanism to obtain prealigned features;

then, generating residuals of mask and offset with the pre-alignment features and the current frame features;

finally, two pairs of masks and offsets are used to make the final alignment, resulting in the final aligned feature, formulated as follows:

wherein the method comprises the steps ofRepresenting pre-aligned features, D representing a deformable convolution,/->And->Representing the mask and offset in the last bi-directional propagation;

the aligned features are concatenated to produce a residual deformable convolution offset, formulated as follows:

two offsets are used for feature alignment, and the aligned features are combined to reconstruct a restored image, formulated as follows

。

9. The method according to claim 1, wherein in step 5, the pixel reorganizing module performs a restoration operation for all frame features, expands the feature map by using a method based on feature extraction and subpixel convolution, and converts the feature map from a low resolution space to a high resolution space.

10. The method for reconstructing video super-resolution applied to complex scene as recited in claim 9, wherein said step size of sub-pixel convolutionWhere r is a multiple of the upsampling.