CN116266336A

CN116266336A - Video super-resolution reconstruction method, device, computing equipment and storage medium

Info

Publication number: CN116266336A
Application number: CN202210867466.6A
Authority: CN
Inventors: 张超
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2022-07-22
Filing date: 2022-07-22
Publication date: 2023-06-20

Abstract

The invention discloses a video super-resolution reconstruction method, a device, a computing device and a storage medium, wherein the method comprises the following steps: inputting the frame to be processed and the adjacent frame into a preset optical flow estimation network model to obtain an inter-frame optical flow between the frame to be processed and the adjacent frame; establishing a relation between an inter-frame optical flow and a frame to be processed and adjacent frames thereof through optical flow conversion, and generating an optical flow set based on the relation; performing motion compensation on adjacent frames by using the optical flow set, and combining the adjacent frames subjected to the motion compensation with the frames to be processed to obtain a frame set to be processed; and inputting the frame set to be processed into a pre-trained superdivision reconstruction model to obtain the superdivision frame corresponding to the frame to be processed. Through the scheme, end-to-end training and reconstruction are realized, the characteristic fitting capacity of the super-division reconstruction model is enhanced, the accuracy and time consistency of the result are improved, the reconstructed video frame is more lifelike, the details are more abundant, and the visual perception of people is better.

Description

Video super-resolution reconstruction method, device, computing equipment and storage medium

Technical Field

The invention relates to the technical field of computer vision, in particular to a video super-resolution reconstruction method, a device, computing equipment and a storage medium.

Background

Super-Resolution (SR) reconstruction is an important branch of computer vision. The purpose of SR reconstruction is to reconstruct a corresponding High Resolution (HR) image/video from an input Low Resolution (LR) image/video by a certain algorithm model, even if the output image/video has a higher pixel density, the capability of characterizing details is improved. The technology can be applied to the fields of mobile phone photographing enhancement, face image processing, public safety, remote sensing image enhancement, medical image processing and the like.

The input of the VSR (Video Super-Resolution reconstruction) model is a plurality of frames of images which are continuous in time, and highly similar time information is added, wherein the motion difference between adjacent LR Video frames can be regarded as LR results generated by the same scene under different degradation modes. Because the VSR model based on deep learning needs to consider how to effectively utilize the time information between video frames, the VSR model is more complicated in nature, has higher difficulty, has fewer related researches, and cannot achieve the accuracy required by production and life. For example, kappeler et al proposed a VSRnet model that generalized CNN networks into VSR problems, but that was not an end-to-end model, motion estimation and compensation was achieved by conventional methods, not covered in CNN models, and there was some ambiguity in the details of reconstruction at large scale factors.

Disclosure of Invention

In view of the above problems, the present invention is directed to providing a method, an apparatus, a computing device, and a storage medium for reconstructing a super-resolution video, which can at least overcome the problems of insufficient fitting capability of model features, poor accuracy and time consistency of reconstructed video frames, and blurred details.

According to an aspect of the present invention, there is provided a video super-resolution reconstruction method, the method comprising:

inputting the frame to be processed and the adjacent frame into a preset optical flow estimation network model to obtain an inter-frame optical flow between the frame to be processed and the adjacent frame;

establishing a relation between the inter-frame optical flow and the frame to be processed and the adjacent frames thereof through optical flow conversion, and generating an optical flow set based on the relation;

performing motion compensation on adjacent frames by using the optical flow set, and combining the adjacent frames subjected to the motion compensation with the frames to be processed to obtain a frame set to be processed;

and inputting the frame set to be processed into a pre-trained superdivision reconstruction model to obtain the superdivision frame corresponding to the frame to be processed.

Optionally, the optical flow estimation network model includes at least two pyramid layers, each pyramid layer outputs optical flows with different resolutions, and the resolution of the optical flows estimated by the pyramid layer at the output end is higher than that of the pyramid layer at the input end.

Optionally, inputting the frame to be processed and the adjacent frame into a preset optical flow estimation network model, and obtaining the inter-frame optical flow between the frame to be processed and the adjacent frame includes:

taking image vectors and initialized optical flows of a frame to be processed and an adjacent frame as inputs, and obtaining a first residual optical flow after processing by a first pyramid layer in the optical flow estimation network model, wherein the first residual optical flow is fused with the initialized optical flow to obtain a first optical flow;

taking image vectors of a frame to be processed and an adjacent frame and a first optical flow as inputs, and obtaining a second residual optical flow after processing by a second pyramid layer in the optical flow estimation network model, wherein the second residual optical flow is fused with the first optical flow to obtain a second optical flow;

and taking the image vectors of the frame to be processed and the adjacent frame and the second optical flow as inputs, processing the image vectors by a third pyramid layer in the optical flow estimation network model to obtain a third residual optical flow, and fusing the third residual optical flow and the second optical flow to obtain the inter-frame optical flow.

Optionally, establishing a relationship between the inter-frame optical flow and the frame to be processed and its neighboring frames through optical flow conversion, generating an optical flow set based on the relationship includes:

and extracting a plurality of derived optical flows from at least two inter-frame optical flows by using a frame to be processed or an adjacent frame as a template through conversion from space to depth, and arranging the derived optical flows on a channel dimension according to the extraction sequence to form an optical flow set with the same resolution size as the frame to be processed.

Optionally, the super-resolution reconstruction model includes a first CNN network and a second CNN network that are set in parallel, where the number of layers of the first CNN network is greater than that of the second CNN network, and the first CNN network and/or the second CNN network includes at least one attention module.

Optionally, the first CNN network includes at least one layer or combination of: a combination of convolution layer and ReLU function, a combination of convolution layer and attention module, a multichannel attention module, an attention module, a connection layer, a convolution layer, or an upsampling layer;

the second CNN network comprises at least one layer or combination of: a combination of convolution layer and ReLU function, an attention module, upsampling, or convolution layer.

Optionally, the attention module includes a channel attention sub-module and a spatial attention sub-module which are arranged in parallel, a connection layer and a convolution layer;

the channel attention submodule comprises a global average pooling network, a global maximum pooling network and a Sigmoid function which are arranged in parallel, wherein the global average pooling network and the global maximum pooling network realize parameter sharing;

the spatial attention submodule comprises at least one convolution layer, a ReLU function and a Sigmoid function.

According to another aspect of the present invention, there is provided a video super-resolution reconstruction apparatus, the apparatus comprising:

the optical flow estimation module is suitable for inputting the frame to be processed and the adjacent frame into a preset optical flow estimation network model to obtain an inter-frame optical flow between the frame to be processed and the adjacent frame;

the optical flow set generation module is suitable for establishing a relation between the inter-frame optical flow and the frame to be processed and the adjacent frames thereof through optical flow conversion, and generating an optical flow set based on the relation;

the motion compensation module is suitable for performing motion compensation on adjacent frames by utilizing the optical flow set, and combining the adjacent frames subjected to the motion compensation with the frames to be processed to obtain a frame set to be processed;

and the super-division reconstruction module is suitable for inputting the frame set to be processed into a pre-trained super-division reconstruction model to obtain super-division frames corresponding to the frames to be processed.

According to still another aspect of the present invention, there is provided an electronic apparatus including: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the video super-resolution reconstruction method.

According to still another aspect of the present invention, there is provided a computer storage medium having at least one executable instruction stored therein, the executable instruction causing a processor to perform operations corresponding to the above-described video super-resolution reconstruction method.

According to the video super-resolution reconstruction method, firstly, a low-resolution frame to be processed and adjacent frames thereof are input into an optical flow estimation model to obtain an inter-frame optical flow, then a frame set to be processed is obtained based on the optical flow, and then the frame set to be processed is input into a pre-trained super-division reconstruction model to obtain a super-division frame corresponding to the frame to be processed. The scheme realizes an end-to-end video super-resolution model, wherein a attention mechanism and high-resolution optical flow estimation are fused, and the whole training is optimized, so that the characteristic fitting capacity of the super-resolution reconstruction model is enhanced, the accuracy and time consistency of results are improved, the reconstructed video frame is more lifelike, the details are more abundant, and the visual perception of people is better.

The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

fig. 1 is a schematic flow chart of a video super-resolution reconstruction method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a video super-resolution reconstruction model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an optical flow estimation network model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a residual error density block according to an embodiment of the present invention;

FIG. 5 illustrates a flow diagram of space-to-depth conversion provided by one embodiment of the present invention;

FIG. 6 shows a schematic structural diagram of a super-division reconstruction model (ADS-VSR model) according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a multi-channel attention module according to one embodiment of the present invention;

FIG. 8 is a schematic diagram showing the structure of an Attention (Attention) module according to one embodiment of the present invention;

FIG. 9 is a schematic diagram of a channel attention sub-module according to one embodiment of the present invention;

FIG. 10 is a schematic diagram of a spatial attention sub-module according to one embodiment of the present invention;

fig. 11 is a schematic structural diagram of a video super-resolution reconstruction device according to an embodiment of the present invention;

FIG. 12 illustrates a schematic diagram of a computing device provided by one embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Fig. 1 shows a flowchart of an embodiment of a video super-resolution reconstruction method of the present invention, which is applied to a computing device. The computing device carries a pre-trained model, performs optimal reconstruction on each frame in the video, and reconstructs a low-resolution frame into a super-resolution frame. As shown in fig. 1, the method comprises the steps of:

Step 110: inputting the frame to be processed and the adjacent frame into a preset optical flow estimation network model to obtain the inter-frame optical flow between the frame to be processed and the adjacent frame.

Wherein the frame to be processed and the adjacent frame are low resolution frames, the frame to be processed may also be referred to as a (low resolution) center frame, the adjacent frame is a frame or frames appearing before and after the frame to be processed, respectively, which form an LR video frame. Optical flow (Optical flow or optic flow) is a concept in object motion detection in the field of view to describe the motion of an observed object, surface or edge caused by motion relative to an observer. The inter-frame optical flow is preferably a high resolution optical flow, i.e. HR optical flow. The optical flow estimation network model is used for estimating the optical flow between the frame to be processed and each adjacent frame.

Step 120: and establishing a relation between the inter-frame optical flow and the frame to be processed and the adjacent frames thereof through optical flow conversion, and generating an optical flow set based on the relation.

This step is used to obtain more optical flow from the inter-frame optical flow based on the format of the low resolution frame, etc., so as to obtain more optical flow information, thereby facilitating to obtain more and more accurate motion relations.

Step 130: and performing motion compensation on the adjacent frames by using the optical flow set, and combining the adjacent frames subjected to motion compensation with the frames to be processed to obtain a frame set to be processed.

In the step, more frames close to the frames to be processed are obtained through the optical flow set, namely the frames to be processed are collected, so that more effective information is introduced for the follow-up super-resolution reconstruction model.

Step 140: and inputting the frame set to be processed into a pre-trained superdivision reconstruction model to obtain the superdivision frame corresponding to the frame to be processed.

Therefore, the embodiment realizes the estimation of the optical flow, the acquisition of the optical flow set and the low-resolution frame set, the acquisition of the super-resolution frame and the like by establishing the neural network model, and solves the technical problems that the end-to-end training and the resolution improvement cannot be realized by the deep learning network completely in the prior art. In addition, by introducing the neural network model, more effective information is provided for the super-division reconstruction model, the characteristic fitting capacity of the super-division reconstruction model is enhanced, the accuracy and time consistency of the result are improved, the reconstructed video frame is more lifelike, the details are more abundant, and the visual perception of people is better.

The following gives specific examples of the model or algorithm in the above embodiments in connection with the end-to-end video super-resolution reconstruction model frame diagram shown in fig. 2, which is of course only one of the embodiments, and the scope of the invention is not limited to these embodiments.

The OFE model in fig. 2 is an optical flow estimation network model based on a pyramid structure, and estimates potential HR optical flow between corresponding HR video frames from coarse to fine for a group of input LR video frames, so that compared with the LR optical flow, the method can provide more accurate corresponding relation, simultaneously introduce more effective information, and help the VSR to improve accuracy and time consistency of results. The ADS-VSR model is a VSR model combined with a deep and shallow CNN network which introduces an attention mechanism, and the similarity of adjacent frame time dimensions is reasonably utilized to improve the reconstruction quality of the video. The model consists of two branches, namely a shallow second CNN network for recovering the HR video frame basic information and a relatively deep first CNN network for recovering the high-frequency detail information, attention mechanisms are introduced into each CNN network, an Attention module is constructed, and the module carries out self-adaptive adjustment on the input feature images by utilizing the space and channel correlation between the feature images, so that the weights of different types of features are different, and the fitting capacity of the network is improved.

Specifically, in one embodiment, the optical flow estimation network model includes at least two pyramid layers, where each pyramid layer outputs optical flows with different resolutions, and the resolution of the optical flows estimated by the pyramid layer next to the output end is higher than that of the pyramid layer next to the input end.

Since video is more time-dimensional information than images, the time correlation can be represented by optical flow-based motion estimation. The accuracy of the optical flow estimation is closely related to the quality of the VSR results, and the accurate optical flow estimation is beneficial to improving the VSR performance. This embodiment therefore proposes an optical flow estimation network having at least a two-layer pyramid structure, preferably a three-layer pyramid structure, the OFE model. The model outputs motion vectors with different resolutions at different pyramid layers, and estimates large motion with higher resolution in a high pyramid layer, namely, estimates potential HR optical flow between corresponding HR video frames in a mode from thick to thin for an input LR video frame. Compared with the LR optical flow, the HR optical flow can provide more accurate corresponding relation, more effective information is introduced, and the accuracy of the result is improved by the VSR.

In one embodiment, based on the three-layer pyramid optical flow estimation network model, step 110 inputs the to-be-processed frame and the adjacent frame into a preset optical flow estimation network model, and obtaining the optical flow between the to-be-processed frame and the adjacent frame specifically includes: taking image vectors and initialized optical flows of a frame to be processed and an adjacent frame as inputs, and obtaining a first residual optical flow after processing by a first pyramid layer in the optical flow estimation network model, wherein the first residual optical flow is fused with the initialized optical flow to obtain a first optical flow; taking image vectors of a frame to be processed and an adjacent frame and a first optical flow as inputs, and obtaining a second residual optical flow after processing by a second pyramid layer in the optical flow estimation network model, wherein the second residual optical flow is fused with the first optical flow to obtain a second optical flow; and taking the image vectors of the frame to be processed and the adjacent frame and the second optical flow as inputs, processing the image vectors by a third pyramid layer in the optical flow estimation network model to obtain a third residual optical flow, and fusing the third residual optical flow and the second optical flow to obtain the inter-frame optical flow.

With reference to fig. 3, the preprocessed two adjacent frames of image vectors and the initialized optical flow are taken as inputs, and after connection, a first residual optical flow is obtained through at least two convolution layers, at least one residual dense block and feature multiplexing calculation, and the first residual optical flow is obtained after fusing the initialized optical flow.

Then, two adjacent frames of image vectors and the preprocessed first optical flow are taken as input, and after combination and connection, a second residual optical flow is obtained through at least two convolution layers, at least one residual error density block and characteristic multiplexing calculation, and the second residual optical flow is obtained through fusing the preprocessed first optical flow.

And finally, taking the adjacent two frames of image vectors and the second optical flow as inputs, combining and connecting the two adjacent frames of image vectors and the second optical flow, and then calculating through at least two convolution layers, at least two residual error dense blocks and characteristic multiplexing to obtain a third residual error optical flow, wherein the third residual error optical flow is used for fusing the preprocessed second optical flow, and obtaining the inter-frame optical flow.

It is noted that where preprocessing includes downsampling or upsampling, feature multiplexing refers to the connection of the inputs of the layer in the channel direction of the outputs of all or part of the previous layers.

Specifically, two LR video frames X _i And X _j As input, the corresponding HR video frame Y is estimated _i And Y _j The potential optical flow between them can be formulated as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,

representing HR optical flow from the jth frame to the ith frame, and θ _OFE Parameters representing an optical flow estimation network model. Referring to the structure of the optical flow estimation network model shown in fig. 3, each layer of the pyramid-based optical flow estimation network model is specifically described below.

A first layer: first for LR video frame X _i And X _j Downsampling by 2 times by adopting average pooling to generate

And->

Initial light flow map->

All elements of (1) are set to 0,/o>

There are two channels representing motion vectors in the horizontal and vertical axes of the image, respectively. Will->

And->

The low-level features are extracted by means of a Concat layer connection, feeding into a convolutional layer of size 4 x 3 x 032. Advanced features are then extracted by 2 residual dense blocks (RDB, residual Dense Block) with a growth rate of 32. The RDB module has a structure as shown in fig. 4, combines residual learning and dense connection, inputs of each layer are all outputs of the previous layers connected in the channel direction, feature multiplexing is achieved, the RDB module has 4 layers in total, the previous 3 layers are all composed of convolution with the size of 3×3 and ReLU activation functions, the last layer is a feature fusion layer with the convolution kernel size of 1×1, and 32 feature graphs are output. The outputs of the two RDB modules are connected through a Concat layer and sent to a convolution layer with the size of 64 multiplied by 1 multiplied by 32 for feature fusion. Finally, the residual optical flow +.A.is calculated from the convolution layer of size 32×3×3×2 >

Optical flow estimation value of first layer of OFE network

From residual optical flow->

And initial optical flow->

The addition results in that:

a second layer: optical flow output by the first layer

Is the resolution of LR video frame X _j 1/2 of the resolution of (a) so that the optical flow value output for the first layer +.>

Amplifying by 2 times by using bilinear interpolation to obtain +.>

Optical flow after upsampling ∈ ->

For video frame X _j Performing deformation operation, i.e. Warp, to obtain +.>

Then will->

X _i And->

The network is connected and fed through a Concat layer, the network structure of a second layer is the same as that of the first layer, and residual light flow is output>

Residual optical flow->

And->

Adding to obtain an optical flow estimate of the second layer +.>

Namely:

third layer: optical flow output by the second layer

And LR video frame X _j With the same resolution. Thus, the third layer of the optical flow estimation network model acts as an SR to predict potential HR optical flow between corresponding HR video frames. Similar to the step of the second layer, use is made of +.>

For video frame X _j Performing deformation operation to obtain->

Will->

X _i And->

And sending the network through the Concat layer connection. The third layer differs in that there are 4 RDB modules of this layer to extract more advanced features. At the same time, a trainable sub-pixel convolution layer is added to improve the resolution, namely, the resolution of the optical flow diagram is enlarged to s times of the LR input, and s is a scale factor of VSR. Finally, this layer outputs HR residual optical flow +. >

Handle->

Value and residual optical flow after bilinear upsampling by a factor s +.>

Adding to obtain HR optical flow estimated value of the third layer +.>

Namely:

where U (-) represents a bilinear upsampling operation.

In one or some embodiments, establishing a relationship between the inter-frame optical flow and the frame to be processed and its neighboring frames by optical flow conversion, generating an optical flow set based on the relationship includes:

After LR video frames such as a frame to be processed and adjacent frames thereof obtain HR inter-frame optical flow through an optical flow estimation network model, establishing a relation between HR optical flow and LR video frames by utilizing space-to-depth conversion, extracting information from HR optical flow through a fixed LR grid as shown in fig. 5, arranging the information in a channel dimension, and generating an LR optical flow set with the same resolution size as the LR video frames

Namely:

where H and W represent the height and width of LR video frames, HR optical flow

Is 2, and s is the scale factor of VSR, LR optical flow set +. >

The number of channels is 2s ² . During the conversion, HR optical flow is divided by s to match the spatial resolution of the LR video frames. Then use LR optical flow set +.>

Optical flow values in video frame X _j Motion compensation is performed, resulting in a compensated set +.>

Namely:

wherein W (·) represents a deformation operation. Although motion compensation is performed on LR video frames, because HR optical flow estimation is utilized, more accurate motion relationships can be embodied in the compensated video frames while introducing more efficient information for the VSR.

In one or some embodiments, the superminute reconstruction model shown in fig. 6 is incorporated, wherein the structure of the multi-channel attention module in fig. 6 is shown with reference to fig. 7. The superbranch reconstruction model comprises a first CNN network (a network of a first row in fig. 6) and a second CNN network (a network of a second row in fig. 6) which are arranged in parallel, wherein the layer number of the first CNN network is larger than that of the second CNN network, and the first CNN network and/or the second CNN network comprises at least one attention module.

The input to the super-division reconstruction model is T consecutive LR video frames { X ] _t-n …X _t …X _t+n T=2n+1, the goal of this embodiment is to reconstruct the center frame X _t HR version of (c). Sequentially performing a sum-middle for each adjacent frame Optical flow estimation and motion compensation of cardiac frames, generating corresponding compensated sets

Then the compensated set of all LR adjacent frames +.>

And center frame X _t Combine into a new collection, which may be referred to as an LR draft set

Finally C is arranged ^L Sending the center frame X into the super-division reconstruction model to recover the center frame X _t HR reconstruction results->

The formulation may be as follows:

wherein θ _VSR Parameters representing an improved ADS-VSR network. The convolution-generated feature map contains different types of information, such as low-frequency information or high-frequency information, low-level features or high-level features, on different channels and spatial regions, and the contribution rate of the different types of information to the reconstruction of the video data SR is different. However, most methods based on CNN lack discrimination capability for different types of information, and default that the contribution rate of all information is the same, so that all feature graphs are processed equally, and cannot be flexibly adjusted, and the fitting capability of a network model is limited. The representational capacity of the network will be enhanced if the sensitivity of the network to high contribution features can be improved. The attention module is thus introduced in this embodiment with the aim of adapting itself to the most useful and important parts of the feature map.

Referring to fig. 6, wherein the first CNN network comprises at least one layer or combination of: a combination of convolution layers and ReLU functions, a combination of at least two sets of convolution layers and attention modules (16 sets in series in fig. 6), a multi-channel attention module (see fig. 7 for a number of numbers), an attention module, a connection layer, a convolution layer, or an upsampling layer. The convolution layers include Conv1 x 1, conv3 x 3, conv5 x 5 and Conv7 x 7, and a ReLU function is further provided after part of the convolution layers, and the number or types of each layer, combination or module can be adjusted according to the requirements of reconstruction and training, so that any CNN network conforming to the spirit of the construction in the embodiment is within the protection range.

With continued reference to the structure shown in fig. 6, wherein the second CNN network includes at least one layer or combination of: a combination of convolution layer and ReLU function, an attention module, upsampling, or convolution layer.

It should be noted that the bottom of fig. 6 also shows a specific block diagram of the upsampling layer, which is upsampling by a factor of 2 or 3 and a factor of 4, respectively.

In one or some embodiments, as shown in connection with fig. 8, the attention module includes a channel attention sub-module and a spatial attention sub-module disposed in parallel, and a connection layer connecting the two sub-modules and a convolution layer after the connection layer.

Specifically, the attention module includes a channel attention unit and a spatial attention unit, and the input feature map is adjusted by using the correlation between the feature map channel and the space. Input and output feature maps of the attention module are respectively represented as

And->

Where H, W and C are the height, width and number of channels of the input profile F. Output characteristic map->

The formulation may be as follows:

representing element-by-element multiplication, F gets channel weights through the channel attention unit

Representing the characteristic diagram after the channel weight adjustment, and F obtains the spatial weight through the spatial attention unit in the same way >

And (5) representing the characteristic diagram after the spatial weight adjustment. Will F _CA And F _SA The output characteristic diagram is obtained by connecting Concat layers and then reducing the number of channels from 2C to C by a convolution layer with the size of 2C multiplied by 1 multiplied by C>

Each channel profile has a different meaning due to the effect of different filters in the same convolutional layer. Constructing a channel attention unit, generating a channel weight W by calculating a statistical characteristic of each channel _C And then the feature map is adjusted in a global mode, important information is enhanced, and useless redundant information is restrained. The channel attention submodule is structured as shown in fig. 9, and spatial information of each feature map is first aggregated by global average pooling and global maximum pooling, one of which encodes average statistical characteristics and the other encodes regions to a significant extent, and two different spatial feature descriptors are generated

And->

The average pooling feature and the maximum pooling feature, respectively. To learn more nonlinear relations between channels +.>

And->

Sending the data into a parameter sharing full-connection layer, summing the outputs of the full-connection layer element by element, and finally generating final channel weight W through a gating mechanism _C As shown in formula (5.11):

wherein σ (·) and δ (·) represent Sigmoid and ReLU activation functions, respectively.

The first fully connected layer is shown for reducing the number of channels to C/r, r being the reduction rate. After ReLU activation, the low-dimensional signal passes through the second fully-connected layer +.>

The number of channels is increased back to the initial number C. Finally, the weights are mapped to the (0, 1) interval by using a Sigmoid function. The input profile can thus be readjusted according to equation (5.8).

Through the above process, the channel attention submodule can adaptively adjust the feature map according to the inter-channel statistical characteristics of the input feature map, so that the network can be helped to improve the processing capacity of different channel features.

The channel attention unit compresses the global spatial information into one channel weight using global average pooling and global maximum pooling, so that the influence of the spatial information in each feature map is not considered. However, the characteristics of the information contained in the feature map are also different at different spatial positions. For example, edges orThe textured region typically contains more high frequency information, while the smooth region, such as the sky, contains more low frequency information. Therefore, if the network can have discrimination capability on different local information and pay more attention to important and difficult-to-reconstruct areas, the detail information recovery of the SR problem is facilitated. The space attention sub-module is constructed as a supplement to the channel attention unit to improve the representation capability of the network, as shown in FIG. 10, by adopting two-layer convolution and then sending the two-layer convolution to the Sigmoid function to generate the space weight W _S The formulation may be as follows:

W _S ＝σ(conv _1×1 (δ(conv _1×1 (F)))) (5.12)

wherein the symbols sigma (·) and delta (·) have the same meaning as in formula (3.5). The first convolution layer parameter is Cx1×1×C/γ for reducing the number of channels to C/γ, γ being the reduction rate. The second convolution layer parameter is C/gamma x 1, combining the inputs into a single spatial attention profile. Finally, the values of the attention profile are normalized to the (0, 1) interval to obtain the spatial weight W, again using the Sigmoid function _S 。

The spatial attention units are used for carrying out self-adaptive adjustment on the feature map in a local mode, and are matched with the channel attention units in a global mode to form an attention module, so that the network can be helped to enhance the fitting capacity.

Fig. 11 is a schematic structural diagram of an embodiment of the video super-resolution reconstruction device according to the present invention. As shown in fig. 11, the apparatus 1100 includes:

optical flow estimation module 1110: the method is suitable for inputting the to-be-processed frame and the adjacent frame into a preset optical flow estimation network model to obtain the inter-frame optical flow between the to-be-processed frame and the adjacent frame.

Wherein optical flow (Optical flow or optic flow) is a concept in object motion detection in the field of view to describe the motion of an observed object, surface or edge caused by motion relative to an observer. The optical flow estimation network model is used for estimating the optical flow between the frame to be processed and each adjacent frame.

Optical flow set generation module 1120: is adapted to establish a relation between the inter-frame optical flow and the frame to be processed and its neighboring frames by optical flow conversion, based on which relation an optical flow set is generated.

The module is used for obtaining more optical flow according to the inter-frame optical flow based on the format of the low-resolution frame and the like, so that more optical flow information is obtained, and more accurate motion relations are obtained.

Motion compensation module 1130: and the optical flow set is suitable for performing motion compensation on the adjacent frames, and combining the adjacent frames subjected to the motion compensation with the frames to be processed to obtain a frame set to be processed.

The module obtains more frames close to the frames to be processed through the optical flow set, namely the frames to be processed set, so that more effective information is introduced for the follow-up super-resolution reconstruction model.

The super-division reconstruction module 1140: and the method is suitable for inputting the frame set to be processed into a pre-trained super-division reconstruction model to obtain the super-division frame corresponding to the frame to be processed.

In summary, the embodiment realizes the estimation of the optical flow, the acquisition of the optical flow set and the low resolution frame set, the acquisition of the super-resolution frame and the like by establishing the neural network model, and solves the technical problem that the end-to-end training and the resolution improvement cannot be realized by the deep learning network completely in the prior art. In addition, by introducing the neural network model, more effective information is provided for the super-division reconstruction model, the characteristic fitting capacity of the super-division reconstruction model is enhanced, the accuracy and time consistency of the result are improved, the reconstructed video frame is more lifelike, the details are more abundant, and the visual perception of people is better.

In one or some embodiments, the optical flow estimation network includes at least two pyramid layers, each pyramid layer outputting motion vectors of different resolutions, the resolution optical flow estimated by a subsequent pyramid layer being higher than the resolution optical flow estimated by a preceding pyramid layer.

In one embodiment, the optical flow estimation module 1110 is further adapted to:

In one embodiment, optical-flow-set generation module 1120 is further adapted to:

In a preferred embodiment, the super-resolution reconstruction model includes a first CNN network and a second CNN network that are disposed in parallel, the number of layers of the first CNN network is greater than the number of layers of the second CNN network, and at least one attention module is included in the first CNN network and/or the second CNN network.

In one embodiment, the first CNN network comprises at least one layer or combination of: a combination of convolution layer and ReLU function, a combination of convolution layer and attention module, a multichannel attention module, an attention module, a connection layer, a convolution layer, or an upsampling layer;

In one embodiment, the attention module includes a channel attention sub-module and a spatial attention sub-module arranged in parallel, a connection layer, and a convolution layer;

In summary, the above-mentioned embodiments of the present invention provide an optical flow estimation model (OFE model) based on a pyramid structure, which outputs motion vectors with different resolutions in different pyramid layers, and estimates large motion with higher resolution in a high pyramid layer, that is, estimates potential HR optical flow between corresponding HR video frames by using a coarse-to-fine manner for an input LR video frame. Compared with the LR optical flow, the HR optical flow can provide more accurate corresponding relation, more effective information is introduced, and the accuracy of the result is improved by the VSR. And the similarity of adjacent frame time dimensions is reasonably utilized to improve the reconstruction quality of the video through a deep and shallow layer combined VSR model (ADS-VSR model) which introduces an attention mechanism. The network consists of two branches, namely a relatively deep first CNN network for recovering high-frequency detail information and a shallow second CNN network for recovering HR video frame basic information. Meanwhile, the attention mechanism thought is introduced into the network, and an attention module is constructed, and the module carries out self-adaptive adjustment on the input feature graphs by utilizing the space and channel correlation between the feature graphs, so that the weights of different types of features are different, and the fitting capacity of the network is improved. In addition, the invention constructs an end-to-end VSR model based on the CNN network, adopts end-to-end network overall training optimization, is favorable for obtaining global optimal solution, further improves accuracy of VSR reconstruction results, and the reconstructed video contains more abundant detail information and obtains better visual quality.

The beneficial effects of the invention include: an end-to-end model for solving the VSR problem is provided, namely a video super-resolution model integrating an attention mechanism and high-resolution optical flow estimation is provided, and motion estimation is completed by adopting a CNN model, so that the end-to-end model is integrally trained and optimized, and global optimal solution is obtained. The beneficial effects are as follows:

(1) The attention mechanism thought is introduced, an attention module is constructed, the attention module comprises a channel attention unit and a space attention unit, the input feature map is adjusted by utilizing the correlation of the feature map channel and the space, and the feature fitting capacity of the model is enhanced.

(2) The optical flow estimation model based on the pyramid structure, namely the OFE model, is provided, the potential HR optical flow between corresponding HR video frames is estimated in a coarse-to-fine mode for a group of input LR video frames, and compared with the LR optical flow, the method can provide more accurate corresponding relation, simultaneously introduces more effective information, and helps VSR to improve accuracy and time consistency of results.

(3) The video super-resolution model integrating the attention mechanism and the high-resolution optical flow estimation, namely the OFE and ADS-VSR model, is provided, the reconstructed video frame is more similar to a real HR video frame, the detail information contained in the reconstructed video frame is more abundant, and the visual perception of people is better.

The embodiment of the invention provides a non-volatile computer storage medium, which stores at least one executable instruction, and the computer executable instruction can execute the video super-resolution reconstruction method in any of the method embodiments.

FIG. 12 illustrates a schematic diagram of an embodiment of a computing device of the present invention, and the embodiments of the present invention are not limited to a particular implementation of the computing device.

As shown in fig. 12, the computing device may include: a processor 1202, a communication interface (Communications Interface) 1204, a memory 1206, and a communication bus 1208.

Wherein: the processor 1202, the communication interface 1204, and the memory 1206 communicate with each other via a communication bus 1208. A communication interface 1204 for communicating with network elements of other devices, such as clients or other servers, etc. The processor 1202 is configured to execute the program 1210, and may specifically perform relevant steps in the above-described embodiment of a video super-resolution reconstruction method for a computing device.

In particular, program 1210 may include program code including computer operating instructions.

The processor 1202 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included by the computing device may be the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.

Memory 1206 for storing program 1210. The memory 1206 may comprise high-speed RAM memory or may further comprise non-volatile memory (non-volatile memory), such as at least one disk memory.

The program 1210 may be specifically configured to cause the processor 1202 to perform operations corresponding to the video super-resolution reconstruction method in any of the foregoing embodiments.

The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the above description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.

Various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functionality of some or all of the components according to embodiments of the present invention may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present invention can also be implemented as an apparatus or device program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present invention may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specifically stated.

Claims

1. A method of video super-resolution reconstruction, the method comprising:

2. The method of claim 1, wherein the optical flow estimation network model comprises at least two pyramid layers, each pyramid layer outputting optical flow of a different resolution, and wherein the pyramid layer at the output has an estimated optical flow resolution that is higher than the pyramid layer at the input.

3. The method of claim 2, wherein inputting the frame to be processed and the adjacent frame into a predetermined optical flow estimation network model to obtain an inter-frame optical flow between the frame to be processed and the adjacent frame comprises:

4. The method of claim 1, wherein establishing a relationship between the inter-frame optical flow and a frame to be processed and its neighboring frames by optical flow conversion, generating an optical flow set based on the relationship comprises:

5. The method according to any of claims 1-4, wherein the super-resolution reconstruction model comprises a first CNN network and a second CNN network arranged in parallel, wherein the number of layers of the first CNN network is greater than the number of layers of the second CNN network, and wherein at least one attention module is included in the first CNN network and/or the second CNN network.

6. The method of claim 5, wherein the first CNN network comprises at least one layer or combination of: a combination of convolution layer and ReLU function, a combination of convolution layer and attention module, a multichannel attention module, an attention module, a connection layer, a convolution layer, or an upsampling layer;

7. The method of claim 6, wherein the attention module comprises a channel attention sub-module and a spatial attention sub-module, a connection layer, and a convolution layer arranged in parallel;

8. A video super-resolution reconstruction apparatus, the apparatus comprising:

9. A computing device, comprising: the device comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete communication with each other through the communication bus;

the memory is configured to store at least one executable instruction, where the executable instruction causes the processor to perform operations corresponding to the video super-resolution reconstruction method according to any one of claims 1 to 7.

10. A computer storage medium having stored therein at least one executable instruction for causing a processor to perform operations corresponding to the video super-resolution reconstruction method according to any one of claims 1-7.