CN118279154A

CN118279154A - Video super-resolution method, device, electronic equipment and storage medium

Info

Publication number: CN118279154A
Application number: CN202410561171.5A
Authority: CN
Inventors: 吴锦泉
Original assignee: Guangzhou Huya Information Technology Co Ltd
Current assignee: Guangzhou Huya Information Technology Co Ltd
Priority date: 2024-05-08
Filing date: 2024-05-08
Publication date: 2024-07-02

Abstract

The embodiment of the invention provides a video super-resolution method, a video super-resolution device, electronic equipment and a storage medium, and relates to the technical field of image processing. According to the method, the characteristic fusion of different widths and different heights is carried out on the video frames to be reconstructed and each adjacent video frame by adopting a separate convolution structure to obtain a width characteristic image and a height characteristic image, and then the characteristic reconstruction is carried out on the multi-dimensional fusion characteristic image and the key characteristic image which are subjected to interleaving fusion of the width characteristic image and the height characteristic image to obtain a high-resolution video frame, so that the characteristic fusion calculation amount is reduced, and meanwhile, the characteristic fusion capacity of long distances and different directions is improved, thereby effectively improving the efficiency and quality of the super-resolution of the video, and effectively reducing visual artifacts.

Description

Video super-resolution method, device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a video super-resolution method, apparatus, electronic device, and storage medium.

Background

Video super-resolution is a technique for converting low resolution video to high resolution video, whose key is to accurately align a series of temporally consecutive low resolution frames onto a high resolution reference frame. Although the existing video super-resolution method adopts various alignment technologies, the problems of high calculation cost and incapability of capturing rich space-time context information are faced.

Disclosure of Invention

Accordingly, the present invention is directed to a method, an apparatus, an electronic device, and a storage medium for video super resolution, which can improve the efficiency and quality of video super resolution and effectively reduce visual artifacts.

In order to achieve the above object, the technical scheme adopted by the embodiment of the invention is as follows:

in a first aspect, the present invention provides a video super-resolution method, the method comprising:

The adjacent feature images and the key feature images are fused in series to obtain an initial feature image; the key feature map characterizes the features of the video frame to be reconstructed; the adjacent feature map is obtained after the feature of the adjacent video frame is aligned with the key feature map;

Respectively carrying out feature fusion of different widths and different heights along the width direction and the height direction of the initial feature map to obtain a corresponding width feature map and height feature map;

inputting the width feature map and the height feature map corresponding to the initial feature map into a cross attention network to obtain a multi-dimensional fusion feature map;

And carrying out feature reconstruction according to the multi-dimensional fusion feature map and the key feature map to obtain a high-resolution video frame.

Optionally, the step of obtaining the width feature map includes:

inputting the initial feature map to a first layer normalization module to obtain a first normalized feature map;

performing convolution operation on the first normalized feature map by adopting a plurality of convolution kernel parameters with different widths to obtain a plurality of width convolution results corresponding to the first normalized feature map;

And fusing a plurality of width convolution results corresponding to the first normalized feature map to obtain a width feature map.

Optionally, the step of obtaining the height feature map includes:

Inputting the initial feature map to a second layer normalization module to obtain a second normalized feature map;

Performing convolution operation on the second normalized feature map by adopting a plurality of convolution kernel parameters with different heights to obtain a plurality of height convolution results corresponding to the second normalized feature map;

And fusing a plurality of height convolution results corresponding to the second normalized feature map to obtain a height feature map.

Optionally, the cross-attention network comprises a first convolution layer and a second convolution layer; inputting the width feature map and the height feature map corresponding to the initial feature map into a cross attention network to obtain a multi-dimensional fusion feature map, wherein the multi-dimensional fusion feature map comprises:

performing convolution operation on the width feature map by using a first convolution layer with three different parameters to obtain a first width result, a second width result and a third width result;

performing convolution operation on the height feature map by using second convolution layers with three different parameters to obtain a first height result, a second height result and a third height result;

feature fusion is carried out according to the first width result, the second width result and the first height result, and a first cross feature diagram is obtained;

Feature fusion is carried out according to the second height result, the third height result and the third width result, and a second cross feature diagram is obtained;

and carrying out series fusion on the first cross feature map and the second cross feature map to obtain a multi-dimensional fusion feature map corresponding to the adjacent video frames.

Optionally, the fusing the adjacent feature map and the key feature map in series to obtain an initial feature map includes:

performing series operation on the adjacent feature images and the key feature images to obtain corresponding first spliced feature images;

and inputting the first spliced feature map into a first dimension reduction convolution layer to carry out convolution operation, so as to obtain an initial feature map.

In an alternative embodiment, before the step of fusing the adjacent feature map and the key feature map in series to obtain the initial feature map, the method further includes:

acquiring the video frame to be reconstructed and at least one adjacent video frame corresponding to the video frame to be reconstructed;

Respectively extracting features of the video frame to be reconstructed and the at least one adjacent video frame to obtain a key feature map and at least one initial adjacent feature map; the initial adjacent feature map corresponds to the adjacent video frames one by one;

and respectively carrying out feature alignment on the at least one initial adjacent feature map and the key feature map to obtain an adjacent feature map corresponding to the at least one adjacent video frame.

Optionally, the performing feature alignment on the at least one initial adjacent feature map and the key feature map to obtain an adjacent feature map corresponding to the at least one adjacent video frame includes:

Inputting the adjacent feature graphs and the key feature graphs into a first deformable convolution layer for feature alignment aiming at each adjacent feature graph to obtain an intermediate feature output by the first deformable convolution layer;

Inputting the middle feature output by the previous deformable convolution layer and the key feature map into the next deformable convolution layer for feature alignment until the middle feature output by the last deformable convolution layer is obtained, and determining the middle feature output by the last deformable convolution layer as the corresponding adjacent feature map of the adjacent video frame; a plurality of the deformable convolution layers are connected in series.

In a second aspect, the present invention provides a video super-resolution apparatus, the apparatus comprising:

the preprocessing module is used for fusing adjacent feature images and key feature images in series to obtain an initial feature image; the key feature map characterizes the features of the video frame to be reconstructed; the adjacent feature map is obtained after the feature of the adjacent video frame is aligned with the key feature map;

the fusion module is used for respectively carrying out feature fusion of different widths and different heights along the width direction and the height direction of the initial feature map to obtain a corresponding width feature map and a corresponding height feature map;

The fusion module is also used for inputting the width feature map and the height feature map corresponding to the initial feature map into a cross attention network to obtain a multi-dimensional fusion feature map;

And the reconstruction module is used for carrying out feature reconstruction according to the multi-dimensional fusion feature map and the key feature map to obtain a high-resolution video frame.

In a third aspect, the present invention provides an electronic device comprising a processor and a memory storing machine executable instructions executable by the processor to implement the video super-resolution method of any of the preceding embodiments.

In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a video super-resolution method as in any of the preceding embodiments.

Compared with the prior art, the video super-resolution method, the device, the electronic equipment and the storage medium provided by the embodiment of the invention have the advantages that the adjacent feature images and the key feature images are fused in series to obtain the initial feature image. And carrying out feature fusion of different widths along the width direction of the initial feature map to obtain a corresponding width feature map. And carrying out feature fusion of different heights along the height direction of the initial feature map to obtain a corresponding height feature map. And inputting the width feature map and the height feature map corresponding to the initial feature map into a cross attention network to obtain a multi-dimensional fusion feature map. And carrying out feature reconstruction according to the multi-dimensional fusion feature map and the key feature map to obtain a high-resolution video frame.

According to the method, the characteristics of the video frames to be reconstructed and the adjacent video frames are fused in different widths and different heights by adopting a separate convolution structure to obtain the width characteristic image and the height characteristic image, and then the multi-dimensional fused characteristic image and the key characteristic image which are obtained by interweaving and fusing the width characteristic image and the height characteristic image are utilized to reconstruct the characteristics to obtain the high-resolution video frame, so that the characteristic fusion calculation amount is reduced, and meanwhile, the characteristic fusion capacity of long distances and different directions is improved, thereby effectively improving the efficiency and quality of the super-resolution of the video and effectively reducing visual artifacts.

In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 shows a schematic flow chart of a video super-resolution method according to an embodiment of the present invention.

Fig. 2 shows a schematic structural diagram of a super-resolution model according to an embodiment of the present invention.

Fig. 3 is a schematic flow chart of another video super-resolution method according to an embodiment of the present invention.

Fig. 4 shows a schematic flow chart of feature fusion provided by an embodiment of the present invention.

Fig. 5 is a schematic block diagram of a video super-resolution device according to an embodiment of the present invention.

Fig. 6 shows a block schematic diagram of an electronic device according to an embodiment of the present invention.

Icon: 100-an electronic device; 110-memory; a 120-processor; 130-a communication module; 200-a video super-resolution device; 201-a preprocessing module; 202-a fusion module; 203-a reconstruction module.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.

It is noted that relational terms such as "first" and "second", and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The video super-resolution technology refers to a process of generating a high-resolution video frame according to a low-resolution video frame, and can improve the definition and detail of video so that the viewing experience is better, thus being applied to important scenes such as video streaming service, safety monitoring, medical imaging and the like. In order to accurately align low resolution frames to a high resolution frame in a continuous time, it is necessary to solve problems such as shooting shake, rapid movement of an object, view distortion caused by scene depth change, and the like.

Currently, the video super-resolution method generally adopts an alignment method such as a block matching method, an optical flow method, an alignment method based on feature and depth learning, and the like to realize the alignment from a low-resolution frame to a high-resolution frame. Each of the methods has its advantages and limitations as found by the inventors. For example, block matching is straightforward but may not be sufficiently accurate, optical flow methods are more accurate but computationally intensive for continuous motion estimation, while feature and depth learning based alignment methods can cope with more complex situations but require a large amount of data to train.

Based on the above, the method and the device for video super-resolution provided by the embodiment of the invention have the core ideas that the separate convolution structure is adopted to perform feature fusion of different widths and different heights on the video frame to be reconstructed and each adjacent video frame to obtain the width feature map and the height feature map, and then the multi-dimensional fusion feature map and the key feature map after interleaving and fusion of the width feature map and the height feature map are utilized to perform feature reconstruction to obtain the high-resolution video frame, so that the feature fusion calculation amount is reduced, and the feature fusion capability of long distance and different directions is improved, thereby effectively improving the efficiency and quality of video super-resolution and effectively reducing visual artifacts.

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a schematic flow chart of a video super-resolution method according to an embodiment of the invention. The method is applied to the electronic equipment and comprises the following steps of:

and step 40, fusing adjacent feature images and key feature images in series to obtain an initial feature image.

The key feature map characterizes the features of the video frame to be reconstructed; the adjacent feature map is obtained after the feature of the adjacent video frame is aligned with the key feature map.

In the embodiment of the invention, in order to improve the visual continuity between the reconstructed high-resolution video frames, the video frames to be reconstructed and at least one adjacent video frame corresponding to the video frames to be reconstructed can be obtained, and the key feature map and the at least one adjacent feature map of the video frames to be reconstructed can be obtained according to the video frames to be reconstructed and the at least one adjacent video frame. Wherein, adjacent video frames are in one-to-one correspondence with adjacent feature maps.

And when the adjacent video frames are one, carrying out series fusion on the key feature images of the adjacent video frames according to the feature channels to obtain an initial feature image. When a plurality of adjacent video frames are provided, each adjacent feature image and each key feature image are fused in series according to the feature channels, and an initial feature image corresponding to each adjacent feature image is obtained. Wherein, adjacent feature maps are in one-to-one correspondence with the initial feature maps.

And step 50, respectively carrying out feature fusion of different widths and different heights along the width direction and the height direction of the initial feature map to obtain a corresponding width feature map and a corresponding height feature map.

In the embodiment of the invention, the width characteristic map representation fuses a plurality of characteristics of the same height in each adjacent video frame and the video frame to be reconstructed, and the height characteristic map representation fuses a plurality of characteristics of the same width in each adjacent video frame and the video frame to be reconstructed. And carrying out feature fusion of different widths along the width direction of the initial feature map to obtain a corresponding width feature map, and carrying out feature fusion of different heights along the height direction of the initial feature map to obtain a corresponding height feature map.

That is, feature fusion in the width direction is performed on the initial feature map for multiple times according to different widths to obtain a width feature map corresponding to the initial feature map, and feature fusion in the height direction is performed on the initial feature map for multiple times according to different heights to obtain a height feature map corresponding to the initial feature map. The characteristic dependency relationship between long distance and different directions of video frames is improved by extracting the width characteristic diagram and the height characteristic diagram, so that the problems of motion blur and time distortion can be effectively solved.

And step 60, inputting the width feature map and the height feature map corresponding to the initial feature map into a cross attention network to obtain a multi-dimensional fusion feature map.

It should be noted that, the multi-dimensional fusion feature map is obtained after the cross fusion of the width feature map and the height feature map, and the initial feature map corresponds to the multi-dimensional fusion feature map one by one. When a plurality of adjacent video frames exist, steps 40-60 are executed for each adjacent video frame to obtain a multi-dimensional fusion feature map corresponding to each adjacent video frame.

And step 70, carrying out feature reconstruction according to the multi-dimensional fusion feature map and the key feature map to obtain a high-resolution video frame.

In the embodiment of the invention, the feature reconstruction is carried out on the multi-dimensional fusion feature map and the key feature map, and the high-resolution video frame corresponding to the video frame to be reconstructed is obtained. And when a plurality of adjacent video frames exist (namely, a plurality of multi-dimensional fusion feature images exist), carrying out feature reconstruction on all the multi-dimensional fusion feature images and the key feature images to obtain a high-resolution video frame corresponding to the video frame to be reconstructed.

In summary, in the video super-resolution method provided by the embodiment of the invention, adjacent feature images and key feature images are fused in series to obtain an initial feature image. And carrying out feature fusion of different widths along the width direction of the initial feature map to obtain a corresponding width feature map. And carrying out feature fusion of different heights along the height direction of the initial feature map to obtain a corresponding height feature map. And inputting the width feature map and the height feature map corresponding to the initial feature map into a cross attention network to obtain a multi-dimensional fusion feature map. And carrying out feature reconstruction according to the multi-dimensional fusion feature map and the key feature map to obtain a high-resolution video frame.

In an alternative implementation, embodiments of the present invention may utilize a pre-trained super-resolution model to convert low-resolution video frames to be reconstructed into corresponding resolution video frames. Referring to fig. 2, fig. 2 is a schematic structural diagram of a super-resolution model according to an embodiment of the invention. In fig. 2, the super-resolution model includes a feature extraction layer, a feature alignment layer, a feature fusion layer, and a feature reconstruction layer.

The feature extraction layer is used for extracting features of the video frame to be reconstructed and the adjacent video frames, and extracting important features from the video frames with low resolution. The feature alignment layer is used for carrying out feature alignment on the features of the adjacent video frames and the features of the video frames to be reconstructed to obtain adjacent feature images corresponding to the adjacent video frames. The feature fusion layer is used for carrying out feature fusion on the adjacent feature images and the key feature images to obtain multi-dimensional fusion feature images corresponding to the adjacent video frames. The feature reconstruction layer is used for carrying out feature reconstruction according to the multi-dimensional fusion feature map and the key feature map to obtain a high-resolution video frame corresponding to the video frame to be reconstructed.

It should be noted that fig. 2 is only a schematic structural diagram of the present embodiment. In fact, the structure of the super-resolution model may be set according to the actual application scenario, which is not limited in the embodiment of the present invention.

Optionally, in practical applications, feature extraction needs to be performed on the low-resolution video frame before feature fusion is performed, and then feature fusion is performed based on the extracted feature map. Referring to fig. 3, before step 40 in fig. 1, the method further includes the following steps:

step 10, obtaining a video frame to be reconstructed and at least one adjacent video frame corresponding to the video frame to be reconstructed.

As a possible implementation manner, the number of preset adjacent frames may be initialized in advance according to a scene, each video frame in the video is sequentially determined to be a video frame to be reconstructed, at least one corresponding adjacent video frame is obtained according to the number of preset adjacent frames and the video frame to be reconstructed, the number of preset adjacent frames is assumed to be 2N, and when the video frame to be reconstructed is the first frame video frame, 2N video frames behind the video frame to be reconstructed are determined to be adjacent video frames. When the video frame to be reconstructed is an M (M < =N) frame video frame, determining M-1 video frames in front of the video frame to be reconstructed and 2N-M+1 video frames behind the video frame to be reconstructed as adjacent video frames. That is, video frames corresponding to a preset number of adjacent frames are selected as adjacent video frames before and after the video frame to be reconstructed. Where M and N are both positive integers.

And step 20, respectively extracting the characteristics of the video frame to be reconstructed and at least one adjacent video frame to obtain a key characteristic diagram and at least one initial adjacent characteristic diagram.

Wherein the initial adjacent feature map corresponds to adjacent video frames one-to-one.

In the embodiment of the invention, the video frame to be rebuilt and each adjacent video frame are respectively input into the feature extraction layer to obtain the key feature map corresponding to the video frame to be rebuilt and the initial adjacent feature map corresponding to each adjacent video frame. It should be noted that, the feature extraction layer acts on each video frame independently, and both the video frame to be reconstructed and the adjacent video frame are input to the feature extraction layer to perform the same processing steps for feature extraction. The feature extraction layer may be expressed using the following formula:

wherein, when i=t, For a video frame to be reconstructed, F _i is a key feature map, when i noteq t,For adjacent video frames, F _i is the initial adjacent feature map; net _fea is the feature extraction operation of the feature extraction layer; n is half of the number of adjacent video frames, and i and t are positive integers.

In one possible implementation, the feature extraction layer includes a feature extraction convolutional layer and a plurality of residual blocks, the feature extraction convolutional layer and the residual blocks being connected in series. And respectively inputting the video frame to be reconstructed and at least one adjacent video frame into a feature extraction convolution layer to perform feature mapping and activation to obtain preliminary extraction features corresponding to the video frame to be reconstructed and preliminary extraction features corresponding to the at least one adjacent video frame. And sequentially inputting the preliminary extraction features corresponding to the video frames to be reconstructed into a plurality of residual blocks for feature processing to obtain a key feature map corresponding to the video frames to be reconstructed, sequentially inputting the preliminary extraction features corresponding to at least one adjacent video frame into a plurality of residual blocks for feature processing to obtain an initial adjacent feature map corresponding to at least one adjacent video frame.

The feature extraction layer is assumed to include a feature extraction convolution layer and three residual blocks. Firstly, inputting a video frame to be rebuilt and at least one adjacent video frame into a feature extraction convolution layer in sequence for processing, inputting initial extraction features output by the feature extraction convolution layer into a first residual block for processing, inputting a result output by the first residual block into a second residual block for processing, and finally inputting a result output by the second residual block into a third residual block for processing to obtain a key feature map corresponding to the video frame to be rebuilt and an initial adjacent feature map corresponding to the at least one adjacent video frame.

And step 30, performing feature alignment on at least one initial adjacent feature map and the key feature map respectively to obtain an adjacent feature map corresponding to at least one adjacent video frame.

In the embodiment of the invention, aiming at each initial adjacent feature map, the initial adjacent feature map and the key feature map are input into a feature alignment layer to realize feature alignment between the initial adjacent feature map and the key feature map, so as to obtain a corresponding adjacent feature map. The feature alignment layer may be expressed using the following formula:

F _i ^a＝Net_align(F_i,F_t), i.e. [ t-N, t+N ] and i.noteq.t ]

Wherein, F _i ^a is the adjacent feature map corresponding to the ith adjacent video frame; f _i is an initial adjacent feature map corresponding to the ith adjacent video frame; f _t is a key feature map; net _align is a feature alignment operation of the feature alignment layer.

It should be noted that, one adjacent video frame may be obtained according to the video frame to be reconstructed, or a plurality of adjacent video frames may be obtained, which is not limited in this embodiment, and the adjacent video frames may be obtained according to the actual application scenario.

As a possible implementation manner, the feature alignment layer comprises a plurality of deformable convolution layers connected in series, and the feature sampling positions can be dynamically determined through the deformable convolution layers, so that the effect of motion compensation is achieved by adaptively adjusting sampling points in the feature alignment process. The substeps of step 30 in fig. 3 may include:

Inputting adjacent feature images and key feature images into a first deformable convolution layer for feature alignment aiming at each adjacent feature image to obtain an intermediate feature output by the first deformable convolution layer; inputting the middle feature and the key feature map output by the previous deformable convolution layer into the next deformable convolution layer for feature alignment until the middle feature output by the last deformable convolution layer is obtained, and determining the middle feature output by the last deformable convolution layer as the adjacent feature map of the corresponding adjacent video frame; wherein a plurality of deformable convolution layers are connected in series.

As one possible implementation, it is assumed that the feature alignment layer comprises five deformable convolution layers connected in series. Inputting adjacent feature images and key feature images to a first deformable convolution layer for feature alignment aiming at each adjacent feature image to obtain an intermediate feature output by the first deformable convolution layer; inputting the intermediate features and key feature graphs output by the first deformable convolution layer to a second deformable convolution layer for feature alignment to obtain intermediate features output by the second deformable convolution layer; inputting the intermediate features and the key feature graphs output by the second deformable convolution layer to a third deformable convolution layer for feature alignment to obtain intermediate features output by the third deformable convolution layer; inputting the intermediate features and the key feature graphs output by the third deformable convolution layer to a fourth deformable convolution layer for feature alignment to obtain intermediate features output by the fourth deformable convolution layer; and inputting the intermediate features and key feature graphs output by the fourth deformable convolution layer into a fifth deformable convolution layer for feature alignment to obtain intermediate features output by the fifth deformable convolution layer, and determining the intermediate features output by the fifth deformable convolution layer as adjacent feature graphs of corresponding adjacent video frames.

That is, feature alignment is performed in cascade in series order with the intermediate feature and key feature map of the output of the previous deformable convolutional layer as the input of the next deformable convolutional layer. Therefore, the deformable convolution layer is utilized to carry out self-adaptive feature space adjustment, so that the feature alignment layer can dynamically determine the feature sampling position under the condition that the motion vector between video frames does not need to be calculated explicitly, effectively cooperates with the motion between adjacent frames, and effectively improves the efficiency of feature alignment of the video frames with diversified motions while reducing the operation complexity.

As a possible implementation manner, the feature fusion layer includes a first dimension-reducing convolution layer, and the first dimension-reducing convolution layer is used for reducing the channel dimension. The substeps of step 40 in fig. 1 may include:

performing series connection operation on adjacent feature images and key feature images to obtain corresponding first spliced feature images; and inputting the first spliced feature map into a first dimension-reduction convolution layer to carry out convolution operation, so as to obtain an initial feature map.

In the embodiment of the present invention, it is assumed that three dimensions of the adjacent feature map and the key feature map are w-dimension (wide), h-dimension (high), and c-dimension (channel number), respectively. When at least one adjacent feature map exists, carrying out series operation on the adjacent feature map and the key feature map based on the channel dimension aiming at each adjacent feature map to obtain a first spliced feature map corresponding to each adjacent feature map, wherein three dimensions of the first spliced feature map are w dimension, h dimension and 2*c dimension respectively. And inputting each first spliced feature map into a first dimension reduction convolution layer to carry out convolution operation, so as to obtain an initial feature map after dimension reduction, wherein three dimensions of the initial feature map are w dimension, h dimension and c dimension respectively. Wherein w, h and c are positive integers.

As a possible implementation manner, the feature fusion layer comprises a first layer normalization module and a plurality of width convolution layers, wherein each width convolution layer is provided with different width parameters, and the first layer normalization module is connected with each width convolution layer respectively. The substeps of obtaining the width feature map in step 50 of fig. 1 may include:

Inputting the initial feature map to a first layer normalization module to obtain a first normalized feature map; performing convolution operation on the first normalized feature map by adopting a plurality of convolution kernel parameters with different widths to obtain a plurality of width convolution results corresponding to the first normalized feature map; and fusing a plurality of width convolution results corresponding to the first normalized feature map to obtain a width feature map.

In the embodiment of the invention, the first layer normalization module can be unbiased layer normalization or biased layer normalization, and can be selected according to the characteristics of video content, so that the flexibility of a model is improved. When at least one initial feature map exists, inputting the initial feature map to a first layer normalization module for normalization processing aiming at each initial feature map to obtain a corresponding first normalization feature map. And respectively inputting the first normalized feature map to each width convolution layer for convolution operation to obtain a width convolution result output by each width convolution layer, and performing matrix addition operation on all the width convolution results corresponding to the first normalized feature map to obtain a width feature map corresponding to the initial feature map.

It should be noted that, in order to enhance the effect of feature fusion with different widths, the matrix addition operation may be performed on all the width convolution results, and then the result of the matrix addition operation is input into the 1×1 convolution to implement cross-channel feature fusion, so as to finally obtain the width feature map corresponding to each initial feature map.

As a possible implementation manner, the feature fusion layer further comprises a second layer normalization module and a plurality of height convolution layers, wherein each height convolution layer is provided with different height parameters, and the second layer normalization module is connected with each height convolution layer respectively. The substeps of obtaining the height profile in step 50 of fig. 1 may include:

Inputting the initial feature map to a second layer normalization module to obtain a second normalized feature map; performing convolution operation on the second normalized feature map by adopting a plurality of convolution kernel parameters with different heights to obtain a plurality of height convolution results corresponding to the second normalized feature map; and fusing a plurality of height convolution results corresponding to the second normalized feature map to obtain the height feature map.

In the embodiment of the invention, the second layer normalization module and the first layer normalization module are set identically, and can be used for selecting non-offset layer normalization and biased layer normalization. When at least one initial feature map exists, inputting the initial feature map to a second layer normalization module for normalization processing aiming at each initial feature map to obtain a corresponding second normalization feature map. And respectively inputting the second normalized feature map into each height convolution layer to perform corresponding convolution operation to obtain a height convolution result output by each height convolution layer, and performing matrix addition operation on all the height convolution results corresponding to the second normalized feature map to obtain a height feature map corresponding to the initial feature map.

In order to enhance the effect of feature fusion of different heights, the matrix addition operation can be performed on all the height convolution results, and then the result of the matrix addition operation is input into the 1×1 convolution to realize cross-channel feature fusion, so that the height feature map corresponding to each initial feature map is finally obtained.

Therefore, the embodiment of the invention can capture the characteristic dependency relationship of long distance and different directions between video frames by setting the width convolution layers with different widths and the height convolution layers with different heights to fuse the context information of different distances to the characteristic images of the video frames according to the space rows or the space columns, thereby effectively solving the problems of motion blur and time distortion. The method reduces the calculated amount, simultaneously more effectively maintains the time consistency and reduces the visual artifacts, and further improves the video reconstruction quality.

As a possible implementation manner, the feature fusion layer further comprises a cross-attention network, and the cross-attention network comprises a first convolution layer and a second convolution layer; the sub-steps of step 60 in fig. 1 may include:

Performing convolution operation on the width feature map by using a first convolution layer with three different parameters to obtain a first width result, a second width result and a third width result; performing convolution operation on the height feature map by using second convolution layers with three different parameters to obtain a first height result, a second height result and a third height result; feature fusion is carried out according to the first width result, the second width result and the first height result, and a first cross feature diagram is obtained; feature fusion is carried out according to the second height result, the third height result and the third width result, and a second cross feature diagram is obtained; and carrying out series fusion on the first cross feature map and the second cross feature map to obtain multi-dimensional fusion feature maps corresponding to the adjacent video frames.

Feature fusion is illustrated by way of example in fig. 4, where in fig. 4,Representing matrix multiplication, i.e. multiplication of two matrices, in the figureThe matrix addition is represented, i.e. the two matrices are added by element.AndFor three first convolution layers of different parameters,AndThree second convolution layers of different parameters. When at least one adjacent feature map exists, the adjacent feature maps and the key feature maps are subjected to series connection operation aiming at each adjacent feature map to obtain a corresponding first spliced feature map, and the first spliced feature map is input into a first dimension reduction convolution layer of 1 multiplied by 1 to perform convolution operation to obtain a corresponding initial feature map X. Respectively inputting the initial feature map X into a first layer normalization module and a second layer normalization module to obtain a first normalized feature mapAnd a second normalized feature map

Mapping the first normalized feature mapAnd respectively inputting a 1X 5 width convolution layer, a 1X 7 width convolution layer and a 1X 11 width convolution layer to obtain three width convolution results, adding a matrix to the three width convolution results, inputting the three width convolution results into the 1X 1 convolution, and carrying out convolution operation to obtain a width characteristic diagram X _w. Mapping the second normalized feature patternAnd respectively inputting the 5X 1 height convolution layer, the 7X 1 height convolution layer and the 11X 1 height convolution layer to obtain three height convolution results, adding the three height convolution results into a matrix, inputting the three height convolution results into the 1X 1 convolution, and carrying out convolution operation to obtain a height characteristic diagram X _h.

The width characteristic diagrams X _w are respectively input into the first convolution layersAndAnd performing convolution operation to obtain a first width result, a second width result and a third width result. The height characteristic diagrams X _h are respectively input into the second convolution layersAndA first height result, a second height result, and a third height result are obtained. And performing matrix multiplication on the second width result and the first height result, processing the matrix multiplication result by using a softmax normalization function, performing matrix multiplication on the processed result and the first width result, and inputting the matrix multiplication result into a1 multiplied by 1 to obtain a first cross feature map. And performing matrix multiplication on the third width result and the second height result, processing the matrix multiplication result by using a softmax normalization function, performing matrix multiplication on the processed result and the third height result, and inputting the matrix multiplication result into a1 multiplied by 1 to obtain a second cross feature map.

And performing series operation on the first cross feature map and the second cross feature map to obtain a corresponding second spliced feature map, inputting the second spliced feature map into a second dimension reduction convolution layer to perform convolution operation, and obtaining a multi-dimension fusion feature map corresponding to the initial feature map.

It can be seen that the width convolution layer of 1×n and the height convolution layer of n×1 are weighted and summed according to N feature points, whereas in the prior art, n×n convolution is weighted and summed according to n×n feature points. Compared with the prior art, the embodiment of the invention replaces an N multiplied by N convolution kernel in the traditional multi-head attention mechanism by using a separate convolution architecture (namely a width convolution layer and a height convolution layer), thereby effectively reducing the parameter quantity of a super-resolution model and improving the calculation efficiency. Meanwhile, the method can capture diversified characteristic information in different directions and different lengths, so that the capture capacity of the super-resolution model on complex modes is improved, and the generalization capacity of the model is improved. When the feature map dimension of the video frame is large, the separation convolution can effectively reduce the number of floating point operations required. Meanwhile, as the plurality of width convolution layers and the plurality of height convolution layers are operated in parallel, better parallel processing capability can be provided, and the efficiency of the video super-resolution is further improved.

The embodiment of the invention also provides a formula expression of the characteristic fusion process of the figure 4:

X_w＝Conv_1×1(Conv_1×5(LN(X))+Conv_1×7(LN(X))+Conv_1×11(LN(X)))

X_h＝Conv_1×1(Conv_5×1(LN(X))+Conv_7×1(LN(X))+Conv_11×1(LN(X)))

Wherein X _w is a width feature map; x _h is a height feature map; y _w is a first cross feature map; y _h is a second cross-feature map; x is an initial feature map; LN () characterization layer normalization operations; conv _1×1 is a 1×1 convolution and Conv _1×5 is a width convolution layer with a width of 5; conv _1×7 is a width convolution layer with width 7; conv _1×11 is a width convolution layer with a width of 11; conv _5×1 is a high convolution layer with a height of 5; conv _7×1 is a high convolution layer with a height of 7; conv _11×1 is a high convolution layer with a height of 11; And A1 x1 first convolutional layer of different parameters; And A second convolution layer of 1 x 1 for different parameters; the softmax (-) characterizes the softmax normalization function.

After feature fusion, obtaining a multi-dimensional fusion feature map F _i ^a′ i epsilon [ t-N, t+N ] and i not equal to t corresponding to each adjacent video frame, inputting all the multi-dimensional fusion feature maps and key feature maps into a feature reconstruction layer, performing serial operation on all the multi-dimensional fusion feature maps and key feature maps according to channel dimensions, and inputting the serial operation to an M multiplied by M convolution layer for convolution operation to obtain a reconstruction convolution result; and inputting the reconstruction convolution result into a plurality of residual Swin transform blocks for feature extraction, and then inputting the reconstruction convolution result into a sub-pixel up-sampling layer and reconstruction layer sub-module with a key feature map for processing to obtain a high-resolution video frame.

Assuming that the mxm convolution layer uses a 3 x 3 convolution layer, embodiments of the present invention also provide a formulation of the feature reconstruction process:

Wherein, The method comprises the steps of obtaining a high-resolution video frame corresponding to a video frame to be reconstructed; f _fus is the result of the 3×3 convolutional layer output; f _t is a key feature map; f _i ^a′ is the ith multidimensional fusion feature map; net _up (-) consists of a sub-pixel upsampling layer and reconstruction layer sub-modules; net _RSTBs (°) characterizes a plurality of residual Swin Transformer blocks.

Based on the same inventive concept, the embodiment of the invention also provides a video super-resolution device. The basic principle and the technical effects are the same as those of the above embodiments, and for brevity, reference is made to the corresponding matters in the above embodiments where the description of the present embodiment is omitted.

Referring to fig. 5, fig. 5 is a block diagram illustrating a video super-resolution device 200 according to an embodiment of the invention. The video super-resolution device 200 is applied to an electronic device, and the video super-resolution device 200 comprises a preprocessing module 201, a fusion module 202 and a reconstruction module 203.

The preprocessing module 201 is configured to fuse adjacent feature graphs and key feature graphs in series to obtain an initial feature graph; the key feature map characterizes the features of the video frame to be reconstructed; the adjacent feature map is obtained after the feature of the adjacent video frame is aligned with the key feature map.

The fusion module 202 is configured to fuse features of different widths and different heights along the width direction and the height direction of the initial feature map to obtain a corresponding width feature map and height feature map; the width characteristic diagram representation fuses a plurality of characteristics of the same height in each adjacent video frame and the video frame to be reconstructed; the height feature map representation fuses multiple features of the same width in each adjacent video frame and the video frame to be reconstructed.

The fusion module 202 is further configured to input the width feature map and the height feature map corresponding to the initial feature map into a cross-attention network to obtain a multi-dimensional fusion feature map; the multi-dimensional fusion feature map is obtained by cross fusion of the width feature map and the height feature map.

And the reconstruction module 203 is configured to perform feature reconstruction according to the multi-dimensional fusion feature map and the key feature map, so as to obtain a high-resolution video frame.

In summary, the video super-resolution device provided by the embodiment of the invention adopts the separate convolution structure to perform feature fusion of different widths and different heights on the video frame to be reconstructed and each adjacent video frame to obtain the width feature map and the height feature map, and performs feature reconstruction on the multi-dimensional fusion feature map and the key feature map after interleaving and fusion of the width feature map and the height feature map to obtain the high-resolution video frame, so that the feature fusion calculation amount is reduced, and meanwhile, the feature fusion capability of long distance and different directions is improved, thereby effectively improving the efficiency and quality of the video super-resolution, and effectively reducing visual artifacts.

Optionally, the fusion module 202 is specifically configured to input the initial feature map to the first layer normalization module to obtain a first normalized feature map; performing convolution operation on the first normalized feature map by adopting a plurality of convolution kernel parameters with different widths to obtain a plurality of width convolution results corresponding to the first normalized feature map; and fusing a plurality of width convolution results corresponding to the first normalized feature map to obtain a width feature map.

Optionally, the fusion module 202 is specifically configured to input the initial feature map to the second-layer normalization module to obtain a corresponding second normalized feature map; performing convolution operation on the second normalized feature map by adopting a plurality of convolution kernel parameters with different heights to obtain a plurality of height convolution results corresponding to the second normalized feature map; and fusing a plurality of height convolution results corresponding to the second normalized feature map to obtain the height feature map.

Optionally, the cross-attention network comprises a first convolution layer and a second convolution layer. The fusion module 202 is specifically configured to perform convolution operation on the width feature map by using three first convolution layers with different parameters to obtain a first width result, a second width result, and a third width result; performing convolution operation on the height feature map by using second convolution layers with three different parameters to obtain a first height result, a second height result and a third height result; feature fusion is carried out according to the first width result, the second width result and the first height result, and a first cross feature diagram is obtained; feature fusion is carried out according to the second height result, the third height result and the third width result, and a second cross feature diagram is obtained; and carrying out series fusion on the first cross feature map and the second cross feature map to obtain multi-dimensional fusion feature maps corresponding to the adjacent video frames.

Optionally, the preprocessing module 201 is specifically configured to perform a series operation on the adjacent feature map and the key feature map, so as to obtain a corresponding first spliced feature map; and inputting the first spliced feature map into a first dimension-reduction convolution layer to carry out convolution operation, so as to obtain an initial feature map.

Optionally, the preprocessing module 201 is further configured to acquire a video frame to be reconstructed and at least one adjacent video frame corresponding to the video frame to be reconstructed; respectively extracting features of a video frame to be reconstructed and at least one adjacent video frame to obtain a key feature map and at least one initial adjacent feature map; the initial adjacent feature map corresponds to the adjacent video frames one by one; and respectively carrying out feature alignment on at least one initial adjacent feature map and the key feature map to obtain an adjacent feature map corresponding to at least one adjacent video frame.

Optionally, the preprocessing module 201 is specifically configured to input, for each adjacent feature map, the adjacent feature map and the key feature map into the first deformable convolution layer to perform feature alignment, so as to obtain an intermediate feature output by the first deformable convolution layer; inputting the middle feature and the key feature map output by the previous deformable convolution layer into the next deformable convolution layer for feature alignment until the middle feature output by the last deformable convolution layer is obtained, and determining the middle feature output by the last deformable convolution layer as the adjacent feature map of the corresponding adjacent video frame; wherein a plurality of deformable convolution layers are connected in series.

Referring to fig. 6, fig. 6 is a block diagram of an electronic device 100 according to an embodiment of the invention. The electronic device 100 may be a personal computer, a notebook computer, a server, or the like. The electronic device 100 includes a memory 110, a processor 120, and a communication module 130. The memory 110, the processor 120, and the communication module 130 are electrically connected directly or indirectly to each other to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines.

Wherein the memory 110 is used for storing programs or data. The Memory 110 may be, but is not limited to, random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), etc.

The processor 120 is used to read/write data or programs stored in the memory 110 and perform corresponding functions. For example, the video super-resolution method disclosed in the above embodiments may be implemented when a computer program stored in the memory 110 is executed by the processor 120.

The communication module 130 is used for establishing a communication connection between the electronic device 100 and other communication terminals through a network, and for transceiving data through the network.

It should be understood that the structure shown in fig. 6 is merely a schematic diagram of the structure of the electronic device 100, and that the electronic device 100 may also include more or fewer components than shown in fig. 6, or have a different configuration than shown in fig. 6. The components shown in fig. 6 may be implemented in hardware, software, or a combination thereof.

Embodiments of the present invention also provide a computer-readable storage medium having stored thereon a computer program which, when executed by the processor 120, implements the video super-resolution method disclosed in the above embodiments.

In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus embodiments described above are merely illustrative, for example, of the flowcharts and block diagrams in the figures that illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present invention may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A video super-resolution method, the method comprising:

2. The method of claim 1, wherein the step of obtaining a width feature map comprises:

3. The method of claim 1, wherein the step of obtaining a height profile comprises:

4. The video super-resolution method as claimed in claim 1, wherein the cross-attention network comprises a first convolution layer and a second convolution layer; inputting the width feature map and the height feature map corresponding to the initial feature map into a cross attention network to obtain a multi-dimensional fusion feature map, wherein the multi-dimensional fusion feature map comprises:

5. The method of claim 1, wherein the fusing adjacent feature maps and key feature maps in series to obtain an initial feature map comprises:

6. The video super-resolution method as claimed in claim 1, wherein before the step of fusing adjacent feature maps and key feature maps in series to obtain an initial feature map, the method further comprises:

7. The method according to claim 6, wherein the performing feature alignment on the at least one initial neighboring feature map and the key feature map to obtain neighboring feature maps corresponding to the at least one neighboring video frame includes:

8. A video super-resolution apparatus, the apparatus comprising:

9. An electronic device comprising a processor and a memory, the memory storing machine executable instructions executable by the processor to implement the video super-resolution method of any one of claims 1-7.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the video super-resolution method according to any one of claims 1-7.