CN115994857B

CN115994857B - Video super-resolution method, device, equipment and storage medium

Info

Publication number: CN115994857B
Application number: CN202310029515.3A
Authority: CN
Inventors: 骆剑平; 侯凯旋
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2023-01-09
Filing date: 2023-01-09
Publication date: 2023-10-13
Anticipated expiration: 2043-01-09
Also published as: CN115994857A

Abstract

The invention discloses a video super-resolution method, a device, equipment and a storage medium. The method comprises the following steps: acquiring a video frame sequence to be processed; the video frame sequence to be processed comprises at least one video frame group, wherein the video frame group is formed by three continuous video frames; inputting the video frame sequence to be processed into a preset video super-resolution model, and determining a super-resolution video frame sequence corresponding to the video frame sequence to be processed according to the output generation result; the video super-resolution model is a neural network model trained by a set training method; the video super-resolution model at least comprises a multi-branch feature fusion module, wherein the multi-branch feature fusion module is used for extracting the feature information and the high-frequency feature information of different receptive fields in parallel and fusing the feature information and the high-frequency feature information of the different receptive fields. According to the technical scheme provided by the embodiment of the invention, the accuracy of super-resolution video frame reconstruction is enhanced while the parameter quantity and the data operation quantity are reduced.

Description

Video super-resolution method, device, equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for video super resolution.

Background

With the popularization of high-resolution electronic equipment in daily life, the demands of various industries on high-resolution video are increasing, but due to the factors of high-definition imaging equipment being expensive, high-definition video transmission cost being high and the like, the resolution of the main stream video is still 720P or even lower at present, and the demands of the masses are difficult to meet.

At present, a new end-to-end video super-division (Dynamic Upsampling Filters, DUF) neural network model is provided to realize space-time feature extraction and reconstruction of low-resolution video frames, so as to obtain corresponding high-resolution video frames. A basic video superdivision network model (Basic Video Super-Resolution, basicVSR) of another bi-directional recurrent neural network structure is also provided, which can transfer the temporal-spatial information of the far-distance forward and backward video frames to the current frame, and reconstruct the high-Resolution video frames from the low-Resolution video frames.

However, the model structure of the video super-resolution models such as DUF and basic vsr is complex. For example, in the DUF network model, three-dimensional convolution is used to extract space-time characteristics of the video frame, but in the three-dimensional convolution scheme, the parameter amount is more in practical application, the operation amount is larger, and convergence is difficult in the training process. While the basic VSR model has excellent effect of super-resolution, the adopted bidirectional circulating network structure needs to input bidirectional video frames when training and testing, and the model is huge and is difficult to be applied to real-time super-resolution reconstruction of videos in daily life.

Disclosure of Invention

The invention provides a video super-resolution method, a device, equipment and a storage medium, which realize super-resolution reconstruction of low-resolution video through a compact and lightweight neural network model, reduce the number of parameters and the data operand, simultaneously promote the richness of video frame extraction information and enhance the accuracy of super-resolution video frame reconstruction.

In a first aspect, an embodiment of the present invention provides a video super-resolution method, where the method includes:

acquiring a video frame sequence to be processed; the video frame sequence to be processed comprises at least one video frame group, wherein the video frame group is formed by three continuous video frames;

inputting the video frame sequence to be processed into a preset video super-resolution model, and determining a super-resolution video frame sequence corresponding to the video frame sequence to be processed according to the output generation result;

the video super-resolution model is a neural network model trained by a set training method; the video super-resolution model at least comprises a multi-branch feature fusion module, wherein the multi-branch feature fusion module is used for extracting the feature information and the high-frequency feature information of different receptive fields in parallel and fusing the feature information and the high-frequency feature information of the different receptive fields.

In a second aspect, an embodiment of the present invention further provides a video super-resolution apparatus, where the video super-resolution apparatus includes:

the video frame acquisition module is used for acquiring a video frame sequence to be processed; the video frame sequence to be processed comprises at least one video frame group, wherein the video frame group is formed by three continuous video frames;

the super-resolution frame generation module is used for inputting the video frame sequence to be processed into a preset video super-resolution model, and determining the super-resolution video frame sequence corresponding to the video frame sequence to be processed according to the output generation result;

In a third aspect, an embodiment of the present invention further provides a video super-resolution apparatus, where the video super-resolution apparatus includes:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

The memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the video super-resolution method of any one of the embodiments of the invention.

In a fourth aspect, embodiments of the present invention further provide a computer readable storage medium storing computer instructions for causing a processor to implement the video super-resolution method of any of the embodiments of the present invention when executed.

The embodiment of the invention provides a video super-resolution method, a device, equipment and a storage medium, which are used for acquiring a video frame sequence to be processed; the video frame sequence to be processed comprises at least one video frame group, wherein the video frame group is formed by three continuous video frames; inputting the video frame sequence to be processed into a preset video super-resolution model, and determining a super-resolution video frame sequence corresponding to the video frame sequence to be processed according to the output generation result; the video super-resolution model is a neural network model trained by a set training method; the video super-resolution model at least comprises a multi-branch feature fusion module, wherein the multi-branch feature fusion module is used for extracting the feature information and the high-frequency feature information of different receptive fields in parallel and fusing the feature information and the high-frequency feature information of the different receptive fields. By adopting the technical scheme, the obtained low-resolution to-be-processed video frame sequence is input into a pre-trained video super-resolution model, the video super-resolution model comprises a multi-branch feature fusion module, feature extraction and feature fusion are carried out on the to-be-processed video frame sequence in the input video super-resolution model from multiple dimensions through a multi-branch parallel network structure, fusion of different dimension features and high-frequency features can be realized in the multi-branch feature fusion module, light-weight multi-feature extraction fusion is realized, and further high-resolution video frame reconstruction on the low-resolution to-be-processed video frame sequence is realized, so that a corresponding super-resolution video frame sequence is obtained. The problems of complex structure and large calculated amount of the existing video super-resolution model are solved, multi-dimensional feature extraction and fusion of different frequency types are realized through the arranged multi-branch feature fusion module, the parameter quantity and the data operation amount are reduced, the richness of video frame extraction information is improved, and the accuracy of super-resolution video frame reconstruction is enhanced.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a video super-resolution method according to a first embodiment of the present invention;

fig. 2 is a flowchart of a video super-resolution method according to a second embodiment of the present invention;

fig. 3 is a flowchart illustrating a procedure of inputting a video frame group to a video frame alignment module and determining an output of the video frame alignment module as an aligned video frame according to a second embodiment of the present invention;

FIG. 4 is a flowchart illustrating a process of performing optical flow estimation on adjacent video frames by an optical flow estimation module according to a second embodiment of the present invention;

FIG. 5 is a diagram illustrating a structure of a multi-layer two-dimensional convolution unit according to a second embodiment of the present disclosure;

fig. 6 is a flowchart illustrating a process of inputting an aligned video frame into a multi-branch feature fusion module for feature extraction and fusion, and determining a fused feature map according to a second embodiment of the present invention;

fig. 7 is a diagram illustrating a structure of a residual block according to a second embodiment of the present invention;

FIG. 8 is a diagram illustrating a compact block according to a second embodiment of the present invention;

FIG. 9 is a flowchart illustrating a training of a video super-resolution model by using a set training method according to a second embodiment of the present invention;

fig. 10 is a diagram illustrating a structure of a video super-resolution model according to a second embodiment of the present invention;

fig. 11 is a schematic structural diagram of a video super-resolution device according to a third embodiment of the present invention;

fig. 12 is a schematic structural diagram of a video super-resolution device according to a fourth embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Fig. 1 is a flowchart of a video super-resolution method according to a first embodiment of the present invention, where the embodiment of the present invention is applicable to performing super-resolution processing on a low-resolution video frame to obtain a high-resolution video frame, where the method may be performed by a video super-resolution device, where the video super-resolution device may be implemented by software and/or hardware, where the video super-resolution device may be configured on a video super-resolution device, where the video super-resolution device may be a computer device, where the computer device may be a notebook computer, a desktop computer, a smart tablet, or the like.

As shown in fig. 1, a video super-resolution method provided in the first embodiment of the present invention specifically includes the following steps:

s101, acquiring a video frame sequence to be processed.

The video frame sequence to be processed comprises at least one video frame group, and the video frame group is formed by three continuous video frames.

In this embodiment, the video frame sequence to be processed may be specifically understood as a set of video frames acquired by a video acquisition device or acquired by different channels and needing resolution improvement. The video frame group can be specifically understood as a group of continuous video frames divided according to a preset rule in a video frame sequence to be processed to meet the processing requirement of the video super-resolution model, and optionally, the video frame group can include three continuous video frames.

Specifically, when video data needing resolution improvement is acquired by the video acquisition device or when video data needing resolution improvement is acquired by other channels such as a network, the video data is formed by a plurality of video frames which are continuous in time, the video data can be processed into a plurality of different video frame groups according to a preset rule, each video frame group comprises three continuous video frames, and each video frame group is further sequenced according to the time sequence comprising the video frames, so that a corresponding video frame sequence to be processed is obtained.

S102, inputting the video frame sequence to be processed into a preset video super-resolution model, and determining the super-resolution video frame sequence corresponding to the video frame sequence to be processed according to the output generation result.

In this embodiment, the video super-resolution model may be specifically understood as a neural network model for reconstructing a video frame in an input low-resolution to-be-processed video frame sequence. A super-resolution video frame sequence is to be understood as meaning, in particular, a set of video frames corresponding to the individual video frames of the video frame sequence to be processed, for which a high-resolution reconstruction has been completed. The multi-branch feature fusion module can be specifically understood as a neural network sub-model of a multi-branch parallel network structure for multi-scale feature extraction and fusion in a video super-resolution model. The Receptive Field (Receptive Field) is specifically understood as the area size of the pixel point mapped back on the input image on the feature map output by each layer in the convolutional neural network, that is, the smaller the Receptive Field, the more detailed the extracted features, the larger the Receptive Field, the coarser the extracted features, that is, the larger the Receptive Field will be in the neural network model with the deepening of the network. The high-frequency characteristic information can be specifically understood as characteristic information which has higher frequency in video frames and can be used for representing the outline of an object.

In general, a Neural network model (NN) is understood as a complex network system formed by a large number of simple processing units (also referred to as neurons) widely connected to each other, which can reflect many basic features of human brain functions, is a highly complex nonlinear power learning system, and in short, a Neural network model is understood as a mathematical model based on neurons. The neural network model is composed of a plurality of neural network layers, different neural network layers can realize different processing, such as convolution, normalization and the like, on data input into the neural network model, and the plurality of different neural network layers can be combined according to a certain preset rule to form different modules in the neural network model. Optionally, the video super-resolution model provided in the embodiment of the present invention may be a neural network model trained by an Adam optimizer, or may be a neural network model trained in other manners, which is not limited in this embodiment of the present invention.

Specifically, the video frame sequence to be processed is input into a preset video super-resolution model, so that the video super-resolution model can perform corresponding processing on each video frame group in the video frame sequence to be processed, rough feature extraction is performed on video frames needing to be reconstructed in each video frame group, and extraction and fusion are performed on detail features with different receptive fields and different frequencies in the video frame groups through a multi-branch feature fusion module, so that super-resolution video frames corresponding to each video frame group are obtained, and further the super-resolution video frame sequence corresponding to the video frame sequence to be processed can be obtained according to arrangement of each video frame group in the video frame sequence to be processed.

It should be clear that the video frame sequence to be processed in the embodiment of the present invention may be formed by a continuous video frame, or may be formed by a plurality of continuous video frames in a center frame, which is not limited in this embodiment of the present invention. By way of example, assuming that three video frame sets are included in the sequence of video frames to be processed, the three video frame sets including [1,2,3], [2,3,4] and [3,4,5] frames in the video data, the sequence of video frames to be processed may be denoted as [1,2,3,4,5], and may also be denoted as [1,2,3,2,3,4,3,4,5], embodiments of the present invention are not limited in this respect.

According to the technical scheme, a video frame sequence to be processed is obtained; the video frame sequence to be processed comprises at least one video frame group, wherein the video frame group is formed by three continuous video frames; inputting the video frame sequence to be processed into a preset video super-resolution model, and determining a super-resolution video frame sequence corresponding to the video frame sequence to be processed according to the output generation result; the video super-resolution model is a neural network model trained by a set training method; the video super-resolution model at least comprises a multi-branch feature fusion module, wherein the multi-branch feature fusion module is used for extracting the feature information and the high-frequency feature information of different receptive fields in parallel and fusing the feature information and the high-frequency feature information of the different receptive fields. By adopting the technical scheme, the obtained low-resolution to-be-processed video frame sequence is input into a pre-trained video super-resolution model, the video super-resolution model comprises a multi-branch feature fusion module, feature extraction and feature fusion are carried out on the to-be-processed video frame sequence in the input video super-resolution model from multiple dimensions through a multi-branch parallel network structure, fusion of different dimension features and high-frequency features can be realized in the multi-branch feature fusion module, light-weight multi-feature extraction fusion is realized, and further high-resolution video frame reconstruction on the low-resolution to-be-processed video frame sequence is realized, so that a corresponding super-resolution video frame sequence is obtained. The problems of complex structure and large calculated amount of the existing video super-resolution model are solved, multi-dimensional feature extraction and fusion of different frequency types are realized through the arranged multi-branch feature fusion module, the parameter quantity and the data operation amount are reduced, the richness of video frame extraction information is improved, and the accuracy of super-resolution video frame reconstruction is enhanced.

Example two

Fig. 2 is a flowchart of a video super-resolution method provided by a second embodiment of the present invention, where the technical solution of the second embodiment of the present invention may be further optimized based on the above-mentioned alternative technical solutions, and a manner of generating a video frame sequence to be processed according to video data to be processed is clarified, and in the case that a video super-resolution model includes a video frame super-division module, a video frame alignment module, a multi-branch feature fusion module, and an upsampling module, coarse feature extraction and preliminary super-division are performed on the video frame sequence to be processed input into the video super-resolution model based on the video frame super-division module, and at the same time, alignment of video frames and detailed feature extraction fusion and amplification of the video frame sequence to be processed are implemented through the video frame alignment module, the multi-branch feature fusion module, and the upsampling module, so that the data amount required to be input into the video super-resolution model is reduced, the richness of extraction information for video frames is improved, and the accuracy of the obtained super-resolution video frame sequence is enhanced.

As shown in fig. 2, the method for video super-resolution provided in the second embodiment of the present invention specifically includes the following steps:

s201, obtaining video data to be processed, dividing the video data to be processed through a preset sliding window, and determining at least one video frame group.

The window size of the preset sliding window is three frames, and the sliding step length is one frame.

In this embodiment, the video data to be processed may be specifically understood as video data that needs to be improved in resolution, and exemplary, the video data to be processed may be an image with resolution improvement requirement that is propagated in the internet, or may be an image directly collected by a video collecting device, where resolution is difficult to meet actual requirements. The preset sliding window is specifically understood as a window which is preset in size according to actual conditions and is used for dividing video frames in video data to be processed. Optionally, according to the data processing requirement of the super-resolution video model, for the low-resolution video frame to be reconstructed, the information in the preceding and following two video frames is required to be referred to, so that the window size of the preset sliding window can be set to be three frames, and in order to ensure the continuity of super-resolution processing of each video frame in the video data to be processed, the sliding step length of the preset sliding window can be set to be one frame, so that the super-resolution video frame output by the video super-resolution model corresponds to the continuous video frame in the video data to be processed.

Specifically, the video data to be processed is obtained through different channels, the window size of a preset sliding window is set to be three frames according to the data processing requirement of the video super-resolution model, the sliding step length is set to be one frame, the video data to be processed is divided through the preset sliding window, three continuous video frames in the preset sliding window are determined to be a video frame group, namely, the number of complete sliding of the preset sliding window in the video frame group is the number of groups of the video frame groups which can be divided for the video data to be processed, aiming at the video data to be processed.

S202, sequencing each video frame group according to the time sequence, and determining a video frame sequence to be processed.

Specifically, since the video frames in the video data to be processed are sequentially arranged according to the shooting time sequence, when the video data to be processed is divided into a plurality of video frame groups, in order to ensure the continuity of data processing, each video frame group needs to be ordered according to the time sequence, and the set of each video frame group after the ordering is determined as the video frame sequence to be processed. Optionally, the shooting time of the first video frame in each video frame group may be used as a ranking basis, and the shooting time of the intermediate video frame to be reconstructed in each video frame group may also be used as a ranking basis.

For example, assuming that the video data to be processed is a video composed of ten consecutive video frames, which may be expressed as [1,2,3,4,5,6,7,8,9,10], divided by a predetermined sliding window, eight video frame groups of [1,2,3], [2,3,4], [3,4,5], [4,5,6], [5,6,7], [6,7,8], [7,8,9] and [8,9,10] may be obtained, and the eight video frame groups may be ordered according to the photographing time of the first video frame or the intermediate video frame, and the resulting sequence of the video frames to be processed may be expressed as [1,2,3,2,3,4,3,4,5,4,5,6,5,6,7,6,7,8,7,8,9,8,9,10] or [1,2,3,4,5,6,7,8,9,10], which is not limited in the embodiment of the present invention.

Further, the video super-resolution model also comprises a video frame super-division module, a video frame alignment module and an up-sampling module.

In this embodiment, the video frame super-resolution module is specifically understood to be used for up-sampling the video frame input therein, so as to implement preliminary super-resolution for the video frame lack of details, and a combination of multiple neural network layers in the video super-resolution model. Optionally, the video frame super-division module may process the video frame input therein by bilinear interpolation up-sampling to obtain a preliminary super-divided video frame. The video frame alignment module is specifically understood as a module for determining positional offset information between different video frames for a video frame group input therein, and completing the alignment and combination of multiple neural network layers in the combined video super-resolution model for each video frame. The upsampling module is specifically understood to be used for performing pixel rearrangement on the feature map input containing the detail features of the video frame, so as to implement the combination of multiple neural network layers in the video super-resolution model aiming at the super-division of the detail content of the video frame.

S203, for each video frame group in the video frame sequence to be processed, respectively inputting the video frame groups into a video frame superdivision module and a video frame alignment module in the video super-resolution model, determining the output of the video frame superdivision module as a first superdivision result, and determining the output of the video frame alignment module as an aligned video frame.

In this embodiment, the first superdivision result may be specifically understood as a result obtained after performing preliminary superdivision on the video frames to be reconstructed in the video frame group, where the preliminary superdivision is performed without detail features. Further, the first super-resolution result contains rough feature information in the video frame to be subjected to super-resolution reconstruction, so that other modules except the video frame super-resolution module in the video super-resolution model only need to extract, fuse and super-divide the detail feature information in the video frame group, and the parameter number in the video super-resolution model and the data calculation amount in the actual use process are reduced. The alignment video frame can be specifically understood as a video frame obtained by performing alignment and combination on each video frame and a video frame to be reconstructed in a video frame group in the input video super-resolution model.

Specifically, since there may be a plurality of video frame groups in the video frame sequence to be processed, in the embodiment of the present invention, a video super-resolution model is used to process one video frame group input into the video frame sequence, among three video frames of the video frame group, a video frame located in a middle position may be determined as a frame to be reconstructed that needs super-resolution reconstruction, and the frame to be reconstructed is input into a video frame super-division module to perform rough feature extraction and up-sampling processing, so as to obtain a first super-division result. And simultaneously, inputting the whole video frame group into a video frame alignment module, determining two video frames except for the frame to be reconstructed in the video frame group through the video frame alignment module, carrying out offset elimination on the two video frames according to the position offset condition relative to the position offset condition of the frame to be reconstructed, and combining the two video frames after offset elimination with the frame to be reconstructed to obtain the corresponding aligned video frame.

For example, assuming that the frame to be reconstructed is a three-channel video frame, after the video frame group is processed by the video frame alignment module, the two video frames are combined with the frame to be reconstructed after the offset is eliminated, and the obtained aligned video frame is a nine-channel video frame.

In the embodiment of the invention, by arranging the independent video frame super-resolution module in the video super-resolution model, the extraction of rough features in the input video frame is realized, and further, the modules except the video frame super-resolution module in the video super-resolution model only need to extract and fuse the detailed features of the video frame input into the video super-resolution module, and further, the final super-resolution processing can be completed by adding the detailed features and the rough features during output, the detailed features and the rough features are distinguished, the parameter quantity of each module in the video super-resolution model is reduced, the data quantity required to be processed by the detailed feature processing module is reduced, and the data calculation quantity is reduced while the video super-resolution effect is ensured.

Further, the video frame alignment module includes a first optical flow estimation sub-module, a first alignment sub-module, a second optical flow estimation sub-module, and a second alignment sub-module.

Fig. 3 is a flowchart illustrating a procedure of inputting a video frame group to a video frame alignment module and determining an output of the video frame alignment module as an aligned video frame according to a second embodiment of the present invention, as shown in fig. 3, and specifically includes the following steps:

S2031, determining a first frame in the video frame group as a first reference frame, determining a second frame as a frame to be reconstructed, and determining a third frame as a second reference frame.

Specifically, because the same person or target object in the video will change in position along with different shooting time, and the position change between two continuous video frames will not be very large, when the video frame group contains three continuous video frames, the second frame can be used as the frame to be reconstructed for super-resolution reconstruction, the first frame shot at the previous moment is used as the first reference frame, the second frame shot at the subsequent moment is used as the second reference frame, so that the information contained in the person or target object in the frame to be reconstructed is enriched based on the information of the same person or target object in the first reference frame and the second reference frame.

S2032, inputting the first reference frame and the frame to be reconstructed into a first optical flow estimation submodule to determine a first optical flow diagram.

In this embodiment, the first optical flow estimation submodule may be understood as a combination of a plurality of neural network layers for determining a first optical flow diagram between the first reference frame and the frame to be reconstructed. The first optical flow map may be understood as a feature map characterizing the motion speed and motion direction of each pixel in the adjacent first reference frame and the frame to be reconstructed.

Specifically, a first reference frame and a frame to be reconstructed are input to a first optical flow estimation sub-module, and the first reference frame is subjected to multiple times of twisting alignment through the first optical flow estimation sub-module, so that a first optical flow diagram of the first reference frame relative to the frame to be reconstructed is obtained.

Fig. 4 is a flowchart illustrating a process of performing optical flow estimation on an adjacent video frame by using an optical flow estimation module according to a second embodiment of the present invention, as shown in fig. 4, in the second embodiment of the present invention, the optical flow estimation module is taken as a first optical flow estimation sub-module with a pyramid structure, which can perform optical flow estimation on the adjacent video frame from coarse to fine. Assume that the video frame sequence formed by two adjacent video frames isWherein t is time, then ∈ ->Can be represented as a first reference frame,/or->May be represented as a frame to be reconstructed. In the process of video frame sequence I ^LR After being input into the first optical flow estimation sub-module, the first optical flow estimation sub-module first calculates the first optical flow of the object I ^LR Performing twice downsampling to obtain corresponding video frame sequence +.>Andwherein I is ^LR1 Video frame sequence size I ^LR Half of (I) ^LR2 Video frame sequence size I ^LR One quarter of (a) the number of (c). Downsampling and then I ^LR2 Video frameThe sequence is input to a multi-layer two-dimensional convolution unit in a first optical flow estimation sub-module to extract video frames +. >To video frame->Pixel offset information between pixels, generating corresponding optical flow map +.>Light flow graph->Performing bilinear interpolation amplification twice to obtain a light flow diagram +.>Video frame +.>Performing warp alignment and will +_video frames following it>And video frame->Inputting the multi-layer two-dimensional convolution unit to generate a corresponding optical flow chartLight flow map->And->Adding, namely integrating and adding pixel offset information between the first reference frame and the frame to be reconstructed, and performing bilinear interpolation amplification on the added optical flow diagramDouble-obtained light flow graph->Video frame +.>Warp alignment and will be +_ for the following video frames>And video frame->Inputting a multi-layer two-dimensional convolution unit to generate a corresponding optical flow chart +.>Light flow map->And->Adding to obtain a first optical flow diagram of final output +.>

Alternatively, the multi-layer two-dimensional convolution unit may specifically be formed by alternately using a plurality of two-dimensional convolution layers and a plurality of activation function layers, and fig. 5 is a schematic diagram of an exemplary structure of a multi-layer two-dimensional convolution unit according to a second embodiment of the present invention, and fig. 5 is only an example provided by an embodiment of the present invention, where the embodiment of the present invention does not limit the structure of a specific multi-layer two-dimensional convolution unit. As shown in fig. 5, where LR represents the input low resolution picture, F represents the output optical flow diagram, conv2d represents the two-dimensional convolution layer, and Relu represents the activation function.

In the embodiment of the invention, the first optical flow estimation submodule is set to be in a pyramid structure to extract the optical flow information of the adjacent video frames, and compared with other types of optical flow estimation methods such as FlowNet, deformable convolution and the like, under the condition of equivalent performance, the pyramid structure has fewer parameter amounts used in an extraction mode, so that the calculation speed is faster in an inference stage and the convergence is easier in a training stage. Compared with the implicit light flow estimation method using three-dimensional convolution, the first light flow estimation sub-module in the embodiment of the invention uses less parameter quantity and has better effect.

S2033, inputting the first reference frame and the first optical flow map to the first alignment sub-module, and determining a first aligned video frame corresponding to the first reference frame.

In this embodiment, the first alignment sub-module may be specifically understood as a combination of multiple neural network layers for aligning each pixel point in the first reference frame with the frame to be reconstructed. The first aligned video frame may be understood as a video frame comprising the pixel positions of the target object in complete alignment with the pixel positions of the target object in the frame to be reconstructed.

Specifically, a first reference frame and a first optical flow diagram are input to a first alignment sub-module, and each pixel point in the first reference frame is rearranged according to position offset information of each pixel point contained in the first optical flow diagram, so that alignment operation of a target object in the first reference frame and a target object in a frame to be reconstructed is realized, and a first alignment video frame is obtained. Alternatively, any pixel alignment method that meets the requirements may be used in the first alignment sub-module, which is not limited in the embodiment of the present invention.

S2034, inputting the second reference frame and the frame to be reconstructed into a second optical flow estimation submodule to determine a second optical flow diagram.

In this embodiment, the second optical flow estimation submodule may be understood as a combination of a plurality of neural network layers for determining the second reference frame and the inter-frame second optical flow diagram to be reconstructed. The second optical flow map may be understood as a feature map characterizing the motion speed and motion direction of each pixel in the adjacent second reference frame and the frame to be reconstructed.

Specifically, a second reference frame and a frame to be reconstructed are input to a second optical flow estimation sub-module, and the second reference frame is subjected to multiple times of twisting alignment through the second optical flow estimation sub-module, so that a second optical flow diagram of the second reference frame relative to the frame to be reconstructed is obtained. Optionally, the manner in which the second optical flow estimation sub-module performs optical flow estimation is the same as that of the first optical flow estimation sub-module, and the specific flow is shown in fig. 4, which is not described in detail in the embodiment of the present invention.

S2035, inputting the second reference frame and the second optical flow map to the second alignment sub-module, and determining a second aligned video frame corresponding to the second reference frame.

In this embodiment, the second alignment sub-module may be specifically understood as a combination of multiple neural network layers for aligning each pixel point in the second reference frame with the frame to be reconstructed. A second aligned video frame may be understood as a video frame comprising the pixel positions of the target object in complete alignment with the pixel positions of the target object in the frame to be reconstructed.

Specifically, a second reference frame and a second optical flow diagram are input to a second alignment sub-module, and each pixel point in the second reference frame is rearranged according to the position offset information of each pixel point contained in the second optical flow diagram, so that the alignment operation of a target object in the second reference frame and a target object in a frame to be reconstructed is realized, and a second alignment video frame is obtained. Alternatively, any desired pixel alignment method may be used in the second alignment sub-module, which is not limited in this embodiment of the present invention.

S2036, merging the first aligned video frame, the second aligned video frame and the frame to be reconstructed, and determining the aligned video frame.

Specifically, the first aligned video frame and the second aligned video frame which are obtained after alignment are combined with a frame to be reconstructed which is required to be reconstructed, and the information content of each pixel point in the frame to be reconstructed is increased, so that the corresponding aligned video frame is obtained.

S204, inputting the aligned video frames into a multi-branch feature fusion module to perform feature extraction and fusion, and determining a fusion feature map.

In this embodiment, the fused feature map may be specifically understood as a feature map that includes detailed information in an aligned video frame after extracting and fusing features of different scales and frequencies in the aligned video frame.

Specifically, an aligned video frame is input into a multi-branch feature fusion module, small receptive field feature images, large receptive field feature images and high-frequency feature information in the aligned video frame are extracted, and all extracted features are fused to obtain a fused feature image containing detail information in the aligned video frame.

Further, the multi-branch feature fusion module comprises a feature extraction sub-module, a residual sub-module and a dense sub-module; fig. 6 is a flowchart illustrating feature extraction and fusion performed by inputting an aligned video frame into a multi-branch feature fusion module according to a second embodiment of the present invention, where the flowchart illustrating a fused feature map is determined, as shown in fig. 6, and specifically includes the following steps:

s2041, inputting the aligned video frames to a feature extraction submodule to carry out feature extraction, and determining a small receptive field feature map.

In the present embodiment, the feature extraction submodule can be understood as a combination of a plurality of neural network layers for performing feature extraction on an image input thereto. It should be clear that, in the embodiment of the present invention, the feature extraction method of the feature extraction sub-module may be selected according to actual requirements, which is not limited in the embodiment of the present invention. The small receptive field feature map is specifically understood to be a feature map which is directly extracted from the aligned video frames by the feature extraction submodule, contains more information and corresponds to a smaller receptive field. Illustratively, the small receptive field feature map may be a feature map of the receptive field corresponding to the 3*3 convolution block.

S2042, inputting the small receptive field feature map to a residual sub-module for feature extraction and fusion, and determining a receptive field fusion feature map.

And the middle residual block and the last residual block in the residual sub-module are connected with the small receptive field feature map.

In this embodiment, the residual submodule is specifically understood to be a combination of multiple neural network layers consisting of multiple residual blocks, for expanding the receptive field gradually, and extracting large receptive field features in aligned video frames. Fig. 7 is a diagram showing an exemplary structure of a residual block provided in the second embodiment of the present invention, as shown in fig. 7, the residual blocks in the residual sub-modules have the same structure, and each residual block is formed by alternately activating a function and two-dimensional convolution layers. The median residual block is specifically understood as a residual block located in the middle among the residual blocks of the residual sub-module, and if the residual sub-module includes 8 residual blocks, for example, the 4 th residual block is the median residual block of the residual sub-module. The last residual block can be specifically understood as a residual block located at the last position in the residual blocks of the residual sub-module, and in the above example, if the residual sub-module contains 8 residual blocks, the 8 th residual block is the last residual block of the residual sub-module.

Specifically, the small receptive field feature map is input into a residual sub-module, and each residual block in the residual sub-module sequentially performs receptive field enlarged feature extraction on the small receptive field feature map, namely, the small receptive field feature map is input into a first residual block in the residual sub-module, and the output result of the first residual block is input into a second residual block after the first residual block until the output of the last residual block in the residual sub-module is determined as receptive field fusion feature map. Further, since the middle residual block and the last residual block in the residual sub-module are connected with the small receptive field feature map in a jumping manner, when the residual sub-module performs feature extraction on the small receptive field feature map input into the residual sub-module, the small receptive field feature map is added with the output of one residual block on the middle residual block to serve as the input of the middle residual block, and the small receptive field feature map is added with the output of one residual block on the last residual block to serve as the input of the last residual block, so that the receptive field fusion feature map output by the last residual block simultaneously comprises the small receptive field feature and the large receptive field feature in the aligned video frame, and fusion of the small receptive field feature and the large receptive field feature is realized.

In the embodiment of the invention, the residual sub-modules which are in jump connection with different residual blocks are adopted to perform feature extraction and fusion on the aligned video frames, the residual blocks can be used for expanding the receptive field gradually, extracting the features of the large receptive field in the input aligned video frames, and simultaneously ensuring that the information contained in the input aligned video frames is transmitted to the bottom of the model as much as possible, so that the information loss caused by a plurality of convolution layers is reduced. The small receptive field feature map containing a large amount of low-dimensional inter-frame information is fused with the large receptive field feature information extracted by the residual block through the jump connection of the small receptive field feature map and the residual block, so that the low-dimensional information in the small receptive field feature map can be reserved in the feature extraction and fusion process, the large receptive field feature can be extracted as far as possible, and the feature extraction completeness is improved.

S2043, inputting the aligned video frames into the dense submodule for high-frequency information extraction, and determining a high-frequency feature map.

In this embodiment, the dense sub-module is specifically understood to include a plurality of dense blocks for performing high-frequency information extraction on the video frame input thereto, which is a combination of a plurality of neural network layers.

Specifically, in order to solve the problem that fuzzy diffusion occurs to the edge contour of an object in a reconstructed video frame caused by feature extraction and fusion of only a residual sub-module, in the embodiment of the invention, the video frame is input into a dense sub-module with a parallel structure, and high-frequency information in the aligned video frame is extracted through each dense block in the dense sub-module, so that a high-frequency feature map is output.

Fig. 8 is a schematic diagram of a structure of a compact block according to a second embodiment of the present invention, where the compact block is composed of a batch normalization layer (BN layer), an activation function (Relu), and a two-dimensional convolution layer (Conv 2 d), and the structure is shown in fig. 8, and after feature extraction is performed by one compact block, the input feature map and the output feature map are combined to obtain the output of the final compact block.

S2044, carrying out feature addition on the receptive field fusion feature map and the high-frequency feature map, and determining the fusion feature map.

Specifically, feature addition is performed on the receptive field fusion feature map and the high-frequency feature map, namely, the receptive field fusion feature map extracted from the residual sub-module is subjected to information supplementation through outline information in the high-frequency feature map, and the receptive field fusion feature map and the high-frequency feature map are fused to obtain a fusion feature map, so that video frame information contained in the fusion feature map is more complete, and further, the video super-resolution reconstruction effect can be further improved.

S205, inputting the fusion feature map into an up-sampling module for pixel rearrangement, and determining a second superdivision result.

Specifically, in order to make the size of the fused feature map after feature extraction and fusion meet the size requirement of the super-resolution video frame, the fused feature map is input into an up-sampling module for amplification and rearrangement, and a second super-division result with the same size as the first super-division result output by the video frame super-division module is obtained.

An up-sampling module in the embodiment of the present invention may employ a sub-pixel convolution up-sampling method to rearrange pixels of the fused feature map obtained after fusion through a sub-pixel convolution layer, thereby making the size h×w×r ² c are arranged as a second superdivision result with the size of rH multiplied by rW multiplied by c. Wherein, H and W are the height and width of the video frame in the low resolution video frame sequence to be processed, respectively, and since the video super resolution model does not modify the size of the video frame when the video frame group inputted therein is aligned and the feature extraction is fused, the height and width of the fused feature map is still H and W, r is the magnification factor, and c is the channel number. Optionally, in the embodiment of the present invention, the magnification may be 4, and in the case that the input video frame is an RGB image, c may be 3, and the magnification and the number of channels may be adaptively adjusted according to the actual situation, which is not limited in the embodiment of the present invention.

In the embodiment of the invention, the sub-pixel convolution upsampling method has better upsampling effect compared with the traditional upsampling method based on interpolation, and the upsampling time is shorter under the condition that the upsampling effect is the same as other upsampling algorithms based on learning, such as a deconvolution algorithm, so that the efficiency of video super-resolution processing is higher.

S206, the video frames generated by carrying out feature addition on the first super-division result and the second super-division result are determined to be super-resolution video frames corresponding to the video frame group.

Specifically, as the first super-resolution result and the second super-resolution result have the same size, the first super-resolution result containing rough features of the video frame and the features of the pixel points corresponding to the second super-resolution result containing detailed features of the video frame are added to obtain the super-resolution video frame corresponding to the video group of the input video super-resolution model.

S207, sequencing each super-resolution video frame corresponding to the video frame sequence to be processed according to the time sequence to determine the super-resolution video frame sequence.

Specifically, since each video frame group in the video frame sequence to be processed is arranged according to the time sequence, each super-resolution video frame corresponding to each video frame group can be arranged according to the time sequence of each video frame group, so as to obtain the super-resolution video frame sequence with the same time sequence as the shooting sequence of the video data to be processed.

Further, ordering each super-resolution video frame corresponding to the video frame sequence to be processed according to the time sequence, and determining the super-resolution video frame sequence, including:

Determining the super-resolution video frame corresponding to the video frame group as the super-resolution video frame corresponding to the frame to be reconstructed in the video frame group;

and sequencing each super-resolution video frame according to the time sequence of the corresponding frame to be rebuilt, and determining a super-resolution video frame sequence corresponding to the video frame sequence to be processed.

Further, fig. 9 is a flowchart illustrating a process of training a video super-resolution model by using a set training method according to a second embodiment of the present invention, as shown in fig. 9, specifically including the following steps:

s301, inputting a video training sample set into an initial video super-resolution model, and determining a reconstruction intermediate result.

Wherein the video training samples include low resolution video frames and high resolution video frames corresponding to the low resolution video frames.

In this embodiment, the video training sample set may be specifically understood as a set of training objects including a low-resolution video frame and a corresponding high-resolution video frame, which are determined according to actual requirements and are input into the initial video super-resolution model to train the video super-resolution model. The initial video super-resolution model can be specifically understood as an untrained video super-resolution model, and the modules and the neural network layer structures contained in the initial video super-resolution model are completely consistent with those in the video super-resolution model, but weight parameters in the initial video super-resolution model are not adjusted yet. The reconstruction intermediate result can be specifically understood as a result output after the super-resolution reconstruction of the video training sample input into the model by the initial video super-resolution model without training.

Specifically, a low-resolution video frame in a video training sample is input into an initial super-resolution model, coarse feature extraction, detail feature extraction fusion and up-sampling processing are carried out on the low-resolution video frame through the initial super-resolution model, and an output result is determined to be a reconstruction intermediate result corresponding to the low-resolution video frame.

For example, a video training sample set may be constructed according to the disclosed reality and dynamic scene data sets (The Realistic and Diverse Scenes, REDS), a video frame sequence is acquired from the REDS data set, the video frame sequence is sliced as raw data with a group of three consecutive frames and the grouping order is disturbed, four-times gaussian blur downsampling is performed on the raw data after grouping, corresponding low-resolution video frames are generated, the raw data is determined to be high-resolution video frames corresponding to the low-resolution video frames, the two are associated to form a video training sample, and a set of a plurality of video training samples is determined to be the video training sample set.

S302, determining a loss function according to the reconstructed intermediate result and the high-resolution video frame.

In this embodiment, the Loss Function (Loss Function) can be specifically understood as a Function for measuring the distance between the model trained in the deep learning process and the ideal model, and can be used for parameter estimation of the model to make the trained model reach a convergence state, so as to reduce the error between the model predicted value and the true value after training.

Specifically, a high-resolution video frame corresponding to the reconstruction intermediate result is determined, the high-resolution video frame is considered as an ideal state of the low-resolution video frame after super-resolution reconstruction, difference information between each pixel point in the reconstruction intermediate result and the high-resolution video frame is determined, and then a corresponding loss function is determined according to the difference information.

Exemplary, in the embodiment of the present invention, a smoothl 1 type loss function may be used as a loss function of the video super-resolution model, and it is assumed that the reconstruction intermediate result is expressed asThe high resolution video frame corresponding to the reconstructed intermediate result isThe loss function formula can be expressed as: />Where c is the number of channels and H and W are the height and width of the video frame, respectively.

And S303, training the initial video super-resolution model based on the loss function until a preset convergence condition is met to obtain the video super-resolution model.

In this embodiment, the preset convergence condition may be specifically understood as a condition preset according to an actual situation, which is used to determine whether the trained initial video super-resolution model enters a convergence state. Optionally, the preset convergence condition may include that a difference between the reconstructed intermediate result and the high-resolution video frame is smaller than a preset threshold, a change of a weight parameter between two iterations of model training is smaller than a preset parameter change threshold, the iteration exceeds a set maximum iteration number, and all the video training samples are trained, which is not limited in the embodiment of the present invention.

Specifically, training an initial video super-resolution model by using the obtained loss function, namely adjusting weight parameters of each neural network layer in the initial video super-resolution model, and determining the trained initial video super-resolution model as a video super-resolution model which can be put into use when the preset convergence condition is known to be met.

Further, fig. 10 is a diagram illustrating a structural example of a video super-resolution model according to a second embodiment of the present invention, in which fig. 10 shows a data flow of inputting a video frame group including a first reference frame, a frame to be reconstructed and a second reference frame into the video super-resolution model for processing, wherein "+" represents feature merging and "+" represents feature addition, and a specific data processing flow is shown in each step, and embodiments of the present invention are not described in detail herein.

According to the technical scheme, the video super-resolution model comprising the video frame super-division module, the video frame alignment module, the multi-branch feature fusion module and the up-sampling module is used for processing the video frame sequence to be processed, rough feature extraction and preliminary super-division are performed on the video frame sequence to be processed based on the video frame super-division module, and meanwhile, the video frame alignment module, the multi-branch feature fusion module and the up-sampling module are used for realizing alignment of video frames in the video frame sequence to be processed, detail feature extraction fusion and amplification, so that the data volume required to be input into the video super-resolution model is reduced, the richness of information extracted for the video frames is improved, and the accuracy of the obtained super-resolution video frame sequence is enhanced.

Example III

Fig. 11 is a schematic structural diagram of a video super-resolution device according to a third embodiment of the present invention, where the video super-resolution device includes: a video frame acquisition module 31 and a super resolution frame generation module 32.

The video frame acquisition module 31 is configured to acquire a video frame sequence to be processed; the video frame sequence to be processed comprises at least one video frame group, wherein the video frame group is formed by three continuous video frames; the super-resolution frame generation module 32 is configured to input a video frame sequence to be processed into a preset video super-resolution model, and determine a super-resolution video frame sequence corresponding to the video frame sequence to be processed according to an output generation result; the video super-resolution model is a neural network model trained by a set training method; the video super-resolution model at least comprises a multi-branch feature fusion module, wherein the multi-branch feature fusion module is used for extracting the feature information and the high-frequency feature information of different receptive fields in parallel and fusing the feature information and the high-frequency feature information of the different receptive fields.

According to the technical scheme, the multi-branch parallel network structure is used for carrying out feature extraction and feature fusion on the video frame sequence to be processed in the input video super-resolution model from multiple dimensions, and fusion of different dimension features and high-frequency features can be realized in the multi-branch feature fusion module, so that light multi-feature extraction fusion is realized, high-resolution video frame reconstruction on the low-resolution video frame sequence to be processed is further realized, and the corresponding super-resolution video frame sequence is obtained. The problems of complex structure and large calculated amount of the existing video super-resolution model are solved, multi-dimensional feature extraction and fusion of different frequency types are realized through the arranged multi-branch feature fusion module, the parameter quantity and the data operation amount are reduced, the richness of video frame extraction information is improved, and the accuracy of super-resolution video frame reconstruction is enhanced.

Further, the video frame acquisition module 31 includes:

the video group determining unit is used for obtaining video data to be processed, dividing the video data to be processed through a preset sliding window and determining at least one video frame group; the window size of a preset sliding window is three frames, and the sliding step length is one frame;

and the video frame sequence determining unit is used for sequencing each video frame group according to the time sequence to determine the video frame sequence to be processed.

The super-resolution frame generation module 32 includes:

the first result determining unit is used for inputting the video frame groups into the video frame superdivision module and the video frame alignment module in the video super-resolution model respectively aiming at each video frame group in the video frame sequence to be processed, determining the output of the video frame superdivision module as a first superdivision result, and determining the output of the video frame alignment module as an aligned video frame;

the fusion feature determining unit is used for inputting the aligned video frames into the multi-branch feature fusion module to perform feature extraction and fusion, and determining a fusion feature map;

The second result determining unit is used for inputting the fusion feature map into the up-sampling module to carry out pixel rearrangement and determining a second superdivision result;

the super-resolution frame determining unit is used for determining a video frame generated by carrying out feature addition on the first super-division result and the second super-division result as a super-resolution video frame corresponding to the video frame group;

the super-resolution sequence determining unit is used for sequencing the super-resolution video frames corresponding to the video frame sequence to be processed according to the time sequence to determine the super-resolution video frame sequence.

Correspondingly, the first result determining unit is specifically configured to:

determining a first frame in a video frame group as a first reference frame, determining a second frame as a frame to be reconstructed, and determining a third frame as a second reference frame;

inputting a first reference frame and a frame to be reconstructed into a first optical flow estimation sub-module, and determining a first optical flow diagram;

inputting a first reference frame and a first optical flow diagram to a first alignment sub-module, and determining a first alignment video frame corresponding to the first reference frame;

inputting a second reference frame and a frame to be reconstructed into a second optical flow estimation sub-module to determine a second optical flow diagram;

Inputting a second reference frame and a second optical flow diagram to a second alignment sub-module, and determining a second alignment video frame corresponding to the second reference frame;

and merging the first aligned video frame, the second aligned video frame and the frame to be rebuilt to determine the aligned video frame.

Further, the multi-branch feature fusion module comprises a feature extraction sub-module, a residual sub-module and a dense sub-module.

Correspondingly, the fusion characteristic determining unit is specifically configured to:

inputting the aligned video frames to a feature extraction submodule for feature extraction, and determining a small receptive field feature map;

inputting the small receptive field feature map to a residual sub-module for feature extraction and fusion, and determining a receptive field fusion feature map; the middle residual block and the last residual block in the residual sub-module are connected with the small receptive field feature map;

inputting the aligned video frames to a dense submodule for high-frequency information extraction, and determining a high-frequency feature map;

and carrying out feature addition on the receptive field fusion feature map and the high-frequency feature map to determine the fusion feature map.

Further, the super-resolution sequence determining unit is specifically configured to:

Further, the step of training the video super-resolution model by adopting the set training method comprises the following steps:

inputting the video training sample set into an initial video super-resolution model, and determining a reconstruction intermediate result; the video training samples comprise low-resolution video frames and high-resolution video frames corresponding to the low-resolution video frames;

determining a loss function according to the reconstructed intermediate result and the high-resolution video frame;

training the initial video super-resolution model based on the loss function until a preset convergence condition is met to obtain the video super-resolution model.

The video super-resolution device provided by the embodiment of the invention can execute the video super-resolution method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example IV

Fig. 12 is a schematic structural diagram of a video super-resolution device according to a fourth embodiment of the present invention. The video super resolution device 40 may be an electronic device intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 12, the video super-resolution device 40 includes at least one processor 41, and a memory communicatively connected to the at least one processor 41, such as a Read Only Memory (ROM) 42, a Random Access Memory (RAM) 43, etc., in which the memory stores a computer program executable by the at least one processor, and the processor 41 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 42 or the computer program loaded from the storage unit 48 into the Random Access Memory (RAM) 43. In the RAM 43, various programs and data required for the operation of the video super-resolution apparatus 40 can also be stored. The processor 41, the ROM 42 and the RAM 43 are connected to each other via a bus 44. An input/output (I/O) interface 45 is also connected to bus 44.

A plurality of components in the video super resolution device 40 are connected to the I/O interface 45, including: an input unit 46 such as a keyboard, a mouse, etc.; an output unit 47 such as various types of displays, speakers, and the like; a storage unit 48 such as a magnetic disk, an optical disk, or the like; and a communication unit 49 such as a network card, modem, wireless communication transceiver, etc. The communication unit 49 allows the video super-resolution device 40 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The processor 41 may be various general and/or special purpose processing components with processing and computing capabilities. Some examples of processor 41 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 41 performs the various methods and processes described above, such as the video super resolution method.

In some embodiments, the video super-resolution method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 48. In some embodiments, part or all of the computer program may be loaded and/or installed onto the video super resolution device 40 via the ROM 42 and/or the communication unit 49. When the computer program is loaded into RAM 43 and executed by processor 41, one or more steps of the video super-resolution method described above may be performed. Alternatively, in other embodiments, the processor 41 may be configured to perform the video super-resolution method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be noted that, in the above embodiment of the apparatus, each unit and module included are only divided according to the functional logic, but not limited to the above division, so long as the corresponding function can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A video super-resolution method, comprising:

the video super-resolution model is a neural network model trained by a set training method; the video super-resolution model at least comprises a multi-branch feature fusion module, wherein the multi-branch feature fusion module is used for extracting feature information and high-frequency feature information of different receptive fields in parallel and fusing the feature information of the different receptive fields and the high-frequency feature information;

the video super-resolution model also comprises a video frame super-division module, a video frame alignment module and an up-sampling module; the step of inputting the video frame sequence to be processed into a preset video super-resolution model, and determining the super-resolution video frame sequence corresponding to the video frame sequence to be processed according to the output generation result, wherein the step of determining the super-resolution video frame sequence comprises the following steps:

for each video frame group in the video frame sequence to be processed, respectively inputting the video frame groups into the video frame superdivision module and the video frame alignment module in the video super-resolution model, determining the output of the video frame superdivision module as a first superdivision result, and determining the output of the video frame alignment module as an aligned video frame;

Inputting the aligned video frames into the multi-branch feature fusion module for feature extraction and fusion, and determining a fusion feature map;

inputting the fusion feature map to the up-sampling module for pixel rearrangement, and determining a second superdivision result;

the video frames generated by carrying out feature addition on the first super-division result and the second super-division result are determined to be super-resolution video frames corresponding to the video frame group;

and sequencing each super-resolution video frame corresponding to the video frame sequence to be processed according to a time sequence to determine a super-resolution video frame sequence.

2. The method of claim 1, wherein the acquiring a sequence of video frames to be processed comprises:

obtaining video data to be processed, dividing the video data to be processed through a preset sliding window, and determining at least one video frame group; the window size of the preset sliding window is three frames, and the sliding step length is one frame;

and sequencing the video frame groups according to the time sequence to determine a video frame sequence to be processed.

3. The method of claim 1, wherein the video frame alignment module comprises a first optical flow estimation sub-module, a first alignment sub-module, a second optical flow estimation sub-module, and a second alignment sub-module;

Inputting the set of video frames to the video frame alignment module, determining an output of the video frame alignment module as aligned video frames, comprising:

determining a first frame in the video frame group as a first reference frame, determining a second frame as a frame to be reconstructed, and determining a third frame as a second reference frame;

inputting the first reference frame and the frame to be reconstructed to the first optical flow estimation submodule to determine a first optical flow graph;

inputting the first reference frame and the first optical flow diagram to the first alignment sub-module, and determining a first alignment video frame corresponding to the first reference frame;

inputting the second reference frame and the frame to be reconstructed to the second optical flow estimation submodule to determine a second optical flow graph;

inputting the second reference frame and the second optical flow diagram to the second alignment sub-module, and determining a second aligned video frame corresponding to the second reference frame;

and merging the first aligned video frame, the second aligned video frame and the frame to be rebuilt to determine an aligned video frame.

4. The method of claim 1, wherein the multi-branch feature fusion module comprises a feature extraction sub-module, a residual sub-module, and a dense sub-module;

Inputting the aligned video frames into the multi-branch feature fusion module for feature extraction and fusion, and determining a fusion feature map, wherein the method comprises the following steps:

inputting the aligned video frames to the feature extraction submodule to perform feature extraction, and determining a small receptive field feature map;

inputting the small receptive field feature map to the residual sub-module for feature extraction and fusion, and determining a receptive field fusion feature map; wherein, the middle residual block and the last residual block in the residual sub-module are connected with the small receptive field feature map hop;

inputting the aligned video frames to the dense submodule for high-frequency information extraction, and determining a high-frequency feature map;

and carrying out feature addition on the receptive field fusion feature map and the high-frequency feature map to determine a fusion feature map.

5. A method according to claim 3, wherein said ordering each of said super-resolution video frames corresponding to said sequence of video frames to be processed according to a temporal order, determining a sequence of super-resolution video frames, comprises:

determining the super-resolution video frames corresponding to the video frame groups as the super-resolution video frames corresponding to the frames to be reconstructed in the video frame groups;

And sequencing the super-resolution video frames according to the time sequence of the corresponding frames to be rebuilt, and determining a super-resolution video frame sequence corresponding to the video frame sequence to be processed.

6. The method of claim 1, wherein the step of training the video super-resolution model using a set-up training method comprises:

inputting the video training sample set into an initial video super-resolution model, and determining a reconstruction intermediate result; wherein the video training samples comprise low-resolution video frames and high-resolution video frames corresponding to the low-resolution video frames;

determining a loss function according to the reconstruction intermediate result and the high-resolution video frame;

7. A video super-resolution apparatus, comprising:

the super-resolution frame generation module is used for inputting the video frame sequence to be processed into a preset video super-resolution model, and determining a super-resolution video frame sequence corresponding to the video frame sequence to be processed according to an output generation result;

the video super-resolution model also comprises a video frame super-division module, a video frame alignment module and an up-sampling module;

wherein, super resolution frame generation module includes:

a first result determining unit configured to input, for each video frame group in the video frame sequence to be processed, the video frame groups to the video frame superdivision module and the video frame alignment module in the video super-resolution model, respectively, and determine an output of the video frame superdivision module as a first superdivision result, and determine an output of the video frame alignment module as an aligned video frame;

The second result determining unit is used for inputting the fusion feature map to the up-sampling module for pixel rearrangement and determining a second superdivision result;

a super-resolution frame determining unit, configured to determine a video frame generated by performing feature addition on the first super-division result and the second super-division result as a super-resolution video frame corresponding to the video frame group;

8. A video super-resolution apparatus, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the video super-resolution method of any one of claims 1-6.

9. A computer readable storage medium storing computer instructions for causing a processor to perform the video super-resolution method of any one of claims 1-6.