CN116012230B

CN116012230B - Space-time video super-resolution method, device, equipment and storage medium

Info

Publication number: CN116012230B
Application number: CN202310092691.1A
Authority: CN
Inventors: 骆剑平; 郑敏妍
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2023-01-17
Filing date: 2023-01-17
Publication date: 2023-09-29
Anticipated expiration: 2043-01-17
Also published as: CN116012230A

Abstract

The invention discloses a space-time video super-resolution method, a device, equipment and a storage medium. The method comprises the following steps: acquiring a video frame sequence to be processed; the video frame sequence to be processed comprises at least one video frame group, wherein the video frame group is four continuous video frames; inputting the video frame sequence to be processed into a preset space-time video super-resolution model, and determining a space-time video super-resolution result corresponding to the video frame sequence to be processed according to the output generation result; the space-time video super-resolution model is a neural network model trained by a set training method; the space-time video super-resolution model at least comprises two video frame alignment modules, wherein the video frame alignment modules are used for extracting space local characteristic information and time global characteristic information of an input video frame sequence and aligning the input video frame sequence according to the space local characteristic information and the time global characteristic information. The technical scheme of the embodiment of the invention enhances the accuracy of the spatial-temporal video super-resolution result.

Description

Space-time video super-resolution method, device, equipment and storage medium

Technical Field

The present invention relates to the field of point cloud data processing technologies, and in particular, to a spatio-temporal video super-resolution method, apparatus, device, and storage medium.

Background

In recent years, with the increasing popularity of high-resolution display devices in daily life, there is an increasing demand for high-resolution video. In the field of movie production, high-speed cameras and high-resolution cameras have come into use gradually to capture clearer and finer pictures to improve the visual effect of movies. However, since the requirement of the type of pictures on hardware devices is high, the devices are not widely used, and if the video frames with high frame rate and high resolution are to be obtained, the implementation needs to be realized by software means.

In the prior art, the superdivision for space-time video is often implemented using convolutional neural networks. The designed space-time video super-resolution network is realized in a two-stage mode by combining a video frame inserting method and a video super-resolution method, for example, the frame inserting of a low-resolution video frame can be finished by a Depth perception video frame interpolation (Depth-aware video frame interpolation, DAIN) method, and then the reconstruction of a high-resolution video frame can be finished by a video super-resolution network based on a cyclic back-projection network (RBPN); the other is realized by training a single-stage Space-time video super-division model, such as STARnet (Space-time-aware multi-resolution video enhancement) and zoom (Fast and accurate one-stage Space-time video super-resolution) methods. An important process of the super-resolution of the space-time video is to perform inter-frame alignment, and two modes of inter-frame alignment are mainly performed at present, one mode is to perform motion compensation by predicting optical flow and to warp a reference frame onto a target frame to complete inter-frame registration; the other is to perform interframe information fusion by an implicit mode, and common modes include 3D convolution, variability convolution, transform, RNN and the like.

However, the temporal interpolation and spatial super-resolution in the spatio-temporal video super-resolution process are internally correlated, and it is difficult to fully exploit this natural property by profiling it as two separate stages. The two single-stage space-time super-resolution models STRENet and zooing proposed at present, the former uses the optical flow estimation to carry out space alignment, the method is seriously dependent on the accuracy of the optical flow estimation, and the effect is not good for some scenes such as large actions; and the method combines the variability convolution with the convolution long-short time memory network to perform implicit alignment among frames, so that the method has extremely large calculated amount and very low reasoning speed, and is difficult to meet the requirement of simply and accurately performing space-time video super-resolution.

Disclosure of Invention

The invention provides a space-time video super-resolution method, a device, equipment and a storage medium, which realize implicit motion compensation of a low-resolution video frame through a neural network model extracted by space-time feature separation, combine local space features with global time features in the motion compensation process, enable the extracted features of the neural network model to be more various and accurate, are suitable for super-resolution processing of space-time videos of various complex motion scenes, and enhance the accuracy of the space-time video super-resolution result.

In a first aspect, an embodiment of the present invention provides a spatio-temporal video super-resolution method, including:

acquiring a video frame sequence to be processed; the video frame sequence to be processed comprises at least one video frame group, wherein the video frame group is four continuous video frames;

inputting the video frame sequence to be processed into a preset space-time video super-resolution model, and determining a space-time video super-resolution result corresponding to the video frame sequence to be processed according to the output generation result;

the space-time video super-resolution model is a neural network model trained by a set training method; the space-time video super-resolution model at least comprises two video frame alignment modules, wherein the video frame alignment modules are used for extracting space local characteristic information and time global characteristic information of an input video frame sequence and aligning the input video frame sequence according to the space local characteristic information and the time global characteristic information.

In a second aspect, an embodiment of the present invention further provides a spatio-temporal video super-resolution apparatus, where the spatio-temporal video super-resolution apparatus includes:

the video frame acquisition module is used for acquiring a video frame sequence to be processed; the video frame sequence to be processed comprises at least one video frame group, wherein the video frame group is four continuous video frames;

The super-resolution result generation module is used for inputting the video frame sequence to be processed into a preset space-time video super-resolution model, and determining a space-time video super-resolution result corresponding to the video frame sequence to be processed according to the output generation result;

In a third aspect, an embodiment of the present invention further provides a spatio-temporal video super-resolution apparatus, where the spatio-temporal video super-resolution apparatus includes:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the spatio-temporal video super resolution method of any of the embodiments of this invention.

In a fourth aspect, embodiments of the present invention further provide a computer readable storage medium storing computer instructions for causing a processor to implement the spatio-temporal video super resolution method of any of the embodiments of the present invention when executed.

The embodiment of the invention provides a space-time video super-resolution method, a device, equipment and a storage medium, which are used for acquiring a video frame sequence to be processed; the video frame sequence to be processed comprises at least one video frame group, wherein the video frame group is four continuous video frames; inputting the video frame sequence to be processed into a preset space-time video super-resolution model, and determining a space-time video super-resolution result corresponding to the video frame sequence to be processed according to the output generation result; the space-time video super-resolution model is a neural network model trained by a set training method; the space-time video super-resolution model at least comprises two video frame alignment modules, wherein the video frame alignment modules are used for extracting space local characteristic information and time global characteristic information of an input video frame sequence and aligning the input video frame sequence according to the space local characteristic information and the time global characteristic information. By adopting the technical scheme, the acquired low-frame-rate low-resolution video frame sequence to be processed is input into a pre-trained space-time video super-resolution model, and the alignment of the low-frame-rate video frames and the alignment of the video frames after the frame rate is improved are respectively realized in a mode of separating, extracting and fusing spatial local feature information and temporal global feature information through two video frame alignment modules in the space-time video super-resolution model, so that a space-time video super-resolution result with both the frame rate and the resolution improved is finally obtained. The method solves the problems that the correlation in the space-time relation is difficult to consider in the existing two-stage space-time video super-resolution processing method, the concentration is insufficient in the single-stage space-time video super-resolution processing method, and huge parameters and operation are needed when model training is carried out.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a spatio-temporal video super-resolution method according to an embodiment of the present invention;

fig. 2 is a flowchart of a spatio-temporal video super-resolution method according to a second embodiment of the present invention;

FIG. 3 is a flowchart illustrating a method for determining a first alignment feature pattern sequence according to a second embodiment of the present invention;

fig. 4 is a schematic diagram of a first video frame alignment module according to a second embodiment of the present invention;

fig. 5 is a flowchart illustrating a process of inputting a first feature map sequence and a first alignment feature map sequence to a feature fusion module to generate a second feature map sequence according to a second embodiment of the present invention;

Fig. 6 is a diagram illustrating a structure of a spatio-temporal video super-resolution model according to a second embodiment of the present invention;

FIG. 7 is a flowchart illustrating training of a spatio-temporal video super-resolution model by a setting method according to a second embodiment of the present invention;

fig. 8 is a schematic structural diagram of a spatio-temporal video super-resolution device according to a third embodiment of the present invention;

fig. 9 is a schematic structural diagram of a spatio-temporal video super-resolution device according to a fourth embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Fig. 1 is a flowchart of a spatio-temporal video super-resolution method according to an embodiment of the present invention, where the embodiment of the present invention may be applied to a case where a low-frame-rate low-resolution video frame sequence is super-resolution processed to obtain a high-frame-rate high-resolution video frame sequence, where the method may be performed by a spatio-temporal video super-resolution device, which may be implemented by software and/or hardware, and the spatio-temporal video super-resolution device may be configured on a spatio-temporal video super-resolution apparatus, which may be a computer apparatus, which may be a notebook, a desktop computer, an intelligent tablet, or the like.

As shown in fig. 1, the spatio-temporal video super-resolution method provided in the first embodiment of the present invention specifically includes the following steps:

s101, acquiring a video frame sequence to be processed.

The video frame sequence to be processed comprises at least one video frame group, and the video frame group is formed by four continuous video frames.

In this embodiment, the video frame sequence to be processed may be specifically understood as a set of video frames acquired by a video acquisition device or acquired by different channels and needing resolution improvement. The video frame group can be specifically understood as a group of continuous video frames which are divided according to a preset rule in a video frame sequence to be processed and are used for meeting the processing requirement of the space-time video super-resolution model, and optionally, the video frame group can comprise four continuous video frames.

Specifically, when video data needing to be subjected to frame rate and resolution improvement is acquired by a video acquisition device or obtained by other channels such as a network, the video data is composed of a plurality of video frames which are continuous in time, the video frame data can be processed into a plurality of different video frame groups according to a preset rule, each video frame group comprises four continuous video frames, and each video frame group is further sequenced according to the time sequence of the video frame included in the video frame group, so that a corresponding video frame sequence to be processed is obtained.

S102, inputting the video frame sequence to be processed into a preset space-time video super-resolution model, and determining a space-time video super-resolution result corresponding to the video frame sequence to be processed according to the output generation result.

In this embodiment, the spatio-temporal video super-resolution model may be specifically understood as a neural network model for performing video frame interpolation and high-resolution video frame reconstruction on a video frame group in an input low-frame-rate low-resolution video frame sequence to be processed. The spatio-temporal video super-resolution result can be understood in particular as a set of video frames corresponding to each group of video frames in the sequence of video frames to be processed, for which the interpolation and the high-resolution reconstruction have been completed. The video frame alignment module can be specifically understood as a neural network sub-model for realizing alignment of an input feature map by extracting local spatial feature information in a space-time separation mode in a space-time video super-resolution model and fusing the spatial local feature information with the extracted time global feature. The spatial local feature information may be specifically understood as spatial feature information between pixels located in a local block extracted through a local transducer network. The time global feature information can be specifically understood as global time feature information of a video frame group extracted through a global transducer network, namely pixel feature information aiming at the same target object pixel point in different video frames.

In general, a Neural network model (NN) is understood as a complex network system formed by a large number of simple processing units (also referred to as neurons) widely connected to each other, which can reflect many basic features of human brain functions, is a highly complex nonlinear power learning system, and in short, a Neural network model is understood as a mathematical model based on neurons. The neural network model is composed of a plurality of neural network layers, different neural network layers can realize different processing, such as convolution, normalization and the like, on data input into the neural network model, and the plurality of different neural network layers can be combined according to a certain preset rule to form different modules in the neural network model. Alternatively, the spatio-temporal video super-resolution model provided in the embodiment of the present invention may be a neural network model trained by an Adam optimizer, or may be a neural network model trained in other manners, which is not limited in this embodiment of the present invention.

Specifically, the video frame sequence to be processed is input into a preset space-time video super-resolution model, so that the space-time video super-resolution model can perform corresponding processing on each video frame group in the video frame sequence to be processed, when the space-time video super-resolution model is used for performing frame interpolation and feature extraction fusion on the video frame groups input into the space-time video super-resolution model, a space-time separated video frame alignment module is used for completing pixel alignment of each video frame in the video frame groups in the processing process, so as to obtain high-frame-rate high-resolution space-time video frame groups corresponding to each video frame group, and further, according to arrangement of each video frame group in the video frame sequence to be processed, each space-time video frame group corresponding to the video frame sequence to be processed can be obtained, and each space-time video frame group after the arrangement is subjected to video frame de-duplication, so that only one identical video frame is reserved, and a space-time video super-resolution result corresponding to the video frame sequence to be processed can be obtained.

It should be clear that, when the video frame sequence to be processed in the embodiment of the present invention includes more than one video frame group, for two adjacent video frame groups, the last video frame in the previous video frame group should be the same as the first video frame group in the next video frame group, so as to ensure the continuity of the obtained spatio-temporal video super-resolution result. By way of example, assuming that two video frame sets are included in the video frame sequence to be processed, the two video frame sets include [1,2,3,4] and [4,5,6,7] frames in the video data, respectively, the video sequence to be processed may be represented as [1,2,3,4,5,6,7], and may also be represented as [1,2,3,4,4,5,6,7], and embodiments of the present invention are not limited in this respect.

According to the technical scheme, a video frame sequence to be processed is obtained; the video frame sequence to be processed comprises at least one video frame group, wherein the video frame group is four continuous video frames; inputting the video frame sequence to be processed into a preset space-time video super-resolution model, and determining a space-time video super-resolution result corresponding to the video frame sequence to be processed according to the output generation result; the space-time video super-resolution model is a neural network model trained by a set training method; the space-time video super-resolution model at least comprises two video frame alignment modules, wherein the video frame alignment modules are used for extracting space local characteristic information and time global characteristic information of an input video frame sequence and aligning the input video frame sequence according to the space local characteristic information and the time global characteristic information. By adopting the technical scheme, the acquired low-frame-rate low-resolution video frame sequence to be processed is input into a pre-trained space-time video super-resolution model, and the alignment of the low-frame-rate video frames and the alignment of the video frames after the frame rate is improved are respectively realized in a mode of separating, extracting and fusing spatial local feature information and temporal global feature information through two video frame alignment modules in the space-time video super-resolution model, so that a space-time video super-resolution result with both the frame rate and the resolution improved is finally obtained. The method solves the problems that the correlation in the space-time relation is difficult to consider in the existing two-stage space-time video super-resolution processing method, the concentration is insufficient in the single-stage space-time video super-resolution processing method, and huge parameters and operation are needed when model training is carried out.

Example two

Fig. 2 is a flowchart of a spatio-temporal video super-resolution method provided by a second embodiment of the present invention, where the technical solution of the second embodiment of the present invention may be further optimized based on the above-mentioned alternative technical solutions, and a manner of generating a sequence of video frames to be processed according to video data to be processed is clarified, and a feature extraction module, a first video frame alignment module, a feature fusion module, a second video frame alignment module, and a video frame reconstruction module are included in a spatio-temporal video super-resolution model, where the first video frame alignment module and the second video frame alignment module have the same structure, and separation and extraction of spatio-temporal features in a video frame group input therein are completed twice by the first video frame alignment module and the second video frame alignment module, so as to implement pixel alignment of each feature map in the sequence of feature map input therein.

As shown in fig. 2, the spatio-temporal video super-resolution method provided in the second embodiment of the present invention specifically includes the following steps:

S201, obtaining video data to be processed, dividing a video sequence to be processed through a preset sliding window, and determining at least one video frame group.

The window size of the preset sliding window is four frames, and the sliding step length is three frames.

In this embodiment, the video data to be processed may be specifically understood as video data that needs to improve the frame rate and the resolution, and the video data to be processed may be, for example, an image with a frame rate and a resolution improving requirement that is propagated in the internet, or an image directly acquired by a video acquisition device, where the frame rate and the resolution are difficult to meet the actual requirement. The preset sliding window is specifically understood as a window which is preset in size according to actual conditions and is used for dividing video frames in video data to be processed. Optionally, according to the data processing requirement of the space-time video super-resolution model, the frame inserting requirement of the video frame groups is always considered in the design process, and only one frame or two frames of frame inserting effect may not be good, so that the window size of the preset sliding window can be set to four frames, and also can be adaptively adjusted according to the actual requirement.

Specifically, the video data to be processed is obtained through different channels, the size of a preset sliding window is set to four frames according to the data processing requirement of a space-time video super-resolution model, the sliding step length is set to three frames, the video data to be processed is divided through the preset sliding window, four continuous video frames in the preset sliding window are determined to be a video frame group, namely, the number of complete sliding of the preset sliding window in the video frame group is the number of groups of the video frame groups which can be divided by the video data to be processed for the video data to be processed.

S202, sequencing each video frame group according to the time sequence, and determining a video frame sequence to be processed.

Specifically, since the video frames in the video data to be processed are sequentially arranged according to the shooting time sequence, when the video data to be processed is divided into a plurality of video frame groups, in order to ensure the continuity of data processing, each video frame group needs to be ordered according to the time sequence, and the set of each video frame group after the ordering is determined as the video frame sequence to be processed. Optionally, since the last video frame in the previous video frame group is the same as the first video frame in the next video frame group in the two adjacent video frame groups, the sorting of each video frame group may be completed according to the first video frame and the last video frame in each video frame group, or the photographing time of the first video frame in each video frame group may be used as the sorting basis, which is not limited in the embodiments of the present invention.

Further, the space-time video super-resolution model comprises a feature extraction module, a first video frame alignment module, a feature fusion module, a second video frame alignment module and a video frame reconstruction module; the first video frame alignment module and the second video frame alignment module have the same structure.

In this embodiment, the feature extraction module may be understood as a neural network sub-model for performing feature extraction on an image or video frame input thereto. It should be clear that, in the embodiment of the present invention, the feature extraction method of the feature extraction module may be selected according to actual requirements, which is not limited in the embodiment of the present invention. The first video frame alignment module is specifically understood to be used for extracting spatial local feature information of a video frame sequence or a feature map sequence input therein and temporal global feature information fused with the spatial local feature information, so as to complete a neural network sub-model for aligning each feature map or each video frame according to the extracted spatial-temporal feature information. The feature fusion module can be specifically understood as a neural network sub-model for extracting and optimally combining feature vectors of a feature map sequence input into the feature fusion module and inserting the feature map obtained after fusion into a specific position in the feature map sequence. The second video frame alignment module is specifically understood to be the same as the first video frame alignment module in structure and is used for aligning the feature map sequence input into the second video frame alignment module, wherein the feature map sequence is inserted into the second video frame alignment module. The video frame reconstruction module is specifically understood as a neural network sub-model for high resolution reconstruction of an input low resolution video frame.

S203, inputting the video frame groups into a feature extraction module in a space-time video super-resolution model for each video frame group in the video frame sequence to be processed, and determining a first feature map sequence.

In this embodiment, the first feature map sequence may be specifically understood as a set of first feature maps corresponding to each video frame obtained after feature extraction for each video frame in the video frame group.

Specifically, since there may be a plurality of video frame groups in the video sequence to be processed, in the embodiment of the present invention, a spatio-temporal video super-resolution model is taken as an example to process one video frame group input therein, when the video frame group is input into the spatio-temporal video super-resolution model, the video frame group is input into the feature extraction module first, feature vectors in each video frame in the video frame group are extracted by the feature extraction module, so as to obtain first feature graphs corresponding to each video frame, and each first feature graph is ordered according to the arrangement sequence of each video frame in the video frame group, so as to obtain the first feature graph sequence corresponding to the video frame group. Optionally, the feature extraction module may be formed by a plurality of two-dimensional convolutional neural network layers, or any method capable of implementing image feature extraction may be used to implement feature extraction of the video frame, which is not limited in the embodiment of the present invention.

S204, inputting the first feature map sequence into a first video frame alignment module to extract space local feature information and time global feature information of the first feature map sequence, aligning the first feature map sequence according to the space local feature information and the time global feature information, and determining a first aligned feature map sequence.

Specifically, the first feature map sequence is input to a first video frame alignment module, the neural network layer in the first video frame alignment module is used for extracting spatial local feature information of each first feature map in the first feature map sequence, the extracted spatial local feature information is used as a basis, time global feature information among each first feature map is extracted, on the basis of space-time separation, space-time fusion feature information is extracted, and pixel points in each first feature map in the first feature map sequence are subjected to distortion alignment according to the space-time fusion feature information, so that an aligned first alignment feature map sequence is obtained.

Further, the first video frame alignment module comprises a first temporal-spatial separation feature extraction sub-module, a first downsampling sub-module, a second temporal-spatial separation feature extraction sub-module, a second downsampling sub-module, a third temporal-spatial separation feature extraction sub-module, a first upsampling sub-module and a second upsampling sub-module; the first time-space separation characteristic extraction submodule, the second time-space separation characteristic extraction submodule and the third time-space separation characteristic extraction submodule have the same structure.

In this embodiment, the first time-space separation feature extraction submodule may be specifically understood as a combination of a plurality of neural network layers for performing separation, extraction and fusion of spatial local feature information and temporal global feature information input into the feature map. The first downsampling submodule is specifically understood as a combination of a plurality of neural network layers for reducing the number of sampling points in the feature map by reducing the feature map input into the first downsampling submodule according to a preset scaling. The second spatio-temporal separation feature extraction submodule can be understood in particular as a combination of a plurality of neural network layers for the separation, extraction and fusion of spatial local feature information and temporal global feature information of a feature map into which one downsampling has been completed. The second downsampling submodule may be specifically understood as a combination of a plurality of neural network layers disposed after the second spatio-temporal separation feature extraction submodule, where the combination is used to reduce the feature map output by the second spatio-temporal separation feature extraction submodule according to a preset scaling. The third spatio-temporal separation feature extraction submodule can be understood in particular as a combination of a plurality of neural network layers for the spatial local feature information and temporal global feature information separation extraction and fusion of the feature map into which the two downsampling operations have been completed. The first upsampling submodule is specifically understood to be a combination of multiple neural network layers for amplifying the feature map input therein according to a preset scaling to achieve feature map resolution restoration. The second upsampling submodule may be understood as a combination of a plurality of neural network layers arranged after the first upsampling submodule and the second spatio-temporal separation feature extraction submodule for amplifying the feature map input into which the upsampling and feature fusion have been completed once according to a preset scaling. The first space-time separation characteristic extraction submodule, the second space-time separation characteristic extraction submodule and the third space-time separation characteristic extraction submodule have the same structure; the first downsampling submodule and the second downsampling submodule have the same structure; the first upper sampling submodule and the second upper sampling submodule have the same structure. Optionally, the scaling multiple of each sub-module and each sub-module for sampling may be 2 times, and the setting of the scaling multiple is not limited in the embodiment of the present invention.

Accordingly, fig. 3 is a flowchart illustrating a process for determining a first alignment feature map sequence according to a second embodiment of the present invention, where, as shown in fig. 3, the first feature map sequence is input to a first video frame alignment module to extract first spatial local feature information and first temporal global feature information from the first feature map sequence, and the first feature map sequence is aligned according to the first spatial local feature information and the first temporal global feature information, so as to determine the first alignment feature map sequence, which may be implemented specifically by the following steps:

s2041, inputting the first characteristic diagram sequence into a first time-space separation characteristic extraction submodule, separating, extracting and fusing time-space characteristics, and determining the first time-space characteristic diagram sequence.

Specifically, a first feature map sequence is input into a first time-space separation feature extraction submodule, local transformation patterns corresponding to local blocks are divided in the first time-space separation feature extraction submodule in advance to extract space local features in each first feature map local block, global self-attention calculation is carried out on time by a global transformation pattern network based on each extracted space local feature, first time-space feature maps which are fused and contain local and global layer features at the same time are obtained, and each first time-space feature map is sequentially arranged according to the sequence corresponding to the first feature map sequence, so that a first time-space feature map sequence is obtained.

Further, the first time-space separation feature extraction submodule includes: the system comprises a two-dimensional convolution layer, a spatial local feature extraction network and a time global feature extraction network.

In this embodiment, the two-dimensional convolution layer is specifically understood as a neural network layer for performing feature extraction on the feature map input therein and converting it into a feature matrix. The spatial local feature extraction network may be specifically understood as a local transducer network for extracting spatial features between pixel points in a preset local block. The time global feature extraction network can be specifically understood as a global transducer network for extracting time features among rubber and plastic points of the same target object in each feature map in the global.

Correspondingly, the first characteristic diagram sequence is input to a first time-space separation characteristic extraction submodule to carry out separation extraction and fusion of time-space characteristics, and the first time-space characteristic diagram sequence is determined, which can be realized by the following steps:

a) And inputting the first feature map sequence into a two-dimensional convolution layer to generate a first feature matrix set.

Specifically, the first feature map sequence is input to a two-dimensional convolution layer, feature extraction is performed on each first feature map in the first feature map sequence through the two-dimensional convolution layer, a corresponding query matrix, a key value matrix and a numerical matrix are obtained, and a set of the query matrix, the key value matrix and the numerical matrix is determined to be a first feature matrix set.

Exemplary, assume that the input first feature map sequence may be represented as F _N×C×H1×W1 Wherein N is the sequence length of the first characteristic diagram sequence, C is the channel number and H ₁ For the height of the first characteristic diagram, W ₁ Is the width of the first feature map. Assuming that the query matrix may be denoted as Q, the key value matrix may be denoted as K, and the numerical matrix may be denoted as V, the manner of generating the first feature matrix set may be denoted as:

b) The first feature matrix set is input to a space local feature extraction network to conduct dimension mapping on the first feature matrix set to obtain a second feature matrix set, and local self-attention calculation is conducted on the second feature matrix set to extract first space local feature information.

Specifically, a first feature matrix set is input to a spatial local feature extraction network, dimension mapping is conducted on the first feature matrix set to enable the first feature matrix set to meet the calculation requirement of the spatial local feature extraction network, a corresponding second feature matrix set is obtained, and then local self-attention calculation is conducted on the second feature matrix set to obtain first spatial local feature information corresponding to the first feature matrix set.

Following the above example, for the firstPerforming dimension mapping on the feature matrix set to obtain a dimension of The manner of the second feature matrix set of (a) may be represented by the following formula:

[Q ₁ ,K ₁ ,V ₁ ]＝reshape[Q,K,V]

wherein, the liquid crystal display device comprises a liquid crystal display device,p is the size of the local block, and in the embodiment of the present invention, P may be set to be a matrix block with a size of 8×8 pixels, and the embodiment of the present invention does not limit the size of P.

Further, local self-attention calculation is performed on the second feature matrix set to obtain first spatial local feature information S ₁ The manner of (a) may be represented by the following formula:

c) And inputting the first space local feature information into a time global feature network so as to remap the first space local feature information to obtain a third feature matrix set, and determining the first space local feature information by performing global self-attention calculation on the third feature matrix set.

Specifically, in order to meet the calculation requirement of the time global feature network, when the first spatial local feature information is input into the time global feature network, a corresponding third feature matrix set is obtained by remapping the first spatial local feature information, and then the third feature matrix set is subjected to the global self-attention calculation in time to obtain the first time space feature information containing both the time information and the space information.

In the above example, the first spatial local feature information is remapped to obtain a dimension H ₁ W ₁ The manner of the third feature matrix set of x N x C can be expressed by the following formula:

[Q ₂ ,K ₂ ,V ₂ ]＝reshape[S ₁ ,S ₁ ,S ₁ ]

wherein Q is ₂ ∈H ₁ W ₁ ×N×C，K ₂ ∈H ₁ W ₁ ×N×C，V ₂ ∈H ₁ W ₁ ×N×C。

Further, performing global self-attention calculation on the third feature matrix set to obtain first time-space feature information T ₁ Can be represented by the following formula:

d) And performing dimension mapping on the first time-space characteristic information to obtain a first time-space characteristic diagram sequence with the same dimension as the first characteristic diagram sequence.

Specifically, in order to make the dimensions of the feature map of the input first time-space separation feature extraction sub-module and the feature map of the output first time-space separation feature extraction sub-module identical, after the first time-space feature information is obtained, the dimension mapping is needed to be performed on the first time-space feature information again so as to obtain a first time-space feature map sequence with the dimensions identical to the first feature map sequence.

S2042, the first time-space feature map sequence is input to a first downsampling submodule to downsample, and a first scaling feature map sequence is generated.

Specifically, the first time-space feature map sequence is input to a first downsampling sub-module, and downsampling is performed on the first time-space feature map sequence according to scaling factors preset in the first downsampling sub-module, so that a first scaled feature map sequence containing large-scale features is obtained.

S2043, inputting the first scaled feature map sequence into a second space-time separation feature extraction submodule, and carrying out separation extraction and fusion of space-time features to determine a second space-time feature map sequence.

Specifically, the first scaled feature map sequence is input to a second space-time separation feature extraction submodule, the second space-time separation feature extraction submodule with the same structure as the first space-time separation feature extraction submodule is utilized to perform space-time feature extraction and fusion on the first scaled feature map sequence subjected to one-time downsampling to obtain second space-time feature maps simultaneously containing local and global layer features, and the second space-time feature maps are sequentially arranged according to the sequence corresponding to the first scaled feature map sequence to obtain a corresponding second space-time feature map sequence. It should be clear that, since the second space-time separation feature extraction submodule has the same structure as the first space-time separation feature extraction submodule, the processing manner for inputting the feature map sequence therein is also the same as that in step S2041, and detailed description thereof will not be given in this step.

S2044, inputting the second space-time characteristic diagram sequence into a second downsampling submodule to downsample, and generating a second zooming characteristic diagram sequence.

Specifically, the second space-time characteristic diagram sequence is input to a second downsampling sub-module, and downsampling is performed on the second space-time characteristic diagram sequence according to scaling factors preset in the second downsampling sub-module so as to obtain a second scaled characteristic diagram sequence containing larger scale features compared with the second space-time characteristic diagram sequence

S2045, inputting the second zooming feature map sequence into a third space-time separation feature extraction submodule, and carrying out separation extraction and fusion of space-time features to determine a third space-time feature map sequence.

Specifically, the second scaled feature map sequence is input to a third space-time separation feature extraction submodule, the third space-time separation feature extraction submodule with the same structure as the first space-time separation feature extraction submodule is used for performing the temporal space-time feature extraction and fusion of the twice downsampled second scaled feature map sequence to obtain third space-time feature maps simultaneously containing local and global layer features, and the third space-time feature maps are sequentially arranged according to the correspondence of the second scaled feature map sequence to obtain a corresponding third space-time feature map sequence. It should be clear that, since the third space-time separation feature extraction submodule has the same structure as the first space-time separation feature extraction submodule, the processing manner for inputting the feature map sequence therein is also the same as that in step S2041, and detailed description thereof will not be given in this step.

S2046, after the second zooming feature map sequence and the third space-time feature map sequence are subjected to feature addition, the second zooming feature map sequence and the third space-time feature map sequence are input into the first up-sampling submodule, and the third zooming feature map sequence is determined.

Specifically, since the second scaled feature map sequence and the third space-time feature map sequence have the same size, at this time, the second scaled feature map sequence and the third space-time feature map sequence can be subjected to corresponding feature addition, so as to realize jump connection between the same scale information, so that the feature content in the obtained feature map is richer, the feature map sequence obtained after feature addition is input to the first up-sampling submodule, the scaling multiple of the first up-sampling submodule is the same as that of the first down-sampling submodule and the second down-sampling submodule, and the restoration of the size of the feature map sequence input into the second up-sampling submodule can be realized, and the third scaled feature map sequence with the same size as that before the second down-sampling submodule is input is generated.

S2047, after the third scaled feature map sequence and the second space-time feature map sequence are subjected to feature addition, inputting the third scaled feature map sequence and the second space-time feature map sequence into a second upsampling submodule, and determining output as a first alignment feature map sequence corresponding to the first feature map sequence.

Specifically, the third scaled feature map sequence and the second space-time feature map sequence have the same size, at this time, the third scaled feature map sequence and the second space-time feature map sequence may be subjected to feature addition, the information of the same scale is subjected to skip connection again, the feature map sequence obtained after the feature addition is input to the second upsampling submodule, the scaling multiple of the second upsampling submodule and the scaling multiple of the first downsampling submodule are the same, and when the size of the third scaled feature map sequence and the second space-time feature map sequence input therein are the same as the size of the first feature map sequence subjected to primary scaling, the size of the feature map sequence input therein may be restored through the second upsampling module, so as to obtain the first aligned feature map sequence corresponding to the first feature map sequence.

Fig. 4 is a diagram illustrating an exemplary structure of a first video frame alignment module according to a second embodiment of the present invention, in which the structure of a first time-space separation feature extraction sub-module, T, is described in detail ₁ 、T ₂ And T ₃ Respectively represent different scalesThe space-time characteristic information graph sequences under the degree are the first space-time characteristic graph sequence, the second space-time characteristic graph sequence and the third space-time characteristic graph sequence.

S205, inputting the first feature map sequence and the first alignment feature map sequence to a feature fusion module to generate a second feature map sequence.

Specifically, a first feature map sequence and a first alignment feature map sequence are input to a feature fusion module, feature fusion is carried out on the aligned first alignment feature map sequence to obtain feature maps which can be used for being inserted into the first feature map sequence, and then the obtained feature maps are inserted into the first feature map sequence in a preset frame inserting mode to obtain a second feature map sequence with improved frame rate.

Further, the feature fusion module comprises a feature fusion sub-module and a feature map integration sub-module.

Fig. 5 is a flowchart illustrating a process of inputting a first feature map sequence and a first alignment feature map sequence to a feature fusion module to generate a second feature map sequence, which is provided in a second embodiment of the present invention, and specifically includes the following steps:

S2051, inputting a first alignment feature map sequence to a feature fusion sub-module, and carrying out feature fusion on the first alignment feature map to generate an intermediate feature map sequence; the intermediate feature map sequence comprises three intermediate feature maps.

In this embodiment, the feature fusion submodule may be specifically understood as a combination of multiple neural network layers for performing feature extraction and optimization combination on multiple existing feature sets to generate new fusion features, and optionally, the feature fusion submodule may be a three-dimensional convolutional neural network submodule, which is not limited in this embodiment of the present invention.

Specifically, the first alignment feature map sequence is input to the feature fusion sub-module, and because each first alignment feature map has completed alignment for the same object, feature extraction and combination are performed on the first alignment feature map sequence by the feature fusion sub-module at this time, so as to generate an intermediate feature map sequence containing fusion features, wherein, because the number of feature maps in the first feature map sequence is four, that is, three intervals exist, in order to realize uniform insertion of the feature maps, three feature maps are included in the generated intermediate feature map sequence at this time.

S2052, inputting the intermediate feature map sequence and the first feature map sequence into a feature map integration sub-module, and respectively inserting each intermediate feature map between two adjacent first feature maps in the first feature map sequence to generate a second feature map sequence.

Illustratively, it is assumed that the first sequence of feature maps is denoted [1,2,3,4], and the generated second sequence of feature maps is denoted [1',2',3',4',5',6',7'], where "1'", "3'", "5'", and "7'" are feature maps in the first sequence of feature maps, and "2'", "4'", and "6'" are intermediate feature maps inserted into the first sequence of feature maps.

S206, inputting the second characteristic diagram sequence into a second video frame alignment module, and determining the second alignment characteristic diagram sequence.

Specifically, after the second feature map sequence with the frame rate being improved is obtained, the second feature map sequence is input into a second video frame alignment module with the same structure as that of the first video frame alignment module, and alignment of each second feature map is achieved in the same mode, so that a corresponding second alignment feature map sequence is obtained. It should be clear that, since the alignment of the feature map sequence is already clear in step S204, the embodiment of the present invention will not be described in detail here.

S207, inputting the second alignment feature map sequence to a video frame reconstruction module for channel multiplication and pixel rearrangement, and determining a super-resolution video frame sequence corresponding to the video frame group.

Specifically, the second alignment feature map sequence after the frame rate is improved and aligned is input to the video frame reconstruction module, and as the video frame reconstruction module can be used for carrying out high-resolution reconstruction on each video frame input into the video frame reconstruction module, the video frame reconstruction module can be realized by using a sub-pixel convolution network. After the second alignment feature map sequence is input to the video frame reconstruction module, the video frame reconstruction module performs feature extraction and channel multiplication on each feature map in the second alignment feature map sequence, further performs pixel point rearrangement on the feature map after channel multiplication, converts channel information into resolution information, realizes resolution multiplication, and obtains super-resolution video frames corresponding to each second alignment feature map. And arranging the super-resolution video frames according to the sequence of the second alignment feature map sequence to obtain a super-resolution video frame sequence corresponding to the video frame group corresponding to the second alignment feature map sequence.

S208, arranging the super-resolution video frame sequences according to the time sequence, and generating a space-time video super-resolution result corresponding to the video frame sequence to be processed.

Specifically, since the video frame groups are arranged according to the time sequence, in order to ensure the sequence accuracy of the obtained spatio-temporal video super-resolution result, the corresponding super-resolution video frame sequences of the video frame groups can be ordered according to the time sequence of the video frame groups, so that the spatio-temporal video super-resolution result corresponding to the video frame sequences to be processed can be obtained.

Fig. 6 is a diagram illustrating an exemplary structure of a spatio-temporal video super-resolution model according to a second embodiment of the present invention, in which fig. 6 shows a data flow of inputting a video frame group into the spatio-temporal video super-resolution model for processing, and a specific data processing flow is shown in the above steps, and embodiments of the present invention will not be described in detail herein.

Further, fig. 7 is a flowchart illustrating training of a spatio-temporal video super-resolution model by using a setting method according to a second embodiment of the present invention, as shown in fig. 7, and specifically includes the following steps:

s301, inputting a video training sample set into an initial space-time video super-resolution model, and determining a reconstruction intermediate result.

The video training sample set comprises a first video frame sample set with low frame rate and low resolution and a second video frame sample set with high frame rate and high resolution corresponding to the first video frame sample set.

In this embodiment, the video training sample set may be specifically understood as a set of training objects that are input into an initial spatio-temporal video super-resolution model to train the model, where the set of training objects includes a first video frame sample set with a low frame rate and a low resolution and a second video frame sample set with a high frame rate and a high resolution, which are determined according to actual requirements. The initial spatio-temporal video super-resolution model can be specifically understood as an untrained spatio-temporal video super-resolution model, and the modules and the neural network layer components contained in the initial spatio-temporal video super-resolution model are completely consistent with those in the spatio-temporal video super-resolution model, but weight parameters in the initial spatio-temporal video super-resolution model are not adjusted yet. The reconstruction intermediate result can be specifically understood as a result output after the frame insertion and super-resolution reconstruction of the video training sample input into the model by the initial space-time video super-resolution model without training.

Specifically, a first video frame sample group in a video training sample set is input into an initial space-time video super-resolution model, video frame alignment, feature fusion and video frame interpolation of the first video frame sample group are carried out on the first video frame sample group through the initial space-time video super-resolution model to obtain a high-frame-rate video frame group, video frame alignment of space-time separation fusion is carried out on the high-frame-rate video frame group, and after high-resolution reconstruction processing is carried out on the aligned high-frame-rate video frame group, an output result is determined to be a reconstruction intermediate result corresponding to the first video frame sample group.

The video training sample set may be constructed according to a disclosed reference test data set (Vimeo-90K), and since the data set includes a plurality of video frame sequences, each video frame sequence includes seven continuous video frames, odd numbered frames in each video frame sequence are extracted, and the odd numbered frames are four times downsampled by using bicubic interpolation, so that the downsampled frames may be reinforced by random cutting, rotation, mirroring, and other methods, and then used as a first video frame sample set in the video training sample set, and an original video frame sequence corresponding to the first video frame sample set is determined as a second video frame sample set, and the first video frame sample set is associated with the second video frame sample set to form a corresponding video training sample set.

S302, determining a loss function according to the reconstruction intermediate result and the second video frame sample group.

In this embodiment, the Loss Function (Loss Function) can be specifically understood as a Function for measuring the distance between the model trained in the deep learning process and the ideal model, and can be used for parameter estimation of the model to make the trained model reach a convergence state, so as to reduce the error between the predicted value and the true value of the model after training.

Specifically, a second video frame sample group corresponding to the reconstruction intermediate result is determined, the second video frame sample group is considered as an ideal state of the first video frame sample group after the interpolation frame is reconstructed, difference information of each pixel point in the reconstruction intermediate result and the second video frame sample group is determined, and then a corresponding loss function is determined according to the difference information.

For example, in the embodiment of the present invention, a mean square error and structural similarity method may be used as a loss function for training an initial spatio-temporal video super-resolution model, so as to monitor the spatio-temporal video super-resolution model to generate a high-frame-rate high-resolution video frame, where the expression is as follows:

where a=0.99, n is the number of groups of first video frame samples,for the reconstruction intermediate result corresponding to the i-th group of first video frame samples,/for the i-th group of first video frame samples>For the ith set of second video frame samples, MSE is the mean square error function and SSIM is the structural similarity function.

And S303, training the initial space-time video super-resolution model based on the loss function until a preset convergence condition is met to obtain the space-time video super-resolution model.

In this embodiment, the preset convergence condition may be specifically understood as a condition preset according to an actual situation, which is used to determine whether the training initial spatio-temporal video super-resolution model enters a convergence state. Optionally, the preset convergence condition may include that a difference between the reconstructed intermediate result and the second video frame sample set is smaller than a preset threshold, a change of a weight parameter between two iterations of model training is smaller than a preset parameter change threshold, the iterations exceed a set maximum iteration number, and all video training samples are trained, which is not limited in the embodiment of the present invention.

Specifically, training the initial spatio-temporal video super-resolution model by using the obtained loss function, namely adjusting weight parameters of each neural network layer in the initial spatio-temporal video super-resolution model until a preset convergence condition is met, and determining the trained initial spatio-temporal video super-resolution model as a practical spatio-temporal video super-resolution model.

According to the technical scheme, the space-time video super-resolution model comprises a feature extraction module, a first video frame alignment module, a feature fusion module, a second video frame alignment module and a video frame reconstruction module, and under the condition that the structures of the first video frame alignment module and the second video frame alignment module are the same, separation and extraction of space-time features in a video frame group input into the space-time video super-resolution model are completed twice through the first video frame alignment module and the second video frame alignment module, so that pixel alignment of each feature image in a feature image sequence input into the space-time video super-resolution model is achieved, the diversity of feature extraction is enriched due to the fact that the space local features and the time global features are combined in the alignment process, alignment results are more accurate, and the space-time video super-resolution model is applicable to video data processing of multiple complex motion scenes.

Example III

Fig. 8 is a schematic structural diagram of a spatio-temporal video super-resolution device according to a third embodiment of the present invention, where the spatio-temporal video super-resolution device includes: a video frame acquisition module 31 and a superdivision result generation module 32.

The video frame acquisition module 31 is configured to acquire a video frame sequence to be processed; the video frame sequence to be processed comprises at least one video frame group, wherein the video frame group is four continuous video frames; the super-resolution result generating module 32 is configured to input a video frame sequence to be processed into a preset spatio-temporal video super-resolution model, and determine a spatio-temporal video super-resolution result corresponding to the video frame sequence to be processed according to the output generation result; the space-time video super-resolution model is a neural network model trained by a set training method; the space-time video super-resolution model at least comprises two video frame alignment modules, wherein the video frame alignment modules are used for extracting space local characteristic information and time global characteristic information of an input video frame sequence and aligning the input video frame sequence according to the space local characteristic information and the time global characteristic information.

According to the technical scheme, the obtained low-frame-rate low-resolution video frame sequence to be processed is input into a pre-trained space-time video super-resolution model, and alignment of the low-frame-rate video frames and alignment of the video frames after frame rate improvement are respectively achieved in a mode of separating, extracting and fusing spatial local feature information and temporal global feature information through two video frame alignment modules in the space-time video super-resolution model, and finally a space-time video super-resolution result with both frame rate and resolution improved is obtained. The method solves the problems that the correlation in the space-time relation is difficult to consider in the existing two-stage space-time video super-resolution processing method, the concentration is insufficient in the single-stage space-time video super-resolution processing method, and huge parameters and operation are needed when model training is carried out.

Further, the video frame acquisition module 31 includes:

the video frame group determining unit is used for obtaining video data to be processed, dividing a video sequence to be processed through a preset sliding window and determining at least one video frame group; the window size of a preset sliding window is four frames, and the sliding step length is three frames;

and the video frame sequence determining unit is used for sequencing each video frame group according to the time sequence to determine the video frame sequence to be processed.

The superdivision result generation module 32 includes:

the first sequence determining unit is used for inputting the video frame groups into the feature extraction module in the space-time video super-resolution model for each video frame group in the video frame sequence to be processed, and determining a first feature map sequence;

the first alignment sequence determining unit is used for inputting the first feature map sequence into the first video frame alignment module so as to extract space local feature information and time global feature information of the first feature map sequence, aligning the first feature map sequence according to the space local feature information and the time global feature information and determining the first alignment feature map sequence;

The second sequence determining unit is used for inputting the first characteristic diagram sequence and the first alignment characteristic diagram sequence to the characteristic fusion module to generate a second characteristic diagram sequence;

a second alignment sequence determining unit, configured to input a second feature map sequence to the second video frame alignment module, and determine a second alignment feature map sequence;

the super-resolution sequence determining unit is used for inputting the second alignment feature map sequence to the video frame reconstruction module for channel multiplication and pixel rearrangement, and determining a super-resolution video frame sequence corresponding to the video frame group;

the super-resolution result determining unit is used for arranging each super-resolution video frame sequence according to the time sequence and generating a space-time video super-resolution result corresponding to the video frame sequence to be processed.

Correspondingly, the first alignment sequence determining unit is specifically configured to:

inputting the first characteristic diagram sequence into a first time-space separation characteristic extraction submodule, separating, extracting and fusing time-space characteristics, and determining the first time-space characteristic diagram sequence;

inputting the first time-space characteristic diagram sequence into a first downsampling submodule to downsample to generate a first zooming characteristic diagram sequence;

inputting the first scaled feature map sequence to a second space-time separation feature extraction submodule, and carrying out separation extraction and fusion of space-time features to determine a second space-time feature map sequence;

inputting the second space-time characteristic diagram sequence into a second downsampling submodule to downsample to generate a second zooming characteristic diagram sequence;

inputting the second scaled feature map sequence to a third space-time separation feature extraction submodule, and carrying out separation extraction and fusion of space-time features to determine a third space-time feature map sequence;

after the second zooming feature map sequence and the third space-time feature map sequence are subjected to feature addition, the second zooming feature map sequence and the third space-time feature map sequence are input into a first up-sampling submodule, and the third zooming feature map sequence is determined;

and after the third scaled feature map sequence and the second space-time feature map sequence are subjected to feature addition, the third scaled feature map sequence and the second space-time feature map sequence are input into a second upsampling submodule, and output is determined to be a first alignment feature map sequence corresponding to the first feature map sequence.

Correspondingly, the first time-space separation characteristic extraction submodule comprises: the system comprises a two-dimensional convolution layer, a spatial local feature extraction network and a time global feature extraction network.

Inputting the first characteristic diagram sequence into a first time-space separation characteristic extraction submodule to carry out separation extraction and fusion of time-space characteristics, and determining the first time-space characteristic diagram sequence, wherein the method comprises the following steps:

inputting the first feature map sequence into a two-dimensional convolution layer to generate a first feature matrix set;

inputting the first feature matrix set into a space local feature extraction network to perform dimension mapping on the first feature matrix set to obtain a second feature matrix set, and extracting first space local feature information by performing local self-attention calculation on the second feature matrix set;

inputting the first space local feature information into a time global feature network to remap the first space local feature information to obtain a third feature matrix set, and determining the first space local feature information by performing global self-attention calculation on the third feature matrix set;

and performing dimension mapping on the first time-space characteristic information to obtain a first time-space characteristic diagram sequence with the same dimension as the first characteristic diagram sequence.

Correspondingly, the second sequence determining unit is specifically configured to:

inputting the first alignment feature map sequence to a feature fusion sub-module, and carrying out feature fusion on the first alignment feature map to generate an intermediate feature map sequence; the intermediate feature map sequence comprises three intermediate feature maps;

and inputting the intermediate feature map sequence and the first feature map sequence into a feature map integration sub-module, respectively inserting each intermediate feature map between two adjacent first feature maps in the first feature map sequence, and generating a second feature map sequence.

Further, the step of training the spatio-temporal video super-resolution model by adopting the setting method comprises the following steps:

inputting a video training sample set into an initial space-time video super-resolution model, and determining a reconstruction intermediate result; the video training sample set comprises a first video frame sample set with low frame rate and low resolution and a second video frame sample set with high frame rate and high resolution, which corresponds to the first video frame sample set;

determining a loss function according to the reconstructed intermediate result and the second video frame sample set;

training the initial space-time video super-resolution model based on the loss function until a preset convergence condition is met to obtain the space-time video super-resolution model.

The space-time video super-resolution device provided by the embodiment of the invention can execute the space-time video super-resolution method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example IV

Fig. 9 is a schematic structural diagram of a spatio-temporal video super-resolution device according to a fourth embodiment of the present invention. The spatio-temporal video super resolution device 40 may be an electronic device intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 9, the spatiotemporal video super-resolution device 40 includes at least one processor 41, and a memory communicatively connected to the at least one processor 41, such as a Read Only Memory (ROM) 42, a Random Access Memory (RAM) 43, etc., in which a computer program executable by the at least one processor is stored, and the processor 41 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 42 or the computer program loaded from the storage unit 48 into the Random Access Memory (RAM) 43. In the RAM 43, various programs and data required for the operation of the spatiotemporal video super-resolution device 40 can also be stored. The processor 41, the ROM 42 and the RAM 43 are connected to each other via a bus 44. An input/output (I/O) interface 45 is also connected to bus 44.

A plurality of components in the spatiotemporal video super resolution device 40 are connected to the I/O interface 45, including: an input unit 46 such as a keyboard, a mouse, etc.; an output unit 47 such as various types of displays, speakers, and the like; a storage unit 48 such as a magnetic disk, an optical disk, or the like; and a communication unit 49 such as a network card, modem, wireless communication transceiver, etc. The communication unit 49 allows the spatio-temporal video super-resolution device 40 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The processor 41 may be various general and/or special purpose processing components with processing and computing capabilities. Some examples of processor 41 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 41 performs the various methods and processes described above, such as the spatio-temporal video super-resolution method.

In some embodiments, the spatiotemporal video super resolution method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 48. In some embodiments, part or all of the computer program may be loaded and/or installed onto the spatiotemporal video super resolution device 40 via the ROM 42 and/or the communication unit 49. When the computer program is loaded into RAM 43 and executed by processor 41, one or more steps of the spatio-temporal video super-resolution method described above may be performed. Alternatively, in other embodiments, the processor 41 may be configured to perform the spatio-temporal video super-resolution method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be noted that, in the above embodiment of the apparatus, each unit and module included are only divided according to the functional logic, but not limited to the above division, so long as the corresponding function can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A spatio-temporal video super-resolution method, comprising:

the space-time video super-resolution model is a neural network model trained by a set training method; the space-time video super-resolution model at least comprises two video frame alignment modules, wherein the video frame alignment modules are used for extracting space local feature information and time global feature information of an input video frame sequence and aligning the input video frame sequence according to the space local feature information and the time global feature information;

the space-time video super-resolution model comprises a feature extraction module, a first video frame alignment module, a feature fusion module, a second video frame alignment module and a video frame reconstruction module; the first video frame alignment module and the second video frame alignment module have the same structure;

The step of inputting the video frame sequence to be processed into a preset space-time video super-resolution model, and determining a space-time video super-resolution result corresponding to the video frame sequence to be processed according to the output generation result, wherein the step of determining the space-time video super-resolution result comprises the following steps:

inputting the video frame groups into the feature extraction module in the space-time video super-resolution model for each video frame group in the video frame sequence to be processed, and determining a first feature map sequence;

inputting the first feature map sequence to the first video frame alignment module to extract spatial local feature information and temporal global feature information of the first feature map sequence, aligning the first feature map sequence according to the spatial local feature information and the temporal global feature information, and determining a first aligned feature map sequence;

inputting the first feature map sequence and the first alignment feature map sequence to the feature fusion module to generate a second feature map sequence;

inputting the second feature map sequence to the second video frame alignment module to determine a second alignment feature map sequence;

inputting the second alignment feature map sequence to the video frame reconstruction module for channel multiplication and pixel rearrangement, and determining a super-resolution video frame sequence corresponding to the video frame group;

And arranging the super-resolution video frame sequences according to time sequence to generate a space-time video super-resolution result corresponding to the video frame sequences to be processed.

2. The method of claim 1, wherein the acquiring a sequence of video frames to be processed comprises:

obtaining video data to be processed, dividing the video data to be processed through a preset sliding window, and determining at least one video frame group; the window size of the preset sliding window is four frames, and the sliding step length is three frames;

and sequencing the video frame groups according to the time sequence to determine a video frame sequence to be processed.

3. The method of claim 1, wherein the first video frame alignment module comprises a first temporal-spatial separation feature extraction sub-module, a first downsampling sub-module, a second temporal-spatial separation feature extraction sub-module, a second downsampling sub-module, a third temporal-spatial separation feature extraction sub-module, a first upsampling sub-module, and a second upsampling sub-module; the first time-space separation characteristic extraction submodule, the second time-space separation characteristic extraction submodule and the third time-space separation characteristic extraction submodule have the same structure;

The inputting the first feature map sequence to the first video frame alignment module to extract first spatial local feature information and first temporal global feature information from the first feature map sequence, and aligning the first feature map sequence according to the first spatial local feature information and the first temporal global feature information, to determine a first aligned feature map sequence, including:

inputting the first characteristic diagram sequence to the first time-space separation characteristic extraction submodule, and carrying out separation extraction and fusion of time-space characteristics to determine the first time-space characteristic diagram sequence;

inputting the first time-space feature map sequence to the first downsampling submodule for downsampling to generate a first zooming feature map sequence;

inputting the first scaled feature map sequence to the second space-time separation feature extraction submodule, and carrying out separation extraction and fusion of space-time features to determine a second space-time feature map sequence;

inputting the second space-time characteristic diagram sequence to the second downsampling submodule to downsample to generate a second zooming characteristic diagram sequence;

inputting the second scaled feature map sequence to the third space-time separation feature extraction submodule, and carrying out separation extraction and fusion of space-time features to determine a third space-time feature map sequence;

After the second zooming feature map sequence and the third space-time feature map sequence are subjected to feature addition, the second zooming feature map sequence and the third space-time feature map sequence are input into the first up-sampling submodule, and a third zooming feature map sequence is determined;

and after the third scaled feature map sequence and the second space-time feature map sequence are subjected to feature addition, inputting the third scaled feature map sequence and the second space-time feature map sequence into the second upsampling submodule, and determining output as a first aligned feature map sequence corresponding to the first feature map sequence.

4. A method according to claim 3, wherein the first time-space separation feature extraction submodule comprises: a two-dimensional convolution layer, a spatial local feature extraction network and a time global feature extraction network;

the step of inputting the first feature map sequence to the first time-space separation feature extraction submodule to perform separation extraction and fusion of space-time features, and determining the first time-space feature map sequence comprises the following steps:

inputting the first feature map sequence to the two-dimensional convolution layer to generate a first feature matrix set;

inputting the first feature matrix set into the spatial local feature extraction network to dimension map the first feature matrix set to obtain a second feature matrix set, and extracting first spatial local feature information by carrying out local self-attention calculation on the second feature matrix set;

Inputting the first spatial local feature information into the time global feature extraction network to remap the first spatial local feature information to obtain a third feature matrix set, and determining first time space feature information by performing global self-attention calculation on the third feature matrix set;

5. The method of claim 1, wherein the feature fusion module comprises a feature fusion sub-module and a feature map integration sub-module;

the inputting the first feature map sequence and the first alignment feature map sequence to the feature fusion module, generating a second feature map sequence, includes:

inputting the first alignment feature map sequence to the feature fusion submodule, and carrying out feature fusion on the first alignment feature map to generate an intermediate feature map sequence; the intermediate feature map sequence comprises three intermediate feature maps;

and inputting the intermediate feature map sequence and the first feature map sequence into the feature map integration submodule, respectively inserting each intermediate feature map between two adjacent first feature maps in the first feature map sequence, and generating a second feature map sequence.

6. The method of claim 1, wherein training the spatio-temporal video super-resolution model using a set-up method comprises:

inputting a video training sample set into an initial space-time video super-resolution model, and determining a reconstruction intermediate result; wherein the video training sample set comprises a first video frame sample set with low frame rate and low resolution, and a second video frame sample set with high frame rate and high resolution corresponding to the first video frame sample set;

determining a loss function according to the reconstruction intermediate result and the second video frame sample set;

7. A spatio-temporal video super-resolution apparatus, comprising:

the superdivision result generation module comprises:

a first sequence determining unit, configured to, for each video frame group in the video frame sequence to be processed, input the video frame group to the feature extraction module in the spatio-temporal video super-resolution model, and determine a first feature map sequence;

the first alignment sequence determining unit is used for inputting the first feature map sequence into the first video frame alignment module, extracting spatial local feature information and temporal global feature information of the first feature map sequence, aligning the first feature map sequence according to the spatial local feature information and the temporal global feature information, and determining a first alignment feature map sequence;

a second alignment sequence determining unit, configured to input the second feature map sequence to the second video frame alignment module, and determine a second alignment feature map sequence;

the super-resolution result determining unit is used for arranging the super-resolution video frame sequences according to time sequence to generate a space-time video super-resolution result corresponding to the video frame sequences to be processed.

8. A spatio-temporal video super-resolution apparatus, comprising:

at least one processor; and

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the spatio-temporal video super resolution method of any of claims 1-6.

9. A computer readable storage medium storing computer instructions for causing a processor to implement the spatio-temporal video super resolution method of any of claims 1-6 when executed.