CN115861078B

CN115861078B - Video enhancement method and system based on bidirectional space-time recursion propagation neural network

Info

Publication number: CN115861078B
Application number: CN202310145957.4A
Authority: CN
Inventors: 李�杰; 周仁爽; 陈尧森; 杨月欣
Original assignee: Chengdu Sobey Digital Technology Co Ltd
Current assignee: Chengdu Sobey Digital Technology Co Ltd
Priority date: 2023-02-22
Filing date: 2023-02-22
Publication date: 2023-05-12
Anticipated expiration: 2043-02-22
Also published as: CN115861078A

Abstract

The invention relates to the technical field of video data enhancement, and discloses a video enhancement method and a system based on a bidirectional space-time recursion propagation neural network. The invention solves the problems that the real-time performance, the video enhancement and the resolution improvement are difficult to be considered in the prior art.

Description

Video enhancement method and system based on bidirectional space-time recursion propagation neural network

Technical Field

The invention relates to the technical field of video data enhancement, in particular to a video enhancement method and a video enhancement system based on a bidirectional space-time recursion propagation neural network.

Background

The super-resolution technology is a technology for recovering a high-resolution image from a low-resolution image and improving the quality of the image so as to obtain a clearer video and picture. With the vigorous development of deep learning, the super-resolution technology of videos and pictures also makes a great breakthrough.

However, some of the prior super-resolution algorithms based on the neural network, some of the neural networks adopted by the prior super-resolution algorithms are disclosed on academic communities based on the prior super-resolution technology and are specially used for training on super-resolution data sets, so that the effect of the algorithms on resolution improvement of videos or pictures on real application scenes or specific application scenes is not obvious, and some of the super-resolution algorithms based on the neural network have good effects on image resolution improvement and image enhancement of the public data sets and videos or pictures of the real application scenes, but the time cost of the algorithms is high and the real-time performance is poor.

In order to better improve the effects of video enhancement and resolution enhancement for a specific video type and to improve the real-time performance of a video super-resolution model, a super-resolution method capable of satisfying the above conditions is needed.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a video enhancement method and a system based on a bidirectional space-time recursion propagation neural network, which solve the problems that the prior art is difficult to consider real-time performance, video enhancement and resolution enhancement.

The invention solves the problems by adopting the following technical scheme:

the video enhancement method based on the bidirectional space-time recursion propagation neural network comprises the steps of constructing the bidirectional space-time recursion propagation neural network, preparing the collected video to be enhanced as a data set, and training the data set through the bidirectional space-time recursion propagation neural network to obtain the trained bidirectional space-time recursion propagation neural network aiming at video superdivision.

As a preferred technical scheme, the method comprises the following steps:

s1, optical flow calculation and feature extraction: performing inter-frame alignment and feature extraction on an input video frame;

the method specifically comprises the following steps:

S1A, optical flow calculation: performing inter-frame alignment on an input video frame, and then entering step S2;

S1B, feature extraction: extracting the characteristics of the input video frame, and then entering step S2;

s2, bidirectional recursive propagation: inputting the video frames subjected to optical flow calculation and the video frames subjected to feature extraction into a bidirectional space-time recursion propagation neural network for processing, and then entering step S3;

s3, upsampling: and up-sampling the output data through the backbone network, and then aggregating the up-sampled data with the input video frames to obtain super-divided video frames.

As a preferred technical solution, step S2 includes the following steps:

s21, sequentially performing first backward propagation, first forward propagation, second backward propagation and second forward propagation processing on the video frame subjected to optical flow calculation; and sequentially performing first backward propagation, first forward propagation, second backward propagation and second forward propagation on the video frame subjected to the feature extraction;

s22, sequentially carrying out second backward propagation and second forward propagation processing on the video frames subjected to optical flow calculation; and sequentially performing second backward propagation and second forward propagation processing on the video frames subjected to the feature extraction.

As a preferable technical scheme, the video to be enhanced is a broadcast television standard definition video or a broadcast television high definition video.

The video enhancement system based on the bidirectional space-time recursion propagation neural network is used for realizing the video enhancement method based on the bidirectional space-time recursion propagation neural network, and comprises the following modules:

optical flow calculation module: the method comprises the steps of performing inter-frame alignment on an input video frame, and inputting the video frame to a bidirectional recursion propagation module;

and the feature extraction module is used for: the method comprises the steps of extracting features of an input video frame, and inputting the extracted features to a bidirectional recursion propagation module;

bidirectional recursive propagation module: the method comprises the steps of inputting video frames subjected to optical flow calculation and video frames subjected to feature extraction into a bidirectional space-time recursion propagation neural network for processing, and then inputting into an up-sampling module;

up-sampling module: the method comprises the steps of up-sampling output data through a backbone network, and then aggregating the up-sampled data with an input video frame to obtain an overdrived video frame.

As a preferable technical scheme, the feature extraction module comprises a first feature extraction layer and a second feature extraction layer, the first feature extraction layer, the second feature extraction layer and the bidirectional recursion propagation module are sequentially connected according to the video frame transmission direction, the first feature extraction layer is used for receiving an input video frame, and the first feature extraction layer is also directly connected with the bidirectional recursion propagation module.

As a preferred technical solution, the first feature extraction layer includes the following layers connected in sequence: a convolution layer, a LeakyReLU layer, 5 residual blocks, said second feature extraction layer; the convolution kernel of one convolution layer is 3*3, the number of output channels is 32, each residual block comprises a first residual convolution layer, a second residual convolution layer and a Relu layer which are sequentially connected, and the number of input channels and the number of output channels of the first residual convolution layer and the second residual convolution layer are 32; the second feature extraction layer comprises the following layers connected in sequence: a convolutional layer, a LeakyReLU layer; the convolution kernel of one convolution layer is 3*3, the number of input channels and the number of output channels are 32, and the step size is 2.

As a preferred technical scheme, the bidirectional recursion propagation module comprises a first backward propagation layer, a first forward propagation layer, a second backward propagation layer and a second forward propagation layer which are sequentially connected, wherein the first backward propagation layer is respectively connected with the optical flow calculation module and the second feature extraction layer, the second backward propagation layer is respectively connected with the optical flow calculation module and the first feature extraction layer, and the second forward propagation layer is connected with the up-sampling module.

As a preferred technical solution, the first counter-propagating layer and the second counter-propagating layer each include a convolution layer, a LeakyReLU layer and 20 residual blocks connected in sequence; the convolution kernel of one convolution layer is 3*3, the number of output channels is 32, each residual block comprises a first residual convolution layer, a second residual convolution layer and a Relu layer which are sequentially connected, and the number of input channels and the number of output channels of the first residual convolution layer and the second residual convolution layer are 32; the number of input channels of the convolution layer of the first counter-propagating layer is 64 and the number of input channels of the convolution layer of the second counter-propagating layer is 128; the first forward propagation layer and the second forward propagation layer comprise a convolution layer, a LeakyReLU layer and 20 residual blocks which are connected in sequence; the convolution kernel of one convolution layer is 3*3, the number of output channels is 32, and the convolution kernel comprises a first residual convolution layer, a second residual convolution layer and a Relu layer which are sequentially connected, wherein the number of input channels and the number of output channels of the first residual convolution layer and the second residual convolution layer are 32; the number of input channels for the convolutional layer of the first counter-propagating layer is 96 and the number of input channels for the convolutional layer of the second counter-propagating layer is 160.

As a preferable technical scheme, the up-sampling module comprises a first up-sampling convolution layer, a first PixelShuffle layer, a second up-sampling convolution layer and a second PixelShuffle layer which are connected in sequence; the convolution kernel of the first up-sampling convolution layer is 3*3, the number of input channels is 32, the number of output channels is 128, the convolution kernel of the first up-sampling convolution layer is 3*3, the number of input channels is 32, and the number of output channels is 12.

Compared with the prior art, the invention has the following beneficial effects:

(1) The method solves the problem that the effect of the super-division algorithm of the video aiming at the broadcasting and television high-definition standard definition materials is poor, and solves the problem that the super-division effect of most video super-division algorithms on the service scene is poor through the targeted training of the deep learning neural network on the broadcasting and television high-definition standard definition materials;

(2) The invention improves the effect of superdivision and the real-time performance of the algorithm through the special design of the network, and greatly reduces the time expenditure of the algorithm, so that the algorithm is more excellent in real service scenes.

Drawings

FIG. 1 is a schematic diagram of a video enhancement system based on a bi-directional spatio-temporal recursive propagation neural network according to the present invention;

FIG. 2 is a schematic diagram of the structure of an upsampling layer in a video enhancement system based on a bi-directional spatio-temporal recursive propagation neural network according to the present invention;

FIG. 3 is a schematic diagram of a residual block in a video enhancement system based on a bi-directional spatio-temporal recursive propagation neural network according to the present invention;

fig. 4 is a flow chart of a video enhancement method based on a bidirectional spatio-temporal recursive propagation neural network according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.

Example 1

As shown in fig. 1 to 4, the focus of the present invention is the neural network structure, and the main structure of the video enhancement system based on the bidirectional space-time recursive propagation neural network is shown in fig. 1, and the video enhancement system comprises the following modules:

The specific steps are as follows when in use:

the optical flow calculation method comprises the steps that the network input is the low-resolution video to be super-divided, and the resolution of the video is improved after system reasoning, so that the super-division effect is achieved. In order to more effectively utilize the information between adjacent frames of the video, the invention adopts an optical flow based method to perform inter-frame alignment, firstly, an optical flow module for calculating the optical flow is used for calculating the optical flow result for the input video, finally, an optical flow for forward propagation and an optical flow for backward propagation are calculated, and the optical flow for backward propagation are twice sampled, and the total is four outputs.

Secondly, carrying out reshape operation on the input video frame once, shaping the input of the original five dimensions into four dimensions, and carrying out feature extraction on the four-dimensional tensor. The method comprises the steps of extracting characteristics of input by using a characteristic extraction layer with residual connection, wherein the characteristic extraction layer consists of a convolution layer with the number of output channels being 32 and the convolution kernel size being 3*3, a LeakyReLU layer and five residual blocks, and each residual block consists of two convolution layers with the number of input and output channels being 32 and the convolution kernel being 3*3 and a Relu layer. And simultaneously, a feature extraction layer which consists of a convolution layer with the number of input and output channels being 32 and the step length being 2 and a LeakyReLU layer and can be used for downsampling twice is used for extracting the features after downsampling twice.

Third, the design of the backbone network is that the backbone network adopts a bidirectional recursion propagation neural network. The invention adopts a network comprising two recursions of forward propagation and backward propagation to form the backbone network of the invention.

Four networks were constructed to accomplish these two recursions, each consisting of a convolutional layer with 32 output channels and a convolutional kernel size of 3*3, a LeakyReLU layer and a residual block with 20 layers identical to the feature extraction layer, except that the first layer of convolutions of the first backward propagation network had 64 channels, the first layer of convolutions of the first forward propagation network had 96 channels, the first layer of convolutions of the second backward propagation network had 128 channels, and the first layer of convolutions of the second forward propagation network had 160 channels.

The input of the first propagation (including forward and backward propagation networks) is twice the feature of the downsampling extracted by the second step and twice the optical flow of the downsampling of the forward and backward propagation extracted by the first step.

Inputs to the second propagation network (including forward propagation network and backward propagation network) are the extracted features of the second step and the optical flow extracted in the first step, and the first forward propagation output and the first backward propagation output of the first propagation output of the previous step.

Upsampling layer: the invention takes the input initially entering the system and the output through the backbone network as the input to the upsampling layer. And sequentially carrying out twice convolution and up-sampling operations on the output of the backbone network according to a time sequence, wherein the number of output channels of the first convolution is 128, the up-sampling is carried out through a PixelShellfit after the output of the convolution, the number of output channels of the second convolution is 12, the up-sampling is carried out through the PixelShellfit after the output of the convolution, the number of input channels of the two convolutions is 32, and finally the up-sampling operation is completed. The output is then aggregated with the input initially entering the system to complete up-sampling of the up-sampling layer network output. And finally, directly outputting the result to obtain a group of video frames after video superdivision.

The bidirectional space-time recursion network is trained by using a data set constructed based on the broadcast television standard definition materials, the pre-training network weight aiming at the broadcast television standard definition materials is obtained, then the pre-training network weight is loaded to the system in an reasoning stage, the low-resolution video to be super-divided is used as the input of the network, and finally the final video super-division result is obtained through the reasoning of the network, so that the resolution of the video is improved, and the quality of the video is enhanced.

As shown in fig. 1 to 4, fig. 1 illustrates a structure of the system of the present invention, fig. 2 illustrates an upsampling layer structure diagram in the structure of the system of the present invention, fig. 2 illustrates a residual block structure diagram in the structure of the system of the present invention, and fig. 4 illustrates an implementation flow of the present invention.

The video is super-divided, and the detailed steps are as follows:

as shown in fig. 1 to 4, the spatio-temporal bidirectional recursive propagation network structure of the present invention is: an independent optical flow calculation module, a feature extraction module belonging to a spatio-temporal bidirectional recursion propagation network (including a feature extraction module that does not perform downsampling and a feature extraction module that performs twice downsampling), a backbone network of the spatio-temporal bidirectional recursion propagation network (a recursive network of two forward and backward propagation), and an upsampling network. For an optical flow calculation module, some existing super-division algorithms divide a video into independent pictures, and perform super-division operation on the pictures to enhance the image quality, but the effect of enhancement is poor because the operation can cause video distortion after super-division, so that the invention uses an optical flow method to perform motion estimation and acquire the relationship between adjacent frames in the video to improve the super-division effect of the algorithm. The design of the backbone network (two forward propagation and backward propagation recursive network) enables the invention to better utilize the time sequence information of the video through forward propagation and backward propagation, and further can improve the enhancement effect of the system on the video.

As shown in fig. 4, the implementation flow of the present invention is that a low resolution video to be superseparated is used as an input of a network in the present invention, and is put into a system of the present invention, the system needs to load a weight of pre-training, firstly, the input is calculated to the optical flow through an optical flow calculation module, so as to facilitate the use of the subsequent stage of network reasoning, secondly, the input is extracted to the feature extraction network, then the obtained feature and the optical flow are input into a backbone network together, the network is repaired and enhanced through the backbone network, finally, the output of the backbone network is put into an up-sampling network, and the final superseparation result is obtained after the output through the up-sampling network.

Because the space-time bidirectional recursion propagation network is an end-to-end network, when resolution and video enhancement are carried out on a certain low-resolution video, the video to be superdivided can be directly input into a system, and finally, the video output obtained by the system is the superdivision result of the invention. Because the system of the invention is designed to reduce the number of network parameters and the like and improve the reasoning speed, the real-time performance of the invention is greatly improved, and the real-time requirement of a real service scene on the real-time performance can be met.

The invention has the following characteristics:

1) The method solves the problem that the effect of the super-division algorithm of the video aiming at the broadcasting and television high-definition standard definition materials is poor, and solves the problem that the super-division effect of most video super-division algorithms on the service scene is poor through the targeted training of the deep learning neural network on the broadcasting and television high-definition standard definition materials;

2) The invention improves the effect of superdivision and the real-time performance of the algorithm through the special design of the network, and greatly reduces the time expenditure of the algorithm, so that the algorithm is more excellent in real service scenes.

As described above, the present invention can be preferably implemented.

All of the features disclosed in all of the embodiments of this specification, or all of the steps in any method or process disclosed implicitly, except for the mutually exclusive features and/or steps, may be combined and/or expanded and substituted in any way.

The foregoing description of the preferred embodiment of the invention is not intended to limit the invention in any way, but rather to cover all modifications, equivalents, improvements and alternatives falling within the spirit and principles of the invention.

Claims

1. The video enhancement method based on the bidirectional space-time recursion propagation neural network is characterized by constructing the bidirectional space-time recursion propagation neural network, preparing the collected video to be enhanced as a data set, and training the data set through the bidirectional space-time recursion propagation neural network to obtain a trained bidirectional space-time recursion propagation neural network for video superdivision;

the method comprises the following steps:

the method specifically comprises the following steps:

s3, upsampling: up-sampling output data through a backbone network, and then aggregating the up-sampled data with an input video frame to obtain an ultra-divided video frame;

S1A is realized by adopting an optical flow calculation module, step S1B is realized by adopting a feature extraction module, step S2 is realized by adopting a bidirectional recursion propagation module, and step S3 is realized by adopting an up-sampling module;

the feature extraction module comprises a first feature extraction layer and a second feature extraction layer, the first feature extraction layer, the second feature extraction layer and the bidirectional recursion propagation module are sequentially connected according to the video frame transmission direction, the first feature extraction layer is used for receiving an input video frame, and the first feature extraction layer is also directly connected with the bidirectional recursion propagation module;

the bidirectional recursion propagation module comprises a first backward propagation layer, a first forward propagation layer, a second backward propagation layer and a second forward propagation layer which are sequentially connected, wherein the first backward propagation layer is respectively connected with the optical flow calculation module and the second feature extraction layer, the second backward propagation layer is respectively connected with the optical flow calculation module and the first feature extraction layer, and the second forward propagation layer is connected with the up-sampling module;

the first counter-propagation layer and the second counter-propagation layer comprise a convolution layer, a LeakyReLU layer and 20 residual blocks which are connected in sequence; the convolution kernel of one convolution layer is 3*3, the number of output channels is 32, each residual block comprises a first residual convolution layer, a second residual convolution layer and a Relu layer which are sequentially connected, and the number of input channels and the number of output channels of the first residual convolution layer and the second residual convolution layer are 32; the number of input channels of the convolution layer of the first counter-propagating layer is 64 and the number of input channels of the convolution layer of the second counter-propagating layer is 128; the first forward propagation layer and the second forward propagation layer comprise a convolution layer, a LeakyReLU layer and 20 residual blocks which are connected in sequence; the convolution kernel of one convolution layer is 3*3, the number of output channels is 32, and the convolution kernel comprises a first residual convolution layer, a second residual convolution layer and a Relu layer which are sequentially connected, wherein the number of input channels and the number of output channels of the first residual convolution layer and the second residual convolution layer are 32; the number of input channels for the convolutional layer of the first counter-propagating layer is 96 and the number of input channels for the convolutional layer of the second counter-propagating layer is 160.

2. The video enhancement method based on a bi-directional spatiotemporal recursive propagating neural network according to claim 1, wherein step S2 comprises the steps of:

3. The video enhancement method based on the bidirectional spatiotemporal recursion propagation neural network according to claim 1 or 2, wherein the video to be enhanced is a broadcast television standard definition video or a broadcast television high definition video.

4. A video enhancement system based on a bidirectional spatiotemporal recursion propagation neural network for implementing the video enhancement method based on a bidirectional spatiotemporal recursion propagation neural network according to any one of claims 1 to 3, comprising the following modules:

up-sampling module: the method comprises the steps of up-sampling output data through a backbone network, and then aggregating the up-sampled data with an input video frame to obtain an overdrived video frame;

5. The two-way spatio-temporal recursive propagation neural network based video enhancement system according to claim 4, wherein said first feature extraction layer comprises the following layers, connected in sequence: a convolution layer, a LeakyReLU layer, 5 residual blocks, said second feature extraction layer; the convolution kernel of one convolution layer is 3*3, the number of output channels is 32, each residual block comprises a first residual convolution layer, a second residual convolution layer and a Relu layer which are sequentially connected, and the number of input channels and the number of output channels of the first residual convolution layer and the second residual convolution layer are 32; the second feature extraction layer comprises the following layers connected in sequence: a convolutional layer, a LeakyReLU layer; the convolution kernel of one convolution layer is 3*3, the number of input channels and the number of output channels are 32, and the step size is 2.

6. The video enhancement system based on a bi-directional spatio-temporal recursive propagating neural network according to claim 5, wherein said upsampling module comprises a first upsampling convolution layer, a first PixelShuffle layer, a second upsampling convolution layer, a second PixelShuffle layer, connected in sequence; the convolution kernel of the first up-sampling convolution layer is 3*3, the number of input channels is 32, the number of output channels is 128, the convolution kernel of the first up-sampling convolution layer is 3*3, the number of input channels is 32, and the number of output channels is 12.