CN115689917A

CN115689917A - Efficient space-time super-resolution video compression restoration method based on deep learning

Info

Publication number: CN115689917A
Application number: CN202211285099.5A
Authority: CN
Inventors: 陈律丞; 刘亮为; 卓成
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-10-20
Filing date: 2022-10-20
Publication date: 2023-02-03

Abstract

The invention discloses an efficient space-time super-resolution video compression restoration method based on deep learning, and provides a coding-decoding two-stage video transmission scheme based on deep learning. Specifically, in the encoding stage, the method firstly performs spatial domain down-sampling and time domain frame-extracting operations on the video in sequence, and then generates a transmission code stream by using a conventional encoding compression standard. In the decoding stage, a space-time super-resolution reconstruction network takes the video obtained by decompressing the code stream as input, and executes three tasks of super-resolution, frame interpolation and quality enhancement so as to obtain a quality enhanced video frame with high resolution and high frame rate. Experimental results prove that the performance of the disclosed method on two coding evaluation indexes of BD-Rate and BD-PSNR is obviously superior to that of a traditional coding scheme and an existing deep learning restoration method, and the method has the advantage of low calculation overhead.

Description

Efficient space-time super-resolution video compression restoration method based on deep learning

Technical Field

The invention relates to the field of video coding and decoding and restoration, in particular to an efficient space-time super-resolution video compression restoration method based on deep learning.

Background

In recent years, the transmission proportion of video contents in internet traffic has undergone explosive growth, and the exploration and design of a novel video compression and restoration method with lower bandwidth consumption and smaller visual quality loss is a research subject with deep application value and demand.

Conventional video compression algorithms, such as h.264, h.265, AV1 and other encoding standards, mostly rely on artificial operation modules, for example, a motion estimator based on region blocks and Discrete Cosine Transform (DCT) are used to remove redundancy in video to complete compression. While these modules are well designed and efficient, the entire compression scheme cannot be jointly optimized in an end-to-end manner.

With the continuous development of deep learning techniques and their success in a series of visual tasks, video compression restoration schemes based on deep learning techniques are receiving more and more attention. These video compression methods based on deep learning use a learnable deep neural network to replace the manual module in the traditional video codec, thereby achieving end-to-end optimization on large-scale video datasets. However, although they have achieved compression restoration effects exceeding the current conventional codec standard, the methods based on deep learning have high computational complexity overhead, which is an important problem and challenge that currently hinders the application of deep learning techniques to the design of video compression restoration schemes.

To overcome this challenge, non-patent document 1 (Liu, et al. "adapting the Data: computer Neural Video Content-Adaptive Feature modulation.", ICCV, 2021.) and non-patent document 2 (Khani, et al. "Efficient Video Compression Video Content-Adaptive Super-resolution.", ICCV, 2021.) try to reduce the bit rate during the encoding process by adding a lightweight Content-Adaptive Super-resolution network at the decoder side. Although the inference time for video decoding is fast, their methods require relatively long training times to encode the video. Furthermore, the parameters of the above-mentioned content-adaptive image hyper-division network (typically more than 5 megabytes) also need to be included in the bitstream, which increases the amount of bits that ultimately need to be transmitted, hindering their applicability in low bit-rate scenarios.

How to further combine multiple video restoration tasks into the flow of video decompression transmission and quality enhancement to improve the overall transmission efficiency and performance has become an important issue that needs to be solved and has a wide application value in the academic and industrial circles at present.

Based on the problems, the invention provides an efficient space-time super-resolution video compression restoration method based on deep learning.

Disclosure of Invention

The invention provides an efficient space-time super-resolution video compression and restoration method based on deep learning, aiming at the defects of the existing video compression, restoration and transmission method in performance and efficiency.

The invention is realized by the following technical scheme: a high-efficiency space-time super-resolution video compression restoration method based on deep learning comprises an encoding (Encoder) stage and a decoding (Decoder) stage according to a video distribution process widely adopted at present.

Step 1, encoding (Encoder) phase

The task of the encoding (Encoder) phase is to encode a given source video

(N video frames in total, the height and the width are respectively H and W, and the number of channels is C) are coded and compressed, so that the bit overhead is reduced as much as possible during transmission. The coding algorithm applying the conventional compression coding standard to the source video is a straightforward solution, these codecs using oneA series of efficient manual modules are used for reducing redundancy in video, and in a practical system, the step can use existing conventional video compression coding algorithms such as H.264, H.265 and the like. Considering the recent progress of the video restoration task in deep learning, before the code stream is generated, down-sampling operation in space and time dimensions can be performed firstly, so as to further reduce the transmission bit number on the premise of not influencing the quality, and the down-sampled video is restored by performing space-time super-division operation in a decoding stage.

The step 1 comprises the following steps:

step 1.1, frame extraction is carried out on a source video;

step 1.2, down-sampling a source video;

step 1.3, compressing a source video;

step 2, decoding stage

The task of the decoding stage is that after receiving the video code stream from the encoding stage, the module needs to generate the final restored video

The visual quality of which needs to be as close as possible to the source video

Close to and with the same picture size and video frame rate.

The step 2 comprises the following steps:

step 2.1, decompressing the video processed in step 1;

and 2.2, performing space-time hyper-resolution restoration on the decompressed video.

Further, the frame extraction operation of step 1.1 includes:

for source video in time domain

Performing frame extraction operation, and setting a time domain frame extraction coefficient K _t Meaning that only every Kth in the video is retained _t The frame may be, for example, _K t =2 represents the deletion of a half frame to obtain a temporally downsampled video

Further, the down-sampling operation of step 1.2 comprises:

using a bicubic interpolation downsampling operation, will

Is reduced in height and width of each frame by K _s Multiple (K) _s Spatial downsampling coefficients) to reduce spatial resolution.

Further, the compression operation of step 1.3 includes:

spatio-temporal down-sampling video using existing conventional video coding algorithms

Compressing, further reducing the size of the compressed data and generating a bit stream. Since deep learning is not involved in the encoding process and no time-consuming model inference process needs to be performed, the encoding stage here introduces only negligible computational overhead comparable to using the h.265 encoding standard.

Further, the decompression operation of step 2.1 comprises:

in the decompression stage of step 2.1, the video bit stream is decompressed by using the decoding algorithm with the same standard as that in the encoding stage to obtain a low-quality (LQ) video

The video is downsampled in both resolution and frame rate, with artifacts and artifacts introduced by the compression algorithm.

Further, the time-space superseparation recovery operation of step 2.2 includes:

in the spatio-temporal hyperfunction recovery phase of step 2.2, in order to recover from

Is recovered to

Three recovery tasks of video frame insertion, video super-resolution and video quality enhancement are required. Next, the recovery effects of these three tasks benefit each other through a joint optimization modeling process. Specifically, in the decoding stage of step 2.2, a space-time super-resolution network SRFI with an emerging structure design is designed, and meanwhile, the method aims at low-quality video

The time-space over-resolution quality enhancement task comprises the following steps: performing up-sampling operation on the video in two dimensions of time domain and space domain (including time domain video frame interpolation and space domain video super-resolution), removing compression artifacts, and obtaining restored High Quality (HQ) video

The SRFI network of the decoding (Decoder) stage is in N _in A low quality frame as input and recover N _out A high quality frame.

Step 2.2 the decoding stage can be divided into:

propagation sub-network: to exploit the potential complementary information between different temporal and spatial domains in a video segment, a forward-backward propagation recurrent neural network structure is first applied to the input frames to extract the frame feature map that incorporates global information. The bidirectional network consists of a Forward Propagation (FP) sub-network and a Backward Propagation (BP) sub-network, which are structurally identical except for the order of image frame feeding at input. In the interior of the subnet, an input video frame sequence is firstly input into an Optical Flow predictor network SpyNet, which is from non-patent document 3 (Ranjan, et. Al. "Optical Flow Estimation using a Spatial pyramid network.", CVPR, 2017.), optical Flow prediction of inter-image motion is performed, then image frames at each moment are bent and aligned with the obtained corresponding Optical Flow graph, and then two groups of residual error feature blocks are fused with hidden features to obtain an output feature map representation at the moment of the subnet. The output of the front and back sub-networks is spliced according to the channels, and fused by using the 1 multiplied by 1 convolutional layer, and finally the image fusion characteristic diagram of each moment of the sub-network output is obtained.

Frame insertion sub-network: and taking the image fusion characteristic graph obtained in the last step as the input of a characteristic interpolation sub-network, and synthesizing the frame characteristic graph lost in the compression and frame extraction by the network. In this sub-network, pyramid structured variable convolution operations are employed to effectively capture motion cues between frames. Variable convolution is an improvement over conventional convolution and generalization, with the ability to better capture a wide range of motion.

The 3 × 3 convolution kernel K is defined as: k = { (-1, -1), (-1,0), …, (0,1), (1,1) }, the calculation formula for conventional convolution F (·) is as follows:

where K is the position in K, w _t (. Cndot.) is a weight, f (-) is a feature of the input image, and p is an initial position;

variable convolution F _deform Adding an additional set of two-dimensional position offsets for each convolution position, increasing the motion capture capability of the network and increasing the robustness of the network:

in the formula, compared with the conventional convolution, the parameter Δ k added by the variable convolution is an offset layer obtained by network pre-learning, and the variable convolution operation realizes the final variable convolution operation through addition compensation and bilinear interpolation operation.

Features from two adjacent frames are first channel-stitched, using conventional convolution layers to generate a learnable offset layer that computes two variable convolutions. Then, the two adjacent frame feature maps are respectively input into a variable convolution operation using corresponding offsets, and a synthesized centered frame feature map is obtained. Finally, final fusion is performed using a 1 × 1 convolution to obtain a composite feature result.

Supramolecular networks: pixel Shuffle operation (Pixel Shuffle) is adopted to improve the N generated by the first two sub-networks _out Spatial resolution of frame feature map, for having availableThe frame time of the low-resolution original feature map is additionally introduced with a residual block structure and finally output

The SRFI network is trained on MFQEv2 data sets, which contain a total of 160 lossless video sequences. The resolutions of these video sequences are 2K (2048 × 1080), 1080p (1920 × 1080), 360p (640 × 360), CIF (352 × 288), and the like. At test time, the HEVC standard test data set, which is widely used to evaluate video compression related tasks, is used, and consists of 16 video sequences of different content and resolution.

The SRFI network uses peak signal-to-noise ratio (PSNR) and Bits Per Pixel (BPP) to evaluate overall performance. The two indexes are calculated in the video frames of the network output results obtained by H.265 compressed video training with different data set types and different constant Rate Coefficients (CRF), and two video coding evaluation indexes of BD-Rate and BD-PSNR are also calculated at the same time, so that the SRFI network is compared with other schemes.

The SRFI network is implemented using a pytorreh language framework, and the bitstream is generated using FFmpeg and libx265 toolkits. The network uses a pre-trained model of SPyNet to initialize the optical flow estimator when training. In preprocessing data, cropping is performed from a source video and a corresponding low-quality video, pictures with the size of 128 × 128 are taken as a training sample pair, and data enhancement is realized through random horizontal flipping and 90 ° rotation. The learning rates of the optical flow estimator and other networks were initially set to 2.5 × 10, respectively ^-5 And 2X 10 ^-4 . By beta ₁ ＝0.9,β ₂ An Adam optimizer of =0.99 performs parameter update with the update policy:

where η is the learning rate, θ ^k Is the weight parameter for the kth iteration.

The learning rate is periodically changed according to the training process by adopting a cosine annealing scheme, and the k-th learning rate is as follows:

wherein eta is ₀ Is the initial learning rate (set to 2 × 10) ^-4 ) The total number of iterations T was 300000.

In training and evaluation, four different Constant Rate Factors (CRFs), 20, 25, 30, 35, were selected in total, and a separate model was trained for each CRF value for the different compression settings.

The invention has the following beneficial effects:

1) A novel video compression method is proposed that combines the latest advances in the spatio-temporal super-resolution task to improve the performance of traditional video codecs. This is the first practice of applying the deep neural network related to the spatio-temporal super-resolution task to video compression.

2) An efficient space-time hyper-division network (SRFI) is provided to improve video decoding, and the SRFI network can automatically capture deep motion characteristics in a compressed video sequence with low resolution and low frame rate and efficiently complete the tasks of characteristic alignment bending, frame characteristic interpolation, up-sampling super-resolution and quality enhancement. The results of comparison with the prior art demonstrate the advantages of the network model in terms of speed and accuracy.

3) Experimental results show that the proposed STSR-VC (Space-Time Super-Resolution Video Compression) Video Compression and restoration method has both high efficiency of a conventional codec (the calculation overhead at the encoder end is about 0) and strong robustness of learnable DNN (superior to a conventional h.265 codec and other existing codecs based on DNN), achieves better bit transmission efficiency, and is more suitable for the scene conditions of low bit transmission. The method has higher value in the aspects of helping the Internet content distribution industry to save transmission bandwidth, improve efficiency and the like.

Drawings

FIG. 1 is a flow chart of an implementation of a method for efficient spatio-temporal super-resolution video compression restoration based on deep learning according to an embodiment of the present invention;

FIG. 2 is a block diagram of an SRFI spatio-temporal hyper-division network at the decoding stage in one embodiment of the present invention;

FIG. 3 is a diagram comparing the structure of an SRFI network and an existing FISR spatio-temporal hyper-division network according to an embodiment of the present invention;

FIG. 4 is a BPP-PSNR index comparison graph of the space-time super-division compression restoration task performed by SRFI according to one embodiment of the present invention and other prior art methods;

FIG. 5 is a graph comparing the results of space-time super-division compression restoration tasks performed by SRFIs according to one embodiment of the present invention with other prior art methods.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples.

Fig. 1 shows an implementation flow of an efficient space-time super-resolution video compression restoration method based on deep learning according to the present invention. The method is in accordance with the video distribution flow widely adopted at present, and comprises an encoding (Encoder) stage and a decoding (Decode) stage.

The task of the encoding (Encoder) phase is to encode a given source video

(N video frames in total, the height and the width are respectively H and W, and the number of channels is C) are coded and compressed, so that the bit overhead is reduced as much as possible during transmission. The coding algorithms that apply conventional compression coding standards to source video are a straightforward solution, and these codecs use a series of efficient manual modules to reduce redundancy in the video. Considering the recent progress of the video restoration task in deep learning, before the code stream is generated, down-sampling operation in space and time dimensions can be performed firstly, so as to further reduce the transmission bit number on the premise of not influencing the quality, and the down-sampled video is restored by performing space-time super-division operation in a decoding stage.

The encoding stage first aligns the source video in the time domain

Performing frame extraction operation, and setting a time domain frame extraction coefficient K _t Meaning that only every Kth in the video is retained _t Frames, e.g. K _t =2 denotes deleting a half frame, obtaining time domain down-sampled video

Then, using a bicubic interpolation downsampling operation, will

Is reduced in height and width of each frame by K _s Multiple (K) _s Spatial downsampling coefficients) to reduce spatial resolution. Finally, the time-space down-sampled video is sampled using the existing conventional video coding algorithm

Compressing, further reducing the size of the compressed data and generating a bit stream. In practical systems, this step may use an off-the-shelf conventional video compression coding algorithm such as h.264, h.265, etc. Since deep learning is not involved in the encoding process and no time-consuming model inference process needs to be performed, the encoding stage here introduces only negligible computational overhead compared to using conventional coding standard algorithms.

Close to and with the same picture size and video frame rate. The decoding stage first decompresses the bits using the same standard decoding algorithm as the encoding stage to obtain Low Quality (LQ) video

The video is downsampled in both resolution and frame rate, with various artifacts and artifacts introduced by the h.265 compression algorithm. To recover to obtain

Three recovery tasks of video frame insertion, video super-resolution and video quality enhancement are required to be respectively carried out. However, considering the internal dependencies of these three tasks, their recovery effects can benefit each other through a joint optimization modeling process. Specifically, in the decoding stage, a space-time super-resolution network SRFI with an emerging structure design is designed, and meanwhile, the method aims at low-quality video

As shown in FIG. 2, the SRFI network of the decode (Decode) module is divided into N _in A low quality frame as input and recover N _out The high-quality frame can be divided into three processing sub-networks of 'transmission', 'frame insertion' and 'super-division' in the implementation structure.

The network architecture shown in fig. 2 (b) is a schematic diagram of a single forward propagation segment module in a "propagation" subnetwork in an SRFI network. To exploit the potential complementary information between different temporal and spatial domains in a video segment, a forward-backward propagation recurrent neural network structure is first applied to the input frames to extract the frame feature map that incorporates global information. The bidirectional network consists of a Forward Propagation (FP) sub-network and a Backward Propagation (BP) sub-network, which are structurally identical except for the order of image frame feeding at input. In the interior of the subnet, firstly, an input video frame sequence is input into a SpyNet network to predict the optical flow of the movement between images, then the image frame at each moment is aligned with the obtained corresponding optical flow graph in a bending way, and then the two groups of residual error feature blocks perform the fusion of hidden features to obtain an output feature graph representation of the subnet at the moment. The output of the front and back sub-networks is spliced according to the channels, and fused by using the 1 multiplied by 1 convolutional layer, and finally the image fusion characteristic diagram of each moment of the sub-network output is obtained.

The network structure shown in fig. 2 (c) is a "frame-inserted" subnetwork in an SRFI network: and taking the image fusion characteristic graph obtained in the last step as the input of a characteristic interpolation sub-network, and synthesizing the frame characteristic graph lost in the compression and frame extraction by the network. In this sub-network, pyramid structured variable convolution operations are employed to effectively capture motion cues between frames. Variable convolution is an improvement over conventional convolution and generalization, with the ability to better capture a wide range of motion.

where K is the position in K, w _t (. Cndot.) is the weight, f (. Cndot.) is the feature of the input image, and o is the initial position;

in the formula, compared with the conventional convolution, the parameter Δ k added by the variable convolution is an offset layer obtained by network pre-learning, and the final variable convolution operation is realized through addition compensation and bilinear interpolation operation.

Features from two adjacent frames are first channel-stitched, using conventional convolutional layers to generate a learnable offset layer that computes two variable convolutions. Then, the two adjacent frame feature maps are respectively input into a variable convolution operation using corresponding offsets, and a synthesized centered frame feature map is obtained. Finally, final fusion is performed using a 1 × 1 convolution to obtain a composite feature result.

The "super-divided" sub-network: pixel Shuffle operation (Pixel Shuffle) is adopted to improve the N generated by the first two sub-networks _out The spatial resolution of the frame feature map, for the frame time with available low-resolution original feature map, additionally introduces a residual block structure, and finally outputs

As shown in fig. 3, the structure difference of the SRFI (Super-Resolution Frame-Interpolation) network is compared with the structure difference of the conventional Frame-Interpolation Super-Resolution (Frame-Interpolation Super-Resolution) network. The previously existing STSR network tends to follow the "FI-SR" architecture, i.e., given N input frames, frame feature interpolation is first performed to generate (N-1) (assuming K) _t = 2) new signature representations, and then the entire (2 n-1) signature graphs are input into the hyper-diversity network, which involves a number of computationally intensive operations such as traffic estimation, action alignment and signature propagation. Although this architecture is intuitive, it has two major limitations:

1. since the newly generated (n-1) feature maps are derived from neighboring frames, they do not contain new beneficial information, feeding them to the supramolecular network may not benefit the overall performance;

the FI-SR architecture has inherent complexity, and the number of input feature maps of the supermolecular network part is (2 n-1), which causes the computation overhead of the whole module to be remarkably increased and the inference speed to be reduced. Meanwhile, the upper limit of the input frame number in the decoding stage is also limited by excessive memory consumption.

To address both of these limitations, a simple and efficient "SR-FI" architecture may be employed, which first feeds video frames to the propagation network. Feature interpolation is then performed using the intermediate features containing global and high resolution information to obtain (n-1) new features. Here, the SR network only receives n frames to perform high cost operations, saving nearly half of the amount of computation compared to the approach of inputting (2 n-1) frames in the "FI-SR" architecture, which not only increases the inference speed of the network, but also allows the overall method to input more video frames at a time. It is noted that there is no sub-network for quality enhancement alone in the SRFI network, because the SRFI network can automatically learn to remove the compression artifacts through end-to-end training, thereby completing the task of quality enhancement.

In the training process of the proposed SRFI network, in order to generate input samples, firstly, a complete encoding stage processing process is executed to generate bit code streams, and then the code streams are decoded into low-quality video frames by using a decoding algorithm with the same standard as that in the encoding stage. Reconstructed video frames recovered while updating back-propagation parameters of neural networks

And corresponding source video frames

Using a Charbonier penalty function as a loss function:

wherein the value of ∈ is set to 1 × 10 during training ^-3 。

Indexes of the SRFI network are compared with indexes of other deep learning video compression restoration methods, and a BD-Rate/BD-PSNR quantization result and a BPP-PSNR curve of different methods on an HEVC data set are shown in table 1 and fig. 4 respectively. As can be seen from the table, the H.265 scheme of the conventional codec is taken as the baseline performance, and the disclosed inventive method combines the H.265 scheme and the SRFI network based on deep learning, so that the BD-Rate reduction of 10.14% and the BD-PSNR gain of 0.48dB are realized compared with the H.265 baseline, which shows that the emerging spatio-temporal super-resolution video compression restoration method provided by the invention has higher coding performance compared with the H.265 scheme. SRFI achieves comparable results to the baseline h.265 codec in high bit rate scenarios, but also outperforms all of the same type of schemes in low bit rate scenarios. Furthermore, the overall scheme generates only negligible computational overhead at the encoding End, and can run at 70 frames per second To decode a test Video of 352 × 288 in size, reaching An approximate 2-fold speed increase compared To non-patent document 4 (Lu, et. Al. "DVC: an End-To-End Deep Video Compression framework.", CVPR, 2019.).

TABLE 1 comparison of BD-Rate (%)/BD-PSNR (PSNR) metrics for SRFIs performing spatio-temporal hyper-division compression recovery tasks in accordance with one embodiment of the present invention versus other prior methods

Fig. 5 shows a detail comparison diagram of a restored video frame on an HEVC data set. It can be seen that video frames compressed using h.265 only are still severely distorted by various artifacts after restoration (e.g., blurring in the first row of images and ringing in the third row of images). Under the similar BPP (Bit Per Pixel) index, although other existing deep learning video compression methods can reduce artifacts to some extent, the generated frames still have the problems of excessive blurring and detail loss. Compared with the prior methods, the SRFI network recovers more accurate details, for example, in the first line of images, the texture of the vent can be clearly identified, and the surrounding is not obviously blurred compared with the results of other methods; in the second line of images, the results obtained by the SRFI network are also the closest to the true results.

One skilled in the art can, using the teachings of the present invention, readily make various changes and modifications to the invention without departing from the spirit and scope of the invention as defined by the appended claims. Any modifications and equivalent variations of the above-described embodiments, which are made in accordance with the technical spirit and substance of the present invention, fall within the scope of protection of the present invention as defined in the claims.

Claims

1. A high-efficiency space-time super-resolution video compression restoration method based on deep learning is characterized by comprising the following steps:

step 1, in the encoding stage, encoding and compressing a given source video, including:

step 1.1, frame extraction is carried out on a source video;

step 1.2, down-sampling a source video;

step 1.3, compressing the source video;

step 2, a decoding stage, which generates a final restored video after receiving the video code stream from the encoding stage, and comprises the following steps:

step 2.1, decompressing the video processed in step 1;

and 2.2, performing space-time super-resolution restoration on the decompressed video.

2. The method for compressing and restoring high-efficiency space-time super-resolution video based on deep learning of claim 1, wherein the frame extraction in step 1.1 comprises:

for a given source video

Wherein N, H, W, C respectively represent the batch number, frame height, frame width and channel number of the input source video, and a time domain frame extraction coefficient K is set _t Means that only every Kth in the video is retained _t Frame, obtaining time domain down-sampled video

3. The method for compressing and restoring high-efficiency space-time super-resolution video based on deep learning of claim 2, wherein the down-sampling operation of step 1.2 comprises:

using a bicubic interpolation downsampling operation, will

Is reduced in height and width of each frame by K _s Multiple, wherein K _s The down-sampling coefficient of the space domain is adopted to reduce the spatial resolution to obtain the down-sampling video of the space domain

4. The method for compressing and restoring high-efficiency space-time super-resolution video based on deep learning of claim 3, wherein the compression operation of step 1.3 comprises:

spatio-temporal down-sampling of video using existing conventional video coding algorithms

Compressing, further reducing the size of the compressed data and generating a bit stream.

5. The method for compressing and restoring high-efficiency space-time super-resolution video based on deep learning of claim 4, wherein the decompression operation of step 2.1 comprises:

decompressing the video bit code stream by using a decoding algorithm with the same standard as the encoding stage to obtain a low-quality video

6. The method for compressing and restoring high-efficiency spatio-temporal super-resolution video based on deep learning of claim 5, wherein the spatio-temporal super-resolution restoration operation of step 2.2 comprises:

designing a Super-Resolution Frame-Interpolation space-time Super-Resolution network, and finishing aiming at low-quality video through the SRFI network

The space-time super-resolution quality enhancement task comprises the following steps: for low quality video

Performing up-sampling operation on two dimensions of time domain and space domain, including time domain video frame interpolation and space domain video super-resolution, and removing compression artifacts to obtain recovered high-quality HQ video

Space-time super-resolution network SRFI and N _in A low quality frame is taken as input and N is recovered _out A high quality frame.

7. The method for compressing and restoring high-efficiency space-time super-resolution video based on deep learning of claim 6, wherein the space-time super-resolution network SRFI in step 2.2 comprises a propagation sub-network, a frame interpolation sub-network and a supramolecular network;

a propagation subnetwork for applying a back-and-forth propagation recurrent neural network structure to the input frame to extract a frame feature map combined with global information; the bidirectional network is composed of a forward propagation sub-network and a backward propagation sub-network, the forward propagation sub-network and the backward propagation sub-network have the same structure and are different in the image frame feeding sequence during input, an input video frame sequence is firstly input into an optical flow predictor network in the sub-network to carry out optical flow prediction of image movement, then image frames at each moment are bent and aligned with the obtained corresponding optical flow graph, and then two groups of residual error feature blocks are fused with hidden features to obtain an output feature graph representation of the sub-network at the moment; splicing the outputs of the front and rear sub-networks according to channels, and fusing by using a 1 multiplied by 1 convolutional layer to finally obtain an image fusion characteristic diagram of each moment of the sub-network output;

the frame interpolation sub-network takes the image fusion characteristic graph at each moment obtained by the propagation sub-network as the input of the characteristic interpolation sub-network, and synthesizes and outputs the frame characteristic graph lost when the compression frame is extracted; in the sub-network, variable convolution operation of a pyramid structure is adopted to capture motion clues between frames;

supramolecular networks using pixel recombination operations to enhance the N generated by propagation and framing subnetworks _out The spatial resolution of the frame feature map, for the frame time with available low-resolution original feature map, additionally introducing a residual block structure, and finally outputting high-quality HQ video