CN115689917A - Efficient space-time super-resolution video compression restoration method based on deep learning - Google Patents

Efficient space-time super-resolution video compression restoration method based on deep learning Download PDF

Info

Publication number
CN115689917A
CN115689917A CN202211285099.5A CN202211285099A CN115689917A CN 115689917 A CN115689917 A CN 115689917A CN 202211285099 A CN202211285099 A CN 202211285099A CN 115689917 A CN115689917 A CN 115689917A
Authority
CN
China
Prior art keywords
video
network
frame
resolution
super
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211285099.5A
Other languages
Chinese (zh)
Inventor
陈律丞
刘亮为
卓成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202211285099.5A priority Critical patent/CN115689917A/en
Publication of CN115689917A publication Critical patent/CN115689917A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention discloses an efficient space-time super-resolution video compression restoration method based on deep learning, and provides a coding-decoding two-stage video transmission scheme based on deep learning. Specifically, in the encoding stage, the method firstly performs spatial domain down-sampling and time domain frame-extracting operations on the video in sequence, and then generates a transmission code stream by using a conventional encoding compression standard. In the decoding stage, a space-time super-resolution reconstruction network takes the video obtained by decompressing the code stream as input, and executes three tasks of super-resolution, frame interpolation and quality enhancement so as to obtain a quality enhanced video frame with high resolution and high frame rate. Experimental results prove that the performance of the disclosed method on two coding evaluation indexes of BD-Rate and BD-PSNR is obviously superior to that of a traditional coding scheme and an existing deep learning restoration method, and the method has the advantage of low calculation overhead.

Description

Efficient space-time super-resolution video compression restoration method based on deep learning
Technical Field
The invention relates to the field of video coding and decoding and restoration, in particular to an efficient space-time super-resolution video compression restoration method based on deep learning.
Background
In recent years, the transmission proportion of video contents in internet traffic has undergone explosive growth, and the exploration and design of a novel video compression and restoration method with lower bandwidth consumption and smaller visual quality loss is a research subject with deep application value and demand.
Conventional video compression algorithms, such as h.264, h.265, AV1 and other encoding standards, mostly rely on artificial operation modules, for example, a motion estimator based on region blocks and Discrete Cosine Transform (DCT) are used to remove redundancy in video to complete compression. While these modules are well designed and efficient, the entire compression scheme cannot be jointly optimized in an end-to-end manner.
With the continuous development of deep learning techniques and their success in a series of visual tasks, video compression restoration schemes based on deep learning techniques are receiving more and more attention. These video compression methods based on deep learning use a learnable deep neural network to replace the manual module in the traditional video codec, thereby achieving end-to-end optimization on large-scale video datasets. However, although they have achieved compression restoration effects exceeding the current conventional codec standard, the methods based on deep learning have high computational complexity overhead, which is an important problem and challenge that currently hinders the application of deep learning techniques to the design of video compression restoration schemes.
To overcome this challenge, non-patent document 1 (Liu, et al. "adapting the Data: computer Neural Video Content-Adaptive Feature modulation.", ICCV, 2021.) and non-patent document 2 (Khani, et al. "Efficient Video Compression Video Content-Adaptive Super-resolution.", ICCV, 2021.) try to reduce the bit rate during the encoding process by adding a lightweight Content-Adaptive Super-resolution network at the decoder side. Although the inference time for video decoding is fast, their methods require relatively long training times to encode the video. Furthermore, the parameters of the above-mentioned content-adaptive image hyper-division network (typically more than 5 megabytes) also need to be included in the bitstream, which increases the amount of bits that ultimately need to be transmitted, hindering their applicability in low bit-rate scenarios.
How to further combine multiple video restoration tasks into the flow of video decompression transmission and quality enhancement to improve the overall transmission efficiency and performance has become an important issue that needs to be solved and has a wide application value in the academic and industrial circles at present.
Based on the problems, the invention provides an efficient space-time super-resolution video compression restoration method based on deep learning.
Disclosure of Invention
The invention provides an efficient space-time super-resolution video compression and restoration method based on deep learning, aiming at the defects of the existing video compression, restoration and transmission method in performance and efficiency.
The invention is realized by the following technical scheme: a high-efficiency space-time super-resolution video compression restoration method based on deep learning comprises an encoding (Encoder) stage and a decoding (Decoder) stage according to a video distribution process widely adopted at present.
Step 1, encoding (Encoder) phase
The task of the encoding (Encoder) phase is to encode a given source video
Figure BDA0003899464550000021
(N video frames in total, the height and the width are respectively H and W, and the number of channels is C) are coded and compressed, so that the bit overhead is reduced as much as possible during transmission. The coding algorithm applying the conventional compression coding standard to the source video is a straightforward solution, these codecs using oneA series of efficient manual modules are used for reducing redundancy in video, and in a practical system, the step can use existing conventional video compression coding algorithms such as H.264, H.265 and the like. Considering the recent progress of the video restoration task in deep learning, before the code stream is generated, down-sampling operation in space and time dimensions can be performed firstly, so as to further reduce the transmission bit number on the premise of not influencing the quality, and the down-sampled video is restored by performing space-time super-division operation in a decoding stage.
The step 1 comprises the following steps:
step 1.1, frame extraction is carried out on a source video;
step 1.2, down-sampling a source video;
step 1.3, compressing a source video;
step 2, decoding stage
The task of the decoding stage is that after receiving the video code stream from the encoding stage, the module needs to generate the final restored video
Figure BDA0003899464550000022
The visual quality of which needs to be as close as possible to the source video
Figure BDA0003899464550000023
Close to and with the same picture size and video frame rate.
The step 2 comprises the following steps:
step 2.1, decompressing the video processed in step 1;
and 2.2, performing space-time hyper-resolution restoration on the decompressed video.
Further, the frame extraction operation of step 1.1 includes:
for source video in time domain
Figure BDA0003899464550000024
Performing frame extraction operation, and setting a time domain frame extraction coefficient K t Meaning that only every Kth in the video is retained t The frame may be, for example, K t =2 represents the deletion of a half frame to obtain a temporally downsampled video
Figure BDA0003899464550000025
Further, the down-sampling operation of step 1.2 comprises:
using a bicubic interpolation downsampling operation, will
Figure BDA0003899464550000026
Is reduced in height and width of each frame by K s Multiple (K) s Spatial downsampling coefficients) to reduce spatial resolution.
Further, the compression operation of step 1.3 includes:
spatio-temporal down-sampling video using existing conventional video coding algorithms
Figure BDA0003899464550000031
Compressing, further reducing the size of the compressed data and generating a bit stream. Since deep learning is not involved in the encoding process and no time-consuming model inference process needs to be performed, the encoding stage here introduces only negligible computational overhead comparable to using the h.265 encoding standard.
Further, the decompression operation of step 2.1 comprises:
in the decompression stage of step 2.1, the video bit stream is decompressed by using the decoding algorithm with the same standard as that in the encoding stage to obtain a low-quality (LQ) video
Figure BDA0003899464550000032
The video is downsampled in both resolution and frame rate, with artifacts and artifacts introduced by the compression algorithm.
Further, the time-space superseparation recovery operation of step 2.2 includes:
in the spatio-temporal hyperfunction recovery phase of step 2.2, in order to recover from
Figure BDA0003899464550000033
Is recovered to
Figure BDA0003899464550000034
Three recovery tasks of video frame insertion, video super-resolution and video quality enhancement are required. Next, the recovery effects of these three tasks benefit each other through a joint optimization modeling process. Specifically, in the decoding stage of step 2.2, a space-time super-resolution network SRFI with an emerging structure design is designed, and meanwhile, the method aims at low-quality video
Figure BDA0003899464550000035
The time-space over-resolution quality enhancement task comprises the following steps: performing up-sampling operation on the video in two dimensions of time domain and space domain (including time domain video frame interpolation and space domain video super-resolution), removing compression artifacts, and obtaining restored High Quality (HQ) video
Figure BDA0003899464550000036
The SRFI network of the decoding (Decoder) stage is in N in A low quality frame as input and recover N out A high quality frame.
Step 2.2 the decoding stage can be divided into:
propagation sub-network: to exploit the potential complementary information between different temporal and spatial domains in a video segment, a forward-backward propagation recurrent neural network structure is first applied to the input frames to extract the frame feature map that incorporates global information. The bidirectional network consists of a Forward Propagation (FP) sub-network and a Backward Propagation (BP) sub-network, which are structurally identical except for the order of image frame feeding at input. In the interior of the subnet, an input video frame sequence is firstly input into an Optical Flow predictor network SpyNet, which is from non-patent document 3 (Ranjan, et. Al. "Optical Flow Estimation using a Spatial pyramid network.", CVPR, 2017.), optical Flow prediction of inter-image motion is performed, then image frames at each moment are bent and aligned with the obtained corresponding Optical Flow graph, and then two groups of residual error feature blocks are fused with hidden features to obtain an output feature map representation at the moment of the subnet. The output of the front and back sub-networks is spliced according to the channels, and fused by using the 1 multiplied by 1 convolutional layer, and finally the image fusion characteristic diagram of each moment of the sub-network output is obtained.
Frame insertion sub-network: and taking the image fusion characteristic graph obtained in the last step as the input of a characteristic interpolation sub-network, and synthesizing the frame characteristic graph lost in the compression and frame extraction by the network. In this sub-network, pyramid structured variable convolution operations are employed to effectively capture motion cues between frames. Variable convolution is an improvement over conventional convolution and generalization, with the ability to better capture a wide range of motion.
The 3 × 3 convolution kernel K is defined as: k = { (-1, -1), (-1,0), …, (0,1), (1,1) }, the calculation formula for conventional convolution F (·) is as follows:
Figure BDA0003899464550000041
where K is the position in K, w t (. Cndot.) is a weight, f (-) is a feature of the input image, and p is an initial position;
variable convolution F deform Adding an additional set of two-dimensional position offsets for each convolution position, increasing the motion capture capability of the network and increasing the robustness of the network:
Figure BDA0003899464550000042
in the formula, compared with the conventional convolution, the parameter Δ k added by the variable convolution is an offset layer obtained by network pre-learning, and the variable convolution operation realizes the final variable convolution operation through addition compensation and bilinear interpolation operation.
Features from two adjacent frames are first channel-stitched, using conventional convolution layers to generate a learnable offset layer that computes two variable convolutions. Then, the two adjacent frame feature maps are respectively input into a variable convolution operation using corresponding offsets, and a synthesized centered frame feature map is obtained. Finally, final fusion is performed using a 1 × 1 convolution to obtain a composite feature result.
Supramolecular networks: pixel Shuffle operation (Pixel Shuffle) is adopted to improve the N generated by the first two sub-networks out Spatial resolution of frame feature map, for having availableThe frame time of the low-resolution original feature map is additionally introduced with a residual block structure and finally output
Figure BDA0003899464550000043
The SRFI network is trained on MFQEv2 data sets, which contain a total of 160 lossless video sequences. The resolutions of these video sequences are 2K (2048 × 1080), 1080p (1920 × 1080), 360p (640 × 360), CIF (352 × 288), and the like. At test time, the HEVC standard test data set, which is widely used to evaluate video compression related tasks, is used, and consists of 16 video sequences of different content and resolution.
The SRFI network uses peak signal-to-noise ratio (PSNR) and Bits Per Pixel (BPP) to evaluate overall performance. The two indexes are calculated in the video frames of the network output results obtained by H.265 compressed video training with different data set types and different constant Rate Coefficients (CRF), and two video coding evaluation indexes of BD-Rate and BD-PSNR are also calculated at the same time, so that the SRFI network is compared with other schemes.
The SRFI network is implemented using a pytorreh language framework, and the bitstream is generated using FFmpeg and libx265 toolkits. The network uses a pre-trained model of SPyNet to initialize the optical flow estimator when training. In preprocessing data, cropping is performed from a source video and a corresponding low-quality video, pictures with the size of 128 × 128 are taken as a training sample pair, and data enhancement is realized through random horizontal flipping and 90 ° rotation. The learning rates of the optical flow estimator and other networks were initially set to 2.5 × 10, respectively -5 And 2X 10 -4 . By beta 1 =0.9,β 2 An Adam optimizer of =0.99 performs parameter update with the update policy:
Figure BDA0003899464550000051
where η is the learning rate, θ k Is the weight parameter for the kth iteration.
The learning rate is periodically changed according to the training process by adopting a cosine annealing scheme, and the k-th learning rate is as follows:
Figure BDA0003899464550000052
wherein eta is 0 Is the initial learning rate (set to 2 × 10) -4 ) The total number of iterations T was 300000.
In training and evaluation, four different Constant Rate Factors (CRFs), 20, 25, 30, 35, were selected in total, and a separate model was trained for each CRF value for the different compression settings.
The invention has the following beneficial effects:
1) A novel video compression method is proposed that combines the latest advances in the spatio-temporal super-resolution task to improve the performance of traditional video codecs. This is the first practice of applying the deep neural network related to the spatio-temporal super-resolution task to video compression.
2) An efficient space-time hyper-division network (SRFI) is provided to improve video decoding, and the SRFI network can automatically capture deep motion characteristics in a compressed video sequence with low resolution and low frame rate and efficiently complete the tasks of characteristic alignment bending, frame characteristic interpolation, up-sampling super-resolution and quality enhancement. The results of comparison with the prior art demonstrate the advantages of the network model in terms of speed and accuracy.
3) Experimental results show that the proposed STSR-VC (Space-Time Super-Resolution Video Compression) Video Compression and restoration method has both high efficiency of a conventional codec (the calculation overhead at the encoder end is about 0) and strong robustness of learnable DNN (superior to a conventional h.265 codec and other existing codecs based on DNN), achieves better bit transmission efficiency, and is more suitable for the scene conditions of low bit transmission. The method has higher value in the aspects of helping the Internet content distribution industry to save transmission bandwidth, improve efficiency and the like.
Drawings
FIG. 1 is a flow chart of an implementation of a method for efficient spatio-temporal super-resolution video compression restoration based on deep learning according to an embodiment of the present invention;
FIG. 2 is a block diagram of an SRFI spatio-temporal hyper-division network at the decoding stage in one embodiment of the present invention;
FIG. 3 is a diagram comparing the structure of an SRFI network and an existing FISR spatio-temporal hyper-division network according to an embodiment of the present invention;
FIG. 4 is a BPP-PSNR index comparison graph of the space-time super-division compression restoration task performed by SRFI according to one embodiment of the present invention and other prior art methods;
FIG. 5 is a graph comparing the results of space-time super-division compression restoration tasks performed by SRFIs according to one embodiment of the present invention with other prior art methods.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples.
Fig. 1 shows an implementation flow of an efficient space-time super-resolution video compression restoration method based on deep learning according to the present invention. The method is in accordance with the video distribution flow widely adopted at present, and comprises an encoding (Encoder) stage and a decoding (Decode) stage.
The task of the encoding (Encoder) phase is to encode a given source video
Figure BDA0003899464550000061
(N video frames in total, the height and the width are respectively H and W, and the number of channels is C) are coded and compressed, so that the bit overhead is reduced as much as possible during transmission. The coding algorithms that apply conventional compression coding standards to source video are a straightforward solution, and these codecs use a series of efficient manual modules to reduce redundancy in the video. Considering the recent progress of the video restoration task in deep learning, before the code stream is generated, down-sampling operation in space and time dimensions can be performed firstly, so as to further reduce the transmission bit number on the premise of not influencing the quality, and the down-sampled video is restored by performing space-time super-division operation in a decoding stage.
The encoding stage first aligns the source video in the time domain
Figure BDA0003899464550000062
Performing frame extraction operation, and setting a time domain frame extraction coefficient K t Meaning that only every Kth in the video is retained t Frames, e.g. K t =2 denotes deleting a half frame, obtaining time domain down-sampled video
Figure BDA0003899464550000063
Figure BDA0003899464550000064
Then, using a bicubic interpolation downsampling operation, will
Figure BDA0003899464550000065
Is reduced in height and width of each frame by K s Multiple (K) s Spatial downsampling coefficients) to reduce spatial resolution. Finally, the time-space down-sampled video is sampled using the existing conventional video coding algorithm
Figure BDA0003899464550000066
Compressing, further reducing the size of the compressed data and generating a bit stream. In practical systems, this step may use an off-the-shelf conventional video compression coding algorithm such as h.264, h.265, etc. Since deep learning is not involved in the encoding process and no time-consuming model inference process needs to be performed, the encoding stage here introduces only negligible computational overhead compared to using conventional coding standard algorithms.
The task of the decoding stage is that after receiving the video code stream from the encoding stage, the module needs to generate the final restored video
Figure BDA0003899464550000067
The visual quality of which needs to be as close as possible to the source video
Figure BDA0003899464550000068
Close to and with the same picture size and video frame rate. The decoding stage first decompresses the bits using the same standard decoding algorithm as the encoding stage to obtain Low Quality (LQ) video
Figure BDA0003899464550000069
The video is downsampled in both resolution and frame rate, with various artifacts and artifacts introduced by the h.265 compression algorithm. To recover to obtain
Figure BDA00038994645500000610
Three recovery tasks of video frame insertion, video super-resolution and video quality enhancement are required to be respectively carried out. However, considering the internal dependencies of these three tasks, their recovery effects can benefit each other through a joint optimization modeling process. Specifically, in the decoding stage, a space-time super-resolution network SRFI with an emerging structure design is designed, and meanwhile, the method aims at low-quality video
Figure BDA00038994645500000611
The time-space over-resolution quality enhancement task comprises the following steps: performing up-sampling operation on the video in two dimensions of time domain and space domain (including time domain video frame interpolation and space domain video super-resolution), removing compression artifacts, and obtaining restored High Quality (HQ) video
Figure BDA00038994645500000612
As shown in FIG. 2, the SRFI network of the decode (Decode) module is divided into N in A low quality frame as input and recover N out The high-quality frame can be divided into three processing sub-networks of 'transmission', 'frame insertion' and 'super-division' in the implementation structure.
The network architecture shown in fig. 2 (b) is a schematic diagram of a single forward propagation segment module in a "propagation" subnetwork in an SRFI network. To exploit the potential complementary information between different temporal and spatial domains in a video segment, a forward-backward propagation recurrent neural network structure is first applied to the input frames to extract the frame feature map that incorporates global information. The bidirectional network consists of a Forward Propagation (FP) sub-network and a Backward Propagation (BP) sub-network, which are structurally identical except for the order of image frame feeding at input. In the interior of the subnet, firstly, an input video frame sequence is input into a SpyNet network to predict the optical flow of the movement between images, then the image frame at each moment is aligned with the obtained corresponding optical flow graph in a bending way, and then the two groups of residual error feature blocks perform the fusion of hidden features to obtain an output feature graph representation of the subnet at the moment. The output of the front and back sub-networks is spliced according to the channels, and fused by using the 1 multiplied by 1 convolutional layer, and finally the image fusion characteristic diagram of each moment of the sub-network output is obtained.
The network structure shown in fig. 2 (c) is a "frame-inserted" subnetwork in an SRFI network: and taking the image fusion characteristic graph obtained in the last step as the input of a characteristic interpolation sub-network, and synthesizing the frame characteristic graph lost in the compression and frame extraction by the network. In this sub-network, pyramid structured variable convolution operations are employed to effectively capture motion cues between frames. Variable convolution is an improvement over conventional convolution and generalization, with the ability to better capture a wide range of motion.
The 3 × 3 convolution kernel K is defined as: k = { (-1, -1), (-1,0), …, (0,1), (1,1) }, the calculation formula for conventional convolution F (·) is as follows:
Figure BDA0003899464550000071
where K is the position in K, w t (. Cndot.) is the weight, f (. Cndot.) is the feature of the input image, and o is the initial position;
variable convolution F deform Adding an additional set of two-dimensional position offsets for each convolution position, increasing the motion capture capability of the network and increasing the robustness of the network:
Figure BDA0003899464550000072
in the formula, compared with the conventional convolution, the parameter Δ k added by the variable convolution is an offset layer obtained by network pre-learning, and the final variable convolution operation is realized through addition compensation and bilinear interpolation operation.
Features from two adjacent frames are first channel-stitched, using conventional convolutional layers to generate a learnable offset layer that computes two variable convolutions. Then, the two adjacent frame feature maps are respectively input into a variable convolution operation using corresponding offsets, and a synthesized centered frame feature map is obtained. Finally, final fusion is performed using a 1 × 1 convolution to obtain a composite feature result.
The "super-divided" sub-network: pixel Shuffle operation (Pixel Shuffle) is adopted to improve the N generated by the first two sub-networks out The spatial resolution of the frame feature map, for the frame time with available low-resolution original feature map, additionally introduces a residual block structure, and finally outputs
Figure BDA0003899464550000073
As shown in fig. 3, the structure difference of the SRFI (Super-Resolution Frame-Interpolation) network is compared with the structure difference of the conventional Frame-Interpolation Super-Resolution (Frame-Interpolation Super-Resolution) network. The previously existing STSR network tends to follow the "FI-SR" architecture, i.e., given N input frames, frame feature interpolation is first performed to generate (N-1) (assuming K) t = 2) new signature representations, and then the entire (2 n-1) signature graphs are input into the hyper-diversity network, which involves a number of computationally intensive operations such as traffic estimation, action alignment and signature propagation. Although this architecture is intuitive, it has two major limitations:
1. since the newly generated (n-1) feature maps are derived from neighboring frames, they do not contain new beneficial information, feeding them to the supramolecular network may not benefit the overall performance;
the FI-SR architecture has inherent complexity, and the number of input feature maps of the supermolecular network part is (2 n-1), which causes the computation overhead of the whole module to be remarkably increased and the inference speed to be reduced. Meanwhile, the upper limit of the input frame number in the decoding stage is also limited by excessive memory consumption.
To address both of these limitations, a simple and efficient "SR-FI" architecture may be employed, which first feeds video frames to the propagation network. Feature interpolation is then performed using the intermediate features containing global and high resolution information to obtain (n-1) new features. Here, the SR network only receives n frames to perform high cost operations, saving nearly half of the amount of computation compared to the approach of inputting (2 n-1) frames in the "FI-SR" architecture, which not only increases the inference speed of the network, but also allows the overall method to input more video frames at a time. It is noted that there is no sub-network for quality enhancement alone in the SRFI network, because the SRFI network can automatically learn to remove the compression artifacts through end-to-end training, thereby completing the task of quality enhancement.
In the training process of the proposed SRFI network, in order to generate input samples, firstly, a complete encoding stage processing process is executed to generate bit code streams, and then the code streams are decoded into low-quality video frames by using a decoding algorithm with the same standard as that in the encoding stage. Reconstructed video frames recovered while updating back-propagation parameters of neural networks
Figure BDA0003899464550000081
And corresponding source video frames
Figure BDA0003899464550000082
Using a Charbonier penalty function as a loss function:
Figure BDA0003899464550000083
wherein the value of ∈ is set to 1 × 10 during training -3
Indexes of the SRFI network are compared with indexes of other deep learning video compression restoration methods, and a BD-Rate/BD-PSNR quantization result and a BPP-PSNR curve of different methods on an HEVC data set are shown in table 1 and fig. 4 respectively. As can be seen from the table, the H.265 scheme of the conventional codec is taken as the baseline performance, and the disclosed inventive method combines the H.265 scheme and the SRFI network based on deep learning, so that the BD-Rate reduction of 10.14% and the BD-PSNR gain of 0.48dB are realized compared with the H.265 baseline, which shows that the emerging spatio-temporal super-resolution video compression restoration method provided by the invention has higher coding performance compared with the H.265 scheme. SRFI achieves comparable results to the baseline h.265 codec in high bit rate scenarios, but also outperforms all of the same type of schemes in low bit rate scenarios. Furthermore, the overall scheme generates only negligible computational overhead at the encoding End, and can run at 70 frames per second To decode a test Video of 352 × 288 in size, reaching An approximate 2-fold speed increase compared To non-patent document 4 (Lu, et. Al. "DVC: an End-To-End Deep Video Compression framework.", CVPR, 2019.).
Figure BDA0003899464550000091
TABLE 1 comparison of BD-Rate (%)/BD-PSNR (PSNR) metrics for SRFIs performing spatio-temporal hyper-division compression recovery tasks in accordance with one embodiment of the present invention versus other prior methods
Fig. 5 shows a detail comparison diagram of a restored video frame on an HEVC data set. It can be seen that video frames compressed using h.265 only are still severely distorted by various artifacts after restoration (e.g., blurring in the first row of images and ringing in the third row of images). Under the similar BPP (Bit Per Pixel) index, although other existing deep learning video compression methods can reduce artifacts to some extent, the generated frames still have the problems of excessive blurring and detail loss. Compared with the prior methods, the SRFI network recovers more accurate details, for example, in the first line of images, the texture of the vent can be clearly identified, and the surrounding is not obviously blurred compared with the results of other methods; in the second line of images, the results obtained by the SRFI network are also the closest to the true results.
One skilled in the art can, using the teachings of the present invention, readily make various changes and modifications to the invention without departing from the spirit and scope of the invention as defined by the appended claims. Any modifications and equivalent variations of the above-described embodiments, which are made in accordance with the technical spirit and substance of the present invention, fall within the scope of protection of the present invention as defined in the claims.

Claims (7)

1. A high-efficiency space-time super-resolution video compression restoration method based on deep learning is characterized by comprising the following steps:
step 1, in the encoding stage, encoding and compressing a given source video, including:
step 1.1, frame extraction is carried out on a source video;
step 1.2, down-sampling a source video;
step 1.3, compressing the source video;
step 2, a decoding stage, which generates a final restored video after receiving the video code stream from the encoding stage, and comprises the following steps:
step 2.1, decompressing the video processed in step 1;
and 2.2, performing space-time super-resolution restoration on the decompressed video.
2. The method for compressing and restoring high-efficiency space-time super-resolution video based on deep learning of claim 1, wherein the frame extraction in step 1.1 comprises:
for a given source video
Figure FDA0003899464540000011
Wherein N, H, W, C respectively represent the batch number, frame height, frame width and channel number of the input source video, and a time domain frame extraction coefficient K is set t Means that only every Kth in the video is retained t Frame, obtaining time domain down-sampled video
Figure FDA0003899464540000012
3. The method for compressing and restoring high-efficiency space-time super-resolution video based on deep learning of claim 2, wherein the down-sampling operation of step 1.2 comprises:
using a bicubic interpolation downsampling operation, will
Figure FDA0003899464540000013
Is reduced in height and width of each frame by K s Multiple, wherein K s The down-sampling coefficient of the space domain is adopted to reduce the spatial resolution to obtain the down-sampling video of the space domain
Figure FDA0003899464540000014
4. The method for compressing and restoring high-efficiency space-time super-resolution video based on deep learning of claim 3, wherein the compression operation of step 1.3 comprises:
spatio-temporal down-sampling of video using existing conventional video coding algorithms
Figure FDA0003899464540000015
Compressing, further reducing the size of the compressed data and generating a bit stream.
5. The method for compressing and restoring high-efficiency space-time super-resolution video based on deep learning of claim 4, wherein the decompression operation of step 2.1 comprises:
decompressing the video bit code stream by using a decoding algorithm with the same standard as the encoding stage to obtain a low-quality video
Figure FDA0003899464540000016
Figure FDA0003899464540000017
6. The method for compressing and restoring high-efficiency spatio-temporal super-resolution video based on deep learning of claim 5, wherein the spatio-temporal super-resolution restoration operation of step 2.2 comprises:
designing a Super-Resolution Frame-Interpolation space-time Super-Resolution network, and finishing aiming at low-quality video through the SRFI network
Figure FDA0003899464540000021
The space-time super-resolution quality enhancement task comprises the following steps: for low quality video
Figure FDA0003899464540000022
Performing up-sampling operation on two dimensions of time domain and space domain, including time domain video frame interpolation and space domain video super-resolution, and removing compression artifacts to obtain recovered high-quality HQ video
Figure FDA0003899464540000023
Space-time super-resolution network SRFI and N in A low quality frame is taken as input and N is recovered out A high quality frame.
7. The method for compressing and restoring high-efficiency space-time super-resolution video based on deep learning of claim 6, wherein the space-time super-resolution network SRFI in step 2.2 comprises a propagation sub-network, a frame interpolation sub-network and a supramolecular network;
a propagation subnetwork for applying a back-and-forth propagation recurrent neural network structure to the input frame to extract a frame feature map combined with global information; the bidirectional network is composed of a forward propagation sub-network and a backward propagation sub-network, the forward propagation sub-network and the backward propagation sub-network have the same structure and are different in the image frame feeding sequence during input, an input video frame sequence is firstly input into an optical flow predictor network in the sub-network to carry out optical flow prediction of image movement, then image frames at each moment are bent and aligned with the obtained corresponding optical flow graph, and then two groups of residual error feature blocks are fused with hidden features to obtain an output feature graph representation of the sub-network at the moment; splicing the outputs of the front and rear sub-networks according to channels, and fusing by using a 1 multiplied by 1 convolutional layer to finally obtain an image fusion characteristic diagram of each moment of the sub-network output;
the frame interpolation sub-network takes the image fusion characteristic graph at each moment obtained by the propagation sub-network as the input of the characteristic interpolation sub-network, and synthesizes and outputs the frame characteristic graph lost when the compression frame is extracted; in the sub-network, variable convolution operation of a pyramid structure is adopted to capture motion clues between frames;
supramolecular networks using pixel recombination operations to enhance the N generated by propagation and framing subnetworks out The spatial resolution of the frame feature map, for the frame time with available low-resolution original feature map, additionally introducing a residual block structure, and finally outputting high-quality HQ video
Figure FDA0003899464540000024
CN202211285099.5A 2022-10-20 2022-10-20 Efficient space-time super-resolution video compression restoration method based on deep learning Pending CN115689917A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211285099.5A CN115689917A (en) 2022-10-20 2022-10-20 Efficient space-time super-resolution video compression restoration method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211285099.5A CN115689917A (en) 2022-10-20 2022-10-20 Efficient space-time super-resolution video compression restoration method based on deep learning

Publications (1)

Publication Number Publication Date
CN115689917A true CN115689917A (en) 2023-02-03

Family

ID=85066448

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211285099.5A Pending CN115689917A (en) 2022-10-20 2022-10-20 Efficient space-time super-resolution video compression restoration method based on deep learning

Country Status (1)

Country Link
CN (1) CN115689917A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116523758A (en) * 2023-07-03 2023-08-01 清华大学 End cloud combined super-resolution video reconstruction method and system based on key frames
CN116634209A (en) * 2023-07-24 2023-08-22 武汉能钠智能装备技术股份有限公司 Breakpoint video recovery system and method based on hot plug
CN116781910A (en) * 2023-07-03 2023-09-19 江苏汇智达信息科技有限公司 Information conversion system based on neural network algorithm
CN117376583A (en) * 2023-09-18 2024-01-09 南通大学 Traceable frame rate conversion model construction method for high-frame rate video

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116523758A (en) * 2023-07-03 2023-08-01 清华大学 End cloud combined super-resolution video reconstruction method and system based on key frames
CN116781910A (en) * 2023-07-03 2023-09-19 江苏汇智达信息科技有限公司 Information conversion system based on neural network algorithm
CN116523758B (en) * 2023-07-03 2023-09-19 清华大学 End cloud combined super-resolution video reconstruction method and system based on key frames
CN116634209A (en) * 2023-07-24 2023-08-22 武汉能钠智能装备技术股份有限公司 Breakpoint video recovery system and method based on hot plug
CN116634209B (en) * 2023-07-24 2023-11-17 武汉能钠智能装备技术股份有限公司 Breakpoint video recovery system and method based on hot plug
CN117376583A (en) * 2023-09-18 2024-01-09 南通大学 Traceable frame rate conversion model construction method for high-frame rate video

Similar Documents

Publication Publication Date Title
Wang et al. Wireless deep video semantic transmission
CN115689917A (en) Efficient space-time super-resolution video compression restoration method based on deep learning
CN113766249B (en) Loop filtering method, device, equipment and storage medium in video coding and decoding
CN112218072B (en) Video coding method based on deconstruction compression and fusion
CN111711817B (en) HEVC intra-frame coding compression performance optimization method combined with convolutional neural network
CN113747242B (en) Image processing method, image processing device, electronic equipment and storage medium
Chen et al. Compressed domain deep video super-resolution
KR100678909B1 (en) Video coding method and apparatus for reducing mismatch between encoder and decoder
CN104704839A (en) Video compression method
Hu et al. Fvc: An end-to-end framework towards deep video compression in feature space
CN111726614A (en) HEVC (high efficiency video coding) optimization method based on spatial domain downsampling and deep learning reconstruction
CN104539961A (en) Scalable video encoding system based on hierarchical structure progressive dictionary learning
CN111726638A (en) HEVC (high efficiency video coding) optimization method combining decompression effect and super-resolution
CN115914654A (en) Neural network loop filtering method and device for video coding
FR2886787A1 (en) METHOD AND DEVICE FOR ENCODING AND DECODING AN IMAGE SEQUENCE
CN112601095B (en) Method and system for creating fractional interpolation model of video brightness and chrominance
EP0792488A1 (en) Method and apparatus for regenerating a dense motion vector field
CN108833920B (en) DVC side information fusion method based on optical flow and block matching
CN116012272A (en) Compressed video quality enhancement method based on reconstructed flow field
CN115512199A (en) Image compression model based on graph attention and asymmetric convolution network
Wang et al. Deep correlated image set compression based on distributed source coding and multi-scale fusion
CN113822801A (en) Compressed video super-resolution reconstruction method based on multi-branch convolutional neural network
CN113709483A (en) Adaptive generation method and device for interpolation filter coefficient
Guo et al. Video Compression with Arbitrary Rescaling Network
Tian et al. Effortless Cross-Platform Video Codec: A Codebook-Based Method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination