CN113592746A

CN113592746A - Method for enhancing quality of compressed video by fusing space-time information from coarse to fine

Info

Publication number: CN113592746A
Application number: CN202111067216.6A
Authority: CN
Inventors: 叶茂; 罗登晏
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-07-07
Filing date: 2021-09-13
Publication date: 2021-11-02
Anticipated expiration: 2041-09-13
Also published as: CN113592746B; CN113450280A

Abstract

The invention discloses a method for enhancing the quality of a compressed video by fusing space-time information from coarse to fine, which is applied to the field of video processing and aims at solving the problems that the compressed video in the prior art inevitably has compression artifacts, and the subjective experience and the objective quality are seriously influenced; the invention uses a multi-frame quality enhancement network for predicting the alignment deviation without displaying optical flow estimation, and better predicts the alignment deviation by fusing the inter-frame information from coarse to fine, thereby fully utilizing the time information between adjacent frames to realize the improvement of the quality of the compressed video and enhancing the compressed video subjectively and objectively.

Description

Method for enhancing quality of compressed video by fusing space-time information from coarse to fine

Technical Field

The invention belongs to the field of video processing, and particularly relates to a technology for enhancing the quality of a compressed video.

Background

Since the international standard of video compression has been proposed, the method for enhancing the quality of compressed video has been widely studied in the industry and academia. Before deep learning was proposed, methods to enhance the quality of compressed video were mainly based on mathematical derivation of spatial and frequency domain methods to enhance single frame images. After the deep learning is successfully applied to the field of image enhancement, various new methods are proposed for enhancing the quality of compressed video, resulting in better results and greater generalization than the conventional methods.

The most common h.265/HEVC standard now uses a block-based hybrid coding framework, whose core process includes: predictive coding, transform coding, quantization and entropy coding, and block-based prediction. The transform and quantization operations ignore the block-to-block correlation, resulting in the coded reconstructed image exhibiting blockiness, i.e., the human eye can perceive significant discontinuities at the block boundaries (these effects are more pronounced when the step size is larger and the bit rate is lower); at the same time, the quantization is based on block expansion in the transform domain, and this quantization process is irreversible. In addition, high precision interpolation in motion compensation is prone to ringing. Due to accumulation of errors in the interframe coding process, the above effects also affect the coding quality of subsequent frames, thereby causing the objective evaluation quality of video images to be reduced and the visual perception quality of human eyes to be reduced.

The invention of Xumai, Yangbai and Wangzhilin of Beijing aerospace university, namely a method for enhancing image or video quality based on convolutional neural network, applies for patent and obtains approval to the intellectual property office of China in 9 and 26 days in 2017, and is disclosed in 12 and 15 days in 2017, wherein the publication numbers are as follows: publication No. CN 107481209A; firstly, designing two convolutional neural networks for enhancing the video (or image) quality, wherein the two convolutional neural networks have different computational complexity; then selecting a plurality of training images or videos to train parameters in the two convolutional neural networks; selecting a convolution neural network with proper computational complexity according to actual needs, and inputting an image or video to be enhanced in quality into the selected network; finally, the network outputs the quality enhanced image or video. The invention can effectively enhance the video quality; the user can select a convolutional neural network with proper computational complexity according to the computational capability or the residual capacity of the equipment to enhance the quality of the image or the video.

The patent designs two convolutional neural networks with different complexity, a user selects the networks according to the condition of equipment, the two networks are only different in depth, a scheme of improving the quality enhancement effect by only deepening the network depth is not feasible, and the networks are not designed according to the characteristics of image videos, namely the networks cannot utilize the time correlation between video frames and frames, so that the quality enhancement effect of the method is limited.

The invention of GaoQiquan, Nie flowers, Liuwen philosomes and Tong, of Fujia emperor information technology Limited, Youchi, the invention of a video quality enhancement method based on adaptive separable convolution, applies for a patent and obtains approval from the intellectual property office of China in 6 and 12 months in 2018, and discloses a publication number of the invention in 27 months 11 and 2018: CN 108900848A.

The video quality enhancement method based on the self-adaptive separable convolution applies the self-adaptive separable convolution as a first module in a network model, converts each two-dimensional convolution into a pair of one-dimensional convolution kernels in the horizontal direction and the vertical direction, and the parameter quantity is n²To become n + n. Secondly, the self-adaptively changed convolution kernels learned by the network for different inputs are utilized to realize the estimation of the motion vector, a pair of separable two-dimensional convolution kernels can be obtained for every two continuous inputs by selecting two continuous frames as the network inputs, then the 2-dimensional convolution kernels are unfolded into four 1-dimensional convolution kernels, the obtained 1-dimensional convolution kernels are changed along with the change of the inputs, and the network adaptivity is improved. The invention replaces two-dimensional convolution kernel with one-dimensional convolution kernel, so that the parameters of the network training model are reduced, and the execution efficiency is high.

The scheme uses five coding modules, four decoding modules, a separation convolution module and an image prediction module, and the structure is that on the basis of the traditional symmetrical coding and decoding module network, the last decoding module is replaced by the separation convolution module, although the parameters of the model are effectively reduced, the quality enhancement effect is still required to be further improved.

The invention of Xumai, Yangbai, Liu Tie, Litian I and Fangmegaji of Beijing aerospace university 'a multiframe quality enhancement method and device for lossy compressed video' application to the China intellectual property office in 2 and 8 months in 2018 and approval is obtained, and the invention is disclosed in 7 and 20 months in 2018, and the publication number is as follows: publication No. CN 108307193A.

A multiframe quality enhancement method and device for lossy compressed video comprises the following steps: aiming at the ith frame of the decompressed video stream, adopting m frames related to the ith frame to perform quality enhancement on the ith frame so as to play the ith frame after the quality enhancement; the m frames belong to frames in the video stream, and each frame in the m frames and the ith frame respectively have the same or corresponding pixel quantity larger than a preset threshold value; m is a natural number greater than 1. In a particular application, a peak quality frame may be utilized to enhance a non-peak quality frame between two peak quality frames. The method 3 reduces the quality fluctuation among multiple frames in the video stream playing process, and simultaneously enhances the quality of each frame in the lossy compressed video.

Although this invention takes into account temporal information between neighboring frames, the designed multi-frame convolutional neural network (MF-CNN) is divided into a motion-compensated sub-network (MC-subnet) and a quality-enhanced sub-network (QE-subnet), where the motion-compensated sub-network relies heavily on optical flow estimation to compensate for motion between non-peak quality frames and peak quality frames to achieve alignment, and any error in optical flow computation introduces artifacts around image structures in aligned neighboring frames. However, accurate optical flow estimation is inherently challenging and time consuming, and thus the quality enhancement effect of the invention is still limited.

In summary, it is desirable to employ video compression techniques to significantly save coding bit rate when transmitting video over bandwidth-limited networks. However, the compressed video inevitably has compression artifacts, which seriously affect the subjective experience and objective quality.

Disclosure of Invention

In order to solve the problem of subjective and objective quality reduction caused by compressed video, the invention provides a compressed video quality enhancement network based on multiple frames, and the quality of the compressed video is improved by fusing the time information between adjacent frames through better predicting alignment offset. Meanwhile, the network of the invention does not need to utilize optical flow estimation to predict the alignment offset in a display mode any more, and the aim of simpler network training is fulfilled.

The technical scheme adopted by the invention is as follows: a method for enhancing quality of compressed video by merging spatio-temporal information from coarse to fine, the method being based on a network structure comprising: the system comprises a rough fusion module, a multi-stage residual fusion module, a D2D fusion module and a reconstruction module; the method comprises the steps that a low-quality compressed video frame sequence passes through a rough fusion module to obtain a rough fusion feature map, and the rough fusion feature map passes through a multi-stage residual fusion module to obtain global and local fine fusion features; and jointly predicting all deformable offsets for alignment according to the global and local fine fusion features, obtaining an aligned fusion feature map by the D2D fusion module according to the deformable offsets, and obtaining a reconstruction result after the aligned fusion feature map passes through the reconstruction module.

The coarse fusion module comprises: a plurality of C3D with activation functions convolved with the bottleneck, the plurality of C3D with activation functions for extracting spatiotemporal information in the sequence of low quality compressed video frames; and the bottleneck convolution fuses the extracted space-time information from the time dimension to obtain a rough fused characteristic diagram.

The multi-stage residual fusion module comprises three parallel stages, namely an L1 stage, an L2 stage and an L3 stage, wherein the L1 stage comprises a residual block; the L2 stage includes a downsample block, a plurality of residual blocks, and an upsample block; the L2 stage includes two downsample blocks, a plurality of residual blocks, and two upsample blocks;

the first input of the residual block in stage L1 is the coarse fused feature map, the second input is the output of the first residual block in stage L2, which is the output result of stage L1;

the coarse fused feature map in the L2 stage is subjected to downsampling and serves as a first input of a first residual block, the second input of the first residual block is the output of the first residual block in the L3 stage, the output of the last residual block serves as the input of an upsampling block, and the output of the upsampling block serves as the output result of the L2 stage;

the coarse fusion characteristic diagram in the L3 level is input into a seventh first residual block after being processed by two downsampling blocks, and the output of the last residual block after being processed by two upsampling blocks is used as the output result of the L3 level;

the method also comprises a convolution block, wherein the output result of the L1 level, the output result of the L2 level and the output result of the L3 level are added and then input into the convolution block, and global and local fine fusion features are extracted.

And obtaining an aligned fusion characteristic map by adopting modulation deformable convolution.

The reconstruction module is realized by the following steps: inputting the fusion characteristic diagram aligned by the D2D fusion module into a reconstruction module to obtain an enhanced residual error

Will enhance the residual

With the current frame

Adding element by element to obtain reconstructed frame

The network structure is trained in an end-to-end manner.

The loss function used for training is:

wherein the content of the first and second substances,

which represents the original frame of the video signal,

represents the reconstruction result of the current iteration, | ·| non-woven phosphor₂Representing a 2 norm.

The invention has the beneficial effects that: the invention can better predict the alignment offset to enhance the current low-quality frame by fusing the interframe information from coarse to fine, so that the subjective and objective quality of the compressed video is obviously enhanced; in addition, the network of the invention does not need to utilize optical flow estimation to predict the alignment offset in a display mode, and the aim of simpler network training is fulfilled.

Drawings

Fig. 1 is a diagram of a quality enhancement network architecture according to the present invention;

FIG. 2 is an architecture diagram of the multi-stage residual fusion module of the present invention;

FIG. 3 is an architecture diagram of a reconstruction module of the present invention;

fig. 4 is a subjective quality comparison plot for the sequences BasketballPass, RaceHorses, and partyscreen at QP of 37.

Detailed Description

In order to facilitate the understanding of the technical contents of the present invention by those skilled in the art, the following technical terms are first explained:

H.265/HEVC: the method is a new video coding standard established after H.264, reserves some technologies of the original H.264 coding standard, and improves some technologies. The new technology is used for improving the relation among code stream, coding quality, time delay and algorithm complexity so as to achieve the optimal setting.

GOP, Group of pictures: refers to the distance between two I frames.

I-frame, Intra-coded picture (Intra-coded image frame): and coding is carried out by only utilizing the information of the current frame without referring to other image frames.

P frame, Predictive-coded picture frame: and performing inter-frame prediction coding by using the previous I frame or P frame in a motion prediction mode.

Low Delay P (LDP): only the first frame is I-frame encoded and the others are P-frame encoded.

Peak Signal to Noise Ratio (PSNR): peak signal-to-noise ratio, an objective criterion for evaluating images.

Structural Similarity (SSIM): the structural similarity is a full-reference image quality evaluation index, and measures the image similarity from three aspects of brightness, contrast and structure.

Ringing effect: for strong edges in an image, due to quantization distortion of high frequency ac coefficients, a moire phenomenon may occur around the edges after decoding, and such distortion is called a ringing effect.

PQF: the peak quality frame, i.e., the high quality frame in the GOP, may also be considered an I-frame in the GOP.

non-PQF: non-peak quality frames, i.e., low quality frames in a GOP, may also be considered P-frames in the GOP.

Deformable 2D Convolition (D2D): a deformable 2D convolution.

3D volume (C3D): and (3) performing 3D convolution.

Rectisified Linear Unit (ReLU): an activation function increases the non-linear relationship between layers of a neural network.

The invention is described in detail below with reference to the accompanying drawings:

the quality enhancement network proposed by the scheme of the invention is shown in figure 1 and comprises four parts: a Coarse Fusion Module (CFModule), a Multi-level Residual Fusion Module (MLRF), a D2D Fusion Module (D2D Fusion Module, D2DF), and a Reconstruction Module (Reconstruction Module). Given a sequence of 2N +1 low-quality compressed video frames

Sequence of which

Is a reference frame and the other frames are its neighboring frames. The object of the invention is to derive from the original frame

Compressed frame of

Inferring high quality frames

Firstly, input sequence

The coarse fusion eigen map F is obtained by coarsely fusing the input frames through a CFModule consisting of p 3D contributions (C3D)_c. Then, a MLRF module is used to generate global and local fine fusion features F from different stages_L。

The fusion feature F generated from the above from coarse to fine_LAll of the deformable offsets for alignment are jointly predicted rather than just one offset at a time as with optical flow estimation.

Then the fusion feature map F after D2DF alignment_fInput to a REModule consisting of several densely packed blocks to obtain an enhanced residual

Finally enhancing residual error

With the current frame

Adding element by element to obtain reconstructed frame

I in FIG. 1_TNRepresenting the characteristics of the input 2N +1 frames after splicing along the channel.

The following details are provided for the four modules:

coarse fusion module (CFModule)

It roughly extracts and fuses the input sequence by two C3D accompanied by the ReLU activation function

The spatiotemporal information of (1):

where H × W denotes the size of an input frame, T ═ 2N +1 denotes the input sequence length, C denotes the number of channels, O_C3DRepresenting 2C 3D operations.

The extracted features F are then further fused from the time dimension using a 1 × 1 bottleneck convolution (bottleeck)_3D∈R^{C′×T×H×W}Obtaining a rough fused feature map F_c∈R^C″×H×W：

F_c＝O_B(F_3D)

Where C 'represents the number of C3D filters, C' represents the number of filters for bottleneck convolution, O_BRepresenting a bottleneck convolution operation, R representing the dimension notation, the upper right hand corner of R representing how many dimensions, e.g. R here^{C′×T×H×W}Representing a four-dimensional tensor. Note that the invention is described in F_3DBefore input to the bottleneck convolution, the data is first converted into a 3-dimensional tensor C · T × H × W.

Multi-level residual fusion Module (MLRF)

A schematic diagram of a multi-stage residual fusion module is shown in fig. 2, which includes three stages: l1, L2 and L3.

Stage L1 is used to extract the AND-F_cThe images are of the same size global features and the output features from the corresponding stages of stage L2 are fused. It consists mainly of a residual block, level L1 can be expressed as:

wherein the content of the first and second substances,

representing the output of the first residual block in stages L1 and L2, respectively.

Indicating the first residual block operation at level L1. O is_tcRepresenting a transposed convolution operation. F_L1A characteristic diagram representing the output of stage L1.

The L2 stage consists essentially of one downsample block and one upsample block and a plurality of residual blocks. Unlike the L1 stage, the present invention uses a 3 × 3 step convolution with a step size of 2 and a 3 × 3 convolution to convolve the coarse fused features F of the input_cAnd carrying out down-sampling. Several residual blocks are then used to extract and fuse features from the L3 stage. Finally, the extracted features are upsampled to F by a transposed convolution and a 3x3 convolution_cThe size of the image. The level L2 can be expressed as:

wherein the content of the first and second substances,

and

respectively, one down-sampling and one up-sampling operation (both in a ratio of 2).

Representing the output of the first residual block in stages L2 and L3, respectively.

Representing the first and second residual block operations in stage L2, respectively. F_L2A characteristic diagram representing the output of stage L2. The present invention provides a specific implementation structure of the residual error module in fig. 2, and those skilled in the art should know that the residual error module is a basic module in deep learning, and the specific implementation structure thereof is not elaborated in this embodiment.

The level L3 mainly refers to a structure of up-sampling step by step to extract information features down-sampled at a scale of 4. First, two downsample block pairs F with a ratio of 2_cAnd (5) carrying out operation. The downsampled features are then progressively input into some residual block and upsampled block. The level L3 can be expressed as:

wherein the content of the first and second substances,

and

two downsampling and two upsampling operations (both scale 2) are referred to, respectively.

Referring to the output of the first residual block in stage L3.

Representing the first and second residual block operations in stage L3, respectively. F_L3A characteristic diagram representing the output of stage L3.

The larger the number of residual blocks in L2 and L3 is, the better the general performance is, but the more complicated the model is, the more the calculation amount is, and 2 residual blocks are used in this embodiment.

Finally, the features extracted from the L1, L2, and L3 stages are added element by element and input into a 3 × 3 convolution to fuse the extracted global and local fine fusion features F_L：

F_L＝O_C(F_L1+F_L2+F_L3)

Wherein, O_CRefers to a convolution operation.

Through the coarse-to-fine fusion strategy, the network of the invention can better predict the fusion characteristics required for generating the deformable offset.

D2D fusion module (D2DF)

Let X and Y denote the input and output, respectively, of a conventional convolution. For each position p on Y, one convolution operation can be described as:

wherein p is_kRepresenting a sampling grid with K sampling locations, w_kRepresenting the weight of each location. For example, K is 9 and p_kE { (-1, -1), (-1,0), …, (0,1), (1,1) } represents a 3 × 3 convolution kernel. In modulation-variable convolution, a predicted offset and a modulation mask are added to the sampling grid, causing the convolution kernel to vary spatially. Here, the present invention uses a modulated deformable convolution of "Zhu X, Hu H, Lin S, et al]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition.2019:9308-_f. The operation of the modulation-variable convolution is as follows:

wherein, Δ p_kAnd Δ m_kRespectively, the offset and modulation mask learnable for the kth position. I is_TNRepresenting an input sequence

The spliced features along the channel. The convolution will operate on irregular positions with dynamic weights to achieve the input features I_TNAdaptive sampling of (3). The corresponding variable sampling parameters can be learned as follows:

wherein the content of the first and second substances,

represents a conventional convolution with a filter number of (2N +1) · 2k for generating a deformable offset Δ P²。

Refers to a conventional convolution for generating a modulation mask Δ M with a filter number of (2N +1) · k²。Δp_k∈ΔP,Δm_kE.g. Δ M. Because of Δ p_kPossibly fractional, we also use bilinear interpolation, such as "Dai J, Qi H, Xiong Y, et al]// Proceedings of the IEEE international conference on computer vision.2017:764-773 ".

Reconstruction module (REFModule)

The proposed reconstruction module is shown in fig. 3. The invention is first provided by a 3x3 convolutional layerTaking the aligned fused feature map F_fMore useful feature F in_fu：

F_fu＝O_C(F_f)

Then, following "Mehta S, Kumar A, Reda F, et al. EVRNet: Efficient Video retrieval on Edge Devices [ J ]. arXiv preprint arXiv:2012.02228,2020.", the present invention combines feature subtraction and feature summation as an operation that effectively avoids computational complexity:

F_fus＝F_fu+D(F_fu-D(F_fu))

wherein, D refers to dense connection operation, namely, firstly, a ReLU activation function is used to increase the nonlinearity of the network, then, three convolution layers are sequentially passed through, and finally, the convolution output and input of different layers are spliced along a channel to be used as the output of the whole module.

Further, F is_fusInput into a dense connection block and a 3x3 convolutional layer to obtain the fusion characteristics F of the different layers_H：

F_H＝O_C([F_fu,D(E_fu),F_fus,D(F_fus)])

Wherein, O_CRepresenting a convolution operation of 3x 3. Final F_HAnd F_fuAdding element by element, and obtaining enhanced residual error by two convolution layers

Wherein the content of the first and second substances,

representing two convolutional layers.

Loss function of network

In the method of the invention, the coarse fusion module, the multi-stage residual fusion module, the D2D fusion module and the reconstruction module are combined in an end-to-end mannerTraining (i.e., end-to-end training from the original frame to the reconstructed result), and the network does not need to be trained to converge on a certain subnet first, so the loss function is only composed of one item. The invention uses L₂Norm as a function of the losses of the inventive network:

wherein | · | purple sweet₂Representing a 2 norm.

This example qualitatively and quantitatively evaluated the effectiveness of the method of the invention, where the quantitative evaluation was based on Δ PSNR and Δ SSIM in combination with MFQE (Yang R, Xu M, Wang Z, et al. Multi-frame quality enhancement for compensated video [ C ]// Processing of the IEEE Conference on Vision and Pattern recognition.2018: 6664), SDTS (Meng X, Deng X, Zhu S, et al. enhancing quality for VVC compensated video side texture sampling enhancement and temporal structure [ C ]//2019 IEEE Conference Processing (IEEE Conference on P, IEEE Conference on 2019. III. M. and MFQ. M. J. for compatible video [ MFQ ] and MFQ. 3. J. III. M. and MFQ. 11. M. for compatible video and MFQ. M. for compatible video and MFQ. 2. for MFQ. conversion of MFP. for Processing [ C. III. M. for MFQ. for evaluating the same experiment, MFQ. for evaluating the effectiveness of the same experiment, MFQ. for evaluating the same experiment, MFQ. for evaluating the same experiment of the same experiment, MFQ. for evaluating the same experiment of the same, was obtained for evaluating the same, MFQ. for evaluating the same experiment of the same type of the same, was evaluated in the same type of the, deng X, Zhu S, et al. mganet: a robust model for quality enhancement of compressed video [ J ]. arXiv preprinted arXiv:1811.09150,2018.), MGANet2.0(Meng X, Deng X, Zhu S, et al. A Robust Quality Enhancement Method Based on Jont space-Temporal principles for Video Coding [ J ]. IEEE Transactions on Circuits and Systems for Video Technology,2020.), FastMSD (Xiao W, He, Wang T, et al. the Interval space-Scale Decoder for Standard Bitstreams [ J ]. IEEE Transactions on Multimedia,2020,22(7): 16991.) and STDF (Deng J, Wang L, PupuS, example. space-Temporal principles) for the comparison of AAI 19, see FIG. 34. A. Compare A Compare, III; the qualitative assessment was compared to MFQE2.0 and STDF.

Table 1 Δ psnr (db) and Δ SSIM ((× 10) for HEVC standard test sequences at five QP points^-4) Overall comparison of

Quantitative evaluation: table 1 gives the average results of Δ PSNR and Δ SSIM over all frames of each test sequence. It can be seen that our method is always superior to other video quality enhancement methods. Specifically, at an input frame radius N of 1 and QP of 22, our method achieves an average Δ PSNR value of 0.707dB, 27.1% higher than STDF-N1(0.556dB), 54.4% higher than MFQE2.0(0.458dB), and 102% higher than FastMSDD (0.350 dB). As the radius N of the input frame increases to 3, our method achieves an average Δ PSNR value of 0.845dB, 30.8% higher than STDF (0.646dB), 84.5% higher than MFQE2.0(0.458dB), and 141.4% higher than FastMSDD (0.350 dB).

At other QP points, the method of the present invention is superior to other methods in both Δ PSNR and Δ SSIM. In addition, the present invention also compares the performance of the network with the reduction of BD-rate, which is shown in Table 2 to be an average reduction of 24.69% over the advanced STDF method (21.61%).

Table 2 compares the results with BD-rates (%) of FastMSDD [11], MFQE [8], MFQE2.0[3], STDF [4]

And (3) qualitative evaluation: fig. 4 shows the dominance quality performance of the sequences BasketballPass, RaceHorses and parsyscreen at QP ═ 37. As can be seen from fig. 4, the method of the present invention can reduce more compression artifacts and achieve better visual experience compared to the MFQE2.0 and STDF methods.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A method for enhancing the quality of compressed video by merging spatio-temporal information from coarse to fine, characterized in that said method is based on a network structure comprising: the system comprises a rough fusion module, a multi-stage residual fusion module, a D2D fusion module and a reconstruction module; the method comprises the steps that a low-quality compressed video frame sequence passes through a rough fusion module to obtain a rough fusion feature map, and the rough fusion feature map passes through a multi-stage residual fusion module to obtain global and local fine fusion features; and jointly predicting all deformable offsets for alignment according to the global and local fine fusion features, obtaining an aligned fusion feature map by the D2D fusion module according to the deformable offsets, and obtaining a reconstruction result after the aligned fusion feature map passes through the reconstruction module.

2. A method of enhancing the quality of compressed video by coarse-to-fine fusion of spatiotemporal information according to claim 1, characterized in that said coarse fusion module comprises: a plurality of C3D with activation functions convolved with the bottleneck, the plurality of C3D with activation functions for extracting spatiotemporal information in the sequence of low quality compressed video frames; and the bottleneck convolution fuses the extracted space-time information from the time dimension to obtain a rough fused characteristic diagram.

3. The method of claim 2, wherein the multi-stage residual fusion module comprises three parallel stages, namely stage L1, stage L2, and stage L3, wherein the stage L1 comprises a residual block; the L2 stage includes a downsample block, a plurality of residual blocks, and an upsample block; the L2 stage includes two downsample blocks, a plurality of residual blocks, and two upsample blocks;

the coarse fusion characteristic diagram in the L3 level is input into the first residual block after being processed by two downsampling blocks, and the output of the last residual block after being processed by two upsampling blocks is used as the output result of the L3 level;

4. A method for enhancing the quality of compressed video by coarse-to-fine fusion of spatio-temporal information according to claim 1, characterized in that the D2D fusion module uses modulated deformable convolution to obtain the aligned fusion feature map.

5. The method for enhancing the quality of the compressed video by fusing the spatio-temporal information from coarse to fine according to claim 1, wherein the reconstruction module is implemented by: inputting the fusion characteristic diagram aligned by the D2D fusion module into a reconstruction module to obtain an enhanced residual error

Will enhance the residual

With the current frame

Adding element by element to obtain reconstructed frame

6. The method of claim 1, wherein the network structure is trained in an end-to-end manner.

7. The method of claim 5, wherein the loss function used in training is:

wherein the content of the first and second substances,

which represents the original frame of the video signal,