CN113592746A - Method for enhancing quality of compressed video by fusing space-time information from coarse to fine - Google Patents

Method for enhancing quality of compressed video by fusing space-time information from coarse to fine Download PDF

Info

Publication number
CN113592746A
CN113592746A CN202111067216.6A CN202111067216A CN113592746A CN 113592746 A CN113592746 A CN 113592746A CN 202111067216 A CN202111067216 A CN 202111067216A CN 113592746 A CN113592746 A CN 113592746A
Authority
CN
China
Prior art keywords
fusion
stage
quality
residual
coarse
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111067216.6A
Other languages
Chinese (zh)
Other versions
CN113592746B (en
Inventor
叶茂
罗登晏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Publication of CN113592746A publication Critical patent/CN113592746A/en
Application granted granted Critical
Publication of CN113592746B publication Critical patent/CN113592746B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration by the use of more than one image, e.g. averaging, subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • G06T9/002Image coding using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention discloses a method for enhancing the quality of a compressed video by fusing space-time information from coarse to fine, which is applied to the field of video processing and aims at solving the problems that the compressed video in the prior art inevitably has compression artifacts, and the subjective experience and the objective quality are seriously influenced; the invention uses a multi-frame quality enhancement network for predicting the alignment deviation without displaying optical flow estimation, and better predicts the alignment deviation by fusing the inter-frame information from coarse to fine, thereby fully utilizing the time information between adjacent frames to realize the improvement of the quality of the compressed video and enhancing the compressed video subjectively and objectively.

Description

Method for enhancing quality of compressed video by fusing space-time information from coarse to fine
Technical Field
The invention belongs to the field of video processing, and particularly relates to a technology for enhancing the quality of a compressed video.
Background
Since the international standard of video compression has been proposed, the method for enhancing the quality of compressed video has been widely studied in the industry and academia. Before deep learning was proposed, methods to enhance the quality of compressed video were mainly based on mathematical derivation of spatial and frequency domain methods to enhance single frame images. After the deep learning is successfully applied to the field of image enhancement, various new methods are proposed for enhancing the quality of compressed video, resulting in better results and greater generalization than the conventional methods.
The most common h.265/HEVC standard now uses a block-based hybrid coding framework, whose core process includes: predictive coding, transform coding, quantization and entropy coding, and block-based prediction. The transform and quantization operations ignore the block-to-block correlation, resulting in the coded reconstructed image exhibiting blockiness, i.e., the human eye can perceive significant discontinuities at the block boundaries (these effects are more pronounced when the step size is larger and the bit rate is lower); at the same time, the quantization is based on block expansion in the transform domain, and this quantization process is irreversible. In addition, high precision interpolation in motion compensation is prone to ringing. Due to accumulation of errors in the interframe coding process, the above effects also affect the coding quality of subsequent frames, thereby causing the objective evaluation quality of video images to be reduced and the visual perception quality of human eyes to be reduced.
The invention of Xumai, Yangbai and Wangzhilin of Beijing aerospace university, namely a method for enhancing image or video quality based on convolutional neural network, applies for patent and obtains approval to the intellectual property office of China in 9 and 26 days in 2017, and is disclosed in 12 and 15 days in 2017, wherein the publication numbers are as follows: publication No. CN 107481209A; firstly, designing two convolutional neural networks for enhancing the video (or image) quality, wherein the two convolutional neural networks have different computational complexity; then selecting a plurality of training images or videos to train parameters in the two convolutional neural networks; selecting a convolution neural network with proper computational complexity according to actual needs, and inputting an image or video to be enhanced in quality into the selected network; finally, the network outputs the quality enhanced image or video. The invention can effectively enhance the video quality; the user can select a convolutional neural network with proper computational complexity according to the computational capability or the residual capacity of the equipment to enhance the quality of the image or the video.
The patent designs two convolutional neural networks with different complexity, a user selects the networks according to the condition of equipment, the two networks are only different in depth, a scheme of improving the quality enhancement effect by only deepening the network depth is not feasible, and the networks are not designed according to the characteristics of image videos, namely the networks cannot utilize the time correlation between video frames and frames, so that the quality enhancement effect of the method is limited.
The invention of GaoQiquan, Nie flowers, Liuwen philosomes and Tong, of Fujia emperor information technology Limited, Youchi, the invention of a video quality enhancement method based on adaptive separable convolution, applies for a patent and obtains approval from the intellectual property office of China in 6 and 12 months in 2018, and discloses a publication number of the invention in 27 months 11 and 2018: CN 108900848A.
The video quality enhancement method based on the self-adaptive separable convolution applies the self-adaptive separable convolution as a first module in a network model, converts each two-dimensional convolution into a pair of one-dimensional convolution kernels in the horizontal direction and the vertical direction, and the parameter quantity is n2To become n + n. Secondly, the self-adaptively changed convolution kernels learned by the network for different inputs are utilized to realize the estimation of the motion vector, a pair of separable two-dimensional convolution kernels can be obtained for every two continuous inputs by selecting two continuous frames as the network inputs, then the 2-dimensional convolution kernels are unfolded into four 1-dimensional convolution kernels, the obtained 1-dimensional convolution kernels are changed along with the change of the inputs, and the network adaptivity is improved. The invention replaces two-dimensional convolution kernel with one-dimensional convolution kernel, so that the parameters of the network training model are reduced, and the execution efficiency is high.
The scheme uses five coding modules, four decoding modules, a separation convolution module and an image prediction module, and the structure is that on the basis of the traditional symmetrical coding and decoding module network, the last decoding module is replaced by the separation convolution module, although the parameters of the model are effectively reduced, the quality enhancement effect is still required to be further improved.
The invention of Xumai, Yangbai, Liu Tie, Litian I and Fangmegaji of Beijing aerospace university 'a multiframe quality enhancement method and device for lossy compressed video' application to the China intellectual property office in 2 and 8 months in 2018 and approval is obtained, and the invention is disclosed in 7 and 20 months in 2018, and the publication number is as follows: publication No. CN 108307193A.
A multiframe quality enhancement method and device for lossy compressed video comprises the following steps: aiming at the ith frame of the decompressed video stream, adopting m frames related to the ith frame to perform quality enhancement on the ith frame so as to play the ith frame after the quality enhancement; the m frames belong to frames in the video stream, and each frame in the m frames and the ith frame respectively have the same or corresponding pixel quantity larger than a preset threshold value; m is a natural number greater than 1. In a particular application, a peak quality frame may be utilized to enhance a non-peak quality frame between two peak quality frames. The method 3 reduces the quality fluctuation among multiple frames in the video stream playing process, and simultaneously enhances the quality of each frame in the lossy compressed video.
Although this invention takes into account temporal information between neighboring frames, the designed multi-frame convolutional neural network (MF-CNN) is divided into a motion-compensated sub-network (MC-subnet) and a quality-enhanced sub-network (QE-subnet), where the motion-compensated sub-network relies heavily on optical flow estimation to compensate for motion between non-peak quality frames and peak quality frames to achieve alignment, and any error in optical flow computation introduces artifacts around image structures in aligned neighboring frames. However, accurate optical flow estimation is inherently challenging and time consuming, and thus the quality enhancement effect of the invention is still limited.
In summary, it is desirable to employ video compression techniques to significantly save coding bit rate when transmitting video over bandwidth-limited networks. However, the compressed video inevitably has compression artifacts, which seriously affect the subjective experience and objective quality.
Disclosure of Invention
In order to solve the problem of subjective and objective quality reduction caused by compressed video, the invention provides a compressed video quality enhancement network based on multiple frames, and the quality of the compressed video is improved by fusing the time information between adjacent frames through better predicting alignment offset. Meanwhile, the network of the invention does not need to utilize optical flow estimation to predict the alignment offset in a display mode any more, and the aim of simpler network training is fulfilled.
The technical scheme adopted by the invention is as follows: a method for enhancing quality of compressed video by merging spatio-temporal information from coarse to fine, the method being based on a network structure comprising: the system comprises a rough fusion module, a multi-stage residual fusion module, a D2D fusion module and a reconstruction module; the method comprises the steps that a low-quality compressed video frame sequence passes through a rough fusion module to obtain a rough fusion feature map, and the rough fusion feature map passes through a multi-stage residual fusion module to obtain global and local fine fusion features; and jointly predicting all deformable offsets for alignment according to the global and local fine fusion features, obtaining an aligned fusion feature map by the D2D fusion module according to the deformable offsets, and obtaining a reconstruction result after the aligned fusion feature map passes through the reconstruction module.
The coarse fusion module comprises: a plurality of C3D with activation functions convolved with the bottleneck, the plurality of C3D with activation functions for extracting spatiotemporal information in the sequence of low quality compressed video frames; and the bottleneck convolution fuses the extracted space-time information from the time dimension to obtain a rough fused characteristic diagram.
The multi-stage residual fusion module comprises three parallel stages, namely an L1 stage, an L2 stage and an L3 stage, wherein the L1 stage comprises a residual block; the L2 stage includes a downsample block, a plurality of residual blocks, and an upsample block; the L2 stage includes two downsample blocks, a plurality of residual blocks, and two upsample blocks;
the first input of the residual block in stage L1 is the coarse fused feature map, the second input is the output of the first residual block in stage L2, which is the output result of stage L1;
the coarse fused feature map in the L2 stage is subjected to downsampling and serves as a first input of a first residual block, the second input of the first residual block is the output of the first residual block in the L3 stage, the output of the last residual block serves as the input of an upsampling block, and the output of the upsampling block serves as the output result of the L2 stage;
the coarse fusion characteristic diagram in the L3 level is input into a seventh first residual block after being processed by two downsampling blocks, and the output of the last residual block after being processed by two upsampling blocks is used as the output result of the L3 level;
the method also comprises a convolution block, wherein the output result of the L1 level, the output result of the L2 level and the output result of the L3 level are added and then input into the convolution block, and global and local fine fusion features are extracted.
And obtaining an aligned fusion characteristic map by adopting modulation deformable convolution.
The reconstruction module is realized by the following steps: inputting the fusion characteristic diagram aligned by the D2D fusion module into a reconstruction module to obtain an enhanced residual error
Figure BDA0003258851520000041
Will enhance the residual
Figure BDA0003258851520000042
With the current frame
Figure BDA0003258851520000043
Adding element by element to obtain reconstructed frame
Figure BDA0003258851520000044
Figure BDA0003258851520000045
The network structure is trained in an end-to-end manner.
The loss function used for training is:
Figure BDA0003258851520000046
wherein the content of the first and second substances,
Figure BDA0003258851520000047
which represents the original frame of the video signal,
Figure BDA0003258851520000048
represents the reconstruction result of the current iteration, | ·| non-woven phosphor2Representing a 2 norm.
The invention has the beneficial effects that: the invention can better predict the alignment offset to enhance the current low-quality frame by fusing the interframe information from coarse to fine, so that the subjective and objective quality of the compressed video is obviously enhanced; in addition, the network of the invention does not need to utilize optical flow estimation to predict the alignment offset in a display mode, and the aim of simpler network training is fulfilled.
Drawings
Fig. 1 is a diagram of a quality enhancement network architecture according to the present invention;
FIG. 2 is an architecture diagram of the multi-stage residual fusion module of the present invention;
FIG. 3 is an architecture diagram of a reconstruction module of the present invention;
fig. 4 is a subjective quality comparison plot for the sequences BasketballPass, RaceHorses, and partyscreen at QP of 37.
Detailed Description
In order to facilitate the understanding of the technical contents of the present invention by those skilled in the art, the following technical terms are first explained:
H.265/HEVC: the method is a new video coding standard established after H.264, reserves some technologies of the original H.264 coding standard, and improves some technologies. The new technology is used for improving the relation among code stream, coding quality, time delay and algorithm complexity so as to achieve the optimal setting.
GOP, Group of pictures: refers to the distance between two I frames.
I-frame, Intra-coded picture (Intra-coded image frame): and coding is carried out by only utilizing the information of the current frame without referring to other image frames.
P frame, Predictive-coded picture frame: and performing inter-frame prediction coding by using the previous I frame or P frame in a motion prediction mode.
Low Delay P (LDP): only the first frame is I-frame encoded and the others are P-frame encoded.
Peak Signal to Noise Ratio (PSNR): peak signal-to-noise ratio, an objective criterion for evaluating images.
Structural Similarity (SSIM): the structural similarity is a full-reference image quality evaluation index, and measures the image similarity from three aspects of brightness, contrast and structure.
Ringing effect: for strong edges in an image, due to quantization distortion of high frequency ac coefficients, a moire phenomenon may occur around the edges after decoding, and such distortion is called a ringing effect.
PQF: the peak quality frame, i.e., the high quality frame in the GOP, may also be considered an I-frame in the GOP.
non-PQF: non-peak quality frames, i.e., low quality frames in a GOP, may also be considered P-frames in the GOP.
Deformable 2D Convolition (D2D): a deformable 2D convolution.
3D volume (C3D): and (3) performing 3D convolution.
Rectisified Linear Unit (ReLU): an activation function increases the non-linear relationship between layers of a neural network.
The invention is described in detail below with reference to the accompanying drawings:
the quality enhancement network proposed by the scheme of the invention is shown in figure 1 and comprises four parts: a Coarse Fusion Module (CFModule), a Multi-level Residual Fusion Module (MLRF), a D2D Fusion Module (D2D Fusion Module, D2DF), and a Reconstruction Module (Reconstruction Module). Given a sequence of 2N +1 low-quality compressed video frames
Figure BDA0003258851520000051
Sequence of which
Figure BDA0003258851520000052
Is a reference frame and the other frames are its neighboring frames. The object of the invention is to derive from the original frame
Figure BDA0003258851520000053
Compressed frame of
Figure BDA0003258851520000054
Inferring high quality frames
Figure BDA0003258851520000055
Firstly, input sequence
Figure BDA0003258851520000056
The coarse fusion eigen map F is obtained by coarsely fusing the input frames through a CFModule consisting of p 3D contributions (C3D)c. Then, a MLRF module is used to generate global and local fine fusion features F from different stagesL
The fusion feature F generated from the above from coarse to fineLAll of the deformable offsets for alignment are jointly predicted rather than just one offset at a time as with optical flow estimation.
Then the fusion feature map F after D2DF alignmentfInput to a REModule consisting of several densely packed blocks to obtain an enhanced residual
Figure BDA0003258851520000057
Finally enhancing residual error
Figure BDA0003258851520000058
With the current frame
Figure BDA0003258851520000059
Adding element by element to obtain reconstructed frame
Figure BDA00032588515200000510
Figure BDA0003258851520000061
I in FIG. 1TNRepresenting the characteristics of the input 2N +1 frames after splicing along the channel.
The following details are provided for the four modules:
coarse fusion module (CFModule)
It roughly extracts and fuses the input sequence by two C3D accompanied by the ReLU activation function
Figure BDA0003258851520000062
Figure BDA0003258851520000063
The spatiotemporal information of (1):
Figure BDA0003258851520000064
where H × W denotes the size of an input frame, T ═ 2N +1 denotes the input sequence length, C denotes the number of channels, OC3DRepresenting 2C 3D operations.
The extracted features F are then further fused from the time dimension using a 1 × 1 bottleneck convolution (bottleeck)3D∈RC′×T×H×WObtaining a rough fused feature map Fc∈RC″×H×W
Fc=OB(F3D)
Where C 'represents the number of C3D filters, C' represents the number of filters for bottleneck convolution, OBRepresenting a bottleneck convolution operation, R representing the dimension notation, the upper right hand corner of R representing how many dimensions, e.g. R hereC′×T×H×WRepresenting a four-dimensional tensor. Note that the invention is described in F3DBefore input to the bottleneck convolution, the data is first converted into a 3-dimensional tensor C · T × H × W.
Multi-level residual fusion Module (MLRF)
A schematic diagram of a multi-stage residual fusion module is shown in fig. 2, which includes three stages: l1, L2 and L3.
Stage L1 is used to extract the AND-FcThe images are of the same size global features and the output features from the corresponding stages of stage L2 are fused. It consists mainly of a residual block, level L1 can be expressed as:
Figure BDA0003258851520000065
wherein the content of the first and second substances,
Figure BDA0003258851520000066
representing the output of the first residual block in stages L1 and L2, respectively.
Figure BDA0003258851520000067
Indicating the first residual block operation at level L1. O istcRepresenting a transposed convolution operation. FL1A characteristic diagram representing the output of stage L1.
The L2 stage consists essentially of one downsample block and one upsample block and a plurality of residual blocks. Unlike the L1 stage, the present invention uses a 3 × 3 step convolution with a step size of 2 and a 3 × 3 convolution to convolve the coarse fused features F of the inputcAnd carrying out down-sampling. Several residual blocks are then used to extract and fuse features from the L3 stage. Finally, the extracted features are upsampled to F by a transposed convolution and a 3x3 convolutioncThe size of the image. The level L2 can be expressed as:
Figure BDA0003258851520000071
Figure BDA0003258851520000072
wherein the content of the first and second substances,
Figure BDA0003258851520000073
and
Figure BDA0003258851520000074
respectively, one down-sampling and one up-sampling operation (both in a ratio of 2).
Figure BDA0003258851520000075
Representing the output of the first residual block in stages L2 and L3, respectively.
Figure BDA0003258851520000076
Representing the first and second residual block operations in stage L2, respectively. FL2A characteristic diagram representing the output of stage L2. The present invention provides a specific implementation structure of the residual error module in fig. 2, and those skilled in the art should know that the residual error module is a basic module in deep learning, and the specific implementation structure thereof is not elaborated in this embodiment.
The level L3 mainly refers to a structure of up-sampling step by step to extract information features down-sampled at a scale of 4. First, two downsample block pairs F with a ratio of 2cAnd (5) carrying out operation. The downsampled features are then progressively input into some residual block and upsampled block. The level L3 can be expressed as:
Figure BDA0003258851520000077
Figure BDA0003258851520000078
wherein the content of the first and second substances,
Figure BDA0003258851520000079
and
Figure BDA00032588515200000710
two downsampling and two upsampling operations (both scale 2) are referred to, respectively.
Figure BDA00032588515200000711
Referring to the output of the first residual block in stage L3.
Figure BDA00032588515200000712
Representing the first and second residual block operations in stage L3, respectively. FL3A characteristic diagram representing the output of stage L3.
The larger the number of residual blocks in L2 and L3 is, the better the general performance is, but the more complicated the model is, the more the calculation amount is, and 2 residual blocks are used in this embodiment.
Finally, the features extracted from the L1, L2, and L3 stages are added element by element and input into a 3 × 3 convolution to fuse the extracted global and local fine fusion features FL
FL=OC(FL1+FL2+FL3)
Wherein, OCRefers to a convolution operation.
Through the coarse-to-fine fusion strategy, the network of the invention can better predict the fusion characteristics required for generating the deformable offset.
D2D fusion module (D2DF)
Let X and Y denote the input and output, respectively, of a conventional convolution. For each position p on Y, one convolution operation can be described as:
Figure BDA0003258851520000081
wherein p iskRepresenting a sampling grid with K sampling locations, wkRepresenting the weight of each location. For example, K is 9 and pkE { (-1, -1), (-1,0), …, (0,1), (1,1) } represents a 3 × 3 convolution kernel. In modulation-variable convolution, a predicted offset and a modulation mask are added to the sampling grid, causing the convolution kernel to vary spatially. Here, the present invention uses a modulated deformable convolution of "Zhu X, Hu H, Lin S, et al]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition.2019:9308-f. The operation of the modulation-variable convolution is as follows:
Figure BDA0003258851520000082
wherein, Δ pkAnd Δ mkRespectively, the offset and modulation mask learnable for the kth position. I isTNRepresenting an input sequence
Figure BDA0003258851520000083
The spliced features along the channel. The convolution will operate on irregular positions with dynamic weights to achieve the input features ITNAdaptive sampling of (3). The corresponding variable sampling parameters can be learned as follows:
Figure BDA0003258851520000084
Figure BDA0003258851520000085
wherein the content of the first and second substances,
Figure BDA0003258851520000086
represents a conventional convolution with a filter number of (2N +1) · 2k for generating a deformable offset Δ P2
Figure BDA0003258851520000087
Refers to a conventional convolution for generating a modulation mask Δ M with a filter number of (2N +1) · k2。Δpk∈ΔP,ΔmkE.g. Δ M. Because of Δ pkPossibly fractional, we also use bilinear interpolation, such as "Dai J, Qi H, Xiong Y, et al]// Proceedings of the IEEE international conference on computer vision.2017:764-773 ".
Reconstruction module (REFModule)
The proposed reconstruction module is shown in fig. 3. The invention is first provided by a 3x3 convolutional layerTaking the aligned fused feature map FfMore useful feature F infu
Ffu=OC(Ff)
Then, following "Mehta S, Kumar A, Reda F, et al. EVRNet: Efficient Video retrieval on Edge Devices [ J ]. arXiv preprint arXiv:2012.02228,2020.", the present invention combines feature subtraction and feature summation as an operation that effectively avoids computational complexity:
Ffus=Ffu+D(Ffu-D(Ffu))
wherein, D refers to dense connection operation, namely, firstly, a ReLU activation function is used to increase the nonlinearity of the network, then, three convolution layers are sequentially passed through, and finally, the convolution output and input of different layers are spliced along a channel to be used as the output of the whole module.
Further, F isfusInput into a dense connection block and a 3x3 convolutional layer to obtain the fusion characteristics F of the different layersH
FH=OC([Ffu,D(Efu),Ffus,D(Ffus)])
Wherein, OCRepresenting a convolution operation of 3x 3. Final FHAnd FfuAdding element by element, and obtaining enhanced residual error by two convolution layers
Figure BDA0003258851520000091
Figure BDA0003258851520000092
Wherein the content of the first and second substances,
Figure BDA0003258851520000093
representing two convolutional layers.
Loss function of network
In the method of the invention, the coarse fusion module, the multi-stage residual fusion module, the D2D fusion module and the reconstruction module are combined in an end-to-end mannerTraining (i.e., end-to-end training from the original frame to the reconstructed result), and the network does not need to be trained to converge on a certain subnet first, so the loss function is only composed of one item. The invention uses L2Norm as a function of the losses of the inventive network:
Figure BDA0003258851520000094
wherein | · | purple sweet2Representing a 2 norm.
This example qualitatively and quantitatively evaluated the effectiveness of the method of the invention, where the quantitative evaluation was based on Δ PSNR and Δ SSIM in combination with MFQE (Yang R, Xu M, Wang Z, et al. Multi-frame quality enhancement for compensated video [ C ]// Processing of the IEEE Conference on Vision and Pattern recognition.2018: 6664), SDTS (Meng X, Deng X, Zhu S, et al. enhancing quality for VVC compensated video side texture sampling enhancement and temporal structure [ C ]//2019 IEEE Conference Processing (IEEE Conference on P, IEEE Conference on 2019. III. M. and MFQ. M. J. for compatible video [ MFQ ] and MFQ. 3. J. III. M. and MFQ. 11. M. for compatible video and MFQ. M. for compatible video and MFQ. 2. for MFQ. conversion of MFP. for Processing [ C. III. M. for MFQ. for evaluating the same experiment, MFQ. for evaluating the effectiveness of the same experiment, MFQ. for evaluating the same experiment, MFQ. for evaluating the same experiment of the same experiment, MFQ. for evaluating the same experiment of the same, was obtained for evaluating the same, MFQ. for evaluating the same experiment of the same type of the same, was evaluated in the same type of the, deng X, Zhu S, et al. mganet: a robust model for quality enhancement of compressed video [ J ]. arXiv preprinted arXiv:1811.09150,2018.), MGANet2.0(Meng X, Deng X, Zhu S, et al. A Robust Quality Enhancement Method Based on Jont space-Temporal principles for Video Coding [ J ]. IEEE Transactions on Circuits and Systems for Video Technology,2020.), FastMSD (Xiao W, He, Wang T, et al. the Interval space-Scale Decoder for Standard Bitstreams [ J ]. IEEE Transactions on Multimedia,2020,22(7): 16991.) and STDF (Deng J, Wang L, PupuS, example. space-Temporal principles) for the comparison of AAI 19, see FIG. 34. A. Compare A Compare, III; the qualitative assessment was compared to MFQE2.0 and STDF.
Table 1 Δ psnr (db) and Δ SSIM ((× 10) for HEVC standard test sequences at five QP points-4) Overall comparison of
Figure BDA0003258851520000101
Quantitative evaluation: table 1 gives the average results of Δ PSNR and Δ SSIM over all frames of each test sequence. It can be seen that our method is always superior to other video quality enhancement methods. Specifically, at an input frame radius N of 1 and QP of 22, our method achieves an average Δ PSNR value of 0.707dB, 27.1% higher than STDF-N1(0.556dB), 54.4% higher than MFQE2.0(0.458dB), and 102% higher than FastMSDD (0.350 dB). As the radius N of the input frame increases to 3, our method achieves an average Δ PSNR value of 0.845dB, 30.8% higher than STDF (0.646dB), 84.5% higher than MFQE2.0(0.458dB), and 141.4% higher than FastMSDD (0.350 dB).
At other QP points, the method of the present invention is superior to other methods in both Δ PSNR and Δ SSIM. In addition, the present invention also compares the performance of the network with the reduction of BD-rate, which is shown in Table 2 to be an average reduction of 24.69% over the advanced STDF method (21.61%).
Table 2 compares the results with BD-rates (%) of FastMSDD [11], MFQE [8], MFQE2.0[3], STDF [4]
Figure BDA0003258851520000111
And (3) qualitative evaluation: fig. 4 shows the dominance quality performance of the sequences BasketballPass, RaceHorses and parsyscreen at QP ═ 37. As can be seen from fig. 4, the method of the present invention can reduce more compression artifacts and achieve better visual experience compared to the MFQE2.0 and STDF methods.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims (7)

1. A method for enhancing the quality of compressed video by merging spatio-temporal information from coarse to fine, characterized in that said method is based on a network structure comprising: the system comprises a rough fusion module, a multi-stage residual fusion module, a D2D fusion module and a reconstruction module; the method comprises the steps that a low-quality compressed video frame sequence passes through a rough fusion module to obtain a rough fusion feature map, and the rough fusion feature map passes through a multi-stage residual fusion module to obtain global and local fine fusion features; and jointly predicting all deformable offsets for alignment according to the global and local fine fusion features, obtaining an aligned fusion feature map by the D2D fusion module according to the deformable offsets, and obtaining a reconstruction result after the aligned fusion feature map passes through the reconstruction module.
2. A method of enhancing the quality of compressed video by coarse-to-fine fusion of spatiotemporal information according to claim 1, characterized in that said coarse fusion module comprises: a plurality of C3D with activation functions convolved with the bottleneck, the plurality of C3D with activation functions for extracting spatiotemporal information in the sequence of low quality compressed video frames; and the bottleneck convolution fuses the extracted space-time information from the time dimension to obtain a rough fused characteristic diagram.
3. The method of claim 2, wherein the multi-stage residual fusion module comprises three parallel stages, namely stage L1, stage L2, and stage L3, wherein the stage L1 comprises a residual block; the L2 stage includes a downsample block, a plurality of residual blocks, and an upsample block; the L2 stage includes two downsample blocks, a plurality of residual blocks, and two upsample blocks;
the first input of the residual block in stage L1 is the coarse fused feature map, the second input is the output of the first residual block in stage L2, which is the output result of stage L1;
the coarse fused feature map in the L2 stage is subjected to downsampling and serves as a first input of a first residual block, the second input of the first residual block is the output of the first residual block in the L3 stage, the output of the last residual block serves as the input of an upsampling block, and the output of the upsampling block serves as the output result of the L2 stage;
the coarse fusion characteristic diagram in the L3 level is input into the first residual block after being processed by two downsampling blocks, and the output of the last residual block after being processed by two upsampling blocks is used as the output result of the L3 level;
the method also comprises a convolution block, wherein the output result of the L1 level, the output result of the L2 level and the output result of the L3 level are added and then input into the convolution block, and global and local fine fusion features are extracted.
4. A method for enhancing the quality of compressed video by coarse-to-fine fusion of spatio-temporal information according to claim 1, characterized in that the D2D fusion module uses modulated deformable convolution to obtain the aligned fusion feature map.
5. The method for enhancing the quality of the compressed video by fusing the spatio-temporal information from coarse to fine according to claim 1, wherein the reconstruction module is implemented by: inputting the fusion characteristic diagram aligned by the D2D fusion module into a reconstruction module to obtain an enhanced residual error
Figure FDA0003258851510000011
Will enhance the residual
Figure FDA0003258851510000012
With the current frame
Figure FDA0003258851510000013
Adding element by element to obtain reconstructed frame
Figure FDA0003258851510000014
Figure FDA0003258851510000015
6. The method of claim 1, wherein the network structure is trained in an end-to-end manner.
7. The method of claim 5, wherein the loss function used in training is:
Figure FDA0003258851510000021
wherein the content of the first and second substances,
Figure FDA0003258851510000022
which represents the original frame of the video signal,
Figure FDA0003258851510000023
represents the reconstruction result of the current iteration, | ·| non-woven phosphor2Representing a 2 norm.
CN202111067216.6A 2021-07-07 2021-09-13 Method for enhancing quality of compressed video by fusing space-time information from coarse to fine Active CN113592746B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110768143.7A CN113450280A (en) 2021-07-07 2021-07-07 Method for enhancing quality of compressed video by fusing space-time information from coarse to fine
CN2021107681437 2021-07-07

Publications (2)

Publication Number Publication Date
CN113592746A true CN113592746A (en) 2021-11-02
CN113592746B CN113592746B (en) 2023-04-18

Family

ID=77815429

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202110768143.7A Withdrawn CN113450280A (en) 2021-07-07 2021-07-07 Method for enhancing quality of compressed video by fusing space-time information from coarse to fine
CN202111067216.6A Active CN113592746B (en) 2021-07-07 2021-09-13 Method for enhancing quality of compressed video by fusing space-time information from coarse to fine

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202110768143.7A Withdrawn CN113450280A (en) 2021-07-07 2021-07-07 Method for enhancing quality of compressed video by fusing space-time information from coarse to fine

Country Status (1)

Country Link
CN (2) CN113450280A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114511485A (en) * 2022-01-29 2022-05-17 电子科技大学 Compressed video quality enhancement method based on cyclic deformable fusion
CN114554213A (en) * 2022-02-21 2022-05-27 电子科技大学 Motion adaptive and detail-focused compressed video quality enhancement method

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0739139A2 (en) * 1995-04-18 1996-10-23 Sun Microsystems, Inc. Decoder for an end-to-end scalable video delivery system
CN102289795A (en) * 2011-07-29 2011-12-21 上海交通大学 Method for enhancing video in spatio-temporal mode based on fusion idea
US20120013807A1 (en) * 2010-07-15 2012-01-19 Gaurav Arora Method and apparatus for fast source switching and/or automatic source switching
CN104539961A (en) * 2014-12-12 2015-04-22 上海交通大学 Scalable video encoding system based on hierarchical structure progressive dictionary learning
CN104616243A (en) * 2015-01-20 2015-05-13 北京大学 Effective GPU three-dimensional video fusion drawing method
CN108307193A (en) * 2018-02-08 2018-07-20 北京航空航天大学 A kind of the multiframe quality enhancement method and device of lossy compression video
CN109257600A (en) * 2018-11-28 2019-01-22 福建帝视信息科技有限公司 A kind of adaptive minimizing technology of video compression artifact based on deep learning
CN110378348A (en) * 2019-07-11 2019-10-25 北京悉见科技有限公司 Instance of video dividing method, equipment and computer readable storage medium
CN111031315A (en) * 2019-11-18 2020-04-17 复旦大学 Compressed video quality enhancement method based on attention mechanism and time dependency
CN111028150A (en) * 2019-11-28 2020-04-17 武汉大学 Rapid space-time residual attention video super-resolution reconstruction method
CN112291570A (en) * 2020-12-24 2021-01-29 浙江大学 Real-time video enhancement method based on lightweight deformable convolutional neural network
CN112381866A (en) * 2020-10-27 2021-02-19 天津大学 Attention mechanism-based video bit enhancement method
CN112991183A (en) * 2021-04-09 2021-06-18 华南理工大学 Video super-resolution method based on multi-frame attention mechanism progressive fusion
CN113055674A (en) * 2021-03-24 2021-06-29 电子科技大学 Compressed video quality enhancement method based on two-stage multi-frame cooperation
CN113066022A (en) * 2021-03-17 2021-07-02 天津大学 Video bit enhancement method based on efficient space-time information fusion

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0739139A2 (en) * 1995-04-18 1996-10-23 Sun Microsystems, Inc. Decoder for an end-to-end scalable video delivery system
US20120013807A1 (en) * 2010-07-15 2012-01-19 Gaurav Arora Method and apparatus for fast source switching and/or automatic source switching
CN102289795A (en) * 2011-07-29 2011-12-21 上海交通大学 Method for enhancing video in spatio-temporal mode based on fusion idea
CN104539961A (en) * 2014-12-12 2015-04-22 上海交通大学 Scalable video encoding system based on hierarchical structure progressive dictionary learning
CN104616243A (en) * 2015-01-20 2015-05-13 北京大学 Effective GPU three-dimensional video fusion drawing method
CN108307193A (en) * 2018-02-08 2018-07-20 北京航空航天大学 A kind of the multiframe quality enhancement method and device of lossy compression video
CN109257600A (en) * 2018-11-28 2019-01-22 福建帝视信息科技有限公司 A kind of adaptive minimizing technology of video compression artifact based on deep learning
CN110378348A (en) * 2019-07-11 2019-10-25 北京悉见科技有限公司 Instance of video dividing method, equipment and computer readable storage medium
CN111031315A (en) * 2019-11-18 2020-04-17 复旦大学 Compressed video quality enhancement method based on attention mechanism and time dependency
CN111028150A (en) * 2019-11-28 2020-04-17 武汉大学 Rapid space-time residual attention video super-resolution reconstruction method
CN112381866A (en) * 2020-10-27 2021-02-19 天津大学 Attention mechanism-based video bit enhancement method
CN112291570A (en) * 2020-12-24 2021-01-29 浙江大学 Real-time video enhancement method based on lightweight deformable convolutional neural network
CN113066022A (en) * 2021-03-17 2021-07-02 天津大学 Video bit enhancement method based on efficient space-time information fusion
CN113055674A (en) * 2021-03-24 2021-06-29 电子科技大学 Compressed video quality enhancement method based on two-stage multi-frame cooperation
CN112991183A (en) * 2021-04-09 2021-06-18 华南理工大学 Video super-resolution method based on multi-frame attention mechanism progressive fusion

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
IANDONG MENG等: "BSTN: An Effective Framework for Compressed Video Quality Enhancement" *
XIANDONG MENG等: "BSTN: An Effective Framework for Compressed Video Quality Enhancement" *
陈学伟等: "面向 HDTV高刷新率的视频帧速率变化算法研究" *
陈学伟等: "面向HDTV高刷新率的视频帧速率变化算法研究" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114511485A (en) * 2022-01-29 2022-05-17 电子科技大学 Compressed video quality enhancement method based on cyclic deformable fusion
CN114554213A (en) * 2022-02-21 2022-05-27 电子科技大学 Motion adaptive and detail-focused compressed video quality enhancement method

Also Published As

Publication number Publication date
CN113592746B (en) 2023-04-18
CN113450280A (en) 2021-09-28

Similar Documents

Publication Publication Date Title
US8526488B2 (en) Video sequence encoding system and algorithms
CN109842799B (en) Intra-frame prediction method and device of color components and computer equipment
EP2479996A1 (en) Video coding with prediction using a signle coding mode for all color components
TW200535717A (en) Directional video filters for locally adaptive spatial noise reduction
CN102460504B (en) Out of loop frame matching in 3d-based video denoising
CN112261414B (en) Video coding convolution filtering method divided by attention mechanism fusion unit
JPH07231450A (en) Filter device and method for reducing artifact in moving video picture signal system
CN113592746B (en) Method for enhancing quality of compressed video by fusing space-time information from coarse to fine
CN113766249B (en) Loop filtering method, device, equipment and storage medium in video coding and decoding
US20120263225A1 (en) Apparatus and method for encoding moving picture
JP2005039837A (en) Method and apparatus for video image noise removal
CN117730338A (en) Video super-resolution network and video super-resolution, encoding and decoding processing method and device
Zhao et al. CBREN: Convolutional neural networks for constant bit rate video quality enhancement
CN116916036A (en) Video compression method, device and system
CN113055674B (en) Compressed video quality enhancement method based on two-stage multi-frame cooperation
Xia et al. Asymmetric convolutional residual network for av1 intra in-loop filtering
CN113068041A (en) Intelligent affine motion compensation coding method
Tan et al. Image compression algorithms based on super-resolution reconstruction technology
Zuo et al. Bi-layer texture discriminant fast depth intra coding for 3D-HEVC
CN111726636A (en) HEVC (high efficiency video coding) coding optimization method based on time domain downsampling and frame rate upconversion
Segall et al. Super-resolution from compressed video
Wu et al. MPCNet: Compressed multi-view video restoration via motion-parallax complementation network
Jung Comparison of video quality assessment methods
CN113507607B (en) Compressed video multi-frame quality enhancement method without motion compensation
CN112468826A (en) VVC loop filtering method and system based on multilayer GAN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant