CN113507607B

CN113507607B - Compressed video multi-frame quality enhancement method without motion compensation

Info

Publication number: CN113507607B
Application number: CN202110654128.XA
Authority: CN
Inventors: 叶茂; 罗登晏; 朱策; 郭红伟
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2023-05-26
Anticipated expiration: 2041-06-11
Also published as: CN113507607A

Abstract

The invention discloses a compressed video multi-frame quality enhancement method without motion compensation, belonging to the technical field of video compression. The invention adopts a multi-frame quality enhancement network without optical flow estimation, solves the contradiction between the definition and color richness of the video and the limited network bandwidth in the multimedia application, enhances the low-quality frame by fully utilizing the time information between adjacent frames, and enhances the subjective and objective quality of the compressed video. At the same time, the invention does not need to utilize optical flow estimation to compensate the motion between adjacent frames in a displaying way, thereby simplifying the network training. The application of the invention can ensure that the high-definition video is normally transmitted in the network after being compressed under the same code rate, and can also ensure better subjective and objective quality.

Description

Compressed video multi-frame quality enhancement method without motion compensation

Technical Field

The invention belongs to the technical field of video compression, and particularly relates to a compressed video multi-frame quality enhancement method without motion compensation.

Background

Since the international standard of video compression was proposed, the compressed video quality enhancement method has been widely studied in the industry and academia. Before deep learning is proposed, the method for enhancing the quality of compressed video is mainly a spatial domain method and a frequency domain method for enhancing single-frame images based on mathematical derivation. After deep learning is successfully applied to the field of image enhancement, various new methods have been proposed for enhancing the quality of compressed video, resulting in better results and more generalization than conventional methods.

The most commonly used H.265/HEVC standard now adopts a mixed coding framework based on blocks, and the core process comprises the following steps: predictive coding, transform coding, quantization and entropy coding, and block-based prediction. The transformation and quantization operations ignore block-to-block correlation, resulting in the encoded reconstructed image exhibiting blockiness, i.e., significant discontinuities in block boundaries that can be perceived by the human eye (these effects are more pronounced when the step size is larger and the bit rate is lower); meanwhile, quantization is based on block expansion in the transform domain, and this quantization process is irreversible. In addition, high precision interpolation in motion compensation is prone to ringing effects. Due to the accumulation of errors in the inter-frame coding process, the above effects also affect the coding quality of the subsequent frames, thereby resulting in a degradation of the objective evaluation quality of the video image and a degradation of the visual perception quality of the human eye.

In the existing solutions, chinese patent application publication No. CN107481209a proposes an enhancement scheme named "an image or video quality enhancement method based on convolutional neural network", in which two convolutional neural networks for video (or image) quality enhancement are first designed, the two networks having different computational complexity; then selecting a plurality of training images or videos to train parameters in the two convolutional neural networks; according to actual needs, a convolutional neural network with proper computational complexity is selected, and an image or video with the quality to be enhanced is input into the selected network; finally, the network outputs the quality enhanced image or video. The invention can effectively enhance the video quality; the user can select the convolutional neural network with proper calculation complexity according to the calculation capacity or the residual electric quantity of the equipment to enhance the quality of the image or the video. In the scheme, two convolutional neural networks with different complexity are designed, a user selects the networks according to the condition of equipment, the difference between the two networks is only that the depths of the convolutional neural networks are different, the scheme of improving the quality enhancement effect by only deepening the depths of the networks is not feasible, and the network is not designed aiming at the characteristics of the image video, namely, the network cannot utilize the time correlation between video frames, so the quality enhancement effect of the method is limited.

And publication No. CN108900848A proposes an enhancement scheme called "an adaptive separable convolution-based video quality enhancement method", in which an adaptive separable convolution is applied as a first module to a network model, each two-dimensional convolution is converted into a pair of one-dimensional convolution kernels in the horizontal and vertical directions, and the parameters are represented by n ² Becomes n + n. Secondly, the self-adaptive change convolution kernels learned by the network for different inputs are utilized to realize motion vector estimation, two continuous frames are selected as network inputs, a pair of separable two-dimensional convolution kernels can be obtained for every two continuous inputs, then the 2-dimensional convolution kernels are expanded into four 1-dimensional convolution kernels, the obtained 1-dimensional convolution kernels are changed along with the change of the inputs, and the network self-adaptability is improved. The invention replaces the two-dimensional convolution kernel with the one-dimensional convolution kernel, so that the parameters of the network training model are reduced, and the execution efficiency is high. The scheme adopts five coding modules, four decoding modules, a separation convolution module and an image prediction module, and the structure is that the last decoding module is replaced by the separation convolution module on the basis of the traditional symmetrical coding and decoding module network, so that the parameters of a model are effectively reduced, and the quality enhancement effect is still to be further improved.

In addition, chinese patent application publication No. CN108307193a proposes an enhancement scheme named as "a multi-frame quality enhancement method and apparatus for lossy compressed video", in which, for an i-th frame of a decompressed video stream, m frames associated with the i-th frame are adopted to perform quality enhancement on the i-th frame, so as to play the i-th frame after quality enhancement; the m frames belong to frames in the video stream, and each frame in the m frames and the i frame respectively have the same or corresponding pixel number which is larger than a preset threshold value; m is a natural number greater than 1. In a particular application, a peak quality frame may be utilized to enhance a non-peak quality frame between two peak quality frames. The scheme reduces the quality fluctuation among multiple frames in the video stream playing process, and simultaneously enhances the quality of each frame in the video after lossy compression. Although this scheme takes into account the temporal information between adjacent frames, the multi-frame convolutional neural network (MF-CNN) is designed to be divided into a motion compensation sub-network (MC-sub-network) and a quality enhancement sub-network (QE-sub-network), where the motion compensation sub-network relies heavily on optical flow estimation to compensate for motion between non-peak quality frames and peak quality frames to achieve alignment, any error in optical flow calculation can introduce artifacts around the image structure in the aligned adjacent frames. However, accurate optical flow estimation is inherently challenging and time consuming, and therefore the quality enhancement effect of the invention is still limited.

Disclosure of Invention

The invention aims to enhance a low-quality frame by fully utilizing time information between adjacent frames by adopting a multi-frame quality enhancement network without optical flow estimation, so that the subjective and objective quality of a compressed video is enhanced.

The embodiment of the invention provides a compressed video multi-frame quality enhancement method needing motion compensation, which comprises the following steps:

to-be-enhanced compressed video, forming an input sequence by a low-quality frame of the compressed video sequence and two adjacent high-quality frames of the compressed video sequence;

inputting each input sequence into a quality enhancement network to obtain an enhanced low-quality frame of the current input sequence;

the quality enhancement network comprises a preprocessing module and an enhancement module, wherein the preprocessing module comprises a feature extraction module and a feature fusion module, and the feature extraction module is used for extracting the spatial feature of each frame of an input sequence to obtain a feature map of each frame; the fusion module is used for fusing event information between adjacent frames to obtain a first fusion feature map and inputting the first fusion feature map into the enhancement module;

the enhancement module comprises a nonlinear mapping module and a reconstruction module, wherein the nonlinear mapping module is used for carrying out nonlinear mapping on the first fusion feature map to obtain a second fusion feature and inputting the second fusion feature into the reconstruction module; the reconstruction module is formed by stacking at least two layers of convolution layers and is used for predicting an enhanced residual image, and then the enhanced low-quality frame is obtained based on fusion of the enhanced residual image and the low-quality frame of the input sequence.

The embodiment of the invention improves the quality of the compressed video by fully utilizing the time information between frames. Meanwhile, the quality enhancement network adopted by the embodiment of the invention does not need to utilize optical flow estimation to compensate the motion between adjacent frames in a displaying way, thereby simplifying the network structure and simplifying the network training.

In one possible implementation, the feature extraction module is a feature extraction network based on a multi-scale feature extraction strategy to extract rich feature information.

In one possible implementation, the nonlinear mapping module is a nonlinear mapping module based on a layered residual and a channel attention mechanism.

Further, the nonlinear mapping module includes a plurality of layered residual modules hr_block, where the hr_block includes: the method comprises a downsampling layer, a residual block R, a residual block RA, a convolution layer and an upsampling layer, wherein the residual block R comprises at least two layers of convolution layers, and the residual block RA is as follows: adding a channel attention mechanism module CA_block into the residual block R to extract information of residual errors of layered characteristics on different channels, wherein the residual block RA comprises two paths of outputs and defines the residual error output as a first output of the residual block RA, and the CA_block output of the residual block RA is a second output of the residual block RA; the characteristic diagram of the input HR_block is expressed as Z, and the characteristic diagram Z is obtained through a first residual block ^S The method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, the feature map Z obtains a feature map ZD through a downsampling layer and inputs the feature map ZD into a second residual block, and obtains the feature map based on the first output of the second residual block

Obtaining a feature map based on a second output of the second residual block>

The characteristic diagram->

And->

Obtaining a sign map by an up-sampling layer with the same structure>

And->

Feature map +.>

And after being spliced with ZS according to the channel, the ZS is input into a convolution layer to obtain a characteristic diagram HR (Z) output by a layered residual error module.

Further, the nonlinear mapping module comprises three hr_blocks and two channel attention mechanism modules ca_blocks, and sequentially defines the three hr_blocks as a first hr_block, a second hr_block and a third hr_block according to the forward propagation direction of the network, and sequentially defines the two ca_blocks as a first ca_block and a second ca_block; the input feature map of the first HR_block is a feature map Z, the feature map HR (Z) output by the first HR_block is subtracted from the feature map Z and then input into a second HR_block, the feature map HR (Z) output by the second HR_block is added with the feature map Z and then input into a first CA_block, and the input feature map of the first CA_block is added with the output feature map of the first CA_block and then input into a third HR_block; the feature map Z, the feature map HR (Z) output by the first HR_block and the feature map input by the third HR_block are subjected to splicing layer to obtain a third fusion feature map; and outputting a feature map of the first hr_block

An output characteristic diagram of the first CA_block and a characteristic diagram of the third HR_block output +.>

Obtaining a fourth fusion feature map through the splicing layer; and finally, adding the third fusion characteristic diagram passing through one convolution layer and the four fusion characteristic diagrams passing through one convolution layer and one second CA_block in sequence to obtain an output characteristic diagram of the nonlinear mapping module.

In one possible implementation, the network structure of the ca_block is: the method comprises the steps of sequentially connecting a pooling layer, at least 1 convolution block, a convolution layer and a sigmoid activation function, wherein the convolution block comprises the convolution layer and the activation function thereof.

The technical scheme provided by the embodiment of the invention at least has the following beneficial effects:

in the embodiment of the invention, the low-quality frames are enhanced by fully utilizing the adjacent high-quality information frames, so that the subjective and objective quality of the compressed video is obviously enhanced, and the contradiction between the quality (definition and color richness) of the video in the multimedia application and the limited network bandwidth is solved, thereby ensuring that the multimedia application based on the invention can ensure the normal transmission of the high-definition video in the network after compression under the same code rate and also ensure better subjective and objective quality.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a quality enhancement network used in a method for enhancing quality of a compressed video multi-frame without motion compensation according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a feature extraction network employed in an embodiment of the invention;

FIG. 3 is a schematic diagram of a network structure of a nonlinear mapping module according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a network structure of a layered residual module (Hierarchical Residual block, hr_block) according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a network architecture of a channel attention mechanism module (Channel Attention block, CA_block) employed in an embodiment of the present invention;

FIG. 6 is a graph showing the result of enhancement processing of the present invention compared with the conventional comparative scheme in the embodiment of the present invention;

fig. 7 is a graph showing PSNR fluctuation curves after enhancement processing of frames 6 to 36 in a video sequence bqsquad according to the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

With the continuous progress of hardware and algorithms, digital video contains an increasing amount of data while resolution is higher and colors are richer. The video compression coding can effectively reduce the video data volume, but inevitably leads to the reduction of subjective and objective quality, and in order to enhance the subjective and objective quality of the compressed video, the embodiment of the invention provides a multi-frame quality enhancement method of the compressed video without motion compensation. Before describing embodiments of the present invention, related terms and nouns are annotated as follows:

265/HEVC: is a new video coding standard formulated after H.264, retains some techniques of the original H.264 coding standard, and improves some techniques at the same time. The new technique used is to improve the relationship between the code stream, the coding quality, the delay and the algorithm complexity in order to achieve an optimal setting.

GOP, group of pictures (group of pictures): refers to the distance between two I frames.

I-frame, intra-coded picture (Intra-coded image frame): the other image frames are not referred to, and only the information of the present frame is used for encoding.

P-frame, predictive-coded picture (Predictive-coded image frame): and performing inter-frame prediction coding by using a previous I frame or P frame in a motion prediction mode.

Low Delay P (LDP): only the first frame is I-frame encoded, while the others are P-frame encoded.

Peak Signal to Noise Ratio (PSNR): peak signal to noise ratio, an objective criterion for evaluating images.

Structural Similarity (SSIM): the structural similarity is a full-reference image quality evaluation index, and measures the image similarity from three aspects of brightness, contrast and structure respectively.

Ringing effect: for strong edges in the image, moire is generated around the edges after decoding due to quantization distortion of the high frequency ac coefficients, which is called ringing effect.

PQF: peak quality frames, i.e., high quality frames in a GOP, may also be considered I frames in a GOP.

non-PQF: non-peak quality frames, i.e., low quality frames in a GOP, can also be considered P-frames in a GOP.

In one possible implementation manner, the method for enhancing the multi-frame quality of the compressed video without motion compensation provided by the embodiment of the invention comprises the following steps:

to enhance compressed video, one low quality frame of the compressed video sequence and two adjacent high quality frames thereof are formed into an input sequence, e.g. the input sequence is defined as X, where x= { X _-1 ，x ₀ ，x ₊₁ }，x ₀ Low quality frames representing the input sequence (i.e. original frames y ₀ Compressed frame x of (2) ₀ )，x _-1 And x ₊₁ Representation and x ₀ Adjacent high quality frames, such that the currently input low quality frames are enhanced with the aid of their adjacent high quality frames;

the quality enhancement network includes a preprocessing module (Pre-processing Module) and an enhancement module (Enhancement Module), as shown in fig. 1;

the preprocessing module comprises a feature extraction module (Feature Extraction Module) and a feature fusion module (Feature Fusion Module), wherein the feature extraction module is used for extracting the spatial features of each frame of an input sequence to obtain a feature map of each frame; the fusion module is used for fusing event information between adjacent frames to obtain a first fusion feature map and inputting the first fusion feature map into the enhancement module;

the enhancement module comprises a nonlinear mapping module (Non-linear Mapping Module) and a reconstruction module (Reconstruction Module), wherein the nonlinear mapping module is used for carrying out nonlinear mapping on the first fusion feature map to obtain a second fusion feature and inputting the second fusion feature into the reconstruction module; the reconstruction module is formed by stacking convolution layers (at least comprising two layers of convolution layers) and is used for predicting an enhanced residual image and obtaining an enhanced low-quality frame based on fusion of the enhanced residual image and the low-quality frame of the input sequence.

The object of the embodiment of the invention is to compress the frame x according to the original frame y0 ₀ Deducing frames of high quality

I.e. enhanced low quality frames. Extracting and fusing input x by preprocessing module _-1 、x ₀ And x ₊₁ Is then reduced by an enhancement module ₀ Compression artifact of (a) to get->

It is thus possible to achieve that the quality of low quality frames in compressed video is enhanced by making full use of the information of its neighboring high quality frames.

In one possible implementation manner, in the embodiment of the present invention, the adopted feature extraction module is a feature extraction network based on a multi-scale feature extraction policy.

Referring to fig. 2, in the embodiment of the present invention, a multi-layer convolutional feature extraction network, for example, a 4-layer structure is adopted, a ReLU activation function is set after each of the first three layers of convolutional layers, and according to the forward propagation direction of the network, the convolution kernel sizes of the convolutional layers are sequentially set as follows: 3×3,5×5,7×7,3×3.

The feature extraction network is used to extract each frame X in the input sequence X _t To obtain a feature map T (x _t )：

T(x _t )＝W ^t *M(S(x _t )；W ^M ，B ^M )+b ^t ；

S(x _t )＝ReLU(Conv ₁ (x _t ))；

Wherein Conv ₁ Representing the first convolution layer in the feature extraction network, M representing a weight W ^M Offset to B ^M The sign 'x' refers to convolution operation after feature splicing of different scales, W ^t And b ^t Representing the weight and bias of its convolutional layers, respectively.

In one possible implementation, the feature fusion module obtains a fused feature, i.e. a first fused feature map F (x ₀ )：

F(x ₀ )＝Conv([T(x _-1 )，T(x ₀ )，T(x ₊₁ )])，

Wherein the number of the components is [. Cndot. ],]representing the splicing operation, T (x _-1 )，T(x ₀ )，T(x ₊₁ ) Respectively represent x _-1 、x ₀ And x ₊₁ Is a feature map of (1).

In the enhancement processing module, a nonlinear mapping module is first used to calculate a first fused feature map F (x ₀ ) Is a more useful representation of U (x ₀ ) Obtaining a second fusion profile U (x ₀ ). Then, U (x ₀ ) Input into a reconstruction module consisting of at least 2 convolutions of layers to predict (learn) an enhanced residual map R (x) ₀ ). Note that the enhanced residual map refers to the difference between the original frame and the enhanced frame. Finally, the learned enhanced residual picture (x ₀ ) X with input ₀ Enhanced low quality frame after element-wise summation

I.e. < ->

In a possible implementation, the nonlinear mapping module is constructed based on a layered residual and a channel attention mechanism, i.e. the nonlinear mapping module is a nonlinear mapping module based on a layered residual and a channel attention mechanism.

Further, in the embodiment of the present invention, the nonlinear mapping module includes a plurality of layered residual modules (Hierarchicai Residual block, hr_block) and a channel attention mechanism module (Channel Attention block, ca_block), and referring to fig. 3 and fig. 4, the network structure of the hr_block includes: the network structure of the hr_block used is introduced based on the input feature map and its processing output.

For ease of representation, the feature map of the input hr_block is represented as Z (i.e., U (x ₀ )). First, Z is input to a Residual block consisting of at least 2 convolutional layers and a Relu-activated function to obtain Z ^S ：

Z ^S ＝Conv(ReLU(Conv(ReLU(Z))))+Z；

Then, the feature map is subjected to layering operation through a downsampling layer to obtain a downsampled feature map Z _D And then the characteristic diagram Z _D Is input into a Residual block, but the convolution operation adopts bottleneck convolution and adds a CA_block to the Residual block to extract information (expressed as

) This Residual block to which the CA_block is added is referred to as RA_block. The output of RA_blcok is denoted +.>

Will then->

Obtaining a characteristic diagram after an up-sampling layer>

Finally, the feature map->

And Z is ^S After being spliced according to the channel (i.e. passing through a Concat layer) is input into aThe convolution layers of the number obtain a characteristic diagram HR (Z) output by the layered residual module:

when needing to be noted, feature map

The feature map is also obtained via the same upsampling layer +.>

As input to the subsequent network.

Referring to fig. 3, as a possible implementation manner, in the embodiment of the present invention, a preferred network structure of the nonlinear mapping module includes 3 hr_blocks and 2 ca_blocks, where an input feature map of a first hr_block is Z, an output feature map HR (Z) of the first hr_block is subtracted from the feature map Z, and then a second hr_block is input, and after the feature map HR (Z) of the second hr_block is added to the feature map Z, a first ca_block is input, and after the input feature map of the first ca_block is added to the output feature map thereof, a third hr_block is input; then fusing the feature map Z, the feature map HR (Z) output by the first HR_block and the feature map input by the third HR_bloc through a Concat layer to obtain a third fused feature map; and outputting a feature map of the first hr_block

A fourth fusion feature map is obtained after Concat layer fusion (channel splicing); and finally, adding the third fusion feature map passing through one convolution layer and the four fusion feature maps passing through one convolution layer and one second CA_block in sequence to obtain an output feature map of the nonlinear mapping module.

Preferably, the convolution kernel size of the convolution layer employed in the Residual block is 5×5, and the convolution kernel size of the convolution layer outputting the feature map HR (Z) is 3×3.

Preferably, the downsampling strategy of the downsampling layer may be described in the literature "Sajjadi M S M, vemulapalli R, brown M.frame-recurrent video super-resolution [ C ]]v/Proceedings of the IEEE Conference on Computer Vision and Pattern Reconnaist.2018: 6626-6634 "or" Sun W, he X, chen H, et a1.A quality enhancement framework with noisedistribution characteristics for high efficiency video coding [ J ]]Neuroompuying, 2020, 411:428-441", the present invention does not specifically limit the downsampling strategy of the downsampling layer. Likewise, the upsampling strategy of the upsampling layer is not particularly limited, and the literature "Shi W, cabellero J, husz r F, et al real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network [ C ]]/(Proceedings of the IEEE conference on computer vision and pattern recoganation.2016): 1874-188", e.g. for a special group graph

The up-sampled ad hoc graph thereof can be expressed as:

In one possible implementation, referring to fig. 5, the structure of ca_block used includes a pooling layer, a convolution layer and its activation function (layer number adjustable), a convolution layer and a sigmoid activation function connected in sequence. The output of the sigmoid activation function is multiplied by the output of CA_block to obtain the output of CA_block.

To facilitate description of the hierarchy, v= [ V ] is defined ₁ ，v ₂ ，...，v _c ]To represent the input feature map of CA_block, which is composed of C (number of input channels) feature maps v with the space size H×W _i Composition is prepared. The feature graphs are subjected to dimension reduction in the space dimension through a pooling layer which is an average pooling (average pooling) in a pooling mode, so that a pooling result of a c-th channel is obtained:

wherein v is _i (i, j) is the value of the c-th channel at position (i, j), e _c The value after the c-th channel average is shown. A sigmoid activation function is then used as a gating mechanism to derive the most relevant feature map:

α _c ＝σ(Conv(ReLU(Conv(e _c ))))

wherein σ represents a sigmoid activation function, defining E= [ E ₁ ，e ₂ ，...，e _c ]. Finally based on channel weight alpha= [ alpha ] ₁ ，α ₂ ，...，α _C ]To readjust the interdependence between the channels of the input profile V:

in a preferred manner, the number of input channels of the CA_block is C, and the number of output channels of the first convolution layer of the CA_block is set to be

The core size of the pooling layer of CA_block is 1×1, and the convolution core size is 1×1.

In the embodiment of the invention, the training of the network parameters of the adopted quality enhancement network can be performed based on the conventional training mode of the neural network parameters, and as a preferable mode, the preprocessing module and the enhancement module of the quality enhancement network provided by the embodiment of the invention are jointly trained in an end-to-end mode, and the network does not need to firstly train and converge a certain sub-network, so that the loss function only consists of one item. For example, using L ₂ Norm as a loss function of the network:

in an exemplary embodiment, the present invention also provides a computer device including a processor and a memory having at least one computer program stored therein. The at least one computer program is loaded and executed by one or more processors to implement any of the methods for compressed video multi-frame quality enhancement described above without motion compensation.

In an exemplary embodiment, the present invention further provides a computer readable storage medium having at least one computer program stored therein, the at least one computer program being loaded and executed by a processor of a computer device to implement any of the above methods for compressed video multi-frame quality enhancement without motion compensation.

In one possible implementation, the computer readable storage medium may be a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Read-Only Memory (CD-ROM), a magnetic tape, an optical data storage device, and the like.

It should be noted that the terms "first," "second," and the like in the description and in the claims are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged where appropriate such that the description of the embodiments of the invention herein described is not intended to represent all the embodiments consistent with the invention. The idea is that they are merely examples of apparatuses and methods consistent with some aspects of the invention as detailed in the appended claims.

In order to qualitatively and quantitatively evaluate the enhancement performance of the multi-frame quality enhancement method for compressed video without motion compensation provided by the embodiment of the present invention, the quality enhancement network shown in fig. 1 to 5 is adopted in the present embodiment to perform quality enhancement processing on the selected compressed video to be enhanced, and the quantitative evaluation is compared with the existing enhancement schemes DCAD, DS-CNN, MFQE and MFQE 2.0 according to Δpsnr and Δssim, as shown in tables 1 and 2; qualitative assessment was compared to MFQE 2.0, as shown in fig. 6. Among them, enhancement scheme DCAD can be referred to as "Wang T, chen M, chao H.A novel deep learning-based method of improving coding efficiency from the decoder-end for HEVC [ C ]//2017Data Compression Conference (DCC) & lt/EN & gt, 2017:410-419", enhancement scheme DS-CNN can be referred to as "Yang R, xu M, wang Z.Decoder-side HEVC quality enhancement with scalable convolutional neural network [ C ]//201 2017IEEE International Conference on Multimedia and Expo (ICME) & lt/EN & gt, IEEE,2017:817-822", enhancement discard MFQE can be referred to as "Yang R, xu M, wang Z, et al, multi-frame quality enhancement for compressed video [ C ]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognity.2018:6664-6673".

TABLE 1

TABLE 2

Table 1 gives the average results of Δpsnr and Δssim over all frames of each test sequence, Δpsnr (dB) and Δssim for the HEVC standard test sequence at a total of five QP points ((×10) ^-4 ) Is a comparison of the population of (c). It can be seen that the enhancement scheme (outer) provided by the embodiment of the present invention is always superior to other video quality enhancement methods. Specifically, at qp=37, the maximum Δpsnr value of the present embodiment reaches 1.154dB, the average Δpsnr is 0.632dB, 12.5% higher than MFQE 2.0 (0.562 dB), 38.9% higher than MFQE (0.455 dB), 96.3% higher than DCAD (0.322 dB), and 110.7% higher than DS-CNN (0.300 dB). At other QP points, the embodiments of the present invention are also superior to other methods in both ΔPSNR and ΔSSIM. In addition, the decrease in BD-rate was used to compare the performance of the network, as shown in Table 2, with an average decrease in BD-rate of 17.18% for our network, which is superior to the currently best MFQE 2.0 (14.06%).

Table 2 gives the BD-rate (%) for the test sequence as compared to the HEVC reference drop, calculated at five points qp=22, 27, 32, 37 and 42.

Fig. 6 shows subjective quality performance of sequences BasketballPass, partyScene and BQMall at qp=37. As can be seen from the figure, compared with the MFQE 2.0 method, the enhancement scheme provided by the embodiment of the invention can reduce more compression artifacts, and better visual experience is realized.

For lossless video, after compression, there will be a certain quality fluctuation between video frames, as shown in fig. 7 (line corresponding to HEVC), and the PSNR fluctuation between high quality frames and low quality frames is obvious in the enhancement scheme MFQE 2.0 (see "Guan Z, xing Q, xu M, et al MFQE 2.0:A new approach for multi-frame quality enhancement on compressed video [ J ]. IEEE transactions on pattern analysis and machine intelligence, 2019") after using adjacent high quality frames to enhance middle low quality frames, as shown in fig. 7. Meanwhile, the PSNR fluctuation of the scheme (outer) provided by the embodiment of the present invention is also shown in fig. 7. It can be seen that the enhancement method provided by the embodiments of the present invention better utilizes the information of the high quality frames than the existing MFQE 2.0, exhibiting lower enhancement quality fluctuations. And enhancement scheme MFQE 2.0 is a modification of the enhancement scheme of publication No. CN108307193 a. Namely, the compressed video multi-frame quality enhancement network without motion compensation provided by the embodiment of the invention enhances the low-quality frames by fully utilizing the adjacent high-quality information frames, so that the subjective quality and the objective quality of the compressed video are obviously enhanced.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

What has been described above is merely some embodiments of the present invention. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit of the invention.

Claims

1.A method of motion compensation-free multi-frame quality enhancement of compressed video, comprising:

the enhancement module comprises a nonlinear mapping module and a reconstruction module, wherein the nonlinear mapping module is used for carrying out nonlinear mapping on the first fusion feature map to obtain a second fusion feature and inputting the second fusion feature into the reconstruction module; the reconstruction module is formed by stacking at least two layers of convolution layers and is used for predicting an enhanced residual error map, and then an enhanced low-quality frame is obtained based on fusion of the enhanced residual error map and the low-quality frame of the input sequence;

the nonlinear mapping module is a nonlinear mapping module based on a layered residual and a channel attention mechanism, the nonlinear mapping module comprises a plurality of layered residual modules hr_blocks, the hr_blocks comprise: the method comprises a downsampling layer, a residual block R, a residual block RA, a convolution layer and an upsampling layer, wherein the residual block R comprises at least two layers of convolution layers, and the residual block RA is as follows: adding a channel attention mechanism module CA_block into the residual block R to extract information of residual errors of layered characteristics on different channels, wherein the residual block RA comprises two paths of outputs and defines the residual error output as a first output of the residual block RA, and the CA_block output of the residual block RA is a second output of the residual block RA; feature map to be input to hr_blockDenoted as Z, the feature map Z is obtained by a first residual block ^S The method comprises the steps of carrying out a first treatment on the surface of the At the same time, the feature map Z is obtained through a downsampling layer _D And input into a second residual block, and obtain a feature map based on the first output of the second residual block

Obtaining a feature map based on a second output of the second residual block>

The characteristic diagram->

And->

Obtaining a sign map by an up-sampling layer with the same structure>

And->

Feature map +.>

And Z is ^S And (3) inputting the characteristic images into a convolution layer after channel splicing to obtain a characteristic image HR (Z) output by a layered residual error module.

2. The method of claim 1, wherein the feature extraction module is a feature extraction network based on a multi-scale feature extraction strategy.

3. The method of claim 1, wherein the convolution kernel size of the convolution layer employed in the residual block R is 5 x 5 and the convolution kernel size of the last layer of the hr_block is 3 x 3.

4. The method of claim 1, wherein the nonlinear mapping module comprises three hr_blocks and two channel attention mechanism modules ca_blocks, and wherein the three hr_blocks are sequentially defined as a first hr_block, a second hr_block, and a third hr_block, and the two ca_blocks are sequentially defined as a first ca_block and a second ca_block according to a forward propagation direction of the network;

the input feature map of the first HR_block is a feature map Z, the feature map HR (Z) output by the first HR_block is subtracted from the feature map Z, then a second HR_block is input, the feature map HR (Z) output by the second HR_block is added with the feature map Z, then a first CA_block is input, and the input feature map of the first CA_block is added with the output feature map of the first CA_block and then a third HR_block is input; the feature map Z, the feature map HR (Z) output by the first HR_block and the feature map input by the third HR_block are subjected to splicing layer to obtain a third fusion feature map; and outputting a feature map of the first hr_block

Obtaining a fourth fusion feature map through the splicing layer; and finally, adding the third fusion characteristic diagram passing through one convolution layer and the four fusion characteristic diagrams passing through one convolution layer and one second CA_block in sequence to obtain an output characteristic diagram of the nonlinear mapping module. />

5. The method according to any one of claims 1 to 4, wherein the network structure of the ca_block is: the method comprises the steps of sequentially connecting a pooling layer, at least 1 convolution block, a convolution layer and a sigmoid activation function, wherein the convolution block comprises the convolution layer and the activation function thereof.

6. The method of claim 1, wherein the quality enhancement network employs a loss function in training that is:

wherein x is ₀ Representing low quality frames in the input sequence, +.>

Representing enhanced low quality frames of the quality enhancement network output.

7. The method of claim 2, wherein the feature extraction network is a 4-layer convolution structure, and a ReLU activation function is set after each of the first three convolution layers, and according to the forward propagation direction of the network, the convolution kernel sizes of the convolution layers are set as follows: 3×3,5×5,7×7,3×3.