CN108900848B

CN108900848B - Video quality enhancement method based on self-adaptive separable convolution

Info

Publication number: CN108900848B
Application number: CN201810603510.6A
Authority: CN
Inventors: 高钦泉; 聂可卉; 刘文哲; 童同
Original assignee: Fujian Imperial Vision Information Technology Co ltd
Current assignee: Fujian Deshi Technology Group Co ltd
Priority date: 2018-06-12
Filing date: 2018-06-12
Publication date: 2021-03-02
Anticipated expiration: 2038-06-12
Also published as: CN108900848A

Abstract

The invention discloses a video quality enhancement method based on self-adaptive separable convolution, which applies the self-adaptive separable convolution as a first module in a network model, converts each two-dimensional convolution into a pair of one-dimensional convolution kernels in the horizontal direction and the vertical direction, and has the parameter quantity of n²To become n + n. Secondly, the self-adaptively changed convolution kernels learned by the network for different inputs are utilized to realize the estimation of the motion vector, a pair of separable two-dimensional convolution kernels can be obtained for every two continuous inputs by selecting two continuous frames as the network inputs, then the 2-dimensional convolution kernels are unfolded into four 1-dimensional convolution kernels, the obtained 1-dimensional convolution kernels are changed along with the change of the inputs, and the network adaptivity is improved. The invention replaces two-dimensional convolution kernel with one-dimensional convolution kernel, so that the parameters of the network training model are reduced and the execution efficiency is high.

Description

Video quality enhancement method based on self-adaptive separable convolution

Technical Field

The invention relates to the field of image processing and deep learning technology, in particular to a video based on self-adaptive separable convolution

A quality enhancement method.

Background

Removing compression artifacts of images and video is a classical problem in computer vision. The goal of the problem is to estimate lossless images from compressed images and video. In the information explosion age, images and videos spread on the internet and mobile phones are increasing day by day, and technologies of lossy compression modes such as JPEG and WebP are widely applied to platforms such as news websites, WeChat and microblogs to reduce the size of video files so as to save bandwidth and transmission time. Images and videos used in web pages need to be compressed as much as possible to speed up page loading and thereby improve user experience. These compression algorithms, however, typically introduce compression artifacts such as occlusion, back-projection, contouring, blurring, ringing, and the like. Generally, the larger the compression factor, the more severe the video degradation caused by these artifacts, resulting in loss of video information, directly affecting the visual experience of the user. Therefore, there has been an increasing interest in how to recover visually high quality artifact free images and videos.

In recent years, with the development of deep learning, more and more techniques are beginning to be applied to how to improve the visual quality of compressed images and videos. For example, Dong^[1]The 3-layer convolutional neural network (ARCNN) is used for removing the artifacts of JPEG compressed images and obtaining better image decompression artifact effect. Then yang et al proposed DS-CNN [2,3 ]]For video quality enhancement. However, none of the above mentioned video quality enhancement methods utilizes information between adjacent frames, and thus its network performance is limited to a large extent. Until recently, yang et al continued to propose the MFQE algorithm [4 ]]It is believed that in compressed video, the information in a high quality frame can be used to enhance the quality of its neighboring low quality frames because the quality of each frame fluctuates greatly. However, the method relies on the optical flow estimation network to estimate the motion between frames, and the effect is not obvious because the group-route value of the motion estimation is difficult to obtain in the optical flow estimation method.

Disclosure of Invention

The invention aims to provide a video quality enhancement method based on self-adaptive separable convolution aiming at the problem of artifacts generated by high-degree compression of a video, so that various artifacts in a compressed video are effectively removed, and the video quality and the visual effect are obviously improved.

The technical scheme adopted by the invention is as follows:

a video quality enhancement method based on adaptive separable convolution adopts a system network comprising an adaptive separable convolution network and a residual error network, wherein the adaptive separable convolution network is used for obtaining a motion compensation frame, and the residual error network is used for removing a compression artifact of a video frame so as to enhance the video quality; the video quality enhancement method comprises the following specific steps:

step 1, selecting high-quality videos to form a video database [4,5,6 ].

Step 2, preprocessing a video database to form a training data set; the training data set is composed of a pairing set of a plurality of video frames

Is formed therein

Represents the current frame of the compressed video frame,

representing the frame that follows the compressed video frame,

a current frame representing a high definition video frame,

a subsequent frame representing a frame of high definition video,

step 3, inputting two continuous compressed video frames

And

obtaining the next frame using a separable convolutional network

Predictive compressed video frames

Step 4, predicting the compressed video frame obtained by the self-adaptive separable convolution network

The original compressed image of the frame corresponding to the training set

And uncompressed images

The normalization and the y-channel processing are performed simultaneously,

step 5, inputting compressed video frame

And predicting compressed video frames

Obtaining a predicted high-definition video frame by using a residual error network model

Step 6: compressing video frames based on prediction

And predicting high definition video frames

Calculating a total cost function;

and 7: and continuously updating and optimizing the overall cost function to obtain the optimal convolution weight parameters and the optimal bias parameters until the optimal effect is obtained.

Further, step 2 specifically includes the following steps:

step 2-1, setting a quality coefficient qp according to the latest HEVC standard, and compressing the original video by using an ffmpeg command to ensure that each high-definition video has a corresponding video with a compression artifact;

step 2-2, respectively carrying out frame extraction on the high-definition video and the compressed video to obtain a high-definition image set and a corresponding compressed image set;

step 2-3, two continuous images in the compressed image set are taken each time, and the compressed video frames are intercepted according to the size of d x d

And

since it is the compression artifact of the removed video, inter-frame similarity should be consideredSex;

step 2-4, simultaneously acquiring two corresponding images from the high-definition image set and executing the same operation to obtain a high-definition video frame

And

forming a pairing set of a number of video frames

And 2-5, randomly disordering the sequence of the video frames in the pairing set to obtain a training data set of the network model.

Further, the separable convolutional neural network in the step 3 comprises five encoding modules, four decoding modules, a separable convolutional module and an image prediction module;

further, step 3 specifically includes the following steps:

step 3.1, each coding module comprises three convolutional layers and one average pooling layer,

the formula for calculating the convolutional layer is:

wherein x_i,jI row and j column pixels, w, representing an image_m,nRepresenting the m-th row and n-th column weight, w, of the filter_bRepresenting the bias term of the filter, a_i,jRepresenting the ith row and jth column of the obtained characteristic diagram, and representing an activation function relu by f;

the formula for the average pooling layer is as follows:

wherein alpha is_iExpressing the value of the ith pixel point in the taken neighborhood, and alpha after normalization_iValue range of 01, N represents the total number of pixel points in the neighborhood. h is_mExpressing the result of pooling all pixel points in the neighborhood;

step 3.2, each decoding module sequentially comprises three convolution layers and a bilinear upsampling layer, the output of the last coding module is used as the input of the first decoding module, and then the output of the last decoding module is used as the input of the next decoding module; the calculation formula of the convolution layer of the decoding module is the same as that of the convolution layer of the coding module;

the computation process of the bilinear upsampling layer is as follows:

step 3.2.1, for each obtained feature map, to obtain the value of the unknown function f at point p ═ (x, y), first, linear interpolation is performed in the x direction to obtain:

wherein R is₁＝(x,y₁) (3)

Wherein R is₂＝(x,y₂) (4)

Wherein Q₁₁＝(x₁,y₁)，Q₁₂＝(x₁,y₂)，Q₂₁＝(x₂,y₁)，Q₂₂＝(x₂,y₂) F is a bilinear interpolation function for known four points;

step 3.2.2, linear interpolation is carried out in the y direction:

this way the desired interpolation result can be obtained:

and obtaining the value of a middle pixel point of the feature map after the pixel point p to be predicted is (x, y) and passes through a bilinear interpolation function f, namely f (x, y).

Step 3.3, adding a jump connection between the decoder and the encoder: respectively adopting skip connection between the third layer convolution layer of the 2 nd, 3 rd, 4 th and 5 th coding modules and the corresponding bilinear upsampling layer of the 4 th, 3 th, 2 th and 1 th decoding modules, and adding the output characteristics of the coding modules and the decoding modules to obtain combined characteristics;

step 3.4, the separable convolution module comprises four sub-networks, wherein each sub-network consists of three convolution layers and a bilinear up-sampling layer; the method comprises the following specific steps:

step 3.4.1, the output of steps 3.1-3.3 is expanded into two adaptive convolution kernels to perform convolution operation on two continuous frame inputs respectively:

wherein K₁(x, y) and K₂(x, y) respectively represent two-dimensional convolution kernels, P, predicted based on a separable convolution model₁(x, y) and P₂(x, y) represents pixel values of two consecutive input frames, representing a convolution operation;

at step 3.4.2, each two-dimensional adaptive convolution kernel is expanded into 2 one-dimensional convolution kernels along the horizontal and vertical directions<K_{1_v}(x,y),K_{1_h}(x,y)>And<K_{2_v}(x,y),K_{2_h}(x,y)>to obtain four self-adaptive one-dimensional convolution kernels,

step 3.4.3, the convolution of two one-dimensional convolution kernels can approximate a two-dimensional convolution kernel:

K₁(x,y)≈K_{1_h}(x,y)*K_{1_v}(x,y)

K₂(x,y)≈K_{2_h}(x,y)*K_{2_v}(x,y) (8)

step 3.4.4, the two sets of one-dimensional kernels obtained by the separate convolution module<k₁_h,k₁_v>And<k₂_h,k₂_v>as convolution kernels of an image prediction module, two groups of convolution kernels are arranged in sequenceFor the input current frame I₁And the next frame I₂Performing convolution operation, and adding the two finally obtained results to obtain an output result which is a compensation image of the next frame;

step 3.5, the original input current frame image P₁(x, y), second frame image P₂(x, y) carrying out convolution operation on the convolution kernel output by the self-adaptive separable convolution module to obtain a predicted image I obtained by the image prediction module_gt：

I_gt＝k_{1_h}(x,y)*k_{1_v}(x,y)*P₁(x,y)+k_{2_h}(x,y)*k_{2_v}(x,y)*P₂(x,y) (9)

Further, the specific steps of step 4 are respectively:

step 4.1, dividing each pixel value of the image by 255 to enable each pixel to be between [0,1] to obtain a processed image;

step 4.2, taking the normalized RGB image, and obtaining the normalized RGB image according to a formula

Y＝0.257R+0.564G+0.098B+16

A Y-channel image is obtained.

Further, the residual error network in the step 5 comprises an initial convolution module, a residual error convolution module and an image reconstruction module respectively;

further, step 5 comprises the following processing steps:

step 5.1, the initial convolution stage comprises a convolution layer and an activation layer, and the bottom layer characteristic F is obtained through learning₁；

Wherein W₁And B₁As weights and bias parameters of the initial convolution module, F_reluRepresenting a relu activation function;

step 5.2, each residual convolution module sequentially comprises a convolution layer, a nonlinear activation layer, a convolution layer and a characteristic combination layer; the characteristic combination layer connects the output characteristics F of the layer through a jump connection_kWith the two subsequent layers of the convolution layerOutput characteristic F of_k+2Adding and obtaining combined features F_k,k+2；

F_k＝W_k(F_relu(W_k-1F_k-2+B_k-1)+F_k-2 (11)

F_k,k+2＝F_k+F_k+2 (12)

F_K-2Is the output characteristic diagram of the (k-2) th convolutional layer, F_reluRepresenting the relu activation function, W_kRepresents the k-th convolutional layer weight, W_k-1And B_k-1Weight and bias parameters, F, representing the k-1 th layer convolution module_k,k+2Is a characteristic layer F_kAnd F_k+2The resulting high-level bonding characteristics.

Step 5.3, utilizing the obtained high-level feature F_k,k+2Executing an image reconstruction layer;

F_g＝W_M(F_relu(W_M-1F_k,k+2+B_M-1)+F₁ (13)

F₁is the bottom layer characteristic obtained from (10), F_reluRepresenting the relu activation function, F_k,k+2Is the high layer binding characteristic obtained by (12), W_MRepresents the weight of the convolution layer of the Mth layer, W_M-1And B_M-1Representing the weights and bias parameters of the layer M-1 convolution module. Further, the calculation of the total cost function of step 6 comprises the following steps:

step 6.1, in separable convolution network, compare the predicted compressed video frame of the next frame

And the next frame of the original compressed video frame

Calculating the Euclidean distance between the two;

num denotes the number of all pixel blocks in each frame image.

Step 6.2, in the network for removing the video frame compression artifact, predicting the high-definition video frame

With the original high definition video frame

Comparing, and calculating a Charbonier penalty function;

num denotes the number of all pixel blocks in each frame image.

Step 6.3, adding the two loss functions to obtain an overall cost function:

Total_loss＝Mse_loss+Charbonnier_loss (16)。

by adopting the technical scheme, the motion compensation frame is obtained through the self-adaptive separable convolution network, and the compression artifact of the video frame is removed through the residual error network, so that the video quality is enhanced. The video artifact compression removing method based on the model of the self-adaptive separable convolutional network can effectively remove various artifacts in a compressed video and obviously improve the video quality and visual effect.

Drawings

The invention is described in further detail below with reference to the accompanying drawings and the detailed description;

FIG. 1 is a schematic diagram illustrating a schematic structure of a video quality enhancement method based on adaptive separable convolution according to the present invention;

fig. 2 is a comparison graph of artifact removal effects of images of "vidoo 3" in JCT-VC, HEVC standard test sequence of MFQE in the prior art, according to the latest HEVC standard, a test video is compressed, and a quality coefficient QP is set to 37.

Detailed Description

As shown in one of fig. 1-2, the present invention proposes a video enhancement method based on separable convolutional network. The network consists of two parts: the first part is a separable convolutional network to obtain motion-compensated frames, and the second is a residual network to remove compression artifacts of video frames, thereby enhancing video quality. The overall network model adopts an Adam optimization mode, except that convolution kernels with the size of 51 are used in 4 sub-networks in the separable convolution module, all other convolution layers all use convolution kernels with the size of 3 × 3, and the specific steps are as follows:

step 1, selecting high-quality videos to form a video database. There were 7000 training data pictures.

And 2, preprocessing the video database to form a training data set. According to the latest HEVC standard, a quality coefficient qp is set, and an ffmpeg command is used for compressing original videos, so that each high-definition video has a corresponding video with a compression artifact. And then respectively carrying out frame extraction on the high-definition video and the compressed video to obtain a high-definition image set and a corresponding compressed image set. Since it is the compression artifact of the removed video, inter-frame similarity should be considered. Each time, the previous image and the next image in the compressed image set are taken, and the video frame I is intercepted according to the size of d x d^t _cAnd I^t+1 _cSimultaneously, two corresponding images are taken from the high-definition image set to execute the same operation to obtain a video frame I^t _gtAnd I^t+1 _gtForming a pairing set of several video frames { I^t _c，I^t+1 _c，I^t _gt，I^t ⁺¹ _gt}. And randomly disordering the sequence of the video frames in the pairing set to obtain a training data set of the network model. The training data set contained 7000 pictures in total.

Step 3, inputting two continuous compressed video frames I by utilizing a separable convolution network^t _c，I^t+1 _c(representing the current frame and the next frame, respectively) to obtain a next frame I^t+1 _cPredicted result of (I)^t+1’ _c. The separable convolution neural network comprises five coding modules, four decoding modules, a separation convolution module and an image pre-decoderAnd a measuring module. Each encoding module includes three convolutional layers and one average pooling layer. The formula for calculating the convolutional layer is:

wherein x_i，jI row and j column pixels, w, representing an image_m，nRepresenting the m-th row and n-th column weight, w, of the filter_bRepresenting the bias term of the filter, a_i，jThe ith row and jth column of pixels of the resulting feature map are represented, with f representing the activation function relu. And setting the size of the convolution kernel to be 3 x 3 in the coding and decoding module.

The average pooling layer is used for downsampling the output feature map, and further reduces the parameter number by removing unimportant samples in the feature map.

Then, the output of the encoding module is used as the input of the decoding module, each decoding module sequentially comprises three convolution layers and a bilinear upsampling layer, and is somewhat like the inverse process of the decoding module, wherein the calculation formula of the bilinear upsampling layer is as follows: for each obtained feature map, linear interpolation is performed in the x direction to obtain:

wherein Q₁₁＝(x₁,y₁)，Q₁₂＝(x₁,y₂)，Q₂₁＝(x₂,y₁)，Q₂₂＝(x₂,y₂) F bilinear interpolation function for known four points. Then, linear interpolation is performed again in the y direction:

thus, the value of each pixel point of the feature map after bilinear interpolation can be obtained, wherein p ═ x, y is the pixel point to be predicted.

The formula for the convolutional layer is as above.

Meanwhile, a characteristic combination layer is added to serve as a bridge for connecting a decoder and an encoder so as to avoid detail information loss. The specific operation is as follows: respectively connecting the third layer convolution layer of the 2 nd, 3 th, 4 th and 5 th coding modules with the 4 th, 3 rd, 2 th and 1 th bilinear upsampling layers corresponding to the decoding modules through skip connection, and adding the output characteristics of the coding modules and the decoding modules to obtain combined characteristics F_K。

The separation convolution module is composed of four sub-networks (sub-networks), wherein each sub-network is composed of three convolution layers and a bilinear upsampling layer, but at the moment, a two-dimensional convolution kernel of each convolution layer is replaced by two one-dimensional convolution kernels, the two one-dimensional convolution kernels are respectively represented horizontally and vertically by the two-dimensional convolution kernels, and the specific process is as follows: two sets of one-dimensional kernels obtained by separate convolution modules<k₁_h,k₁_v>And<k₂_h,k₂_v>as convolution kernels of the image prediction module, the two groups of convolution kernels respectively correspond to the input current frame I₁And the next frame I₂Performing convolution operation, and finally adding the two obtained results to obtain an output result, namely a predicted image of a next frame, wherein the specific operation is as follows:

final predicted image I_gtPixel P of current frame image capable of being input by original₁(x, y) pixel point P of second frame image₂And (x, y) and the network respectively perform convolution operation on the convolution kernels learned by the two images to obtain:

I_gt＝K₁(x,y)*P₁(x,y)+K₂(x,y)*P₂(x,y) (5)

using convolution results of one-dimensional convolution kernel in horizontal direction and one-dimensional convolution kernel in vertical directionTwo-dimensional convolution kernel K in approximate expression (6)₁(x, y) and K₂(x,y)：

K₁(x,y)＝k_{1_h}(x,y)*v_{1_v}(x,y)

K₂(x,y)＝k_{2_h}(x,y)*k_{2_v}(x,y) (6)

Can obtain

I_gt＝k_{1_h}(x,y)*k_{1_v}(x,y)*P₁(x,y)+k_{2_h}(x,y)*k_{2_v}(x,y)*P₂(x,y) (7)

Step 4, obtaining the predicted frame I by the separable convolution network^t+1’ _cThe original compressed image I of the frame corresponding to the training set^t+1 _cAnd uncompressed image I^t+1 _gtSimultaneously carrying out normalization and y-channel processing, wherein the specific steps are as follows:

Y＝0.257R+0.564G+0.098B+16

A Y-channel image is obtained.

Step 5, inputting the compressed video frame I by using the residual error network model^t+1 _cAnd predicting compressed video frame I^t+1’ _cObtaining a model predicted image I^t+1’ _gt. The residual error network comprises an initial convolution module, a residual error convolution module and an image reconstruction module respectively. Each residual convolution module sequentially comprises a convolution layer, a nonlinear active layer, a convolution layer and a characteristic combination layer, wherein the characteristic combination layer connects the output characteristics F of the layers through a jump connection_kOutput characteristic F of the convolution layer of the two subsequent layers_k+2Adding and obtaining combined features F_k，k+2。

step 5.2, each residual convolution module sequentially comprises a convolution layer, a nonlinear activation layer, a convolution layer and a characteristic combination layer; the characteristic combination layer connects the output characteristics F of the layer through a jump connection_kOutput characteristic F of the convolution layer of the two subsequent layers_k+2Adding and obtaining combined features F_k,k+2；

F_k＝W_k(F_relu(W_k-1F_k-2+B_k-1)+F_k-2 (9)

F_k,k+2＝F_k+F_k+2 (10)

F_K-2Is the output characteristic diagram of the (k-2) th convolutional layer, F_reluRepresenting the relu activation function, W_KRepresents the weight of the K layer convolutional layer, W_K-1And B_K-1Weight and bias parameters representing the K-1 layer convolution module, F_k,k+2Is a characteristic layer F_kAnd F_k+2The resulting high-level bonding characteristics.

F_g＝W_M(F_relu(W_M-1F_k,k+2+B_M-1)+F₁ (11)

F₁is the bottom layer characteristic obtained from (9), F_reluRepresenting the relu activation function, F_k,k+2Is the high layer bonding characteristic obtained by (10), W_MRepresents the weight of the convolution layer of the Mth layer, W_M-1And B_M-1Representing the weights and bias parameters of the layer M-1 convolution module.

Step 6: calculating an overall cost function;

step 6.1, in the separable convolution network, the predicted image I of the next frame is compared^t+1’ _cAnd the next frame original image I^t ⁺¹ _cAnd calculating the Euclidean distance between the two.

Step 6.2, in the network for removing the video frame compression artifact, the network prediction image I is used^t+1’ _gtWith the original video frame I^t+1 _gtAnd comparing and calculating a Charbonier penalty function.

And 6.3, adding the two loss functions to obtain an overall cost function.

Total_loss＝Mse_loss+Charbonnier_loss (14)

Seq.	AR-CNN[1]	DCAD[7]	DSCNN[2]	MFQE[4]	The invention
						1	0.13	0.14	0.48	0.77	2.56
2	0.07	0.04	0.42	0.60	2.25
						3	0.11	0.11	0.24	0.47	2.51
4	0.13	0.08	0.32	0.44	1.37
						5	0.19	0.23	0.33	0.55	1.00
6	0.15	0.16	0.37	0.60	1.32
						7	0.14	0.18	0.28	0.39	1.20
8	0.13	0.19	0.28	0.48	1.34
						9	0.16	0.22	0.27	0.39	1.46
10	0.15	0.20	0.25	0.40	1.80
						Ave.	0.14	0.16	0.32	0.51	1.68

Table 1 comparison of results on test sets for QP 37 for the prior art

By adopting the technical scheme, the method can effectively eliminate the artifacts generated in the high compression of the video. The innovation of the invention is mainly embodied in two aspects: firstly, the two-dimensional convolution kernel is replaced by the one-dimensional convolution kernel, so that the parameters of the network training model are reduced, and the execution efficiency is high. The invention applies the latest deep learning technology, applies the self-adaptive separable convolution as the first module in the network model, converts each two-dimensional convolution into a pair of one-dimensional convolution kernels in the horizontal direction and the vertical direction, and uses the method that the parameter number is n²And the calculation cost is greatly reduced and the memory is saved by changing the N + n. Second, unlike most approaches that use a photo graph to motion compensate successive video frames, the present invention uses adaptively varying convolution kernels learned by the network for different inputs to achieve motion vector estimation. In the process of estimating motion offset by optical flow map, this method often causes inaccuracy of motion compensation due to lack of real-value (ground-true) of optical flow map (flow map). In the invention, two continuous frames are selected as network input, a pair of separable two-dimensional convolution kernels can be obtained for every two continuous inputs, then the 2-dimensional convolution kernels are expanded into four 1-dimensional convolution kernels, and the obtained 1-dimensional convolution kernels can change along with the change of the input, so that the self-adaptability of the network is greatly improved, and the method is a data-drive (drive) mode. The invention obtains the motion compensation frame through a self-adaptive separable convolution network, and removes the compression artifact of the video frame through a residual error network, thereby enhancing the video quality. The video artifact compression removing method based on the model of the self-adaptive separable convolutional network can effectively remove various artifacts in a compressed video and obviously improve the video quality and visual effect.

The present invention relates to the following references:

[1]Chao Dong,Yubin Deng,Chen Change Loy,Xiaoou Tang.Compression Artifacts Reduction by a Deep Convolutional Network,in Proceedings of International Conference on Computer Vision(ICCV),2015.

[2]Yang R,Xu M,Wang Z.Decoder-side HEVC quality enhancement with scalable convolutional neural network[C]//IEEE International Conference on Multimedia and Expo.IEEE,2017:817-822.

[3]Yang R,Xu M,Wang Z,et al.Enhancing Quality for HEVC Compressed Videos[J].2017.

[4]Yang R,Xu M,Wang Z,et al.Multi-Frame Quality Enhancement for Compressed Video[J].2018.

[5]Xiph.org,Xiph.org Video Test Media,https://media.xiph.org/video/derf/(2017).[6]VQEG,VQEG video datasets and organizations,https://www.its.bldrdoc.gov/vqeg/video-datasets-and-organizations.aspx

[7]Wang T,Chen M,Chao H.A Novel Deep Learning-Based Method of Improving Coding Efficiency from the Decoder-End for HEVC[C]//Data Compression Conference.IEEE,2017.

Claims

1. a video quality enhancement method based on adaptive separable convolution is characterized in that: the adopted system network comprises a self-adaptive separable convolution network and a residual error network, wherein the self-adaptive separable convolution network is used for acquiring the motion compensation frame, and the residual error network is used for removing the compression artifact of the video frame; the video quality enhancement method comprises the following specific steps:

step 1, selecting high-quality videos to form a video database;

Is formed therein

A current frame representing a compressed image,

representing the next frame of the compressed image,

a current frame representing a high definition image,

the next frame representing the high-definition image,

step 3, inputting two continuous compressed video frames

And

obtaining the next frame using a separable convolutional network

Predictive compressed video frames

The original compressed image of the frame corresponding to the training set

And uncompressed images

The normalization and the y-channel processing are performed simultaneously,

step 5, inputting compressed video frame

And predicting compressed video frames

Step 6: compressing video frames based on prediction

And predicting high definition video frames

Calculating a total cost function;

and 7: and continuously updating and optimizing the overall cost function to obtain the optimal convolution weight parameters and bias parameters.

2. The method of claim 1, wherein the adaptive separable convolution-based video quality enhancement method comprises: the step 2 specifically comprises the following steps:

And

And

forming a pairing set of a number of video frames

3. The method of claim 1, wherein the adaptive separable convolution-based video quality enhancement method comprises: the separable convolutional neural network comprises five encoding modules, four decoding modules, a separating convolutional module and an image prediction module.

4. The method of claim 3, wherein the adaptive separable convolution-based video quality enhancement method comprises: the step 3 specifically comprises the following steps:

the formula for calculating the convolutional layer is:

the formula for the average pooling layer is as follows:

wherein alpha is_iExpressing the value of the ith pixel point in the taken neighborhood, and alpha after normalization_iThe value range is 0-1, and N represents the total number of pixel points in the neighborhood; h is_mExpressing the result of pooling all pixel points in the neighborhood;

the computation process of the bilinear upsampling layer is as follows:

step 3.2.2, linear interpolation is carried out in the y direction:

this way the desired interpolation result can be obtained:

obtaining the value of a pixel point in a feature map after a pixel point p to be predicted is (x, y) passes through a bilinear interpolation function f, namely f (x, y);

step 3.4.2, expand each two-dimensional adaptive convolution kernel into 2 one-dimensional convolution kernels along horizontal and vertical directions respectively<K_{1_v}(x,y),K_{1_h}(x,y)>And<K_{2_v}(x,y),K_{2_h}(x,y)>to obtain four self-adaptive one-dimensional convolution kernels,

K₁(x,y)≈K_{1_h}(x,y)*K_{1_v}(x,y)

K₂(x,y)≈K_{2_h}(x,y)*K_{2_v}(x,y) (8)

step 3.4.4, the two groups obtained by the separate convolution modulesOne-dimensional nucleus<k_{1_h},k_{1_v}>And<k_{2_h},k_{2_v}>as convolution kernels of the image prediction module, two groups of convolution kernels are used for sequentially aligning the input current frame I₁And the next frame I₂Performing convolution operation, and adding the two finally obtained results to obtain an output result which is a compensation image of the next frame;

step 3.5, according to the above formulas (7) and (8), the original input current frame image P₁(x, y), second frame image P₂(x, y) carrying out convolution operation on the convolution kernel output by the self-adaptive separable convolution module to obtain a predicted image I obtained by the image prediction module_gt：

5. The method of claim 1, wherein the adaptive separable convolution-based video quality enhancement method comprises: the specific steps of the step 4 are respectively as follows:

Y＝0.257R+0.564G+0.098B+16

A Y-channel image is obtained.

6. The method of claim 1, wherein the adaptive separable convolution-based video quality enhancement method comprises: and in the step 5, the residual error network respectively comprises an initial convolution module, a residual error convolution module and an image reconstruction module.

7. The method of claim 6, wherein the adaptive separable convolution-based video quality enhancement method comprises: step 5 comprises the following processing steps:

representing the union as a network input;

step 5.2, each residual convolution module sequentially comprises a convolution layer, a nonlinear activation layer, a convolution layer and a characteristic combination layer; output characteristic F of characteristic bonding layer_kOutput characteristics F of the last two convolution layers in combination with the characteristics_k+2Adding by jump-connection and obtaining combined characteristics F_k,k+2；

F_k＝W_k(F_relu(W_k-1F_k-2+B_k-1)+F_k-2 (11)

F_k,k+2＝F_k+F_k+2 (12)

F_k-2Is the output characteristic diagram of the (k-2) th convolutional layer, F_reluRepresenting the relu activation function, W_kRepresents the k-th convolutional layer weight, W_k-1And B_k-1Weight and bias parameters, F, representing the k-1 th layer convolution module_k,k+2Is a characteristic layer F_kAnd F_k+2The obtained high-level bonding characteristics;

F_g＝W_M(F_relu(W_M-1F_k,k+2+B_M-1)+F₁ (13)

F₁is the bottom layer characteristic obtained from (9), F_reluRepresenting the relu activation function, F_k,k+2Is the high layer binding characteristic obtained by (12), W_MRepresents the weight of the convolution layer of the Mth layer, W_M-1And B_M-1Representing the weights and bias parameters of the layer M-1 convolution module.

8. The method of claim 1, wherein the adaptive separable convolution-based video quality enhancement method comprises: the calculation of the total cost function comprises the following steps:

And the next frame of original compressed image

Calculating the Euclidean distance between the two;

num represents the number of all pixel blocks in each frame image;

With the original high definition video frame

Comparing, and calculating a Charbonier penalty function;

num represents the number of all pixel blocks in each frame of image, epsilon is a regularization term used for preserving image edges, and epsilon is set to be 1E-3 based on experience;

step 6.3, adding the two loss functions to obtain an overall cost function:

Total_loss＝Mse_loss+Charbonnier_loss (16)。