CN108900848B - Video quality enhancement method based on self-adaptive separable convolution - Google Patents
Video quality enhancement method based on self-adaptive separable convolution Download PDFInfo
- Publication number
- CN108900848B CN108900848B CN201810603510.6A CN201810603510A CN108900848B CN 108900848 B CN108900848 B CN 108900848B CN 201810603510 A CN201810603510 A CN 201810603510A CN 108900848 B CN108900848 B CN 108900848B
- Authority
- CN
- China
- Prior art keywords
- convolution
- layer
- image
- video
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/85—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
- H04N19/86—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression involving reduction of coding artifacts, e.g. of blockiness
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/50—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
- H04N19/503—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
- H04N19/51—Motion estimation or motion compensation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/50—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
- H04N19/503—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
- H04N19/51—Motion estimation or motion compensation
- H04N19/577—Motion compensation with bidirectional frame interpolation, i.e. using B-pictures
Landscapes
- Engineering & Computer Science (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The invention discloses a video quality enhancement method based on self-adaptive separable convolution, which applies the self-adaptive separable convolution as a first module in a network model, converts each two-dimensional convolution into a pair of one-dimensional convolution kernels in the horizontal direction and the vertical direction, and has the parameter quantity of n2To become n + n. Secondly, the self-adaptively changed convolution kernels learned by the network for different inputs are utilized to realize the estimation of the motion vector, a pair of separable two-dimensional convolution kernels can be obtained for every two continuous inputs by selecting two continuous frames as the network inputs, then the 2-dimensional convolution kernels are unfolded into four 1-dimensional convolution kernels, the obtained 1-dimensional convolution kernels are changed along with the change of the inputs, and the network adaptivity is improved. The invention replaces two-dimensional convolution kernel with one-dimensional convolution kernel, so that the parameters of the network training model are reduced and the execution efficiency is high.
Description
Technical Field
The invention relates to the field of image processing and deep learning technology, in particular to a video based on self-adaptive separable convolution
A quality enhancement method.
Background
Removing compression artifacts of images and video is a classical problem in computer vision. The goal of the problem is to estimate lossless images from compressed images and video. In the information explosion age, images and videos spread on the internet and mobile phones are increasing day by day, and technologies of lossy compression modes such as JPEG and WebP are widely applied to platforms such as news websites, WeChat and microblogs to reduce the size of video files so as to save bandwidth and transmission time. Images and videos used in web pages need to be compressed as much as possible to speed up page loading and thereby improve user experience. These compression algorithms, however, typically introduce compression artifacts such as occlusion, back-projection, contouring, blurring, ringing, and the like. Generally, the larger the compression factor, the more severe the video degradation caused by these artifacts, resulting in loss of video information, directly affecting the visual experience of the user. Therefore, there has been an increasing interest in how to recover visually high quality artifact free images and videos.
In recent years, with the development of deep learning, more and more techniques are beginning to be applied to how to improve the visual quality of compressed images and videos. For example, Dong[1]The 3-layer convolutional neural network (ARCNN) is used for removing the artifacts of JPEG compressed images and obtaining better image decompression artifact effect. Then yang et al proposed DS-CNN [2,3 ]]For video quality enhancement. However, none of the above mentioned video quality enhancement methods utilizes information between adjacent frames, and thus its network performance is limited to a large extent. Until recently, yang et al continued to propose the MFQE algorithm [4 ]]It is believed that in compressed video, the information in a high quality frame can be used to enhance the quality of its neighboring low quality frames because the quality of each frame fluctuates greatly. However, the method relies on the optical flow estimation network to estimate the motion between frames, and the effect is not obvious because the group-route value of the motion estimation is difficult to obtain in the optical flow estimation method.
Disclosure of Invention
The invention aims to provide a video quality enhancement method based on self-adaptive separable convolution aiming at the problem of artifacts generated by high-degree compression of a video, so that various artifacts in a compressed video are effectively removed, and the video quality and the visual effect are obviously improved.
The technical scheme adopted by the invention is as follows:
a video quality enhancement method based on adaptive separable convolution adopts a system network comprising an adaptive separable convolution network and a residual error network, wherein the adaptive separable convolution network is used for obtaining a motion compensation frame, and the residual error network is used for removing a compression artifact of a video frame so as to enhance the video quality; the video quality enhancement method comprises the following specific steps:
step 1, selecting high-quality videos to form a video database [4,5,6 ].
Step 2, preprocessing a video database to form a training data set; the training data set is composed of a pairing set of a plurality of video framesIs formed thereinRepresents the current frame of the compressed video frame,representing the frame that follows the compressed video frame,a current frame representing a high definition video frame,a subsequent frame representing a frame of high definition video,
step 3, inputting two continuous compressed video framesAndobtaining the next frame using a separable convolutional networkPredictive compressed video frames
Step 4, predicting the compressed video frame obtained by the self-adaptive separable convolution networkThe original compressed image of the frame corresponding to the training setAnd uncompressed imagesThe normalization and the y-channel processing are performed simultaneously,
step 5, inputting compressed video frameAnd predicting compressed video framesObtaining a predicted high-definition video frame by using a residual error network model
Step 6: compressing video frames based on predictionAnd predicting high definition video framesCalculating a total cost function;
and 7: and continuously updating and optimizing the overall cost function to obtain the optimal convolution weight parameters and the optimal bias parameters until the optimal effect is obtained.
Further, step 2 specifically includes the following steps:
step 2-1, setting a quality coefficient qp according to the latest HEVC standard, and compressing the original video by using an ffmpeg command to ensure that each high-definition video has a corresponding video with a compression artifact;
step 2-2, respectively carrying out frame extraction on the high-definition video and the compressed video to obtain a high-definition image set and a corresponding compressed image set;
step 2-3, two continuous images in the compressed image set are taken each time, and the compressed video frames are intercepted according to the size of d x dAndsince it is the compression artifact of the removed video, inter-frame similarity should be consideredSex;
step 2-4, simultaneously acquiring two corresponding images from the high-definition image set and executing the same operation to obtain a high-definition video frameAndforming a pairing set of a number of video frames
And 2-5, randomly disordering the sequence of the video frames in the pairing set to obtain a training data set of the network model.
Further, the separable convolutional neural network in the step 3 comprises five encoding modules, four decoding modules, a separable convolutional module and an image prediction module;
further, step 3 specifically includes the following steps:
step 3.1, each coding module comprises three convolutional layers and one average pooling layer,
the formula for calculating the convolutional layer is:
wherein xi,jI row and j column pixels, w, representing an imagem,nRepresenting the m-th row and n-th column weight, w, of the filterbRepresenting the bias term of the filter, ai,jRepresenting the ith row and jth column of the obtained characteristic diagram, and representing an activation function relu by f;
the formula for the average pooling layer is as follows:
wherein alpha isiExpressing the value of the ith pixel point in the taken neighborhood, and alpha after normalizationiValue range of 01, N represents the total number of pixel points in the neighborhood. h ismExpressing the result of pooling all pixel points in the neighborhood;
step 3.2, each decoding module sequentially comprises three convolution layers and a bilinear upsampling layer, the output of the last coding module is used as the input of the first decoding module, and then the output of the last decoding module is used as the input of the next decoding module; the calculation formula of the convolution layer of the decoding module is the same as that of the convolution layer of the coding module;
the computation process of the bilinear upsampling layer is as follows:
step 3.2.1, for each obtained feature map, to obtain the value of the unknown function f at point p ═ (x, y), first, linear interpolation is performed in the x direction to obtain:
Wherein Q11=(x1,y1),Q12=(x1,y2),Q21=(x2,y1),Q22=(x2,y2) F is a bilinear interpolation function for known four points;
step 3.2.2, linear interpolation is carried out in the y direction:
this way the desired interpolation result can be obtained:
and obtaining the value of a middle pixel point of the feature map after the pixel point p to be predicted is (x, y) and passes through a bilinear interpolation function f, namely f (x, y).
Step 3.3, adding a jump connection between the decoder and the encoder: respectively adopting skip connection between the third layer convolution layer of the 2 nd, 3 rd, 4 th and 5 th coding modules and the corresponding bilinear upsampling layer of the 4 th, 3 th, 2 th and 1 th decoding modules, and adding the output characteristics of the coding modules and the decoding modules to obtain combined characteristics;
step 3.4, the separable convolution module comprises four sub-networks, wherein each sub-network consists of three convolution layers and a bilinear up-sampling layer; the method comprises the following specific steps:
step 3.4.1, the output of steps 3.1-3.3 is expanded into two adaptive convolution kernels to perform convolution operation on two continuous frame inputs respectively:
wherein K1(x, y) and K2(x, y) respectively represent two-dimensional convolution kernels, P, predicted based on a separable convolution model1(x, y) and P2(x, y) represents pixel values of two consecutive input frames, representing a convolution operation;
at step 3.4.2, each two-dimensional adaptive convolution kernel is expanded into 2 one-dimensional convolution kernels along the horizontal and vertical directions<K1_v(x,y),K1_h(x,y)>And<K2_v(x,y),K2_h(x,y)>to obtain four self-adaptive one-dimensional convolution kernels,
step 3.4.3, the convolution of two one-dimensional convolution kernels can approximate a two-dimensional convolution kernel:
K1(x,y)≈K1_h(x,y)*K1_v(x,y)
K2(x,y)≈K2_h(x,y)*K2_v(x,y) (8)
step 3.4.4, the two sets of one-dimensional kernels obtained by the separate convolution module<k1_h,k1_v>And<k2_h,k2_v>as convolution kernels of an image prediction module, two groups of convolution kernels are arranged in sequenceFor the input current frame I1And the next frame I2Performing convolution operation, and adding the two finally obtained results to obtain an output result which is a compensation image of the next frame;
step 3.5, the original input current frame image P1(x, y), second frame image P2(x, y) carrying out convolution operation on the convolution kernel output by the self-adaptive separable convolution module to obtain a predicted image I obtained by the image prediction modulegt:
Igt=k1_h(x,y)*k1_v(x,y)*P1(x,y)+k2_h(x,y)*k2_v(x,y)*P2(x,y) (9)
Further, the specific steps of step 4 are respectively:
step 4.1, dividing each pixel value of the image by 255 to enable each pixel to be between [0,1] to obtain a processed image;
step 4.2, taking the normalized RGB image, and obtaining the normalized RGB image according to a formula
Y=0.257R+0.564G+0.098B+16
A Y-channel image is obtained.
Further, the residual error network in the step 5 comprises an initial convolution module, a residual error convolution module and an image reconstruction module respectively;
further, step 5 comprises the following processing steps:
step 5.1, the initial convolution stage comprises a convolution layer and an activation layer, and the bottom layer characteristic F is obtained through learning1;
Wherein W1And B1As weights and bias parameters of the initial convolution module, FreluRepresenting a relu activation function;
step 5.2, each residual convolution module sequentially comprises a convolution layer, a nonlinear activation layer, a convolution layer and a characteristic combination layer; the characteristic combination layer connects the output characteristics F of the layer through a jump connectionkWith the two subsequent layers of the convolution layerOutput characteristic F ofk+2Adding and obtaining combined features Fk,k+2;
Fk=Wk(Frelu(Wk-1Fk-2+Bk-1)+Fk-2 (11)
Fk,k+2=Fk+Fk+2 (12)
FK-2Is the output characteristic diagram of the (k-2) th convolutional layer, FreluRepresenting the relu activation function, WkRepresents the k-th convolutional layer weight, Wk-1And Bk-1Weight and bias parameters, F, representing the k-1 th layer convolution modulek,k+2Is a characteristic layer FkAnd Fk+2The resulting high-level bonding characteristics.
Step 5.3, utilizing the obtained high-level feature Fk,k+2Executing an image reconstruction layer;
Fg=WM(Frelu(WM-1Fk,k+2+BM-1)+F1 (13)
F1is the bottom layer characteristic obtained from (10), FreluRepresenting the relu activation function, Fk,k+2Is the high layer binding characteristic obtained by (12), WMRepresents the weight of the convolution layer of the Mth layer, WM-1And BM-1Representing the weights and bias parameters of the layer M-1 convolution module. Further, the calculation of the total cost function of step 6 comprises the following steps:
step 6.1, in separable convolution network, compare the predicted compressed video frame of the next frameAnd the next frame of the original compressed video frameCalculating the Euclidean distance between the two;
num denotes the number of all pixel blocks in each frame image.
Step 6.2, in the network for removing the video frame compression artifact, predicting the high-definition video frameWith the original high definition video frameComparing, and calculating a Charbonier penalty function;
num denotes the number of all pixel blocks in each frame image.
Step 6.3, adding the two loss functions to obtain an overall cost function:
Total_loss=Mse_loss+Charbonnier_loss (16)。
by adopting the technical scheme, the motion compensation frame is obtained through the self-adaptive separable convolution network, and the compression artifact of the video frame is removed through the residual error network, so that the video quality is enhanced. The video artifact compression removing method based on the model of the self-adaptive separable convolutional network can effectively remove various artifacts in a compressed video and obviously improve the video quality and visual effect.
Drawings
The invention is described in further detail below with reference to the accompanying drawings and the detailed description;
FIG. 1 is a schematic diagram illustrating a schematic structure of a video quality enhancement method based on adaptive separable convolution according to the present invention;
fig. 2 is a comparison graph of artifact removal effects of images of "vidoo 3" in JCT-VC, HEVC standard test sequence of MFQE in the prior art, according to the latest HEVC standard, a test video is compressed, and a quality coefficient QP is set to 37.
Detailed Description
As shown in one of fig. 1-2, the present invention proposes a video enhancement method based on separable convolutional network. The network consists of two parts: the first part is a separable convolutional network to obtain motion-compensated frames, and the second is a residual network to remove compression artifacts of video frames, thereby enhancing video quality. The overall network model adopts an Adam optimization mode, except that convolution kernels with the size of 51 are used in 4 sub-networks in the separable convolution module, all other convolution layers all use convolution kernels with the size of 3 × 3, and the specific steps are as follows:
step 1, selecting high-quality videos to form a video database. There were 7000 training data pictures.
And 2, preprocessing the video database to form a training data set. According to the latest HEVC standard, a quality coefficient qp is set, and an ffmpeg command is used for compressing original videos, so that each high-definition video has a corresponding video with a compression artifact. And then respectively carrying out frame extraction on the high-definition video and the compressed video to obtain a high-definition image set and a corresponding compressed image set. Since it is the compression artifact of the removed video, inter-frame similarity should be considered. Each time, the previous image and the next image in the compressed image set are taken, and the video frame I is intercepted according to the size of d x dt cAnd It+1 cSimultaneously, two corresponding images are taken from the high-definition image set to execute the same operation to obtain a video frame It gtAnd It+1 gtForming a pairing set of several video frames { It c,It+1 c,It gt,It +1 gt}. And randomly disordering the sequence of the video frames in the pairing set to obtain a training data set of the network model. The training data set contained 7000 pictures in total.
Step 3, inputting two continuous compressed video frames I by utilizing a separable convolution networkt c,It+1 c(representing the current frame and the next frame, respectively) to obtain a next frame It+1 cPredicted result of (I)t+1’ c. The separable convolution neural network comprises five coding modules, four decoding modules, a separation convolution module and an image pre-decoderAnd a measuring module. Each encoding module includes three convolutional layers and one average pooling layer. The formula for calculating the convolutional layer is:
wherein xi,jI row and j column pixels, w, representing an imagem,nRepresenting the m-th row and n-th column weight, w, of the filterbRepresenting the bias term of the filter, ai,jThe ith row and jth column of pixels of the resulting feature map are represented, with f representing the activation function relu. And setting the size of the convolution kernel to be 3 x 3 in the coding and decoding module.
The average pooling layer is used for downsampling the output feature map, and further reduces the parameter number by removing unimportant samples in the feature map.
Then, the output of the encoding module is used as the input of the decoding module, each decoding module sequentially comprises three convolution layers and a bilinear upsampling layer, and is somewhat like the inverse process of the decoding module, wherein the calculation formula of the bilinear upsampling layer is as follows: for each obtained feature map, linear interpolation is performed in the x direction to obtain:
wherein Q11=(x1,y1),Q12=(x1,y2),Q21=(x2,y1),Q22=(x2,y2) F bilinear interpolation function for known four points. Then, linear interpolation is performed again in the y direction:
thus, the value of each pixel point of the feature map after bilinear interpolation can be obtained, wherein p ═ x, y is the pixel point to be predicted.
The formula for the convolutional layer is as above.
Meanwhile, a characteristic combination layer is added to serve as a bridge for connecting a decoder and an encoder so as to avoid detail information loss. The specific operation is as follows: respectively connecting the third layer convolution layer of the 2 nd, 3 th, 4 th and 5 th coding modules with the 4 th, 3 rd, 2 th and 1 th bilinear upsampling layers corresponding to the decoding modules through skip connection, and adding the output characteristics of the coding modules and the decoding modules to obtain combined characteristics FK。
The separation convolution module is composed of four sub-networks (sub-networks), wherein each sub-network is composed of three convolution layers and a bilinear upsampling layer, but at the moment, a two-dimensional convolution kernel of each convolution layer is replaced by two one-dimensional convolution kernels, the two one-dimensional convolution kernels are respectively represented horizontally and vertically by the two-dimensional convolution kernels, and the specific process is as follows: two sets of one-dimensional kernels obtained by separate convolution modules<k1_h,k1_v>And<k2_h,k2_v>as convolution kernels of the image prediction module, the two groups of convolution kernels respectively correspond to the input current frame I1And the next frame I2Performing convolution operation, and finally adding the two obtained results to obtain an output result, namely a predicted image of a next frame, wherein the specific operation is as follows:
final predicted image IgtPixel P of current frame image capable of being input by original1(x, y) pixel point P of second frame image2And (x, y) and the network respectively perform convolution operation on the convolution kernels learned by the two images to obtain:
Igt=K1(x,y)*P1(x,y)+K2(x,y)*P2(x,y) (5)
using convolution results of one-dimensional convolution kernel in horizontal direction and one-dimensional convolution kernel in vertical directionTwo-dimensional convolution kernel K in approximate expression (6)1(x, y) and K2(x,y):
K1(x,y)=k1_h(x,y)*v1_v(x,y)
K2(x,y)=k2_h(x,y)*k2_v(x,y) (6)
Can obtain
Igt=k1_h(x,y)*k1_v(x,y)*P1(x,y)+k2_h(x,y)*k2_v(x,y)*P2(x,y) (7)
Step 4, obtaining the predicted frame I by the separable convolution networkt+1’ cThe original compressed image I of the frame corresponding to the training sett+1 cAnd uncompressed image It+1 gtSimultaneously carrying out normalization and y-channel processing, wherein the specific steps are as follows:
step 4.1, dividing each pixel value of the image by 255 to enable each pixel to be between [0,1] to obtain a processed image;
step 4.2, taking the normalized RGB image, and obtaining the normalized RGB image according to a formula
Y=0.257R+0.564G+0.098B+16
A Y-channel image is obtained.
Step 5, inputting the compressed video frame I by using the residual error network modelt+1 cAnd predicting compressed video frame It+1’ cObtaining a model predicted image It+1’ gt. The residual error network comprises an initial convolution module, a residual error convolution module and an image reconstruction module respectively. Each residual convolution module sequentially comprises a convolution layer, a nonlinear active layer, a convolution layer and a characteristic combination layer, wherein the characteristic combination layer connects the output characteristics F of the layers through a jump connectionkOutput characteristic F of the convolution layer of the two subsequent layersk+2Adding and obtaining combined features Fk,k+2。
Step 5.1, the initial convolution stage comprises a convolution layer and an activation layer, and the bottom layer characteristic F is obtained through learning1;
Wherein W1And B1As weights and bias parameters of the initial convolution module, FreluRepresenting a relu activation function;
step 5.2, each residual convolution module sequentially comprises a convolution layer, a nonlinear activation layer, a convolution layer and a characteristic combination layer; the characteristic combination layer connects the output characteristics F of the layer through a jump connectionkOutput characteristic F of the convolution layer of the two subsequent layersk+2Adding and obtaining combined features Fk,k+2;
Fk=Wk(Frelu(Wk-1Fk-2+Bk-1)+Fk-2 (9)
Fk,k+2=Fk+Fk+2 (10)
FK-2Is the output characteristic diagram of the (k-2) th convolutional layer, FreluRepresenting the relu activation function, WKRepresents the weight of the K layer convolutional layer, WK-1And BK-1Weight and bias parameters representing the K-1 layer convolution module, Fk,k+2Is a characteristic layer FkAnd Fk+2The resulting high-level bonding characteristics.
Step 5.3, utilizing the obtained high-level feature Fk,k+2Executing an image reconstruction layer;
Fg=WM(Frelu(WM-1Fk,k+2+BM-1)+F1 (11)
F1is the bottom layer characteristic obtained from (9), FreluRepresenting the relu activation function, Fk,k+2Is the high layer bonding characteristic obtained by (10), WMRepresents the weight of the convolution layer of the Mth layer, WM-1And BM-1Representing the weights and bias parameters of the layer M-1 convolution module.
Step 6: calculating an overall cost function;
step 6.1, in the separable convolution network, the predicted image I of the next frame is comparedt+1’ cAnd the next frame original image It +1 cAnd calculating the Euclidean distance between the two.
Step 6.2, in the network for removing the video frame compression artifact, the network prediction image I is usedt+1’ gtWith the original video frame It+1 gtAnd comparing and calculating a Charbonier penalty function.
And 6.3, adding the two loss functions to obtain an overall cost function.
Total_loss=Mse_loss+Charbonnier_loss (14)
And 7: and continuously updating and optimizing the overall cost function to obtain the optimal convolution weight parameters and the optimal bias parameters until the optimal effect is obtained.
Seq. | AR-CNN[1] | DCAD[7] | DSCNN[2] | MFQE[4] | The invention |
1 | 0.13 | 0.14 | 0.48 | 0.77 | 2.56 |
2 | 0.07 | 0.04 | 0.42 | 0.60 | 2.25 |
3 | 0.11 | 0.11 | 0.24 | 0.47 | 2.51 |
4 | 0.13 | 0.08 | 0.32 | 0.44 | 1.37 |
5 | 0.19 | 0.23 | 0.33 | 0.55 | 1.00 |
6 | 0.15 | 0.16 | 0.37 | 0.60 | 1.32 |
7 | 0.14 | 0.18 | 0.28 | 0.39 | 1.20 |
8 | 0.13 | 0.19 | 0.28 | 0.48 | 1.34 |
9 | 0.16 | 0.22 | 0.27 | 0.39 | 1.46 |
10 | 0.15 | 0.20 | 0.25 | 0.40 | 1.80 |
Ave. | 0.14 | 0.16 | 0.32 | 0.51 | 1.68 |
Table 1 comparison of results on test sets for QP 37 for the prior art
By adopting the technical scheme, the method can effectively eliminate the artifacts generated in the high compression of the video. The innovation of the invention is mainly embodied in two aspects: firstly, the two-dimensional convolution kernel is replaced by the one-dimensional convolution kernel, so that the parameters of the network training model are reduced, and the execution efficiency is high. The invention applies the latest deep learning technology, applies the self-adaptive separable convolution as the first module in the network model, converts each two-dimensional convolution into a pair of one-dimensional convolution kernels in the horizontal direction and the vertical direction, and uses the method that the parameter number is n2And the calculation cost is greatly reduced and the memory is saved by changing the N + n. Second, unlike most approaches that use a photo graph to motion compensate successive video frames, the present invention uses adaptively varying convolution kernels learned by the network for different inputs to achieve motion vector estimation. In the process of estimating motion offset by optical flow map, this method often causes inaccuracy of motion compensation due to lack of real-value (ground-true) of optical flow map (flow map). In the invention, two continuous frames are selected as network input, a pair of separable two-dimensional convolution kernels can be obtained for every two continuous inputs, then the 2-dimensional convolution kernels are expanded into four 1-dimensional convolution kernels, and the obtained 1-dimensional convolution kernels can change along with the change of the input, so that the self-adaptability of the network is greatly improved, and the method is a data-drive (drive) mode. The invention obtains the motion compensation frame through a self-adaptive separable convolution network, and removes the compression artifact of the video frame through a residual error network, thereby enhancing the video quality. The video artifact compression removing method based on the model of the self-adaptive separable convolutional network can effectively remove various artifacts in a compressed video and obviously improve the video quality and visual effect.
The present invention relates to the following references:
[1]Chao Dong,Yubin Deng,Chen Change Loy,Xiaoou Tang.Compression Artifacts Reduction by a Deep Convolutional Network,in Proceedings of International Conference on Computer Vision(ICCV),2015.
[2]Yang R,Xu M,Wang Z.Decoder-side HEVC quality enhancement with scalable convolutional neural network[C]//IEEE International Conference on Multimedia and Expo.IEEE,2017:817-822.
[3]Yang R,Xu M,Wang Z,et al.Enhancing Quality for HEVC Compressed Videos[J].2017.
[4]Yang R,Xu M,Wang Z,et al.Multi-Frame Quality Enhancement for Compressed Video[J].2018.
[5]Xiph.org,Xiph.org Video Test Media,https://media.xiph.org/video/derf/(2017).[6]VQEG,VQEG video datasets and organizations,https://www.its.bldrdoc.gov/vqeg/video-datasets-and-organizations.aspx
[7]Wang T,Chen M,Chao H.A Novel Deep Learning-Based Method of Improving Coding Efficiency from the Decoder-End for HEVC[C]//Data Compression Conference.IEEE,2017.
Claims (8)
1. a video quality enhancement method based on adaptive separable convolution is characterized in that: the adopted system network comprises a self-adaptive separable convolution network and a residual error network, wherein the self-adaptive separable convolution network is used for acquiring the motion compensation frame, and the residual error network is used for removing the compression artifact of the video frame; the video quality enhancement method comprises the following specific steps:
step 1, selecting high-quality videos to form a video database;
step 2, preprocessing a video database to form a training data set; the training data set is composed of a pairing set of a plurality of video framesIs formed thereinA current frame representing a compressed image,representing the next frame of the compressed image,a current frame representing a high definition image,the next frame representing the high-definition image,
step 3, inputting two continuous compressed video framesAndobtaining the next frame using a separable convolutional networkPredictive compressed video frames
Step 4, predicting the compressed video frame obtained by the self-adaptive separable convolution networkThe original compressed image of the frame corresponding to the training setAnd uncompressed imagesThe normalization and the y-channel processing are performed simultaneously,
step 5, inputting compressed video frameAnd predicting compressed video framesObtaining a predicted high-definition video frame by using a residual error network model
Step 6: compressing video frames based on predictionAnd predicting high definition video framesCalculating a total cost function;
and 7: and continuously updating and optimizing the overall cost function to obtain the optimal convolution weight parameters and bias parameters.
2. The method of claim 1, wherein the adaptive separable convolution-based video quality enhancement method comprises: the step 2 specifically comprises the following steps:
step 2-1, setting a quality coefficient qp according to the latest HEVC standard, and compressing the original video by using an ffmpeg command to ensure that each high-definition video has a corresponding video with a compression artifact;
step 2-2, respectively carrying out frame extraction on the high-definition video and the compressed video to obtain a high-definition image set and a corresponding compressed image set;
step 2-3, two continuous images in the compressed image set are taken each time, and the compressed video frames are intercepted according to the size of d x dAnd
step 2-4, simultaneously acquiring two corresponding images from the high-definition image set and executing the same operation to obtain a high-definition video frameAndforming a pairing set of a number of video frames
And 2-5, randomly disordering the sequence of the video frames in the pairing set to obtain a training data set of the network model.
3. The method of claim 1, wherein the adaptive separable convolution-based video quality enhancement method comprises: the separable convolutional neural network comprises five encoding modules, four decoding modules, a separating convolutional module and an image prediction module.
4. The method of claim 3, wherein the adaptive separable convolution-based video quality enhancement method comprises: the step 3 specifically comprises the following steps:
step 3.1, each coding module comprises three convolutional layers and one average pooling layer,
the formula for calculating the convolutional layer is:
wherein xi,jI row and j column pixels, w, representing an imagem,nRepresenting the m-th row and n-th column weight, w, of the filterbRepresenting the bias term of the filter, ai,jRepresenting the ith row and jth column of the obtained characteristic diagram, and representing an activation function relu by f;
the formula for the average pooling layer is as follows:
wherein alpha isiExpressing the value of the ith pixel point in the taken neighborhood, and alpha after normalizationiThe value range is 0-1, and N represents the total number of pixel points in the neighborhood; h ismExpressing the result of pooling all pixel points in the neighborhood;
step 3.2, each decoding module sequentially comprises three convolution layers and a bilinear upsampling layer, the output of the last coding module is used as the input of the first decoding module, and then the output of the last decoding module is used as the input of the next decoding module; the calculation formula of the convolution layer of the decoding module is the same as that of the convolution layer of the coding module;
the computation process of the bilinear upsampling layer is as follows:
step 3.2.1, for each obtained feature map, to obtain the value of the unknown function f at point p ═ (x, y), first, linear interpolation is performed in the x direction to obtain:
wherein Q11=(x1,y1),Q12=(x1,y2),Q21=(x2,y1),Q22=(x2,y2) F is a bilinear interpolation function for known four points;
step 3.2.2, linear interpolation is carried out in the y direction:
this way the desired interpolation result can be obtained:
obtaining the value of a pixel point in a feature map after a pixel point p to be predicted is (x, y) passes through a bilinear interpolation function f, namely f (x, y);
step 3.3, adding a jump connection between the decoder and the encoder: respectively adopting skip connection between the third layer convolution layer of the 2 nd, 3 rd, 4 th and 5 th coding modules and the corresponding bilinear upsampling layer of the 4 th, 3 th, 2 th and 1 th decoding modules, and adding the output characteristics of the coding modules and the decoding modules to obtain combined characteristics;
step 3.4, the separable convolution module comprises four sub-networks, wherein each sub-network consists of three convolution layers and a bilinear up-sampling layer; the method comprises the following specific steps:
step 3.4.1, the output of steps 3.1-3.3 is expanded into two adaptive convolution kernels to perform convolution operation on two continuous frame inputs respectively:
wherein K1(x, y) and K2(x, y) respectively represent two-dimensional convolution kernels, P, predicted based on a separable convolution model1(x, y) and P2(x, y) represents pixel values of two consecutive input frames, representing a convolution operation;
step 3.4.2, expand each two-dimensional adaptive convolution kernel into 2 one-dimensional convolution kernels along horizontal and vertical directions respectively<K1_v(x,y),K1_h(x,y)>And<K2_v(x,y),K2_h(x,y)>to obtain four self-adaptive one-dimensional convolution kernels,
step 3.4.3, the convolution of two one-dimensional convolution kernels can approximate a two-dimensional convolution kernel:
K1(x,y)≈K1_h(x,y)*K1_v(x,y)
K2(x,y)≈K2_h(x,y)*K2_v(x,y) (8)
step 3.4.4, the two groups obtained by the separate convolution modulesOne-dimensional nucleus<k1_h,k1_v>And<k2_h,k2_v>as convolution kernels of the image prediction module, two groups of convolution kernels are used for sequentially aligning the input current frame I1And the next frame I2Performing convolution operation, and adding the two finally obtained results to obtain an output result which is a compensation image of the next frame;
step 3.5, according to the above formulas (7) and (8), the original input current frame image P1(x, y), second frame image P2(x, y) carrying out convolution operation on the convolution kernel output by the self-adaptive separable convolution module to obtain a predicted image I obtained by the image prediction modulegt:
Igt=k1_h(x,y)*k1_v(x,y)*P1(x,y)+k2_h(x,y)*k2_v(x,y)*P2(x,y) (9)
5. The method of claim 1, wherein the adaptive separable convolution-based video quality enhancement method comprises: the specific steps of the step 4 are respectively as follows:
step 4.1, dividing each pixel value of the image by 255 to enable each pixel to be between [0,1] to obtain a processed image;
step 4.2, taking the normalized RGB image, and obtaining the normalized RGB image according to a formula
Y=0.257R+0.564G+0.098B+16
A Y-channel image is obtained.
6. The method of claim 1, wherein the adaptive separable convolution-based video quality enhancement method comprises: and in the step 5, the residual error network respectively comprises an initial convolution module, a residual error convolution module and an image reconstruction module.
7. The method of claim 6, wherein the adaptive separable convolution-based video quality enhancement method comprises: step 5 comprises the following processing steps:
step 5.1, the initial convolution stage comprises a convolution layer and an activation layer, and the bottom layer characteristic F is obtained through learning1;
Wherein W1And B1As weights and bias parameters of the initial convolution module, FreluRepresenting a relu activation function;representing the union as a network input;
step 5.2, each residual convolution module sequentially comprises a convolution layer, a nonlinear activation layer, a convolution layer and a characteristic combination layer; output characteristic F of characteristic bonding layerkOutput characteristics F of the last two convolution layers in combination with the characteristicsk+2Adding by jump-connection and obtaining combined characteristics Fk,k+2;
Fk=Wk(Frelu(Wk-1Fk-2+Bk-1)+Fk-2 (11)
Fk,k+2=Fk+Fk+2 (12)
Fk-2Is the output characteristic diagram of the (k-2) th convolutional layer, FreluRepresenting the relu activation function, WkRepresents the k-th convolutional layer weight, Wk-1And Bk-1Weight and bias parameters, F, representing the k-1 th layer convolution modulek,k+2Is a characteristic layer FkAnd Fk+2The obtained high-level bonding characteristics;
step 5.3, utilizing the obtained high-level feature Fk,k+2Executing an image reconstruction layer;
Fg=WM(Frelu(WM-1Fk,k+2+BM-1)+F1 (13)
F1is the bottom layer characteristic obtained from (9), FreluRepresenting the relu activation function, Fk,k+2Is the high layer binding characteristic obtained by (12), WMRepresents the weight of the convolution layer of the Mth layer, WM-1And BM-1Representing the weights and bias parameters of the layer M-1 convolution module.
8. The method of claim 1, wherein the adaptive separable convolution-based video quality enhancement method comprises: the calculation of the total cost function comprises the following steps:
step 6.1, in separable convolution network, compare the predicted compressed video frame of the next frameAnd the next frame of original compressed imageCalculating the Euclidean distance between the two;
num represents the number of all pixel blocks in each frame image;
step 6.2, in the network for removing the video frame compression artifact, predicting the high-definition video frameWith the original high definition video frameComparing, and calculating a Charbonier penalty function;
num represents the number of all pixel blocks in each frame of image, epsilon is a regularization term used for preserving image edges, and epsilon is set to be 1E-3 based on experience;
step 6.3, adding the two loss functions to obtain an overall cost function:
Total_loss=Mse_loss+Charbonnier_loss (16)。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810603510.6A CN108900848B (en) | 2018-06-12 | 2018-06-12 | Video quality enhancement method based on self-adaptive separable convolution |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810603510.6A CN108900848B (en) | 2018-06-12 | 2018-06-12 | Video quality enhancement method based on self-adaptive separable convolution |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108900848A CN108900848A (en) | 2018-11-27 |
CN108900848B true CN108900848B (en) | 2021-03-02 |
Family
ID=64344922
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810603510.6A Active CN108900848B (en) | 2018-06-12 | 2018-06-12 | Video quality enhancement method based on self-adaptive separable convolution |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108900848B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109451308B (en) | 2018-11-29 | 2021-03-09 | 北京市商汤科技开发有限公司 | Video compression processing method and device, electronic equipment and storage medium |
CN110677651A (en) * | 2019-09-02 | 2020-01-10 | 合肥图鸭信息科技有限公司 | Video compression method |
CN110610467B (en) * | 2019-09-11 | 2022-04-15 | 杭州当虹科技股份有限公司 | Multi-frame video compression noise removing method based on deep learning |
CN110705513A (en) * | 2019-10-17 | 2020-01-17 | 腾讯科技(深圳)有限公司 | Video feature extraction method and device, readable storage medium and computer equipment |
CN113727141B (en) * | 2020-05-20 | 2023-05-12 | 富士通株式会社 | Interpolation device and method for video frames |
CN113761983B (en) * | 2020-06-05 | 2023-08-22 | 杭州海康威视数字技术股份有限公司 | Method and device for updating human face living body detection model and image acquisition equipment |
CN112257847A (en) * | 2020-10-16 | 2021-01-22 | 昆明理工大学 | Method for predicting geomagnetic Kp index based on CNN and LSTM |
RU2764395C1 (en) | 2020-11-23 | 2022-01-17 | Самсунг Электроникс Ко., Лтд. | Method and apparatus for joint debayering and image noise elimination using a neural network |
CN112801266B (en) * | 2020-12-24 | 2023-10-31 | 武汉旷视金智科技有限公司 | Neural network construction method, device, equipment and medium |
CN115442613A (en) * | 2021-06-02 | 2022-12-06 | 四川大学 | Interframe information-based noise removal method using GAN |
CN114339030B (en) * | 2021-11-29 | 2024-04-02 | 北京工业大学 | Network live video image stabilizing method based on self-adaptive separable convolution |
CN114820350A (en) * | 2022-04-02 | 2022-07-29 | 北京广播电视台 | Inverse tone mapping system, method and neural network system thereof |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103366389A (en) * | 2013-04-27 | 2013-10-23 | 中国人民解放军北京军区总医院 | CT (computed tomography) image reconstruction method |
CN107871332A (en) * | 2017-11-09 | 2018-04-03 | 南京邮电大学 | A kind of CT based on residual error study is sparse to rebuild artifact correction method and system |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060062478A1 (en) * | 2004-08-16 | 2006-03-23 | Grandeye, Ltd., | Region-sensitive compression of digital video |
WO2016132145A1 (en) * | 2015-02-19 | 2016-08-25 | Magic Pony Technology Limited | Online training of hierarchical algorithms |
CN106131443A (en) * | 2016-05-30 | 2016-11-16 | 南京大学 | A kind of high dynamic range video synthetic method removing ghost based on Block-matching dynamic estimation |
CN106791836A (en) * | 2016-12-02 | 2017-05-31 | 深圳市唯特视科技有限公司 | It is a kind of to be based on a pair of methods of the reduction compression of images effect of Multi net voting |
CN106709875B (en) * | 2016-12-30 | 2020-02-18 | 北京工业大学 | Compressed low-resolution image restoration method based on joint depth network |
CN107145846B (en) * | 2017-04-26 | 2018-10-19 | 贵州电网有限责任公司输电运行检修分公司 | A kind of insulator recognition methods based on deep learning |
CN107392868A (en) * | 2017-07-21 | 2017-11-24 | 深圳大学 | Compression binocular image quality enhancement method and device based on full convolutional neural networks |
CN107463989B (en) * | 2017-07-25 | 2019-09-27 | 福建帝视信息科技有限公司 | A kind of image based on deep learning goes compression artefacts method |
CN107507148B (en) * | 2017-08-30 | 2018-12-18 | 南方医科大学 | Method based on the convolutional neural networks removal down-sampled artifact of magnetic resonance image |
-
2018
- 2018-06-12 CN CN201810603510.6A patent/CN108900848B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103366389A (en) * | 2013-04-27 | 2013-10-23 | 中国人民解放军北京军区总医院 | CT (computed tomography) image reconstruction method |
CN107871332A (en) * | 2017-11-09 | 2018-04-03 | 南京邮电大学 | A kind of CT based on residual error study is sparse to rebuild artifact correction method and system |
Also Published As
Publication number | Publication date |
---|---|
CN108900848A (en) | 2018-11-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108900848B (en) | Video quality enhancement method based on self-adaptive separable convolution | |
Zhang et al. | DMCNN: Dual-domain multi-scale convolutional neural network for compression artifacts removal | |
Sun et al. | Reduction of JPEG compression artifacts based on DCT coefficients prediction | |
CN111866521A (en) | Video image compression artifact removing method combining motion compensation and generation type countermeasure network | |
CN111047532B (en) | Low-illumination video enhancement method based on 3D convolutional neural network | |
Yu et al. | Quality enhancement network via multi-reconstruction recursive residual learning for video coding | |
CN111031315B (en) | Compressed video quality enhancement method based on attention mechanism and time dependence | |
CN113055674B (en) | Compressed video quality enhancement method based on two-stage multi-frame cooperation | |
CN112218094A (en) | JPEG image decompression effect removing method based on DCT coefficient prediction | |
WO2022211657A9 (en) | Configurable positions for auxiliary information input into a picture data processing neural network | |
CN112188217B (en) | JPEG compressed image decompression effect removing method combining DCT domain and pixel domain learning | |
CN113810715B (en) | Video compression reference image generation method based on cavity convolutional neural network | |
CN115187455A (en) | Lightweight super-resolution reconstruction model and system for compressed image | |
US20230110503A1 (en) | Method, an apparatus and a computer program product for video encoding and video decoding | |
CN112601095B (en) | Method and system for creating fractional interpolation model of video brightness and chrominance | |
Ho et al. | SR-CL-DMC: P-frame coding with super-resolution, color learning, and deep motion compensation | |
CN113822801B (en) | Compressed video super-resolution reconstruction method based on multi-branch convolutional neural network | |
CN115243044A (en) | Reference frame selection method and device, equipment and storage medium | |
Amaranageswarao et al. | Blind compression artifact reduction using dense parallel convolutional neural network | |
WO2022211658A1 (en) | Independent positioning of auxiliary information in neural network based picture processing | |
Jia et al. | Deep convolutional network based image quality enhancement for low bit rate image compression | |
Mishra et al. | Edge-aware image compression using deep learning-based super-resolution network | |
CN114862687B (en) | Self-adaptive compressed image restoration method driven by depth deblocking operator | |
CN112243132A (en) | Compressed video post-processing method combining non-local prior and attention mechanism | |
CN114071166B (en) | HEVC compressed video quality improvement method combined with QP detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP03 | Change of name, title or address | ||
CP03 | Change of name, title or address |
Address after: 350000 Unit 01, 16th Floor, TB # Office Building, Phase III, CR MIXC, Hongshanyuan Road, Hongshan Town, Gulou District, Fuzhou City, Fujian Province Patentee after: Fujian Deshi Technology Group Co.,Ltd. Address before: 350000 area B, 5th floor, building 2, Yunzuo, 528 Xihong Road, Gulou District, Fuzhou City, Fujian Province Patentee before: FUJIAN IMPERIAL VISION INFORMATION TECHNOLOGY CO.,LTD. |