CN108900848B - Video quality enhancement method based on self-adaptive separable convolution - Google Patents

Video quality enhancement method based on self-adaptive separable convolution Download PDF

Info

Publication number
CN108900848B
CN108900848B CN201810603510.6A CN201810603510A CN108900848B CN 108900848 B CN108900848 B CN 108900848B CN 201810603510 A CN201810603510 A CN 201810603510A CN 108900848 B CN108900848 B CN 108900848B
Authority
CN
China
Prior art keywords
convolution
layer
image
video
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810603510.6A
Other languages
Chinese (zh)
Other versions
CN108900848A (en
Inventor
高钦泉
聂可卉
刘文哲
童同
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Deshi Technology Group Co ltd
Original Assignee
Fujian Imperial Vision Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Imperial Vision Information Technology Co ltd filed Critical Fujian Imperial Vision Information Technology Co ltd
Priority to CN201810603510.6A priority Critical patent/CN108900848B/en
Publication of CN108900848A publication Critical patent/CN108900848A/en
Application granted granted Critical
Publication of CN108900848B publication Critical patent/CN108900848B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/85Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression
    • H04N19/86Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using pre-processing or post-processing specially adapted for video compression involving reduction of coding artifacts, e.g. of blockiness
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/577Motion compensation with bidirectional frame interpolation, i.e. using B-pictures

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention discloses a video quality enhancement method based on self-adaptive separable convolution, which applies the self-adaptive separable convolution as a first module in a network model, converts each two-dimensional convolution into a pair of one-dimensional convolution kernels in the horizontal direction and the vertical direction, and has the parameter quantity of n2To become n + n. Secondly, the self-adaptively changed convolution kernels learned by the network for different inputs are utilized to realize the estimation of the motion vector, a pair of separable two-dimensional convolution kernels can be obtained for every two continuous inputs by selecting two continuous frames as the network inputs, then the 2-dimensional convolution kernels are unfolded into four 1-dimensional convolution kernels, the obtained 1-dimensional convolution kernels are changed along with the change of the inputs, and the network adaptivity is improved. The invention replaces two-dimensional convolution kernel with one-dimensional convolution kernel, so that the parameters of the network training model are reduced and the execution efficiency is high.

Description

Video quality enhancement method based on self-adaptive separable convolution
Technical Field
The invention relates to the field of image processing and deep learning technology, in particular to a video based on self-adaptive separable convolution
A quality enhancement method.
Background
Removing compression artifacts of images and video is a classical problem in computer vision. The goal of the problem is to estimate lossless images from compressed images and video. In the information explosion age, images and videos spread on the internet and mobile phones are increasing day by day, and technologies of lossy compression modes such as JPEG and WebP are widely applied to platforms such as news websites, WeChat and microblogs to reduce the size of video files so as to save bandwidth and transmission time. Images and videos used in web pages need to be compressed as much as possible to speed up page loading and thereby improve user experience. These compression algorithms, however, typically introduce compression artifacts such as occlusion, back-projection, contouring, blurring, ringing, and the like. Generally, the larger the compression factor, the more severe the video degradation caused by these artifacts, resulting in loss of video information, directly affecting the visual experience of the user. Therefore, there has been an increasing interest in how to recover visually high quality artifact free images and videos.
In recent years, with the development of deep learning, more and more techniques are beginning to be applied to how to improve the visual quality of compressed images and videos. For example, Dong[1]The 3-layer convolutional neural network (ARCNN) is used for removing the artifacts of JPEG compressed images and obtaining better image decompression artifact effect. Then yang et al proposed DS-CNN [2,3 ]]For video quality enhancement. However, none of the above mentioned video quality enhancement methods utilizes information between adjacent frames, and thus its network performance is limited to a large extent. Until recently, yang et al continued to propose the MFQE algorithm [4 ]]It is believed that in compressed video, the information in a high quality frame can be used to enhance the quality of its neighboring low quality frames because the quality of each frame fluctuates greatly. However, the method relies on the optical flow estimation network to estimate the motion between frames, and the effect is not obvious because the group-route value of the motion estimation is difficult to obtain in the optical flow estimation method.
Disclosure of Invention
The invention aims to provide a video quality enhancement method based on self-adaptive separable convolution aiming at the problem of artifacts generated by high-degree compression of a video, so that various artifacts in a compressed video are effectively removed, and the video quality and the visual effect are obviously improved.
The technical scheme adopted by the invention is as follows:
a video quality enhancement method based on adaptive separable convolution adopts a system network comprising an adaptive separable convolution network and a residual error network, wherein the adaptive separable convolution network is used for obtaining a motion compensation frame, and the residual error network is used for removing a compression artifact of a video frame so as to enhance the video quality; the video quality enhancement method comprises the following specific steps:
step 1, selecting high-quality videos to form a video database [4,5,6 ].
Step 2, preprocessing a video database to form a training data set; the training data set is composed of a pairing set of a plurality of video frames
Figure GDA0002667034340000021
Is formed therein
Figure GDA0002667034340000022
Represents the current frame of the compressed video frame,
Figure GDA0002667034340000023
representing the frame that follows the compressed video frame,
Figure GDA0002667034340000024
a current frame representing a high definition video frame,
Figure GDA0002667034340000025
a subsequent frame representing a frame of high definition video,
step 3, inputting two continuous compressed video frames
Figure GDA0002667034340000026
And
Figure GDA0002667034340000027
obtaining the next frame using a separable convolutional network
Figure GDA0002667034340000028
Predictive compressed video frames
Figure GDA0002667034340000029
Step 4, predicting the compressed video frame obtained by the self-adaptive separable convolution network
Figure GDA00026670343400000210
The original compressed image of the frame corresponding to the training set
Figure GDA00026670343400000211
And uncompressed images
Figure GDA00026670343400000212
The normalization and the y-channel processing are performed simultaneously,
step 5, inputting compressed video frame
Figure GDA00026670343400000213
And predicting compressed video frames
Figure GDA00026670343400000214
Obtaining a predicted high-definition video frame by using a residual error network model
Figure GDA00026670343400000215
Step 6: compressing video frames based on prediction
Figure GDA00026670343400000216
And predicting high definition video frames
Figure GDA00026670343400000217
Calculating a total cost function;
and 7: and continuously updating and optimizing the overall cost function to obtain the optimal convolution weight parameters and the optimal bias parameters until the optimal effect is obtained.
Further, step 2 specifically includes the following steps:
step 2-1, setting a quality coefficient qp according to the latest HEVC standard, and compressing the original video by using an ffmpeg command to ensure that each high-definition video has a corresponding video with a compression artifact;
step 2-2, respectively carrying out frame extraction on the high-definition video and the compressed video to obtain a high-definition image set and a corresponding compressed image set;
step 2-3, two continuous images in the compressed image set are taken each time, and the compressed video frames are intercepted according to the size of d x d
Figure GDA00026670343400000218
And
Figure GDA00026670343400000219
since it is the compression artifact of the removed video, inter-frame similarity should be consideredSex;
step 2-4, simultaneously acquiring two corresponding images from the high-definition image set and executing the same operation to obtain a high-definition video frame
Figure GDA00026670343400000220
And
Figure GDA00026670343400000221
forming a pairing set of a number of video frames
Figure GDA00026670343400000222
And 2-5, randomly disordering the sequence of the video frames in the pairing set to obtain a training data set of the network model.
Further, the separable convolutional neural network in the step 3 comprises five encoding modules, four decoding modules, a separable convolutional module and an image prediction module;
further, step 3 specifically includes the following steps:
step 3.1, each coding module comprises three convolutional layers and one average pooling layer,
the formula for calculating the convolutional layer is:
Figure GDA00026670343400000223
wherein xi,jI row and j column pixels, w, representing an imagem,nRepresenting the m-th row and n-th column weight, w, of the filterbRepresenting the bias term of the filter, ai,jRepresenting the ith row and jth column of the obtained characteristic diagram, and representing an activation function relu by f;
the formula for the average pooling layer is as follows:
Figure GDA0002667034340000031
wherein alpha isiExpressing the value of the ith pixel point in the taken neighborhood, and alpha after normalizationiValue range of 01, N represents the total number of pixel points in the neighborhood. h ismExpressing the result of pooling all pixel points in the neighborhood;
step 3.2, each decoding module sequentially comprises three convolution layers and a bilinear upsampling layer, the output of the last coding module is used as the input of the first decoding module, and then the output of the last decoding module is used as the input of the next decoding module; the calculation formula of the convolution layer of the decoding module is the same as that of the convolution layer of the coding module;
the computation process of the bilinear upsampling layer is as follows:
step 3.2.1, for each obtained feature map, to obtain the value of the unknown function f at point p ═ (x, y), first, linear interpolation is performed in the x direction to obtain:
Figure GDA0002667034340000032
wherein R is1=(x,y1) (3)
Figure GDA0002667034340000033
Wherein R is2=(x,y2) (4)
Wherein Q11=(x1,y1),Q12=(x1,y2),Q21=(x2,y1),Q22=(x2,y2) F is a bilinear interpolation function for known four points;
step 3.2.2, linear interpolation is carried out in the y direction:
Figure GDA0002667034340000034
this way the desired interpolation result can be obtained:
Figure GDA0002667034340000035
and obtaining the value of a middle pixel point of the feature map after the pixel point p to be predicted is (x, y) and passes through a bilinear interpolation function f, namely f (x, y).
Step 3.3, adding a jump connection between the decoder and the encoder: respectively adopting skip connection between the third layer convolution layer of the 2 nd, 3 rd, 4 th and 5 th coding modules and the corresponding bilinear upsampling layer of the 4 th, 3 th, 2 th and 1 th decoding modules, and adding the output characteristics of the coding modules and the decoding modules to obtain combined characteristics;
step 3.4, the separable convolution module comprises four sub-networks, wherein each sub-network consists of three convolution layers and a bilinear up-sampling layer; the method comprises the following specific steps:
step 3.4.1, the output of steps 3.1-3.3 is expanded into two adaptive convolution kernels to perform convolution operation on two continuous frame inputs respectively:
Figure GDA0002667034340000041
wherein K1(x, y) and K2(x, y) respectively represent two-dimensional convolution kernels, P, predicted based on a separable convolution model1(x, y) and P2(x, y) represents pixel values of two consecutive input frames, representing a convolution operation;
at step 3.4.2, each two-dimensional adaptive convolution kernel is expanded into 2 one-dimensional convolution kernels along the horizontal and vertical directions<K1_v(x,y),K1_h(x,y)>And<K2_v(x,y),K2_h(x,y)>to obtain four self-adaptive one-dimensional convolution kernels,
step 3.4.3, the convolution of two one-dimensional convolution kernels can approximate a two-dimensional convolution kernel:
K1(x,y)≈K1_h(x,y)*K1_v(x,y)
K2(x,y)≈K2_h(x,y)*K2_v(x,y) (8)
step 3.4.4, the two sets of one-dimensional kernels obtained by the separate convolution module<k1_h,k1_v>And<k2_h,k2_v>as convolution kernels of an image prediction module, two groups of convolution kernels are arranged in sequenceFor the input current frame I1And the next frame I2Performing convolution operation, and adding the two finally obtained results to obtain an output result which is a compensation image of the next frame;
step 3.5, the original input current frame image P1(x, y), second frame image P2(x, y) carrying out convolution operation on the convolution kernel output by the self-adaptive separable convolution module to obtain a predicted image I obtained by the image prediction modulegt
Igt=k1_h(x,y)*k1_v(x,y)*P1(x,y)+k2_h(x,y)*k2_v(x,y)*P2(x,y) (9)
Further, the specific steps of step 4 are respectively:
step 4.1, dividing each pixel value of the image by 255 to enable each pixel to be between [0,1] to obtain a processed image;
step 4.2, taking the normalized RGB image, and obtaining the normalized RGB image according to a formula
Y=0.257R+0.564G+0.098B+16
A Y-channel image is obtained.
Further, the residual error network in the step 5 comprises an initial convolution module, a residual error convolution module and an image reconstruction module respectively;
further, step 5 comprises the following processing steps:
step 5.1, the initial convolution stage comprises a convolution layer and an activation layer, and the bottom layer characteristic F is obtained through learning1
Figure GDA0002667034340000042
Wherein W1And B1As weights and bias parameters of the initial convolution module, FreluRepresenting a relu activation function;
step 5.2, each residual convolution module sequentially comprises a convolution layer, a nonlinear activation layer, a convolution layer and a characteristic combination layer; the characteristic combination layer connects the output characteristics F of the layer through a jump connectionkWith the two subsequent layers of the convolution layerOutput characteristic F ofk+2Adding and obtaining combined features Fk,k+2
Fk=Wk(Frelu(Wk-1Fk-2+Bk-1)+Fk-2 (11)
Fk,k+2=Fk+Fk+2 (12)
FK-2Is the output characteristic diagram of the (k-2) th convolutional layer, FreluRepresenting the relu activation function, WkRepresents the k-th convolutional layer weight, Wk-1And Bk-1Weight and bias parameters, F, representing the k-1 th layer convolution modulek,k+2Is a characteristic layer FkAnd Fk+2The resulting high-level bonding characteristics.
Step 5.3, utilizing the obtained high-level feature Fk,k+2Executing an image reconstruction layer;
Fg=WM(Frelu(WM-1Fk,k+2+BM-1)+F1 (13)
F1is the bottom layer characteristic obtained from (10), FreluRepresenting the relu activation function, Fk,k+2Is the high layer binding characteristic obtained by (12), WMRepresents the weight of the convolution layer of the Mth layer, WM-1And BM-1Representing the weights and bias parameters of the layer M-1 convolution module. Further, the calculation of the total cost function of step 6 comprises the following steps:
step 6.1, in separable convolution network, compare the predicted compressed video frame of the next frame
Figure GDA0002667034340000051
And the next frame of the original compressed video frame
Figure GDA0002667034340000052
Calculating the Euclidean distance between the two;
Figure GDA0002667034340000053
num denotes the number of all pixel blocks in each frame image.
Step 6.2, in the network for removing the video frame compression artifact, predicting the high-definition video frame
Figure GDA0002667034340000054
With the original high definition video frame
Figure GDA0002667034340000055
Comparing, and calculating a Charbonier penalty function;
Figure GDA0002667034340000056
num denotes the number of all pixel blocks in each frame image.
Step 6.3, adding the two loss functions to obtain an overall cost function:
Total_loss=Mse_loss+Charbonnier_loss (16)。
by adopting the technical scheme, the motion compensation frame is obtained through the self-adaptive separable convolution network, and the compression artifact of the video frame is removed through the residual error network, so that the video quality is enhanced. The video artifact compression removing method based on the model of the self-adaptive separable convolutional network can effectively remove various artifacts in a compressed video and obviously improve the video quality and visual effect.
Drawings
The invention is described in further detail below with reference to the accompanying drawings and the detailed description;
FIG. 1 is a schematic diagram illustrating a schematic structure of a video quality enhancement method based on adaptive separable convolution according to the present invention;
fig. 2 is a comparison graph of artifact removal effects of images of "vidoo 3" in JCT-VC, HEVC standard test sequence of MFQE in the prior art, according to the latest HEVC standard, a test video is compressed, and a quality coefficient QP is set to 37.
Detailed Description
As shown in one of fig. 1-2, the present invention proposes a video enhancement method based on separable convolutional network. The network consists of two parts: the first part is a separable convolutional network to obtain motion-compensated frames, and the second is a residual network to remove compression artifacts of video frames, thereby enhancing video quality. The overall network model adopts an Adam optimization mode, except that convolution kernels with the size of 51 are used in 4 sub-networks in the separable convolution module, all other convolution layers all use convolution kernels with the size of 3 × 3, and the specific steps are as follows:
step 1, selecting high-quality videos to form a video database. There were 7000 training data pictures.
And 2, preprocessing the video database to form a training data set. According to the latest HEVC standard, a quality coefficient qp is set, and an ffmpeg command is used for compressing original videos, so that each high-definition video has a corresponding video with a compression artifact. And then respectively carrying out frame extraction on the high-definition video and the compressed video to obtain a high-definition image set and a corresponding compressed image set. Since it is the compression artifact of the removed video, inter-frame similarity should be considered. Each time, the previous image and the next image in the compressed image set are taken, and the video frame I is intercepted according to the size of d x dt cAnd It+1 cSimultaneously, two corresponding images are taken from the high-definition image set to execute the same operation to obtain a video frame It gtAnd It+1 gtForming a pairing set of several video frames { It c,It+1 c,It gt,It +1 gt}. And randomly disordering the sequence of the video frames in the pairing set to obtain a training data set of the network model. The training data set contained 7000 pictures in total.
Step 3, inputting two continuous compressed video frames I by utilizing a separable convolution networkt c,It+1 c(representing the current frame and the next frame, respectively) to obtain a next frame It+1 cPredicted result of (I)t+1’ c. The separable convolution neural network comprises five coding modules, four decoding modules, a separation convolution module and an image pre-decoderAnd a measuring module. Each encoding module includes three convolutional layers and one average pooling layer. The formula for calculating the convolutional layer is:
Figure GDA0002667034340000061
wherein xi,jI row and j column pixels, w, representing an imagem,nRepresenting the m-th row and n-th column weight, w, of the filterbRepresenting the bias term of the filter, ai,jThe ith row and jth column of pixels of the resulting feature map are represented, with f representing the activation function relu. And setting the size of the convolution kernel to be 3 x 3 in the coding and decoding module.
The average pooling layer is used for downsampling the output feature map, and further reduces the parameter number by removing unimportant samples in the feature map.
Then, the output of the encoding module is used as the input of the decoding module, each decoding module sequentially comprises three convolution layers and a bilinear upsampling layer, and is somewhat like the inverse process of the decoding module, wherein the calculation formula of the bilinear upsampling layer is as follows: for each obtained feature map, linear interpolation is performed in the x direction to obtain:
Figure GDA0002667034340000071
Figure GDA0002667034340000072
wherein Q11=(x1,y1),Q12=(x1,y2),Q21=(x2,y1),Q22=(x2,y2) F bilinear interpolation function for known four points. Then, linear interpolation is performed again in the y direction:
Figure GDA0002667034340000073
thus, the value of each pixel point of the feature map after bilinear interpolation can be obtained, wherein p ═ x, y is the pixel point to be predicted.
The formula for the convolutional layer is as above.
Meanwhile, a characteristic combination layer is added to serve as a bridge for connecting a decoder and an encoder so as to avoid detail information loss. The specific operation is as follows: respectively connecting the third layer convolution layer of the 2 nd, 3 th, 4 th and 5 th coding modules with the 4 th, 3 rd, 2 th and 1 th bilinear upsampling layers corresponding to the decoding modules through skip connection, and adding the output characteristics of the coding modules and the decoding modules to obtain combined characteristics FK
The separation convolution module is composed of four sub-networks (sub-networks), wherein each sub-network is composed of three convolution layers and a bilinear upsampling layer, but at the moment, a two-dimensional convolution kernel of each convolution layer is replaced by two one-dimensional convolution kernels, the two one-dimensional convolution kernels are respectively represented horizontally and vertically by the two-dimensional convolution kernels, and the specific process is as follows: two sets of one-dimensional kernels obtained by separate convolution modules<k1_h,k1_v>And<k2_h,k2_v>as convolution kernels of the image prediction module, the two groups of convolution kernels respectively correspond to the input current frame I1And the next frame I2Performing convolution operation, and finally adding the two obtained results to obtain an output result, namely a predicted image of a next frame, wherein the specific operation is as follows:
final predicted image IgtPixel P of current frame image capable of being input by original1(x, y) pixel point P of second frame image2And (x, y) and the network respectively perform convolution operation on the convolution kernels learned by the two images to obtain:
Igt=K1(x,y)*P1(x,y)+K2(x,y)*P2(x,y) (5)
using convolution results of one-dimensional convolution kernel in horizontal direction and one-dimensional convolution kernel in vertical directionTwo-dimensional convolution kernel K in approximate expression (6)1(x, y) and K2(x,y):
K1(x,y)=k1_h(x,y)*v1_v(x,y)
K2(x,y)=k2_h(x,y)*k2_v(x,y) (6)
Can obtain
Igt=k1_h(x,y)*k1_v(x,y)*P1(x,y)+k2_h(x,y)*k2_v(x,y)*P2(x,y) (7)
Step 4, obtaining the predicted frame I by the separable convolution networkt+1’ cThe original compressed image I of the frame corresponding to the training sett+1 cAnd uncompressed image It+1 gtSimultaneously carrying out normalization and y-channel processing, wherein the specific steps are as follows:
step 4.1, dividing each pixel value of the image by 255 to enable each pixel to be between [0,1] to obtain a processed image;
step 4.2, taking the normalized RGB image, and obtaining the normalized RGB image according to a formula
Y=0.257R+0.564G+0.098B+16
A Y-channel image is obtained.
Step 5, inputting the compressed video frame I by using the residual error network modelt+1 cAnd predicting compressed video frame It+1’ cObtaining a model predicted image It+1’ gt. The residual error network comprises an initial convolution module, a residual error convolution module and an image reconstruction module respectively. Each residual convolution module sequentially comprises a convolution layer, a nonlinear active layer, a convolution layer and a characteristic combination layer, wherein the characteristic combination layer connects the output characteristics F of the layers through a jump connectionkOutput characteristic F of the convolution layer of the two subsequent layersk+2Adding and obtaining combined features Fk,k+2
Step 5.1, the initial convolution stage comprises a convolution layer and an activation layer, and the bottom layer characteristic F is obtained through learning1
Figure GDA0002667034340000081
Wherein W1And B1As weights and bias parameters of the initial convolution module, FreluRepresenting a relu activation function;
step 5.2, each residual convolution module sequentially comprises a convolution layer, a nonlinear activation layer, a convolution layer and a characteristic combination layer; the characteristic combination layer connects the output characteristics F of the layer through a jump connectionkOutput characteristic F of the convolution layer of the two subsequent layersk+2Adding and obtaining combined features Fk,k+2
Fk=Wk(Frelu(Wk-1Fk-2+Bk-1)+Fk-2 (9)
Fk,k+2=Fk+Fk+2 (10)
FK-2Is the output characteristic diagram of the (k-2) th convolutional layer, FreluRepresenting the relu activation function, WKRepresents the weight of the K layer convolutional layer, WK-1And BK-1Weight and bias parameters representing the K-1 layer convolution module, Fk,k+2Is a characteristic layer FkAnd Fk+2The resulting high-level bonding characteristics.
Step 5.3, utilizing the obtained high-level feature Fk,k+2Executing an image reconstruction layer;
Fg=WM(Frelu(WM-1Fk,k+2+BM-1)+F1 (11)
F1is the bottom layer characteristic obtained from (9), FreluRepresenting the relu activation function, Fk,k+2Is the high layer bonding characteristic obtained by (10), WMRepresents the weight of the convolution layer of the Mth layer, WM-1And BM-1Representing the weights and bias parameters of the layer M-1 convolution module.
Step 6: calculating an overall cost function;
step 6.1, in the separable convolution network, the predicted image I of the next frame is comparedt+1’ cAnd the next frame original image It +1 cAnd calculating the Euclidean distance between the two.
Figure GDA0002667034340000091
Step 6.2, in the network for removing the video frame compression artifact, the network prediction image I is usedt+1’ gtWith the original video frame It+1 gtAnd comparing and calculating a Charbonier penalty function.
Figure GDA0002667034340000092
And 6.3, adding the two loss functions to obtain an overall cost function.
Total_loss=Mse_loss+Charbonnier_loss (14)
And 7: and continuously updating and optimizing the overall cost function to obtain the optimal convolution weight parameters and the optimal bias parameters until the optimal effect is obtained.
Seq. AR-CNN[1] DCAD[7] DSCNN[2] MFQE[4] The invention
1 0.13 0.14 0.48 0.77 2.56
2 0.07 0.04 0.42 0.60 2.25
3 0.11 0.11 0.24 0.47 2.51
4 0.13 0.08 0.32 0.44 1.37
5 0.19 0.23 0.33 0.55 1.00
6 0.15 0.16 0.37 0.60 1.32
7 0.14 0.18 0.28 0.39 1.20
8 0.13 0.19 0.28 0.48 1.34
9 0.16 0.22 0.27 0.39 1.46
10 0.15 0.20 0.25 0.40 1.80
Ave. 0.14 0.16 0.32 0.51 1.68
Table 1 comparison of results on test sets for QP 37 for the prior art
By adopting the technical scheme, the method can effectively eliminate the artifacts generated in the high compression of the video. The innovation of the invention is mainly embodied in two aspects: firstly, the two-dimensional convolution kernel is replaced by the one-dimensional convolution kernel, so that the parameters of the network training model are reduced, and the execution efficiency is high. The invention applies the latest deep learning technology, applies the self-adaptive separable convolution as the first module in the network model, converts each two-dimensional convolution into a pair of one-dimensional convolution kernels in the horizontal direction and the vertical direction, and uses the method that the parameter number is n2And the calculation cost is greatly reduced and the memory is saved by changing the N + n. Second, unlike most approaches that use a photo graph to motion compensate successive video frames, the present invention uses adaptively varying convolution kernels learned by the network for different inputs to achieve motion vector estimation. In the process of estimating motion offset by optical flow map, this method often causes inaccuracy of motion compensation due to lack of real-value (ground-true) of optical flow map (flow map). In the invention, two continuous frames are selected as network input, a pair of separable two-dimensional convolution kernels can be obtained for every two continuous inputs, then the 2-dimensional convolution kernels are expanded into four 1-dimensional convolution kernels, and the obtained 1-dimensional convolution kernels can change along with the change of the input, so that the self-adaptability of the network is greatly improved, and the method is a data-drive (drive) mode. The invention obtains the motion compensation frame through a self-adaptive separable convolution network, and removes the compression artifact of the video frame through a residual error network, thereby enhancing the video quality. The video artifact compression removing method based on the model of the self-adaptive separable convolutional network can effectively remove various artifacts in a compressed video and obviously improve the video quality and visual effect.
The present invention relates to the following references:
[1]Chao Dong,Yubin Deng,Chen Change Loy,Xiaoou Tang.Compression Artifacts Reduction by a Deep Convolutional Network,in Proceedings of International Conference on Computer Vision(ICCV),2015.
[2]Yang R,Xu M,Wang Z.Decoder-side HEVC quality enhancement with scalable convolutional neural network[C]//IEEE International Conference on Multimedia and Expo.IEEE,2017:817-822.
[3]Yang R,Xu M,Wang Z,et al.Enhancing Quality for HEVC Compressed Videos[J].2017.
[4]Yang R,Xu M,Wang Z,et al.Multi-Frame Quality Enhancement for Compressed Video[J].2018.
[5]Xiph.org,Xiph.org Video Test Media,https://media.xiph.org/video/derf/(2017).[6]VQEG,VQEG video datasets and organizations,https://www.its.bldrdoc.gov/vqeg/video-datasets-and-organizations.aspx
[7]Wang T,Chen M,Chao H.A Novel Deep Learning-Based Method of Improving Coding Efficiency from the Decoder-End for HEVC[C]//Data Compression Conference.IEEE,2017.

Claims (8)

1. a video quality enhancement method based on adaptive separable convolution is characterized in that: the adopted system network comprises a self-adaptive separable convolution network and a residual error network, wherein the self-adaptive separable convolution network is used for acquiring the motion compensation frame, and the residual error network is used for removing the compression artifact of the video frame; the video quality enhancement method comprises the following specific steps:
step 1, selecting high-quality videos to form a video database;
step 2, preprocessing a video database to form a training data set; the training data set is composed of a pairing set of a plurality of video frames
Figure FDA0002667034330000011
Is formed therein
Figure FDA0002667034330000012
A current frame representing a compressed image,
Figure FDA0002667034330000013
representing the next frame of the compressed image,
Figure FDA0002667034330000014
a current frame representing a high definition image,
Figure FDA0002667034330000015
the next frame representing the high-definition image,
step 3, inputting two continuous compressed video frames
Figure FDA0002667034330000016
And
Figure FDA0002667034330000017
obtaining the next frame using a separable convolutional network
Figure FDA0002667034330000018
Predictive compressed video frames
Figure FDA0002667034330000019
Step 4, predicting the compressed video frame obtained by the self-adaptive separable convolution network
Figure FDA00026670343300000110
The original compressed image of the frame corresponding to the training set
Figure FDA00026670343300000111
And uncompressed images
Figure FDA00026670343300000112
The normalization and the y-channel processing are performed simultaneously,
step 5, inputting compressed video frame
Figure FDA00026670343300000113
And predicting compressed video frames
Figure FDA00026670343300000114
Obtaining a predicted high-definition video frame by using a residual error network model
Figure FDA00026670343300000115
Step 6: compressing video frames based on prediction
Figure FDA00026670343300000116
And predicting high definition video frames
Figure FDA00026670343300000117
Calculating a total cost function;
and 7: and continuously updating and optimizing the overall cost function to obtain the optimal convolution weight parameters and bias parameters.
2. The method of claim 1, wherein the adaptive separable convolution-based video quality enhancement method comprises: the step 2 specifically comprises the following steps:
step 2-1, setting a quality coefficient qp according to the latest HEVC standard, and compressing the original video by using an ffmpeg command to ensure that each high-definition video has a corresponding video with a compression artifact;
step 2-2, respectively carrying out frame extraction on the high-definition video and the compressed video to obtain a high-definition image set and a corresponding compressed image set;
step 2-3, two continuous images in the compressed image set are taken each time, and the compressed video frames are intercepted according to the size of d x d
Figure FDA00026670343300000118
And
Figure FDA00026670343300000119
step 2-4, simultaneously acquiring two corresponding images from the high-definition image set and executing the same operation to obtain a high-definition video frame
Figure FDA00026670343300000120
And
Figure FDA00026670343300000121
forming a pairing set of a number of video frames
Figure FDA00026670343300000122
And 2-5, randomly disordering the sequence of the video frames in the pairing set to obtain a training data set of the network model.
3. The method of claim 1, wherein the adaptive separable convolution-based video quality enhancement method comprises: the separable convolutional neural network comprises five encoding modules, four decoding modules, a separating convolutional module and an image prediction module.
4. The method of claim 3, wherein the adaptive separable convolution-based video quality enhancement method comprises: the step 3 specifically comprises the following steps:
step 3.1, each coding module comprises three convolutional layers and one average pooling layer,
the formula for calculating the convolutional layer is:
Figure FDA0002667034330000021
wherein xi,jI row and j column pixels, w, representing an imagem,nRepresenting the m-th row and n-th column weight, w, of the filterbRepresenting the bias term of the filter, ai,jRepresenting the ith row and jth column of the obtained characteristic diagram, and representing an activation function relu by f;
the formula for the average pooling layer is as follows:
Figure FDA0002667034330000022
wherein alpha isiExpressing the value of the ith pixel point in the taken neighborhood, and alpha after normalizationiThe value range is 0-1, and N represents the total number of pixel points in the neighborhood; h ismExpressing the result of pooling all pixel points in the neighborhood;
step 3.2, each decoding module sequentially comprises three convolution layers and a bilinear upsampling layer, the output of the last coding module is used as the input of the first decoding module, and then the output of the last decoding module is used as the input of the next decoding module; the calculation formula of the convolution layer of the decoding module is the same as that of the convolution layer of the coding module;
the computation process of the bilinear upsampling layer is as follows:
step 3.2.1, for each obtained feature map, to obtain the value of the unknown function f at point p ═ (x, y), first, linear interpolation is performed in the x direction to obtain:
Figure FDA0002667034330000023
Figure FDA0002667034330000024
wherein Q11=(x1,y1),Q12=(x1,y2),Q21=(x2,y1),Q22=(x2,y2) F is a bilinear interpolation function for known four points;
step 3.2.2, linear interpolation is carried out in the y direction:
Figure FDA0002667034330000025
this way the desired interpolation result can be obtained:
Figure FDA0002667034330000026
obtaining the value of a pixel point in a feature map after a pixel point p to be predicted is (x, y) passes through a bilinear interpolation function f, namely f (x, y);
step 3.3, adding a jump connection between the decoder and the encoder: respectively adopting skip connection between the third layer convolution layer of the 2 nd, 3 rd, 4 th and 5 th coding modules and the corresponding bilinear upsampling layer of the 4 th, 3 th, 2 th and 1 th decoding modules, and adding the output characteristics of the coding modules and the decoding modules to obtain combined characteristics;
step 3.4, the separable convolution module comprises four sub-networks, wherein each sub-network consists of three convolution layers and a bilinear up-sampling layer; the method comprises the following specific steps:
step 3.4.1, the output of steps 3.1-3.3 is expanded into two adaptive convolution kernels to perform convolution operation on two continuous frame inputs respectively:
Figure FDA0002667034330000031
wherein K1(x, y) and K2(x, y) respectively represent two-dimensional convolution kernels, P, predicted based on a separable convolution model1(x, y) and P2(x, y) represents pixel values of two consecutive input frames, representing a convolution operation;
step 3.4.2, expand each two-dimensional adaptive convolution kernel into 2 one-dimensional convolution kernels along horizontal and vertical directions respectively<K1_v(x,y),K1_h(x,y)>And<K2_v(x,y),K2_h(x,y)>to obtain four self-adaptive one-dimensional convolution kernels,
step 3.4.3, the convolution of two one-dimensional convolution kernels can approximate a two-dimensional convolution kernel:
K1(x,y)≈K1_h(x,y)*K1_v(x,y)
K2(x,y)≈K2_h(x,y)*K2_v(x,y) (8)
step 3.4.4, the two groups obtained by the separate convolution modulesOne-dimensional nucleus<k1_h,k1_v>And<k2_h,k2_v>as convolution kernels of the image prediction module, two groups of convolution kernels are used for sequentially aligning the input current frame I1And the next frame I2Performing convolution operation, and adding the two finally obtained results to obtain an output result which is a compensation image of the next frame;
step 3.5, according to the above formulas (7) and (8), the original input current frame image P1(x, y), second frame image P2(x, y) carrying out convolution operation on the convolution kernel output by the self-adaptive separable convolution module to obtain a predicted image I obtained by the image prediction modulegt
Igt=k1_h(x,y)*k1_v(x,y)*P1(x,y)+k2_h(x,y)*k2_v(x,y)*P2(x,y) (9)
5. The method of claim 1, wherein the adaptive separable convolution-based video quality enhancement method comprises: the specific steps of the step 4 are respectively as follows:
step 4.1, dividing each pixel value of the image by 255 to enable each pixel to be between [0,1] to obtain a processed image;
step 4.2, taking the normalized RGB image, and obtaining the normalized RGB image according to a formula
Y=0.257R+0.564G+0.098B+16
A Y-channel image is obtained.
6. The method of claim 1, wherein the adaptive separable convolution-based video quality enhancement method comprises: and in the step 5, the residual error network respectively comprises an initial convolution module, a residual error convolution module and an image reconstruction module.
7. The method of claim 6, wherein the adaptive separable convolution-based video quality enhancement method comprises: step 5 comprises the following processing steps:
step 5.1, the initial convolution stage comprises a convolution layer and an activation layer, and the bottom layer characteristic F is obtained through learning1
Figure FDA0002667034330000041
Wherein W1And B1As weights and bias parameters of the initial convolution module, FreluRepresenting a relu activation function;
Figure FDA0002667034330000042
representing the union as a network input;
step 5.2, each residual convolution module sequentially comprises a convolution layer, a nonlinear activation layer, a convolution layer and a characteristic combination layer; output characteristic F of characteristic bonding layerkOutput characteristics F of the last two convolution layers in combination with the characteristicsk+2Adding by jump-connection and obtaining combined characteristics Fk,k+2
Fk=Wk(Frelu(Wk-1Fk-2+Bk-1)+Fk-2 (11)
Fk,k+2=Fk+Fk+2 (12)
Fk-2Is the output characteristic diagram of the (k-2) th convolutional layer, FreluRepresenting the relu activation function, WkRepresents the k-th convolutional layer weight, Wk-1And Bk-1Weight and bias parameters, F, representing the k-1 th layer convolution modulek,k+2Is a characteristic layer FkAnd Fk+2The obtained high-level bonding characteristics;
step 5.3, utilizing the obtained high-level feature Fk,k+2Executing an image reconstruction layer;
Fg=WM(Frelu(WM-1Fk,k+2+BM-1)+F1 (13)
F1is the bottom layer characteristic obtained from (9), FreluRepresenting the relu activation function, Fk,k+2Is the high layer binding characteristic obtained by (12), WMRepresents the weight of the convolution layer of the Mth layer, WM-1And BM-1Representing the weights and bias parameters of the layer M-1 convolution module.
8. The method of claim 1, wherein the adaptive separable convolution-based video quality enhancement method comprises: the calculation of the total cost function comprises the following steps:
step 6.1, in separable convolution network, compare the predicted compressed video frame of the next frame
Figure FDA0002667034330000043
And the next frame of original compressed image
Figure FDA0002667034330000044
Calculating the Euclidean distance between the two;
Figure FDA0002667034330000045
num represents the number of all pixel blocks in each frame image;
step 6.2, in the network for removing the video frame compression artifact, predicting the high-definition video frame
Figure FDA0002667034330000046
With the original high definition video frame
Figure FDA0002667034330000047
Comparing, and calculating a Charbonier penalty function;
Figure FDA0002667034330000051
num represents the number of all pixel blocks in each frame of image, epsilon is a regularization term used for preserving image edges, and epsilon is set to be 1E-3 based on experience;
step 6.3, adding the two loss functions to obtain an overall cost function:
Total_loss=Mse_loss+Charbonnier_loss (16)。
CN201810603510.6A 2018-06-12 2018-06-12 Video quality enhancement method based on self-adaptive separable convolution Active CN108900848B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810603510.6A CN108900848B (en) 2018-06-12 2018-06-12 Video quality enhancement method based on self-adaptive separable convolution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810603510.6A CN108900848B (en) 2018-06-12 2018-06-12 Video quality enhancement method based on self-adaptive separable convolution

Publications (2)

Publication Number Publication Date
CN108900848A CN108900848A (en) 2018-11-27
CN108900848B true CN108900848B (en) 2021-03-02

Family

ID=64344922

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810603510.6A Active CN108900848B (en) 2018-06-12 2018-06-12 Video quality enhancement method based on self-adaptive separable convolution

Country Status (1)

Country Link
CN (1) CN108900848B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109451308B (en) 2018-11-29 2021-03-09 北京市商汤科技开发有限公司 Video compression processing method and device, electronic equipment and storage medium
CN110677651A (en) * 2019-09-02 2020-01-10 合肥图鸭信息科技有限公司 Video compression method
CN110610467B (en) * 2019-09-11 2022-04-15 杭州当虹科技股份有限公司 Multi-frame video compression noise removing method based on deep learning
CN110705513A (en) * 2019-10-17 2020-01-17 腾讯科技(深圳)有限公司 Video feature extraction method and device, readable storage medium and computer equipment
CN113727141B (en) * 2020-05-20 2023-05-12 富士通株式会社 Interpolation device and method for video frames
CN113761983B (en) * 2020-06-05 2023-08-22 杭州海康威视数字技术股份有限公司 Method and device for updating human face living body detection model and image acquisition equipment
CN112257847A (en) * 2020-10-16 2021-01-22 昆明理工大学 Method for predicting geomagnetic Kp index based on CNN and LSTM
RU2764395C1 (en) 2020-11-23 2022-01-17 Самсунг Электроникс Ко., Лтд. Method and apparatus for joint debayering and image noise elimination using a neural network
CN112801266B (en) * 2020-12-24 2023-10-31 武汉旷视金智科技有限公司 Neural network construction method, device, equipment and medium
CN115442613A (en) * 2021-06-02 2022-12-06 四川大学 Interframe information-based noise removal method using GAN
CN114339030B (en) * 2021-11-29 2024-04-02 北京工业大学 Network live video image stabilizing method based on self-adaptive separable convolution
CN114820350A (en) * 2022-04-02 2022-07-29 北京广播电视台 Inverse tone mapping system, method and neural network system thereof

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103366389A (en) * 2013-04-27 2013-10-23 中国人民解放军北京军区总医院 CT (computed tomography) image reconstruction method
CN107871332A (en) * 2017-11-09 2018-04-03 南京邮电大学 A kind of CT based on residual error study is sparse to rebuild artifact correction method and system

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060062478A1 (en) * 2004-08-16 2006-03-23 Grandeye, Ltd., Region-sensitive compression of digital video
WO2016132145A1 (en) * 2015-02-19 2016-08-25 Magic Pony Technology Limited Online training of hierarchical algorithms
CN106131443A (en) * 2016-05-30 2016-11-16 南京大学 A kind of high dynamic range video synthetic method removing ghost based on Block-matching dynamic estimation
CN106791836A (en) * 2016-12-02 2017-05-31 深圳市唯特视科技有限公司 It is a kind of to be based on a pair of methods of the reduction compression of images effect of Multi net voting
CN106709875B (en) * 2016-12-30 2020-02-18 北京工业大学 Compressed low-resolution image restoration method based on joint depth network
CN107145846B (en) * 2017-04-26 2018-10-19 贵州电网有限责任公司输电运行检修分公司 A kind of insulator recognition methods based on deep learning
CN107392868A (en) * 2017-07-21 2017-11-24 深圳大学 Compression binocular image quality enhancement method and device based on full convolutional neural networks
CN107463989B (en) * 2017-07-25 2019-09-27 福建帝视信息科技有限公司 A kind of image based on deep learning goes compression artefacts method
CN107507148B (en) * 2017-08-30 2018-12-18 南方医科大学 Method based on the convolutional neural networks removal down-sampled artifact of magnetic resonance image

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103366389A (en) * 2013-04-27 2013-10-23 中国人民解放军北京军区总医院 CT (computed tomography) image reconstruction method
CN107871332A (en) * 2017-11-09 2018-04-03 南京邮电大学 A kind of CT based on residual error study is sparse to rebuild artifact correction method and system

Also Published As

Publication number Publication date
CN108900848A (en) 2018-11-27

Similar Documents

Publication Publication Date Title
CN108900848B (en) Video quality enhancement method based on self-adaptive separable convolution
Zhang et al. DMCNN: Dual-domain multi-scale convolutional neural network for compression artifacts removal
Sun et al. Reduction of JPEG compression artifacts based on DCT coefficients prediction
CN111866521A (en) Video image compression artifact removing method combining motion compensation and generation type countermeasure network
CN111047532B (en) Low-illumination video enhancement method based on 3D convolutional neural network
Yu et al. Quality enhancement network via multi-reconstruction recursive residual learning for video coding
CN111031315B (en) Compressed video quality enhancement method based on attention mechanism and time dependence
CN113055674B (en) Compressed video quality enhancement method based on two-stage multi-frame cooperation
CN112218094A (en) JPEG image decompression effect removing method based on DCT coefficient prediction
WO2022211657A9 (en) Configurable positions for auxiliary information input into a picture data processing neural network
CN112188217B (en) JPEG compressed image decompression effect removing method combining DCT domain and pixel domain learning
CN113810715B (en) Video compression reference image generation method based on cavity convolutional neural network
CN115187455A (en) Lightweight super-resolution reconstruction model and system for compressed image
US20230110503A1 (en) Method, an apparatus and a computer program product for video encoding and video decoding
CN112601095B (en) Method and system for creating fractional interpolation model of video brightness and chrominance
Ho et al. SR-CL-DMC: P-frame coding with super-resolution, color learning, and deep motion compensation
CN113822801B (en) Compressed video super-resolution reconstruction method based on multi-branch convolutional neural network
CN115243044A (en) Reference frame selection method and device, equipment and storage medium
Amaranageswarao et al. Blind compression artifact reduction using dense parallel convolutional neural network
WO2022211658A1 (en) Independent positioning of auxiliary information in neural network based picture processing
Jia et al. Deep convolutional network based image quality enhancement for low bit rate image compression
Mishra et al. Edge-aware image compression using deep learning-based super-resolution network
CN114862687B (en) Self-adaptive compressed image restoration method driven by depth deblocking operator
CN112243132A (en) Compressed video post-processing method combining non-local prior and attention mechanism
CN114071166B (en) HEVC compressed video quality improvement method combined with QP detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 350000 Unit 01, 16th Floor, TB # Office Building, Phase III, CR MIXC, Hongshanyuan Road, Hongshan Town, Gulou District, Fuzhou City, Fujian Province

Patentee after: Fujian Deshi Technology Group Co.,Ltd.

Address before: 350000 area B, 5th floor, building 2, Yunzuo, 528 Xihong Road, Gulou District, Fuzhou City, Fujian Province

Patentee before: FUJIAN IMPERIAL VISION INFORMATION TECHNOLOGY CO.,LTD.