CN113822801A

CN113822801A - Compressed video super-resolution reconstruction method based on multi-branch convolutional neural network

Info

Publication number: CN113822801A
Application number: CN202110718467.XA
Authority: CN
Inventors: 陈卫刚; 周迪
Original assignee: Zhejiang Uniview Technologies Co Ltd; Zhejiang Gongshang University
Current assignee: Zhejiang Uniview Technologies Co Ltd; Zhejiang Gongshang University
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2021-12-21
Anticipated expiration: 2041-06-28
Also published as: CN113822801B

Abstract

The invention discloses a compressed video super-resolution reconstruction method based on a multi-branch convolutional neural network, which searches approximate blocks in intra-frame coding frames with similar intervals for each frame of image to be processed in a block processing mode, forms a predicted image corresponding to the current image to be processed by the approximate blocks, respectively takes the predicted image and the image to be processed as the input of each branch network, and fuses the output of the branch networks as the final high-resolution reconstruction result. The compressed video super-resolution reconstruction method adopting the multi-branch convolutional neural network can effectively utilize the inter-frame redundancy information of the video sequence, and particularly utilizes the characteristic that an intra-frame coding frame in the compressed video has better visual quality, so that the reconstructed super-resolution image has better quality.

Description

Compressed video super-resolution reconstruction method based on multi-branch convolutional neural network

Technical Field

The invention relates to the field of computer vision, in particular to a compressed video super-resolution reconstruction method based on a multi-branch convolutional neural network.

Background

With the increasing popularity of high-resolution display devices and the continued emergence of new video applications, the market demand for ultra-high-definition video, such as 4K or 8K, is increasing. At the same time, the increase in network bandwidth as a common resource has not kept pace with the demand for transmitting high quality video. In the background, the super-resolution reconstruction of the video image can be operated as an image enhancement technology at a decoding end, thereby providing a feasible solution for the contradiction.

Chinese patent CN101345870B discloses a method for constructing a small amount of super-resolution reconstruction auxiliary code stream by using pre-decoding closed-loop feedback with super-resolution reconstruction through a coding device, and further guiding and correcting super-resolution reconstruction at the decoding end by using an eye-interested analysis module in the coding module, so as to improve resolution and subjective quality of video decoding output. Chinese patent CN103475876B discloses a super-resolution reconstruction method of a low-bit-rate compressed image based on learning, the off-line part of the method classifies low-resolution images according to the distortion degree thereof to establish a sample library, and trains respective super-resolution models for each type of samples; and the online part judges the distortion category of the input image and selects different models to realize super-resolution reconstruction. Chinese patent CN101605260B discloses a compressed video super-resolution reconstruction method based on maximum posterior probability estimation MAP, which defines the MAP reconstruction cost function as a reconstruction error item, a regular constraint item containing DCT coefficient distribution parameters before quantization and a general constraint item, and improves the quality of compressed video super-resolution reconstruction by introducing a DCT coefficient distribution model.

Unlike single-frame image and video image super-resolution reconstruction, the compressed video image super-resolution reconstruction system takes an image with compression loss as an input. Quantization errors are introduced in the quantization process of the lossy video compression coding system, and the errors are more expressed as loss of high-frequency components in a frequency domain, so that the compressed image has the characteristics of detail loss, edge blurring and the like. The reconstruction of high-resolution images by using the low-resolution images with defects of detail loss, edge blurring and the like as input poses more challenges to the ultra-resolution reconstruction system.

Disclosure of Invention

The invention aims to provide a compressed video super-resolution reconstruction method based on a multi-branch convolutional neural network by fully utilizing interframe redundancy information of a video sequence, in particular utilizing the characteristic that an intraframe coding frame in a compressed video has better visual quality.

The technical scheme adopted by the invention is as follows: a compressed video super-resolution reconstruction method based on a multi-branch convolutional neural network comprises the following specific steps

(1) The multi-branch convolutional neural network for compressed video super-resolution reconstruction comprises three branches, wherein a second branch network Sub-B and a third branch network Sub-C take a current decoding frame I of a compressed video as input; selecting a reference image from two intra-frame coding frames which are positioned before and after the I frame according to the number of the interval frames, searching a block with the maximum similarity in the reference image for each block image in the current decoding frame I in a block processing mode, and forming a reconstructed image by the similar blocks as the input of a first branch network Sub-A;

(2) the first branch network and the second branch network have the same structure, according to the data flow direction in the forward data transmission, the input data firstly passes through a convolution layer containing 32 convolution kernels with the convolution step length of 1 and 3 multiplied by 3, and N residual blocks which are connected in sequence are connected behind the convolution layer; the output characteristic diagram of the last residual block of the first branch network and the output characteristic diagram of the last residual block of the second branch network form a 2N-containing network through the channel merging operation_CA characteristic diagram of each channel, wherein the number of channels of the output characteristic diagram of the first branch network and the second branch network is N_C；

(3) The feature map formed by the channel merging in the step (2) is processed by a method comprising r²Convolution layers with convolution kernel of 3 × 3 and convolution step size of 1, and output generated by convolution operationObtaining an upsampled image H by means of phase screening₁Wherein r is an upsampling factor;

(4) the input of the third branch network passes through a filter comprising r²A convolution layer with 3 × 3 convolution kernel and convolution step size of 1, the output of the convolution layer is periodically screened to obtain an up-sampled image H₂Wherein r is an upsampling factor;

(5) the up-sampled image H₁And H₂And carrying out summation operation on the pixels one by one, and taking the generated output as a result image, namely the image after super-resolution reconstruction in the compressed video.

Further, the finding a block with the maximum similarity in the reference image for each block image in the current decoding frame in the form of block processing, and forming a reconstructed image from these similar blocks as an input of the first branch network Sub-a includes:

2.1 initializing the reconstructed image I by setting the height and width of the current decoded image to H and W respectively_pThe size of the initialized weight matrix C is H multiplied by W, all pixel values are 0, the size of the initialized weight matrix C is H multiplied by W, and the initial values of all elements are 0;

2.2 in s, respectively₁And s₂Scanning the reference image and the current decoded image from left to right and from top to bottom at equal intervals for the scanning step size, extracting at each scanning position (u, v) with the position as the upper left corner

The image blocks with different sizes are obtained by subtracting the gray average value of each image block and are converted into a row vector containing d elements in a row-first mode; adding each row vector from the reference image to the matrix T as a row in T, and adding each row vector from the current decoded image to the matrix Q as a row in Q;

2.3 for the row vector Q in the matrix Q, the Euclidean distance is taken as the similarity measurement, the most similar row vector is searched in T by the k-nearest neighbor algorithm and is recorded as T, and if the Euclidean distance between the vector T and the vector Q is smaller than a preset threshold value e, the row vector Q in T is sequentially taken

The elements being one row of the matrix

The rows form one

A matrix of sizes as target blocks, otherwise in q are fetched sequentially

The elements being one row of the matrix

Line formation

A matrix of sizes as target blocks; adding a gray average value corresponding to q to each pixel value in the target block;

2.4 setting the scanning position of the image block corresponding to the row vector Q in the matrix Q as (u, v), and the target block obtained in the step 2.3 as b, then reconstructing the image I_pIn which (u, v) is the upper left corner and the size is

Each pixel is added with the value of the corresponding element in the target block, and the weight matrix C takes (u, v) as the upper left corner and has the size of

Each element value is added by 1;

2.5 repeating steps 2.3 and 2.4 for all row vectors in the matrix Q to obtain a reconstructed image I_p；

2.6 reconstruction of image I_pDividing the value of each pixel in the weight matrix C by the value of the corresponding element in the weight matrix C to obtain the final reconstructed image.

Furthermore, the N residual blocks connected in sequence in the first branch network and the second branch network, each residual block having the same structure, include two convolution layers and a ReLU layer, and according to the flow direction when data is transmitted forward, the convolution layers including 128 convolution kernels of 3 × 3 and convolution step size 1, and the ReLU layer and the convolution layer including 32 convolution kernels of 3 × 3 and convolution step size 1 are in sequence; let x be the input of any one residual block, the two convolutional layers and the ReLU layer map this input to f (x), and finally f (x) + x is taken as the output of the residual block.

Further, the output of the convolutional layer is an up-sampled image in a periodic screening manner, including: let the output of the convolutional layer be H × W × r²Taking r of all channels with coordinates of (x, y) position²Forming a vector by the elements, taking r elements of the vector as one row of the matrix in sequence, forming a matrix with the size of r multiplied by r in the total row, and placing the matrix at the (rx, ry) position in the up-sampling image; the above process is repeated for all coordinate positions of the feature map to form an up-sampled image.

Further, the parameters of each layer of the multi-branch convolutional neural network are determined in a learning mode, and the method comprises the following steps:

A. preparing a training sample: let I be the currently decoded frame in compressed video, I_oIs an uncompressed coded original image corresponding to I_pFor the reconstructed image constructed, the ith sample in the sample set used for training the multi-branch convolution neural network is shaped as

Wherein,

and y_iAre respectively from I, I_pAnd I_oThe image blocks at the same position and the same size;

B. training: batch loading of samples in a training sample set will

Input to the second and third branch networks,

inputting the network parameters into the first branch network, and searching for the optimal network parameters according to the following optimization process:

wherein

For correspondences produced by multi-branch convolutional neural networks

I.e. | non-calculation of the luminance₁Represents a norm of 1; in the training process, the weight values of all layers of the network are updated by an Adam optimization algorithm, the learning rate is adjusted in a piecewise descending mode, specifically, the total training period number is divided into four stages, and the learning rate of the next stage is equal to one half of the learning rate of the previous stage.

The invention has the beneficial technical effects that: the compressed video super-resolution reconstruction method adopting the multi-branch convolutional neural network can effectively utilize the inter-frame redundancy information of the video sequence, and particularly utilizes the characteristic that an intra-frame coding frame in the compressed video has better visual quality, so that the reconstructed super-resolution image has better quality.

Drawings

FIG. 1 is a schematic diagram of a multi-branch convolutional neural network structure according to the present invention;

fig. 2 is a schematic diagram of a residual block network structure.

Detailed Description

The invention is further described below in conjunction with the drawings and the specific embodiments so that those skilled in the art can better understand the essence of the invention.

As shown in fig. 1, the invention provides a compressed video super-resolution reconstruction method based on a multi-branch convolutional neural network, which comprises the following specific steps:

(1) the multi-branch convolutional neural network for compressed video super-resolution reconstruction comprises three branches, wherein a second branch network Sub-B and a third branch network Sub-C take a current decoding frame I of a compressed video as input; selecting a reference image from two intra-frame coding frames which are positioned before and after the I frame according to the number of the interval frames, searching a block with the maximum similarity in the reference image for each block image in the current decoding frame in a block processing mode, and forming a reconstructed image by the similar blocks as the input of a first branch network Sub-A;

the finding of the block with the maximum similarity in the reference image for each block image in the current decoding frame in the form of block processing, and the forming of the reconstructed image by these similar blocks as the input of the first branch network Sub-a, includes:

step 1A, setting the height and width of the current decoding image as H and W respectively, and initializing a reconstructed image I_pThe size of the initialized weight matrix C is H multiplied by W, all pixel values are 0, the size of the initialized weight matrix C is H multiplied by W, and the initial values of all elements are 0;

step 1B, respectively with s₁And s₂Scanning the reference image and the current decoded image from left to right and from top to bottom at equal intervals for the scanning step size, extracting at each scanning position (u, v) with the position as the upper left corner

The image blocks with the sizes are obtained by subtracting the gray average value of each image block and are converted into a row vector containing d elements in a row-first mode; adding each row vector from the reference image into the matrix T as a row in T, and adding each row vector from the current decoded image into the matrix Q as a row in Q; wherein d may be 36 or 64, s₁Can be 1 or 2, s₂(may be)

Step 1C, regarding the row vector Q in the matrix Q, using Euclidean distance as similarity measurement, using k-nearest neighbor algorithm to search the most similar row vector in T, and recording the most similar row vector as T, if the Euclidean distance between the vector T and the vector Q is equal to TIf the distance is less than a preset threshold e, the values in t are taken

The elements being one row of the matrix

The rows form one

Taking the matrix of the size as a target block, otherwise, sequentially taking the matrix in q

The elements being one row of the matrix

The rows form one

The matrix of the size is used as a target block; adding a gray average value corresponding to q to each pixel value in the target block;

step 1D, setting the scanning position of the image block corresponding to the row vector Q in the matrix Q as (u, v), and setting the target block obtained in the step 1C as b, then reconstructing an image I_pIn which (u, v) is the upper left corner and the size is

Each pixel is added with the value of the corresponding element in the target block b, and the weight matrix C takes (u, v) as the upper left corner and has the size of

Each element value is incremented by 1;

step 1E, repeating the steps 1C and 1D on all row vectors in the matrix Q;

step 1F, reconstructing an image I_pDividing the value of each pixel in the weight matrix C by the value of the corresponding element in the weight matrix C to obtain the final reconstructed image.

(2) First branch network and second branch networkThe two branch networks have the same structure, according to the data flow direction when the data forward propagates, the input data firstly passes through a convolution layer containing 32 convolution kernels with the convolution step length of 1 and N residual blocks which are connected in sequence are connected behind the convolution layer, wherein N can be an integer which is more than 10 and less than 18; the output characteristic diagram of the last residual block of the first branch network and the output characteristic diagram of the last residual block of the second branch network form a 2N-containing channel merging operation_CCharacteristic of each channel, wherein the number of channels of the output characteristic diagram of the first branch network and the second branch network is N_C；

Each of the N sequentially connected residual blocks has the same structure, which includes two convolutional layers and a ReLU layer, and the convolutional layers sequentially include 128 convolutional kernels with a convolution step length of 1 and 3 × 3 convolutional kernels with a convolution step length of 1, and the ReLU layer includes 32 convolutional kernels with a convolution step length of 1, according to the flow direction of data forward propagation; let x be the input of any residual block, the two convolutional layers and the ReLU layer map this input to f (x), and finally f (x) + x is taken as the output of the residual block.

(3) The feature map formed by combining the channels in the previous step is processed by a method comprising r²Obtaining an up-sampling image H by output generated by convolution operation of convolution kernels with the convolution kernel number of 3 multiplied by 3 and convolution layer with the convolution step length of 1 in a periodic screening mode₁Wherein r is an upper sampling factor;

the output generated by convolution operation is used for obtaining an up-sampled image in a periodic screening mode, and the output of the convolution layer is set to be H multiplied by W multiplied by r²Taking r of all channels with coordinates of (x, y) position²Forming a vector by the elements, taking r elements of the vector as a row of a matrix in sequence, forming a matrix of r multiplied by r in the row, and placing the matrix at the (rx, ry) position in the up-sampling image; repeating the above process for all coordinate positions of the feature map to form an up-sampling image;

(4) the input of the third branch passes through a filter comprising r²A convolution layer with 3 × 3 convolution kernel and convolution step size of 1, the output of the convolution layer is up-sampled by periodic screeningImage H₂Wherein r is an upsampling factor;

(5) the up-sampled image H₁And H₂And performing summation operation of corresponding pixels one by one, and generating output as a result image.

The technical scheme (1) of the invention determines the parameters of each layer of the multi-branch convolutional neural network in a learning mode, and comprises the following steps:

5A, preparing a training sample: let I be a frame in compressed video, I_oIs the original image corresponding to the frame, I, which is not compression-encoded_pFor the reconstructed images constructed as described in steps 1A to 1F, the ith sample in the sample set used to train the multi-branch convolutional neural network model

Wherein,

and y_iAre respectively from I, I_pAnd I_oThe image blocks with the same position and size;

5B, training: batch loading of samples in a training sample set will

Input to said second and third branch networks,

wherein

For correspondences generated by multi-branch convolutional neural network models

I.e. | non-calculation of the luminance₁Represents a norm of 1; in the training process, the weight values of each layer of the network are updated by an Adam optimization algorithm, optionally, the initial value of the learning rate can be set to be a value between 0.001 and 0.005, the learning rate is adjusted in a piecewise descending manner, specifically, the total training cycle number is divided into four stages, and the learning rate of the latter stage is equal to one half of the learning rate of the former stage.

The method provided by the embodiment of the invention tests the video coded by HEVC; the HEVC reference software HM16.0 is adopted as a compression tool, and a test video with an original size and a video with a width direction and a height direction respectively reduced to an original size 1/2 are subjected to compression coding by using quantization parameters QP 27, 32, 37 and 42 respectively; setting the inter-frame coding frame interval to be 32, setting the QP value offset of the intra-frame coding frame to be-7, and setting the reserved encoder _ lowdelay _ P _ main.cfg configuration file for other parameters; recording the code rate of a compressed video with an original size and the peak signal to noise ratio (PSNR) with the original uncompressed video as a reference, recording the code rate of the compressed and coded video after being reduced, reconstructing the compressed and coded video into the video with the original size by adopting the model provided by the embodiment of the invention, and calculating the peak signal to noise ratio (PSNR) of the reconstructed video and the uncompressed video as the reference; the method provided by the invention has the advantages that the video compressed by the original size is taken as the reference, and the BD-rate is taken as the measurement criterion, so that the code rate saving condition of the method is given under the same objective quality; the PSNR gain of the method provided by the invention is given under the condition of the same code rate by taking BD-PSNR as a measurement criterion, and the result is listed in Table 1; as can be seen from the table, the method provided by the invention saves about 14% of code rate on average under the condition of giving the same objective quality, and provides about 0.77dB of PSNR gain on average under the condition of giving the same code rate.

Table 1 experimental results of examples of the present invention

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any modification or replacement within the spirit and principle of the present invention should be covered within the scope of the present invention.

Claims

1. A compressed video super-resolution reconstruction method based on a multi-branch convolutional neural network is characterized by comprising the following steps: the method comprises the following specific steps

(2) the first branch network and the second branch network have the same structure, according to the data flow direction when the data forward propagates, the input data firstly passes through a convolution layer containing 32 convolution kernels with the convolution step length of 1 and 3 multiplied by 3, and N residual blocks which are connected in sequence are connected behind the convolution layer; the output characteristic diagram of the last residual block of the first branch network and the output characteristic diagram of the last residual block of the second branch network form a 2N-containing network through the channel merging operation_CA characteristic diagram of each channel, wherein the number of channels of the output characteristic diagram of the first branch network and the second branch network is N_C；

(3) The feature map formed by the channel merging in the step (2) is processed by a method comprising r²Obtaining an up-sampled image H by output generated by convolution operation of convolution kernels with a convolution kernel of 3 multiplied by 3 and a convolution layer with a convolution step length of 1 in a periodic screening mode₁Wherein r is an upsampling factor;

(5) the up-sampled image H₁And H₂And carrying out summation operation on the pixels one by one, and taking the generated output as a result image, namely the image after super-resolution reconstruction of the compressed video image.

2. The compressed video super-resolution reconstruction method based on multi-branch convolutional neural network of claim 1, wherein the method searches for the block with the largest similarity in the reference image for each block image in the current decoded frame in the form of block processing, and the reconstructed image is formed by these similar blocks as the input of the first branch network Sub-a, and the specific process includes:

The image blocks with different sizes are obtained by subtracting the gray average value of each image block and are converted into a row vector containing d elements in a row-first mode; adding each row vector from the reference image into the matrix T as a row in T, and adding each row vector from the current decoded image into the matrix Q as a row in Q;

The elements being one row of the matrix

The rows form one

The matrix of the size is used as a target block, otherwise, the matrix in q is taken in sequence

The elements being one row of the matrix

Line formation

Each element value is added by 1;

3. The compressed video super-resolution reconstruction method based on the multi-branch convolutional neural network of claim 1, wherein the N residual blocks connected in sequence in the first branch network and the second branch network, each residual block having the same structure, comprise two convolutional layers and a ReLU layer, and in sequence, in the flow direction when data is propagated in the forward direction, are convolutional layers comprising 128 3 × 3 convolutional kernels and having a convolution step size of 1, and the ReLU layer and a convolutional layer comprising 32 3 × 3 convolutional kernels and having a convolution step size of 1; let x be the input of any one residual block, the two convolutional layers and the ReLU layer map this input to f (x), and finally f (x) + x is taken as the output of the residual block.

4. The compressed video super-resolution reconstruction method based on multi-branch convolutional neural network of claim 1, wherein the output of the convolutional layer is obtained as an up-sampled image in a periodic screening manner, and the method comprises the following steps: let the output of the convolutional layer be H × W × r²Taking r of all channels with coordinates of (x, y) position²Forming a vector by the elements, taking r elements of the vector as a row of the matrix in sequence, forming a matrix with the size of r multiplied by r in the row, and placing the matrix at the (rx, ry) position in the up-sampling image; the above process is repeated for all coordinate positions of the feature map to form an up-sampled image.

5. The compressed video super-resolution reconstruction method based on the multi-branch convolutional neural network as claimed in claim 1, wherein the parameters of each layer of the multi-branch convolutional neural network are determined in a learning manner, and the method comprises the following steps:

Wherein,

B. training: batch loading of samples in a training sample set will

Input to the second and third branch networks,

wherein

For correspondences produced by multi-branch convolutional neural networks