CN115941978A

CN115941978A - Loop filtering method of deep neural network based on multi-information fusion

Info

Publication number: CN115941978A
Application number: CN202211615020.0A
Authority: CN
Inventors: 向劲松; 王鹏; 陈贤龙; 曹维俊; 黄胜
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-12-15
Filing date: 2022-12-15
Publication date: 2023-04-07

Abstract

The invention belongs to the technical field of video coding and decoding, and particularly relates to a loop filtering method of a neural Network MIIN (Multi-Information Integration Network) by utilizing fusion of various Information in a coding process. According to the invention, various intermediate information generated in the encoding process is input into the built neural network, and the DBF module and the SAO module in the VVC loop filter of the video coding and decoding standard are replaced, so that the performance of the loop filter scheme is improved. In the scheme, different input information is adopted for a luminance component and a chrominance component, wherein main input is reconstruction information corresponding to each component, auxiliary input corresponding to the luminance component is prediction information and partition information, and auxiliary input corresponding to the chrominance component is residual information and partition information.

Description

Loop filtering method of deep neural network based on multi-information fusion

Technical Field

The invention relates to the field related to video coding technology, in particular to a design and construction method of a deep neural network loop filter based on multi-information fusion.

Background

With the continuous improvement of hardware performance and the iterative development of network technology and the continuous improvement of the demand of people on video content, videos are developed in the directions of ultra-high definition, wide color gamut, panorama and the like, and the mass video storage and stable video transmission of the videos provide new challenges for video coding related technologies. Compared with the HEVC, the VVC, which is the latest generation of video coding standard, has a bit rate that is reduced by about 50% on the premise of the same perceptual quality. In particular, the loop filtering technique in the video coding standard improves the quality of the currently encoded image and provides a higher quality reference image for encoding of subsequent frames.

Luma mapping with chroma scaling (LMCS) is a new technology in the VVC video codec standard, achieving better reconstruction effect in both SDR and HDR video by adaptively modifying the distribution of coded samples to improve coding efficiency, mainly comprising two functional components: luminance mapping and chrominance scaling. Luminance mapping is an in-loop mapping method based on an adaptive piecewise linear model, and the basic idea is to improve the compression efficiency by adjusting the dynamic range of an input signal at a given bit depth. Chroma scaling is a luma information based chroma residual scaling that aims to compensate for the luma adjustment problems by adjusting the chroma residual information values within a chroma coding block.

The deblocking filter (DBF) is mainly used to remove blocking artifacts occurring during encoding. Blocking artifacts occur because the video encoding process is based on partitions. The blocks are independent of each other as processing units, and as a result, the texture on both sides of the boundary of the reconstructed image block is relatively smooth, but the pixel values on both sides are discontinuous. When the images on both sides of the boundary of the coding block have strong correlation and the image texture is smooth, the discontinuity on the pixels forms a 'block effect' in human eyes. The deblocking filter DBF first determines the type of the boundary of the image and then "corrects" the pixel values near the pseudo boundary formed by the blocking artifacts by setting filter parameters that depend on the boundary strength information of the coding block.

The sample adaptive filter (SAO) is placed after the DBF because the transition signal is compressed and exhibits a certain degree of ringing in the lossy compression encoding process. In order to eliminate the artifacts, the SAO compensates some special positions in the reconstructed sampling points, such as peak, inflection point, valley point, and the like, so as to reduce the difference between the reconstructed sampling points and the original sampling points, thereby achieving the purpose of filtering. In VVC, SAO can be divided into two types, edge compensation EO and sideband compensation BO, according to the difference of the characteristics of the reconstructed image, and different filter types are selected by the control parameters in CTU.

In the VVC, a Geometry transformation-based ALF (GALF) is integrated to reduce its complexity for removing artifacts and distortion generated in an encoding stage, and the conventional ALF and GALF are collectively referred to as ALF. The ALF minimizes MSE between a reconstructed image and an original image in a wiener filtering mode, and after the ALF is applied to a sample adaptive filter SAO, the specific process is to classify blocks firstly and then carry out filtering operation by adopting filters with different coefficients.

The loop filtering module can effectively improve the subjective and objective quality of the video. The smaller the compression distortion of the filtered reconstructed frame, the closer to the original frame, the more advantageous it is for the encoding of the subsequent frame. In particular, the filtered image frame may be used in the encoding process of a subsequent image frame. The loop filtering technology in the existing video coding and decoding standard VVC can reduce the compression distortion of a reconstructed frame to a certain extent, but the calculation process is complicated, and the quality of the reconstructed frame after filtering is not ideal.

The artificial intelligence utilizes the deep neural network to extract and analyze the characteristics of the data, achieves remarkable effects in the field of computer vision, and has excellent performance in low-level visual tasks such as image super-resolution, noise reduction and the like. At present, research and application of combining a filtering module and deep learning in video coding and decoding can be divided into two categories, one is loop filtering, and specifically, in VVC coding, an original filtering module is replaced by a neural network so as to improve coding performance. The other type is an out-of-loop filtering, which specifically refers to performing neural network processing on a decoded video after conventional encoding and decoding to achieve a filtering effect. Although the existing loop filtering based on the neural network improves the quality of the reconstructed frame to a certain extent, the existing loop filtering based on the neural network has a limited effect on a loop filtering module based on the neural network due to insufficient utilization of intermediate information generated in the encoding process, such as prediction information, residual information, partition information and the like in the encoding process, and the quality of the restored reconstructed frame is not ideal.

In view of the defects and shortcomings of the prior art, the invention provides a loop filtering method with multi-information fusion, which fully considers the influence of an LMCS tool and the data characteristics among different components, and adopts different information as network input for different components, so that the information in the encoding process is more fully utilized; in addition, the method takes the partition information generated in the encoding process as input, and takes the characteristic of encoding processing by blocks in the video compression process into consideration. Based on the analysis, the method can efficiently utilize and fuse various information in the encoding process so as to achieve the aim of fully utilizing various kinds of information generated in the middle. On the premise that gains are approximately the same, the invention adopts richer coding information and a more efficient fusion mode, so that the complexity is reduced compared with other deep learning methods.

Disclosure of Invention

The invention provides a method and a system suitable for a loop filter module in a VVC (variable value code) coding standard to solve the problems, and relates to the technical field of image processing. The method is characterized in that various information in the encoding process is fused to be used as the input of a network, a network model is built, and the network is trained by utilizing a training set. And after the network model is converged, embedding the network model into a loop filtering module in the VVC standard. Specifically, a loop filtering scheme applicable to the VVC of the video codec standard includes the following steps:

s1, constructing a training data set and a verification data set: the generation of the training dataset and the validation dataset is by means of a DIV2K image dataset. Firstly, converting pictures in a DIV2K data set from an RGB color space to a YUV color space, then respectively compressing each picture under the condition that QPs are 22, 27, 32, 37 and 42 by using All Intra coding configuration of reference software VTM14.0 of video coding and decoding standard VVC, then storing reconstruction information and partition information which pass through an LMCS module but do not pass through a DBF filtering module, storing residual information and prediction information which do not pass through the LMCS module, obtaining an information block with a size of 128 multiplied by 128 and without overlapping according to a preset division mode, and then separating YUV three channel components of the information block. In order to ensure the effectiveness of the network on feature learning, PSNR of each reconstruction block is calculated, and information block groups with PSNR larger than 50 and PSNR smaller than 15 are removed. Through the above operations, for the luminance component, a training set and a verification set corresponding to the conditions of QP 22, QP 27, QP 32, QP 37, and QP 42 are obtained, where each group of data includes an original information block, a reconstructed information block, a partition information block, and a prediction information block of the luminance component; for chroma components, a training set and a verification set are obtained for QP values of 22, 27, 32, 37, and 42, and each set of data includes an original information block, a reconstructed information block, a partition information block, and a residual information block of the chroma component.

S2, building an MIIN (Multi-Information Integration Network) Network, respectively training the networks corresponding to the luminance component and the chrominance component by utilizing the 5 types of training sets under different QP conditions obtained in the step S1, respectively generating models of 5 QPs of the luminance component and the chrominance component, then determining the optimal hyper-parameter according to the representation of each model on the corresponding verification set, selecting the respective optimal model, and finally obtaining the models of 5 QPs corresponding to the luminance component and the chrominance component.

And S3, converting the model obtained in the step S2 into a C + + usable type by using a LibTorch library, and embedding the converted network model into reference software VTM14.0 provided by a video coding and decoding standard VVC by using a C + + API in the model. When the standard video sequence provided by the JFET is subjected to an encoding operation, the DBF filtering module and the SAO filtering module in the loop filtering are firstly closed, and the default configuration is used in the rest part. For the brightness component, firstly, the reconstructed image passing through the LMCS module is used as a main input, the prediction information and partition information of the intermediate process are stored, then, the corresponding network model which is already trained and converged in the step S2 is selected according to the QP value, and the information which is prepared in advance is input into the network model, and the image output by the network model is the brightness component image which is filtered by the network MIIN. Similarly, residual information and partition information in the chroma component encoding process are stored, then a corresponding network model which is already trained and converged is selected in step S2 according to the QP value, and information which is prepared in advance is input into the network model, and an image output by the network model is a chroma component image which is filtered by the network MIIN. And then, the obtained image is used as the input of an ALF filtering module to complete the subsequent coding operation.

The MIIN network structure details described in S2 are as follows:

(1) The difference between the luminance component and the chrominance component is the difference between the input information, which is represented by the fact that the input information used for the luminance component is reconstruction information, prediction information and partition information, and the input information used for the chrominance component is reconstruction information, residual information and partition information. The network structure used for the luminance component and the chrominance component remains the same. In particular, the network structure may be divided into three parts, which are further explained below with respect to the details of the structure in the network.

(2) The first part is a multi-information fusion module. The part can be divided into two layers, one is an information fusion layer, and the other is an inclusion Block. Assuming luminance reconstruction information as

The chrominance reconstruction information is->

Reconstructing information in the information fusion layerThe branches can be represented as:

wherein f is ₂ Indicating the operation of the convolution layer and the ReLU activation function layer corresponding to the branch of reconstruction information, D _L1 Output of the information fusion layer representing the reconstructed information branch in the luminance Luma model, D _C1 Represents the output of the information fusion layer of the reconstruction information branch in the Chroma model.

The input of the inclusion Block is D _L1 And D _C1 In the process, the processing modes of the brightness component and the chrominance component are consistent and are unified into D ₁ Assuming the output is D _IB Then the inclusion Block operation process can be expressed as:

where, denotes a convolution operation,

represents convolution kernels of size i x i, in>

Indicating Concat cascade operation. Further, it is assumed that prediction information, residual information, and partition information are respectively represented as x _pred 、x _resi And x _part Then the multi-information fusion module can be expressed as:

wherein f is ₁ Operation of convolutional layer and ReLU activation function representing branches of luma prediction information and chroma residual information, f ₃ Operation of convolutional layers and ReLU activation functions representing partition information branches, D _L2 Representing the output of a multi-information fusion module in a luminance Luma model, D _C2 Representing the output of the multi-information fusion module in the Chroma model.

(3) The second part is a residual feature aggregation module. In the subsequent steps, the information processing modes of all components are consistent, so that the brightness component and the chrominance component are not distinguished any more, and the input of the module is the output D of the first part ₂ The module body part is composed of a plurality of RFA modules which are cascaded to form RFA Blocks, and assuming that the RFA Blocks are represented by RFAB, the residual feature aggregation module can be represented as follows:

wherein

Denotes a convolution kernel of size 1 × 1, D ₃ Is the output of the residual feature aggregation module. In RFAB, concat obtains the output of RFAB module after several RFA modules are cascaded. The RFA module is formed by orderly connecting 4 RFA modules, wherein the output of the first three RFAs has two branches, one branch is used as the input of the next RFA, the other branch is directly sent to the tail end of the RFAB module, and then the branch is connected with the output of the last RFA through Concat, and then the characteristics are fused through a1 multiplied by 1 convolution.

Wherein, the RFA module is realized by 3 residual blocks through a combination mode of an attention mechanism. Suppose the input and output of the d-th RFA module are RFAs _d-1 And RFA _d+1 The inputs of the 3 residual blocks in the RFA module are respectively R ₁ 、R ₂ And R ₃ Expressed, this process can be expressed by the following equation:

R ₁ ＝RFA _d-1

wherein

And &>

The residual module and the attention module are indicated separately. Further, the operation process of the RFA module can be expressed as:

(4) The third part is an output reconstruction module. The module uses a convolution kernel of 3 x 3 to adjust dimensionality, the output of the module is filtered residual information, and finally the residual information is added with original reconstruction information to finally obtain the output of the whole network. Assume that the input of the output processing module is D ₄ And the final output of the network is Y, then:

Y＝w _k3 *(D ₄ )+x _rec o

the invention has the beneficial effects that: the invention uses the filter network MIIN to replace a DBF module and an SAO module of loop filter in the VVC standard of video coding and decoding, effectively eliminates the distortion caused by blocking effect, ringing effect, color shift and the like in the reconstructed frame by utilizing information in the coding and decoding process, and improves the subjective quality and objective quality of the final output image; compared with the existing loop filtering scheme based on the neural network, the multi-information fusion scheme used by the invention firstly considers the influence and the difference among components of an LMCS filtering tool, and adopts different information as network input aiming at different components, namely, the network corresponding to the brightness component adopts the prediction information as input information, and the chroma component adopts the residual error information as input information, so that the information is more reasonably utilized; the second is to take the partition information as another input information in consideration of the block-based encoding characteristics. Compared with other methods based on the deep neural network, the method has the advantages that the network complexity is lower and the overall filtering time is shorter under the condition that the objective quality of the filtered video is close to that of the filtered video.

Drawings

In order to make the object, technical scheme and beneficial effect of the invention more clear, the invention provides the following drawings for explanation:

FIG. 1 is a block flow diagram of a loop filtering module according to the present invention;

FIG. 2 is a schematic diagram of a MIIN network architecture;

FIG. 3 is a schematic view of the Inception Block structure;

FIG. 4 is a schematic view of the RFA structure;

FIG. 5 is a schematic view of an RB module;

fig. 6 is a schematic diagram of a CA module.

Detailed description of the preferred embodiments

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

The loop filtering part in the VVC sequentially comprises the following steps: the invention relates to an adaptive loop shaper (LMCS), a deblocking filter (DBF), a sample adaptive offset filter (SAO) and an Adaptive Loop Filter (ALF), which are based on a VVC loop filtering part of a video coding and decoding standard and adopt a neural network MIIN to replace an original DBF filtering module and an original SAO filtering module, as shown in figure 1. The technical scheme of the invention mainly comprises the following steps:

A. constructing a training data set and a verification data set: the generation of the training data set and the verification data set is realized by means of a DIV2K image data set, firstly converting pictures in the DIV2K data set from an RGB color space to a YUV color space, then performing coding compression on each picture under the condition that QP is 22, 27, 32, 37 and 42 by using an All Intra coding configuration in reference software VTM14.0 of a video coding and decoding standard VVC, then storing reconstruction information and partition information which pass through an LMCS module but do not pass through a DBF filtering module, and storing residual error information and prediction information which do not pass through the LMCS module. And obtaining a non-overlapped information block with the size of 128 multiplied by 128 according to a preset division mode, and separating YUV three-channel components of the information block at the moment. In order to ensure the effectiveness of the network on feature learning, PSNR of each reconstruction block is calculated, and information block groups with PSNR greater than 50 and PSNR less than 15 are removed. Through the above operations, for the luminance component, a training set and a verification set corresponding to the conditions of QP of 22, 27, 32, 37, and 42 are obtained, and each group of data includes a reconstruction information block, a partition information block, and a prediction information block of the luminance component; for chroma components, a training set and a verification set are obtained for QP of 22, 27, 32, 37, 42, and each set of data includes a reconstructed information block, a partition information block, and a residual information block of the chroma components.

B. And B, building an MIIN network, respectively training networks corresponding to the luminance component and the chrominance component by utilizing five types of training sets under different QP conditions obtained in the step A, respectively generating corresponding models of five QP of the luminance component and the chrominance component, then determining the optimal hyper-parameter according to the performance of each model on the corresponding verification set, selecting the respective optimal model, and finally obtaining models of 5 QP corresponding to the luminance component and the chrominance component.

The MIIN network structure can be divided into three parts as shown in fig. 2, the first part is an input information fusion module, the three input information passes through a convolution layer, the reconstruction information passes through an increment Block additionally, and the three branches are connected through a Concat layer. The second part is a residual error feature aggregation module, which passes through a convolution layer firstly, then passes through an RFAB module, and then passes through a convolution layer with the convolution kernel size of 1 multiplied by 1, wherein the RFAB module is formed by orderly connecting 4 RFA blocks. The third part is an output reconstruction module which reconstructs the residual error of the image by using convolution kernel with the size of 3 multiplied by 3 and adds the residual error with the original reconstruction information to obtain the final filtering image

C. And converting the model obtained in the step B into a C + + usable type by using a LibTorch library, and embedding the converted network model into reference software VTM14.0 provided by a video coding and decoding standard VVC by using a C + + API in the LibTorch library. When a standard test video sequence provided by JVET is coded, firstly closing a DBF filtering module and an SAO filtering module in loop filtering, and using default configuration for the rest parts, regarding brightness components, firstly taking a reconstructed image passing through an LMCS module as main input, storing prediction information and partition information of an intermediate process, then selecting a corresponding trained and converged network model in the step B according to a QP value, inputting pre-prepared information into the network model, and outputting an image which is a brightness component image after filtering through a network MIIN. Similarly, residual information and partition information in the chroma component coding process are stored, then a corresponding network model which is already trained and converged is selected in the step B according to the QP value, and information which is prepared in advance is input into the network model, and the output image is the chroma component image which is filtered by the network MIIN. And then, the obtained image is used as the input of an ALF filtering module to complete the subsequent processing.

The training process details of the step B are as follows: for all network models, L is used ₁ Loss function sum L ₂ Two-stage training mode of loss function combination, namely that the first stage of training adopts L ₁ And a loss function is adopted in the second stage of training, and the loss function is respectively as follows:

where U represents the pixel index, U represents the set of pixels, x (U) represents the processed pixel value, and y (U) represents the true pixel value. In addition, considering the influence of quantization parameters on image quality, the setting of the learning rate is classified, and the network model corresponding to the smaller QP, namely the QP is 22, 27 and 32, is updated by adopting the slower learning rate 1 e-5; when the QP is larger, namely the QP is 37 and 42, the network model is updated by adopting the faster learning rate 1 e-4.

The input of the MIIN network is distinguished according to different components, for the brightness component, the main input adopts the brightness component information of the reconstructed image, and the auxiliary information is the partition information and the prediction information of the brightness; for the chroma component, the main input adopts chroma component information of a reconstructed image, and the auxiliary information is chroma partition information and residual information. The output of the MIIN network is a difference image delta F between an original frame F and a reconstructed frame F' of each component learned by the neural network, and the residual information and the reconstructed frame image are added to obtain a final filtered image F _L ', i.e.:

F _L ′＝F′+ΔF＝F′+MIIN(F′)

the invention provides a loop filtering method based on a neural network, which can obtain better filtering effect on a brightness component and a chrominance component by utilizing a multi-information fusion mode and improve the quality of the whole video. Aiming at the problems existing in the video coding and decoding standard VVC and the existing loop filtering method based on the neural network, the data characteristics of each component are considered, and a multi-information fusion and effective network structure is used, so that the distortion of a reconstructed frame can be effectively eliminated, the subjective and objective quality of the video is improved, and the system is stable and strong in robustness.

To better demonstrate the technical feasibility of the solution of the invention, the following is described using a simulation example:

in the experiment, VTM14.0 is used as an experiment platform, the tested sequences are A1, A2, B, C, D, E and F test sets in a JVT CTC standard test sequence, the coding mode is an All Intra mode, the indexes of comparison comprise BD-rates of All video sequences, and Table 1 shows the experiment results.

The reference setting of the experimental result is carried out according to CTC, and VTM14.0 defaults to open all the modules of loop filtering; the experimental test set setup was also performed according to CTC, except that the DBF and SAO filtering modules were turned off, and the MIIN neural network was added between LMCS and ALF. When the BD-rates are negative values, the code rate is reduced and the coding efficiency is improved on the premise of the same reconstruction quality; when the BD-rates are positive, it is shown that the code rate is increased and the coding efficiency is decreased on the premise that the reconstruction quality is the same.

TABLE 1 results of the experiment

	Y	U	V
				Class A1	-4.57％	-7.01％	-9.26％
Class A2	-3.61％	-8.54％	-9.19％
				Class B	-3.43％	-8.90％	-9.41％
Class C	-3.05％	-10.44％	-10.53％
				Class E	-4.66％	-8.35％	-10.02％
Overall	-3.77％	-8.78％	-9.70％
				Class D	-3.09％	-9.26％	-11.13％
Class F	-2.21％	-7.37％	-6.02％

Claims

1. A method for filtering the loop of the deep neural Network based on Multi-Information fusion, this method is carried on the basis of the loop filtering part of the VVC of the video codec standard, the Multi-Information fusion Network MIIN (Multi-Information Integration Network) proposed replaces deblocking filter DBF, sample value adaptive compensation filter SAO in VVC standard; the specific filtering method comprises the following steps:

step 1, constructing a training data set and a verification data set: converting 1000 pictures in a DIV2K data set into YUV, compressing each picture by using VTM14.0 under the condition that QP is 22, 27, 32, 37 and 42, and storing data required in a training set; for training, cutting all generated frames into 128 × 128 information blocks, so as to obtain samples of a training set and a verification set;

step 2, the designed MIIN comprises three parts, wherein the first part is an input information fusion module, the three input information passes through a convolution layer with the convolution kernel size of 3 multiplied by 3, the reconstructed information passes through an inclusion Block additionally, and the three branches are connected through Concat; the second part is a residual error feature gathering module, which is formed by sequentially connecting 4 residual error feature gathering blocks RFA through a convolution layer, a residual error feature gathering module RFAB and a convolution layer; the third part is an output reconstruction module, which reconstructs the residual error of the image by using a convolution kernel with the size of 3 multiplied by 3 and adds the residual error with the original reconstruction information to obtain a final filtering image;

step 3, training networks corresponding to the luminance component and the chrominance component by utilizing the 5 types of training sets under different QP conditions obtained in the step 1, respectively generating 5 types of models of 5 QP of the luminance component and the chrominance component, and then determining a hyper-parameter according to the performance of each model on the corresponding verification set, namely, selecting two stages of training, wherein the stage 1 adopts an L1 loss function, and the stage 2 adopts an L2 loss function; the learning rate is 1e-5 when the QP is 22, 27 and 32, and the learning rate is 1e-4 when the QP is 37 and 42; selecting an optimal model based on the parameters to finally obtain models of 5 QPs corresponding to the brightness component and the chroma component;

step 4, converting the network model obtained in the step 3 by using a LibTorch library and embedding the converted network model into VTM 14.0; when a standard test video sequence provided by JFET is coded, firstly closing a DBF module and an SAO module in loop filtering, and using default configuration for the rest parts, regarding brightness components, firstly taking reconstructed image brightness information after LMCS as main input, storing prediction information and partition information of an intermediate process, then selecting a corresponding trained and converged network model in the step 3 according to a QP value, inputting pre-prepared information into the network model, and outputting an image which is a filtered brightness component image after the MIIN; the chrominance component processing process is similar to the luminance component, except that the prediction information of the intermediate process is replaced by residual information; and then, the obtained image is used as the input of an ALF filtering module, and the subsequent filtering processing is completed.