CN115150628A

CN115150628A - Coarse-to-fine depth video coding method with super-prior guiding mode prediction

Info

Publication number: CN115150628A
Application number: CN202210727355.5A
Authority: CN
Inventors: 盛律; 胡智昊
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-10-04

Abstract

The invention provides a coarse-to-fine depth video coding method with super-prior guiding mode prediction, which comprises the following steps: features of an input video frame are extracted, then motion estimation is carried out twice in a rough to fine mode, and compression and compensation are carried out to obtain predicted features, wherein motion compression carried out on a fine layer surface uses motion compression guided by a super-prior. After the motion compensated features are obtained, the residual information is compressed by the residual compression guided by the super-first-check. And finally, loading the reconstructed residual error characteristics back to the prediction characteristics, and obtaining the reconstructed video frame through a frame reconstruction module. The invention can better process complex and large-motion scenes and improve the motion compensation quality under the condition of extremely little bit consumption. The resolution of different blocks in motion compression and whether the compression of the current block is skipped in residual compression are predicted by using the prior information, so that the bit number required in motion and residual compression is greatly saved.

Description

Coarse-to-fine depth video coding method with super-prior guiding mode prediction

Technical Field

The invention relates to the technical field of video compression and deep learning, in particular to a coarse-to-fine depth video compression coding method with super-prior guide mode prediction.

Background

The phenomenon that the total traffic of the internet occupied by the video content is increased year by year is more and more prominent at present, which is caused by that the traffic of a video website is increased year by year, higher resolution ratio is supported, and higher frame rate is achieved. Most of the video compression algorithms that we use daily are the traditional video compression algorithms h.264 and h.265. Therefore, in the field of video compression, there is an urgent need for a new video compression system based on deep learning to effectively reduce redundant information in a video sequence.

Although the existing video compression algorithm based on deep learning can achieve a good video restoration effect, the existing video compression algorithm only uses a single-scale motion estimation and motion compensation strategy, and because motion information in a video is very complex, the single-scale video compression algorithm can perform a poor effect on a scene with large motion and complex motion. In addition, the existing video compression method based on deep learning cannot use a mode selection strategy, so that the performance of the video compression algorithm based on deep learning is greatly limited.

Therefore, an urgent need exists in the art to solve the problem of how to provide a method for depth video coding with super-apriori guided mode prediction that can effectively reduce the number of bits consumed and improve the compression performance.

Disclosure of Invention

In view of this, the present invention provides a coarse-to-fine depth video coding method with super-prior guided mode prediction.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for coarse-to-fine depth video coding with super-prior guided mode prediction, comprising the steps of:

s1, feature acquisition: obtaining the input image frame X to be compressed _t The reconstructed reference frame obtained by compressing with the previous frame

Respectively extracting to obtain input features F _t And reference character

S2, rough motion compensation: the input feature F _t And reference character

Obtaining a coarse offset between two frames via one motion estimation and one motion compression, loading the coarse offset to the reference feature

Performing motion compensation once to obtain intermediate prediction characteristics

S3, fine motion compensation: the intermediate prediction feature

And input feature F _t Performing secondary motion estimation, secondary motion compression and secondary motion compensation again to generate final prediction characteristics

The secondary motion compression adopts a super-prior guided self-adaptive motion compression method, super-prior information of the features obtained by secondary motion estimation is used as input to carry out resolution mode prediction, and the obtained prediction feature block guides coding and decoding operations of the features obtained by secondary motion estimation in the secondary motion compression;

s4, residual error feature compression: input feature F _t And final predicted features

Residual error characteristic R between _t The adaptive residual error compression method guided by the super-a-priori is used for carrying out skip/non-skip mode prediction, and the characteristics of residual error values meeting the requirements of set thresholds are skipped to obtain reconstructed residual error characteristics

And loaded into the final predicted features

Generating reconstruction features

S5, the reconstruction characteristics are carried out

Input to a frame reconstruction module to generate a reconstructed frame

S6, reconstructing the frame

And (5) as a reference frame of the next frame, repeating the steps of S1-S5 until the last frame to obtain the compressed video.

Preferably, S1 further comprises: when t =1, reconstructing the reference frame

For an input image frame X _t And obtaining a reconstructed frame through compression by a compression algorithm.

Preferably, the S2 includes:

by pair input features F _t And reference character

Carrying out down-sampling operation to scale the two low-resolution features into two low-resolution features with the size of 1/n of the original features;

after motion estimation and motion compression are carried out on the two low-resolution features, up-sampling operation is carried out to scale the two low-resolution features by n times, and then the rough offset between two frames is obtained;

setting the rough offset to be in the reference characteristic

Based on a single motion compensation using a deformable convolution to generate intermediate predictive features

Preferably, the input features F after down sampling _t And reference character

Input to coarseA coarse motion estimation network that connects and passes the two features to the two convolutional layers.

Preferably, the features after motion estimation are input to a coarse motion compression network for performing a motion compression, where the coarse motion compression network is composed of a motion coding network and a motion decoding network.

Preferably, the S3 includes:

the pre-learning prediction network based on the super-prior information, namely a resolution mode prediction network, is used for outputting the optimal block resolution;

inputting the input characteristics needing to be compressed into a motion encoder to be encoded to obtain motion characteristics M _t Motion characteristics M _t Obtaining the super-prior information as the input of the super-prior network;

inputting the super-prior information into a resolution mode prediction network for predicting the optimal resolution of each feature block to obtain a predicted resolution mode;

the motion characteristics M _t Inputting the average pooling layer to the mode-guided average pooling layer for performing corresponding average pooling operation, and inputting the average pooled feature to the mode-guided upsampling layer to restore the original size as the feature

And inputting the motion into a motion decoder for decoding to obtain the compressed motion characteristics.

Preferably, the super-a-priori information includes the motion feature M _t Mean and variance of (c).

Preferably, the coded features in the primary motion compression, the secondary motion compression module and the residual feature compression process are all converted into bit streams and then are subjected to corresponding decoding operations.

Through the technical scheme, compared with the prior art, the invention has the beneficial effects that:

1. the invention provides a depth video compression framework from coarse to fine, wherein motion estimation, motion compression and motion compensation are carried out twice in a coarse-to-fine mode, so that complex and large-motion scenes can be better processed, and the motion compensation quality is improved under the condition of extremely low bit consumption.

2. The invention provides two super prior guiding mode prediction methods, which take the super prior information with discriminability as input to learn two mode prediction networks; the super-prior information in motion and residual compression is used for predicting the resolution of different blocks in motion compression and whether the compression of a current block is skipped in residual compression, so that the number of bits required in motion and residual compression is greatly saved. The method of super-a priori guided mode prediction does not introduce any extra bit cost, incurs negligible computational cost, and can be easily used to predict the best coding mode (i.e. the best block resolution mode for motion coding and the "skip" and "no skip" modes for residual compression).

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts;

fig. 1 is a flowchart of a coarse-to-fine depth video coding method with super-prior guided mode prediction according to an embodiment of the present invention;

fig. 2 is a schematic network structure diagram of a feature extraction module and a frame reconstruction module according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a coarse motion compensation branch network according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a fine motion compensation branch network according to an embodiment of the present invention;

fig. 5 is a schematic diagram of the four basic modes in the resolution mode prediction network according to the embodiment of the present invention and the mode prediction network according to the embodiment of the present invention;

FIG. 6 is a flowchart of adaptive motion compression for guiding a motion vector according to an embodiment of the present invention;

FIG. 7 is a comparison graph of the performance of the Bpp-PSNR video compression algorithm provided by the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

Referring to fig. 1, the present invention provides a coarse-to-fine depth video coding method with super-prior guided mode prediction, which is implemented according to the following procedures: features of an input video frame are extracted, then motion estimation is carried out twice in a rough to fine mode, and compression and compensation are carried out to obtain predicted features, wherein motion compression carried out on a fine layer surface uses motion compression guided by a super-prior. After the motion compensated features are obtained, the residual information is compressed by the residual compression guided by the super-first-check. And finally, loading the reconstructed residual error characteristics back to the prediction characteristics, and obtaining a reconstructed video frame through a frame reconstruction module. The quantized features in the compressed network are arithmetically entropy coded and stored as a binary file.

The specific execution steps are as follows:

s1, feature acquisition: acquiring an input image frame Xx to be compressed currently and a reconstructed reference frame obtained by compressing the previous frame

Respectively extracting to obtain input features F _t And reference character

S2, rough motion compensation: input feature F _t And reference character

Obtaining a rough offset between two frames through one-time motion estimation and one-time motion compression, and loading the rough offset to the reference feature

S3, fine motion compensation: intermediate predictive features

The secondary motion compression adopts a super-prior guided self-adaptive motion compression method, the super-prior information of the features obtained by secondary motion estimation is used as input to carry out resolution mode prediction, and the obtained prediction feature block guides the coding and decoding operations of the features obtained by secondary motion estimation in the secondary motion compression;

s4, residual error characteristic compression: input feature F _t And final predicted features

Residual error characteristic R between _t The adaptive residual error compression method guided by the prior experience is used for carrying out skip/non-skip mode prediction, and skipping the characteristic that the residual error value meets the requirement of a set threshold value to obtain the reconstructed residual error characteristic

And loaded to the final predicted features

Generating reconstruction features

S5, reconstructing the characteristics

Input to a frame reconstruction module to generate a reconstructed frame

S6, reconstructing the frame

In one embodiment, as shown in FIG. 2 (a), the feature extraction module performs the input feature F in the video _t As shown in fig. 2 (b), the frame reconstruction module performs the reconstruction feature

And (4) a reconstruction step. ResBlock in fig. 2 (a) and 2 (b) is a basic block constituting a convolutional neural network ResNet, and is shown in fig. 2 (c).

In one embodiment, the video to be compressed is decomposed into images of one frame, for the first frame, we compress the image to obtain a reconstructed frame by using a conventional image compression algorithm, and for each next frame, we repeatedly compress the image to obtain the reconstructed frame by adopting the method from step 2 to step 7 from front to back.

In this embodiment, the compression process of the first frame reconstruction frame performed before Sl is as follows: when t =1, reconstructing the reference frame

For the t (t > = 2) th frameFrom the current input image frame X, we need to perform a compression _t Reconstructed reference frame obtained by compressing with last frame

Extracting input features F _t And reference character

In one embodiment, to produce more accurate motion compensation results, two stages of coarse to fine motion compensation modules are proposed. As shown in fig. 3, S2 is a step performed by the motion compensation module at the coarse level, and includes:

by applying to input features F _t And reference character

Carrying out down-sampling operation to scale the two low-resolution features into the original features with the size of 1/n;

after motion estimation and motion compression are carried out on the two low-resolution features, up-sampling operation is carried out, namely bilinear interpolation calculation is carried out, and the size of the two low-resolution features is scaled by n times, so that the rough offset between the two frames is obtained;

applying a coarse offset to a reference feature

Since the motion compression bit consumption of this process is not large, adaptive motion compression is not used in the coarse level motion compensation module.

In this embodiment, the downsampling operation is to scale the features to 1/4 length by 1/4 width by bilinear interpolation. The upsampling operation is to scale the features to the original size of 4 times length by 4 times width by a bilinear interpolation algorithm.

In one embodiment, the downsampled input features F are sampled _t And reference character

The input is to the coarse motion estimation network, which connects and passes the two features to the two convolutional layers.

In one embodiment, the features after motion estimation are input to a coarse motion compression network for primary motion compression, the motion compression network is composed of a motion coding network and a motion decoding network, wherein the motion coding network comprises four convolutional layers with a step size of 2 and four convolutional layers with a step size of 1, and the motion decoding network comprises four anti-convolutional layers with a step size of 2 and four convolutional layers with a step size of 1.

In one embodiment, in the fine level motion compensation module, the prediction features are based on the intermediate

And input feature F _t At the fine level, we perform motion estimation, motion compression and motion compensation again, thereby generating the final prediction features

As shown in fig. 4, S3 is a step performed by the motion compensation module at the fine level, wherein the motion estimation network and the motion compensation network are the same as the motion compensation module at the coarse level.

In the fine-level motion compression module, a newly proposed adaptive motion compression module guided in advance is adopted, and as shown in fig. 6, the adaptive motion compression module guided in advance specifically includes the following steps:

the pre-learning prediction network based on the prior information, namely a resolution mode prediction network, is used for outputting the optimal block resolution to determine the optimal block resolution so as to better encode the motion information;

the input features to be compressed are processed by four convolutional layers with the step size of 2 and four convolutional layers with the step size of 1 to obtainTo the coded motion characteristics M to be transmitted _t Motion characteristics M _t Obtaining the super-priority information as the input of the super-priority network;

the super-a-priori information is input into a resolution mode prediction network for predicting the optimal resolution size of each feature block to obtain a predicted resolution mode, wherein, as shown in fig. 5 (a), four basic modes are 4 basic resolution modes, and for each 2x2 and 4x4 feature block, the resolution mode (i.e., the basic mode in fig. 5 a) is predicted. As shown in fig. 5b, for the current 4x4 feature block, what base resolution mode the 4x4 feature block belongs to is predicted first, and when the prediction result is M0 (i.e. the base mode M0 in fig. 5 a), the 4x4 feature block is divided into 4 sub-blocks of 2x 2. Simultaneously, the resolution mode of each 2x2 sub-block is also predicted, and the resolution mode (M0/M1/M2/M3) of each block is selected according to the prediction result; the mode-guided average pooling operation is performed on each feature block according to the resulting resolution mode, e.g., each value in block A in the top left 2x2 (i.e., M) _t 3,4, 4,5) are averaged and pooled to 4, and then quantized, entropy encoded, and the decoding side obtains this 4, since at the decoding side there is also a resolution mode for each block, and it is known that block a actually consists of 4 values, 4 upsampling in block a is used here to obtain 4 (i.e., mode-directed upsampling is used to upsample 4 in block a to 4 (i.e., 4 values)

Red block in the upper left corner).

Will move the characteristic M _t Inputting the average pooling layer to the mode-guided average pooling layer to perform corresponding average pooling operation to reduce the number of values of the motion characteristics to be transmitted, thereby effectively reducing the number of bits for transmitting the coded motion characteristics, and then inputting the average pooled characteristics to the mode-guided upsampling layer to restore the original size of the average pooled characteristics as characteristics according to the prior-a-information

Namely, after obtaining the characteristic, the decoding end can also obtain the characteristic according to the prior informationThe averaged pooled features are restored to their original size.

And inputting the motion decoder for decoding to obtain the decoded fine-level motion characteristics. The motion decoding network comprises four deconvolution layers with step size 2 and four convolution layers with step size 1.

In this embodiment, the super-a-priori information includes the motion characteristics M of the super-a-priori network after encoding _t Predicted mean and variance are used to assist in the alignment of the motion features M _t And performing arithmetic coding and arithmetic decoding.

In one embodiment, residual feature R _t The compression is performed by an adaptive residual compression module guided by a super-a-priori. The overall network structure comprises a residual error coding network, a residual error decoding network, a super-prior network and a resolution mode prediction network. The overall network structure is substantially identical to the super-a-guided adaptive motion compression module (including motion coding network, motion decoding network, super-a-network and resolution mode prediction network), except that based on the super-a-information, the prediction network of the adaptive residual compression module does not predict the optimal resolution of each block, but rather learns the input residual features R _t Coded residual error characteristic Y obtained after residual error coding network _t Each of the eigenvalues (in total, 128 × h × w eigenvalues) that need to be transmitted (dimension 128 × h × w) predicts a "skip"/"no-skip" pattern, as shown in fig. 5 (c). By transmitting skipped insignificant feature values to save bits, insignificant features (e.g., features where the residual value is 0, which also does not contain any information) will not be transmitted to the decoding side, and filling these skipped feature values with 0 at the decoding side to reduce the number of bits required to transmit the encoded residual features, so that the residual compression network can better encode the residual features. Finally, the reconstructed residual error characteristics

Adding back the final predicted features

Generating reconstructed features

In one embodiment, the quadratic compensation process in the fine level motion compensation module is: the decompressed motion characteristics obtained after the adaptive compression by the super-experience guidance are obtained

Predicting features in the middle

Based on the final predicted features by quadratic motion compensation using deformable convolution

Thereby enabling compensation at a higher resolution to obtain a more accurate prediction.

In one embodiment, the coded features in the primary motion compression, the secondary motion compression module and the residual feature compression are all converted into bit streams and then subjected to corresponding decoding operations. As shown in fig. 6, after AC arithmetic coding (arithmetric coding), the feature map is converted into a bitstream for transmission to the decoding end, and after receiving the bitstream, the decoding end converts the bitstream into the feature map again using AD arithmetic decoding (arithmetric decoding).

Table 1 gives the BDBR results of the present example method (Ours) compared to standard reference software H265 (HM) on multiple datasets, including HEVC ClaSS B, C, D, E, UVG and MCL-JCV. Negative values in the table indicate how many percent of bits can be saved for the same reconstruction quality. Compared with other video compression methods (FVC, ELF-VC, DCVC, FVC (re-imp)) based on deep learning, the method of the embodiment can also achieve the best performance at present.

TABLE 1 BDBR results comparison Table

Since video compression requires consideration of reconstruction performance at different bit rates, a performance map is drawn by comparing bpp (the average number of bits consumed per pixel, the smaller the better) with PSNR (the larger the better the quality of reconstruction), as shown in fig. 7.

FVC (re-imp) is our baseline approach, C2F is our proposed coarse-to-fine video compression algorithm framework, C2F + HAMC is our proposed adaptive motion compression algorithm guided by super-prior on the coarse-to-fine video compression algorithm framework equipment, C2F + HAMC + HARC is our proposed adaptive motion and residual compression algorithm guided by super-prior on the coarse-to-fine video compression algorithm framework equipment.

The result shows that the proposed video compression algorithm framework from coarse to fine, the super-a-guided resolution adaptive motion compression and the super-a-guided skip adaptive residual compression can improve the performance of the existing method, and the effectiveness of the proposed algorithm is proved.

Comprehensive experiments on HEVC, UVG and MCL-JCV data sets show that the newly-proposed super-a-priori guided mode prediction method provided by the coarse-to-fine framework in the embodiment achieves video compression performance equivalent to H265 (HM) in terms of PSNR indexes and is generally superior to the current latest video compression standard VTM in terms of MS-SSIM indexes.

The coarse-to-fine depth video coding method with super-prior guided mode prediction provided by the present invention is described in detail above, in this embodiment, a specific example is applied to illustrate the principle and the implementation of the present invention, and the description of the above embodiment is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined in this embodiment may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for coarse-to-fine depth video coding with super-prior guided mode prediction, comprising the steps of:

Respectively extracting to obtain input features F _t And reference character

S2, rough motion compensation: the input feature F _t And reference character

S3, fine motion compensation: the intermediate prediction feature

The secondary motion compression adopts a super-a-guided adaptive motion compression method, super-a-information of the features obtained by secondary motion estimation is used as input to carry out resolution mode prediction, and the obtained prediction feature block guides the coding and decoding operations of the features obtained by secondary motion estimation in the secondary motion compression;

And loaded to the final predicted features

Generating reconstruction features

S5, the reconstruction characteristics are carried out

Input to a frame reconstruction module to generate a reconstructed frame

S6, reconstructing the frame

2. The method of coarse-to-fine depth video coding with super-a priori guided mode prediction according to claim 1, wherein S1 is preceded by: when t =1, reconstructing the reference frame

For input image frame X _t And obtaining a reconstructed frame through compression by a compression algorithm.

3. The method for coarse-to-fine depth video coding with super-prior guided mode prediction according to claim 1, wherein the S2 comprises:

by pair input features F _t And reference character

after motion estimation and motion compression are carried out on the two low-resolution features, up-sampling operation is carried out to scale the two low-resolution features by n times, and further the rough offset between two frames is obtained;

setting the rough offset to be in the reference characteristic

4. The method of claim 3, wherein the coarse-to-fine depth video coding with super-a priori guided mode prediction is performedDownsampled input feature F _t And reference character

Input to a coarse motion estimation network that connects and passes the two features to the two convolutional layers.

5. The method of claim 3, wherein the estimated motion features are input to a coarse motion compression network for motion compression, and the coarse motion compression network comprises a motion coding network and a motion decoding network.

6. The method for coarse-to-fine depth video coding with super-prior guided mode prediction according to claim 1, wherein the S3 comprises:

a pre-learning prediction network based on the prior information, namely a resolution mode prediction network, for outputting the best block resolution;

inputting the input characteristics needing to be compressed into a motion encoder to be encoded to obtain motion characteristics M _t Characteristic of motion M _t Obtaining the super-prior information as the input of the super-prior network;

And inputting the motion decoder for decoding to obtain the compressed motion characteristics.

7. The method of claim 6, wherein the super-prior information comprises the motion feature M _t Mean and variance of.

8. The method of claim 1, wherein the coded features in the primary motion compression, the secondary motion compression, and the residual feature compression are all transformed into a bitstream for a corresponding decoding operation.