CN113938687A

CN113938687A - Multi-reference inter-frame prediction method, system, device and storage medium

Info

Publication number: CN113938687A
Application number: CN202111189110.3A
Authority: CN
Inventors: 陈志波; 冯润森; 郭宗昱; 张直政
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-10-12
Filing date: 2021-10-12
Publication date: 2022-01-14

Abstract

The invention discloses a method, a system, equipment and a storage medium for predicting among multiple reference frames, wherein the related method comprises the following steps: stacking all reference frames of a current video frame into a three-dimensional tensor, and generating a quantized hidden variable containing motion information and a voxel stream set through a motion information coder-decoder by combining the current video frame; carrying out weighted interpolation processing by utilizing the three-dimensional tensor and the voxel stream set to obtain a prediction frame; residual coding is carried out by utilizing the predicted frame and the current video frame to generate a quantization hidden variable containing residual information; respectively entropy coding the quantization hidden variable containing the motion information and the quantization hidden variable containing the residual error information to obtain code streams to be transmitted; and in the decoding process, reconstructing the current video frame by using the three-dimensional tensor, the voxel stream decoded from the code stream and residual information. The scheme disclosed by the invention can be flexibly suitable for different prediction modes, and can utilize the reference pixel information of a plurality of positions to improve the interframe compression performance.

Description

Multi-reference inter-frame prediction method, system, device and storage medium

Technical Field

The present invention relates to the field of video compression coding technology, and in particular, to a method, a system, a device, and a storage medium for inter-reference frame prediction.

Background

Compression coding of video is an important technology in the electronic information age, and is helpful for reducing transmission bandwidth and storage consumption of video. The main goal of video coding is to remove temporal correlation between video frames, thereby achieving higher rate-distortion performance.

Lossy video compression coding techniques can be classified into conventional coding methods and deep learning-based coding methods. The traditional video coding is a mature scheme developed for decades, and adopts a hybrid coding and decoding framework, and the representative algorithms are H.264/AVC, H.265/HEVC and H.266/VVC. The deep video coding and decoding combines with neural network training, and is a new framework for end-to-end optimization proposed in recent years.

The current mainstream video compression coding method is to predict a video frame needing to be coded at present by using a reference frame through motion compensation, and then to code a residual error between the predicted frame and the current frame, so as to remove time domain redundancy. Among them, an accurate and efficient inter prediction method plays a very important role. Existing methods typically use a single optical flow motion compensation approach to achieve inter-frame prediction. Specifically, each pixel of the predicted frame has a motion vector (optical flow), and the predicted pixel value is obtained by bilinear interpolation of pixels at a target position in the reference frame based on the optical flow. However, this method has the following disadvantages: 1) the method is only suitable for a single reference frame scene and is difficult to flexibly apply to multi-reference frame prediction in different modes, and 2) each target pixel of a prediction frame is only obtained by reference pixel prediction of one position, the information of the reference pixels of a plurality of positions cannot be utilized, and the prediction accuracy is limited.

Disclosure of Invention

The invention aims to provide a multi-reference inter-frame prediction method, a multi-reference inter-frame prediction system, a multi-reference inter-frame prediction device and a multi-reference inter-frame prediction storage medium, which can be flexibly suitable for different prediction modes and can improve the inter-frame compression performance by utilizing the reference pixel information of a plurality of positions.

The purpose of the invention is realized by the following technical scheme:

a method of multi-reference inter-frame prediction for depth video coding, comprising:

stacking all reference frames of a current video frame into a three-dimensional tensor, and generating a quantized hidden variable containing motion information and a voxel stream set through a motion information coder-decoder by combining the current video frame;

carrying out weighted interpolation processing by utilizing the three-dimensional tensor and the voxel stream set to obtain a prediction frame;

residual coding is carried out by utilizing the predicted frame and the current video frame to generate a quantization hidden variable containing residual information;

respectively entropy coding the quantization hidden variable containing the motion information and the quantization hidden variable containing the residual error information to obtain code streams to be transmitted;

and in the decoding process, reconstructing the current video frame by using the three-dimensional tensor and the voxel stream set and residual information decoded from the code stream.

A multi-reference inter-frame prediction system for depth video coding for implementing the aforementioned method, the system comprising:

the motion information coding and decoding module is used for stacking all reference frames of the current video frame into a three-dimensional tensor, and generating a quantized hidden variable and a voxel stream set containing motion information by combining the current video frame through a motion information coder and decoder;

the prediction module is used for performing weighted interpolation processing by utilizing the three-dimensional tensor and the voxel stream set to obtain a prediction frame;

the residual coding module is used for carrying out residual coding by utilizing the predicted frame and the current video frame to generate a quantization hidden variable containing residual information;

the code stream generation module is used for respectively carrying out entropy coding on the quantized hidden variable containing the motion information and the quantized hidden variable containing the residual error information to obtain a code stream to be transmitted;

and the decoding module is used for reconstructing the current video frame by utilizing the three-dimensional tensor and the voxel stream set and residual error information decoded from the code stream in the decoding process.

A processing device, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.

A readable storage medium, storing a computer program, characterized in that the computer program realizes the aforementioned method when executed by a processor.

The technical scheme provided by the invention can show that: 1) different prediction modes, such as different reference frame numbers and different reference structures, can be flexibly processed, and multiple modes can be processed by one set of model, so that the method is more efficient in processing multiple sets of modes compared with multiple sets of models; 2) the inter-frame prediction is carried out by using the information of a plurality of reference positions, and the accuracy is higher than that of the inter-frame prediction method of a single reference position; 3) the method has wide applicability and good performance improvement effect on multi-code rate points, various models and multiple data sets.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a diagram illustrating a multi-reference inter-frame prediction method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a different inter prediction method according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a multi-reference inter-frame prediction system according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The terms that may be used herein are first described as follows:

the term "and/or" means that either or both can be achieved, for example, X and/or Y means that both cases include "X" or "Y" as well as three cases including "X and Y".

The terms "comprising," "including," "containing," "having," or other similar terms of meaning should be construed as non-exclusive inclusions. For example: including a feature (e.g., material, component, ingredient, carrier, formulation, material, dimension, part, component, mechanism, device, process, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product, or article of manufacture), is to be construed as including not only the particular feature explicitly listed but also other features not explicitly listed as such which are known in the art.

The inter-reference-frame prediction method provided by the present invention is described in detail below. Details which are not described in detail in the embodiments of the invention belong to the prior art which is known to the person skilled in the art. Those not specifically mentioned in the examples of the present invention were carried out according to the conventional conditions in the art or conditions suggested by the manufacturer.

The embodiment of the invention provides a multi-reference inter-frame prediction method for depth video coding, which mainly comprises the following steps:

step 1, stacking all reference frames of a current video frame into a three-dimensional tensor, and generating a quantized hidden variable and a voxel stream set containing motion information by combining the current video frame through a motion information coder-decoder.

In the embodiment of the invention, the number of reference frames of the current video frame is set as n, and the n reference frames are stacked into a three-dimensional tensor X_t(ii) a Then, the three-dimensional tensor X is divided into_tWith the current video frame x_tConcatenating, generating quantized hidden variables containing motion information by a motion information encoder

Generating a set g of elementary streams by a motion information decoder_tEach voxel stream comprises a three-dimensional position vector and a corresponding weight.

In the embodiment of the invention, the position of the reference frame in the video sequence is positioned before the current video frame and/or after the current video frame.

And 2, performing weighted interpolation processing by using the three-dimensional tensor and the voxel stream set to obtain a prediction frame.

In the embodiment of the present invention, a certain pixel of a prediction frame is obtained by weighting voxels in a plurality of corresponding positions indicated by a plurality of voxel streams, and the formula is expressed as:

wherein the content of the first and second substances,

representing predicted frames, X_tRepresenting a three-dimensional tensor, (x, y) representing spatial coordinates, M representing the number of voxel streams,

representing the three-dimensional position to which the ith voxel stream points,

representing the weight of the ith voxel stream.

Step 3, residual error coding is carried out by utilizing the predicted frame and the current video frame to generate a packetQuantization hidden variable containing residual error information

Step 4, quantizing hidden variables containing motion information

And quantized hidden variables containing residual information

And respectively performing entropy coding to obtain code streams to be transmitted, wherein the complete code streams comprise the motion information code stream in the step 1 and the residual error information code stream in the step 3, and the specific entropy coding scheme can be realized by referring to the prior art, which is not described herein again.

And 5, in the decoding process, reconstructing the current video frame by using the three-dimensional tensor and the voxel stream set and residual information decoded from the code stream.

In this step, a prediction frame is obtained by weighted interpolation using the voxel stream decoded from the code stream and the residual information (the specific manner is the same as that in step 2, that is, the decoded voxel stream set and the three-dimensional tensor X obtained in step 1 are used_tBy substituting into a formula to obtain

) And reconstructing a current video frame by combining the residual error information.

As shown in fig. 1, the scheme of the present invention can be implemented in various depth video coding and decoding frames, and only the inter prediction method provided by the present invention needs to be replaced by the multi-reference frame prediction method provided by the present invention. The motion information codec related to step 1 and the residual error information codec related to steps 3 to 5 are designed based on the existing depth image compression framework, and thus are not described in detail.

Fig. 2 provides a comparative example of different inter prediction methods. Suppose an input video frame sequence x₁,x₂,…,x_TIn the process of compressing the video frame by frame, setting the frame needing to be compressed at present as x_tT is 1,2, …, T is the total number of video frames, video frame x_tWith n reference frames

The positions of these reference frames in the sequence may be both before and after t. According to the relative position relationship of the reference frames, the prediction modes can be roughly divided into two types, namely unidirectional inter-frame prediction and bidirectional inter-frame prediction. Part (a) of fig. 2 shows a common single-reference-frame unidirectional prediction method, i.e., a current video frame x_tIs the previous reference frame pointed to by the optical flow

The reference pixel at a certain position (black dot) is interpolated. Part (b) of fig. 2 shows a common bi-directional prediction method for dual reference frames, where a certain pixel of the current video frame is composed of a previous reference frame and a next reference frame

And

and weighting the interpolation results of the reference pixels at the two positions. Part (c) of fig. 2 shows a prediction method that we first introduced in video compression, i.e. the reference frame that will participate in prediction mentioned in step 1 above

The stack is a three-dimensional tensor, and a stream of three-dimensional voxels is used to specify the location of the desired reference pixel in the tensor. Part (d) of fig. 2 is the final technical solution proposed, and a single voxel stream is expanded to a multi-voxel stream on the basis of part (c), that is, a certain pixel of the prediction frame is weighted by voxels of a plurality of corresponding positions indicated by a plurality of voxel streams.

The scheme of the embodiment of the invention mainly has the following beneficial effects:

1) the method can flexibly process different prediction modes, such as different reference frame numbers and different reference structures, can process various modes by one set of model, and is more efficient compared with a plurality of sets of models for processing a plurality of modes.

2) The inter-frame prediction method using the information of the multiple reference positions is considered to be more accurate than the inter-frame prediction method using a single reference position.

3) The method has wide applicability and good performance improvement effect on multi-code rate points, various models and multiple data sets.

In conclusion, the scheme provided by the invention can improve the flexibility and the accuracy of the multi-reference frame prediction method in the deep video compression frame in a wide sense, and is favorable for better putting the method into practical application.

Another embodiment of the present invention further provides a multi-reference inter-frame prediction system for depth video coding, which is mainly used to implement the method provided by the foregoing embodiment, as shown in fig. 3, the system mainly includes:

and the decoding module is used for reconstructing the current video frame by utilizing the three-dimensional tensor formed by the reference frame and the voxel stream set and residual information decoded from the code stream in the decoding process.

Another embodiment of the present invention further provides a processing apparatus, as shown in fig. 4, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.

Further, the processing device further comprises at least one input device and at least one output device; in the processing device, a processor, a memory, an input device and an output device are connected through a bus.

In the embodiment of the present invention, the specific types of the memory, the input device, and the output device are not limited; for example:

the input device can be a touch screen, an image acquisition device, a physical button or a mouse and the like;

the output device may be a display terminal;

the Memory may be a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as a disk Memory.

Another embodiment of the present invention further provides a readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method provided by the foregoing embodiment.

The readable storage medium in the embodiment of the present invention may be provided in the foregoing processing device as a computer readable storage medium, for example, as a memory in the processing device. The readable storage medium may be various media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of multi-reference inter-frame prediction for depth video coding, comprising:

2. The method of claim 1, wherein stacking all reference frames of a current video frame as a three-dimensional tensor and generating a set of elementary streams by a codec comprises:

setting the number of reference frames of the current video frame as n, and stacking the n reference frames into a three-dimensional tensor X_t；

Then, the three-dimensional tensor X is divided into_tWith the current video frame x_tConcatenating, generating quantized hidden variables containing motion information by a motion information encoder

3. The method of claim 1 or 2, wherein the position of the reference frame in the video sequence is located before the current video frame and/or after the current video frame.

4. The method of claim 1, wherein the weighted interpolation process using the three-dimensional tensor and the voxel stream is used to obtain a predicted frame represented by:

wherein the content of the first and second substances,

representing the weight of the ith voxel stream.

5. The method of claim 1, wherein reconstructing the current video frame using the three-dimensional tensor and the set of voxel streams decoded from the bitstream and residual information comprises:

and obtaining a prediction frame by combining the voxel stream decoded from the code stream with the three-dimensional tensor through weighted interpolation, and reconstructing a current video frame by combining residual information.

6. A multi-reference inter-frame prediction system for depth video coding, for implementing the method of any of claims 1-5, the system comprising:

7. A processing device, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-5.

8. A readable storage medium, storing a computer program, characterized in that the computer program, when executed by a processor, implements the method according to any of claims 1-5.