CN108174218B

CN108174218B - Video coding and decoding system based on learning

Info

Publication number: CN108174218B
Application number: CN201810064012.9A
Authority: CN
Inventors: 陈志波; 何天宇; 金鑫; 刘森
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2018-01-23
Filing date: 2018-01-23
Publication date: 2020-02-07
Anticipated expiration: 2038-01-23
Also published as: CN108174218A

Abstract

The invention discloses a video coding and decoding frame based on learning, which comprises: a space-time domain reconstruction memory for storing the encoded and decoded reconstructed video content; the space-time domain prediction network is used for modeling the reconstructed video content through a convolutional neural network and a cyclic neural network by utilizing the space-time domain correlation of the reconstructed video content and outputting a predicted value of a current coding block; subtracting the predicted value and the original value to form a residual error; the iterative analyzer and the iterative synthesizer encode and decode the input residual error step by step; a binarizer for converting the output of the iterative analyzer into a binary representation; the entropy coder is used for entropy coding the quantized coded output to obtain an output code stream; and the entropy decoder is used for performing entropy decoding on the output code stream and then outputting the decoded output code stream to the iterative synthesizer. The coding framework realizes the prediction of a space-time domain through a learning-based VoxelCNN (space-time domain prediction network), and realizes the control of the distortion optimization of a video coding rate by using a residual iterative coding method.

Description

Video coding and decoding system based on learning

Technical Field

The invention relates to the technical field of video coding and decoding, in particular to a video coding and decoding framework based on learning.

Background

Existing image video coding standards such as: JPEG, H.261, MPEG-2, H.264, H.265, all based on a hybrid coding framework. Through years of development, the improvement of the coding performance is accompanied by the continuous increase of complexity, and further the improvement of the coding performance under the existing hybrid coding architecture also faces more and more challenges.

However, at present, the hybrid coding framework generally implements optimized coding of image video according to a heuristic method, and it is increasingly difficult to meet the current requirements of complex and intelligent media applications such as face recognition, target tracking, image retrieval, and the like.

Disclosure of Invention

The invention aims to provide a learning-based video coding and decoding framework which can realize the control of video coding rate distortion optimization.

The purpose of the invention is realized by the following technical scheme:

a learning-based video coding/decoding framework, comprising: an encoding end and a decoding end; wherein the encoding end includes: the device comprises a space-time domain reconstruction memory, a space-time domain prediction network, an iterative analyzer, an iterative synthesizer, a binarizer, an entropy encoder and an entropy decoder;

the space-time domain reconstruction memory is used for storing the reconstructed video content after being encoded and decoded;

the space-time domain prediction network is used for modeling the reconstructed video content through a convolutional neural network and a cyclic neural network by utilizing the space-time domain correlation of the reconstructed video content and outputting a predicted value of a current coding block;

the iterative analyzer comprises a convolutional neural network and a cyclic neural network structure, a residual error formed by subtracting a predicted value output by the space-time domain prediction network from an original value is used as an input, and the output is a compressed expression of the residual error;

the iterative synthesizer comprises a convolutional neural network and a cyclic neural network structure, receives the compressed expression of the residual error generated by decoding by the entropy decoder, and superposes the predicted value output by the space-time domain prediction network to form reconstructed video content;

the iterative analyzer and the iterative synthesizer encode and decode the input residual error step by step, and gradually reduce the distortion degree of the residual error by increasing the code stream, thereby realizing the encoding of different distortion degrees under the conditions of high and low code streams;

the binarizer is used for converting the output of the iterative analyzer into binary representation;

the entropy encoder is used for entropy encoding the quantized encoded output to obtain an output code stream;

and the entropy decoder is used for performing entropy decoding on the output code stream and then outputting the decoded output code stream to the iterative synthesizer.

The technical scheme provided by the invention can be seen that the space-time domain prediction and residual iterative coding method is integrated, the space-time domain prediction is realized through the learning-based VoxelCNN (space-time domain prediction network), and the control of the video coding rate distortion optimization is realized by using the residual iterative coding method.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a schematic diagram of a learning-based video encoding and decoding framework according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a main processing procedure of a video codec framework according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a motion interpolation process according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a motion extending process provided by an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a video coding and decoding frame based on learning, which mainly comprises: an encoding end and a decoding end; as shown in fig. 1, the encoding end mainly includes: the device comprises a space-time domain reconstruction memory, a space-time domain prediction network, an iterative analyzer, an iterative synthesizer, a binarizer, an entropy encoder and an entropy decoder;

and the space-time domain reconstruction memory is used for storing the reconstructed video content after encoding and decoding, and comprises decoded frames and decoded blocks of the current frame. The codec process typically proceeds in either the forward (P-frame) or bi-directional (B-frame) of the video timeline, with each frame typically being codec block-by-block in left-to-right, top-to-bottom order.

The space-time domain prediction network (VoxelCNN) is used for utilizing the space-time domain correlation of the reconstructed video content to model the reconstructed video content through a convolutional neural network and a cyclic neural network and outputting a prediction value of a current coding block; and subtracting the predicted value and the original value to form a residual error, and performing iterative coding through an iterative analyzer and an iterative synthesizer to realize rate distortion optimization.

and the entropy decoder is used for carrying out entropy decoding on the output code stream and outputting the decoded output code stream to the iterative synthesizer.

In the embodiment of the invention, the entropy encoder and the entropy decoder can be realized by using methods such as context-based arithmetic coding and decoding, namely, the arithmetic encoder/decoder is used as the entropy encoder/decoder.

In the embodiment of the invention, the space-time domain reconstruction memory, the space-time domain prediction network, the iterative synthesizer and the entropy decoder form a decoder in a coding end.

Those skilled in the art will appreciate that since the decoding side can only obtain the reconstructed video content, not the original video content, the encoding side includes decoding functionality to provide the reconstructed video content for reference by the encoder.

For ease of understanding, the main processing in the video codec framework will be described in detail below with reference to a specific example as shown in fig. 2.

In the embodiment of the invention, the space-time domain prediction network calculates the prediction value of the coding block by two processes of motion synthesis and hybrid prediction.

1. And (4) synthesizing the movement.

Motion synthesis includes motion interpolation and motion extension, for two different coding modes, one of which is optional in operation.

1) The motion interpolation is to obtain an object motion track according to two adjacent frames in the reconstructed video content and interpolate the object motion track between the two adjacent frames to serve as an interpolation frame. As shown in fig. 3, the motion interpolation process is as follows: let v_x,v_y,x,

Wherein (v)_x,v_y) Which represents a motion vector, is used to represent,

representing a set of integers. Let the interpolated frame be

Two adjacent frames in the reconstructed video content are respectively marked asAnd

determining a motion vector (v) of a block centered on coordinates (x, y) by a motion compensation operation with a block size m_x,v_y) Interpolation frame

In (x, y) -centered coding block

Is given a value of

To Chinese

Code block as centerCopied, according to which a complete interpolation frame is obtained

And as an output of the motion interpolation operation.

2) Motion extension is to obtain the motion track of an object by reconstructing the first two frames of the video content and extend the motion track backwards to obtain an extended frame

As shown in fig. 4, the motion extension process is as follows: first two frames

And

in the method, a motion vector (v) of a coding block centered on coordinates (x, y) is determined by a motion compensation operation for a coding block size of m_x,v_y) Extended frame

In (x, y) -centered coding blockIs given a value ofIn (x-v)_x,y-v_y) Code block as center

Copied, in this way, a complete extended frame is obtained

And as an output of the motion extension operation.

2. And (4) mixed prediction.

The hybrid prediction includes convolution and convolution LSTM structures, and an interpolation frame or an extension frame (in FIG. 2, if a motion extension operation is performed in the motion synthesis process, the interpolation frame or the extension frame is referred to as an extension frame) and a frame before and after the interpolation frame are respectively (

And

) Or the first two frames of the extended frame (And

) And the blocks which are positioned above the current coding block in the current frame and decoded on the left are used as input, and the prediction value of the current coding block in the current frame is generated by learning the modeling of the video space-time domain information; through iterative computation, a predicted value of the current coding block is generated every time according to the sequence from left to right and from top to bottom, and finally the whole is spliced out.

As shown in FIG. 2, assume that motion is usedIn the extended coding mode, the first two frames of the extended frame are extended in the motion extension mode (And

) And the already decoded block (for each frame, coding and decoding are carried out from top to bottom and from left to right in sequence) positioned at the upper left of the current coding block in the current frame is taken as input; in the motion interpolation mode, the previous frame and the next frame of the interpolation frame are processed (And

) And a block already decoded at the top left of the current coding block in the current frame as input. The hybrid prediction generates a prediction value of a current coding block by learning the modeling of video space-time domain information; through iterative computation, a predicted value of the current coding block is generated every time according to the sequence from top to bottom and from left to right, and finally the whole is spliced out. In the embodiment of the invention, a predicted value output by the space-time domain prediction network is subtracted from an original value to form a residual error, iterative coding is carried out through an iterative analyzer and an iterative synthesizer, and the optimization target of the space-time domain prediction network is as follows:

wherein B is the total number of frames involved in the optimization, J is the total number of coded blocks per frame in the reconstructed video content,

the original value and the predicted value of the j coding block in the i frame are respectively corresponded.

In the embodiment of the invention, the optimization target is equivalent to a loss function, and the space-time domain prediction network has the function of generating a predicted value and enabling the predicted value to be close to an original value.

In the embodiment of the invention, the iteration analyzer and the iteration synthesizer both comprise S encoding stages consisting of S convolution-based self-encoders, a reconstructed value and a target value are continuously iterated, analyzed and synthesized to realize a variable compression ratio, each stage of the iteration analyzer generates a compressed expression of an input residual error, the compressed expression is quantized to form an output code stream, and the optimization target expression of the iteration analyzer and the iteration synthesizer is as follows:

wherein the content of the first and second substances,

is the residual error input in the initial stage (i.e. the 1 st stage),

representing the residual error input at the nth stage,

representing the output of the nth stage (i.e., the compressed representation of the input residual by the n stages).

In the embodiment of the invention, the iterative analyzer and the iterative synthesizer are jointly optimized and in a formulaIs actually

The parameters include all the parameters in the iterative analyzer and the iterative synthesizer.

The scheme provided by the embodiment of the invention solves the problems that the motion prediction is difficult to realize through integrated training in a neural network, provides the VoxelCNN to simultaneously model the space-time domain prior of the video content, integrates an iterative analyzer/synthesizer, a binarizer, an entropy encoder/decoder and the like, and realizes the video coding and decoding based on learning. In experimental verification, under the condition of no entropy encoder/decoder, the performance of the method exceeds that of an MPEG-2 standard encoder, and the effect similar to H.264 is achieved.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A learning-based video coding/decoding system, comprising: an encoding end and a decoding end; wherein the encoding end includes: the device comprises a space-time domain reconstruction memory, a space-time domain prediction network, an iterative analyzer, an iterative synthesizer, a binarizer, an entropy encoder and an entropy decoder;

the iterative analyzer comprises a convolutional neural network and a cyclic neural network structure, a residual error formed by subtracting a predicted value and an original value output by the space-time domain prediction network is used as an input, and the output is a compressed expression of the residual error;

the entropy encoder and the entropy decoder encode and decode input residual errors step by step, and the iterative analyzer and the iterative synthesizer gradually reduce the distortion degree of the residual errors by increasing the code stream, so as to realize encoding of different distortion degrees under the conditions of high and low code streams;

the entropy decoder is used for performing entropy decoding on the output code stream and then outputting the output code stream to the iterative synthesizer;

the space-time domain prediction network calculates the prediction value of the coding block by two processes of motion synthesis and hybrid prediction, wherein:

motion synthesis is carried out to obtain motion tracks of objects through two adjacent frames of reconstructed video contents and the motion tracks are interpolated between the two adjacent frames to be used as interpolation frames; motion extension is to obtain an object motion track by reconstructing the first two frames of the video content and extend backwards, so as to obtain an extended frame;

the mixed prediction comprises convolution and convolution LSTM structures, an interpolation frame or an extension frame, the first two frames of the previous and the next frames or the extension frames of the interpolation frame and the decoded blocks positioned above and on the left of a current coding block in a current frame are used as input, and a prediction value of the current coding block in the current frame is generated by learning the modeling of video space-time domain information; and finally obtaining the predicted value of each coding block through iterative calculation.

2. The learning-based video coding-decoding system according to claim 1, wherein the space-time domain reconstruction memory, the space-time domain prediction network, the iterative synthesizer and the entropy decoder constitute a decoder in a coding end.

3. The learning-based video coding and decoding system of claim 1, wherein the motion interpolation process is as follows: let the interpolated frame be

Two adjacent frames in the reconstructed video content are respectively marked as

Anddetermining a motion vector (v) of a block centered on coordinates (x, y) by a motion compensation operation with a block size m_x,v_y) Interpolation frameIn (x, y) -centered coding block

Is given a value of

To Chinese

Code block as center

Copied to obtain a complete interpolated frame

4. The learning-based video codec system of claim 1, wherein the motion extension process is as follows: two frames before reconstructing video content

And

In (x, y) -centered coding blockIs given a value of

In (x-v)_x,y-v_y) Code block as center

Is replicated, and in this way, a complete extended frame is obtained

5. The video coding and decoding system based on learning of claim 1, wherein the predicted value of the output of the space-time domain prediction network is subtracted from the original value to form a residual error, and the residual error is iteratively encoded by an iterative analyzer and an iterative synthesizer cooperating with an entropy encoder and an entropy decoder, and the optimization goal of the space-time domain prediction network is as follows:

6. The learning-based video coding and decoding system according to claim 5, wherein the iterative analyzer and the iterative synthesizer both include S coding stages composed of S convolution-based self-encoders, the reconstructed value and the target value are continuously iteratively analyzed and synthesized to realize a variable compression ratio, the reconstructed value is also the reconstructed video content, each stage of the iterative analyzer generates a compressed representation of the input residual, the compressed representation is quantized to form an output code stream, and the optimization target representation of the iterative analyzer and the iterative synthesizer is as follows:

wherein the content of the first and second substances,

is the residual error input in the initial stage,

representing the residual error input at the nth stage,

representing the output of the nth stage iterative analyzer.