CN113038126B

CN113038126B - Multi-description video coding method and decoding method based on frame prediction neural network

Info

Publication number: CN113038126B
Application number: CN202110261181.3A
Authority: CN
Inventors: 陈婧; 林琦; 曾焕强; 朱建清; 蔡灿辉
Original assignee: Huaqiao University
Current assignee: Huaqiao University
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2022-11-01
Anticipated expiration: 2041-03-10
Also published as: CN113038126A

Abstract

The invention discloses a multi-description video coding method and a decoding method based on a frame prediction neural network. And aiming at the problem of frame loss caused by time down-sampling, a frame prediction neural network is adopted to respectively predict the lost frames in the corresponding sequences. The predicted frame is subtracted from the encoded video frame of the corresponding sequence to obtain residual information, and the residual information and the encoded information of the current sequence form a description. And packing the two described code streams and transmitting the code streams to a decoding end through different channels respectively. The multi-description video coding formed by the method of the invention ensures that the code stream has certain error recovery capability, and the decoding end can fully utilize the relevant information among the descriptions to ensure the high-quality video reconstruction of the decoding end under the unreliable network transmission.

Description

Multi-description video coding method and decoding method based on frame prediction neural network

Technical Field

The invention relates to the field of error compensation, in particular to a multi-description video coding method and a decoding method based on a frame prediction neural network.

Background

In recent years, with the rapid development of multimedia technology and internet technology, video communication, video applications (such as telemedicine, video conference, remote teaching) and the like are widely promoted, the lives of people are greatly enriched, and the happiness index of the lives of people is improved. Meanwhile, the demand for ultrahigh resolution (3840 × 2160, 7680 × 4320) and high frame rate (120 fps) videos is increasing. As video resolution and frame rate increase, the amount of multimedia data transmitted over the internet has increased explosively. On the background of meeting the requirements of high-definition and ultra-high-definition video transmission, a new generation of video coding standard HEVC is in force.

The new generation of video coding standard HEVC has the advantage of high compression rate, but still has the disadvantage of low fault tolerance resistance. In practical applications, unreliable channels are ubiquitous, such as channel interference, network congestion, burst errors of wireless channels, and the like. When a video code stream is transmitted through an unreliable channel, phenomena such as data packet loss, bit errors and the like are easily generated, so that the quality of a video received by a receiving end is seriously reduced.

Therefore, the multiple description coding is an effective technology for unreliable network transmission video, which divides a source video into two or more sub-videos, wherein the sub-videos have own specific information and other description protection information, and the protection information plays an important role for the multiple descriptions, and provides effective error recovery capability and improves the video quality at a decoding end. As the number of received descriptions increases, the video quality at the decoding end also becomes better. Therefore, it is necessary to research a HEVC multi-description video coding method with fault tolerance capability.

Disclosure of Invention

The invention mainly aims to design a multi-description video coding method with fault-tolerant capability by combining a deep learning method, and provides a multi-description video coding method based on a frame prediction neural network.

The invention adopts the following technical scheme:

a multi-description video coding method based on a frame prediction neural network is characterized by comprising the following steps:

a1 Divide the input video into a sequence of odd frames F_OAnd an even frame sequence F_ERespectively encoded by an HEVC encoder to obtain a reconstructed odd frame sequence F'_OAnd even frame sequence F'_E；

A2 F 'from a sequence of odd frames'_OAnd even frame sequence F'_ERespectively inputting a frame to predict a neural network FP-CNN to obtain a predicted even frame sequence F'_EIAnd predicted odd frame sequence F'_OI；

A3 Predicted even frame sequence F'_EIAnd a reconstructed even frame sequence F'_ESubtracting to obtain an even residual F_EIRThe predicted odd frame sequence and the reconstructed odd frame sequence F'_OSubtracting to obtain odd residual F_OIR；

A4 Even residual F)_EIRSum odd residual F_OIRRespectively obtaining even residual error code streams F through residual error coding_ESISum odd residual code stream F_OSI；

A5 Sequence F 'of reconstructed odd frames'_OSum even residual error code stream F_ESIPackaging into description 1, and carrying out reconstruction on even frame sequence F'_ESum odd residual code stream F_OSIPacked into description 2, and transmitted to a decoding end through different channels, respectively.

Preferably, videos of various scenes are selected, a source video is divided into odd frames and even frames, the videos are coded and decoded by an original HEVC (high efficiency video coding) coder when different QP (quantization parameter) values are set, the coded and decoded videos serve as training data, the original odd frames or the original even frames serve as training labels, and a data set is formed and used for training the frame prediction neural network FP-CNN.

Preferably, the frame prediction neural network FP-CNN includes an encoder-decoder, and the output characteristics of the encoder and the output characteristics of the decoder in the same scale adopt a skip connection mode.

Preferably, the input odd frame sequence F'_OAnd even frame sequence F'_EExtracting features through an encoder and a decoder, wherein the extracted features are provided for four sub-networks; estimating 1/4 output pixel by 1-dimensional kernel in dense pixel mode for each sub-network, and then combining the estimated pixel kernel with odd frame sequence F'_OOr even frame sequence F'_ELocally convolving consecutive two-frame video frames to generate the predicted even frame sequence F'_EIOr a predicted odd frame sequence F'_OI。

Preferably, the encoder and the decoder are provided with a convolution layer, an average pooling layer and a bilinear upsampling layer; each of the sub-networks includes one bilinear upsampling layer and three convolutional layers.

A multi-description video decoding method based on a frame prediction neural network is characterized by comprising the following steps:

b1 The decoder receives the description 1 and the description 2, judges whether a lost video frame is generated, and if not, samples the odd frame and the even frame of the description 1 and the description 2 according to the sequence of the odd frame and the even frame to obtain a decoding reconstruction video sequence of the full frame rate; if yes, entering step B2);

b2 Description 1 and description 2 are decoded by an HEVC standard decoder to obtain a reconstructed odd frame sequence F'_OAnd even frame sequence F'_ESequence F 'of reconstructed odd frames'_OAnd even frame sequence F'_ERespectively inputting a frame to predict a neural network FP-CNN to obtain a predicted even frame sequence F'_EIAnd predicted odd frame sequence F'_OI；

B3 Even residual F)_ESISum odd residual F_OSIRespectively obtaining even residual errors F 'after residual errors are decoded'_ESIAnd decoded odd residual F'_OSI；

B4 Utilizing predicted even frame sequence F'_EIAnd even residual F'_ESIObtaining missing even frames, utilizing a sequence of predicted odd frames F'_OIAnd even residual F'_OSIObtaining missing odd frames, and reusing even frames and even frame sequence F'_EOdd frames and odd frame sequences F'_OThe reconstructed video sequence is upsampled in parity frame order.

Preferably, the input odd frame sequence F'_OAnd even frame sequence F'_EExtracting features through an encoder and a decoder, wherein the extracted features are provided for four sub-networks; estimating 1/4 of output pixels by using a 1-dimensional kernel in a dense pixel mode for each sub-network, and then enabling the estimated pixel kernel to be connected with an odd frame sequence F'_OOr even frame sequence F'_EThe two consecutive video frames are partially convolved to generate the predicted even frame sequence F'_EIOr predicted sequence of odd frames F'_OI。

Preferably, the encoder and the decoder are provided with a convolutional layer, an average pooling layer and a bilinear upsampling layer; each of the sub-networks includes one bilinear upsampling layer and three convolutional layers.

As can be seen from the above description of the present invention, compared with the prior art, the present invention has the following advantages:

1. according to the method, a coding end divides a source video into odd frames and even frames by adopting a time down-sampling method, the odd frames and the even frames are respectively formed into two new sequences, and coding is carried out through an HEVC (high efficiency video coding) coder. For the Frame loss problem caused by time down sampling, frame Prediction-conditional Neural Network (FP-CNN) is adopted to predict the lost frames in the corresponding sequence respectively. The predicted frame is subtracted from the encoded video frame of the corresponding sequence to obtain residual information, and the residual information and the encoded information of the current sequence form a description. The two described code streams are packed and transmitted to a decoding end through different channels respectively, and the multi-description video coding formed by the method enables the code streams to have certain error recovery capability.

2. When only one description is received, the HEVC decoder is adopted to reconstruct the received frame, the prediction information of the lost frame is obtained through the FP-CNN frame prediction neural network and is added with the residual decoding information to reconstruct the lost frame, and the reconstructed frame is restored to the original video frame rate according to the sequence of the odd and even frames; when the decoding end receives the two descriptions, the HEVC decoder is respectively adopted to obtain corresponding reconstructed frames, and the frame rate of the source video is restored according to the sequence of the parity frames. Namely, the decoding end can fully utilize the relevant information between the descriptions to ensure the high-quality video reconstruction of the decoding end under the unreliable network transmission.

Drawings

FIG. 1 is a flow chart of an encoding method according to the present invention;

FIG. 2 is a flow chart of a decoding method according to the present invention;

FIG. 3 is a diagram of a frame prediction neural network FP-CNN according to the present invention.

The invention is described in further detail below with reference to the figures and specific examples.

Detailed Description

The invention is further described below by means of specific embodiments.

The terms "first," "second," "third," and the like in this disclosure are used solely to distinguish between similar items and not necessarily to describe a particular order or sequence, nor are they to be construed as indicating or implying relative importance. In the description, the directions or positional relationships indicated by "up", "down", "left", "right", "front" and "rear" are used based on the directions or positional relationships shown in the drawings, and are only for convenience of describing the present invention, and do not indicate or imply that the device referred to must have a specific direction, be constructed and operated in a specific direction, and thus, should not be construed as limiting the scope of the present invention. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate.

In addition, in the description of the present application, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Referring to fig. 1, a multi-description video coding method based on a frame prediction neural network includes the following steps:

a1 Divide the input video into a sequence of odd frames F_OAnd an even frame sequence F_ERespectively encoded by an HEVC encoder to obtain a reconstructed odd frame sequence F'_OAnd even frame sequence F'_E。

A2 F 'from a sequence of odd frames'_OAnd even frame sequence F'_ERespectively inputting a frame to predict a neural network FP-CNN to obtain a predicted even frame sequence F'_EIAnd predicted odd frame sequence F'_OI。

Half of the video frames in each sub-sequence are missing due to the temporal down-sampling, and a frame prediction module FP-CNN is used to predict the missing video sequence in each sub-sequence.

The frame prediction neural network FP-CNN structure is shown in figure 3, and comprises an encoder and a decoder, wherein the output characteristics of the encoder and the output characteristics of the decoder in the same scale adopt a skip connection mode. The encoder and decoder are provided with a convolutional layer, an average pooling layer, and a bilinear upsampling layer.

Input odd frame sequence F'_OAnd even frame sequence F'_EFeature extraction is carried out through an encoder and a decoder, and the extracted features are provided for four sub-networks; estimating 1/4 output pixel by 1-dimensional kernel in dense pixel mode for each sub-network, and then combining the estimated pixel kernel with odd frame sequence F'_OOr even frame sequence F'_EThe two consecutive video frames are partially convolved to generate the predicted even frame sequence F'_EIOr a predicted odd frame sequence F'_OI. Wherein each sub-network comprises one bilinear upsampling layer and three convolutional layers.

A3 Predicted even frame sequence F'_EIAnd a reconstructed even frame sequence F'_EESubtracting to obtain an even residual F_EIRThe predicted odd frame sequence and the reconstructed odd frame sequence F'_OSubtracting to obtain odd residual F_OIR。

A4 Even residual F)_EIRSum odd residual F_OIRRespectively obtaining even residual error code stream F through residual error coding_ESISum odd residual code stream F_OSI. In this step, residual coding is F to be input_EIRAnd F_OIRDividing into 8 × 8 blocks, performing DCT transformation, quantization and entropy coding to obtain even residual error code stream F_ESISum odd residual code stream F_OSI。

A5 Sequence F 'of reconstructed odd frames'_OSum even residual error code stream F_ESIPackaging into description 1, and carrying out reconstruction on even frame sequence F'_EAnd odd residual code stream F_OSIPacked into description 2, and transmitted to a decoding end through different channels, respectively.

In the invention, a data set is adopted to train and test the predictive neural network FP-CNN in advance. Specifically, videos of various scenes are selected, a source video is divided into odd frames and even frames, when different QP values are set, the videos are coded and decoded by an original HEVC (high efficiency video coding) coder, the coded and decoded videos serve as training data, the original odd frames or the original even frames serve as training labels, a data set is formed and used for training a frame prediction neural network FP-CNN, and the trained frame prediction neural network FP-CNN can be used for predicting lost frames.

Referring to fig. 2, the present invention further provides a multi-description video decoding method based on a frame prediction neural network, including the following steps:

b1 A decoding end receives the description 1 and the description 2, judges whether a lost video frame is generated, and if not, samples the odd frame and the even frame of the description 1 and the description 2 according to the sequence of the odd frame and the even frame to obtain a decoding reconstruction video sequence of a full frame rate; if yes, go to step B2). In this step, description 1 and description 2 are formed by packing the multi-description video coding method based on the frame prediction neural network.

B2) Decoding description 1 and description 2 by an HEVC standard decoder to obtain a reconstructed odd frame sequence F'_OAnd even frame sequence F'_EAnd (c) converting the reconstructed odd frame sequence F'_OAnd even frame sequence F'_ERespectively inputting a frame to predict a neural network FP-CNN to obtain a predicted even frame sequence F'_EIAnd predicted odd frame sequence F'_OI；

B4 Utilizing predicted even frame sequence F'_EIAnd even residual F'_ESIObtaining missing even frames, utilizing a sequence of predicted odd frames F'_OIAnd even residual F'_OSIObtaining missing odd frames, and reusing even frames and even frame sequence F'_EOdd frame and odd frame sequence F'_OThe reconstructed video sequence is upsampled in parity frame order.

Selecting videos of various scenes, dividing a source video into odd frames and even frames, coding and decoding the videos through an original HEVC (high efficiency video coding) coder when different QP (quantization parameter) values are set, taking the coded and decoded videos as training data, taking the original odd frames or even frames as training labels, forming a data set for training a frame prediction neural network FP-CNN, and using the trained frame prediction neural network FP-CNN for predicting lost frames.

The frame prediction neural network FP-CNN is the same as the above network, and includes an encoder-decoder, and the output characteristics of the encoder and the output characteristics of the decoder in the same scale adopt a skip connection mode.

Input odd frame sequence F'_OAnd even frame sequence F'_EExtracting features through an encoder and a decoder, wherein the extracted features are provided for four sub-networks; estimating 1/4 output pixel by 1-dimensional kernel in dense pixel mode for each sub-network, and then combining the estimated pixel kernel with odd frame sequence F'_OOr even frame sequence F'_EThe two consecutive video frames are partially convolved to generate the predicted even frame sequence F'_EIOr a predicted odd frame sequence F'_OI。

The encoder and the decoder are provided with a convolution layer, an average pooling layer and a bilinear upsampling layer; each sub-network comprises one bilinear upsampling layer and three convolutional layers.

The multi-description video coding formed by the method of the invention ensures that the code stream has certain error recovery capability, and the decoding end can fully utilize the relevant information among the descriptions to ensure the high-quality video reconstruction of the decoding end under the unreliable network transmission.

The above description is only an embodiment of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modifications made by using the design concept should fall within the scope of infringing the present invention.

Claims

1. A multi-description video coding method based on a frame prediction neural network is characterized by comprising the following steps:

a1 Divide the input video into a sequence of odd frames F_OAnd even frame sequence F_ERespectively encoded by an HEVC encoder to obtain a reconstructed odd frame sequence F'_OAnd even frame sequence F'_E；

A3 Predicted even frame sequence F'_EIAnd a reconstructed even frame sequence F'_ESubtracting to obtain an even residual F_EIRA predicted odd frame sequence and a reconstructed odd frame sequence F'_OSubtracting to obtain odd residual F_OIR；

A4 Even residual F)_EIRSum odd residual F_OIRRespectively obtaining even residual error code stream F through residual error coding_ESISum odd residual code stream F_OSI；

A5 Sequence F 'of reconstructed odd frames'_OSum even residual code stream F_ESIPacking into description 1, and transmitting the reconstructed even frame sequence F'_EAnd odd residual code stream F_OSIPacked into descriptions 2, respectivelyTransmitting to a decoding end through different channels;

selecting videos of various scenes, dividing a source video into odd frames and even frames, coding and decoding the videos through an original HEVC (high efficiency video coding) coder when different QP (quantization parameter) values are set, taking the coded and decoded videos as training data, and taking the original odd frames or even frames as training labels to form a data set for training the frame prediction neural network FP-CNN; the frame prediction neural network FP-CNN comprises an encoder-decoder, and the output characteristics of the encoder and the output characteristics of the decoder with the same scale adopt a skip connection mode;

the input odd frame sequence F'_OAnd even frame sequence F'_EFeature extraction is carried out through an encoder and a decoder, and the extracted features are provided for four sub-networks; estimating 1/4 output pixel by 1-dimensional kernel in dense pixel mode for each sub-network, and then combining the estimated pixel kernel with odd frame sequence F'_OOr even frame sequence F'_EThe two consecutive video frames are partially convolved to generate the predicted even frame sequence F'_EIOr predicted sequence of odd frames F'_OI；

The encoder and the decoder are provided with a convolution layer, an average pooling layer and a bilinear upsampling layer; each of the subnetworks includes one bilinear upsampling layer and three convolutional layers.

2. A multi-description video decoding method based on a frame prediction neural network is characterized by comprising the following steps:

b1 The decoder receives the description 1 and the description 2, judges whether a lost video frame is generated, and if not, samples the odd frame and the even frame of the description 1 and the description 2 according to the sequence of the odd frame and the even frame to obtain a decoding reconstruction video sequence of the full frame rate; if yes, entering a step B2);

b2 Description 1 and description 2 are decoded by an HEVC standard decoder to obtain a reconstructed odd frame sequence F'_OAnd even frame sequence F'_ESequence F 'of reconstructed odd frames'_OAnd even frame sequence F'_ERespectively inputting a frame to predict a neural network FP-CNN to obtain a predicted even frame sequence F'_EIAnd predictedOdd frame sequence F'_OI；

B4 Utilizing predicted even frame sequence F'_EIAnd even residual F'_ESIObtaining missing even frames, utilizing a sequence of predicted odd frames F'_OIAnd even residual F'_OSIObtaining missing odd frames, and reusing even frames and even frame sequence F'_EOdd and odd frame sequence F'_OThe video sequence is up-sampled and reconstructed according to the parity frame sequence;

selecting videos of various scenes, dividing a source video into odd frames and even frames, coding and decoding the videos through an original HEVC (high efficiency video coding) coder when different QP (quantization parameter) values are set, taking the coded and decoded videos as training data, and taking the original odd frames or even frames as training labels to form a data set for training the frame prediction neural network FP-CNN;

the frame prediction neural network FP-CNN comprises an encoder-decoder, and the output characteristics of the encoder and the output characteristics of the decoder with the same scale adopt a skip connection mode; the input odd frame sequence F'_OAnd even frame sequence F'_EExtracting features through an encoder and a decoder, wherein the extracted features are provided for four sub-networks;

estimating 1/4 of output pixels by using a 1-dimensional kernel in a dense pixel mode for each sub-network, and then enabling the estimated pixel kernel to be connected with an odd frame sequence F'_OOr even frame sequence F'_ELocally convolving consecutive two-frame video frames to generate the predicted even frame sequence F'_EIOr predicted sequence of odd frames F'_OI；

The encoder and the decoder are provided with a convolution layer, an average pooling layer and a bilinear upsampling layer; each of the sub-networks includes one bilinear upsampling layer and three convolutional layers.