CN113132735A

CN113132735A - Video coding method based on video frame generation

Info

Publication number: CN113132735A
Application number: CN201911392082.8A
Authority: CN
Inventors: 刘家瑛; 段凌宇; 夏思烽; 杨文瀚; 胡越予
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2021-07-16

Abstract

The invention discloses a video coding method based on video frame generation, which comprises the following steps: training a neural network: at each training iteration, two frames I of a video segment are extracted from the sample video training set_kAnd I_tSending into neural network to generate pair I_tPrediction of (2)

Computing

And a target frame I_tThe L1 norm in between and back-propagating it to the neural network until the neural network converges; and (3) an encoding stage: the encoding end extracts sparse motion representation between the encoded reference frame and the target to-be-encoded non-key frame by utilizing a neural network to generate a prediction frame; adding the predicted frame into a reference frame list for inter-frame prediction, and then sending inter-frame prediction information and sparse motion representation to a decoding end; and a decoding stage: the decoding end estimates dense motion information of the target frame according to the reconstructed reference frame and the transmitted sparse motion representation and generates the target frame; and then adding the generated target frame into a reference frame list and reconstructing the target frame by utilizing inter-frame prediction information.

Description

Video coding method based on video frame generation

Technical Field

The invention belongs to the field of video coding, relates to an inter-frame prediction module of video coding, and particularly relates to a video coding method based on video frame generation, which can be used for improving the video compression rate.

Background

In the use and transmission process of digital video, video coding and decoding are indispensable key technologies. The video coding and decoding technology greatly reduces the cost of the digital video in the storage and transmission processes by carrying out coding compression on the video at the coding end and decoding recovery on the video at the decoding end, so that the use of the digital video in daily life is realized. The inter-frame prediction module is a key method for improving the video compression rate by using inter-frame redundant information in the video coding and decoding technology.

In the interframe prediction module, an encoder searches an encoded reference block similar to a current video frame block to be encoded in an encoded video frame, and based on the similar encoded reference block, the encoder can only encode and record a residual error between the block to be encoded and the reference block and corresponding block motion offset information without encoding complete information of the block to be encoded, so that the bit number required by encoding is saved, and the compression rate is improved. However, since the search for the reference block is performed in units of image blocks, when there is a pixel-level motion offset (e.g., rotation, distortion) between the reference block and the block to be encoded, the encoder cannot obtain a good prediction of the block to be encoded, and the performance of inter-coding is affected.

Disclosure of Invention

On the premise of the technical background, the invention provides an inter-frame prediction method based on video frame generation, which generates a target to-be-coded video frame by using the prediction of a coded video frame, so that after the operation is also performed at a decoding end, a reference with a very close pixel level can be provided for the time domain prediction of the target to-be-coded frame, and the effect of removing time domain redundancy is improved.

In order to ensure the generated effect, the motion between frames needs to be well modeled. An intuitive idea is to extract dense motion information between the reference frame and the target frame and transmit it to the decoding side as well for the generation of the target frame. However, the transmission of dense motion information itself adds additional bit overhead. Therefore, another core of the present invention is to extract a sparse representation of inter-frame motion and model inter-frame motion based on such a sparse representation, thereby achieving an ideal target frame generation with less overhead.

The technical scheme of the invention is as follows:

a video encoding method based on video frame generation, comprising the steps of:

1) training a neural network: at each training iteration, two frames I of a video segment are extracted from the sample video training set_kAnd I_tRespectively used as source frame and target frame, and fed into neural network to extract I_tAnd I_kThen the neural network estimates I based on the extracted sparse motion characterization_kTo I_tDense motion light stream xi_k→tAnd extracting the source frame I_kThen based on ξ_k→tTransforming the related content features to generate pairs I_tPrediction of (2)

Computing

And a target frame I_tL1 norm in between and back-propagating it to the neural network to update the weights of the neural network until it isThe neural network converges;

2) and (3) an encoding stage: the encoding end extracts sparse motion representation between the encoded reference frame and the target to-be-encoded non-key frame by using the trained neural network; then, generating a prediction frame of a target frame by using the extracted sparse motion representation and the reference frame; adding the generated prediction frame into a reference frame list for inter-frame prediction, and then sending the obtained inter-frame prediction information and the sparse motion representation to a decoding end;

3) and a decoding stage: the decoding end estimates dense motion information of the target frame according to the reconstructed reference frame and the transmitted sparse motion representation and generates the target frame; and then adding the generated target frame into a reference frame list and reconstructing the target frame by utilizing the transmitted inter-frame prediction information.

Further, the neural network performs forward calculation of the neural network on the input source frame and the input target frame to obtain the sparse motion representation.

Further, the frame number of the source frame is smaller than the frame number of the target frame.

Furthermore, each time of training iteration, a first frame is extracted from the video segment as a source frame, and another frame randomly selected from the video segment is used as a target frame.

And further, the sparse motion representation is quantized and subjected to lossless compression and then transmitted to a decoding end.

Further, in the encoding stage, an encoder firstly divides frames in a video sequence to be encoded into key frames and non-key frames; for the key frame, the traditional coding and decoding method is still used for compressing and reconstructing; and (3) encoding and decoding non-key frames by adopting the steps 2) to 3).

A target frame generation network training generation method comprises the following steps:

1) selecting a sample video training set;

2) at each training iteration, two frames I of a video segment are extracted from the sample video training set_kAnd I_tRespectively used as source frame and target frame, and fed into neural network to extract I_tAnd I_kSparse motion characterization in between;

3) neural network estimation I based on extracted sparse motion characterization_kTo I_tDense motion light stream xi_k→tAnd from the source frame I_kExtracting content features;

4) dense motion-based optical flow xi_k→tTransforming the extracted content features to generate pairs I_tPrediction of (2)

5) Computing

And a target frame I_tThe L1 norm in between and back-propagating it to the neural network to update the weights of the neural network until the neural network converges; the converged neural network is used as a target frame generation network.

The invention adds a new inter-frame prediction reference under the existing video coding frame, and designs two convolution neural networks for sparse motion representation and target frame generation to generate a better inter-frame prediction reference.

The main steps of the method of the invention are described next. The method mainly comprises two parts, wherein one part is the training of sparse motion representation and a target frame generation network, and the other part is a newly designed video coding frame.

Firstly, describing the training of the network, wherein the network structure is as shown in fig. 1, and for each training sample, firstly, extracting the inter-frame sparse motion representation. Then another group of networks extracts the content of the input source video frame from the sparse motion representation, and simultaneously estimates dense motion information between frames from the sparse motion representation. And finally, realizing the transformation from the source frame to the target frame by the guidance of the estimated dense motion information, and realizing the generation of the target frame. The network training process is as follows:

step 1: extracting video clips of a plurality of continuous frames from a batch of videos to construct training data pairs; each training data pair includes a source frame and a target frame.

Step 2: in each training iteration, a source frame and a target frame are formed by extracting a first frame and a random other frame from each video segment, and a group of two video frames are sent into a network for forward calculation of the network. The first frame is selected as the source frame as the key frame to correspond to the relationship between the key frame and the non-key frame in the actual encoding process, that is, one key frame corresponds to a plurality of subsequent non-key frames in the actual encoding.

And step 3: and 2, obtaining a calculation result, namely a generation result from the source frame to the target frame, and calculating an L1 norm between the generated target frame and the original target frame.

And 4, step 4: and reversely transmitting the calculated norm to each layer of the neural network to update the weight of each layer, so that the result is closer to the target effect in the next iteration.

And 5: repeating steps 1-4 until the L1 norm of the neural network converges.

And after the trained network model is obtained, applying the model to an interframe prediction module of an encoder. Specifically, at the encoding end, frames in a video sequence to be encoded are distinguished into key frames and non-key frames. For key frames, the conventional video coding method is still used. For non-key frames, the video coding method of the present invention is used. The video coding of the invention additionally adds a generated reference frame on the basis of the traditional video coding method to improve the inter-frame prediction performance. Specifically, for each non-key frame to be encoded, an encoded reference frame and the non-key frame to be encoded are input into a trained network, the encoded reference frame is used for generating the non-key frame to be encoded (namely, a predicted frame), the generated non-key frame to be encoded is added into a reference frame list of the non-key frame to be encoded, and the inter-frame prediction process of the traditional encoder is also carried out. And after the flow is finished, in addition to the relevant information of inter-frame prediction, additionally storing the sparse motion representation of the transmission network extracted from the inter-frame. At the decoder side, dense motion estimation can be performed and an object can be generated by using the reconstructed reference frame and the transmitted sparse motion representation. And after the generation is finished, adding the generated result into a reference frame list, and then reconstructing the target frame by using the inter-frame prediction related information transmitted by coding.

Compared with the prior art, the invention has the following positive effects:

the invention constructs a convolutional neural network based on sparse motion sampling and motion generation to generate a reference frame with smaller pixel-level motion offset with a target frame, and adds the generated reference frame into a reference frame list, thereby further compressing inter-frame time domain redundant information and improving the video coding performance. The invention can better model the complex inter-frame motion and obtain better time domain prediction effect in the inter-frame prediction process, thereby further avoiding the coding overhead of the complex motion and better removing the time domain redundancy.

Drawings

Fig. 1 is a diagram of a network architecture of the present invention.

Detailed Description

For further explanation of the technical method of the present invention, the video encoding method based on video frame generation according to the present invention is further described in detail below with reference to the drawings and specific examples of the specification.

The present example will focus on a detailed description of the training and specific encoding processes of the neural network part of the technical approach, respectively. Suppose now that the present invention has constructed the convolutional neural network model of FIG. 1 and has N video segments { I }₁,I₂,…,I_NAs training data.

The present example method in conjunction with FIG. 1 is as follows:

firstly, a training process:

step 1: from the training set I in each training iteration₁,I₂,…,I_NExtracting two frames I of a video clip_kAnd I_tAs source and target frames, respectively.

Step 2: the network first extracts I_tAnd I_kThen the network estimates I based on the extracted sparse motion characterization_kTo I_tDense motion light stream xi_k→t. Simultaneously with source frame I_kExtraction of video frame content features for input (content features extracted through network adaptive learning) followed by ξ -based extraction_k→tTo I_tIs transformed to generate a pair I_tPrediction of (2)

Sparse motion characterization refers to the method that can be used for characterizing motion between video frames but at the same time, information itself is sparse, that is, the amount of information is small, and the encoding overhead is small, such as some key points that can be selected.

And step 3: calculating the estimation result of step 2

And the original target frame I_tThe L1 norm in between as a loss function.

And 4, step 4: after obtaining the loss function, the error value is propagated reversely to the network to train the network to update the network weight.

And 5, repeating the steps 1-4 until the neural network converges.

Secondly, an encoding process:

after training is finished and sparse motion representation extraction and a target frame generation network are completed, in the actual test of an encoder, a video frame is divided into a key frame and a non-key frame, and for the key frame, the original encoding and decoding method of the encoder is still used for compression and reconstruction. For non-key frames, the operation steps are as follows, firstly at the encoding end:

step 1: extracting sparse motion representation between an encoded reference frame and a target to-be-encoded non-key frame by using a network;

step 2: generating a target frame by using the extracted sparse motion representation and the reference frame;

and step 3: adding the generated target frame into a reference frame list, and continuing the interframe prediction process of the traditional coding method;

and 4, step 4: and additionally carrying out appropriate optional quantization and lossless compression on the extracted sparse motion representation, and transmitting the sparse motion representation to a decoding end.

At the decoding end:

the method comprises the following steps: reconstructing a sparse motion representation of the transmission;

step two: estimating dense motion information of the target frame by using the reconstructed reference frame and the sparse motion representation and generating a prediction frame of the target frame;

step three: and adding the predicted frame of the target frame into a reference frame list and reconstructing the target frame by utilizing the inter-frame prediction related information transmitted by the encoder.

Fig. 1 summarizes the network architecture of the present invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A video encoding method based on video frame generation, comprising the steps of:

Computing

And a target frame I_tThe L1 norm in between and back-propagating it to the neural network to update the weights of the neural network until the neural network converges;

2. The method of claim 1, wherein the neural network performs a forward calculation of the neural network on the input source frames and target frames to obtain the sparse motion characterization.

3. The method of claim 1, wherein the frame sequence number of the source frame is less than the frame sequence number of the destination frame.

4. The method of claim 1, wherein at each training iteration, a first frame is extracted from the video segment as a source frame and another frame randomly selected from the video segment is taken as a target frame.

5. The method of claim 1, wherein the sparse motion representation is quantized and lossless compressed before being transmitted to a decoding end.

6. The method of claim 1, wherein in the encoding stage, the encoder first distinguishes frames in the video sequence to be encoded into key frames and non-key frames; for the key frame, the traditional coding and decoding method is still used for compressing and reconstructing; and (3) encoding and decoding non-key frames by adopting the steps 2) to 3).

7. A target frame generation network training generation method comprises the following steps:

1) selecting a sample video training set;

2) training from sample video each time training iterationsTwo frames I of a video segment are collectively extracted_kAnd I_tRespectively used as source frame and target frame, and fed into neural network to extract I_tAnd I_kSparse motion characterization in between;

5) Computing

8. The method of claim 7, wherein the frame sequence number of the source frame is less than the frame sequence number of the target frame.

9. The method of claim 7, wherein the neural network performs a forward calculation of the neural network on the input source frames and target frames to obtain the sparse motion characterization.

10. The method of claim 7, wherein at each training iteration, a first frame is extracted from the video segment as a source frame and another frame is randomly selected from the video segment as a target frame.