WO2021164176A1

WO2021164176A1 - End-to-end video compression method and system based on deep learning, and storage medium

Info

Publication number: WO2021164176A1
Application number: PCT/CN2020/099445
Authority: WO
Inventors: 马思伟; 贾川民; 赵政辉; 王苫社
Original assignee: 北京大学
Priority date: 2020-02-20
Filing date: 2020-06-30
Publication date: 2021-08-26
Also published as: CN111405283B; CN111405283A

Abstract

Provided are an end-to-end video compression method and system based on deep learning, and a storage medium. The end-to-end video compression method based on deep learning in the present application comprises: dividing a target video into a plurality of groups of pictures; then, performing end-to-end intra-frame encoding on a key frame in each group of pictures to obtain a key frame code; reconstructing the key frame code by means of a loop filter network to obtain a key frame reconstructed frame; next, performing end-to-end inter-frame encoding on a non-key frame in the group of pictures on the basis of the key frame reconstructed frame to obtain a non-key frame code; and finally, reconstructing the non-key frame code by means of the loop filter network to obtain a non-key frame reconstructed frame. Compared with a traditionally used video compression encoder, a video encoder that can realize end-to-end global optimization is used in the present application, and a better encoding performance can be obtained at a low code rate. The problem of how to ensure a better rate-distortion performance while realizing end-to-end video encoding by using a deep neural network is thus solved.

Description

End-to-end video compression method, system and storage medium based on deep learning

Technical field

This application belongs to the technical field of digital signal processing, and specifically relates to an end-to-end video compression method, system and storage medium based on deep learning.

Background technique

Video compression, also known as video coding, aims to eliminate redundant information between video signals. With the continuous development of multimedia digital video applications and the continuous improvement of people’s demand for video cloud computing, the data volume of the original video source has made the existing transmission network bandwidth and storage resources unbearable, so the video after encoding and compression is suitable. For information transmitted in the network, video coding technology has become one of the hot spots in academic research and industrial applications at home and abroad.

In recent years, the image coding method based on deep neural network has become a research hotspot in the coding field. It optimizes the image reconstruction loss function through end-to-end modeling of the auto-encoder structure, and uses the entropy estimation model to approximate the auto-encoder. The codeword distribution of the Bottleneck Layer in the structure realizes rate-distortion optimization. On this basis, the entropy estimation model has been continuously improved. A probability estimation model based on a mixture of Gaussian models and a Gaussian superprior distribution entropy estimation model is proposed, combined with the PixelCNN framework based on the auto-regressive model. The context model of the bottleneck layer codeword. The objective function of this type of end-to-end image compression can be expressed as:

Where x and

Respectively represent the original pixel and the unquantized pixel of the bottleneck layer, y and

Respectively represent the unquantized and quantized codewords of the bottleneck layer, and C is a constant.

End-to-end neural networks are of great significance to video compression. The traditional hybrid coding framework and the local rate-distortion optimization of various coding tools have been developed for half a century, and they have encountered new challenges in the face of more efficient video compression. Common end-to-end video coding technologies are mainly used in video coding modules such as intra-frame coding, inter-frame prediction, residual coding, and rate control by designing an overall trainable network. However, it is still a big challenge to ensure the overall rate-distortion performance of the video compression framework. Therefore, it appears to design and develop a video compression method and system that uses a deep neural network to achieve end-to-end video encoding while ensuring better rate-distortion performance. Is crucial.

Summary of the invention

The present invention proposes an end-to-end video compression method, system and storage medium based on deep learning, and aims to solve the problem that better rate-distortion performance cannot be guaranteed in video compression coding in the prior art.

According to the first aspect of the embodiments of the present application, an end-to-end video compression method based on deep learning is provided, including the following steps:

Divide the target video into multiple image groups;

Perform end-to-end intra-frame coding on the key frames in the image group to obtain the key frame coding;

The key frame encoding is reconstructed through the loop filter network to obtain the key frame reconstruction frame;

Perform end-to-end inter-coding of non-key frames in the image group based on key frame reconstruction frames to obtain non-key frame coding;

After the non-key frame coding is reconstructed through the loop filter network, the non-key frame reconstruction frame is obtained.

Optionally, performing end-to-end inter-encoding of the non-key frames in the image group based on the key-frame reconstruction frame to obtain the non-key frame encoding, which specifically includes:

Performing motion field estimation on the non-key frames in the image group based on the key frame reconstruction frame to obtain the motion field information;

Obtain the inter prediction information of non-key frames according to the sports field information;

Perform predictive residual coding based on the inter prediction information of non-key frames and non-key frames.

Optionally, perform end-to-end intra-encoding of the key frames in the image group to obtain the key-frame encoding, specifically adopting the end-to-end autoencoder structure based on the super-prior model network, the intra-encoding framework, the bottleneck layer of the autoencoder Perform contextual modeling.

Optionally, the objective function of the intra-frame coding frame during training

The formula is:

Where x is the input image,

Is the output image;

Among them, y is a hidden variable based on image coding, y=Enc(x); the prior distribution of the hidden variable y is a normal distribution that obeys the mean μ and the variance is σ, y～N(μ,σ);

Among them, the mean μ and variance σ are obtained through end-to-end learning according to the super-prior autoencoder, specifically:

z=Hyper _Enc(y) ;

in,

Is the codeword of the super-prior self-encoder after quantization,

For the preliminary parameters of the super-prior normal distribution, the results of the super-prior self-encoding structure are upgraded by using PixelCNN contextual modeling.

Optionally, the loop filter network is based on a fully convolutional network, the loop filter network uses the loss function L2, and the loop filter network

The specific formula is:

Among them, x _rec represents the input coded image, x is the real label corresponding to the coded image, and n represents the number of frames.

Optionally, the motion field estimation is performed on the non-key frames in the image group based on the key frame reconstruction frame to obtain the motion field information, which specifically includes:

When the key frame reconstruction frame is only one frame, the sports field information needs to be encoded by the autoencoder and written into the code stream. The calculation formula of the _{sports field information flow 1 is:}

flow ₁ = Flownet(f _t-1 );

When the number of key frame reconstruction frames is greater than one frame, take the two closest reconstruction frames relative to the current non-key frame to obtain the sports field information. At this time, the sports field information does not need to be written into the code stream. The calculation formula of the _{sports field information flow 2 is:}

flow ₂ =Flownet(f _t-2 ,f _t-1 );

Among them, f ₁ is an available key frame reconstruction frame, and Flownet is an optical flow prediction network.

Optionally, obtaining the inter prediction information of the non-key frame according to the sports field information specifically includes: generating the inter prediction signal of the non-key frame according to the video motion characteristics of the sports field information and the reconstructed frame in the decoding buffer area through interpolation and image processing technology, The calculation formula of the inter-frame prediction signal Frame _{pred is:}

Frame _pred = Warp(f _t-1 ,flow);

Among them, Warp is a polynomial interpolation method, f ₁ is the available key frame reconstruction frame, and flow is the sports field information of the non-key frame.

Optionally, calculating the prediction residual and the prediction residual coding according to the inter prediction information of the non-key frame and the non-key frame, specifically including: the prediction residual Frame _Resi calculation formula is:

Frame _Resi = Frame-Frame _pred ;

Among them, Frame is the original signal of the current non-key frame, and Frame _pred is the inter-frame prediction signal;

The prediction residual Frame _Resi is compressed and coded through a self-encoder structure composed of a full convolutional network, and its bottleneck layer is entropy coded and written into the code stream.

According to a second aspect of the embodiments of the present application, an end-to-end video compression system based on deep learning is provided, which specifically includes:

Image group module: used to divide the target video into multiple image groups;

Key frame encoding module: used to perform end-to-end intra-frame encoding on the key frames in the image group to obtain the key frame encoding;

Key frame reconstruction frame module: used to reconstruct the key frame by encoding the key frame through the loop filter network to obtain the key frame reconstruction frame;

Non-key frame coding module: used to perform end-to-end inter-coding of non-key frames in the image group based on the key frame reconstruction frame in the decoding buffer to obtain non-key frame coding;

Non-key frame reconstruction frame module: used to reconstruct the non-key frame by encoding the non-key frame through the loop filter network to obtain the non-key frame reconstruction frame.

According to a third aspect of the embodiments of the present application, there is provided a computer-readable storage medium having a computer program stored thereon; the computer program is executed by a processor to implement an end-to-end video compression method based on deep learning.

Using the deep learning-based end-to-end video compression method, system, and storage medium in the embodiments of the present application, the target video is divided into multiple image groups; and then the key frames in the image group are subjected to end-to-end intra-coding to obtain Key frame coding: The key frame coding is reconstructed through the loop filter network to obtain the key frame reconstruction frame; secondly, based on the key frame reconstruction frame, the non-key frame in the image group is subjected to end-to-end inter-coding to obtain the non-key frame coding; and finally , The non-key frame encoding is reconstructed through the loop filter network to obtain the non-key frame reconstruction frame. Compared with the conventionally adopted video compression encoder, this application can realize an end-to-end global optimization video encoder, and can achieve better encoding performance at a low bit rate. It solves the problem of how to use deep neural networks to achieve end-to-end video encoding while ensuring better rate-distortion performance.

Description of the drawings

The drawings described here are used to provide a further understanding of the application and constitute a part of the application. The exemplary embodiments and descriptions of the application are used to explain the application, and do not constitute an improper limitation of the application. In the attached picture:

Fig. 1 shows a flowchart of the steps of an end-to-end video compression method based on deep learning according to an embodiment of the present application;

FIG. 2 shows a framework diagram of a video compression method based on an end-to-end deep neural network according to an embodiment of the present application;

FIG. 3 shows a method for dividing the structure of a group of pictures GOP according to an embodiment of the present application;

FIG. 4 shows the intra-frame coding network structure diagram of the key frame of the end-to-end video compression method according to an embodiment of the present application;

FIG. 5 shows a non-key frame inter-frame coding framework diagram of an end-to-end video compression method according to an embodiment of the present application;

Fig. 6 shows an implementation method of Mask convolution adopted by an intra-frame coding network according to an embodiment of the present application;

Fig. 7 shows a schematic structural diagram of an end-to-end video compression system based on deep learning according to an embodiment of the present application.

Detailed ways

In the process of realizing this application, the inventor found that the traditional hybrid coding framework and the local rate-distortion optimization of each coding tool have been developed for half a century, and they have encountered new challenges in the face of more efficient video compression. The end-to-end video coding framework can break through the limitations of local optimization of traditional frameworks. By establishing a global optimization model of reconstructed video and original video, and using neural networks to model the rate-distortion optimization problem with high-dimensional complex solution space, the video can be realized. The innovation of the coding framework. Common end-to-end video coding technologies are mainly used for video coding intra-frame coding, inter-frame prediction, residual coding, and rate control modules by designing an overall trainable network. However, guaranteeing the overall rate-distortion performance of the video compression framework still poses great challenges. Therefore, there is an urgent need for a video compression method and system that uses a deep neural network to achieve end-to-end video encoding while ensuring better rate-distortion performance.

In response to the above-mentioned problems, the embodiments of this application provide an end-to-end video compression method, system, and storage medium based on deep learning. The full-convolutional network-based video compression framework provided by this application that can be trained end-to-end is similar to the traditional video compression framework. Compared with the adopted video compression encoder, it can achieve end-to-end global optimization of the video encoder, and can achieve better encoding performance at low bit rates. It solves the problem of how to use deep neural networks to achieve end-to-end video encoding while ensuring better rate-distortion performance.

This application uses convolutional neural network and video processing technology. First, the video is divided into group of pictures (GOP) for encoding, and the adaptively selected key frames in the group of pictures GOP are encoded end-to-end and stored In the decoding buffer area; secondly, for non-key frame encoding, use the reconstructed frame in the decoding buffer area to estimate the motion field based on the depth network for each frame to be encoded, and use the estimated motion information to generate the inter-frame prediction result; and finally Perform end-to-end residual coding on the prediction residuals of non-key frames; when the video is reconstructed and stored in the decoding buffer, both the key frames and non-key frames need to be reconstructed through the deep loop filter module.

In order to make the technical solutions and advantages of the embodiments of the present application clearer, the exemplary embodiments of the present application will be described in further detail below in conjunction with the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present application, and Not all examples are exhaustive. It should be noted that the embodiments in the application and the features in the embodiments can be combined with each other if there is no conflict.

Example 1

Fig. 1 shows a step flow chart of an end-to-end video compression method based on deep learning according to an embodiment of the present application.

As shown in Fig. 1, the end-to-end video compression method based on deep learning in this embodiment specifically includes the following steps:

S101: Divide the target video into multiple image groups;

S102: Perform end-to-end intra-frame coding on the key frames in the image group to obtain key frame codes;

S103: After the key frame encoding is reconstructed through the loop filter network, the key frame reconstruction frame is obtained;

S104: Perform end-to-end inter-coding of non-key frames in the image group based on the key-frame reconstruction frame to obtain non-key frame coding;

S105: The non-key frame encoding is reconstructed through the loop filter network to obtain a non-key frame reconstruction frame.

Fig. 2 shows a framework diagram of a video compression method based on an end-to-end deep neural network according to an embodiment of the present application.

As shown in Figure 2, in the compression framework of the present application, the video can be compressed by the end-to-end deep neural network video coding framework by means of a group of pictures GOP. First, for the key frames in the GOP, the self-encoding architecture based on the Gaussian super prior distribution is used for compression, and the compressed key frames are buffered to the decoding after the deep convolutional network-based loop filter module (CNN Loop Filter) Buffer (DecodedPictureBuffer, DPB).

Fig. 3 shows a method for dividing the structure of a group of pictures GOP according to an embodiment of the present application.

As shown in Fig. 3, the key frame in the present invention is set as the first frame of the GOP of the group of pictures.

In addition, the key frame can be the first frame in the GOP, or it can be the non-first frame; then use the method of the autoencoder network with super a priori structure to encode the key frame, and the autoencoder type is Gaussian distribution , Mixture of Gaussian distribution and Laplace distribution, etc.

Fig. 4 shows the intra-frame coding network structure diagram of the key frame of the end-to-end video compression method according to an embodiment of the present application.

As shown in Figure 4, the end-to-end intra-encoding of the key frames in the image group is performed to obtain the key-frame encoding. The end-to-end autoencoder structure based on the super-prior model network is used to obtain the key frame encoding. The bottleneck layer of the server is designed with a context modeling framework.

This application adopts an end-to-end training method, and the goal is to obtain an output image that is highly similar to the input image x at the signal level

For the input image x, the autoencoder encodes the image into a hidden variable y,

y=Enc(x)

This scheme assumes that the prior distribution of the hidden variable y is a normal distribution that obeys the mean μ and the variance is σ,

y～N(μ,σ),

Among them, the mean μ and the variance σ are obtained through end-to-end learning according to the super-prior autoencoder, specifically:

z=Hyper _Enc(y) ,

Z is the codeword of the autoencoder,

Is the codeword of the super-prior self-encoder after quantization,

It is the preliminary parameter of the super-prior normal distribution.

Not only that, after passing the output of the super-prior self-encoding structure, the present invention also uses the PixelCNN context-based modeling method to upgrade the result of the super-prior self-encoding structure, as shown in Figure 6, using Mask’s 5x5 convolution , The output is the final super-prior distribution parameters.

Therefore, the objective function of the intra-frame coding framework during training

The formula is as follows:

Where x is the input image,

Is the output image.

In S103 and S105, regarding loop filtering, for each key frame and non-key frame image that has been coded, a loop filtering module based on a full convolution network is processed to improve the subjective and objective reconstruction effect.

Specifically, the encoded reconstructed image is x _rec , which is based on an end-to-end full convolutional mapping between the original images x, and the reconstructed image is processed by using a nine-layer convolutional neural network with a global residual structure, and The final reconstructed image is obtained and stored in the decoding buffer area at the same time.

Further, the loop filter network adopts the loss function L2, and the loop filter network

The specific formula is:

Among them, x _rec represents the input coded image, x is the real label corresponding to the coded image, and n represents the number of frames. Using the L2 function can effectively ensure the fidelity of the data.

In S102, performing end-to-end inter-coding of non-key frames in the image group based on the key frame reconstruction frame to obtain non-key frame coding, which specifically includes:

Regarding non-key frame encoding, this application uses the encoded frames in the decoding buffer DPB to generate the motion field information of the current non-key frame, and uses this information to perform texture alignment on the frames in the decoding buffer DPB, thereby obtaining the prediction of the current encoded frame Information, and then encode the prediction residual through the self-encoder structure, and write the bottleneck layer of the self-encoder into the code stream. Similar to the key frame encoding, each non-key frame also needs to be processed by the loop filter module to improve the reconstruction quality.

Specifically, the video motion characteristics of the sports field information specifically include video sports field information and texture motion characteristics. Video motion feature expression forms include, but are not limited to: optical flow field, motion vector field, disparity vector field, and inter-frame gradient field, etc.

Among them, the video motion feature extraction method is specifically the method of extracting the motion feature between video frames. The motion feature extraction method corresponds to the extraction method of the corresponding expression form, including but not limited to methods based on deep learning such as: optical flow model, based on traditional gradient extraction Methods etc.

Fig. 5 shows a non-key frame inter-coding framework diagram of the end-to-end video compression method according to an embodiment of the present application.

Specifically, the coding of non-key frames in this application is mainly divided into two steps, one is prediction frame generation, and the other is prediction residual coding.

1. For prediction frame generation:

First, based on the key frame reconstruction frame, the motion field estimation is performed on the non-key frames in the image group to obtain the motion field information, which specifically includes:

flow ₁ = Flownet(f _t-1 );

flow ₂ =Flownet(f _t-2 ,f _t-1 );

The structure of the non-key frame prediction network is shown in Figure 5. By obtaining the coded frame from the decoding buffer area, and using the two nearest neighbor coded frames to predict the currently coded non-key frame, the prediction method is to use the optical flow network (Flownet) Get the encoded frame in the decoding buffer area.

Further, when the decoding buffer area has only one frame, the video motion characteristic information should be written into the code stream; when the decoding buffer area has more than one frame, the video motion characteristic information should not be written into the code stream.

Secondly, obtain the inter prediction information of non-key frames according to the motion field information, that is, the prediction frame generation, which specifically includes: generate the inter prediction of non-key frames according to the video motion characteristics of the motion field information and the reconstructed frame in the decoding buffer area through interpolation and image processing technology Signal, the calculation formula of the inter-frame prediction signal Frame _{pred is:}

Frame _pred = Warp(f _t-1 ,flow);

2. For predictive residual coding, all non-key frames in the image group in this application need to pass through the non-key frame residual coding module after predictive coding. The input of the non-key frame residual coding module is the original non-key frame signal The residual error with the predicted signal.

Specifically, calculating the prediction residual and the prediction residual coding according to the inter prediction information of the non-key frame and the non-key frame, specifically including: the prediction residual Frame _Resi calculation formula is:

Frame _Resi = Frame-Frame _pred ;

Further, in S105, the non-key frame also needs to be reconstructed through the loop filter network to obtain the non-key frame reconstruction frame during reconstruction. The non-key frame reconstruction frame Frame _Rec formula is:

Then the final reconstructed non-key frame is obtained and stored in the decoding buffer.

The non-key frame prediction residual coding method of this application specifically uses a pre-trained autoencoder network model designed according to specific conditions, and uses the residuals of the original signal of the non-key frame and its prediction signal as the input of the generation network to obtain reconstruction Residual, that is, the compressed image reconstruction is completed.

In the loop filter method in the end-to-end video compression framework of this application, when the key frames and non-key frames are encoded to obtain the final reconstruction, the loop filter reconstruction based on the convolutional neural network designed and trained according to the specific conditions is used, and its input Unfiltered key frames or non-key frames are stored in the decoding buffer.

Among them, in the code stream structure z organization method in the end-to-end video compression framework, the overall code stream is composed of the code streams of multiple image groups GOP, and the code stream of each image is composed of key frames and non-key frame code streams. The frame code stream includes the bottleneck layer code stream of the self-encoder, and the non-key frame code stream is composed of the sports field information and its prediction residual code stream.

The end-to-end video compression method based on deep learning of the present application specifically includes a deep learning method, a video motion feature extraction method, an end-to-end video compression method, and a video reconstruction method. Through the end-to-end video coding framework, it can break through the limitations of local optimization of the traditional framework, establish a global optimization model of reconstructed video and original video, and use neural network to model the rate-distortion optimization problem with high-dimensional complex solution space, thereby realizing video coding Framework innovation.

Among them, the deep learning method used in end-to-end video compression is specifically a deep learning method based on a full convolutional network model; methods based on deep learning include, but are not limited to: variational autoencoders, generative adversarial networks and their variants combine with.

The video coding technology based on deep learning in this application aims to extract the high-level abstract characteristics of data and its inverse process by using multi-layer deep nonlinear transformation, thereby obtaining the optimal prediction signal of video coding, and ensuring the overall Rate-distortion performance of the frame. Finally, a supervised training method is used to optimize the rate-distortion function. The rate-distortion function includes the data fidelity of the reconstructed video and the additional cost required for encoding residuals.

Example 2

As shown in FIG. 7, the end-to-end video compression system based on deep learning provided by this embodiment specifically includes:

Image group module 10: used to divide the target video into multiple image groups;

Key frame encoding module 20: used to perform end-to-end intra-frame encoding on the key frames in the image group to obtain key frame encoding;

Key frame reconstruction frame module 30: used to encode the key frame through the loop filter network for reconstruction to obtain the key frame reconstruction frame, and store it in the decoding buffer;

Non-key frame coding module 40: used to perform end-to-end inter-coding of non-key frames in the image group based on the key frame reconstruction frame in the decoding buffer to obtain non-key frame coding;

The non-key frame reconstruction frame module 50 is used for reconstructing the non-key frame encoding through the loop filter network to obtain the non-key frame reconstruction frame, and storing the non-key frame reconstruction frame in the decoding buffer.

In the key frame encoding module 20, end-to-end inter-encoding of the non-key frames in the image group based on the key frame reconstruction frame to obtain the non-key frame encoding, which specifically includes:

Both the key frame reconstruction frame module 30 and the non-key frame reconstruction frame module 50 in the end-to-end video compression framework include loop filters. When the key frames and non-key frames are finally reconstructed after encoding, they are designed and trained according to the specific conditions. Good loop filter reconstruction based on convolutional neural network, input unfiltered key frames or non-key frames to the loop filter and store them in the decoding buffer.

This embodiment also provides a computer-readable storage medium on which a computer program is stored, and the computer program is executed by a processor to implement the deep learning-based end-to-end video compression method provided by any of the above.

This application proposes a video compression framework based on an end-to-end deep neural network. First, the video is organized into multiple image groups, the key frame images in the image group are intra-coded, and the non-key frame images are inter-coded. Intra-frame coding uses an auto-encoding structure based on a super-prior structure combined with an autoregressive model for context modeling, and inter-frame coding uses motion field to derive prediction and residual coding. It can realize the end-to-end overall optimization of the encoder architecture. At the same time, the inter-frame coding adopts the form of motion field export to avoid the transmission of a large amount of inter-frame motion information, which greatly saves the code rate. At the same time, the deep network-based loop is used in the reconstruction process. Filtering technology improves reconstruction performance. Compared with traditional encoders, the proposed method can globally optimize the video encoder end-to-end without the need to transmit motion information in inter-frame prediction, and can achieve better encoding performance at low bit rates.

Based on the same inventive concept, the embodiments of the present application also provide a computer program product. Since the principle of the computer program product to solve the problem is similar to the method provided in the first embodiment of the present application, the implementation of the computer program product can refer to the method The implementation of the repetition will not be repeated.

Those skilled in the art should understand that the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

This application is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment can be used to generate It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

Although the preferred embodiments of the present application have been described, those skilled in the art can make additional changes and modifications to these embodiments once they learn the basic creative concept. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments and all changes and modifications falling within the scope of the present application.

Obviously, those skilled in the art can make various changes and modifications to the application without departing from the spirit and scope of the application. In this way, if these modifications and variations of this application fall within the scope of the claims of this application and their equivalent technologies, then this application is also intended to include these modifications and variations.

Claims

An end-to-end video compression method based on deep learning, characterized in that it comprises the following steps:

Divide the target video into multiple image groups;

Performing end-to-end intra coding on the key frames in the image group to obtain key frame coding;

The key frame encoding is reconstructed through the loop filter network to obtain the key frame reconstruction frame;

Performing end-to-end inter-frame coding on non-key frames in the image group based on the key frame reconstruction frame to obtain non-key frame coding;

The non-key frame coding is reconstructed through the loop filter network to obtain the non-key frame reconstruction frame.
The end-to-end video compression method based on deep learning according to claim 1, wherein the end-to-end inter-coding of non-key frames in the image group based on the key frame reconstruction frame obtains non-key frames. Key frame coding, including:

Performing motion field estimation on non-key frames in the image group based on the key frame reconstruction frame to obtain motion field information;

Obtaining inter prediction information of non-key frames according to the sports field information;

Perform prediction residual coding according to the inter prediction information of the non-key frame and the non-key frame.
The end-to-end video compression method based on deep learning according to claim 1, wherein the end-to-end intra-encoding of the key frames in the image group is performed to obtain the key frame encoding, which is specifically based on advanced The end-to-end self-encoder structure of the verification model network is an intra-frame coding framework, and the bottleneck layer of the self-encoder performs context modeling.
The end-to-end video compression method based on deep learning according to claim 3, wherein the objective function of the intra-frame coding framework during training is
The formula is:

Where x is the input image,
Is the output image;

Among them, y is a hidden variable based on image coding, y=Enc(x); the prior distribution of the hidden variable y is a normal distribution that obeys the mean μ and the variance is σ, y～N(μ,σ);

Among them, the mean μ and variance σ are obtained through end-to-end learning according to the super-prior autoencoder, specifically:

z=Hyper Enc(y) ;

in,
Is the codeword of the super-prior self-encoder after quantization,
For the preliminary parameters of the super-prior normal distribution, the results of the super-prior self-encoding structure are upgraded by using PixelCNN contextual modeling.
The end-to-end video compression method based on deep learning according to claim 1, wherein the loop filter network is based on a fully convolutional network, the loop filter network uses a loss function L2, and the loop filter The internet
The specific formula is:

Among them, x rec represents the input coded image, x is the real label corresponding to the coded image, and n represents the number of frames.
The end-to-end video compression method based on deep learning according to claim 2, characterized in that said performing motion field estimation on non-key frames in said image group based on said key frame reconstruction frame to obtain motion field information specifically comprises :

When the key frame reconstruction frame has only one frame, the sports field information needs to be obtained by encoding with a self-encoder and written into the code stream. The calculation formula of the sports field information flow 1 is:

flow 1 = Flownet(f t-1 );

When the number of key frames reconstructed frame is greater than one, non-key frames taken relative to the current most two adjacent frames reconstructed motion field obtained information, when the stadium without writing code stream information, the information flow 2 of the stadium The calculation formula is:

flow 2 =Flownet(f t-2 ,f t-1 );

Among them, f 1 is an available key frame reconstruction frame, and Flownet is an optical flow prediction network.
The end-to-end video compression method based on deep learning according to claim 2, wherein said obtaining inter prediction information of non-key frames according to said sports field information specifically comprises: video motion according to said sports field information The features and the reconstructed frame of the decoding buffer area are used to generate the inter-frame prediction signal of the non-key frame through interpolation and image processing technology, and the calculation formula of the inter-frame prediction signal Frame pred is:

Frame pred = Warp(f t-1 ,flow);

Among them, Warp is a polynomial interpolation method, f 1 is the available key frame reconstruction frame, and flow is the sports field information of the non-key frame.
The end-to-end video compression method based on deep learning according to claim 2, wherein the calculation of the prediction residual and the prediction residual coding according to the inter prediction information of the non-key frame and the non-key frame , Specifically including: the prediction residual Frame Resi calculation formula is:

Frame Resi = Frame-Frame pred ;

Among them, Frame is the original signal of the current non-key frame, and Frame pred is the inter-frame prediction signal;

The prediction residual Frame Resi is compressed and coded through a self-encoder structure composed of a full convolutional network, and its bottleneck layer is entropy coded and written into the code stream.
An end-to-end video compression system based on deep learning, which is characterized in that it specifically includes:

Image group module: used to divide the target video into multiple image groups;

Key frame encoding module: used to perform end-to-end intra-frame encoding on the key frames in the image group to obtain key frame encoding;

Key frame reconstruction frame module: used to obtain the key frame reconstruction frame after the key frame encoding is reconstructed through the loop filter network;

Non-key frame encoding module: used to perform end-to-end inter-frame encoding on non-key frames in the image group based on the key frame reconstruction frame to obtain non-key frame encoding;

Non-key frame reconstruction frame module: used to obtain the non-key frame reconstruction frame after the non-key frame encoding is reconstructed through the loop filter network.
A computer-readable storage medium, characterized in that a computer program is stored thereon; the computer program is executed by a processor to implement the deep learning-based end-to-end video compression according to any one of claims 1-8 method.