WO2021164176A1 - End-to-end video compression method and system based on deep learning, and storage medium - Google Patents
End-to-end video compression method and system based on deep learning, and storage medium Download PDFInfo
- Publication number
- WO2021164176A1 WO2021164176A1 PCT/CN2020/099445 CN2020099445W WO2021164176A1 WO 2021164176 A1 WO2021164176 A1 WO 2021164176A1 CN 2020099445 W CN2020099445 W CN 2020099445W WO 2021164176 A1 WO2021164176 A1 WO 2021164176A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- frame
- key
- key frame
- encoding
- coding
- Prior art date
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/134—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
- H04N19/157—Assigned coding mode, i.e. the coding mode being predefined or preselected to be further used for selection of another element or parameter
- H04N19/159—Prediction type, e.g. intra-frame, inter-frame or bidirectional frame prediction
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/134—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
- H04N19/146—Data rate or code amount at the encoder output
- H04N19/147—Data rate or code amount at the encoder output according to rate distortion criteria
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/169—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
- H04N19/177—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being a group of pictures [GOP]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/20—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding
- H04N19/21—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding with binary alpha-plane coding for video objects, e.g. context-based arithmetic encoding [CAE]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/80—Details of filtering operations specially adapted for video compression, e.g. for pixel interpolation
- H04N19/82—Details of filtering operations specially adapted for video compression, e.g. for pixel interpolation involving filtering within a prediction loop
Definitions
- This application belongs to the technical field of digital signal processing, and specifically relates to an end-to-end video compression method, system and storage medium based on deep learning.
- Video compression also known as video coding, aims to eliminate redundant information between video signals.
- video coding aims to eliminate redundant information between video signals.
- the data volume of the original video source has made the existing transmission network bandwidth and storage resources unbearable, so the video after encoding and compression is suitable.
- video coding technology has become one of the hot spots in academic research and industrial applications at home and abroad.
- the image coding method based on deep neural network has become a research hotspot in the coding field. It optimizes the image reconstruction loss function through end-to-end modeling of the auto-encoder structure, and uses the entropy estimation model to approximate the auto-encoder.
- the codeword distribution of the Bottleneck Layer in the structure realizes rate-distortion optimization.
- the entropy estimation model has been continuously improved.
- a probability estimation model based on a mixture of Gaussian models and a Gaussian superprior distribution entropy estimation model is proposed, combined with the PixelCNN framework based on the auto-regressive model.
- the context model of the bottleneck layer codeword is proposed, combined with the PixelCNN framework based on the auto-regressive model.
- the objective function of this type of end-to-end image compression can be expressed as: Where x and Respectively represent the original pixel and the unquantized pixel of the bottleneck layer, y and Respectively represent the unquantized and quantized codewords of the bottleneck layer, and C is a constant.
- End-to-end neural networks are of great significance to video compression.
- the traditional hybrid coding framework and the local rate-distortion optimization of various coding tools have been developed for half a century, and they have encountered new challenges in the face of more efficient video compression.
- Common end-to-end video coding technologies are mainly used in video coding modules such as intra-frame coding, inter-frame prediction, residual coding, and rate control by designing an overall trainable network.
- video coding modules such as intra-frame coding, inter-frame prediction, residual coding, and rate control by designing an overall trainable network.
- it is still a big challenge to ensure the overall rate-distortion performance of the video compression framework. Therefore, it appears to design and develop a video compression method and system that uses a deep neural network to achieve end-to-end video encoding while ensuring better rate-distortion performance. Is crucial.
- the present invention proposes an end-to-end video compression method, system and storage medium based on deep learning, and aims to solve the problem that better rate-distortion performance cannot be guaranteed in video compression coding in the prior art.
- an end-to-end video compression method based on deep learning including the following steps:
- the key frame encoding is reconstructed through the loop filter network to obtain the key frame reconstruction frame;
- the non-key frame reconstruction frame is obtained.
- performing end-to-end inter-encoding of the non-key frames in the image group based on the key-frame reconstruction frame to obtain the non-key frame encoding which specifically includes:
- end-to-end intra-encoding of the key frames in the image group to obtain the key-frame encoding, specifically adopting the end-to-end autoencoder structure based on the super-prior model network, the intra-encoding framework, the bottleneck layer of the autoencoder Perform contextual modeling.
- the objective function of the intra-frame coding frame during training is:
- the mean ⁇ and variance ⁇ are obtained through end-to-end learning according to the super-prior autoencoder, specifically:
- the loop filter network is based on a fully convolutional network, the loop filter network uses the loss function L2, and the loop filter network
- the specific formula is:
- x rec represents the input coded image
- x is the real label corresponding to the coded image
- n represents the number of frames.
- the motion field estimation is performed on the non-key frames in the image group based on the key frame reconstruction frame to obtain the motion field information, which specifically includes:
- the sports field information flow 1 is:
- the calculation formula of the sports field information flow 2 is:
- flow 2 Flownet(f t-2 ,f t-1 );
- f 1 is an available key frame reconstruction frame
- Flownet is an optical flow prediction network.
- obtaining the inter prediction information of the non-key frame according to the sports field information specifically includes: generating the inter prediction signal of the non-key frame according to the video motion characteristics of the sports field information and the reconstructed frame in the decoding buffer area through interpolation and image processing technology,
- the calculation formula of the inter-frame prediction signal Frame pred is:
- Warp is a polynomial interpolation method
- f 1 is the available key frame reconstruction frame
- flow is the sports field information of the non-key frame.
- calculating the prediction residual and the prediction residual coding according to the inter prediction information of the non-key frame and the non-key frame specifically including: the prediction residual Frame Resi calculation formula is:
- Frame Resi Frame-Frame pred
- Frame is the original signal of the current non-key frame
- Frame pred is the inter-frame prediction signal
- the prediction residual Frame Resi is compressed and coded through a self-encoder structure composed of a full convolutional network, and its bottleneck layer is entropy coded and written into the code stream.
- an end-to-end video compression system based on deep learning which specifically includes:
- Image group module used to divide the target video into multiple image groups
- Key frame encoding module used to perform end-to-end intra-frame encoding on the key frames in the image group to obtain the key frame encoding
- Key frame reconstruction frame module used to reconstruct the key frame by encoding the key frame through the loop filter network to obtain the key frame reconstruction frame;
- Non-key frame coding module used to perform end-to-end inter-coding of non-key frames in the image group based on the key frame reconstruction frame in the decoding buffer to obtain non-key frame coding;
- Non-key frame reconstruction frame module used to reconstruct the non-key frame by encoding the non-key frame through the loop filter network to obtain the non-key frame reconstruction frame.
- a computer-readable storage medium having a computer program stored thereon; the computer program is executed by a processor to implement an end-to-end video compression method based on deep learning.
- the target video is divided into multiple image groups; and then the key frames in the image group are subjected to end-to-end intra-coding to obtain Key frame coding:
- the key frame coding is reconstructed through the loop filter network to obtain the key frame reconstruction frame;
- the non-key frame in the image group is subjected to end-to-end inter-coding to obtain the non-key frame coding;
- the non-key frame encoding is reconstructed through the loop filter network to obtain the non-key frame reconstruction frame.
- this application can realize an end-to-end global optimization video encoder, and can achieve better encoding performance at a low bit rate. It solves the problem of how to use deep neural networks to achieve end-to-end video encoding while ensuring better rate-distortion performance.
- Fig. 1 shows a flowchart of the steps of an end-to-end video compression method based on deep learning according to an embodiment of the present application
- FIG. 2 shows a framework diagram of a video compression method based on an end-to-end deep neural network according to an embodiment of the present application
- FIG. 3 shows a method for dividing the structure of a group of pictures GOP according to an embodiment of the present application
- FIG. 4 shows the intra-frame coding network structure diagram of the key frame of the end-to-end video compression method according to an embodiment of the present application
- FIG. 5 shows a non-key frame inter-frame coding framework diagram of an end-to-end video compression method according to an embodiment of the present application
- Fig. 6 shows an implementation method of Mask convolution adopted by an intra-frame coding network according to an embodiment of the present application
- Fig. 7 shows a schematic structural diagram of an end-to-end video compression system based on deep learning according to an embodiment of the present application.
- the inventor found that the traditional hybrid coding framework and the local rate-distortion optimization of each coding tool have been developed for half a century, and they have encountered new challenges in the face of more efficient video compression.
- the end-to-end video coding framework can break through the limitations of local optimization of traditional frameworks. By establishing a global optimization model of reconstructed video and original video, and using neural networks to model the rate-distortion optimization problem with high-dimensional complex solution space, the video can be realized.
- the innovation of the coding framework Common end-to-end video coding technologies are mainly used for video coding intra-frame coding, inter-frame prediction, residual coding, and rate control modules by designing an overall trainable network.
- the embodiments of this application provide an end-to-end video compression method, system, and storage medium based on deep learning.
- the full-convolutional network-based video compression framework provided by this application that can be trained end-to-end is similar to the traditional video compression framework. Compared with the adopted video compression encoder, it can achieve end-to-end global optimization of the video encoder, and can achieve better encoding performance at low bit rates. It solves the problem of how to use deep neural networks to achieve end-to-end video encoding while ensuring better rate-distortion performance.
- This application uses convolutional neural network and video processing technology.
- the video is divided into group of pictures (GOP) for encoding, and the adaptively selected key frames in the group of pictures GOP are encoded end-to-end and stored In the decoding buffer area;
- the reconstructed frame in the decoding buffer area uses the reconstructed frame in the decoding buffer area to estimate the motion field based on the depth network for each frame to be encoded, and use the estimated motion information to generate the inter-frame prediction result; and finally Perform end-to-end residual coding on the prediction residuals of non-key frames; when the video is reconstructed and stored in the decoding buffer, both the key frames and non-key frames need to be reconstructed through the deep loop filter module.
- Fig. 1 shows a step flow chart of an end-to-end video compression method based on deep learning according to an embodiment of the present application.
- the end-to-end video compression method based on deep learning in this embodiment specifically includes the following steps:
- S102 Perform end-to-end intra-frame coding on the key frames in the image group to obtain key frame codes
- S104 Perform end-to-end inter-coding of non-key frames in the image group based on the key-frame reconstruction frame to obtain non-key frame coding;
- S105 The non-key frame encoding is reconstructed through the loop filter network to obtain a non-key frame reconstruction frame.
- Fig. 2 shows a framework diagram of a video compression method based on an end-to-end deep neural network according to an embodiment of the present application.
- the video can be compressed by the end-to-end deep neural network video coding framework by means of a group of pictures GOP.
- the self-encoding architecture based on the Gaussian super prior distribution is used for compression, and the compressed key frames are buffered to the decoding after the deep convolutional network-based loop filter module (CNN Loop Filter) Buffer (DecodedPictureBuffer, DPB).
- CNN Loop Filter Deep Convolutional network-based loop filter module
- DPB DeepConvolutional network-based loop filter module
- Fig. 3 shows a method for dividing the structure of a group of pictures GOP according to an embodiment of the present application.
- the key frame in the present invention is set as the first frame of the GOP of the group of pictures.
- the key frame can be the first frame in the GOP, or it can be the non-first frame; then use the method of the autoencoder network with super a priori structure to encode the key frame, and the autoencoder type is Gaussian distribution , Mixture of Gaussian distribution and Laplace distribution, etc.
- Fig. 4 shows the intra-frame coding network structure diagram of the key frame of the end-to-end video compression method according to an embodiment of the present application.
- the end-to-end intra-encoding of the key frames in the image group is performed to obtain the key-frame encoding.
- the end-to-end autoencoder structure based on the super-prior model network is used to obtain the key frame encoding.
- the bottleneck layer of the server is designed with a context modeling framework.
- This application adopts an end-to-end training method, and the goal is to obtain an output image that is highly similar to the input image x at the signal level
- the autoencoder encodes the image into a hidden variable y
- the mean ⁇ and the variance ⁇ are obtained through end-to-end learning according to the super-prior autoencoder, specifically:
- Z is the codeword of the autoencoder, Is the codeword of the super-prior self-encoder after quantization, It is the preliminary parameter of the super-prior normal distribution.
- the present invention also uses the PixelCNN context-based modeling method to upgrade the result of the super-prior self-encoding structure, as shown in Figure 6, using Mask’s 5x5 convolution ,
- the output is the final super-prior distribution parameters.
- a loop filtering module based on a full convolution network is processed to improve the subjective and objective reconstruction effect.
- the encoded reconstructed image is x rec , which is based on an end-to-end full convolutional mapping between the original images x, and the reconstructed image is processed by using a nine-layer convolutional neural network with a global residual structure, and The final reconstructed image is obtained and stored in the decoding buffer area at the same time.
- loop filter network adopts the loss function L2
- loop filter network The specific formula is:
- x rec represents the input coded image
- x is the real label corresponding to the coded image
- n represents the number of frames.
- this application uses the encoded frames in the decoding buffer DPB to generate the motion field information of the current non-key frame, and uses this information to perform texture alignment on the frames in the decoding buffer DPB, thereby obtaining the prediction of the current encoded frame Information, and then encode the prediction residual through the self-encoder structure, and write the bottleneck layer of the self-encoder into the code stream.
- each non-key frame also needs to be processed by the loop filter module to improve the reconstruction quality.
- video motion characteristics of the sports field information specifically include video sports field information and texture motion characteristics.
- Video motion feature expression forms include, but are not limited to: optical flow field, motion vector field, disparity vector field, and inter-frame gradient field, etc.
- the video motion feature extraction method is specifically the method of extracting the motion feature between video frames.
- the motion feature extraction method corresponds to the extraction method of the corresponding expression form, including but not limited to methods based on deep learning such as: optical flow model, based on traditional gradient extraction Methods etc.
- Fig. 5 shows a non-key frame inter-coding framework diagram of the end-to-end video compression method according to an embodiment of the present application.
- the coding of non-key frames in this application is mainly divided into two steps, one is prediction frame generation, and the other is prediction residual coding.
- the motion field estimation is performed on the non-key frames in the image group to obtain the motion field information, which specifically includes:
- the sports field information flow 1 is:
- the calculation formula of the sports field information flow 2 is:
- flow 2 Flownet(f t-2 ,f t-1 );
- f 1 is an available key frame reconstruction frame
- Flownet is an optical flow prediction network.
- the structure of the non-key frame prediction network is shown in Figure 5.
- the prediction method is to use the optical flow network (Flownet) Get the encoded frame in the decoding buffer area.
- Flownet optical flow network
- the video motion characteristic information should be written into the code stream; when the decoding buffer area has more than one frame, the video motion characteristic information should not be written into the code stream.
- the prediction frame generation which specifically includes: generate the inter prediction of non-key frames according to the video motion characteristics of the motion field information and the reconstructed frame in the decoding buffer area through interpolation and image processing technology Signal, the calculation formula of the inter-frame prediction signal Frame pred is:
- Warp is a polynomial interpolation method
- f 1 is the available key frame reconstruction frame
- flow is the sports field information of the non-key frame.
- all non-key frames in the image group in this application need to pass through the non-key frame residual coding module after predictive coding.
- the input of the non-key frame residual coding module is the original non-key frame signal The residual error with the predicted signal.
- the prediction residual Frame Resi calculation formula is:
- Frame Resi Frame-Frame pred
- Frame is the original signal of the current non-key frame
- Frame pred is the inter-frame prediction signal
- the prediction residual Frame Resi is compressed and coded through a self-encoder structure composed of a full convolutional network, and its bottleneck layer is entropy coded and written into the code stream.
- non-key frame also needs to be reconstructed through the loop filter network to obtain the non-key frame reconstruction frame during reconstruction.
- the non-key frame reconstruction frame Frame Rec formula is:
- the non-key frame prediction residual coding method of this application specifically uses a pre-trained autoencoder network model designed according to specific conditions, and uses the residuals of the original signal of the non-key frame and its prediction signal as the input of the generation network to obtain reconstruction Residual, that is, the compressed image reconstruction is completed.
- the loop filter reconstruction based on the convolutional neural network designed and trained according to the specific conditions is used, and its input Unfiltered key frames or non-key frames are stored in the decoding buffer.
- the overall code stream is composed of the code streams of multiple image groups GOP, and the code stream of each image is composed of key frames and non-key frame code streams.
- the frame code stream includes the bottleneck layer code stream of the self-encoder, and the non-key frame code stream is composed of the sports field information and its prediction residual code stream.
- the end-to-end video compression method based on deep learning of the present application specifically includes a deep learning method, a video motion feature extraction method, an end-to-end video compression method, and a video reconstruction method.
- a deep learning method specifically includes a video motion feature extraction method, an end-to-end video compression method, and a video reconstruction method.
- the deep learning method used in end-to-end video compression is specifically a deep learning method based on a full convolutional network model; methods based on deep learning include, but are not limited to: variational autoencoders, generative adversarial networks and their variants combine with.
- the video coding technology based on deep learning in this application aims to extract the high-level abstract characteristics of data and its inverse process by using multi-layer deep nonlinear transformation, thereby obtaining the optimal prediction signal of video coding, and ensuring the overall Rate-distortion performance of the frame.
- a supervised training method is used to optimize the rate-distortion function.
- the rate-distortion function includes the data fidelity of the reconstructed video and the additional cost required for encoding residuals.
- Fig. 7 shows a schematic structural diagram of an end-to-end video compression system based on deep learning according to an embodiment of the present application.
- the end-to-end video compression system based on deep learning specifically includes:
- Image group module 10 used to divide the target video into multiple image groups
- Key frame encoding module 20 used to perform end-to-end intra-frame encoding on the key frames in the image group to obtain key frame encoding;
- Key frame reconstruction frame module 30 used to encode the key frame through the loop filter network for reconstruction to obtain the key frame reconstruction frame, and store it in the decoding buffer;
- Non-key frame coding module 40 used to perform end-to-end inter-coding of non-key frames in the image group based on the key frame reconstruction frame in the decoding buffer to obtain non-key frame coding;
- the non-key frame reconstruction frame module 50 is used for reconstructing the non-key frame encoding through the loop filter network to obtain the non-key frame reconstruction frame, and storing the non-key frame reconstruction frame in the decoding buffer.
- end-to-end inter-encoding of the non-key frames in the image group based on the key frame reconstruction frame to obtain the non-key frame encoding which specifically includes:
- Both the key frame reconstruction frame module 30 and the non-key frame reconstruction frame module 50 in the end-to-end video compression framework include loop filters.
- the key frames and non-key frames are finally reconstructed after encoding, they are designed and trained according to the specific conditions.
- Good loop filter reconstruction based on convolutional neural network input unfiltered key frames or non-key frames to the loop filter and store them in the decoding buffer.
- This embodiment also provides a computer-readable storage medium on which a computer program is stored, and the computer program is executed by a processor to implement the deep learning-based end-to-end video compression method provided by any of the above.
- This application proposes a video compression framework based on an end-to-end deep neural network.
- the video is organized into multiple image groups, the key frame images in the image group are intra-coded, and the non-key frame images are inter-coded.
- Intra-frame coding uses an auto-encoding structure based on a super-prior structure combined with an autoregressive model for context modeling, and inter-frame coding uses motion field to derive prediction and residual coding. It can realize the end-to-end overall optimization of the encoder architecture.
- the inter-frame coding adopts the form of motion field export to avoid the transmission of a large amount of inter-frame motion information, which greatly saves the code rate.
- the deep network-based loop is used in the reconstruction process. Filtering technology improves reconstruction performance.
- the proposed method can globally optimize the video encoder end-to-end without the need to transmit motion information in inter-frame prediction, and can achieve better encoding performance at low bit rates.
- the embodiments of the present application also provide a computer program product. Since the principle of the computer program product to solve the problem is similar to the method provided in the first embodiment of the present application, the implementation of the computer program product can refer to the method The implementation of the repetition will not be repeated.
- this application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
- computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
- These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
- the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
- These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
- the instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
Provided are an end-to-end video compression method and system based on deep learning, and a storage medium. The end-to-end video compression method based on deep learning in the present application comprises: dividing a target video into a plurality of groups of pictures; then, performing end-to-end intra-frame encoding on a key frame in each group of pictures to obtain a key frame code; reconstructing the key frame code by means of a loop filter network to obtain a key frame reconstructed frame; next, performing end-to-end inter-frame encoding on a non-key frame in the group of pictures on the basis of the key frame reconstructed frame to obtain a non-key frame code; and finally, reconstructing the non-key frame code by means of the loop filter network to obtain a non-key frame reconstructed frame. Compared with a traditionally used video compression encoder, a video encoder that can realize end-to-end global optimization is used in the present application, and a better encoding performance can be obtained at a low code rate. The problem of how to ensure a better rate-distortion performance while realizing end-to-end video encoding by using a deep neural network is thus solved.
Description
本申请属于数字信号处理技术领域,具体地,涉及一种基于深度学习的端到端视频压缩方法、系统及存储介质。This application belongs to the technical field of digital signal processing, and specifically relates to an end-to-end video compression method, system and storage medium based on deep learning.
视频压缩,也称视频编码,其目的是消除视频信号间存在的冗余信息。随着多媒体数字视频应用的不断发展和人们对视频云计算需求的不断提高,原始视频信源的数据量已使现有传输网络带宽和存储资源无法承受,因而经编码压缩后的视频才是宜在网络中传输中的信息,视频编码技术已成为目前国内外学术研究和工业应用的热点之一。Video compression, also known as video coding, aims to eliminate redundant information between video signals. With the continuous development of multimedia digital video applications and the continuous improvement of people’s demand for video cloud computing, the data volume of the original video source has made the existing transmission network bandwidth and storage resources unbearable, so the video after encoding and compression is suitable. For information transmitted in the network, video coding technology has become one of the hot spots in academic research and industrial applications at home and abroad.
近年来基于深度神经网络的图像编码方法成为编码领域的研究热点,它通过端到端建模自编码器(Auto-encoder)结构,优化图像重建损失函数,并利用熵估计模型近似估算自编码器结构中瓶颈层(Bottleneck Layer)的码字分布实现率失真优化。在此基础之上,熵估计模型被不断改进提升,基于混合高斯模型以及基于高斯超先验分布熵估计模型的概率估计模型被提出,并结合基于自回归模型(Auto-regressive)的PixelCNN框架建立瓶颈层码字的上下文模型。这一类端到端图像压缩的目标函数可以表示为:
其中,x和
分别代表原始像素与瓶颈层未量化像素,y和
分别代表瓶颈层未量化及量化后的码字,C为常数。
In recent years, the image coding method based on deep neural network has become a research hotspot in the coding field. It optimizes the image reconstruction loss function through end-to-end modeling of the auto-encoder structure, and uses the entropy estimation model to approximate the auto-encoder. The codeword distribution of the Bottleneck Layer in the structure realizes rate-distortion optimization. On this basis, the entropy estimation model has been continuously improved. A probability estimation model based on a mixture of Gaussian models and a Gaussian superprior distribution entropy estimation model is proposed, combined with the PixelCNN framework based on the auto-regressive model. The context model of the bottleneck layer codeword. The objective function of this type of end-to-end image compression can be expressed as: Where x and Respectively represent the original pixel and the unquantized pixel of the bottleneck layer, y and Respectively represent the unquantized and quantized codewords of the bottleneck layer, and C is a constant.
端到端神经网络对于视频压缩有着重要的意义。传统的混合编码框架及各个编码工具的局部率失真优化已经发展了半个世纪,在面临更高效的视频压缩时遭遇了新的挑战。常见的端到端视频编码技术主要通过设计整体可训练的网络分别用于视频编码帧内编码、帧间预测、残差编码和码率控制等模块。但是对应保证视频压缩框架的整体率失真性能仍然具有很大的挑战,因此设计开发一种利用深度神经网络实现端到端视频编码的同时可以保证较好的率失真性能的视频压缩方法及系统显得是至关重要。End-to-end neural networks are of great significance to video compression. The traditional hybrid coding framework and the local rate-distortion optimization of various coding tools have been developed for half a century, and they have encountered new challenges in the face of more efficient video compression. Common end-to-end video coding technologies are mainly used in video coding modules such as intra-frame coding, inter-frame prediction, residual coding, and rate control by designing an overall trainable network. However, it is still a big challenge to ensure the overall rate-distortion performance of the video compression framework. Therefore, it appears to design and develop a video compression method and system that uses a deep neural network to achieve end-to-end video encoding while ensuring better rate-distortion performance. Is crucial.
发明内容Summary of the invention
本发明提出了一种基于深度学习的端到端视频压缩方法、系统及存储介质,旨在解决现有技术中视频压缩编码中无法保证较好率失真性能的问题。The present invention proposes an end-to-end video compression method, system and storage medium based on deep learning, and aims to solve the problem that better rate-distortion performance cannot be guaranteed in video compression coding in the prior art.
根据本申请实施例的第一个方面,提供了一种基于深度学习的端到端视频压缩方法, 包括以下步骤:According to the first aspect of the embodiments of the present application, an end-to-end video compression method based on deep learning is provided, including the following steps:
将目标视频分为多个图像组;Divide the target video into multiple image groups;
对图像组中的关键帧进行端到端帧内编码得到关键帧编码;Perform end-to-end intra-frame coding on the key frames in the image group to obtain the key frame coding;
关键帧编码通过环路滤波网络进行重建后得到关键帧重建帧;The key frame encoding is reconstructed through the loop filter network to obtain the key frame reconstruction frame;
基于关键帧重建帧对图像组中的非关键帧进行端到端帧间编码得到非关键帧编码;Perform end-to-end inter-coding of non-key frames in the image group based on key frame reconstruction frames to obtain non-key frame coding;
非关键帧编码通过环路滤波网络进行重建后得到非关键帧重建帧。After the non-key frame coding is reconstructed through the loop filter network, the non-key frame reconstruction frame is obtained.
可选地,基于关键帧重建帧对图像组中的非关键帧进行端到端帧间编码得到非关键帧编码,具体包括:Optionally, performing end-to-end inter-encoding of the non-key frames in the image group based on the key-frame reconstruction frame to obtain the non-key frame encoding, which specifically includes:
基于关键帧重建帧对图像组中的非关键帧进行运动场估计得到运动场信息;Performing motion field estimation on the non-key frames in the image group based on the key frame reconstruction frame to obtain the motion field information;
根据运动场信息得到非关键帧的帧间预测信息;Obtain the inter prediction information of non-key frames according to the sports field information;
根据非关键帧的帧间预测信息以及非关键帧进行预测残差编码。Perform predictive residual coding based on the inter prediction information of non-key frames and non-key frames.
可选地,对图像组中的关键帧进行端到端帧内编码得到关键帧编码,具体采用基于超先验模型网络的端到端自编码器结构帧内编码框架,自编码器的瓶颈层进行上下文建模。Optionally, perform end-to-end intra-encoding of the key frames in the image group to obtain the key-frame encoding, specifically adopting the end-to-end autoencoder structure based on the super-prior model network, the intra-encoding framework, the bottleneck layer of the autoencoder Perform contextual modeling.
可选地,帧内编码框架在训练时的目标函数
公式为:
Optionally, the objective function of the intra-frame coding frame during training The formula is:
其中,y为根据图像编码的隐变量,y=Enc(x);隐变量y的先验分布为服从均值μ,方差为σ的正态分布,y~N(μ,σ);Among them, y is a hidden variable based on image coding, y=Enc(x); the prior distribution of the hidden variable y is a normal distribution that obeys the mean μ and the variance is σ, y~N(μ,σ);
其中,均值μ和方差σ是根据超先验自编码器通过端到端学习得到,具体为:Among them, the mean μ and variance σ are obtained through end-to-end learning according to the super-prior autoencoder, specifically:
z=Hyper
Enc(y);
z=Hyper Enc(y) ;
其中,
为经过量化后的超先验自编码器的码字,
为超先验正太分布的初步参数,采用基于PixelCNN上下文建模对超先验自编码结构的结果进行提升处理。
in, Is the codeword of the super-prior self-encoder after quantization, For the preliminary parameters of the super-prior normal distribution, the results of the super-prior self-encoding structure are upgraded by using PixelCNN contextual modeling.
可选地,环路滤波网络基于全卷积网络,环路滤波网络采用损失函数L2,环路滤波网络
具体公式为:
Optionally, the loop filter network is based on a fully convolutional network, the loop filter network uses the loss function L2, and the loop filter network The specific formula is:
其中,x
rec表示输入的已编码图像,x为已编码图像对应的真实标签,n表示帧数。
Among them, x rec represents the input coded image, x is the real label corresponding to the coded image, and n represents the number of frames.
可选地,基于关键帧重建帧对图像组中的非关键帧进行运动场估计得到运动场信息,具体包括:Optionally, the motion field estimation is performed on the non-key frames in the image group based on the key frame reconstruction frame to obtain the motion field information, which specifically includes:
当关键帧重建帧只有一帧时,运动场信息需要通过自编码器编码得到,并写入码流中,运动场信息flow
1的计算公式为:
When the key frame reconstruction frame is only one frame, the sports field information needs to be encoded by the autoencoder and written into the code stream. The calculation formula of the sports field information flow 1 is:
flow
1=Flownet(f
t-1);
flow 1 = Flownet(f t-1 );
当关键帧重建帧数目大于一帧时,取相对当前非关键帧最临近的两帧重建帧得到运动场信息,此时运动场信息无需写入码流中,运动场信息flow
2的计算公式为:
When the number of key frame reconstruction frames is greater than one frame, take the two closest reconstruction frames relative to the current non-key frame to obtain the sports field information. At this time, the sports field information does not need to be written into the code stream. The calculation formula of the sports field information flow 2 is:
flow
2=Flownet(f
t-2,f
t-1);
flow 2 =Flownet(f t-2 ,f t-1 );
其中,f
1为可使用的关键帧重建帧,Flownet为光流预测网络。
Among them, f 1 is an available key frame reconstruction frame, and Flownet is an optical flow prediction network.
可选地,根据运动场信息得到非关键帧的帧间预测信息,具体包括:根据运动场信息的视频运动特征及解码缓存区的重建帧通过插值及图像处理技术生成非关键帧的帧间预测信号,帧间预测信号Frame
pred计算公式为:
Optionally, obtaining the inter prediction information of the non-key frame according to the sports field information specifically includes: generating the inter prediction signal of the non-key frame according to the video motion characteristics of the sports field information and the reconstructed frame in the decoding buffer area through interpolation and image processing technology, The calculation formula of the inter-frame prediction signal Frame pred is:
Frame
pred=Warp(f
t-1,flow);
Frame pred = Warp(f t-1 ,flow);
其中,Warp为多项式插值方法,f
1为可使用的关键帧重建帧,flow为非关键帧的运动场信息。
Among them, Warp is a polynomial interpolation method, f 1 is the available key frame reconstruction frame, and flow is the sports field information of the non-key frame.
可选地,根据非关键帧的帧间预测信息以及非关键帧计算预测残差以及预测残差编码,具体包括:预测残差Frame
Resi计算公式为:
Optionally, calculating the prediction residual and the prediction residual coding according to the inter prediction information of the non-key frame and the non-key frame, specifically including: the prediction residual Frame Resi calculation formula is:
Frame
Resi=Frame-Frame
pred;
Frame Resi = Frame-Frame pred ;
其中,Frame为当前非关键帧的原始信号,Frame
pred为帧间预测信号;
Among them, Frame is the original signal of the current non-key frame, and Frame pred is the inter-frame prediction signal;
预测残差Frame
Resi通过由全卷积网络构成的自编码器结构进行压缩编码,其瓶颈层被熵编码后写入码流中。
The prediction residual Frame Resi is compressed and coded through a self-encoder structure composed of a full convolutional network, and its bottleneck layer is entropy coded and written into the code stream.
根据本申请实施例的第二个方面,提供了一种基于深度学习的端到端视频压缩系统,具体包括:According to a second aspect of the embodiments of the present application, an end-to-end video compression system based on deep learning is provided, which specifically includes:
图像组模块:用于将目标视频分为多个图像组;Image group module: used to divide the target video into multiple image groups;
关键帧编码模块:用于对图像组中的关键帧进行端到端帧内编码得到关键帧编码;Key frame encoding module: used to perform end-to-end intra-frame encoding on the key frames in the image group to obtain the key frame encoding;
关键帧重建帧模块:用于将关键帧编码通过环路滤波网络进行重建后得到关键帧重建帧;Key frame reconstruction frame module: used to reconstruct the key frame by encoding the key frame through the loop filter network to obtain the key frame reconstruction frame;
非关键帧编码模块:用于基于解码缓冲区中的关键帧重建帧对图像组中的非关键帧进行端到端帧间编码得到非关键帧编码;Non-key frame coding module: used to perform end-to-end inter-coding of non-key frames in the image group based on the key frame reconstruction frame in the decoding buffer to obtain non-key frame coding;
非关键帧重建帧模块:用于将非关键帧编码通过环路滤波网络进行重建后得到非关键帧重建帧。Non-key frame reconstruction frame module: used to reconstruct the non-key frame by encoding the non-key frame through the loop filter network to obtain the non-key frame reconstruction frame.
根据本申请实施例的第三个方面,提供了一种计算机可读存储介质,其上存储有计算 机程序;计算机程序被处理器执行以实现基于深度学习的端到端视频压缩方法。According to a third aspect of the embodiments of the present application, there is provided a computer-readable storage medium having a computer program stored thereon; the computer program is executed by a processor to implement an end-to-end video compression method based on deep learning.
采用本申请实施例中的基于深度学习的端到端视频压缩方法、系统及存储介质,通过将目标视频分为多个图像组;然后对图像组中的关键帧进行端到端帧内编码得到关键帧编码;关键帧编码通过环路滤波网络进行重建后得到关键帧重建帧;其次,基于关键帧重建帧对图像组中的非关键帧进行端到端帧间编码得到非关键帧编码;最后,非关键帧编码通过环路滤波网络进行重建后得到非关键帧重建帧。本申请采用与传统采用的视频压缩编码器相比,可以实现端到端全局优化视频编码器,在低码率下能够取得较好的编码性能。解决了如何利用深度神经网络实现端到端视频编码的同时保证较好的率失真性能的问题。Using the deep learning-based end-to-end video compression method, system, and storage medium in the embodiments of the present application, the target video is divided into multiple image groups; and then the key frames in the image group are subjected to end-to-end intra-coding to obtain Key frame coding: The key frame coding is reconstructed through the loop filter network to obtain the key frame reconstruction frame; secondly, based on the key frame reconstruction frame, the non-key frame in the image group is subjected to end-to-end inter-coding to obtain the non-key frame coding; and finally , The non-key frame encoding is reconstructed through the loop filter network to obtain the non-key frame reconstruction frame. Compared with the conventionally adopted video compression encoder, this application can realize an end-to-end global optimization video encoder, and can achieve better encoding performance at a low bit rate. It solves the problem of how to use deep neural networks to achieve end-to-end video encoding while ensuring better rate-distortion performance.
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:The drawings described here are used to provide a further understanding of the application and constitute a part of the application. The exemplary embodiments and descriptions of the application are used to explain the application, and do not constitute an improper limitation of the application. In the attached picture:
图1中示出了根据本申请实施例的一种基于深度学习的端到端视频压缩方法的步骤流程图;Fig. 1 shows a flowchart of the steps of an end-to-end video compression method based on deep learning according to an embodiment of the present application;
图2中示出了根据本申请实施例的基于端到端深度神经网络的视频压缩方法的框架图;FIG. 2 shows a framework diagram of a video compression method based on an end-to-end deep neural network according to an embodiment of the present application;
图3中示出了根据本申请实施例的图像组GOP的结构划分方法;FIG. 3 shows a method for dividing the structure of a group of pictures GOP according to an embodiment of the present application;
图4中示出了根据本申请实施例的端到端视频压缩方法的关键帧的帧内编码网络结构图;FIG. 4 shows the intra-frame coding network structure diagram of the key frame of the end-to-end video compression method according to an embodiment of the present application;
图5中示出了根据本申请实施例的端到端视频压缩方法的非关键帧的帧间编码框架图;FIG. 5 shows a non-key frame inter-frame coding framework diagram of an end-to-end video compression method according to an embodiment of the present application;
图6中示出了根据本申请实施例的帧内编码网络采用的Mask卷积的一种实施方法;Fig. 6 shows an implementation method of Mask convolution adopted by an intra-frame coding network according to an embodiment of the present application;
图7示出了根据本申请实施例的一种基于深度学习的端到端视频压缩系统的结构示意图。Fig. 7 shows a schematic structural diagram of an end-to-end video compression system based on deep learning according to an embodiment of the present application.
在实现本申请的过程中,发明人发现传统的混合编码框架及各个编码工具的局部率失真优化已经发展了半个世纪,在面临更高效的视频压缩时遭遇了新的挑战。而端到端视频编码框架能够突破传统框架局部优化的限制,通过建立起重建视频与原始视频的全局优化模型,并利用神经网络建模具有高维复杂解空间的率失真优化问题,从而实现视频编码框架的革新。常见的端到端视频编码技术主要通过设计整体可训练的网络分别用于视频编码帧内编码、帧间预测、残差编码和码率控制等模块。但是对应保证视频压缩框架的整体率失真性能仍然具有很大的挑战,因此亟需一种利用深度神经网络实现端到端视频编码的同 时可以保证较好的率失真性能的视频压缩方法及系统。In the process of realizing this application, the inventor found that the traditional hybrid coding framework and the local rate-distortion optimization of each coding tool have been developed for half a century, and they have encountered new challenges in the face of more efficient video compression. The end-to-end video coding framework can break through the limitations of local optimization of traditional frameworks. By establishing a global optimization model of reconstructed video and original video, and using neural networks to model the rate-distortion optimization problem with high-dimensional complex solution space, the video can be realized. The innovation of the coding framework. Common end-to-end video coding technologies are mainly used for video coding intra-frame coding, inter-frame prediction, residual coding, and rate control modules by designing an overall trainable network. However, guaranteeing the overall rate-distortion performance of the video compression framework still poses great challenges. Therefore, there is an urgent need for a video compression method and system that uses a deep neural network to achieve end-to-end video encoding while ensuring better rate-distortion performance.
针对上述问题,本申请实施例中提供了一种基于深度学习的端到端视频压缩方法、系统及存储介质,本申请提供的可以端到端训练的基于全卷积网络的视频压缩框架与传统采用的视频压缩编码器相比,可以实现端到端全局优化视频编码器,在低码率下能够取得较好的编码性能。解决了如何利用深度神经网络实现端到端视频编码的同时保证较好的率失真性能的问题。In response to the above-mentioned problems, the embodiments of this application provide an end-to-end video compression method, system, and storage medium based on deep learning. The full-convolutional network-based video compression framework provided by this application that can be trained end-to-end is similar to the traditional video compression framework. Compared with the adopted video compression encoder, it can achieve end-to-end global optimization of the video encoder, and can achieve better encoding performance at low bit rates. It solves the problem of how to use deep neural networks to achieve end-to-end video encoding while ensuring better rate-distortion performance.
本申请利用卷积神经网络和视频处理技术,首先将视频分为图像组(Groupofpictures,GOP)进行编码,对图像组GOP中经自适应选定的关键帧进行端到端帧内编码,并存储于解码缓存区;其次对于非关键帧编码,利用在解码缓存区中的已重构帧对每一个待编码帧进行基于深度网络的运动场估计,并用估计得到的运动信息生成帧间预测结果;最后对非关键帧的预测残差进行端到端残差编码;在视频重构存入解码缓存区时,关键帧和非关键帧均需要经过深度环路滤波模块进行重建。This application uses convolutional neural network and video processing technology. First, the video is divided into group of pictures (GOP) for encoding, and the adaptively selected key frames in the group of pictures GOP are encoded end-to-end and stored In the decoding buffer area; secondly, for non-key frame encoding, use the reconstructed frame in the decoding buffer area to estimate the motion field based on the depth network for each frame to be encoded, and use the estimated motion information to generate the inter-frame prediction result; and finally Perform end-to-end residual coding on the prediction residuals of non-key frames; when the video is reconstructed and stored in the decoding buffer, both the key frames and non-key frames need to be reconstructed through the deep loop filter module.
为了使本申请实施例中的技术方案及优点更加清楚明白,以下结合附图对本申请的示例性实施例进行进一步详细的说明,显然,所描述的实施例仅是本申请的一部分实施例,而不是所有实施例的穷举。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。In order to make the technical solutions and advantages of the embodiments of the present application clearer, the exemplary embodiments of the present application will be described in further detail below in conjunction with the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present application, and Not all examples are exhaustive. It should be noted that the embodiments in the application and the features in the embodiments can be combined with each other if there is no conflict.
实施例1Example 1
图1中示出了根据本申请实施例的一种基于深度学习的端到端视频压缩方法的步骤流程图。Fig. 1 shows a step flow chart of an end-to-end video compression method based on deep learning according to an embodiment of the present application.
如图1所示,本实施例的基于深度学习的端到端视频压缩方法,具体包括以下步骤:As shown in Fig. 1, the end-to-end video compression method based on deep learning in this embodiment specifically includes the following steps:
S101:将目标视频分为多个图像组;S101: Divide the target video into multiple image groups;
S102:对图像组中的关键帧进行端到端帧内编码得到关键帧编码;S102: Perform end-to-end intra-frame coding on the key frames in the image group to obtain key frame codes;
S103:关键帧编码通过环路滤波网络进行重建后得到关键帧重建帧;S103: After the key frame encoding is reconstructed through the loop filter network, the key frame reconstruction frame is obtained;
S104:基于关键帧重建帧对图像组中的非关键帧进行端到端帧间编码得到非关键帧编码;S104: Perform end-to-end inter-coding of non-key frames in the image group based on the key-frame reconstruction frame to obtain non-key frame coding;
S105:非关键帧编码通过环路滤波网络进行重建后得到非关键帧重建帧。S105: The non-key frame encoding is reconstructed through the loop filter network to obtain a non-key frame reconstruction frame.
图2中示出了根据本申请实施例的基于端到端深度神经网络的视频压缩方法的框架图。Fig. 2 shows a framework diagram of a video compression method based on an end-to-end deep neural network according to an embodiment of the present application.
如图2所示,在本申请的压缩框架中,视频可以通过图像组GOP的方式被端到端的深度神经网络视频编码框架所压缩。首先对于GOP中的关键帧,采用基于高斯超先验分布的自编码架构进行压缩,并将压缩后的关键帧在进行基于深度卷积网络的环路滤波模块(CNN Loop Filter)后缓存至解码缓冲区(DecodedPictureBuffer,DPB)中。As shown in Figure 2, in the compression framework of the present application, the video can be compressed by the end-to-end deep neural network video coding framework by means of a group of pictures GOP. First, for the key frames in the GOP, the self-encoding architecture based on the Gaussian super prior distribution is used for compression, and the compressed key frames are buffered to the decoding after the deep convolutional network-based loop filter module (CNN Loop Filter) Buffer (DecodedPictureBuffer, DPB).
图3中示出了根据本申请实施例的图像组GOP的结构划分方法。Fig. 3 shows a method for dividing the structure of a group of pictures GOP according to an embodiment of the present application.
如图3所示,本发明中关键帧被设置为图像组GOP的第一帧。As shown in Fig. 3, the key frame in the present invention is set as the first frame of the GOP of the group of pictures.
其它的,关键帧可以是GOP中的第一帧,也可以是非第一帧;再使用带有超先验结构的自编码器网络的方法对该关键帧进行编码,自编码器种类为高斯分布、混合高斯分布及拉普拉斯分布等。In addition, the key frame can be the first frame in the GOP, or it can be the non-first frame; then use the method of the autoencoder network with super a priori structure to encode the key frame, and the autoencoder type is Gaussian distribution , Mixture of Gaussian distribution and Laplace distribution, etc.
图4中示出了根据本申请实施例的端到端视频压缩方法的关键帧的帧内编码网络结构图。Fig. 4 shows the intra-frame coding network structure diagram of the key frame of the end-to-end video compression method according to an embodiment of the present application.
如图4所示,对图像组中的关键帧进行端到端帧内编码得到关键帧编码,具体采用基于超先验模型网络的端到端自编码器结构帧内编码框架,同时对自编码器的瓶颈层设计了上下文建模框架。As shown in Figure 4, the end-to-end intra-encoding of the key frames in the image group is performed to obtain the key-frame encoding. The end-to-end autoencoder structure based on the super-prior model network is used to obtain the key frame encoding. The bottleneck layer of the server is designed with a context modeling framework.
本申请对采用端到端的训练方式,目标是得到与输入图像x在信号层面高度相似的输出图像
对于输入图像x,该自编码器将图像编码成一个隐变量y,
This application adopts an end-to-end training method, and the goal is to obtain an output image that is highly similar to the input image x at the signal level For the input image x, the autoencoder encodes the image into a hidden variable y,
y=Enc(x)y=Enc(x)
本方案假设该隐变量y的先验分布为服从均值μ,方差为σ的正态分布,This scheme assumes that the prior distribution of the hidden variable y is a normal distribution that obeys the mean μ and the variance is σ,
y~N(μ,σ),y~N(μ,σ),
其中,均值μ和方差σ是根据超先验自编码器,通过端到端学习得到,具体为:Among them, the mean μ and the variance σ are obtained through end-to-end learning according to the super-prior autoencoder, specifically:
z=Hyper
Enc(y),
z=Hyper Enc(y) ,
Z为自编码器的码字,
为经过量化后的超先验自编码器的码字,
为超先验正太分布的初步参数。
Z is the codeword of the autoencoder, Is the codeword of the super-prior self-encoder after quantization, It is the preliminary parameter of the super-prior normal distribution.
不仅如此,在通过超先验自编码结构的输出后,本发明同时采用基于PixelCNN上下文建模方法对超先验自编码结构的结果进行提升处理,如图6所示,使用Mask的5x5卷积,输出为最终的超先验分布的参数。Not only that, after passing the output of the super-prior self-encoding structure, the present invention also uses the PixelCNN context-based modeling method to upgrade the result of the super-prior self-encoding structure, as shown in Figure 6, using Mask’s 5x5 convolution , The output is the final super-prior distribution parameters.
因此帧内编码框架在训练时的目标函数
公式如下:
Therefore, the objective function of the intra-frame coding framework during training The formula is as follows:
S103以及S105中,关于环路滤波,对于已编码的每一帧关键帧和非关键帧图像,都进行基于全卷积网络的环路滤波模块处理,从而提升主观与客观重建效果。In S103 and S105, regarding loop filtering, for each key frame and non-key frame image that has been coded, a loop filtering module based on a full convolution network is processed to improve the subjective and objective reconstruction effect.
具体的,对已编码的重建图像为x
rec,建立于其原始图像x之间的端到端全卷积映射,通过使用具有全局残差结构的九层卷积神经网络处理该重建图像,并得到最终的重建图像, 同时存放于解码缓存区中。
Specifically, the encoded reconstructed image is x rec , which is based on an end-to-end full convolutional mapping between the original images x, and the reconstructed image is processed by using a nine-layer convolutional neural network with a global residual structure, and The final reconstructed image is obtained and stored in the decoding buffer area at the same time.
进一步的,环路滤波网络采用损失函数L2,环路滤波网络
具体公式为:
Further, the loop filter network adopts the loss function L2, and the loop filter network The specific formula is:
其中,x
rec表示输入的已编码图像,x为已编码图像对应的真实标签,n表示帧数。使用L2函数能够有效的保证数据的保真度。
Among them, x rec represents the input coded image, x is the real label corresponding to the coded image, and n represents the number of frames. Using the L2 function can effectively ensure the fidelity of the data.
S102中,基于关键帧重建帧对图像组中的非关键帧进行端到端帧间编码得到非关键帧编码,具体包括:In S102, performing end-to-end inter-coding of non-key frames in the image group based on the key frame reconstruction frame to obtain non-key frame coding, which specifically includes:
基于关键帧重建帧对图像组中的非关键帧进行运动场估计得到运动场信息;Performing motion field estimation on the non-key frames in the image group based on the key frame reconstruction frame to obtain the motion field information;
根据运动场信息得到非关键帧的帧间预测信息;Obtain the inter prediction information of non-key frames according to the sports field information;
根据非关键帧的帧间预测信息以及非关键帧进行预测残差编码。Perform predictive residual coding based on the inter prediction information of non-key frames and non-key frames.
关于非关键帧编码,本申请利用解码缓冲区DPB中已编码的帧生成当前非关键帧的运动场信息,并利用该信息将解码缓冲区DPB中的帧进行纹理对齐,从而得到当前编码帧的预测信息,再通过自编码器结构编码预测残差,将该自编码器的瓶颈层写入码流中,与关键帧编码类似,每一个非关键帧同样需要使用环路滤波模块处理提升重建质量。Regarding non-key frame encoding, this application uses the encoded frames in the decoding buffer DPB to generate the motion field information of the current non-key frame, and uses this information to perform texture alignment on the frames in the decoding buffer DPB, thereby obtaining the prediction of the current encoded frame Information, and then encode the prediction residual through the self-encoder structure, and write the bottleneck layer of the self-encoder into the code stream. Similar to the key frame encoding, each non-key frame also needs to be processed by the loop filter module to improve the reconstruction quality.
具体的,运动场信息的视频运动特征具体包括视频运动场信息、纹理运动特征。视频运动特征表现形式包括但不限定于:光流场、运动矢量场、视差矢量场以及帧间梯度场等。Specifically, the video motion characteristics of the sports field information specifically include video sports field information and texture motion characteristics. Video motion feature expression forms include, but are not limited to: optical flow field, motion vector field, disparity vector field, and inter-frame gradient field, etc.
其中,视频运动特征提取方法具体为提取视频帧间运动特征方法,运动特征提取方法与对应表现形式的提取方法对应,包括但不限定于基于深度学习的方法如:光流模型、基于传统梯度提取方法等。Among them, the video motion feature extraction method is specifically the method of extracting the motion feature between video frames. The motion feature extraction method corresponds to the extraction method of the corresponding expression form, including but not limited to methods based on deep learning such as: optical flow model, based on traditional gradient extraction Methods etc.
图5中示出了根据本申请实施例的端到端视频压缩方法的非关键帧的帧间编码框架图。Fig. 5 shows a non-key frame inter-coding framework diagram of the end-to-end video compression method according to an embodiment of the present application.
具体的,本申请对非关键帧的编码主要分为两个步骤,一是预测帧生成,二是预测残差编码。Specifically, the coding of non-key frames in this application is mainly divided into two steps, one is prediction frame generation, and the other is prediction residual coding.
一、对于预测帧生成:1. For prediction frame generation:
首先,基于关键帧重建帧对图像组中的非关键帧进行运动场估计得到运动场信息,具体包括:First, based on the key frame reconstruction frame, the motion field estimation is performed on the non-key frames in the image group to obtain the motion field information, which specifically includes:
当关键帧重建帧只有一帧时,运动场信息需要通过自编码器编码得到,并写入码流中,运动场信息flow
1的计算公式为:
When the key frame reconstruction frame is only one frame, the sports field information needs to be encoded by the autoencoder and written into the code stream. The calculation formula of the sports field information flow 1 is:
flow
1=Flownet(f
t-1);
flow 1 = Flownet(f t-1 );
当关键帧重建帧数目大于一帧时,取相对当前非关键帧最临近的两帧重建帧得到运动场信息,此时运动场信息无需写入码流中,运动场信息flow
2的计算公式为:
When the number of key frame reconstruction frames is greater than one frame, take the two closest reconstruction frames relative to the current non-key frame to obtain the sports field information. At this time, the sports field information does not need to be written into the code stream. The calculation formula of the sports field information flow 2 is:
flow
2=Flownet(f
t-2,f
t-1);
flow 2 =Flownet(f t-2 ,f t-1 );
其中,f
1为可使用的关键帧重建帧,Flownet为光流预测网络。
Among them, f 1 is an available key frame reconstruction frame, and Flownet is an optical flow prediction network.
非关键帧预测网络的结构如图5所示,通过从解码缓存区中获取已编码的帧,并用最近邻的两已编码帧对当前编码的非关键帧进行预测,预测方法为使用光流网络(Flownet)得到解码缓存区中已编码帧。The structure of the non-key frame prediction network is shown in Figure 5. By obtaining the coded frame from the decoding buffer area, and using the two nearest neighbor coded frames to predict the currently coded non-key frame, the prediction method is to use the optical flow network (Flownet) Get the encoded frame in the decoding buffer area.
进一步的,即当解码缓存区仅有一帧时,视频运动特征信息应当被写入码流中;当解码缓存区有多于一帧时,视频运动特征信息不被写入码流中。Further, when the decoding buffer area has only one frame, the video motion characteristic information should be written into the code stream; when the decoding buffer area has more than one frame, the video motion characteristic information should not be written into the code stream.
其次,根据运动场信息得到非关键帧的帧间预测信息即预测帧生成,具体包括:根据运动场信息的视频运动特征及解码缓存区的重建帧通过插值及图像处理技术生成非关键帧的帧间预测信号,帧间预测信号Frame
pred计算公式为:
Secondly, obtain the inter prediction information of non-key frames according to the motion field information, that is, the prediction frame generation, which specifically includes: generate the inter prediction of non-key frames according to the video motion characteristics of the motion field information and the reconstructed frame in the decoding buffer area through interpolation and image processing technology Signal, the calculation formula of the inter-frame prediction signal Frame pred is:
Frame
pred=Warp(f
t-1,flow);
Frame pred = Warp(f t-1 ,flow);
其中,Warp为多项式插值方法,f
1为可使用的关键帧重建帧,flow为非关键帧的运动场信息。
Among them, Warp is a polynomial interpolation method, f 1 is the available key frame reconstruction frame, and flow is the sports field information of the non-key frame.
二、对于预测残差编码,本申请中的图像组中所有非关键帧经过预测编码后,还需经过非关键帧残差编码模块,非关键帧残差编码模块的输入为原始非关键帧信号与预测信号的残差。2. For predictive residual coding, all non-key frames in the image group in this application need to pass through the non-key frame residual coding module after predictive coding. The input of the non-key frame residual coding module is the original non-key frame signal The residual error with the predicted signal.
具体的,根据非关键帧的帧间预测信息以及非关键帧计算预测残差以及预测残差编码,具体包括:预测残差Frame
Resi计算公式为:
Specifically, calculating the prediction residual and the prediction residual coding according to the inter prediction information of the non-key frame and the non-key frame, specifically including: the prediction residual Frame Resi calculation formula is:
Frame
Resi=Frame-Frame
pred;
Frame Resi = Frame-Frame pred ;
其中,Frame为当前非关键帧的原始信号,Frame
pred为帧间预测信号;
Among them, Frame is the original signal of the current non-key frame, and Frame pred is the inter-frame prediction signal;
预测残差Frame
Resi通过由全卷积网络构成的自编码器结构进行压缩编码,其瓶颈层被熵编码后写入码流中。
The prediction residual Frame Resi is compressed and coded through a self-encoder structure composed of a full convolutional network, and its bottleneck layer is entropy coded and written into the code stream.
进一步的,S105中,非关键帧在重建时同样需要经过环路滤波网络进行重建后得到非关键帧重建帧,非关键帧重建帧Frame
Rec公式为:
Further, in S105, the non-key frame also needs to be reconstructed through the loop filter network to obtain the non-key frame reconstruction frame during reconstruction. The non-key frame reconstruction frame Frame Rec formula is:
进而得到最终的重建非关键帧,并且存入解码缓冲区中。Then the final reconstructed non-key frame is obtained and stored in the decoding buffer.
本申请非关键帧预测残差编码方法具体为使用根据具体情况设计、预先训练好的自编码器网络模型,将非关键帧的原始信号于其预测信号的残差作为生成网络的输入,得到重建残差,即完成压缩图像重建。The non-key frame prediction residual coding method of this application specifically uses a pre-trained autoencoder network model designed according to specific conditions, and uses the residuals of the original signal of the non-key frame and its prediction signal as the input of the generation network to obtain reconstruction Residual, that is, the compressed image reconstruction is completed.
本申请端到端视频压缩框架中的环路滤波方法,关键帧和非关键帧在编码得到最终重 建时,使用根据具体情况设计、训练好的基于卷积神经网络的环路滤波重建,其输入未滤波的关键帧或非关键帧,并存入解码缓存区中。In the loop filter method in the end-to-end video compression framework of this application, when the key frames and non-key frames are encoded to obtain the final reconstruction, the loop filter reconstruction based on the convolutional neural network designed and trained according to the specific conditions is used, and its input Unfiltered key frames or non-key frames are stored in the decoding buffer.
其中,端到端视频压缩框架中的码流结构z组织方法中,整体码流由多个图像组GOP的码流组成,每个图像的码流由关键帧和非关键帧码流组成,关键帧码流包括自编码器瓶颈层码流,非关键帧码流为运动场信息及其预测残差码流组成。Among them, in the code stream structure z organization method in the end-to-end video compression framework, the overall code stream is composed of the code streams of multiple image groups GOP, and the code stream of each image is composed of key frames and non-key frame code streams. The frame code stream includes the bottleneck layer code stream of the self-encoder, and the non-key frame code stream is composed of the sports field information and its prediction residual code stream.
本申请的基于深度学习的端到端视频压缩方法具体包括深度学习方法,视频运动特征提取方法,端到端视频压缩方法,视频重建方法。通过端到端视频编码框架能够突破传统框架局部优化的限制,建立起重建视频与原始视频的全局优化模型,并利用神经网络建模具有高维复杂解空间的率失真优化问题,从而实现视频编码框架的革新。The end-to-end video compression method based on deep learning of the present application specifically includes a deep learning method, a video motion feature extraction method, an end-to-end video compression method, and a video reconstruction method. Through the end-to-end video coding framework, it can break through the limitations of local optimization of the traditional framework, establish a global optimization model of reconstructed video and original video, and use neural network to model the rate-distortion optimization problem with high-dimensional complex solution space, thereby realizing video coding Framework innovation.
其中,端到端视频压缩用到的深度学习方法具体为基于全卷积网络模型的深度学习方法;基于深度学习的方法包括但不限定于:变分自编码器、生成对抗网络及其变体与结合。Among them, the deep learning method used in end-to-end video compression is specifically a deep learning method based on a full convolutional network model; methods based on deep learning include, but are not limited to: variational autoencoders, generative adversarial networks and their variants combine with.
本申请基于深度学习的视频编码技术旨在利用多层深度非线性变换提取数据高层抽象特及其逆过程,从而得到视频编码的最优预测信号,并通过端到端残差编码的方式保证整体框架的率失真性能。最后,通过监督式的训练方法优化率失真函数,该率失真函数包括重建视频的数据保真项,以及编码残差所需要的额外代价。The video coding technology based on deep learning in this application aims to extract the high-level abstract characteristics of data and its inverse process by using multi-layer deep nonlinear transformation, thereby obtaining the optimal prediction signal of video coding, and ensuring the overall Rate-distortion performance of the frame. Finally, a supervised training method is used to optimize the rate-distortion function. The rate-distortion function includes the data fidelity of the reconstructed video and the additional cost required for encoding residuals.
实施例2Example 2
图7示出了根据本申请实施例的一种基于深度学习的端到端视频压缩系统的结构示意图。Fig. 7 shows a schematic structural diagram of an end-to-end video compression system based on deep learning according to an embodiment of the present application.
如图7所示,本实施例提供的一种基于深度学习的端到端视频压缩系统,具体包括:As shown in FIG. 7, the end-to-end video compression system based on deep learning provided by this embodiment specifically includes:
图像组模块10:用于将目标视频分为多个图像组;Image group module 10: used to divide the target video into multiple image groups;
关键帧编码模块20:用于对图像组中的关键帧进行端到端帧内编码得到关键帧编码;Key frame encoding module 20: used to perform end-to-end intra-frame encoding on the key frames in the image group to obtain key frame encoding;
关键帧重建帧模块30:用于将关键帧编码通过环路滤波网络进行重建后得到关键帧重建帧,并存储于解码缓冲区;Key frame reconstruction frame module 30: used to encode the key frame through the loop filter network for reconstruction to obtain the key frame reconstruction frame, and store it in the decoding buffer;
非关键帧编码模块40:用于基于解码缓冲区中的关键帧重建帧对图像组中的非关键帧进行端到端帧间编码得到非关键帧编码;Non-key frame coding module 40: used to perform end-to-end inter-coding of non-key frames in the image group based on the key frame reconstruction frame in the decoding buffer to obtain non-key frame coding;
非关键帧重建帧模块50:用于将非关键帧编码通过环路滤波网络进行重建后得到非关键帧重建帧,并存储于解码缓冲区。The non-key frame reconstruction frame module 50 is used for reconstructing the non-key frame encoding through the loop filter network to obtain the non-key frame reconstruction frame, and storing the non-key frame reconstruction frame in the decoding buffer.
关键帧编码模块20中,基于关键帧重建帧对图像组中的非关键帧进行端到端帧间编码得到非关键帧编码,具体包括:In the key frame encoding module 20, end-to-end inter-encoding of the non-key frames in the image group based on the key frame reconstruction frame to obtain the non-key frame encoding, which specifically includes:
基于关键帧重建帧对图像组中的非关键帧进行运动场估计得到运动场信息;Performing motion field estimation on the non-key frames in the image group based on the key frame reconstruction frame to obtain the motion field information;
根据运动场信息得到非关键帧的帧间预测信息;Obtain the inter prediction information of non-key frames according to the sports field information;
根据非关键帧的帧间预测信息以及非关键帧进行预测残差编码。Perform predictive residual coding based on the inter prediction information of non-key frames and non-key frames.
端到端视频压缩框架中的关键帧重建帧模块30与非关键帧重建帧模块50中均包括环路滤波器,关键帧和非关键帧在编码得到最终重建时,使用根据具体情况设计、训练好的基于卷积神经网络的环路滤波器重建,输入未滤波的关键帧或非关键帧至环路滤波器后存入解码缓存区中。Both the key frame reconstruction frame module 30 and the non-key frame reconstruction frame module 50 in the end-to-end video compression framework include loop filters. When the key frames and non-key frames are finally reconstructed after encoding, they are designed and trained according to the specific conditions. Good loop filter reconstruction based on convolutional neural network, input unfiltered key frames or non-key frames to the loop filter and store them in the decoding buffer.
本实施例还提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行以实现如上任一内容所提供的基于深度学习的端到端视频压缩方法。This embodiment also provides a computer-readable storage medium on which a computer program is stored, and the computer program is executed by a processor to implement the deep learning-based end-to-end video compression method provided by any of the above.
本申请提出了一种基于端到端深度神经网络的视频压缩框架。首先将视频组织为多个图像组,对图像组中的关键帧图像进行帧内编码,非关键帧图像进行帧间编码。帧内编码采用基于超先验结构的自编码结构并结合自回归模型进行上下文建模,帧间编码采用运动场导出预测和残差编码。能够实现端到端整体优化编码器架构,同时对帧间编码采用运动场导出的形式避免了大量传递帧间运动信息,极大的节省了码率,同时在重建过程中使用基于深度网络的环路滤波技术提升重建性能。与传统编码器相比,提出方法能在端到端全局优化视频编码器,同时无需传输帧间预测中的运动信息,在低码率下能够取得较好的编码性能。This application proposes a video compression framework based on an end-to-end deep neural network. First, the video is organized into multiple image groups, the key frame images in the image group are intra-coded, and the non-key frame images are inter-coded. Intra-frame coding uses an auto-encoding structure based on a super-prior structure combined with an autoregressive model for context modeling, and inter-frame coding uses motion field to derive prediction and residual coding. It can realize the end-to-end overall optimization of the encoder architecture. At the same time, the inter-frame coding adopts the form of motion field export to avoid the transmission of a large amount of inter-frame motion information, which greatly saves the code rate. At the same time, the deep network-based loop is used in the reconstruction process. Filtering technology improves reconstruction performance. Compared with traditional encoders, the proposed method can globally optimize the video encoder end-to-end without the need to transmit motion information in inter-frame prediction, and can achieve better encoding performance at low bit rates.
基于同一发明构思,本申请实施例中还提供了一种计算机程序产品,由于该计算机程序产品解决问题的原理与本申请实施例一所提供的方法相似,因此该计算机程序产品的实施可以参见方法的实施,重复之处不再赘述。Based on the same inventive concept, the embodiments of the present application also provide a computer program product. Since the principle of the computer program product to solve the problem is similar to the method provided in the first embodiment of the present application, the implementation of the computer program product can refer to the method The implementation of the repetition will not be repeated.
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, this application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。This application is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment can be used to generate It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置 的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
尽管已描述了本申请的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请范围的所有变更和修改。Although the preferred embodiments of the present application have been described, those skilled in the art can make additional changes and modifications to these embodiments once they learn the basic creative concept. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments and all changes and modifications falling within the scope of the present application.
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the application without departing from the spirit and scope of the application. In this way, if these modifications and variations of this application fall within the scope of the claims of this application and their equivalent technologies, then this application is also intended to include these modifications and variations.
Claims (10)
- 一种基于深度学习的端到端视频压缩方法,其特征在于,包括以下步骤:An end-to-end video compression method based on deep learning, characterized in that it comprises the following steps:将目标视频分为多个图像组;Divide the target video into multiple image groups;对所述图像组中的关键帧进行端到端帧内编码得到关键帧编码;Performing end-to-end intra coding on the key frames in the image group to obtain key frame coding;所述关键帧编码通过环路滤波网络进行重建后得到关键帧重建帧;The key frame encoding is reconstructed through the loop filter network to obtain the key frame reconstruction frame;基于所述关键帧重建帧对所述图像组中的非关键帧进行端到端帧间编码得到非关键帧编码;Performing end-to-end inter-frame coding on non-key frames in the image group based on the key frame reconstruction frame to obtain non-key frame coding;所述非关键帧编码通过环路滤波网络进行重建后得到非关键帧重建帧。The non-key frame coding is reconstructed through the loop filter network to obtain the non-key frame reconstruction frame.
- 根据权利要求1所述的基于深度学习的端到端视频压缩方法,其特征在于,所述基于所述关键帧重建帧对所述图像组中的非关键帧进行端到端帧间编码得到非关键帧编码,具体包括:The end-to-end video compression method based on deep learning according to claim 1, wherein the end-to-end inter-coding of non-key frames in the image group based on the key frame reconstruction frame obtains non-key frames. Key frame coding, including:基于所述关键帧重建帧对所述图像组中的非关键帧进行运动场估计得到运动场信息;Performing motion field estimation on non-key frames in the image group based on the key frame reconstruction frame to obtain motion field information;根据所述运动场信息得到非关键帧的帧间预测信息;Obtaining inter prediction information of non-key frames according to the sports field information;根据所述非关键帧的帧间预测信息以及所述非关键帧进行预测残差编码。Perform prediction residual coding according to the inter prediction information of the non-key frame and the non-key frame.
- 根据权利要求1所述的基于深度学习的端到端视频压缩方法,其特征在于,所述对所述图像组中的关键帧进行端到端帧内编码得到关键帧编码,具体采用基于超先验模型网络的端到端自编码器结构帧内编码框架,所述自编码器的瓶颈层进行上下文建模。The end-to-end video compression method based on deep learning according to claim 1, wherein the end-to-end intra-encoding of the key frames in the image group is performed to obtain the key frame encoding, which is specifically based on advanced The end-to-end self-encoder structure of the verification model network is an intra-frame coding framework, and the bottleneck layer of the self-encoder performs context modeling.
- 根据权利要求3所述的基于深度学习的端到端视频压缩方法,其特征在于,所述帧内编码框架在训练时的目标函数 公式为: The end-to-end video compression method based on deep learning according to claim 3, wherein the objective function of the intra-frame coding framework during training is The formula is:其中,y为根据图像编码的隐变量,y=Enc(x);隐变量y的先验分布为服从均值μ,方差为σ的正态分布,y~N(μ,σ);Among them, y is a hidden variable based on image coding, y=Enc(x); the prior distribution of the hidden variable y is a normal distribution that obeys the mean μ and the variance is σ, y~N(μ,σ);其中,均值μ和方差σ是根据超先验自编码器通过端到端学习得到,具体为:Among them, the mean μ and variance σ are obtained through end-to-end learning according to the super-prior autoencoder, specifically:z=Hyper Enc(y); z=Hyper Enc(y) ;其中, 为经过量化后的超先验自编码器的码字, 为超先验正太分布的初步参数,采用基于PixelCNN上下文建模对超先验自编码结构的结果进行提升处理。 in, Is the codeword of the super-prior self-encoder after quantization, For the preliminary parameters of the super-prior normal distribution, the results of the super-prior self-encoding structure are upgraded by using PixelCNN contextual modeling.
- 根据权利要求1所述的基于深度学习的端到端视频压缩方法,其特征在于,所述环路滤波网络基于全卷积网络,所述环路滤波网络采用损失函数L2,所述环路滤波网络 具体公式为: The end-to-end video compression method based on deep learning according to claim 1, wherein the loop filter network is based on a fully convolutional network, the loop filter network uses a loss function L2, and the loop filter The internet The specific formula is:其中,x rec表示输入的已编码图像,x为已编码图像对应的真实标签,n表示帧数。 Among them, x rec represents the input coded image, x is the real label corresponding to the coded image, and n represents the number of frames.
- 根据权利要求2所述的基于深度学习的端到端视频压缩方法,其特征在于,所述基于所述关键帧重建帧对所述图像组中的非关键帧进行运动场估计得到运动场信息,具体包括:The end-to-end video compression method based on deep learning according to claim 2, characterized in that said performing motion field estimation on non-key frames in said image group based on said key frame reconstruction frame to obtain motion field information specifically comprises :当所述关键帧重建帧只有一帧时,所述运动场信息需要通过自编码器编码得到,并写入码流中,所述运动场信息flow 1的计算公式为: When the key frame reconstruction frame has only one frame, the sports field information needs to be obtained by encoding with a self-encoder and written into the code stream. The calculation formula of the sports field information flow 1 is:flow 1=Flownet(f t-1); flow 1 = Flownet(f t-1 );当所述关键帧重建帧数目大于一帧时,取相对当前非关键帧最临近的两帧重建帧得到运动场信息,此时所述运动场信息无需写入码流中,所述运动场信息flow 2的计算公式为: When the number of key frames reconstructed frame is greater than one, non-key frames taken relative to the current most two adjacent frames reconstructed motion field obtained information, when the stadium without writing code stream information, the information flow 2 of the stadium The calculation formula is:flow 2=Flownet(f t-2,f t-1); flow 2 =Flownet(f t-2 ,f t-1 );其中,f 1为可使用的关键帧重建帧,Flownet为光流预测网络。 Among them, f 1 is an available key frame reconstruction frame, and Flownet is an optical flow prediction network.
- 根据权利要求2所述的基于深度学习的端到端视频压缩方法,其特征在于,所述根据所述运动场信息得到非关键帧的帧间预测信息,具体包括:根据所述运动场信息的视频运动特征及所述解码缓存区的重建帧通过插值及图像处理技术生成所述非关键帧的帧间预测信号,所述帧间预测信号Frame pred计算公式为: The end-to-end video compression method based on deep learning according to claim 2, wherein said obtaining inter prediction information of non-key frames according to said sports field information specifically comprises: video motion according to said sports field information The features and the reconstructed frame of the decoding buffer area are used to generate the inter-frame prediction signal of the non-key frame through interpolation and image processing technology, and the calculation formula of the inter-frame prediction signal Frame pred is:Frame pred=Warp(f t-1,flow); Frame pred = Warp(f t-1 ,flow);其中,Warp为多项式插值方法,f 1为可使用的关键帧重建帧,flow为非关键帧的运动场信息。 Among them, Warp is a polynomial interpolation method, f 1 is the available key frame reconstruction frame, and flow is the sports field information of the non-key frame.
- 根据权利要求2所述的基于深度学习的端到端视频压缩方法,其特征在于,所述根据所述非关键帧的帧间预测信息以及所述非关键帧计算预测残差以及预测残差编码,具体包括:所述预测残差Frame Resi计算公式为: The end-to-end video compression method based on deep learning according to claim 2, wherein the calculation of the prediction residual and the prediction residual coding according to the inter prediction information of the non-key frame and the non-key frame , Specifically including: the prediction residual Frame Resi calculation formula is:Frame Resi=Frame-Frame pred; Frame Resi = Frame-Frame pred ;其中,Frame为当前非关键帧的原始信号,Frame pred为帧间预测信号; Among them, Frame is the original signal of the current non-key frame, and Frame pred is the inter-frame prediction signal;预测残差Frame Resi通过由全卷积网络构成的自编码器结构进行压缩编码,其瓶颈层被熵编码后写入码流中。 The prediction residual Frame Resi is compressed and coded through a self-encoder structure composed of a full convolutional network, and its bottleneck layer is entropy coded and written into the code stream.
- 一种基于深度学习的端到端视频压缩系统,其特征在于,具体包括:An end-to-end video compression system based on deep learning, which is characterized in that it specifically includes:图像组模块:用于将目标视频分为多个图像组;Image group module: used to divide the target video into multiple image groups;关键帧编码模块:用于对所述图像组中的关键帧进行端到端帧内编码得到关键帧编码;Key frame encoding module: used to perform end-to-end intra-frame encoding on the key frames in the image group to obtain key frame encoding;关键帧重建帧模块:用于将所述关键帧编码通过环路滤波网络进行重建后得到关键帧重建帧;Key frame reconstruction frame module: used to obtain the key frame reconstruction frame after the key frame encoding is reconstructed through the loop filter network;非关键帧编码模块:用于基于所述关键帧重建帧对所述图像组中的非关键帧进行端到端帧间编码得到非关键帧编码;Non-key frame encoding module: used to perform end-to-end inter-frame encoding on non-key frames in the image group based on the key frame reconstruction frame to obtain non-key frame encoding;非关键帧重建帧模块:用于将所述非关键帧编码通过环路滤波网络进行重建后得到非关键帧重建帧。Non-key frame reconstruction frame module: used to obtain the non-key frame reconstruction frame after the non-key frame encoding is reconstructed through the loop filter network.
- 一种计算机可读存储介质,其特征在于,其上存储有计算机程序;所述计算机程序被处理器执行以实现如权利要求1-8任一项所述的基于深度学习的端到端视频压缩方法。A computer-readable storage medium, characterized in that a computer program is stored thereon; the computer program is executed by a processor to implement the deep learning-based end-to-end video compression according to any one of claims 1-8 method.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010104772.5 | 2020-02-20 | ||
CN202010104772.5A CN111405283B (en) | 2020-02-20 | 2020-02-20 | End-to-end video compression method, system and storage medium based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021164176A1 true WO2021164176A1 (en) | 2021-08-26 |
Family
ID=71428456
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/099445 WO2021164176A1 (en) | 2020-02-20 | 2020-06-30 | End-to-end video compression method and system based on deep learning, and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111405283B (en) |
WO (1) | WO2021164176A1 (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113709504A (en) * | 2021-10-27 | 2021-11-26 | 深圳传音控股股份有限公司 | Image processing method, intelligent terminal and readable storage medium |
CN114363617A (en) * | 2022-03-18 | 2022-04-15 | 武汉大学 | Network lightweight video stream transmission method, system and equipment |
CN114513658A (en) * | 2022-01-04 | 2022-05-17 | 聚好看科技股份有限公司 | Video loading method, device, equipment and medium |
CN114584780A (en) * | 2022-03-03 | 2022-06-03 | 上海交通大学 | Image coding, decoding and compressing method based on depth Gaussian process regression |
CN114630129A (en) * | 2022-02-07 | 2022-06-14 | 浙江智慧视频安防创新中心有限公司 | Video coding and decoding method and device based on intelligent digital retina |
CN114858455A (en) * | 2022-05-25 | 2022-08-05 | 合肥工业大学 | Rolling bearing fault diagnosis method and system based on improved GAN-OSNet |
CN114926555A (en) * | 2022-03-25 | 2022-08-19 | 江苏预立新能源科技有限公司 | Intelligent data compression method and system for security monitoring equipment |
CN115049541A (en) * | 2022-07-14 | 2022-09-13 | 广州大学 | Reversible gray scale method, system and device based on neural network and image steganography |
CN115278249A (en) * | 2022-06-27 | 2022-11-01 | 北京大学 | Video block-level rate-distortion optimization method and system based on visual self-attention network |
CN115529457A (en) * | 2022-09-05 | 2022-12-27 | 清华大学 | Video compression method and device based on deep learning |
WO2023241188A1 (en) * | 2022-06-13 | 2023-12-21 | 北华航天工业学院 | Data compression method for quantitative remote sensing application of unmanned aerial vehicle |
CN117915096A (en) * | 2023-12-14 | 2024-04-19 | 北京大兴经济开发区开发经营有限公司 | Target identification high-precision high-resolution video coding method and system for AI large model |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114257818B (en) * | 2020-09-22 | 2024-09-24 | 阿里巴巴达摩院(杭州)科技有限公司 | Video encoding and decoding methods, devices, equipment and storage medium |
CN112203093B (en) * | 2020-10-12 | 2022-07-01 | 苏州天必佑科技有限公司 | Signal processing method based on deep neural network |
CN112866697B (en) * | 2020-12-31 | 2022-04-05 | 杭州海康威视数字技术股份有限公司 | Video image coding and decoding method and device, electronic equipment and storage medium |
CN115037936A (en) * | 2021-03-04 | 2022-09-09 | 华为技术有限公司 | Video coding and decoding method and device |
CN113179403B (en) * | 2021-03-31 | 2023-06-06 | 宁波大学 | Underwater video object coding method based on deep learning reconstruction |
CN113382247B (en) * | 2021-06-09 | 2022-10-18 | 西安电子科技大学 | Video compression sensing system and method based on interval observation, equipment and storage medium |
CN115604486A (en) * | 2021-07-09 | 2023-01-13 | 华为技术有限公司(Cn) | Video image coding and decoding method and device |
WO2023051653A1 (en) * | 2021-09-29 | 2023-04-06 | Beijing Bytedance Network Technology Co., Ltd. | Method, apparatus, and medium for video processing |
CN114386595B (en) * | 2021-12-24 | 2023-07-28 | 西南交通大学 | SAR image compression method based on super prior architecture |
CN114095728B (en) * | 2022-01-21 | 2022-07-15 | 浙江大华技术股份有限公司 | End-to-end video compression method, device and computer readable storage medium |
CN115022637A (en) * | 2022-04-26 | 2022-09-06 | 华为技术有限公司 | Image coding method, image decompression method and device |
CN116939210B (en) * | 2023-09-13 | 2023-11-17 | 瀚博半导体(上海)有限公司 | Image compression method and device based on self-encoder |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180124415A1 (en) * | 2016-05-06 | 2018-05-03 | Magic Pony Technology Limited | Encoder pre-analyser |
CN109151475A (en) * | 2017-06-27 | 2019-01-04 | 杭州海康威视数字技术股份有限公司 | A kind of method for video coding, coding/decoding method, device and electronic equipment |
US20190306526A1 (en) * | 2018-04-03 | 2019-10-03 | Electronics And Telecommunications Research Institute | Inter-prediction method and apparatus using reference frame generated based on deep learning |
CN110349141A (en) * | 2019-07-04 | 2019-10-18 | 复旦大学附属肿瘤医院 | A kind of breast lesion localization method and system |
CN110443173A (en) * | 2019-07-26 | 2019-11-12 | 华中科技大学 | A kind of instance of video dividing method and system based on inter-frame relation |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108921789A (en) * | 2018-06-20 | 2018-11-30 | 华北电力大学 | Super-resolution image reconstruction method based on recurrence residual error network |
US10999606B2 (en) * | 2019-01-08 | 2021-05-04 | Intel Corporation | Method and system of neural network loop filtering for video coding |
CN110351568A (en) * | 2019-06-13 | 2019-10-18 | 天津大学 | A kind of filtering video loop device based on depth convolutional network |
-
2020
- 2020-02-20 CN CN202010104772.5A patent/CN111405283B/en active Active
- 2020-06-30 WO PCT/CN2020/099445 patent/WO2021164176A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180124415A1 (en) * | 2016-05-06 | 2018-05-03 | Magic Pony Technology Limited | Encoder pre-analyser |
CN109151475A (en) * | 2017-06-27 | 2019-01-04 | 杭州海康威视数字技术股份有限公司 | A kind of method for video coding, coding/decoding method, device and electronic equipment |
US20190306526A1 (en) * | 2018-04-03 | 2019-10-03 | Electronics And Telecommunications Research Institute | Inter-prediction method and apparatus using reference frame generated based on deep learning |
CN110349141A (en) * | 2019-07-04 | 2019-10-18 | 复旦大学附属肿瘤医院 | A kind of breast lesion localization method and system |
CN110443173A (en) * | 2019-07-26 | 2019-11-12 | 华中科技大学 | A kind of instance of video dividing method and system based on inter-frame relation |
Non-Patent Citations (2)
Title |
---|
DAVID MINNEN; JOHANNES BALLE; GEORGE TODERICI: "Joint Autoregressive and Hierarchical Priors for Learned Image Compression", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 8 September 2018 (2018-09-08), 201 Olin Library Cornell University Ithaca, NY 14853, XP081188741 * |
DJELOUAH ABDELAZIZ; CAMPOS JOAQUIM; SCHAUB-MEYER SIMONE; SCHROERS CHRISTOPHER: "Neural Inter-Frame Compression for Video Coding", 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), IEEE, 27 October 2019 (2019-10-27), pages 6420 - 6428, XP033723542, DOI: 10.1109/ICCV.2019.00652 * |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113709504A (en) * | 2021-10-27 | 2021-11-26 | 深圳传音控股股份有限公司 | Image processing method, intelligent terminal and readable storage medium |
CN114513658B (en) * | 2022-01-04 | 2024-04-02 | 聚好看科技股份有限公司 | Video loading method, device, equipment and medium |
CN114513658A (en) * | 2022-01-04 | 2022-05-17 | 聚好看科技股份有限公司 | Video loading method, device, equipment and medium |
CN114630129A (en) * | 2022-02-07 | 2022-06-14 | 浙江智慧视频安防创新中心有限公司 | Video coding and decoding method and device based on intelligent digital retina |
CN114584780A (en) * | 2022-03-03 | 2022-06-03 | 上海交通大学 | Image coding, decoding and compressing method based on depth Gaussian process regression |
CN114363617A (en) * | 2022-03-18 | 2022-04-15 | 武汉大学 | Network lightweight video stream transmission method, system and equipment |
CN114926555A (en) * | 2022-03-25 | 2022-08-19 | 江苏预立新能源科技有限公司 | Intelligent data compression method and system for security monitoring equipment |
CN114926555B (en) * | 2022-03-25 | 2023-10-24 | 江苏预立新能源科技有限公司 | Intelligent compression method and system for security monitoring equipment data |
CN114858455A (en) * | 2022-05-25 | 2022-08-05 | 合肥工业大学 | Rolling bearing fault diagnosis method and system based on improved GAN-OSNet |
WO2023241188A1 (en) * | 2022-06-13 | 2023-12-21 | 北华航天工业学院 | Data compression method for quantitative remote sensing application of unmanned aerial vehicle |
CN115278249A (en) * | 2022-06-27 | 2022-11-01 | 北京大学 | Video block-level rate-distortion optimization method and system based on visual self-attention network |
CN115049541A (en) * | 2022-07-14 | 2022-09-13 | 广州大学 | Reversible gray scale method, system and device based on neural network and image steganography |
CN115049541B (en) * | 2022-07-14 | 2024-05-07 | 广州大学 | Reversible gray scale method, system and device based on neural network and image steganography |
CN115529457A (en) * | 2022-09-05 | 2022-12-27 | 清华大学 | Video compression method and device based on deep learning |
CN115529457B (en) * | 2022-09-05 | 2024-05-14 | 清华大学 | Video compression method and device based on deep learning |
CN117915096A (en) * | 2023-12-14 | 2024-04-19 | 北京大兴经济开发区开发经营有限公司 | Target identification high-precision high-resolution video coding method and system for AI large model |
Also Published As
Publication number | Publication date |
---|---|
CN111405283B (en) | 2022-09-02 |
CN111405283A (en) | 2020-07-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021164176A1 (en) | End-to-end video compression method and system based on deep learning, and storage medium | |
Liu et al. | A unified end-to-end framework for efficient deep image compression | |
CN106973293B (en) | Light field image coding method based on parallax prediction | |
Golinski et al. | Feedback recurrent autoencoder for video compression | |
CN101049006B (en) | Image coding method and apparatus, and image decoding method and apparatus | |
CN112203093B (en) | Signal processing method based on deep neural network | |
CN108921910B (en) | JPEG coding compressed image restoration method based on scalable convolutional neural network | |
KR20200114436A (en) | Apparatus and method for performing scalable video decoing | |
CN110602494A (en) | Image coding and decoding system and method based on deep learning | |
CN111294604B (en) | Video compression method based on deep learning | |
CN109688407B (en) | Reference block selection method and device for coding unit, electronic equipment and storage medium | |
CN110062239B (en) | Reference frame selection method and device for video coding | |
CN101883284B (en) | Video encoding/decoding method and system based on background modeling and optional differential mode | |
CN106937112A (en) | Bit rate control method based on H.264 video compression standard | |
CN113132735A (en) | Video coding method based on video frame generation | |
WO2023082834A1 (en) | Video compression method and apparatus, and computer device and storage medium | |
TWI489876B (en) | A Multi - view Video Coding Method That Can Save Decoding Picture Memory Space | |
Sun et al. | High-quality single-model deep video compression with frame-conv3d and multi-frame differential modulation | |
CN113068041B (en) | Intelligent affine motion compensation coding method | |
Zhao et al. | A universal optimization framework for learning-based image codec | |
Wang et al. | Learning to fuse residual and conditional information for video compression and reconstruction | |
CN112954350B (en) | Video post-processing optimization method and device based on frame classification | |
Li et al. | 3D tensor auto-encoder with application to video compression | |
CN114222124B (en) | Encoding and decoding method and device | |
Dhungel et al. | An Efficient Video Compression Network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20919425 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20919425 Country of ref document: EP Kind code of ref document: A1 |