CN114286093A - Rapid video coding method based on deep neural network - Google Patents

Rapid video coding method based on deep neural network Download PDF

Info

Publication number
CN114286093A
CN114286093A CN202111599851.9A CN202111599851A CN114286093A CN 114286093 A CN114286093 A CN 114286093A CN 202111599851 A CN202111599851 A CN 202111599851A CN 114286093 A CN114286093 A CN 114286093A
Authority
CN
China
Prior art keywords
block
value
hct
model
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111599851.9A
Other languages
Chinese (zh)
Inventor
陆宇
诸承广
殷海兵
周洋
黄晓峰
杨萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202111599851.9A priority Critical patent/CN114286093A/en
Publication of CN114286093A publication Critical patent/CN114286093A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a rapid video coding method based on a deep neural network. The method comprises a CU partitioning module based on a deep neural network and a PU mode selection module based on neighborhood correlation; the rate distortion cost is calculated through PU mode selection when a CU block is coded in a frame, optimization is carried out through a PU mode selection module based on neighborhood correlation at the moment, and the number of candidate modes calculated by RDO is reduced through a prediction result of a lightweight HCT model; after PU mode selection is finished, the encoder can carry out CU block depth judgment to judge whether the CU block is divided, at the moment, a CU dividing module based on a depth neural network carries out optimization, and a prediction result is obtained from an HCT model to judge whether division is stopped in advance. Otherwise, the sub-CU blocks are continuously divided downwards, and PU mode selection and CU block division judgment are continuously carried out. The invention reduces the complexity of CU recursive partitioning, simplifies the selection process of intra-frame prediction modes, and effectively improves the time efficiency of HEVC coding.

Description

Rapid video coding method based on deep neural network
Technical Field
The invention belongs to the technical field of High Efficiency Video Coding (HEVC), and particularly relates to a low-complexity fast HEVC intra-frame video coding method.
Background
The latest video coding standard, High Efficiency Video Coding (HEVC), developed by the joint video coding development group (JCT-VC) in 2012 significantly improved coding performance. HEVC uses several well-designed methods, saving approximately 50% of the code rate at the same video compression quality, compared to the previous video coding standard h.264/AVC. Specifically, for intra coding, a Coding Unit (CU) based on a quad-tree structure is recursively divided into 64 × 64 to 8 × 8 blocks; in addition, for a Prediction Unit (PU), 35 intra prediction modes including DC, Planar, and 32 angular prediction modes are allowed at the maximum. Both techniques are beneficial to improve coding performance, but at the cost of greatly increasing coding complexity, and are difficult to meet the requirements of real-time applications. Therefore, it is necessary to research a fast video encoding method.
Many HEVC fast intra coding algorithms have been proposed so far, and can be roughly classified into two main categories: fast partitioning of CU blocks and fast selection of intra modes. Since the partitioning of CUs is a recursive partitioning optimized from top to bottom by rate-distortion, and its block size is flexible, its coding complexity is very high, and many fast coding methods try to predict the CU partitioning pattern in advance to avoid a lengthy recursive RDO search. In terms of intra mode selection, the current HEVC encoder employs a three-step algorithm to accelerate the intra mode decision process, and the optimization of intra mode selection is currently mainly focused on simplifying the RMD process or RDO calculation. The optimization algorithm uses a traditional heuristic method to manually extract texture features of the CU blocks or utilizes related features between adjacent CUs, and methods in the field of machine learning such as decision trees, support vector machines, Bayesian decisions and the like are also applied to CU depth decision, so that a certain optimization effect is achieved; the convolutional neural network has outstanding image classification performance in the field of computer vision due to excellent local feature extraction capability, and a plurality of neural network structures have been designed in recent years, so that textures and object features in coding blocks can be automatically extracted. However, due to the problem of prediction accuracy of the neural network, the high accuracy means that the situation of wrong prediction of a CU is few, and excessive and unnecessary improvement of the code rate is not brought, but a network model is excessively complex, and extra time overhead is brought, so that the design of a network structure needs to be balanced between complexity and prediction accuracy.
The Google team in 2017 proposes a Self-Attention mechanism (Self-Attention) based transform neural network model, the model adopts the Self-Attention mechanism and does not adopt the sequential structure of RNN, so that the model can be trained in a parallelization way, and can have global information, the effect is obvious when a sequence is processed, and the model quickly becomes a preferred model in the Natural Language Processing (NLP) field and is gradually expanded to the computer vision field. The Google team in 2021, in addition, published An article entitled "An Image is word 16x16 Words: transformations for Image Recognition at Scale" at The International Conference on Learning retrieval (ICLR), The Vision Transformer (ViT) network model described herein introduced The Transformer structure for The first time into computer Vision classification tasks and achieved superior performance over The CNN model. ViT, the model migrates the encoder module in the Transformer for direct use and adds classification vectors for outputting classification results, compared with CNN (convolutional neural network), ViT network can integrate information of the whole image, even the information of the bottom layer, and then bring better classification effect, which is not achieved by CNN, ViT shows great potential to make research on Transformer and attention mechanism appear explosive growth.
Disclosure of Invention
The invention aims to provide a fast video coding method based on a deep neural network aiming at the defects of high complexity and long overall coding time of the existing HEVC video coding, and the fast coding is realized while the coding performance is ensured by utilizing the strong feature extraction and learning capability of the deep neural network based on a convolution and self-attention mechanism.
The invention provides a rapid video coding method based on a deep neural network, which specifically comprises a CU (coding Unit) partitioning module based on the deep neural network and a PU (polyurethane) mode selection module based on neighborhood correlation.
The CU partitioning module based on the deep neural network predicts the partitioning result of each CU block from top to bottom by using the neural network, optimizes the prediction result by using a threshold value, reduces the generation of error prediction, and finally judges whether the current CU block terminates partitioning in advance according to the partitioning result during encoding by an encoder.
The PU mode selection module based on neighborhood correlation firstly predicts the position of an optimal mode in a candidate mode list after RMD rough selection by utilizing a neural network, avoids some redundant modes from entering RDO calculation by discarding the subsequent modes, optimizes a Most Probable Mode (MPM) and reduces the number of MPMs in the candidate list, thereby achieving the purposes of reducing mode calculation amount and reducing PU mode selection complexity.
The method reduces the complexity of CU recursive partitioning, simplifies the selection process of the intra-frame prediction mode and effectively improves the coding efficiency of HEVC by utilizing the correlation between the CU depth and the PU prediction mode in the video image in the neighborhood.
The technical scheme adopted by the invention for solving the technical problems is as follows
CU (unit) partitioning module based on deep neural network
Step (I), constructing a data set for network model training in an HEVC intra-frame mode: data sets are derived from YUV video sequences of various resolutions including: CIF (352 × 288),480p (832 × 480),720p (1280 × 720),1080p (1920 × 1080), WQXGA (2560 × 1600). The samples of the data set are composed of the luma component of the CU block and corresponding sample labels, where the sample labels are obtained by encoding the luma component for HEVC reference software HM 16.9. The data sets include a training set, a validation set, and a test set, each of which is further divided into four subsets according to four QPs (22,27,32, 37).
And (II) constructing a deep neural network for three CU blocks of 64 × 64,32 × 32 and 16 × 16 respectively to form a Hierarchical convolutional network (HCT) structure, wherein the HCT is formed by combining ViT and CNN, the HCT is trained through a corresponding training set, the HCT model is determined and stored through a verification set, and finally the generalization ability of the HCT model is judged through a test set. The objective function of HCT model training is cross entropy loss function (Cross EntropyLoss):
Figure BDA0003432805810000031
wherein output is the output vector of the HCT model, target is the sample label value, and L is the length of the output vector.
And (III) the layered convolution network HCT consists of a convolution module, a Transformer encoder module, a Sequence Pooling (Sequence Pooling) layer and a full connection layer. Firstly, the luminance component of a CU block is sent into a hierarchical convolution network HCT, a feature map with local feature information is output through a convolution module, the convolution module comprises a convolution layer and a maximum pooling layer (Maxpool), each layer is activated by a linear rectification function (ReLU) to improve the nonlinearity of a model, and then the feature map is flattened (Flatten) into one dimension and is subjected to a turning operation. Suppose an input image x ∈ RC×H×WWhere C represents the number of input images, H is the height of the image, W is the width of the image, the output of the convolution moduleThe following results are obtained:
x0=Transpose(Flatten(MaxPool(Conv2d(x)))) (2)
then the characteristic data x0And the position vector (position vector) is added and sent to a Transformer encoder module for global information extraction, the encoder module has 7 layers in total, each Layer consists of a Multi-headed Self-attentive Layer (MSL) and a Feed-Forward Convolution Layer (FCL), and Layer Normalization (LN) operation is carried out before the two sublayers so as to improve the robustness and the generalization capability of the model. Characteristic data x0Firstly passes through the multi-head self-attention layer, and the output data and x0Adding to obtain new characteristic data x1,x1And the output value and x of the feedforward convolution layer1Adding to obtain characteristic data x2The formula is as follows:
x1=x0+MSL(LN(x0)) (3)
x2=x1+FCL(LN(x1)) (4)
finally, obtaining a classification vector through a Sequence Pooling layer (Sequence Pooling), wherein the Sequence Pooling adopts a mapping T: rb×n×d→Rb×dB denotes a batch size (batch _ size), n denotes the number of pieces of feature data, and d denotes the size of each piece of feature data. This operation will output the feature data x for the entire transform encoder module2Directly into a classification vector containing information about the various parts of the input image, instead of the additional classification vector at ViT. And finally, outputting a classification result by the classification vector through a full connection layer and Softmax, wherein the final predicted value is a subscript (0 or 1) where the maximum output value is located.
Step (IV), training the HCT model by adopting a random gradient descent method (SGD), storing 12 HCT models with the highest accuracy of 3 CU blocks under 4 QPs, predicting the division results of 64 × 64,32 × 32 and 16 × 16 blocks from top to bottom by adopting an early termination mechanism by the trained HCT model, wherein the prediction results of the models are of two types: 0 represents no partition and 1 represents partition. When the prediction result of a certain type of block is 0, the quadtree division is not continuously performed downwards during the encoding, so that some redundant block division operations can be avoided by stopping the division in advance, and the effect of reducing the complexity of the encoding time is achieved.
In order to reduce the extra coding performance loss caused by the error prediction of the model, the module adopts threshold optimization to improve the coding performance. In the invention, the Similarity (SD) between two classification vectors is compared with a threshold lambda, and when the SD is smaller than the threshold lambda, the CU blocks can be checked by adopting an original coding mode, so that the wrong judgment of CU division results is reduced, the coding performance is improved, the balance between the coding performance and the complexity is realized, and the calculation formula of the similarity SD is as follows:
Figure BDA0003432805810000051
outputitwo-class output vectors for i x i-sized blocks, where we divide the threshold λ into three classes according to block size, with a size ratio of 4:2: 1.
(II) PU mode selection module based on neighborhood correlation:
step (1), obtaining a sample label value label ∈ [0,1,2] of each PU block during intra mode selection through HM coding, and obtaining a rule as follows: for PU blocks of 64 × 64,32 × 32, and 16 × 16 sizes, the original length of the candidate list after RMD rough selection is 3, and if the best mode of the PU block during mode selection is the first bit in the candidate list after RMD rough selection, label is 0, and the length of the candidate list after RMD rough selection becomes 1; if the best mode of the PU block during mode selection is located at the second bit in the candidate list after RMD rough selection, the label is 1, and the length of the candidate list after corresponding RMD rough selection becomes 2; in other cases, label is 2, and the candidate list after RMD roughing has a length of 3. For an 8 × 8, 4 × 4 PU block, since its candidate list is originally 8 long, we also divide it into three intervals to correspond to label 0,1,2, which are: if the best mode of the PU block after mode selection is located at the first or second bit in the candidate list after RMD rough selection, then label is 0, and the length of the candidate list after corresponding RMD rough selection becomes 2; when the best mode of the PU block during mode selection is located in the third or fourth bit of the candidate list after RMD rough selection, label is 1, and the length of the candidate list after RMD rough selection becomes 4; in other cases, label is 2, and the candidate list after RMD roughing is 8.
And (3) the data set of the PU mode selection module in the step (2) is also from the video sequence mentioned in the CU partition module based on the deep neural network, and PU block data with the size of 8 multiplied by 8 and 4 multiplied by 4 is added on the basis of the block partition data set. The models of PU blocks of 64 × 64,32 × 32, and 16 × 16 sizes are similar to those of the block division module, but the number of layers of the transform encoder module becomes 1. Models corresponding to 8 × 8 and 4 × 4 PU blocks are simplified and changed for the convolution module on the basis that the number of encoder module layers becomes 1, that is, dimension reduction operation is not performed and the maximum pooling layer is removed to reduce model complexity. In addition, the training of the model is regression trained using mean square error loss functions (mselos):
Figure BDA0003432805810000061
wherein output is a model output value vector, the length is 3, value is a true value vector obtained after the output is compared with label, and N is the number of input images during each training.
The true value vector acquisition rule is as follows: assuming that output is [ x, y, z ], label is 0, if the maximum value among outputs occurs at a place where the subscript is 0, value is output; if the maximum occurs at the subscript 1, then value ═ y, x, z; if the maximum occurs at a subscript of 2, then value is [ z, y, x ]. Similarly, in the case where label is 1, if the maximum value in output occurs where the subscript is 0, value is [ y, x, z ]; if the maximum value appears where the subscript is 1, value is output; if the maximum occurs at a subscript of 2, then value is [ x, z, y ]. In the case of label 2, if the maximum value in output occurs where the subscript is 0, value is [ z, y, x ]; if the maximum occurs at the subscript 1, then value ═ x, z, y; if the maximum occurs at the subscript of 2, value is output.
The invention has the following beneficial effects:
(1) the invention adopts a batch prediction method for the video sequence, can obtain the prediction result of the CU block or the PU block only by running once, and can obviously reduce the prediction time of the neural network. The invention adopts a method of terminating CU block division in advance, reduces the number of modes of the PU block entering RDO calculation after RMD rough selection by utilizing a prediction result, optimizes the MPM mode to reduce the number of modes in a candidate list, and realizes quick coding.
(2) Compared with the CNN structure in the prior art, the HCT model not only can automatically extract the relevant local features of the image block, but also has the capability of extracting global information, so that the prediction accuracy and generalization capability of the model are improved, the computational complexity of the model is reduced due to the parallelization computational characteristics, and the consumed memory resources are greatly reduced. The data set adopted by the invention is far smaller than the data set required by the CNN network in the prior art, so that the training time of the model is greatly reduced, the code rate is not increased much under the condition of ensuring that the time complexity is greatly reduced, and the practicability is stronger.
(3) The invention simulates 5 video sequences with different resolutions of A (2560 × 1600), B (1920 × 1080), C (832 × 480), D (416 × 240) and E (1280 × 720), and experimental results show that the average time efficiency can reach about 70%, and the code rate is only increased by about 2%.
Drawings
FIG. 1 is a flow chart of HEVC original intra coding
FIG. 2 is a general algorithm diagram of the present invention;
FIG. 3 is a flow chart of a CU partition module method of the present invention;
FIG. 4 is a schematic of a data set according to the present invention;
FIG. 5 is a schematic diagram of the HCT model of the present invention;
FIG. 6 is a flow chart of a PU mode selection module method according to the present invention;
FIG. 7 is a diagram illustrating the correspondence between PU mode selection module dataset tags and the best mode locations after RMD rough selection according to the present invention;
FIG. 8 is a schematic diagram of a corresponding relationship between a tag of a dataset and a true value vector during the PU mode selection module model training of the present invention;
Detailed Description
As shown in fig. 1, in an HEVC original intra coding process, a CU block with a size of 64 × 64 is first subjected to PU mode selection to calculate rate-distortion cost, then divided into 4 CU blocks with a size of 32 × 32 by a quad-tree, and these CU blocks are further subjected to PU mode selection to calculate rate-distortion cost, so that the CU blocks are divided down to a size of 8 × 8, and finally rate-distortion costs of the CU blocks and 4 sub-CU blocks of the CU blocks are compared from bottom to top, so as to obtain a division result. In the deep learning method in the prior art, a convolutional neural network is mostly adopted to automatically extract local features in a CU or a PU block for training so as to optimize the downward division operation of the CU block or the rate-distortion calculation of PU mode selection, and a network model usually needs a complex network structure or a large-scale data set for training to achieve good prediction accuracy and generalization capability.
Core improvements of the invention may include: 1. a hierarchical convolutional network (HCT) is provided, the model can achieve the prediction result which is similar to that of a CNN model under a large data set and even has a better effect only by small data set training, and the intra-frame mode complexity is reduced by rapidly outputting CU block division results through batch prediction. 2. The HCT structure is used for PU mode selection after being subjected to lightweight modification, the initial 3 (corresponding to 64 x 64,32 x 32 and 16x16 PU blocks) or 8 (corresponding to 8 x 8 and 4 x 4 PU blocks) modes are divided into three types of [1,2,3] or [2,4 and 8] for prediction, MPM is optimized, and the time complexity of intra-frame coding is reduced by reducing the number of modes entering RDO calculation.
The invention is further described below with reference to the drawings and the examples.
A general algorithm framework of the method for rapidly coding the video based on the deep neural network is shown in figure 2 and comprises a CU partitioning module based on the deep neural network and a PU mode selection module based on neighborhood correlation. The rate distortion cost is calculated through PU mode selection when the CU blocks are coded in the frame, the PU mode selection module is used for optimization, and the number of candidate modes calculated by RDO is reduced through the prediction result of the lightweight HCT model; after PU mode selection is finished, the encoder can carry out CU depth judgment to judge whether the CU block is divided or not, at the moment, a CU dividing module based on a neural network carries out optimization, a prediction result is obtained from an HCT model to judge whether division is stopped in advance, and otherwise, PU mode selection and CU dividing judgment of continuous sub CU blocks are divided downwards. According to the invention, HEVC official reference software HM16.9 is adopted for compression coding in simulation, the test conditions refer to JCT-VC general test conditions (JCTVC-R1015), and an HM16.9 model full-frame intra-coding configuration file encode _ intra _ main.cfg is used. The flow chart of the CU block partitioning module method based on the deep neural network is shown in FIG. 3, and the specific steps are as follows:
and (I) constructing a database required by HCT model training, wherein the related data source is shown in FIG. 4. First, the luminance component of one frame of image is extracted every 50 frames from the video sequence shown in fig. 4, and encoding is performed to obtain two classification labels (labels) corresponding to three CU blocks of 64 × 64,32 × 32, and 16 × 16 in the picture, where the label value is 0 (representing no division) or 1 (representing division). The resulting data set sample size is 9, 865, 968, and divided into 12 sub-data sets according to their QP values and CU block sizes for training of the network model.
And (II) constructing an HCT model for training the database and CU partition prediction, wherein as shown in FIG. 5, the input of the HCT model is the brightness component of the CU block, and each sub data set corresponds to one network model. Each CU block is sent into a network and passes through a convolution module, a Transformer encoder module, a Sequence Pooling (Sequence Pooling) layer and a full connection layer, two classification vectors are finally output, and index subscripts corresponding to the maximum values in the vectors are selected to serve as block division prediction results during prediction. The HCT model is formed by combining CNN and Vision Transformer networks, and the configuration and the function of each part are described as follows:
1) and a convolution module. The module is used for extracting the local features of the CU blocks and improving the prediction accuracy of the model, and the formula is as follows:
x0=Transpose(Flatten(MaxPool(Conv2d(x)))) (1)
as the input pictures are subjected to batch operation, the quantity of the selected input pictures is T and the size of the selected input pictures is [1, W ]](W∈[64,32,16]) Performing convolution operation on the CU luminance component x, wherein the size of a convolution kernel is 3 multiplied by 3, the step length is 2, and the filling is 1; obtaining 128 characteristic graphs with the size of W/2 xW/2, and performing maximum pooling treatment, wherein the dimensionality is changed into [ T, 128, W/4]]The convolution and maximum pooling layers are both activated by a linear rectification function (ReLU) to improve model nonlinearity, and then the feature map is flattened into one dimension which is inverted to x0=[T,W/4×W/4,128](ii) a And finally, learnable position information (position vector) is added, the information is derived from ViT network structure, so that the model can learn the correlation among different positions, and the prediction accuracy is improved.
2) A Transformer encoder module. The encoder module provided by the invention has 7 layers, and each Layer consists of a Multi-headed Self-attention Layer (MSL) and a Feed-Forward Convolution Layer (FCL). The module can carry out parallelization processing on the feature map data and acquire global information of the image. The multi-head self-attention layer is from an ViT network structure, a feedforward layer in an original ViT encoder module is actually combined by two fully-connected layers to carry out simple dimension increasing and reducing operation, too many training parameters can cause overlarge calculated amount, the model parameter is optimized by changing the feedforward layer into two convolutional layers with convolution kernels of 1x1, the input and output of the module are unchanged and are [ T, W/4 xW/4, 128], and a related calculation formula is shown as follows;
x1=x0+MSL(LN(x0)) (2)
x2=x1+FCL(LN(x1)) (3)
characteristic data x output by convolution module0Firstly passes through the multi-head self-attention layer, and the output data and x0Adding to obtain new characteristic data x1,x1And the output value and x of the feedforward convolution layer1Adding to obtain characteristic data x2. Layer Normalization (LN) operation is added before the multi-head self-attention Layer and the feedforward convolution Layer,can play the roles of stabilizing the model and regularizing.
3) And (4) sequence pooling layers. The data output after passing through the transform encoder module contains local information and global information of an input image, and is different from the additionally added classification vectors in ViT. Firstly, carrying out dimensionality reduction on data x ═ T, W/4 xW/4, 128 output by a transform encoder module to obtain [ T, W/4 xW/4, 1], then carrying out normalization operation on the data through Softmax, and then turning over to obtain y ═ T, 1, W/4 xW/4 ]; multiplying y and x to obtain a vector with dimensionality [ T, 1, 128], and finally obtaining two classification vectors after the classification vector obtained by dimensionality reduction passes through a full connection layer: [ T, 128] ═ T, 2 ]. Each CU block outputs two classification values which represent the probability of occurrence of division and non-division, and the subscript (0 or 1) where the maximum value is located is used as an expression of non-division and division.
4) And other layers. A multi-head self-attention layer and a feedforward convolution layer in a transform encoder module are added into a dropout operation, the random discarding probability is 10%, and the purpose is to prevent overfitting so as to improve the generalization capability of the network.
Step (III), with the database and the network model, we can train the model, and under the condition of 3 CU block sizes and 4 QPs, the block partitioning module of the invention needs 12 HCT models in total, the building and training of the network model are carried out under a Pythrch deep learning library, and the required loss function is a cross entropy loss function:
Figure BDA0003432805810000101
wherein output is the output vector of the network model, target is the sample label value, and L is the output vector length. And performing model accuracy verification on the model parameters obtained after training in the training set through the verification set, and considering whether the model parameters are stored or not, wherein the model parameters are an iteration, the iteration times are unified to be 100 times, the batch size (batch size) is 64, and finally the network model with the highest accuracy in the test set is obtained for prediction.
And (IV) after the model is trained, the method can be used for coding, HEVC reference software used by the method is HM16.9, when the coder starts coding, the HCT network model is firstly used for predicting three CU blocks of a frame to be coded of a video sequence, then the obtained prediction result is used for judging the quadtree division of the CU blocks in an intra-frame mode, if the current CU block prediction result is 0, the division is terminated in advance, and if not, the quadtree division judgment is continued downwards.
In order to reduce the extra coding performance loss caused by the error prediction of the model, the module also adds threshold optimization to improve the coding performance. Since the output of the model is a binary vector, we transform it to soft classification values between [0, 1] through the Softmax normalization operation, and the values at subscripts 0 and 1 represent the probability that the model predicts no partitioning and no partitioning. When the distance between the two is larger, namely the probability value of one party is larger and larger, and the probability value of the other party is smaller and smaller, the prediction capability of the model for the two categories is clearer and more definite, and the situation of wrong prediction is not easy to occur; on the contrary, the smaller the probability value, the closer the probability value is, the equal the probability value is, at this time, the prediction capability of the model to the two categories becomes more and more fuzzy and uncertain, the judgment is difficult to be accurate, and then the wrong prediction is easy to occur. Therefore, the Similarity (SD) between the soft classification values is compared with the threshold λ, and when SD is smaller than the threshold λ, the CU block can be checked by using the original encoding method, so that the erroneous judgment of the CU partition result can be reduced, the encoding performance can be improved, and the tradeoff between the encoding performance and the complexity can be realized, and the calculation formula of the similarity SD is as follows:
Figure BDA0003432805810000111
outputithe two classified output vectors are i x i size blocks, and T is the number of input images. It should be noted that the complexity reduction achieved is different for different CU block contents, and that larger blocks are more prone to partitioning, and thus the threshold range should be more pronouncedIs small. Here we divide the constant value λ into three classes according to block size, and λ64×64=2λ32×32=4λ16×16Experiments prove that the fixed value combination can achieve better coding performance improvement. Through the optimization operation, the complexity of the video coding time can be effectively reduced under the condition that the code rate is not increased much.
The method flow diagram of the PU mode selection module based on the neighborhood correlation is shown in FIG. 6. The module adopts a Light-weight improved Light-HCT network model for prediction, and the specific steps are as follows:
and (1) firstly, constructing a data set required by the module. The initial candidate pattern list length in PU mode selection is 3 (corresponding to PU block sizes of 64 × 64,32 × 32,16 × 16) or 8 (corresponding to PU block sizes of 8 × 8 and 4 × 4), and it is observed that the best mode of the PU block is not fixed in the candidate pattern list after RMD coarse selection, and the best mode is a redundant mode after the best mode, which increases the RDO calculation amount. The present invention reduces the amount of subsequent RDO computations by eliminating these redundant modes by predicting the position of the best mode in the RMD-roughed candidate list, with the data set selected for the PU mode covering PU blocks from 64 x 64 to 4 x 4. The label value of each PU block is label ∈ [0,1,2], the predicted best mode position is also classified into three classes according to the label category, and the three classes are in one-to-one correspondence with the label value, the correspondence is shown in fig. 7, and the rule of the correspondence is as follows:
the initial candidate list length of PU blocks of 64 × 64,32 × 32, and 16 × 16 sizes is 3, and if the best mode of the current PU block is located at bit 1 in the RMD after-coarse mode list, label is 0; the best mode is located at the 2 nd position in the mode list after RMD rough selection, and then label is 1; in other cases, label is 2.
Secondly, since the length of the initial candidate list of the PU blocks with the sizes of 8 × 8 and 4 × 4 is 8, the position of the segmented interval is adopted to correspond to the label value, and if the best mode of the current PU block is located at the 1 st bit or the 2 nd bit in the RMD roughly-selected mode list, label is 0; if the best mode of the current PU block is located at bit 3 or 4 in the RMD after-roughing mode list, label is 1; in other cases, label is 2.
And (2) constructing a lightweight HCT model, namely a Light-HCT model, by utilizing a Pythrch deep learning library. Different from the HCT model, the Light-HCT reduces the number of layers of the transform encoder module from 7 to 1, and compared with the HCT classification accuracy function, the Light-HCT is more apt to fit the position of the best mode in the PU block after RMD rough selection, and the mode selection of the PU block is determined by texture features, boundary curvature, quantization parameters and boundary direction, and the deep learning method cannot automatically extract so many features, so that the training by the classification method is difficult. In particular, since the PU blocks of 8 × 8 and 4 × 4 are very small and do not need a very complicated network structure, we simplify the convolution module based on the Light-HCT model, change the parameters of the convolution layer to the convolution kernel size of 3 × 3, and have step size and padding of 1, i.e. do not reduce the dimension operation, and remove the following max pooling layer.
And (3) training the model by adopting linear regression prediction, wherein the loss function adopts an MSELoss function in a Pythrch:
Figure BDA0003432805810000131
wherein output is a network output value vector, the length is 3, value is a true value vector, and the value is obtained by comparing output with label, as shown in fig. 8, the value vector obtaining rule is as follows:
and (c) 0. If the maximum value in output ═ x, y, z ] occurs where the subscript is 0, then value ═ output; if the maximum occurs at the subscript 1, then value ═ y, x, z; if the maximum occurs at a subscript of 2, then value is [ z, y, x ].
(1) in the case of label. If the maximum value in output ═ x, y, z ] occurs where the subscript is 0, then value ═ y, x, z ]; if the maximum value appears where the subscript is 1, value is output; if the maximum occurs at a subscript of 2, then value is [ x, z, y ].
③ 2. If the maximum value in output ═ x, y, z ] occurs where the subscript is 0, then value ═ z, y, x ]; if the maximum occurs at the subscript 1, then value ═ x, z, y; if the maximum occurs at the subscript of 2, value is output.
The purpose of the above exchange mechanism is to approach the maximum position frequency predicted by the model to the position corresponding to the label value, so as to achieve the purpose of fitting the corresponding category of each PU block and realizing the balance between the coding performance and the complexity.
In step (4), the invention optimizes the part of the MPM of the most probable mode after RMD rough selection. Firstly, obtaining a mode with minimum SAD and SATD cost values in a candidate mode list after RMD rough selection, and recording the SATD cost as JSATDminSad (sum of Absolute difference), which is the sum of Absolute differences, represents the size of the residual between the original image block and the predicted image block, satd (Hadamard transformed sad) is the size of the transformed residual, the prediction residual is Hadamard transformed first, and then the Absolute values of the residual are summed, these two cost values reflect the RD-cost size of the PU block to some extent, and can be used for preliminary mode screening. The calculation formula of these two cost functions is as follows:
TD(x,y)=|Orig(x,y)-Pred(x,y)| (7)
SAD=∑x,yTD(x,y) (8)
Figure BDA0003432805810000141
after the MPM part comes, because there are three MPMs obtained from the adjacent PU blocks, firstly, whether the MPMs are the same as the mode with the minimum SAD and SATD costs is compared in sequence, if so, the MPM process is terminated, the mode is taken as the best mode and sent to RDO (remote data object) calculation, and other modes are discarded; if not, calculating the SATD cost value of the MPM and comparing the SATD cost value with an adaptive threshold AT, wherein the adaptive threshold AT defined by the invention is as follows:
AT=ρ×JSATDmin (10)
wherein the proportionality coefficient p is 1.3, which is obtained on the basis of experimental statistical analysis of a large number of video sequences. If the SATD cost of the MPM is larger than the AT, the mode does not add the RDO candidate mode list, otherwise, the mode is added into the candidate list, if the mode does not exist in the candidate list, then the next MPM is judged continuously, and the MPM operation of the original encoder is replaced by the next MPM until all the MPM mode judgment is finished.

Claims (3)

1. A fast video coding method based on a deep neural network is characterized in that the specific implementation comprises a CU partitioning module based on the deep neural network and a PU mode selection module based on neighborhood correlation; the rate distortion cost is calculated through PU mode selection when a CU block is coded in a frame, optimization is carried out through a PU mode selection module based on neighborhood correlation at the moment, and the number of candidate modes calculated by RDO is reduced through a prediction result of a lightweight HCT model; after PU mode selection is finished, the encoder can carry out CU block depth judgment to judge whether the CU block is divided or not, at the moment, a CU dividing module based on a depth neural network carries out optimization, a prediction result is obtained from an HCT model to judge whether division is stopped in advance, and otherwise, PU mode selection and CU block dividing judgment of the sub-CU block are continued to be carried out.
2. The method according to claim 1, wherein the deep neural network-based CU partitioning module is implemented as follows:
step (I), constructing a data set for network model training in an HEVC intra-frame mode: data sets are derived from YUV video sequences of various resolutions including: CIF (352 × 288),480p (832 × 480),720p (1280 × 720),1080p (1920 × 1080), WQXGA (2560 × 1600); coding the image in the data set by adopting an HEVC (high efficiency video coding) coder HM16.9 to obtain a CU block and positive and negative sample labels thereof; the data sets include a training set, a validation set, and a test set, each data set in turn being divided into four subsets according to four QPs (22,27,32, 37);
step (II), constructing a deep neural network for three CU blocks of 64 × 64,32 × 32 and 16 × 16 respectively to form a hierarchical convolutional network HCT structure, wherein the hierarchical convolutional network HCT is composed of ViT and CNN, training the HCT through a corresponding training set, determining and storing an HCT model through a verification set, and judging the generalization capability of the HCT model through a test set; the objective function of HCT model training is a cross entropy loss function:
Figure FDA0003432805800000011
wherein output is the output vector of the HCT model, target is the label value, and N is the length of the output vector;
step (III), the layered convolution network HCT consists of a convolution module, an Encoder module, a sequence pooling layer and a full-connection layer; firstly, sending the brightness component of a CU block into a layered convolution network HCT, outputting a feature map with local feature information through a convolution module, wherein the convolution module comprises a convolution layer and a maximum pooling layer, and each layer is activated by a linear rectification function to improve the nonlinearity of a model; flattening the characteristic diagram into one dimension and exchanging the one dimension with the quantity of the characteristic diagram, namely flattening and turning operation; suppose an input image x ∈ RC×H×WWherein C represents the number of input images, H is the height of the images, W is the width of the images, and the output characteristic data x after passing through the convolution module0The following were used:
x0=Transpose(Flatten(MaxPool(Conv2d(x)))) (2)
then the characteristic data x0Adding the position vector and sending the sum to an Encoder module for extracting global information, wherein the Encoder module has 7 layers, each layer consists of a multi-head self-attention layer and a feedforward convolution layer, and the two sublayers are subjected to layer normalization operation; characteristic data x0Firstly passes through the multi-head self-attention layer, and the output data and x0Adding to obtain new characteristic data x1,x1And the output value and x of the feedforward convolution layer1Adding to obtain characteristic data x2The formula is as follows:
x1=x0+MSL(LayerNorm(x0)) (3)
x2=x1+FFL(LayerNorm(x1)) (4)
finally get throughObtaining classification vectors through a sequence pooling layer, and adopting mapping transformation T: R for sequence poolingb×n×d→Rb×dB represents the batch size, n represents the number of feature data, and d represents the size of each feature data; this operation outputs the entire Encoder output feature data x2Directly transforming into a classification vector containing information about the respective portions of the input image to replace ViT the additional classification vector; finally, the classification vector outputs a classification result through a full connection layer and softmax, and the final predicted value is a subscript where the maximum output value is located;
and (IV) training the HCT model by adopting a random gradient descent method, storing 12 HCT models with the highest accuracy of 3 CU blocks under 4 QPs, predicting the division results of 64 multiplied by 64,32 multiplied by 32 and 16 multiplied by 16 blocks from top to bottom by adopting an early termination mechanism for the trained HCT model, wherein the prediction results of the models are of two types: 0 represents no partition, 1 represents partition; when the prediction result of a certain type of block is 0, the quadtree division is not continuously performed downwards during encoding;
using a contrast value between two classification vectors as a threshold Thr, when Thr is smaller than a constant value λ, we can check the CU block by using an original coding mode, so as to reduce the misjudgment of the CU partition result, thereby improving the coding performance, and realizing the trade-off between the coding performance and the complexity, where the formula is as follows:
Figure FDA0003432805800000031
wherein, outputiAnd (3) outputting vectors which are classified into two classes of i-sized blocks, and classifying the constant value lambda into three classes according to the block size, wherein the size ratio is 4:2: 1.
3. The method according to claim 2, wherein the PU mode selection module based on neighborhood correlation is implemented as follows:
step (1), obtaining a sample label value label ∈ [0,1,2] of each PU block during intra mode selection through HM coding, and obtaining a rule as follows: for a PU block of 64 × 64,32 × 32,16 × 16 size, the original length of the candidate list after RMD rough selection is 3, if the best mode of the PU block during mode selection is the first bit in the candidate list after RMD rough selection, label is 0, and the length of the candidate list after corresponding RMD rough selection becomes 1; if the best mode of the PU block during mode selection is located at the second bit in the candidate list after RMD rough selection, the label is 1, and the length of the candidate list after corresponding RMD rough selection becomes 2; otherwise, label is 2, and the length of the candidate list after corresponding to RMD rough selection is 3; for an 8 × 8, 4 × 4 PU block, since its candidate list is originally 8 long, we also divide it into three intervals to correspond to label 0,1,2, which are: if the best mode of the PU block after mode selection is located at the first or second bit in the candidate list after RMD rough selection, then label is 0, and the length of the candidate list after corresponding RMD rough selection becomes 2; when the best mode of the PU block during mode selection is located in the third or fourth bit of the candidate list after RMD rough selection, label is 1, and the length of the candidate list after RMD rough selection becomes 4; otherwise, label is 2, and the length of the candidate list after corresponding to the RMD rough selection is 8;
step (2), the data set of the PU mode selection module is also from the video sequence mentioned in the block division module, and PU block data with the size of 8 multiplied by 8 and 4 multiplied by 4 is added on the basis of the block division data set; the model of a 64 × 64,32 × 32,16 × 16 size PU block is similar to that of the block division module, but the number of layers of the Encoder module becomes 1; the models corresponding to the 8 × 8 and 4 × 4 PU blocks are also changed in the convolution module, namely, the dimension reduction operation is not carried out and the maximum pooling layer is removed to reduce the complexity of the models;
constructing a lightweight HCT model, namely a Light-HCT model, by utilizing a pytorch deep learning library, wherein the Light-HCT reduces the number of layers of the Encoder module from the original 7 layers to 1 layer; the training of the model adopts a mean square error loss function to carry out regression training:
Figure FDA0003432805800000041
wherein, output is a model output vector, the length is 3, value is a true value vector obtained by comparing output with label, and N is the number of input images during each training; the true value vector acquisition rule is as follows: assuming that output is [ x, y, z ], label is 0, if the maximum value among outputs occurs at a place where the subscript is 0, value is output; if the maximum occurs at the subscript 1, then value ═ y, x, z; if the maximum occurs at subscript 2, then value ═ z, y, x; similarly, in the case where label is 1, if the maximum value in output occurs where the subscript is 0, value is [ y, x, z ]; if the maximum value appears where the subscript is 1, value is output; if the maximum occurs at subscript 2, then value ═ x, z, y; in the case of label 2, if the maximum value in output occurs where the subscript is 0, value is [ z, y, x ]; if the maximum occurs at the subscript 1, then value ═ x, z, y; if the maximum occurs at the subscript of 2, value is output.
CN202111599851.9A 2021-12-24 2021-12-24 Rapid video coding method based on deep neural network Pending CN114286093A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111599851.9A CN114286093A (en) 2021-12-24 2021-12-24 Rapid video coding method based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111599851.9A CN114286093A (en) 2021-12-24 2021-12-24 Rapid video coding method based on deep neural network

Publications (1)

Publication Number Publication Date
CN114286093A true CN114286093A (en) 2022-04-05

Family

ID=80875038

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111599851.9A Pending CN114286093A (en) 2021-12-24 2021-12-24 Rapid video coding method based on deep neural network

Country Status (1)

Country Link
CN (1) CN114286093A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114513660A (en) * 2022-04-19 2022-05-17 宁波康达凯能医疗科技有限公司 Interframe image mode decision method based on convolutional neural network
CN115118977A (en) * 2022-08-29 2022-09-27 华中科技大学 Intra-frame prediction encoding method, system, and medium for 360-degree video
CN115170894A (en) * 2022-09-05 2022-10-11 深圳比特微电子科技有限公司 Smoke and fire detection method and device
CN116229095A (en) * 2022-12-30 2023-06-06 北京百度网讯科技有限公司 Model training method, visual task processing method, device and equipment
CN116600107A (en) * 2023-07-20 2023-08-15 华侨大学 HEVC-SCC quick coding method and device based on IPMS-CNN and spatial neighboring CU coding modes
CN116634147A (en) * 2023-07-25 2023-08-22 华侨大学 HEVC-SCC intra-frame CU rapid partitioning coding method and device based on multi-scale feature fusion
WO2024001886A1 (en) * 2022-06-30 2024-01-04 深圳市中兴微电子技术有限公司 Coding unit division method, electronic device and computer readable storage medium
WO2024027616A1 (en) * 2022-08-01 2024-02-08 深圳市中兴微电子技术有限公司 Intra-frame prediction method and apparatus, computer device, and readable medium

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114513660A (en) * 2022-04-19 2022-05-17 宁波康达凯能医疗科技有限公司 Interframe image mode decision method based on convolutional neural network
WO2024001886A1 (en) * 2022-06-30 2024-01-04 深圳市中兴微电子技术有限公司 Coding unit division method, electronic device and computer readable storage medium
WO2024027616A1 (en) * 2022-08-01 2024-02-08 深圳市中兴微电子技术有限公司 Intra-frame prediction method and apparatus, computer device, and readable medium
CN115118977A (en) * 2022-08-29 2022-09-27 华中科技大学 Intra-frame prediction encoding method, system, and medium for 360-degree video
CN115118977B (en) * 2022-08-29 2022-11-04 华中科技大学 Intra-frame prediction encoding method, system, and medium for 360-degree video
US12015767B2 (en) 2022-08-29 2024-06-18 Huazhong University Of Science And Technology Intra-frame predictive coding method and system for 360-degree video and medium
CN115170894A (en) * 2022-09-05 2022-10-11 深圳比特微电子科技有限公司 Smoke and fire detection method and device
CN116229095A (en) * 2022-12-30 2023-06-06 北京百度网讯科技有限公司 Model training method, visual task processing method, device and equipment
CN116600107A (en) * 2023-07-20 2023-08-15 华侨大学 HEVC-SCC quick coding method and device based on IPMS-CNN and spatial neighboring CU coding modes
CN116600107B (en) * 2023-07-20 2023-11-21 华侨大学 HEVC-SCC quick coding method and device based on IPMS-CNN and spatial neighboring CU coding modes
CN116634147A (en) * 2023-07-25 2023-08-22 华侨大学 HEVC-SCC intra-frame CU rapid partitioning coding method and device based on multi-scale feature fusion
CN116634147B (en) * 2023-07-25 2023-10-31 华侨大学 HEVC-SCC intra-frame CU rapid partitioning coding method and device based on multi-scale feature fusion

Similar Documents

Publication Publication Date Title
CN114286093A (en) Rapid video coding method based on deep neural network
CN110087087B (en) VVC inter-frame coding unit prediction mode early decision and block division early termination method
CN108924558B (en) Video predictive coding method based on neural network
CN115914649B (en) Data transmission method and system for medical video
CN111355956B (en) Deep learning-based rate distortion optimization rapid decision system and method in HEVC intra-frame coding
US11956447B2 (en) Using rate distortion cost as a loss function for deep learning
CN111479110B (en) Fast affine motion estimation method for H.266/VVC
TWI806199B (en) Method for signaling of feature map information, device and computer program
CN112291562B (en) Fast CU partition and intra mode decision method for H.266/VVC
US20230353764A1 (en) Method and apparatus for decoding with signaling of feature map data
KR20230072487A (en) Decoding with signaling of segmentation information
CN107690069B (en) Data-driven cascade video coding method
CN111711815B (en) Fast VVC intra-frame prediction method based on integrated learning and probability model
CN115941943A (en) HEVC video coding method
US20230110503A1 (en) Method, an apparatus and a computer program product for video encoding and video decoding
CN116896638A (en) Data compression coding technology for transmission operation detection scene
CN114143536B (en) Video coding method of SHVC (scalable video coding) spatial scalable frame
WO2023122132A2 (en) Video and feature coding for multi-task machine learning
CN113822801A (en) Compressed video super-resolution reconstruction method based on multi-branch convolutional neural network
CN113225552B (en) Intelligent rapid interframe coding method
CN117692652B (en) Visible light and infrared video fusion coding method based on deep learning
CN117640931A (en) VVC intra-frame coding rapid block division method based on graph neural network
US20240185572A1 (en) Systems and methods for joint optimization training and encoder side downsampling
Kaji et al. Enhancement of CNN-based Probability Modeling by Locally Trained Adaptive Prediction for Efficient Lossless Image Coding
Jiang et al. Encoder-Decoder-Based Intra-Frame Block Partitioning Decision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination