CN114286093A

CN114286093A - Rapid video coding method based on deep neural network

Info

Publication number: CN114286093A
Application number: CN202111599851.9A
Authority: CN
Inventors: 陆宇; 诸承广; 殷海兵; 周洋; 黄晓峰; 杨萌
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2022-04-05

Abstract

The invention discloses a rapid video coding method based on a deep neural network. The method comprises a CU partitioning module based on a deep neural network and a PU mode selection module based on neighborhood correlation; the rate distortion cost is calculated through PU mode selection when a CU block is coded in a frame, optimization is carried out through a PU mode selection module based on neighborhood correlation at the moment, and the number of candidate modes calculated by RDO is reduced through a prediction result of a lightweight HCT model; after PU mode selection is finished, the encoder can carry out CU block depth judgment to judge whether the CU block is divided, at the moment, a CU dividing module based on a depth neural network carries out optimization, and a prediction result is obtained from an HCT model to judge whether division is stopped in advance. Otherwise, the sub-CU blocks are continuously divided downwards, and PU mode selection and CU block division judgment are continuously carried out. The invention reduces the complexity of CU recursive partitioning, simplifies the selection process of intra-frame prediction modes, and effectively improves the time efficiency of HEVC coding.

Description

Rapid video coding method based on deep neural network

Technical Field

The invention belongs to the technical field of High Efficiency Video Coding (HEVC), and particularly relates to a low-complexity fast HEVC intra-frame video coding method.

Background

The latest video coding standard, High Efficiency Video Coding (HEVC), developed by the joint video coding development group (JCT-VC) in 2012 significantly improved coding performance. HEVC uses several well-designed methods, saving approximately 50% of the code rate at the same video compression quality, compared to the previous video coding standard h.264/AVC. Specifically, for intra coding, a Coding Unit (CU) based on a quad-tree structure is recursively divided into 64 × 64 to 8 × 8 blocks; in addition, for a Prediction Unit (PU), 35 intra prediction modes including DC, Planar, and 32 angular prediction modes are allowed at the maximum. Both techniques are beneficial to improve coding performance, but at the cost of greatly increasing coding complexity, and are difficult to meet the requirements of real-time applications. Therefore, it is necessary to research a fast video encoding method.

Many HEVC fast intra coding algorithms have been proposed so far, and can be roughly classified into two main categories: fast partitioning of CU blocks and fast selection of intra modes. Since the partitioning of CUs is a recursive partitioning optimized from top to bottom by rate-distortion, and its block size is flexible, its coding complexity is very high, and many fast coding methods try to predict the CU partitioning pattern in advance to avoid a lengthy recursive RDO search. In terms of intra mode selection, the current HEVC encoder employs a three-step algorithm to accelerate the intra mode decision process, and the optimization of intra mode selection is currently mainly focused on simplifying the RMD process or RDO calculation. The optimization algorithm uses a traditional heuristic method to manually extract texture features of the CU blocks or utilizes related features between adjacent CUs, and methods in the field of machine learning such as decision trees, support vector machines, Bayesian decisions and the like are also applied to CU depth decision, so that a certain optimization effect is achieved; the convolutional neural network has outstanding image classification performance in the field of computer vision due to excellent local feature extraction capability, and a plurality of neural network structures have been designed in recent years, so that textures and object features in coding blocks can be automatically extracted. However, due to the problem of prediction accuracy of the neural network, the high accuracy means that the situation of wrong prediction of a CU is few, and excessive and unnecessary improvement of the code rate is not brought, but a network model is excessively complex, and extra time overhead is brought, so that the design of a network structure needs to be balanced between complexity and prediction accuracy.

The Google team in 2017 proposes a Self-Attention mechanism (Self-Attention) based transform neural network model, the model adopts the Self-Attention mechanism and does not adopt the sequential structure of RNN, so that the model can be trained in a parallelization way, and can have global information, the effect is obvious when a sequence is processed, and the model quickly becomes a preferred model in the Natural Language Processing (NLP) field and is gradually expanded to the computer vision field. The Google team in 2021, in addition, published An article entitled "An Image is word 16x16 Words: transformations for Image Recognition at Scale" at The International Conference on Learning retrieval (ICLR), The Vision Transformer (ViT) network model described herein introduced The Transformer structure for The first time into computer Vision classification tasks and achieved superior performance over The CNN model. ViT, the model migrates the encoder module in the Transformer for direct use and adds classification vectors for outputting classification results, compared with CNN (convolutional neural network), ViT network can integrate information of the whole image, even the information of the bottom layer, and then bring better classification effect, which is not achieved by CNN, ViT shows great potential to make research on Transformer and attention mechanism appear explosive growth.

Disclosure of Invention

The invention aims to provide a fast video coding method based on a deep neural network aiming at the defects of high complexity and long overall coding time of the existing HEVC video coding, and the fast coding is realized while the coding performance is ensured by utilizing the strong feature extraction and learning capability of the deep neural network based on a convolution and self-attention mechanism.

The invention provides a rapid video coding method based on a deep neural network, which specifically comprises a CU (coding Unit) partitioning module based on the deep neural network and a PU (polyurethane) mode selection module based on neighborhood correlation.

The CU partitioning module based on the deep neural network predicts the partitioning result of each CU block from top to bottom by using the neural network, optimizes the prediction result by using a threshold value, reduces the generation of error prediction, and finally judges whether the current CU block terminates partitioning in advance according to the partitioning result during encoding by an encoder.

The PU mode selection module based on neighborhood correlation firstly predicts the position of an optimal mode in a candidate mode list after RMD rough selection by utilizing a neural network, avoids some redundant modes from entering RDO calculation by discarding the subsequent modes, optimizes a Most Probable Mode (MPM) and reduces the number of MPMs in the candidate list, thereby achieving the purposes of reducing mode calculation amount and reducing PU mode selection complexity.

The method reduces the complexity of CU recursive partitioning, simplifies the selection process of the intra-frame prediction mode and effectively improves the coding efficiency of HEVC by utilizing the correlation between the CU depth and the PU prediction mode in the video image in the neighborhood.

The technical scheme adopted by the invention for solving the technical problems is as follows

CU (unit) partitioning module based on deep neural network

Step (I), constructing a data set for network model training in an HEVC intra-frame mode: data sets are derived from YUV video sequences of various resolutions including: CIF (352 × 288),480p (832 × 480),720p (1280 × 720),1080p (1920 × 1080), WQXGA (2560 × 1600). The samples of the data set are composed of the luma component of the CU block and corresponding sample labels, where the sample labels are obtained by encoding the luma component for HEVC reference software HM 16.9. The data sets include a training set, a validation set, and a test set, each of which is further divided into four subsets according to four QPs (22,27,32, 37).

And (II) constructing a deep neural network for three CU blocks of 64 × 64,32 × 32 and 16 × 16 respectively to form a Hierarchical convolutional network (HCT) structure, wherein the HCT is formed by combining ViT and CNN, the HCT is trained through a corresponding training set, the HCT model is determined and stored through a verification set, and finally the generalization ability of the HCT model is judged through a test set. The objective function of HCT model training is cross entropy loss function (Cross EntropyLoss):

wherein output is the output vector of the HCT model, target is the sample label value, and L is the length of the output vector.

And (III) the layered convolution network HCT consists of a convolution module, a Transformer encoder module, a Sequence Pooling (Sequence Pooling) layer and a full connection layer. Firstly, the luminance component of a CU block is sent into a hierarchical convolution network HCT, a feature map with local feature information is output through a convolution module, the convolution module comprises a convolution layer and a maximum pooling layer (Maxpool), each layer is activated by a linear rectification function (ReLU) to improve the nonlinearity of a model, and then the feature map is flattened (Flatten) into one dimension and is subjected to a turning operation. Suppose an input image x ∈ R^C×H×WWhere C represents the number of input images, H is the height of the image, W is the width of the image, the output of the convolution moduleThe following results are obtained:

x₀＝Transpose(Flatten(MaxPool(Conv2d(x)))) (2)

then the characteristic data x₀And the position vector (position vector) is added and sent to a Transformer encoder module for global information extraction, the encoder module has 7 layers in total, each Layer consists of a Multi-headed Self-attentive Layer (MSL) and a Feed-Forward Convolution Layer (FCL), and Layer Normalization (LN) operation is carried out before the two sublayers so as to improve the robustness and the generalization capability of the model. Characteristic data x₀Firstly passes through the multi-head self-attention layer, and the output data and x₀Adding to obtain new characteristic data x₁，x₁And the output value and x of the feedforward convolution layer₁Adding to obtain characteristic data x₂The formula is as follows:

x₁＝x₀+MSL(LN(x₀)) (3)

x₂＝x₁+FCL(LN(x₁)) (4)

finally, obtaining a classification vector through a Sequence Pooling layer (Sequence Pooling), wherein the Sequence Pooling adopts a mapping T: r^b×n×d→R^b×dB denotes a batch size (batch _ size), n denotes the number of pieces of feature data, and d denotes the size of each piece of feature data. This operation will output the feature data x for the entire transform encoder module₂Directly into a classification vector containing information about the various parts of the input image, instead of the additional classification vector at ViT. And finally, outputting a classification result by the classification vector through a full connection layer and Softmax, wherein the final predicted value is a subscript (0 or 1) where the maximum output value is located.

Step (IV), training the HCT model by adopting a random gradient descent method (SGD), storing 12 HCT models with the highest accuracy of 3 CU blocks under 4 QPs, predicting the division results of 64 × 64,32 × 32 and 16 × 16 blocks from top to bottom by adopting an early termination mechanism by the trained HCT model, wherein the prediction results of the models are of two types: 0 represents no partition and 1 represents partition. When the prediction result of a certain type of block is 0, the quadtree division is not continuously performed downwards during the encoding, so that some redundant block division operations can be avoided by stopping the division in advance, and the effect of reducing the complexity of the encoding time is achieved.

In order to reduce the extra coding performance loss caused by the error prediction of the model, the module adopts threshold optimization to improve the coding performance. In the invention, the Similarity (SD) between two classification vectors is compared with a threshold lambda, and when the SD is smaller than the threshold lambda, the CU blocks can be checked by adopting an original coding mode, so that the wrong judgment of CU division results is reduced, the coding performance is improved, the balance between the coding performance and the complexity is realized, and the calculation formula of the similarity SD is as follows:

output_itwo-class output vectors for i x i-sized blocks, where we divide the threshold λ into three classes according to block size, with a size ratio of 4:2: 1.

(II) PU mode selection module based on neighborhood correlation:

step (1), obtaining a sample label value label ∈ [0,1,2] of each PU block during intra mode selection through HM coding, and obtaining a rule as follows: for PU blocks of 64 × 64,32 × 32, and 16 × 16 sizes, the original length of the candidate list after RMD rough selection is 3, and if the best mode of the PU block during mode selection is the first bit in the candidate list after RMD rough selection, label is 0, and the length of the candidate list after RMD rough selection becomes 1; if the best mode of the PU block during mode selection is located at the second bit in the candidate list after RMD rough selection, the label is 1, and the length of the candidate list after corresponding RMD rough selection becomes 2; in other cases, label is 2, and the candidate list after RMD roughing has a length of 3. For an 8 × 8, 4 × 4 PU block, since its candidate list is originally 8 long, we also divide it into three intervals to correspond to

label

0,1,2, which are: if the best mode of the PU block after mode selection is located at the first or second bit in the candidate list after RMD rough selection, then label is 0, and the length of the candidate list after corresponding RMD rough selection becomes 2; when the best mode of the PU block during mode selection is located in the third or fourth bit of the candidate list after RMD rough selection, label is 1, and the length of the candidate list after RMD rough selection becomes 4; in other cases, label is 2, and the candidate list after RMD roughing is 8.

And (3) the data set of the PU mode selection module in the step (2) is also from the video sequence mentioned in the CU partition module based on the deep neural network, and PU block data with the size of 8 multiplied by 8 and 4 multiplied by 4 is added on the basis of the block partition data set. The models of PU blocks of 64 × 64,32 × 32, and 16 × 16 sizes are similar to those of the block division module, but the number of layers of the transform encoder module becomes 1. Models corresponding to 8 × 8 and 4 × 4 PU blocks are simplified and changed for the convolution module on the basis that the number of encoder module layers becomes 1, that is, dimension reduction operation is not performed and the maximum pooling layer is removed to reduce model complexity. In addition, the training of the model is regression trained using mean square error loss functions (mselos):

wherein output is a model output value vector, the length is 3, value is a true value vector obtained after the output is compared with label, and N is the number of input images during each training.

The true value vector acquisition rule is as follows: assuming that output is [ x, y, z ], label is 0, if the maximum value among outputs occurs at a place where the subscript is 0, value is output; if the maximum occurs at the subscript 1, then value ═ y, x, z; if the maximum occurs at a subscript of 2, then value is [ z, y, x ]. Similarly, in the case where label is 1, if the maximum value in output occurs where the subscript is 0, value is [ y, x, z ]; if the maximum value appears where the subscript is 1, value is output; if the maximum occurs at a subscript of 2, then value is [ x, z, y ]. In the case of label 2, if the maximum value in output occurs where the subscript is 0, value is [ z, y, x ]; if the maximum occurs at the subscript 1, then value ═ x, z, y; if the maximum occurs at the subscript of 2, value is output.

The invention has the following beneficial effects:

(1) the invention adopts a batch prediction method for the video sequence, can obtain the prediction result of the CU block or the PU block only by running once, and can obviously reduce the prediction time of the neural network. The invention adopts a method of terminating CU block division in advance, reduces the number of modes of the PU block entering RDO calculation after RMD rough selection by utilizing a prediction result, optimizes the MPM mode to reduce the number of modes in a candidate list, and realizes quick coding.

(2) Compared with the CNN structure in the prior art, the HCT model not only can automatically extract the relevant local features of the image block, but also has the capability of extracting global information, so that the prediction accuracy and generalization capability of the model are improved, the computational complexity of the model is reduced due to the parallelization computational characteristics, and the consumed memory resources are greatly reduced. The data set adopted by the invention is far smaller than the data set required by the CNN network in the prior art, so that the training time of the model is greatly reduced, the code rate is not increased much under the condition of ensuring that the time complexity is greatly reduced, and the practicability is stronger.

(3) The invention simulates 5 video sequences with different resolutions of A (2560 × 1600), B (1920 × 1080), C (832 × 480), D (416 × 240) and E (1280 × 720), and experimental results show that the average time efficiency can reach about 70%, and the code rate is only increased by about 2%.

Drawings

FIG. 1 is a flow chart of HEVC original intra coding

FIG. 2 is a general algorithm diagram of the present invention;

FIG. 3 is a flow chart of a CU partition module method of the present invention;

FIG. 4 is a schematic of a data set according to the present invention;

FIG. 5 is a schematic diagram of the HCT model of the present invention;

FIG. 6 is a flow chart of a PU mode selection module method according to the present invention;

FIG. 7 is a diagram illustrating the correspondence between PU mode selection module dataset tags and the best mode locations after RMD rough selection according to the present invention;

FIG. 8 is a schematic diagram of a corresponding relationship between a tag of a dataset and a true value vector during the PU mode selection module model training of the present invention;

Detailed Description

As shown in fig. 1, in an HEVC original intra coding process, a CU block with a size of 64 × 64 is first subjected to PU mode selection to calculate rate-distortion cost, then divided into 4 CU blocks with a size of 32 × 32 by a quad-tree, and these CU blocks are further subjected to PU mode selection to calculate rate-distortion cost, so that the CU blocks are divided down to a size of 8 × 8, and finally rate-distortion costs of the CU blocks and 4 sub-CU blocks of the CU blocks are compared from bottom to top, so as to obtain a division result. In the deep learning method in the prior art, a convolutional neural network is mostly adopted to automatically extract local features in a CU or a PU block for training so as to optimize the downward division operation of the CU block or the rate-distortion calculation of PU mode selection, and a network model usually needs a complex network structure or a large-scale data set for training to achieve good prediction accuracy and generalization capability.

Core improvements of the invention may include: 1. a hierarchical convolutional network (HCT) is provided, the model can achieve the prediction result which is similar to that of a CNN model under a large data set and even has a better effect only by small data set training, and the intra-frame mode complexity is reduced by rapidly outputting CU block division results through batch prediction. 2. The HCT structure is used for PU mode selection after being subjected to lightweight modification, the initial 3 (corresponding to 64 x 64,32 x 32 and 16x16 PU blocks) or 8 (corresponding to 8 x 8 and 4 x 4 PU blocks) modes are divided into three types of [1,2,3] or [2,4 and 8] for prediction, MPM is optimized, and the time complexity of intra-frame coding is reduced by reducing the number of modes entering RDO calculation.

The invention is further described below with reference to the drawings and the examples.

A general algorithm framework of the method for rapidly coding the video based on the deep neural network is shown in figure 2 and comprises a CU partitioning module based on the deep neural network and a PU mode selection module based on neighborhood correlation. The rate distortion cost is calculated through PU mode selection when the CU blocks are coded in the frame, the PU mode selection module is used for optimization, and the number of candidate modes calculated by RDO is reduced through the prediction result of the lightweight HCT model; after PU mode selection is finished, the encoder can carry out CU depth judgment to judge whether the CU block is divided or not, at the moment, a CU dividing module based on a neural network carries out optimization, a prediction result is obtained from an HCT model to judge whether division is stopped in advance, and otherwise, PU mode selection and CU dividing judgment of continuous sub CU blocks are divided downwards. According to the invention, HEVC official reference software HM16.9 is adopted for compression coding in simulation, the test conditions refer to JCT-VC general test conditions (JCTVC-R1015), and an HM16.9 model full-frame intra-coding configuration file encode _ intra _ main.cfg is used. The flow chart of the CU block partitioning module method based on the deep neural network is shown in FIG. 3, and the specific steps are as follows:

and (I) constructing a database required by HCT model training, wherein the related data source is shown in FIG. 4. First, the luminance component of one frame of image is extracted every 50 frames from the video sequence shown in fig. 4, and encoding is performed to obtain two classification labels (labels) corresponding to three CU blocks of 64 × 64,32 × 32, and 16 × 16 in the picture, where the label value is 0 (representing no division) or 1 (representing division). The resulting data set sample size is 9, 865, 968, and divided into 12 sub-data sets according to their QP values and CU block sizes for training of the network model.

And (II) constructing an HCT model for training the database and CU partition prediction, wherein as shown in FIG. 5, the input of the HCT model is the brightness component of the CU block, and each sub data set corresponds to one network model. Each CU block is sent into a network and passes through a convolution module, a Transformer encoder module, a Sequence Pooling (Sequence Pooling) layer and a full connection layer, two classification vectors are finally output, and index subscripts corresponding to the maximum values in the vectors are selected to serve as block division prediction results during prediction. The HCT model is formed by combining CNN and Vision Transformer networks, and the configuration and the function of each part are described as follows:

1) and a convolution module. The module is used for extracting the local features of the CU blocks and improving the prediction accuracy of the model, and the formula is as follows:

x₀＝Transpose(Flatten(MaxPool(Conv2d(x)))) (1)

as the input pictures are subjected to batch operation, the quantity of the selected input pictures is T and the size of the selected input pictures is [1, W ]](W∈[64，32，16]) Performing convolution operation on the CU luminance component x, wherein the size of a convolution kernel is 3 multiplied by 3, the step length is 2, and the filling is 1; obtaining 128 characteristic graphs with the size of W/2 xW/2, and performing maximum pooling treatment, wherein the dimensionality is changed into [ T, 128, W/4]]The convolution and maximum pooling layers are both activated by a linear rectification function (ReLU) to improve model nonlinearity, and then the feature map is flattened into one dimension which is inverted to x₀＝[T，W/4×W/4，128](ii) a And finally, learnable position information (position vector) is added, the information is derived from ViT network structure, so that the model can learn the correlation among different positions, and the prediction accuracy is improved.

2) A Transformer encoder module. The encoder module provided by the invention has 7 layers, and each Layer consists of a Multi-headed Self-attention Layer (MSL) and a Feed-Forward Convolution Layer (FCL). The module can carry out parallelization processing on the feature map data and acquire global information of the image. The multi-head self-attention layer is from an ViT network structure, a feedforward layer in an original ViT encoder module is actually combined by two fully-connected layers to carry out simple dimension increasing and reducing operation, too many training parameters can cause overlarge calculated amount, the model parameter is optimized by changing the feedforward layer into two convolutional layers with convolution kernels of 1x1, the input and output of the module are unchanged and are [ T, W/4 xW/4, 128], and a related calculation formula is shown as follows;

x₁＝x₀+MSL(LN(x₀)) (2)

x₂＝x₁+FCL(LN(x₁)) (3)

characteristic data x output by convolution module₀Firstly passes through the multi-head self-attention layer, and the output data and x₀Adding to obtain new characteristic data x₁，x₁And the output value and x of the feedforward convolution layer₁Adding to obtain characteristic data x₂. Layer Normalization (LN) operation is added before the multi-head self-attention Layer and the feedforward convolution Layer,can play the roles of stabilizing the model and regularizing.

3) And (4) sequence pooling layers. The data output after passing through the transform encoder module contains local information and global information of an input image, and is different from the additionally added classification vectors in ViT. Firstly, carrying out dimensionality reduction on data x ═ T, W/4 xW/4, 128 output by a transform encoder module to obtain [ T, W/4 xW/4, 1], then carrying out normalization operation on the data through Softmax, and then turning over to obtain y ═ T, 1, W/4 xW/4 ]; multiplying y and x to obtain a vector with dimensionality [ T, 1, 128], and finally obtaining two classification vectors after the classification vector obtained by dimensionality reduction passes through a full connection layer: [ T, 128] ═ T, 2 ]. Each CU block outputs two classification values which represent the probability of occurrence of division and non-division, and the subscript (0 or 1) where the maximum value is located is used as an expression of non-division and division.

4) And other layers. A multi-head self-attention layer and a feedforward convolution layer in a transform encoder module are added into a dropout operation, the random discarding probability is 10%, and the purpose is to prevent overfitting so as to improve the generalization capability of the network.

Step (III), with the database and the network model, we can train the model, and under the condition of 3 CU block sizes and 4 QPs, the block partitioning module of the invention needs 12 HCT models in total, the building and training of the network model are carried out under a Pythrch deep learning library, and the required loss function is a cross entropy loss function:

wherein output is the output vector of the network model, target is the sample label value, and L is the output vector length. And performing model accuracy verification on the model parameters obtained after training in the training set through the verification set, and considering whether the model parameters are stored or not, wherein the model parameters are an iteration, the iteration times are unified to be 100 times, the batch size (batch size) is 64, and finally the network model with the highest accuracy in the test set is obtained for prediction.

And (IV) after the model is trained, the method can be used for coding, HEVC reference software used by the method is HM16.9, when the coder starts coding, the HCT network model is firstly used for predicting three CU blocks of a frame to be coded of a video sequence, then the obtained prediction result is used for judging the quadtree division of the CU blocks in an intra-frame mode, if the current CU block prediction result is 0, the division is terminated in advance, and if not, the quadtree division judgment is continued downwards.

In order to reduce the extra coding performance loss caused by the error prediction of the model, the module also adds threshold optimization to improve the coding performance. Since the output of the model is a binary vector, we transform it to soft classification values between [0, 1] through the Softmax normalization operation, and the values at

subscripts

0 and 1 represent the probability that the model predicts no partitioning and no partitioning. When the distance between the two is larger, namely the probability value of one party is larger and larger, and the probability value of the other party is smaller and smaller, the prediction capability of the model for the two categories is clearer and more definite, and the situation of wrong prediction is not easy to occur; on the contrary, the smaller the probability value, the closer the probability value is, the equal the probability value is, at this time, the prediction capability of the model to the two categories becomes more and more fuzzy and uncertain, the judgment is difficult to be accurate, and then the wrong prediction is easy to occur. Therefore, the Similarity (SD) between the soft classification values is compared with the threshold λ, and when SD is smaller than the threshold λ, the CU block can be checked by using the original encoding method, so that the erroneous judgment of the CU partition result can be reduced, the encoding performance can be improved, and the tradeoff between the encoding performance and the complexity can be realized, and the calculation formula of the similarity SD is as follows:

output_ithe two classified output vectors are i x i size blocks, and T is the number of input images. It should be noted that the complexity reduction achieved is different for different CU block contents, and that larger blocks are more prone to partitioning, and thus the threshold range should be more pronouncedIs small. Here we divide the constant value λ into three classes according to block size, and λ_64×64＝2λ_32×32＝4λ_16×16Experiments prove that the fixed value combination can achieve better coding performance improvement. Through the optimization operation, the complexity of the video coding time can be effectively reduced under the condition that the code rate is not increased much.

The method flow diagram of the PU mode selection module based on the neighborhood correlation is shown in FIG. 6. The module adopts a Light-weight improved Light-HCT network model for prediction, and the specific steps are as follows:

and (1) firstly, constructing a data set required by the module. The initial candidate pattern list length in PU mode selection is 3 (corresponding to PU block sizes of 64 × 64,32 × 32,16 × 16) or 8 (corresponding to PU block sizes of 8 × 8 and 4 × 4), and it is observed that the best mode of the PU block is not fixed in the candidate pattern list after RMD coarse selection, and the best mode is a redundant mode after the best mode, which increases the RDO calculation amount. The present invention reduces the amount of subsequent RDO computations by eliminating these redundant modes by predicting the position of the best mode in the RMD-roughed candidate list, with the data set selected for the PU mode covering PU blocks from 64 x 64 to 4 x 4. The label value of each PU block is label ∈ [0,1,2], the predicted best mode position is also classified into three classes according to the label category, and the three classes are in one-to-one correspondence with the label value, the correspondence is shown in fig. 7, and the rule of the correspondence is as follows:

the initial candidate list length of PU blocks of 64 × 64,32 × 32, and 16 × 16 sizes is 3, and if the best mode of the current PU block is located at bit 1 in the RMD after-coarse mode list, label is 0; the best mode is located at the 2 nd position in the mode list after RMD rough selection, and then label is 1; in other cases, label is 2.

Secondly, since the length of the initial candidate list of the PU blocks with the sizes of 8 × 8 and 4 × 4 is 8, the position of the segmented interval is adopted to correspond to the label value, and if the best mode of the current PU block is located at the 1 st bit or the 2 nd bit in the RMD roughly-selected mode list, label is 0; if the best mode of the current PU block is located at bit 3 or 4 in the RMD after-roughing mode list, label is 1; in other cases, label is 2.

And (2) constructing a lightweight HCT model, namely a Light-HCT model, by utilizing a Pythrch deep learning library. Different from the HCT model, the Light-HCT reduces the number of layers of the transform encoder module from 7 to 1, and compared with the HCT classification accuracy function, the Light-HCT is more apt to fit the position of the best mode in the PU block after RMD rough selection, and the mode selection of the PU block is determined by texture features, boundary curvature, quantization parameters and boundary direction, and the deep learning method cannot automatically extract so many features, so that the training by the classification method is difficult. In particular, since the PU blocks of 8 × 8 and 4 × 4 are very small and do not need a very complicated network structure, we simplify the convolution module based on the Light-HCT model, change the parameters of the convolution layer to the convolution kernel size of 3 × 3, and have step size and padding of 1, i.e. do not reduce the dimension operation, and remove the following max pooling layer.

And (3) training the model by adopting linear regression prediction, wherein the loss function adopts an MSELoss function in a Pythrch:

wherein output is a network output value vector, the length is 3, value is a true value vector, and the value is obtained by comparing output with label, as shown in fig. 8, the value vector obtaining rule is as follows:

and (c) 0. If the maximum value in output ═ x, y, z ] occurs where the subscript is 0, then value ═ output; if the maximum occurs at the subscript 1, then value ═ y, x, z; if the maximum occurs at a subscript of 2, then value is [ z, y, x ].

(1) in the case of label. If the maximum value in output ═ x, y, z ] occurs where the subscript is 0, then value ═ y, x, z ]; if the maximum value appears where the subscript is 1, value is output; if the maximum occurs at a subscript of 2, then value is [ x, z, y ].

③ 2. If the maximum value in output ═ x, y, z ] occurs where the subscript is 0, then value ═ z, y, x ]; if the maximum occurs at the subscript 1, then value ═ x, z, y; if the maximum occurs at the subscript of 2, value is output.

The purpose of the above exchange mechanism is to approach the maximum position frequency predicted by the model to the position corresponding to the label value, so as to achieve the purpose of fitting the corresponding category of each PU block and realizing the balance between the coding performance and the complexity.

In step (4), the invention optimizes the part of the MPM of the most probable mode after RMD rough selection. Firstly, obtaining a mode with minimum SAD and SATD cost values in a candidate mode list after RMD rough selection, and recording the SATD cost as J_SATDminSad (sum of Absolute difference), which is the sum of Absolute differences, represents the size of the residual between the original image block and the predicted image block, satd (Hadamard transformed sad) is the size of the transformed residual, the prediction residual is Hadamard transformed first, and then the Absolute values of the residual are summed, these two cost values reflect the RD-cost size of the PU block to some extent, and can be used for preliminary mode screening. The calculation formula of these two cost functions is as follows:

TD(x，y)＝|Orig(x，y)-Pred(x，y)| (7)

SAD＝∑_x，yTD(x，y) (8)

after the MPM part comes, because there are three MPMs obtained from the adjacent PU blocks, firstly, whether the MPMs are the same as the mode with the minimum SAD and SATD costs is compared in sequence, if so, the MPM process is terminated, the mode is taken as the best mode and sent to RDO (remote data object) calculation, and other modes are discarded; if not, calculating the SATD cost value of the MPM and comparing the SATD cost value with an adaptive threshold AT, wherein the adaptive threshold AT defined by the invention is as follows:

AT＝ρ×J_SATDmin (10)

wherein the proportionality coefficient p is 1.3, which is obtained on the basis of experimental statistical analysis of a large number of video sequences. If the SATD cost of the MPM is larger than the AT, the mode does not add the RDO candidate mode list, otherwise, the mode is added into the candidate list, if the mode does not exist in the candidate list, then the next MPM is judged continuously, and the MPM operation of the original encoder is replaced by the next MPM until all the MPM mode judgment is finished.

Claims

1. A fast video coding method based on a deep neural network is characterized in that the specific implementation comprises a CU partitioning module based on the deep neural network and a PU mode selection module based on neighborhood correlation; the rate distortion cost is calculated through PU mode selection when a CU block is coded in a frame, optimization is carried out through a PU mode selection module based on neighborhood correlation at the moment, and the number of candidate modes calculated by RDO is reduced through a prediction result of a lightweight HCT model; after PU mode selection is finished, the encoder can carry out CU block depth judgment to judge whether the CU block is divided or not, at the moment, a CU dividing module based on a depth neural network carries out optimization, a prediction result is obtained from an HCT model to judge whether division is stopped in advance, and otherwise, PU mode selection and CU block dividing judgment of the sub-CU block are continued to be carried out.

2. The method according to claim 1, wherein the deep neural network-based CU partitioning module is implemented as follows:

step (I), constructing a data set for network model training in an HEVC intra-frame mode: data sets are derived from YUV video sequences of various resolutions including: CIF (352 × 288),480p (832 × 480),720p (1280 × 720),1080p (1920 × 1080), WQXGA (2560 × 1600); coding the image in the data set by adopting an HEVC (high efficiency video coding) coder HM16.9 to obtain a CU block and positive and negative sample labels thereof; the data sets include a training set, a validation set, and a test set, each data set in turn being divided into four subsets according to four QPs (22,27,32, 37);

step (II), constructing a deep neural network for three CU blocks of 64 × 64,32 × 32 and 16 × 16 respectively to form a hierarchical convolutional network HCT structure, wherein the hierarchical convolutional network HCT is composed of ViT and CNN, training the HCT through a corresponding training set, determining and storing an HCT model through a verification set, and judging the generalization capability of the HCT model through a test set; the objective function of HCT model training is a cross entropy loss function:

wherein output is the output vector of the HCT model, target is the label value, and N is the length of the output vector;

step (III), the layered convolution network HCT consists of a convolution module, an Encoder module, a sequence pooling layer and a full-connection layer; firstly, sending the brightness component of a CU block into a layered convolution network HCT, outputting a feature map with local feature information through a convolution module, wherein the convolution module comprises a convolution layer and a maximum pooling layer, and each layer is activated by a linear rectification function to improve the nonlinearity of a model; flattening the characteristic diagram into one dimension and exchanging the one dimension with the quantity of the characteristic diagram, namely flattening and turning operation; suppose an input image x ∈ R^C×H×WWherein C represents the number of input images, H is the height of the images, W is the width of the images, and the output characteristic data x after passing through the convolution module₀The following were used:

x₀＝Transpose(Flatten(MaxPool(Conv2d(x)))) (2)

then the characteristic data x₀Adding the position vector and sending the sum to an Encoder module for extracting global information, wherein the Encoder module has 7 layers, each layer consists of a multi-head self-attention layer and a feedforward convolution layer, and the two sublayers are subjected to layer normalization operation; characteristic data x₀Firstly passes through the multi-head self-attention layer, and the output data and x₀Adding to obtain new characteristic data x₁，x₁And the output value and x of the feedforward convolution layer₁Adding to obtain characteristic data x₂The formula is as follows:

x₁＝x₀+MSL(LayerNorm(x₀)) (3)

x₂＝x₁+FFL(LayerNorm(x₁)) (4)

finally get throughObtaining classification vectors through a sequence pooling layer, and adopting mapping transformation T: R for sequence pooling^b×n×d→R^b×dB represents the batch size, n represents the number of feature data, and d represents the size of each feature data; this operation outputs the entire Encoder output feature data x₂Directly transforming into a classification vector containing information about the respective portions of the input image to replace ViT the additional classification vector; finally, the classification vector outputs a classification result through a full connection layer and softmax, and the final predicted value is a subscript where the maximum output value is located;

and (IV) training the HCT model by adopting a random gradient descent method, storing 12 HCT models with the highest accuracy of 3 CU blocks under 4 QPs, predicting the division results of 64 multiplied by 64,32 multiplied by 32 and 16 multiplied by 16 blocks from top to bottom by adopting an early termination mechanism for the trained HCT model, wherein the prediction results of the models are of two types: 0 represents no partition, 1 represents partition; when the prediction result of a certain type of block is 0, the quadtree division is not continuously performed downwards during encoding;

using a contrast value between two classification vectors as a threshold Thr, when Thr is smaller than a constant value λ, we can check the CU block by using an original coding mode, so as to reduce the misjudgment of the CU partition result, thereby improving the coding performance, and realizing the trade-off between the coding performance and the complexity, where the formula is as follows:

wherein, output_iAnd (3) outputting vectors which are classified into two classes of i-sized blocks, and classifying the constant value lambda into three classes according to the block size, wherein the size ratio is 4:2: 1.

3. The method according to claim 2, wherein the PU mode selection module based on neighborhood correlation is implemented as follows:

step (1), obtaining a sample label value label ∈ [0,1,2] of each PU block during intra mode selection through HM coding, and obtaining a rule as follows: for a PU block of 64 × 64,32 × 32,16 × 16 size, the original length of the candidate list after RMD rough selection is 3, if the best mode of the PU block during mode selection is the first bit in the candidate list after RMD rough selection, label is 0, and the length of the candidate list after corresponding RMD rough selection becomes 1; if the best mode of the PU block during mode selection is located at the second bit in the candidate list after RMD rough selection, the label is 1, and the length of the candidate list after corresponding RMD rough selection becomes 2; otherwise, label is 2, and the length of the candidate list after corresponding to RMD rough selection is 3; for an 8 × 8, 4 × 4 PU block, since its candidate list is originally 8 long, we also divide it into three intervals to correspond to label 0,1,2, which are: if the best mode of the PU block after mode selection is located at the first or second bit in the candidate list after RMD rough selection, then label is 0, and the length of the candidate list after corresponding RMD rough selection becomes 2; when the best mode of the PU block during mode selection is located in the third or fourth bit of the candidate list after RMD rough selection, label is 1, and the length of the candidate list after RMD rough selection becomes 4; otherwise, label is 2, and the length of the candidate list after corresponding to the RMD rough selection is 8;

step (2), the data set of the PU mode selection module is also from the video sequence mentioned in the block division module, and PU block data with the size of 8 multiplied by 8 and 4 multiplied by 4 is added on the basis of the block division data set; the model of a 64 × 64,32 × 32,16 × 16 size PU block is similar to that of the block division module, but the number of layers of the Encoder module becomes 1; the models corresponding to the 8 × 8 and 4 × 4 PU blocks are also changed in the convolution module, namely, the dimension reduction operation is not carried out and the maximum pooling layer is removed to reduce the complexity of the models;

constructing a lightweight HCT model, namely a Light-HCT model, by utilizing a pytorch deep learning library, wherein the Light-HCT reduces the number of layers of the Encoder module from the original 7 layers to 1 layer; the training of the model adopts a mean square error loss function to carry out regression training:

wherein, output is a model output vector, the length is 3, value is a true value vector obtained by comparing output with label, and N is the number of input images during each training; the true value vector acquisition rule is as follows: assuming that output is [ x, y, z ], label is 0, if the maximum value among outputs occurs at a place where the subscript is 0, value is output; if the maximum occurs at the subscript 1, then value ═ y, x, z; if the maximum occurs at subscript 2, then value ═ z, y, x; similarly, in the case where label is 1, if the maximum value in output occurs where the subscript is 0, value is [ y, x, z ]; if the maximum value appears where the subscript is 1, value is output; if the maximum occurs at subscript 2, then value ═ x, z, y; in the case of label 2, if the maximum value in output occurs where the subscript is 0, value is [ z, y, x ]; if the maximum occurs at the subscript 1, then value ═ x, z, y; if the maximum occurs at the subscript of 2, value is output.