Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a VVC inter-frame rapid coding method based on reference blocks, which comprises the following steps:
s1: acquiring current coding unit information, wherein the coding unit information comprises depth and mode information of a current coding unit;
s2: determining whether to perform network prediction division according to the current coding information, if so, executing step S3, and if not, executing the original division process of the encoder;
s3: constructing a division probability network model, and training the division probability network model;
s4: inputting the current coding unit into the trained partition probability network model to obtain a partition probability predicted value of the current coding unit;
s5: selecting a prediction mode according to the partition probability prediction value, determining whether the prediction mode is a partition mode, returning to the step S4 if the prediction mode is the partition mode, and executing the step S6 if the prediction mode is not the partition mode;
s6: performing Mergeskip mode coding on a video to be coded by adopting a current coding unit CU, and calculating an RDCost value of Mergeskip mode coding;
s7: and determining whether Skip mode is adopted to Skip coding prediction according to the RDCost value, otherwise traversing all default candidate prediction modes.
Preferably, the training the partition probability network model includes: the partitioning probability network model consists of a partitioning convolution layer, a conditional residual error module, a sub-network structure and a full connection layer; the process of training the partitioning probability network model includes:
step 1: acquiring all coding units, screening the types of the CUs, and collecting the screened CU types to obtain a training data set;
step 2: optimizing the types of the redundant CUs in the training data set;
step 3: inputting the optimized training data set into a partitioned convolution layer of a partitioned probability network model, and learning different characteristics of the original data, the reference data and the parity data;
step 4: inputting the learned characteristics into a conditional residual error module for deep-level characteristic extraction;
step 5: inputting the extracted deep features into a sub-network structure to obtain division results of different data;
step 6: inputting the division result into the full connection layer to obtain a division probability prediction value;
step 7: and calculating a loss function of the model according to the partition probability predicted value, continuously adjusting parameters of the model, and finishing training of the model when the loss function is minimum.
Preferably, the determining the partition mode according to the partition probability prediction value includes: inputting the partition probability prediction value into a softmax function, and converting the value of each category prediction into a probability value between 0 and 1, wherein the maximum probability value is matched with a set threshold interval [ tau ] 1 ,τ 2 ]Comparing when the maximum probability value is smaller than τ in the threshold interval 1 When the maximum probability value is greater than tau in the threshold interval, the CU is divided by adopting the default division mode of the encoder 2 And when the prediction coding mode is adopted to divide the CUs, otherwise, the optimization mode is adopted to divide the CUs.
Further, the process of partitioning the CU includes:
partitioning the CU using the encoder default partition mode includes: the encoder defaults to divide one CU, wherein the maximum dividing number is 5, and the maximum dividing number comprises four-fork dividing QT, vertical binary dividing BV, horizontal binary dividing BH, vertical three-fork dividing TV and horizontal three-fork dividing TH;
partitioning a CU using a predicted coding mode includes: clearing the partition mode stack; dividing the data according to a network prediction dividing mode, wherein the network prediction dividing mode comprises non-dividing NS, four-way dividing QT, horizontal dividing HS and vertical dividing VS;
the process of partitioning the CUs using the optimization mode includes: and clearing the partition mode stack, determining a current partition mode, if the current partition mode is a horizontal partition HS, carrying out horizontal binary partition and horizontal three-way partition on the partitioned CU again, if the current partition mode is a vertical partition VS, carrying out vertical binary partition and vertical three-way partition on the partitioned CU again, and if the current partition mode is not the horizontal partition HS and the vertical partition VS, adopting a predicted coding mode to partition the CU.
The invention has the beneficial effects that:
the invention utilizes the reference block to solve the problem that the traditional deep learning can not effectively learn the inter-frame coding mode information and the prediction is inaccurate by utilizing the deep learning according to the inter-frame motion characteristics.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
An embodiment of a reference block-based VVC inter-frame fast encoding method, as shown in fig. 5, includes:
s1: acquiring current coding unit information, wherein the coding unit information comprises depth and mode information of a current coding unit;
s2: determining whether to perform network prediction division according to the current coding information, if so, executing step S3, and if not, executing the original division process of the encoder;
s3: constructing a division probability network model, and training the division probability network model;
s4: inputting the current coding unit into the trained partition probability network model to obtain a partition probability predicted value of the current coding unit;
s5: selecting a prediction mode according to the partition probability prediction value, determining whether the prediction mode is a partition mode, returning to the step S4 if the prediction mode is the partition mode, and executing the step S6 if the prediction mode is not the partition mode;
s6: performing Mergeskip mode coding on a video to be coded by adopting a current coding unit CU, and calculating an RDCost value of Mergeskip mode coding;
s7: and determining whether Skip mode is adopted to Skip coding prediction according to the RDCost value, otherwise traversing all default candidate prediction modes.
In the partitioning module of the general video Coding standard (Versital Video Coding, VVC), the partitioning mode of the Coding Unit (CU) is a multi-fork partitioning mode added on the basis of the four-fork partitioning of the high-efficiency video Coding standard (High Efficiency Video Codeing, HEVC) of the previous generation, and the partitioning modes are up to six, namely non-partitioning (NS), four-fork partitioning (QT), horizontal binary partitioning (BH), vertical binary partitioning (BV), horizontal trigeminal partitioning (TH) and vertical trigeminal partitioning (TV), respectively. While determining which partitioning scheme to use at the encoder is based on a recursive Rate-distortion optimization (Rate-Distortion Optimization, RDO) process, the best CU partitioning scheme is achieved by balancing compression ratio and image quality. This recursive process can be summarized as top-down and bottom-up processes, with the specific: a128 x128 standard luminance block is divided by six dividing modes, and each divided sub-block also has a plurality of corresponding dividing modes (determined according to the aspect ratio and the block size), and the sub-blocks also adopt the corresponding dividing modes one by one until the sub-block reaches the minimum specified block size (such as 4x 4) or is determined as not to be able to be divided continuously in the dividing process, and the flow from top to bottom is finished at the moment. The following is a bottom-up process, in particular: and after the RDCost of each division mode of the CU is calculated, the RDCost value with the minimum Cost can be compared, the corresponding division mode is the optimal division mode of the current parent block, the coding modes of all the child blocks are also the optimal modes (when the whole process is described as a tree, the optimal coding modes only appear on leaf nodes of the rebree), and the parent block possibly also has the corresponding parent block, and the parent block is used as the child block and the sibling block for comparison, so that the optimal division mode of the parent block is obtained until no parent block reaches the top 128x128 block. The whole process can see that the depth and width of the recursion are quite large, and the cost of eliminating the locally optimal partitioning mode with the reduction of the partitioning depth is larger.
In this embodiment, the acquired data set includes: as shown in fig. 1, regarding the selection of CU types, various partitioning modes make the shapes of the coding blocks CU different, the different types of CU vary up to 28 kinds among frames, the aspect ratio is minimum 1 and maximum 16, the partitioning mode distribution of different CU types has some differences due to the limitation of the partitioning thereof, the distribution of the number of different block sizes is 28 kinds in total according to experimental statistics, the following table is shown, the expression of width and height is set as 2 k, where k= {2,3,4,5,6,7}, the ratio of width and height is 2 τ, where τ is
= {0,1,2,4,8,16}. When τ > =4 in the aspect ratio, the probability of dividing it into NS averages 88.16%, and the average probability of dividing k <4 into NS is 85.6. It can be seen that CUs with smaller partition height and CUs with larger aspect ratio are more prone to non-partition and have a very high probability, so that other partition modes with smaller probability are skipped by directly adopting a possible partition mode in the coding flow, thereby avoiding the recursive process of local optimal partition, greatly saving coding time and reducing the coding cost.
Table 1 CU types at different aspect ratios
Aspect ratio τ
|
CU type
|
1
|
128x128、64x64、32x32、16x16、8x8、4x4
|
2
|
128x64、64x128、64x32、32x64、32x16、16x32、8x16、16x8、8x4、4x8
|
4
|
64x16、16x64、32x8、8x32、4x16、16x4
|
8
|
64x8、8x64、32x4、4x32
|
16
|
64x4、4x64 |
Table 2 partition duty cycle of different CU types
According to the analysis of the table, if accurate division prediction can be performed at the stage of dividing CU with lower depth and larger width and height, the generation of local optimal division process can be reduced. The basis for selection of which type of CU partitioning needs prediction is CU with width high at k > = 4 and ratio τ < = 4 or min (width, height) > = 8 and τ= {4,8 }; wherein except τ -! CU having=1 is transposed to find CU having the same width and height as the same category, and the category WxH is expressed as W in width and H in height, and a total of 9 categories (refer to table 3). The difference between the horizontal division and the vertical division in the actual coding division is obvious, and the three-fork horizontal and the two-fork horizontal are or the division between the two-fork vertical and the three-fork vertical are similar, so that the classification of the network structure classification designed by the invention is converted into four optimized classification NS, QT, HS, VS classification by the traditional six types of division NS, QT, BH, BV, TH, TV, and the classification modes of the two-fork and the three-fork in the horizontal and the vertical directions are combined.
Table 3 network entered CU categories
According to the network model, input parameters of the convolutional layer are pixel blocks of a current CU and in order to enable the model to learn and infer motion characteristics of the CU between frames, the model is input, besides original pixel blocks of the CU and partition data labels, a co-located block of the current CU in a reference frame and a corresponding reference block are added, and the accuracy of the partition mode is determined to a certain extent by the matching degree of the reference blocks. After the convolutional layer is finished, adding some coding parameters generated by the current CU in coding to the fully-connected layer through flattening, and further generalizing the coding parameters in the fully-connected layer, wherein the coding parameters are formed by: the partition depth parameter, quantization value QP, temporal layer Id, width and height of CU, RDCost of current block in Merge mode. For accessing these training data, the data set employs a video sequence of DVC data sets, the resolution categories including: 3840x2160, 1920x1080, 960x544, 480x272, 3840x2176, 832x480, 416x240. All sequences are encoded by adopting an official appointed VVC standard encoder VTM (variable video compression) version 10.2 under Random Access coding mode configuration by adopting quantization parameters QP= {22, 27, 32 and 37} respectively, coding information in a dividing process is stored in the coding process, labels of different blocks and corresponding data to be trained are finally obtained, and a total of 100 video sequences are selected as training sets and 20 video sequences are selected as verification sets for the fact that the training sets and the verification sets of the experiment are different. And generating training sets and verification sets corresponding to different types of CUs, and then training the model offline.
In this embodiment, as shown in fig. 3, the main network structure of the model is composed of a block convolution layer, a conditional residual module, a sub-network structure and a full connection layer. Different network models are trained for different types of CUs. In order to avoid the problems of gradient explosion and gradient disappearance in the convolution process, the learned local features are fused and further learned by a conditional residual module block after the block convolution, the expression capacity of the network is improved, corresponding sub-networks are designed for different CU types, the prediction accuracy is improved according to the sub-networks designed according to the shapes and the sizes of the blocks, and finally all neurons of the convolution layer are connected with all neurons in the current layer, so that the extraction of global features and the classification of modes are realized. In order to better adapt the activation function in the different feature networks of the dataset, a PreLU is used, which adjusts the slope of the negative input according to the learning parameters so that the network can better handle the different features in the dataset. The mathematical form is as follows:
f(x)=max(0,x)+α·min(0,x)
where α is a parameter that can be learned for controlling the slope of the negative input. When α=0, the prime is equivalent to the standard ReLU activation function.
In this embodiment, the process of training the partition probability network model includes: the partitioning probability network model consists of a partitioning convolution layer, a conditional residual error module, a sub-network structure and a full connection layer; the process of training the partitioning probability network model includes:
step 1: acquiring all coding units, screening the types of the CUs, and collecting the screened CU types to obtain a training data set;
step 2: optimizing the types of the redundant CUs in the training data set;
step 3: inputting the optimized training data set into a partitioned convolution layer of a partitioned probability network model, and learning different characteristics of the original data, the reference data and the parity data;
step 4: inputting the learned characteristics into a conditional residual error module for deep-level characteristic extraction;
step 5: inputting the extracted deep features into a sub-network structure to obtain division results of different data;
step 6: inputting the division result into the full connection layer to obtain a division probability prediction value;
step 7: and calculating a loss function of the model according to the partition probability predicted value, continuously adjusting parameters of the model, and finishing training of the model when the loss function is minimum.
In this embodiment, the process of filtering the type of CU includes: acquiring all CU types, and determining the width W, the height H and the ratio tau of the width to the height of each CU type; setting screening conditions according to the width W, the height H and the width-height ratio tau of the CU, and screening the types of the CU according to the screening conditions; wherein the screening conditions include that any one side of the width and height of the CU is greater than 16, and when one side is 16 or 32, the other side cannot be 128; or CU with min (W, H) =8 and τ= {4,8 }.
Optimizing the types of CUs that are redundant in the dataset includes: CU types in the data set include non-split NS, quadtree split QT, horizontal binary split BH, vertical binary split BV, horizontal trigeminal split TH and vertical trigeminal split TV; fusing the horizontal binary division BH and the horizontal trigeminal division TH to obtain a horizontal division HS, and fusing the vertical binary division BV and the vertical trigeminal division TV to obtain a vertical division VS; the optimized partition types comprise NS, QT, HS and VS.
The process of processing the input data by the sub-network structure comprises the following steps: the input channel type of the sub-network structure is (BATCH_SIZE, 16, H, W), wherein the sub-network structure consists of a convolution layer and a full connection layer, and the SIZE and the step length of a convolution kernel of a first layer convolution network layer of each sub-network are respectively (4 x tau ) and are used for normalizing the output characteristic shape; in the training process of the sub-network, the network is trained by adding additional coding information; wherein the additional encoded information includes: quantization parameter QP, partition Depth, temporal layer Id and RDCost generated by encoding; specific:
for a sub-network of min (W, H) =8, which consists of two convolution layers, the convolution kernel size and step size of the second convolution layer are (2, 2), respectively; the convolution layer output dimension is (BATCH_SIZE, 32,1,1), flattening the output data to obtain data in the (BATCH_SIZE, 32) dimension;
for a sub-network of min (W, H) =16, which consists of three layers of convolution layers, the convolution kernel sizes and step sizes of the second and third layers are (2, 2), (2, 2); the convolution layer output dimension is (BATCH_SIZE, 64,1,1), flattening the output data to obtain data in the (BATCH_SIZE, 64) dimension;
for a sub-network of min (W, H) =32, which consists of three convolutional layers, the convolution kernel sizes and step sizes of the sub-network of min (W, H) =32 at the convolutional second and third layers are (4, 4), respectively; the dimension of the output of the sub-network convolutional layer is (BATCH_SIZE, 128,1,1); flattening the output data to obtain data in the dimension (BATCH_SIZE, 128);
a sub-network of min (W, H) =64 consisting of three layers of convolution layers, the convolution kernel SIZEs and step SIZEs of the sub-network at the second and third layers of convolution being (2, 2), respectively, the dimensions of the sub-network convolution layer output being (BATCH_SIZE, 32,2,2); flattening the output data to obtain data in the dimension (BATCH_SIZE, 128).
The process of the full connection layer for processing the division result comprises the following steps: the division results are (BATCH_SIZE, 32) dimension data, (BATCH_SIZE, 64) dimension data and (BATCH_SIZE, 128) dimension data obtained according to different sub-networks; and adding corresponding coding information into the obtained dimension data, and taking the corresponding coding information as input of a full connection layer to obtain a partition probability prediction value of the current CU type.
The loss function of the network model proposed by the invention is defined as follows: for a CU with width and height w and h, respectively, all partitions may be m= { NO, QT, HS, VS }, and for a partition pattern M e M represents the label of the corresponding partition pattern. Since the CU types in min-batch are the same every time training is performed, let the size of min-batch be N, and the cross entropy loss function is:
wherein the method comprises the steps ofIndicating loss, y n,m Real label representing CU, < >>The prediction label indicating CU, n indicates what CU is in one min-batch, and is generally described as a loss of partition mode m of n-th CU partition prediction in one min-batch.
Because the different partition modes in the CU have very unbalanced duty ratios and the original cross entropy loss function cannot be adapted, the invention proposes that different penalties are applied to the different partition modes, and the loss function after the improvement of the original cross entropy loss function formula is defined as:
wherein L represents loss, p m Represents the proportion of the division pattern m in the CU of the type and satisfies the sum sigma of the proportion of all the division patterns m∈M p m =1,α∈[0,1]Is a variable parameter for adjusting the degree of penalty of the partition mode.
In this embodiment, a partitioned CU needs to be predicted; specifically, since there is a certain deviation in the prediction of the network, which is manually corrected, the present invention is implemented by setting a pair of range thresholds [ τ ] 1 ,τ 2 ]According to the predicted value P m At [ tau ] 1 ,τ 2 ]Different strategies are adopted: default mode, optimized mode, trusted mode. Wherein the default mode represents the candidate partition mode employing the encoder originals; when the prediction result is a horizontal mode and the network prediction result is in an optimization mode, the current CU adopts a horizontal binary and a horizontal trigeminal as candidate prediction modes, and when the prediction result is a vertical mode and the network prediction result is in an optimization mode, the current CU adopts a vertical binary and a vertical trigeminal as candidate prediction modes; and the result accuracy of the CU in network prediction is higher for the trusted mode, and the division mode of the network prediction can be adopted. The method can be summarized as follows:
an embodiment of CU partitioning, as shown in fig. 4, the method includes: inputting current coding unit information into a partition prediction network to obtain a partition probability prediction value, inputting the partition probability prediction value into a softmax function, and converting the predicted value of each category into a probability value between 0 and 1, wherein the maximum probability value is equal to a set threshold interval [ tau ] 1 ,τ 2 ]Comparing when the maximum probability value is smaller than τ in the threshold interval 1 When the maximum probability value is greater than tau in the threshold interval, the CU is divided by adopting the default division mode of the encoder 2 And when the prediction coding mode is adopted to divide the CUs, otherwise, the optimization mode is adopted to divide the CUs. Wherein scratchThe degree of division is shown in fig. 2.
Optimized τ 1 And τ 2 The value of (2) is selected in the range of [0.5,0.75 ] during the encoder test]The method comprises the steps of carrying out a first treatment on the surface of the When the probability is maximum p<0.5, dividing the CU by adopting an encoder default candidate mode; when p is>And (5) directly dividing the CUs by adopting a predicted coding mode at 0.75, otherwise, dividing the CUs by adopting an optimization mode. Partitioning the CU using the encoder default partition mode includes: the encoder defaults to divide one CU, wherein the maximum number of divisions is 5, including four-fork division QT, vertical binary division BV, horizontal binary division BH, vertical three-fork division TV, and horizontal three-fork division TH. Partitioning a CU using a predicted coding mode includes: clearing the partition mode stack; the data is divided according to a network prediction dividing mode, wherein the network prediction dividing mode comprises non-dividing NS, four-way dividing QT, horizontal dividing HS and vertical dividing VS. The process of partitioning the CUs using the optimization mode includes: and clearing the partition mode stack, determining a current partition mode, if the current partition mode is a horizontal partition HS, carrying out horizontal binary partition and horizontal three-way partition on the partitioned CU again, if the current partition mode is a vertical partition VS, carrying out vertical binary partition and vertical three-way partition on the partitioned CU again, and if the current partition mode is not the horizontal partition HS and the vertical partition VS, adopting a predicted coding mode to partition the CU.
The criteria for which coding mode is used in VVC is determined by the minimum rate distortion cost, and the process of obtaining this minimum rate distortion value is called rate distortion optimization, the magnitude of the rate distortion cost being related to the coding prediction distortion and the coding bit cost. In general, some modes of the coding modes have smaller image distortion, but have larger code rate; some modes have large image distortion, but small code rate. The rate distortion optimization process is performed at the current coding rate R curr Less than the coding rate R max The coding distortion D is made smaller, this process can be expressed as:
min{D}s.t.R curr <R max
wherein R is max Represents the maximum coding rate, R curr Representing the current coding rate, D representing coding distortion.
Let M k D for the kth coding mode of the coding modes to be tried k (M k ) For the coding rate distortion of the coding in the kth coding mode, R k (M k ) For the code bits spent in the kth coding mode, the best coding mode that results in the minimum coding RD rate distortion value among all coding modes may be expressed as follows:
since the coding modes are independent of each other, M k The optimal solution of (c) can then be described as:
where λ represents the lagrangian factor.
K-th mode M to be predicted among all candidate prediction modes k RD rate distortion value J of (2) k The expression of (2) is as follows:
J k =D k (M k )+λR k (M k )
the RD rate distortion value of the optimal prediction mode can be expressed by the following way through rate distortion comparison in each mode:
because the SKIP mode occupies a larger proportion in all modes and the coding mode is simpler, the required coding time and code rate are less than those of other complex coding modes, and meanwhile, the coding quality is ensured, the prior probability can be obtained according to the rate distortion between different modes, and then whether the current block can adopt the SKIP mode is judged according to the distribution interval of the rate distortion value of the SKIP mode.
The probability value of whether the best mode to the current RDCost is SKIP mode, which is possible by the bayesian decision algorithm, is expressed as follows:
wherein p (m s |c) probability value, m) indicating whether the best mode of the current RDCost is SKIP mode s Denoted as c is the rate distortion value RDCost, p (m s ) Representing the prior probability of SKIP mode, p (c) represents the full probability of RDCost in all modes.
p(c)=p(c|m s )·p(m s )+p(c|m o )·p(m o )
Wherein p (c|m) o ) Represents the conditional probability of RDCost when the current best mode is the other mode, p (m o ) Representing the prior probability of other modes, m o Representing other modes.
From a combination of the two formulas
Wherein is provided with m s And m o Represents SKIP mode and other modes, respectively, c is the rate distortion value RDCost, p (m s ) Representing the prior probability of SKIP mode, p (m) o ) Representing the prior probability of other modes, p (c|m) s ) The conditional probability of RDCost when the current best mode is SKIP is represented. p (c|m) o ) Representing the conditional probability of RDCost when the current best mode is the other mode. p (c) represents the full probability of RDCost in all modes.
When p (m) s When the probability value of c) is very large, approaching 1, the RDCost value is usually small, which also indicates that the CU's best mode is SKIP mode at this time, and as RDCost increases, p (m) s The probability value of c) gets smaller and gets closer to zero, when the best mode of the CU is more likely to be the other coding mode. However, RDCost may have its best coding mode in a certain distribution interval or may have its best coding mode in another mode or in SKIP mode, and in order to be able to determine the specific coding mode, the rate distortion threshold lambda, c is set to<Lambda, we can consider the current best coding mode to be m s When c is larger than or equal to lambda, the possibility of other coding modes is more consistent with the current coding mode.
In order to determine the magnitude of the threshold value lambda when the SKIP mode is the optimal mode, let p (lambda|m s ) To use the conditional probability of lambda when SKIP mode, p (lambda|m o ) To use the conditional probability of λ for other modes, there is:
wherein f (c|m) s ) To use the conditional probability density function of c in SKIP mode, f (c|m o ) To use the conditional probability density function of c for other modes, the combination of the two equations has the conditional probability of the sum () that yields the threshold λ.
By adjusting p (m s And c), counting the code rate consumption condition of each sequence in the encoding process of the encoder by applying the algorithm, determining the value of lambda according to the counting result, and finally determining the value of lambda as 0.8 according to multiple experimental tests
The network structure designed by the invention is embedded into the algorithm flow of the layout design of the VTM encoder, the CTC standard test sequences specified by the coding standard formulation are adopted, the partial sequences from class A to class E are respectively selected for testing the algorithm effect, and the experimental result shows that the algorithm design provided by the invention greatly improves the coding time complexity under the condition of small code rate fluctuation and the experimental result is as follows. Where DB represents the code rate spent encoding, PSNR represents the quality of the encoded video, and T represents time.
Table 4 algorithm deployment experiment results
While the foregoing is directed to embodiments, aspects and advantages of the present invention, other and further details of the invention may be had by the foregoing description, it will be understood that the foregoing embodiments are merely exemplary of the invention, and that any changes, substitutions, alterations, etc. which may be made herein without departing from the spirit and principles of the invention.