CN111479110B

CN111479110B - Fast affine motion estimation method for H.266/VVC

Info

Publication number: CN111479110B
Application number: CN202010293694.8A
Authority: CN
Inventors: 张秋闻; 黄立勋; 蒋斌; 王祎菡; 赵进超; 吴庆岗; 常化文; 王晓; 张伟伟; 赵永博; 崔腾耀; 郭睿; 孟颍辉; 李祖贺; 黄伟; 甘勇
Original assignee: Zhengzhou Light Industry Technology Research Institute Co ltd; Zhengzhou University of Light Industry
Current assignee: Zhengzhou Light Industry Technology Research Institute Co ltd; Zhengzhou University of Light Industry
Priority date: 2020-04-15
Filing date: 2020-04-15
Publication date: 2022-12-13
Anticipated expiration: 2040-04-15
Also published as: CN111479110A

Abstract

The invention provides a fast affine motion estimation method aiming at H.266/VVC, which comprises the following steps: calculating the texture complexity of the current CU by using the standard deviation, and dividing the current CU into a static area or a non-static area according to the texture complexity; for a CU in a static area, skipping affine motion estimation, directly predicting the current CU by using the motion estimation, and selecting an optimal prediction direction mode by a rate distortion optimization method; and for the CU in the non-static area, classifying the current CU by using a trained random forest classifier RFC model, and outputting the best prediction direction mode. For the CU in the static area, affine motion estimation is skipped, and the calculation complexity is reduced; for the CU in the non-static area, the prediction of the prediction direction mode is directly carried out through the model trained in advance, the calculation of affine motion estimation is avoided, and therefore the complexity of an affine motion estimation module is reduced.

Description

Rapid affine motion estimation method for H.266/VVC

Technical Field

The invention relates to the technical field of image processing, in particular to a fast affine motion estimation method for H.266/VVC.

Background

In the current information age, the demands for video services such as three-dimensional images, ultra-high definition videos, virtual reality and the like are increasing, and encoding and transmission of high-definition videos become hot problems of research. With the development and improvement of the H.266/VVC standard, the improvement of the video processing efficiency also drives the development of the video industry, and lays a foundation for the development of a new generation of video coding technology. The high-density data brings huge challenges to bandwidth and storage, and the current mainstream video coding standard can not meet the emerging application at present, so that a new generation video coding standard H.266/VVC comes to the end, and the requirements of people on definition, fluency and real-time degree of videos are met. The international organization for standardization ISO/IEC MPEG and ITU-T VCEG established the Joint Video expansion Team (jfet), which was responsible for the development of the next generation Video Coding standard h.266/universal Video Coding (VVC). The h.266/VVC is made for high definition video of 4K and above, the bit depth is mainly 10 bits, which is different from the positioning of h.265/HEVC, and this results in that the maximum block size of the current encoder becomes 128, the pixels processed in the middle of encoding are all 10 bits, and even the input 8-bit sequence is changed to 10-bit processing.

The H.266/VVC uses a hybrid coding technology framework, image division is continuously developed from single and fixed division to various and flexible division structures, and the encoding and decoding processing of high-resolution images can be more efficiently adapted. In addition, the H.266/VVC expands new elements such as inter-frame-intra prediction, prediction signal filtering, transformation, quantization/scaling, entropy coding and the like of an original H.265/HEVC encoder aiming at new-generation video data, considers the characteristics of a new-generation video coding standard, and adds a new model prediction mode. In particular, the H.266/VVC adopts the motion estimation, motion compensation and motion vector prediction technologies of high-efficiency video coding H.265/HEVC inter-frame coding, and introduces some new technologies on the basis. For example, the Merge mode is expanded, a history-based prediction motion vector is added, and new prediction methods such as an affine transformation technology, an adaptive motion vector precision method, motion prediction compensation with 1/16 sampling precision and the like are added. The introduction of a plurality of advanced coding tools greatly improves the coding efficiency of the new generation video coding standard H.266/VVC. But also significantly increases the operation complexity of the h.266/VVC interframe coding due to the rate-distortion cost calculation, thereby significantly reducing the coding speed of the new generation video.

The main principle of inter prediction is to find a best matching block in a previously coded picture for each block of pixels of the current picture, this process is called motion estimation ME, where the picture used for prediction is called a reference picture, the reference block is the best matching block in the reference picture, i.e. the reference block of pixels, the displacement of the reference block to the current block of pixels is called the motion vector MV, and the difference between the current block of pixels and the reference block is called the prediction residual. The ME algorithm is the most critical algorithm in the H.266/VVC video coding process, occupies more than half of the calculation amount and most of the operation time of the whole video coding, and is the dominant factor for determining the video compression efficiency. Motion estimation ME has been a research focus in video compression technology by effectively removing temporal redundancy between successive images. To improve compression efficiency, recent video codecs attempt to estimate motion of different shapes and sizes. Furthermore, by adding a multi-type tree, motion estimation ME can be performed on very thin blocks (e.g., width is one-eighth of height). Therefore, among the various modules of a multi-type tree (MTT), motion estimation ME is the tool with the highest coding complexity in VVC. Due to the more advanced inter prediction schemes performed recursively in the fine partitioned blocks of MTT, the computational complexity of motion estimation ME increases even more than in HEVC, since new techniques like affine motion estimation have also been tried by ME in Future Video Coding (FVC). The affine motion estimation AME is characterized by non-translational motion such as rotation and scaling, and is effective in Rate Distortion (RD) performance at the cost of high encoding complexity. The computational complexity of the affine motion estimation AME is a large part of the overall motion estimation ME processing time, and therefore it is very important to reduce its complexity. Therefore, to reduce the complexity of the VTM encoder, it is desirable to speed up the AME module.

In fact, a lot of research has been conducted on the problem of high complexity of h.265/HEVC inter-frame prediction. Xiong et al propose a fast CU selection algorithm based on cone motion divergence that can skip individual inter-frame CUs in h.265/HEVC ahead of time. L.shen et al propose an adaptive inter-mode decision algorithm for h.265/HEVC that combines inter-layer and spatio-temporal correlations, proposes an early skip mode decision based on statistical analysis, a mode decision based on prediction size correlation and a mode decision based on rate-distortion (RD) cost correlation. Lee et al propose an early skip mode decision method using its distortion characteristics after calculating the RD cost of 2N × 2N large mode. Zhang et al propose a method for deciding the depth level of a coding unit and a self-adaptive mode in advance aiming at the high correlation between texture video and depth map content, so as to reduce the computational complexity of video coding. Hu et al propose a Neyman-Pearson rule based fast inter-frame mode decision algorithm that includes early SKIP mode decision and fast CU size decision to reduce h.265/HEVC complexity. Pan et al propose a fast reference frame selection algorithm based on content similarity to reduce the computational complexity of inter-frame prediction based on multiple reference frames. Based on the best motion vector selection correlation among prediction modes with different sizes, z.pan et al proposes a fast motion estimation ME method to reduce the coding complexity of an h.265/HEVC encoder. Zhang et al propose a two-stage fast inter-frame CU decision method based on a Bayes method and a conditional random field to reduce the encoding complexity of an HEVC encoder. T.s.kim et al propose a fast motion estimation algorithm based on HEVC that supports a highly flexible block partition structure. By searching for multiple accurate motion vectors to predict the surrounding narrow region, the algorithm greatly reduces its computational complexity. Ma et al propose an arithmetic coding method based on a neural network to encode inter-frame prediction information in HEVC. Shen et al propose a fast mode decision algorithm to reduce the computational complexity of the encoder. The proposed method utilizes the optimized encoder after three adjustment parameters, namely the conditional probability, the motion characteristics and the mode complexity of the SKIP/Merge mode. Wang et al propose fast depth level and inter mode prediction algorithms. The algorithm uses inter-layer correlation, spatial correlation and their degree of correlation to speed up HEVC inter-frame coding. The fast interframe method can ensure the coding performance and effectively reduce the computational complexity of H.265/HEVC. However, these methods are not designed for the h.266/VVC encoder, and the h.266/VVC encoder uses new inter prediction techniques, such as more advanced affine motion compensated prediction, extended Merge mode, adaptive motion vector precision, triangulation mode, etc. Based on this, there must be a large difference between spatial and inter-layer correlations of the h.266/VVC and the h.265/HEVC-based inter prediction, and therefore, the h.266/VVC-based low-complexity inter coding method needs to be researched again.

In order to solve the problem of high complexity of H.266/VVC interframe coding, a few documents search for the problem. S.park et al propose a method for effectively limiting the reference frame search range for normal motion estimation and affine motion estimation, which mainly makes use of the dependencies within the h.266/VVC prediction structure. The method reduces encoding complexity by minimizing a maximum value of a reference frame search range of a CU based on prediction information of a parent node. Wang et al proposed an early termination scheme for a quadtree plus binary tree (QTBT) partition structure based on confidence intervals, and established a Rate Distortion (RD) model based on a motion divergence field to estimate the RD cost for rate distortion of each partition mode; and the partitioning of the H.266/VVC blocks is terminated early based on the model so as to eliminate unnecessary partitioning iteration and achieve good balance between the encoding performance and the encoding complexity of the H.266/VVC. Wang et al propose a fast QTBT partition decision algorithm for Convolutional Neural Network (CNN) for h.266/VVC inter-frame coding, which analyzes the QTBT in a statistical manner, thereby designing the architecture of the convolutional neural network CNN, and controlling the risk of erroneous prediction by using time correlation, so as to improve the robustness of the convolutional neural network CNN scheme. garcia-Lucas et al propose a pre-analysis algorithm for extracting frame motion information, which is used in the motion estimation module to speed up the h.266/VVC encoder. S.park et al propose a fast H.266/VVC interframe coding method to effectively reduce the coding complexity of affine motion estimation in VTM when using multi-type tree MTT. The method comprises two processes: early termination schemes and reducing the number of reference frames for affine motion estimation. Gao et al propose a low complexity decoder-side motion vector refinement scheme that optimizes the initial motion vector MV from the Merge mode by searching for the block with the lowest matching cost in the previously decoded reference picture, and is added to the bilateral matching-based decoder-side motion vector refinement method. N.Tang et al propose a fast block partitioning algorithm for H.266/VVC interframe coding, use the difference of three frames to judge whether the current block is a static object; when the current block is still, no further splitting is needed, thereby terminating the partition early to improve the inter-coding speed. However, little work is done to mitigate the complexity of affine motion estimation AME in VVC. For VTM there is much room to further reduce the motion estimation ME complexity in the multi-tree MTT structure, especially in affine motion estimation AME.

Disclosure of Invention

Aiming at the defects in the background technology, the invention provides a quick affine motion estimation method for H.266/VVC, which solves the technical problem of high AME encoding complexity in the affine motion estimation in VTM.

The technical scheme of the invention is realized as follows:

a fast affine motion estimation method for H.266/VVC comprises the following steps:

s1, calculating texture complexity SD of a current CU by using a standard deviation, and dividing the current CU into a static area or a non-static area according to the texture complexity SD;

s2, for the CU in the static area, skipping affine motion estimation AME, directly predicting the current CU by using motion estimation CME, and selecting the optimal prediction direction mode by a rate distortion optimization method;

and S3, classifying the current CU by using the trained random forest classifier RFC model for the CU in the non-static area, and outputting the best prediction direction mode.

The method for calculating the texture complexity SD of the current CU by using the standard deviation comprises the following steps:

where W represents the width of the CU, H represents the height of the CU, and P (a, b) represents the pixel value of position (a, b) in the CU.

The method for predicting the current CU by utilizing the motion estimation CME and selecting the optimal prediction direction mode by the rate distortion optimization method comprises the following steps:

s21, firstly performing unidirectional prediction Uni-L0, then performing unidirectional prediction Uni-L1 and finally performing bidirectional prediction Bi on the current CU;

s22, respectively calculating the rate distortion cost of the current CU respectively subjected to unidirectional prediction Uni-L0, unidirectional prediction Uni-L1 and bidirectional prediction Bi in the step S21 by using rate distortion optimization;

and S23, taking the prediction mode with the minimum rate distortion cost as the optimal prediction direction mode.

The rate-distortion cost calculation methods of the unidirectional prediction Uni-L0, the unidirectional prediction Uni-L1 and the bidirectional prediction Bi are as follows:

wherein the content of the first and second substances,

a set of all available reference lists is represented,

representing a set of reference lists, L0 and L1 representing two reference frame lists, phi (J) representing a reference frame in the reference list, J (-) being a rate-distortion cost function, and

d (-) represents the distortion degree of CU coding, lambda represents a Lagrangian multiplier, and R (-) represents the number of bits consumed by CU coding.

The training method of the RFC model of the random forest classifier in the step S3 comprises the following steps:

s31, selecting Traffic, kimono, BQSquare, raceHorseC and FourPeople video sequences under different resolutions from the universal test sequence, respectively coding a previous M frame on the VTM, and simultaneously recording the shape of a CU, the texture complexity of the CU and three prediction direction modes of the CU in the VTM as a data set, wherein the data set comprises a sample set S and a test set T, and the three prediction direction modes comprise a unidirectional prediction Uni-L0, a unidirectional prediction Uni-L1 and a bidirectional prediction Bi;

s32, resampling the sample set S by using a Bootstrap method, and generating K training sample sets

Each training set to be generated

As root node, generate the corresponding decision tree { T } ₁ ,T ₂ ,...,T _K Wherein, i =1,2, \8230, K denotes the ith training sample, and K denotes the size of the training sample set;

s33, training is started from a root node, m characteristic attributes are randomly selected from each intermediate node of the decision tree, a Gini index coefficient of each characteristic attribute is calculated, the characteristic attribute with the minimum Gini index coefficient is selected as the optimal splitting attribute of the current node, the minimum Gini index coefficient is used as a splitting threshold value, and the m characteristic attributes are divided into a left sub-tree and a right sub-tree;

s34, repeating the step S33, training K 'times until the K' decision trees are trained completely, and enabling each decision tree to grow completely without pruning;

s35, the generated decision trees are random forest classifier RFC models, the random forest classifier RFC models are used for distinguishing and classifying the test set T, the classification result adopts a voting mode, the most categories output by the K' decision trees are used as the categories of the test set T, and the best prediction direction mode of the current CU is obtained.

The method for obtaining the data set in step S31 is as follows:

s31.1, predicting a video sequence by utilizing motion estimation CME;

s31.2, performing affine prediction on the video sequence predicted in the step S31.1 by using a 4-parameter affine motion model, wherein the affine prediction comprises unidirectional prediction Uni-L0, unidirectional prediction Uni-L1 and bidirectional prediction Bi;

s31.3, carrying out radial prediction on the video sequence subjected to affine prediction in the step S31.2 by using a 6-parameter affine motion model;

and S31.4, respectively calculating the rate distortion cost after affine prediction in the steps S31.2 and S31.3, and taking the prediction mode corresponding to the minimum rate distortion cost as the prediction direction mode of the video sequence.

The characteristic attributes comprise a two-dimensional haar wavelet transform horizontal coefficient, a two-dimensional haar wavelet transform vertical coefficient, a two-dimensional haar wavelet transform angle coefficient, an angular second moment, a contrast, an entropy, an inverse difference moment, a minimum difference sum and a gradient.

For the 4-parameter affine motion model, the motion vector of the sample position (x, y) in the CU is:

wherein (mv) _0x ,mv _0y ) Is the motion vector of the upper left corner control point, (mv) _1x ,mv _1y ) Is the motion vector of the top right control point, W represents the CU width;

for the 6-parameter affine motion model, the motion vector of the sample position (x, y) in the CU is:

wherein (mv) _2x ,mv _2y ) Is the motion vector control point in the lower left corner, H denotes the CU's height.

The beneficial effect that this technical scheme can produce: according to the method, firstly, a standard deviation SD is utilized to divide a CU into a static area and a non-static area, if the CU belongs to the static area, the probability of selecting a SKIP mode for inter-frame prediction is high, and the static area which tends to select the SKIP mode for inter-frame prediction does not need to be subjected to affine prediction, so that an affine motion estimation AME module can be terminated in the static area in advance, and the optimal direction mode of the current CU is the optimal direction mode of motion estimation CME; if the CU belongs to the non-static area, judging the inter-frame prediction mode of the CU according to the random forest classification model, and finally obtaining the optimal prediction direction mode in advance; therefore, the invention reduces the calculation complexity and saves the encoding time, thereby realizing the fast encoding of the H.266/VVC.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a predicted directional pattern complexity profile of the present invention;

FIG. 3 is a 4-parameter affine model of the present invention;

FIG. 4 is a 6-parameter affine model of the present invention;

FIG. 5 is an overall process diagram of the motion estimation ME of the present invention;

FIG. 6 is a graph of the overall run time comparison of the inventive process and the FAME process.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

As shown in fig. 1, an embodiment of the present invention provides a fast affine motion estimation method for h.266/VVC, which includes the following specific steps:

s1, in the process of image coding, the image content of a single area is often coded by adopting a larger CU. In contrast, areas with rich detail are typically coded using smaller CUs. Therefore, the texture complexity of the coding block is used to determine whether the CU uses the SKIP mode for inter-frame prediction. In the process of image coding, the region with single image content tends to be coded by inter-frame prediction by using an SKIP mode, and the region with rich details has little possibility of inter-frame prediction by using the SKIP mode. The variance of a CU represents the dispersion of energy between two pixels of the current block, so the texture complexity of a block can be roughly measured by its standard deviation SD, and therefore, the texture complexity SD of the current CU is calculated by using the standard deviation SD, and the current CU is divided into static areas or non-static areas according to the texture complexity SD; the formula for the standard deviation is:

where W represents the width of the CU, H represents the height of the CU, and P (a, b) represents the pixel value at position (a, b) in the CU. Since the texture complexity of the neighboring blocks has a correlation with the CU, the classification threshold is derived by the texture complexity of the neighboring blocks. According to a large amount of experimental data, the minimum value of standard deviations SD of adjacent blocks of CU is used as Th _static It is reasonable. The CU may be classified by a threshold. If the current standard deviation SD is less than the threshold Th _static It indicates that the current CU is a static area. Conversely, if the value of the standard deviation SD is greater than Th _static Then the current CU belongs to a non-static area.

S2, existing video coding standards (e.g. h.265/HEVC) use motion vectors MV covering translational motion for motion estimation CME, whereas affine motion estimation AME can predict not only translational motion but also linear transformation motion, such as scaling and rotation. If the camera is scaled or rotated to capture video, affine motion estimation AME predicts motion more accurately than motion estimation CME. In h.266/VVC, the affine motion estimation AME starts with a Uni-prediction L0, followed by a Uni-prediction L1, and finally a Bi-prediction, as in the motion estimation CME. After calculating the three prediction direction modes, a Rate Distortion Optimization (RDO) method is used to select the best prediction direction mode. Fig. 2 shows the distribution of complexity of the affine motion estimation AME inter prediction mode, and unidirectional prediction Uni-prediction L0 requires more prediction time than unidirectional prediction Uni-prediction L1. The coding complexity of Uni-prediction is the same if the reference frames of Uni-prediction L0 and Uni-prediction L1 are different. Otherwise, if the reference frame between Uni-prediction L0 and Uni-prediction L1 is the same, the motion vector MV of Uni-prediction L1 is copied from the motion vector of Uni-prediction L0 to avoid redundant affine motion estimation AME processing. Therefore, uni-directional prediction Uni-prediction L0 prediction consumes more prediction time than Uni-prediction L1. Although the maximum encoding complexity is required in the unidirectional prediction, the probability that the bidirectional prediction mode is the best inter prediction mode is high. In the affine motion estimation AME module, calculating the rate distortion RD cost is an important reason for the high complexity.

To obtain the optimal motion vector MV and the optimal reference frame, the encoder searches a plurality of available reference frames, calculates a rate distortion RD cost J (·) using a lagrangian multiplier method, and compares costs of prediction results, where the lagrangian multiplier method calculates the rate distortion RD cost function J (·) as:

wherein, D (-) represents the distortion degree of CU coding, λ represents Lagrange multiplier, and R (-) represents the bit number consumed by CU coding. Since two reference frame lists, denoted unidirectional prediction Uni-L0 and unidirectional prediction Uni-L1, are used for motion prediction, the motion estimation ME process for unidirectional prediction should be tested with two lists, generating all available frames in both lists.

For a CU in a static area, skipping affine motion estimation AME, directly predicting the current CU by using motion estimation CME, and selecting the optimal prediction direction mode by a rate distortion optimization method; the specific method comprises the following steps:

s21, the current CU firstly carries out unidirectional prediction Uni-L0, then carries out unidirectional prediction Uni-L1 and finally carries out bidirectional prediction Bi.

S22, respectively calculating the rate distortion cost of the unidirectional prediction Uni-L0, the unidirectional prediction Uni-L1 and the bidirectional prediction Bi by using the utilization rate distortion optimization;

the rate-distortion costs of the unidirectional prediction Uni-L0, the unidirectional prediction Uni-L1 and the bidirectional prediction Bi are respectively as follows:

wherein the content of the first and second substances,

a set of all available reference lists is represented,

representing a set of reference lists, L0 and L1 representing two reference frame lists, phi (J) representing a reference frame in the reference list, and J (-) being a rate-distortion cost function.

And S3, for the CU in the non-static area, the condition of skipping the AME process of affine motion estimation is not met, the current CU is classified by using the trained RFC model of the random forest classifier, and the optimal prediction direction mode is output, so that the calculation complexity is further reduced. The random forest algorithm generates K self-help sample sets based on Bootstrap resampling, and the data of each sample set grows into a decision tree; at the node of each tree, M (M < < M ') features are randomly extracted from M' feature vectors based on a random subspace method RSM. According to a certain node splitting algorithm, selecting the optimal attribute from the m characteristic attributes to carry out branch growth; finally, K' decision trees are combined to carry out mode voting. After the random forest classifier is generated, testing a random forest classifier model, independently judging a classification result for each tree in the forest, finally deciding to take the classification category with the most same judgment, expressing the classification category with a formula as follows,

wherein H (t) represents a combined classification model, H _i (t) is a single classification tree model, t represents the characteristic attribute of the decision tree, Y represents the output variable, and I (-) represents the collective indicative function (i.e. when some classification result appears in the collection, the function value is 1, otherwise, it is 0).

When the CU is traversed, the characteristics of the CU and the prediction direction mode of the CU are recorded, and the normal encoding process is not interfered. Resampling by a Bagging integration method to generate a plurality of training sets, randomly and equivalently extracting samples from an original training sample set, repeatedly extracting and returning to generate K new training sample sets, and finally obtaining K new training sample sets. And after the sample is extracted, entering a training module of a random forest classifier model. Table 1 shows the relevant training parameter settings established by the RFC model of the random forest classifier.

TABLE 1 training parameter configuration

According to the parameters of the table 1, the training method of the random forest classifier RFC model comprises the following steps:

s31, selecting a sample set as one of keys of training a classifier, selecting a Traffic, kimono, BQSquare, raceHorseC and FourPeople video sequence which can cover rich texture complexity and has different resolutions from a universal test sequence, respectively coding M =50 frames on a VTM, and simultaneously recording the shape of a CU, the texture complexity of the CU and three prediction direction modes of the CU in the VTM as a data set, wherein the data set comprises the sample set S =20 and a test set T =30, and the three prediction direction modes comprise unidirectional prediction Uni-L0, unidirectional prediction Uni-L1 and bidirectional prediction Bi;

in VTM, affine motion blocks are also predicted in three ways: uni-directional prediction Uni-L0, uni-directional prediction Uni-L1, and Bi-directional prediction Bi. At the same time, affine prediction also includes 4-parameter and 6-parameter affine models. The related reference frames are required for either unidirectional prediction or bidirectional prediction of the affine motion estimation AME module, thereby increasing the encoding complexity of the VTM. When only the number of reference frames required by each affine motion estimation AME module is calculated, the affine motion estimation AME process requires twice the number of reference frames of the motion estimation CME process. The whole motion estimation ME procedure is shown in fig. 5. As can be seen from fig. 5, the method for obtaining the data set in step S31 is:

s31.1, predicting the video sequence by utilizing the motion estimation CME, wherein the prediction method is the same as the step S21;

s31.2, carrying out affine prediction on the video sequence predicted in the step S31.1 by utilizing a 4-parameter affine motion model, wherein the affine prediction comprises unidirectional prediction Uni-L0, unidirectional prediction Uni-L1 and bidirectional prediction Bi;

as shown in fig. 3, the motion vector of the sample position (x, y) in CU of the 4-parameter affine motion model is:

as shown in fig. 4, the motion vector for the sample position (x, y) in the block of the 6-parameter affine motion model is:

wherein (mv) _2x ,mv _2y ) The motion vector control point in the lower left corner, H denotes CU high.

And S31.4, respectively calculating rate-distortion costs after affine prediction in the steps S31.2 and S31.3, and taking the prediction mode corresponding to the minimum rate-distortion cost as the prediction direction mode of the video sequence.

Each training set to be generated

As root nodes, corresponding decision trees { T } are generated ₁ ,T ₂ ,...,T _K Wherein, i =1,2, \8230, K denotes the ith training sample, and K denotes the size of the training sample set;

s33, training is started from a root node, m characteristic attributes are randomly selected from each intermediate node of the decision tree, a Gini index coefficient of each characteristic attribute is calculated, the characteristic attribute with the minimum Gini index coefficient is selected as the optimal splitting attribute of the current node, and the m characteristic attributes are divided into a left sub-tree and a right sub-tree by taking the minimum Gini index coefficient as a splitting threshold;

the effectiveness of machine learning is highly correlated with the diversity and relevance of the training data set. Although the random forest classifier RFC can process ultra-high dimensional feature data, the classification model can be better popularized by selecting truly relevant feature vectors. Since the prediction direction mode of the CU is related to the texture, texture direction, and motion state of the image, these are used as the classification basis, i.e. the feature vector of the random forest classifier model. The characteristic attributes selected by the invention comprise two-dimensional Haar wavelet transform horizontal coefficients (2D Haar wavelet transform horizontal coeffient, hl), two-dimensional Haar wavelet transform vertical coefficients (2D Haar wavelet transform vertical coeffient, lh), two-dimensional Haar wavelet transform angle coefficients (2D Haar wavelet transform angle coeffient, hh), angular Second Moments (ASM), contrast (contrast, CON), entropy (ENT ), inverse Difference Moments (IDM), minimum Difference values (Sum of Absolute Difference, SAD) and gradient (gradient) as the characteristic attributes of the random classifier forest model, and the characteristic attributes are calculated as follows:

the two-dimensional haar wavelet transform horizontal coefficient HL of the image represents the texture of the image in the horizontal direction, the larger the value is, the richer the texture of the horizontal direction is, and the smaller the value is, the flatter the texture of the horizontal direction is; the two-dimensional haar wavelet transform vertical coefficient LH of the image represents the texture in the vertical direction of the image, the larger the value is, the richer the texture in the vertical direction is, and the smaller the value is, the flatter the texture in the vertical direction is; a two-dimensional haar wavelet transform angle coefficient HH of an image represents a texture in the vertical direction of the image, a larger value indicates a richer texture in the 45 ° direction, a smaller value indicates a flatter texture in the 45 ° direction, and a two-dimensional haar wavelet transform horizontal coefficient HL, a two-dimensional haar wavelet transform vertical coefficient LH, and a two-dimensional haar wavelet transform angle coefficient HH are respectively represented as:

where W represents the width of the CU, H represents the height of the CU, and P (a, b) represents the pixel value at position (a, b).

The angle second moment ASM reflects the uniformity degree of gray distribution and the thickness of texture, and the larger the value is, the more uniform the texture distribution of the image is; the texture depth of the contrast CON reaction image is larger when the value is larger, which indicates that the texture depth is larger; the entropy ENT represents the information amount of the image, and the larger the value is, the larger the information amount of the image is; the inverse difference moment IDM reflects the size of local texture variation of the image, different regions of the texture of the image are uniform and vary slowly, and the angular second moment ASM, the contrast CON, the entropy ENT and the inverse difference moment IDE are respectively expressed as:

in block matching based motion estimation algorithms, the decision criteria for the best matching block are many, we use the minimum difference sum SAD, the smaller SAD indicates that the reference block is closer to the current prediction block, which is expressed as:

wherein, P _k (a, b) represents the value of the current pixel, (a, b) represents the coordinates of the current pixel, P _k-1 (a + i ', b + j') is a reference pixel value, and (a + i ', b + j') represents the coordinates of the reference pixel.

The gradient represents the texture direction of the CU, using the gradient of the horizontal and vertical directions of the luminance sample as a characteristic property. The gradients in the horizontal and vertical directions are expressed as:

G _x (a,b)＝P(a+1,b)-P(a,b)+P(a+1,b+1)-P(a,b+1)，

G _y (a,b)＝P(a,b)-P(a,b+1)+P(a+1,b)-P(a+1,b+1)，

wherein G _x (a, b) and G _y And (a, b) respectively represent gradient components of the current pixel in horizontal and vertical directions. (a, b) represents the coordinates of the pixel, and P (a, b) represents the pixel value.

S34, repeating the step S33, training K '=25 times until the training of K' decision trees is completed, and enabling each decision tree to grow completely without pruning;

s35, the generated multiple decision trees are RFC models of random forest classifiers, the RFC models of the random forest classifiers are used for distinguishing and classifying the test set T, the classification result adopts a voting mode, the class with the most output of the K' decision trees is used as the class of the test set T, the best prediction direction mode of the current CU is obtained, and the calculation complexity of an affine motion estimation AME module is reduced.

To evaluate the method of the present invention, simulation tests were performed on the latest H.266/VVC encoder (VTM 7.0). The test video sequence is encoded in the "Random Access" configuration using default parameters. The BDBR reflects the compression performance of the present invention, and the reduction in time represents a reduction in complexity. Table 2 shows the coding characteristics of the present invention, the total coding time average of the present invention is reduced to 87%, and the affine motion estimation AME time average is reduced to 56%. Therefore, the invention can effectively save the coding time, and the loss of the RD performance can be ignored.

TABLE 2 coding characteristics of the invention

From table 2 it can be seen that the RD performance and the saved encoding run time of the present invention compared to VTM. It is possible that the experimental results may fluctuate for different test videos, but the method proposed by the present invention is effective. Compared with VTM, the invention can effectively reduce the complexity of the affine motion estimation AME module and has good RD performance.

The affine motion estimation AME module time is measured according to different Quantization Parameters (QPs). When the quantization parameter QP is 22, it can be seen from fig. 6 that the affine motion estimation AME module time for all video sequences amounts to about 36 hours. However, in the method of the invention, the time of the affine motion estimation AME module is reduced by about 9 hours. It can be seen that this trend is similar under other quantization parameters QPs. Thus, it is more intuitive to observe from fig. 6 that the proposed method reduces the encoding time of the affine motion estimation AME module, thereby reducing the computational complexity.

The technical scheme of the invention is described in detail in combination with the drawings, and the technical scheme of the invention provides a fast affine motion estimation method for H.266/VVC, so that the AME encoding complexity of affine motion estimation in VTM is effectively reduced. Firstly, a CU is divided into a static area and a non-static area by using a standard deviation SD, if the CU belongs to the static area, the probability of selecting a SKIP mode for inter-frame prediction is high, and the static area which tends to select the SKIP mode for inter-frame prediction does not need to be subjected to affine prediction, so that an affine motion estimation AME module can be terminated in the static area in advance, and the optimal direction mode of the current CU is the optimal direction mode of motion estimation CME. And if the CU belongs to the non-static area, judging the inter-frame prediction mode of the CU according to the random forest classification model, and finally obtaining the optimal prediction direction mode in advance. Therefore, the invention reduces the calculation complexity and saves the encoding time, thereby realizing the fast encoding of H.266/VVC.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, which is intended to cover any modifications, equivalents, improvements, etc. within the spirit and scope of the present invention.

Claims

1. A fast affine motion estimation method for H.266/VVC is characterized by comprising the following steps:

2. The method of claim 1, wherein the calculating the texture complexity SD of the current CU using the standard deviation is:

where W represents the width of the CU, H represents the height of the CU, and P (a, b) represents the pixel value at position (a, b) in the CU.

3. The fast affine motion estimation method for h.266/VVC as claimed in claim 1, wherein the method of predicting the current CU by using motion estimation CME and selecting the best prediction direction mode by rate distortion optimization is:

s21, firstly performing unidirectional prediction on a current CU through a Uni-L0, then performing unidirectional prediction on a Uni-L1, and finally performing bidirectional prediction on Bi;

4. The method of claim 3, wherein the rate-distortion cost of Uni-directional prediction Uni-L0 and Uni-directional prediction Uni-L1 are calculated by the following methods:

the method for calculating the rate distortion cost of the Bi-directional prediction Bi comprises the following steps:

where Γ (Φ) represents a set of all available reference lists, Φ represents a set of reference lists, L0 and L1 represent two reference frame lists, Φ (J) represents a reference frame in a reference list, J (-) is a rate distortion cost function, and J (Φ (J)) = D (Φ (J)) + λ · R (Φ (J)), D (-) represents a distortion level of CU coding, λ represents a lagrange multiplier, and R (-) represents a number of bits consumed by CU coding.

5. The fast affine motion estimation method for h.266/VVC according to claim 1, wherein the training method of the RFC model of the random forest classifier in the step S3 is as follows:

Each training set to be generated

As root node, generate the corresponding decision tree { T } ₁ ,T ₂ ,...,T _K Where i =1,2, \8230, K denotes the ith training sample, K denotes the size of the training sample set;

6. The fast affine motion estimation method for h.266/VVC according to claim 5, wherein the method of obtaining data set in step S31 is:

s31.1, predicting a video sequence by utilizing motion estimation CME;

s31.3, carrying out affine prediction on the video sequence subjected to affine prediction in the step S31.2 by using a 6-parameter affine motion model;

7. The method of claim 5 wherein the feature attributes comprise two-dimensional haar wavelet transform horizontal coefficients, two-dimensional haar wavelet transform vertical coefficients, two-dimensional haar wavelet transform angle coefficients, angular second moments, contrast, entropy, inverse moments, minimum difference, and gradient.

8. The fast affine motion estimation method for h.266/VVC as claimed in claim 6, wherein the motion vector of the sample position (x, y) in CU of said 4 parameter affine motion model is: