CN116158079A

CN116158079A - Weighted AC prediction for video codec

Info

Publication number: CN116158079A
Application number: CN202180059215.9A
Authority: CN
Inventors: 修晓宇; 陈伟; 郭哲玮; 陈漪纹; 王祥林; 马宗全; 朱弘正; 于冰
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-07-27
Filing date: 2021-07-27
Publication date: 2023-05-23
Also published as: WO2022026480A1

Abstract

A method, apparatus, and non-transitory computer readable storage medium for video decoding in Weighted Alternating Current Prediction (WACP) are provided. The method may include: a plurality of inter prediction blocks are obtained from a plurality of temporal reference pictures associated with a video block. The method may further comprise: a low frequency signal is obtained based on the plurality of inter prediction blocks. The method may further comprise: a plurality of high frequency signals are obtained based on the plurality of inter prediction blocks. The method may further comprise: at least one weight associated with a high frequency signal of at least one of the inter prediction blocks is determined. The method may further comprise: a final prediction signal of the video block is calculated based on a weighted sum of the plurality of high frequency signals and the low frequency signal using at least one weight.

Description

Weighted AC prediction for video codec

Cross Reference to Related Applications

The present application is based on and claims priority from provisional application No. 63/057,290 filed on 7/27 of 2020, the entire contents of which provisional application is incorporated herein by reference in its entirety.

Technical Field

The present disclosure relates to video codec and compression. More particularly, the present disclosure relates to methods and apparatus for weighted Alternating Current (AC) prediction for video codecs.

Background

Various video codec techniques may be used to compress video data. Video codec is performed according to one or more video codec standards. For example, some well known video codec standards today include: universal video codec (VVC), high efficiency video codec (HEVC, also known as h.265 or MPEG-H part 2) and advanced video coding (AVC, also known as h.264 or MPEG-4 part 10), which are developed jointly by ISO/IEC MPEG and ITU-T VECG. The open media Alliance (AOM) developed AOMedia Video 1 (AV 1) as successor to its previous standard VP 9. Audio video codec (AVS), which refers to digital audio and digital video compression standards, is another family of video compression standards developed by the chinese audio and video codec standards working group. Most existing video codec standards are built on a well-known hybrid video codec framework, i.e., block-based prediction methods (e.g., inter-prediction, intra-prediction) are used to reduce redundancy present in video images or sequences, and transform codecs are used to compress the energy of the prediction errors. An important goal of video codec technology is to compress video data into a form that uses a lower bit rate while avoiding or minimizing degradation of video quality.

Disclosure of Invention

Examples of the present disclosure provide methods and apparatus for weighted Alternating Current (AC) prediction for video codec.

According to a first aspect of the present disclosure, a method for video decoding in Weighted Alternating Current Prediction (WACP) is provided. The method may include: a plurality of inter prediction blocks are obtained from a plurality of temporal reference pictures associated with a video block. The method may also obtain a low frequency signal based on the plurality of inter prediction blocks. The method may also obtain a plurality of high frequency signals based on the plurality of inter prediction blocks. At least one of the plurality of high frequency signals is associated with a prediction block. The method may also determine at least one weight associated with a high frequency signal of at least one of the inter prediction blocks. The method may also calculate a final prediction signal for the video block based on a weighted sum of the plurality of high frequency signals and the low frequency signal using at least one weight.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not intended to limit the present disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate examples consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a block diagram of an encoder according to an example of the present disclosure.

Fig. 2 is a block diagram of a decoder according to an example of the present disclosure.

Fig. 3A is a diagram illustrating block partitioning in a multi-type tree structure according to an example of the present disclosure.

Fig. 3B is a diagram illustrating block partitioning in a multi-type tree structure according to an example of the present disclosure.

Fig. 3C is a diagram illustrating block partitioning in a multi-type tree structure according to an example of the present disclosure.

Fig. 3D is a diagram illustrating block partitioning in a multi-type tree structure according to an example of the present disclosure.

Fig. 3E is a diagram illustrating block partitioning in a multi-type tree structure according to an example of the present disclosure.

FIG. 4 is an illustration of bidirectional optical flow (BDOF) according to an example of the present disclosure.

Fig. 5A is an illustration of a 4-parameter affine pattern according to an example of the present disclosure.

Fig. 5B is a diagram of a 4-parameter affine pattern according to an example of the present disclosure.

Fig. 6 is a diagram of a 6-parameter affine pattern according to an example of the present disclosure.

Fig. 7 is an illustration of inheritance (inhereitance) of a WACP mode in accordance with an example of the present disclosure.

Fig. 8 is a method for video decoding according to an example of the present disclosure.

Fig. 9 is a method for video decoding according to an example of the present disclosure.

FIG. 10 is a diagram illustrating a computing environment coupled with a user interface according to an example of the present disclosure.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings, in which the same numbers in different drawings represent the same or similar elements, unless otherwise indicated. The implementations set forth in the following description of example embodiments are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with aspects related to the present disclosure as set forth in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the present disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term "and/or" as used herein is intended to mean and include any or all possible combinations of one or more of the associated listed items.

It should be understood that, although the terms "first," "second," "third," etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, the first information may be referred to as second information without departing from the scope of the present disclosure; and similarly, the second information may also be referred to as the first information. As used herein, the term "if" may be understood to mean "when..once..once..or" responsive to a determination "depending on the context.

The first generation AVS standard includes the chinese national standard "advanced audio video codec part 2: video "(referred to as AVS 1) and" information technology advanced audio video codec part 16: radio television video "(known as avs+). It can provide a bit rate saving of about 50% compared to the MPEG-2 standard at the same perceived quality. The AVS1 standard video part was promulgated as a chinese national standard in month 2 of 2006. The second generation AVS standard includes the chinese national standard "information technology efficient multimedia codec" (referred to as AVS 2) family, which is primarily directed to the transmission of ultra HD TV programs. The codec efficiency of AVS2 is twice that of avs+. AVS2 was released as a national standard in china 5 in 2016. Meanwhile, the AVS2 standard video part is submitted by Institute of Electrical and Electronics Engineers (IEEE) as an international application standard. The AVS3 standard is a new generation video codec standard for UHD video applications, which aims to exceed the codec efficiency of the latest international standard HEVC. At month 3 of 2019, at 68 th AVS conference, the AVS3-P2 baseline was completed, which provided approximately 30% bit rate savings over the HEVC standard. Currently, there is one reference software called the High Performance Model (HPM), which is maintained by the AVS group, for demonstrating the reference implementation of the AVS3 standard.

As with HEVC, the AVS3 standard is built on a block-based hybrid video codec framework.

Fig. 1 shows a general diagram of a block-based video encoder for VVC. Specifically, fig. 1 shows a typical encoder 100. Encoder 100 has a video input 110, motion compensation 112, motion estimation 114, intra/inter mode decision 116, block predictor 140, adder 128, transform 130, quantization 132, prediction related information 142, intra prediction 118, picture buffer 120, inverse quantization 134, inverse transform 136, adder 126, memory 124, loop filter 122, entropy coding 138, and bitstream 144.

In encoder 100, a video frame is partitioned into a plurality of video blocks for processing. For each given video block, a prediction is formed based on either an inter prediction method or an intra prediction method.

Prediction residuals representing the difference between the current video block (part of video input 110) and its prediction value (part of block prediction value 140) are sent from adder 128 to transform 130. The transform coefficients are then sent from the transform 130 to quantization 132 for entropy reduction. The quantized coefficients are then fed into entropy encoding 138 to generate a compressed video bitstream. As shown in fig. 1, prediction related information 142, such as video block partition information, motion Vectors (MVs), reference picture indices, and intra prediction modes, from intra/inter mode decisions 116 are also fed through entropy encoding 138 and saved into compressed bitstream 144. The compressed bitstream 144 comprises a video bitstream.

In the encoder 100, decoder-related circuitry is also required in order to reconstruct the pixels for prediction purposes. First, the prediction residual is reconstructed by inverse quantization 134 and inverse transform 136. The reconstructed prediction residual is combined with the block predictor 140 to generate unfiltered reconstructed pixels of the current video block.

Spatial prediction (or "intra prediction") predicts a current video block using pixels from samples (which are referred to as reference samples) of neighboring blocks already encoded in the same video frame as the current video block.

Temporal prediction (also referred to as "inter prediction") predicts a current video block using reconstructed pixels from already encoded video pictures. Temporal prediction reduces the inherent temporal redundancy in video signals. The temporal prediction signal for a given Coding Unit (CU) or coding block is typically signaled by one or more MVs, which indicate the amount and direction of motion between the current CU and its temporal reference. Furthermore, if a plurality of reference pictures are supported, a reference picture index for identifying from which reference picture in the reference picture storage the temporal prediction signal originates is additionally transmitted.

Motion estimation 114 takes video input 110 and signals from picture buffer 120 and outputs motion estimation signals to motion compensation 112. Motion compensation 112 takes video input 110, signals from picture buffer 120, and motion estimation signals from motion estimation 114, and outputs motion compensation signals to intra/inter mode decision 116.

After spatial and/or temporal prediction is performed, intra/inter mode decision 116 in encoder 100 selects the best prediction mode, e.g., based on a rate-distortion optimization method. Then, the block predictor 140 is subtracted from the current video block and the resulting prediction residual is decorrelated using transform 130 and quantization 132. The resulting quantized residual coefficients are dequantized by dequantization 134 and inverse transformed by inverse transform 136 to form a reconstructed residual, which is then added back to the prediction block to form a reconstructed signal of the CU. Further loop filtering 122, such as a deblocking filter, a Sample Adaptive Offset (SAO), and/or an Adaptive Loop Filter (ALF), may be applied on the reconstructed CU before the reconstructed CU is placed in a reference picture store of the picture buffer 120 and used to encode future video blocks. To form the output video bitstream 144, the coding mode (inter or intra), prediction mode information, motion information, and quantized residual coefficients are all sent to the entropy encoding unit 138 to be further compressed and packetized to form the bitstream.

Fig. 1 shows a block diagram of a generic block-based hybrid video coding system. The input video signal is processed block by block, called a Coding Unit (CU). Unlike HEVC, which partitions blocks based on quadtrees alone, in AVS3, one Coding Tree Unit (CTU) is split into CUs based on quadtrees/binary tree/extended quadtrees to adapt to varying local characteristics. Furthermore, the concept of multi-partition unit types in HEVC is removed, i.e., there is no separation of CUs, prediction Units (PUs), and Transform Units (TUs) in AVS 3; instead, each CU is always used as a base unit for both prediction and transformation, without further segmentation. In the tree partition structure of AVS3, one CTU is first partitioned based on the quadtree structure. Each quadtree node may then be further partitioned based on the binary tree and the extended quadtree structure.

As shown in fig. 3A, 3B, 3C, 3D, and 3E, there are five split types, namely, quaternary split, horizontal binary split, vertical binary split, horizontal extended quadtree split, and vertical extended quadtree split.

Fig. 3A shows a diagram illustrating block quaternary segmentation according to the present disclosure.

Fig. 3B shows a diagram illustrating a block vertical binary partition according to the present disclosure.

Fig. 3C shows a diagram illustrating block-level binary segmentation according to the present disclosure.

Fig. 3D shows a diagram illustrating a block vertical expansion quaternary segmentation according to the present disclosure.

Fig. 3E shows a diagram illustrating block horizontal expansion quaternary segmentation according to the present disclosure.

In fig. 1, spatial prediction and/or temporal prediction may be performed. Spatial prediction (or "intra prediction") uses pixels from samples (which are referred to as reference samples) of neighboring blocks already encoded in the same video picture/slice to predict the current video block. Spatial prediction reduces the spatial redundancy inherent in video signals. Temporal prediction (also referred to as "inter prediction" or "motion compensated prediction") predicts a current video block using reconstructed pixels from already encoded video pictures. Temporal prediction reduces the inherent temporal redundancy in video signals. The temporal prediction signal for a given CU is typically signaled by one or more Motion Vectors (MVs) that indicate the amount and direction of motion between the current CU and its temporal reference. Furthermore, if a plurality of reference pictures are supported, a reference picture index for identifying from which reference picture in the reference picture storage the temporal prediction signal originates is additionally transmitted. After spatial and/or temporal prediction, a mode decision block in the encoder selects the best prediction mode, e.g. based on a rate-distortion optimization method. Then subtracting the predicted block from the current video block; and the prediction residual is decorrelated using a transform and then quantized. The quantized residual coefficients are inverse quantized and inverse transformed to form reconstructed residuals, which are then added back to the prediction block to form a reconstructed signal of the CU. Further loop filtering, such as deblocking filtering, sample Adaptive Offset (SAO), and Adaptive Loop Filtering (ALF), may be applied on the reconstructed CU before the reconstructed CU is placed in the reference picture store and used as a reference for encoding future video blocks. To form the output video bitstream, the coding mode (inter or intra), prediction mode information, motion information, and quantized residual coefficients are all sent to an entropy encoding unit to be further compressed and packed.

Fig. 2 shows a general block diagram of a video decoder for VVC. Specifically, fig. 2 shows a block diagram of a typical decoder 200. Decoder 200 has a bitstream 210, entropy decoding 212, inverse quantization 214, inverse transform 216, adder 218, intra/inter mode selection 220, intra prediction 222, memory 230, loop filter 228, motion compensation 224, picture buffer 226, prediction related information 234, and video output 232.

The decoder 200 is similar to the reconstruction-related portion located in the encoder 100 of fig. 1. In decoder 200, an incoming video bitstream 210 is first decoded by entropy decoding 212 to derive quantized coefficient levels and prediction related information. The quantized coefficient levels are then processed through inverse quantization 214 and inverse transform 216 to obtain reconstructed prediction residues. The block predictor mechanism implemented in the intra/inter mode selector 220 is configured to perform intra prediction 222 or motion compensation 224 based on the decoded prediction information. The set of unfiltered reconstructed pixels is obtained by adding the reconstructed prediction residual from inverse transform 216 and the prediction output generated by the block predictor mechanism using adder 218.

The reconstructed block may further pass through a loop filter 228 before being stored in a picture buffer 226, the picture buffer 226 acting as a reference picture store. The reconstructed video in the picture buffer 226 may be sent to drive a display device and used to predict future video blocks. With loop filter 228 turned on, a filtering operation is performed on these reconstructed pixels to derive the final reconstructed video output 232.

Fig. 2 presents a general block diagram of a block-based video decoder. The video bitstream is first entropy decoded at an entropy decoding unit. The coding mode and prediction information are sent to a spatial prediction unit (if intra coded) or a temporal prediction unit (if inter coded) to form a prediction block. The residual transform coefficients are sent to an inverse quantization unit and an inverse transform unit to reconstruct the residual block. The prediction block and the residual block are then added together. The reconstructed block may be further loop filtered before being stored in a reference picture store. The reconstructed video in the reference picture store is then sent out for display and used to predict future video blocks.

In one or more embodiments, a weighted AC prediction (WACP) method is presented to enhance the efficiency of motion compensated prediction. The proposed scheme aims at predicting an Alternating Current (AC) component of a video block from a weighted combination of AC components from one or more of its temporal reference blocks. Because a better prediction can be achieved, the corresponding overhead of signaling AC coefficients can be reduced by the proposed WACP scheme. For ease of description, some existing inter-frame codec techniques closely related to the proposed method in the current VVC and AVS3 standards are briefly summarized below. Then, some defects in the current inter prediction design are analyzed. Finally, details of the proposed WACP scheme are discussed.

Weighted prediction

Weighted Prediction (WP) is a codec tool mainly used to compensate for illumination changes, such as fades in and out, between the current picture and its temporal reference picture during the motion compensation phase. WP is first adopted in AVC and is reused by HEVC and VVC. Specifically, when WP is enabled, a set of multiplicative weights and additive offsets is signaled for each picture in each of the L0 and L1 reference lists in the slice header. For P slices, a prediction of a current block is generated by weighting prediction samples obtained from a single reference picture. Specifically, let P (i, j) represent the original predicted sample at coordinates (i, j) (i.e., before WP), the final predicted sample is calculated as:

P′(i，j)＝w·P(i，j)+o

(1)

where w and o are WP weights and offsets associated with the reference picture of the current block. Similarly, for bi-prediction, the final bi-prediction is calculated as:

wherein w is ₀ And o ₀ W ₁ And o ₁ The WP weights and offsets associated with the reference pictures in L0 and L1, respectively. In general, WP effectively works for global illumination changes that vary linearly from picture to picture.

Bi-prediction using CU-level weights

In the preceding AVC and VVC standards, when WP is not applied, a bi-directional prediction signal is generated by averaging unidirectional prediction signals obtained from two reference pictures. In VVC, a tool codec, i.e., bi-prediction (BCW) using CU-level weights, is introduced to improve the efficiency of bi-prediction. Specifically, instead of a simple average, bi-prediction in BCW is extended by allowing a weighted average of the two prediction signals, as follows:

P′(i，j)＝((8-w)·P ₀ (i，j)+w·P ₁ (i，j)+4)＞＞3

(3)

in VVC, when the current picture is a low-delay picture, the weight of one BCW codec block is allowed to be selected from a predefined set of weight values we { -2,3,4,5, 10} and weight 4 represents the conventional bi-prediction case where the two uni-directional prediction signals are equally weighted. For low delays, only 3 weights w e {3,4,5} are allowed. In general, while there are some design similarities between WP and BCW, the goal of both codec tools is to address the illumination change challenges at different granularity. However, because interactions between WP and BCW can potentially complicate VVC design, the two tools are not allowed to be enabled simultaneously. Specifically, when WP is enabled for one stripe, then BCW weights of all bi-predictive CUs in the stripe are not signaled and inferred to be 4 (i.e., equal weights are applied).

Merging Mode (MMVD) using motion vector differences

In addition to the conventional merge mode, which derives its motion information from the spatial/temporal neighbors of a current block, the MMVD/UMVE mode is introduced as a special merge mode into both the VVC standard and the AVS standard. Specifically, in both VVC and AVS3, the pattern is signaled by one MMVD flag at the codec block level. In MMVD mode, the first two candidates in the merge list of normal merge mode are selected as the two base merge candidates for MMVD. After one base merge candidate is selected and signaled, an additional syntax element is signaled to indicate a Motion Vector Difference (MVD) of the motion added to the selected merge candidate. The MMVD syntax element includes a merge candidate flag for selecting a base merge candidate, a distance index for specifying an MVD magnitude, and a direction index for indicating an MVD direction.

In existing MMVD designs, the distance index specifies an MVD amplitude that is defined based on a set of predefined offsets from the starting point. As shown in fig. 6, an offset is added to the horizontal or vertical component of the starting MV (i.e., the MV of the selected base merge candidate). Table 1 shows MVD offsets applied separately in AVS 3.

Table 1 MVD offset used in AVS3

Distance IDX	0	1	2	3	4
						Offset (in units of luminance samples)	11//4	11//2	1	2	4

As shown in table 2, the direction index is used to specify the sign of the MVD that is signaled. Note that the meaning of the MVD symbol may vary depending on the starting MV. When the starting MV is a uni-directional predicted MV or a bi-directional predicted MV, where the MV points to two reference pictures, the POC of which are both greater than the POC of the current picture or both less than the POC of the current picture, then the signaled symbol is the symbol of the MVD added to the starting MV. When the starting MV is a bi-predictive MV pointing to two reference pictures, where the POC of one picture is greater than the current picture and the POC of the other picture is less than the current picture, then the signaled symbol is applied to the L0 MVD and the opposite value of the signaled symbol is applied to the L1 MVD.

Table 2 MVD symbols specified by the direction index

Direction IDX	00	01	10	11
					X-axis	+	-	N/A	N/A
y-axis	N/A	N/A	+	-

Bidirectional optical flow (Bidirectional Optical Flow, BIO) and bidirectional optical flow (Bi-Directional Optical Flow, BDOF)

Conventional bi-prediction in video coding is a simple combination of two temporal prediction blocks obtained from a reference picture. However, due to the tradeoff of signaling cost and accuracy of the motion vectors, the motion vectors received at the decoder may not be as accurate. Thus, there may still be a possibility that can be observed between two prediction blocks The small motion remaining, which may reduce the efficiency of motion compensated prediction. To address this drawback, a BIO tool is employed in both the VVC standard and the AVS3 standard to compensate for this motion for each sample within a block. Specifically, when bi-directional prediction is used, BIO is a sample-by-sample motion refinement performed over block-based motion compensated prediction. In existing BIO designs, the derivation of the refined motion vector for each sample in a block is based on a classical optical flow model. Let I ^(k) (x, y) is a sample value at coordinates (x, y) of the prediction block derived from the reference picture list k (k=0, 1), and

and->

Is the horizontal and vertical gradient of the sample. Assuming that the optical flow model is valid, the motion at (x, y) refines (v _x ，v _y ) Can be derived from the following equation:

in combination with optical flow equation (4) and interpolation of the predicted block along the motion trajectory (as shown in fig. 4), we can obtain the BIO prediction as follows:

in FIG. 4, (MV) _x0 ，MV _y0 ) Sum (MV) _x1 ，MV _y1 ) Indication for generating two prediction blocks I ⁽⁰⁾ And I ⁽¹⁾ Block-level motion vectors of (a).

Furthermore, motion refinement (v) at sample points (x, y) _x ，v _y ) Is calculated by minimizing delta between the values of the samples (i.e., a and B in fig. 4) after motion refinement compensation, as follows:

Furthermore, in order to ensure regularity of the derived motion refinement, it is assumed that the motion refinement is consistent within a local surrounding region centered on (x, y); thus, in the current BIO design of AVS3, (v) _x ，v _y ) Is derived by minimizing delta within a 4x4 window Ω around the current sample at (x, y), as follows:

as shown in equations (4) and (5), in addition to the block level MC, it is also necessary to derive a gradient (i.e., I ⁽⁰⁾ And I ⁽¹⁾ ) In order to derive local motion refinements and generate the final prediction at that sample point. In AVS3, gradients are calculated by a 2D separable Finite Impulse Response (FIR) filtering process that defines a set of 8-tap filters and is based on block-level motion vectors (e.g., (M in fig. 4 _x0 ，MV _y0 ) Sum (MV) _x1 ，MV _y1 ) For example), different filters are applied to derive horizontal and vertical gradients. Table 3 shows the coefficients of the gradient filter used by the BIO.

TABLE 3 gradient filters for use in BIO

Fractional location	Gradient filter
		0	{-4，11，-39，-1，41，-14，8，-2}
1/4	{-2，6，-19，-31，53，-12，7，-2}
		1/2	{0，-1，0，-50，50，0，1，0}
3/4	{2，-7，12，-53，31，19，-6，2}

Finally, the BIO is applied only to bi-predicted blocks predicted from two reference blocks from temporally neighboring pictures. Furthermore, the BIO is enabled without requiring additional information to be sent from the encoder to the decoder. Specifically, the BIO is applied to all bi-predictive blocks having both forward and backward predictive signals.

Affine pattern

In AVC and HEVC, only translational motion models are applied for motion compensated prediction. While in the real world there are many kinds of movements, e.g. zoom in/out, rotation, perspective movements and other irregular movements. In VVC, affine motion compensation prediction is applied by signaling one flag for each inter-codec block to indicate whether translational motion or affine motion model is applied to inter-prediction. In the current VVC design, two affine modes are supported for one affine codec block, including a 4-parameter affine mode and a 6-parameter affine mode.

The 4-parameter affine model has the following parameters: two parameters for translational movement in the horizontal and vertical directions, one parameter for scaling movement, and one parameter for rotational movement in both directions, respectively. The horizontal scaling parameter is equal to the vertical scaling parameter. The horizontal rotation parameter is equal to the vertical rotation parameter. In order to achieve a better adaptation of the motion vectors and affine parameters, in VVC those affine parameters are converted into two MVs (also called control point motion vectors (CPM) V)). As shown in fig. 5A and 5B, the affine motion field of the block is composed of two control points MV (V ₀ ，V ₁ ) To describe.

Fig. 5A shows a diagram of a 4-parameter affine model according to the present disclosure.

Fig. 5B shows a diagram of a 4-parameter affine model according to the present disclosure.

Motion field (v) of an affine codec block based on control point motion _x ，v _y ) Is described as:

the 6-parameter affine pattern has the following parameters: two parameters for translational movement in the horizontal and vertical directions, one parameter for scaling movement in the horizontal direction and one parameter for rotational movement in the horizontal direction, one parameter for scaling movement in the vertical direction and one parameter for rotational movement in the vertical direction, respectively. The 6-parameter affine motion model is encoded with three MVs at three CPMV.

Fig. 6 shows a diagram of a 6-parameter affine model according to the present disclosure.

As shown in fig. 6, three control points of one 6-parameter affine block are located at the upper left corner, upper right corner, and lower left corner of the block. The motion at the upper left control point is related to translational motion and the motion at the upper right control point is related to rotational and scaling motion in the horizontal direction and the motion at the lower left control point is related to rotational and scaling motion in the vertical direction. Compared to the 4-parameter affine motion model, the rotation and scaling motions in the horizontal direction of the 6 parameters may be different from those in the vertical direction. Hypothesis (V) ₀ ，V ₁ ，V ₂ ) Is MV of the upper left corner, the upper right corner and the lower left corner of the current block in fig. 6, the motion vector (v _x ，v _y ) Derived using three MVs at control points, as follows:

improvements in video decoding

Transform codec is one of the most important compression techniques, which is widely used in all mainstream video codecs. It improves the codec efficiency by compressing most of the signal energy into a few low frequency coefficients and distributing the remaining energy into high frequency coefficients. Therefore, in the case of applying quantization, the coefficient having the highest energy on the block (i.e., low frequency coefficient) is finely quantized and allocated with more bits, and the low energy coefficient (i.e., high frequency coefficient) is coarsely quantized and allocated with less bits. For this reason, in most scenarios (especially low bit rate applications), the reconstructed video signal is usually dominated by low frequency information, and some of the high frequency information present in the original video is lost and/or distorted in the reconstructed video signal. Given that the reconstructed video signal is used as a reference for inter prediction, such distorted high frequency information may potentially lead to severe performance degradation of both the current picture and the subsequent pictures predicted from the current picture.

WP and BCW are effective tools to improve the efficiency of motion compensated prediction when there is a global or local illumination variation between different pictures. However, this improvement is achieved by estimating the brightness variation from a linear model (i.e., a multiplicative weight and an additive offset). In practice, the weights and offsets are typically optimized by minimizing the Mean Square Error (MSE) between the current block and its predicted block, i.e.,

where S (i, j) and P (i, j) represent samples at coordinates (i, j) in the current block and the predicted block. Due to the low frequency information energy prevailing in the reconstruction, WP and BCW are only able to compensate for differences between the low frequency components (e.g. Direct Current (DC) components) between the current block and its reference block, and are not able to recover the lost high frequency information in the reference samples that may be lost.

The proposed method

In the present disclosure, a weighted AC prediction (WACP) scheme is proposed to improve the prediction efficiency of the AC component of the motion compensation stage. Briefly, in the proposed method, the AC component of a video block is predicted from a weighted combination of AC components from one or more of its temporal reference blocks. Because a better AC prediction can be achieved, the signaling overhead of the AC coefficients is expected to be reduced, and thus the overall motion compensation efficiency is expected to increase when the WACP scheme is applied.

Generic weighted AC prediction

Conceptually, the idea of WACP can be seen as an extension of the well-known multi-hypothesis prediction to estimate the value of the AC component at each sample of the current block based on a linear combination of AC components from co-located samples of multiple motion-compensated prediction blocks. Specifically, the general idea of the proposed WACP idea can be formulated as follows:

wherein P is ^DC (i, j) is the average (i.e., DC component) of the plurality of prediction blocks at coordinates (i, j);

is the AC component of the kth prediction block at coordinates (i, j); w (w) _k The weight of the AC component applied to the kth prediction block is represented, and N is the total number of multiple hypotheses applied. P (P) ^DC (i, j) and->

The value of (2) may be further calculated as:

in equation (12), P _k (i, j) represents a sample at coordinates (i, j) in the kth prediction block.

Similar to multi-hypothesis prediction, one fundamental drawback of the proposed WACP scheme is how to balance the prediction efficiency gain using more hypotheses and the overhead required to signal multiple weights. Here, more hypothesis candidates implies more accurate AC prediction, however, this requires more bits to encode the weight values. Sometimes, the required overhead may exceed the prediction accuracy benefits. In one or more embodiments, it is proposed to signal the number of hypothetical predicted signals applied in the WACP scheme and have the encoder adaptively select the best number for best rate-distortion (R-D) performance. The number of hypotheses applied by the WACP may be signaled at various codec levels (e.g., sequence level, picture level, tile/slice level, and codec block level, etc.) to provide different trade-offs between codec efficiency and hardware/software implementation costs. In some embodiments, it is proposed that a fixed number of hypothetical prediction blocks are typically used when applying the proposed WACP scheme. Without loss of generality, n=2 will be used as an example to explain the proposed WACP method.

Fig. 8 illustrates a method for video decoding in Weighted Alternating Current Prediction (WACP) according to the present disclosure.

In step 810, a decoder may obtain a plurality of inter prediction blocks from a plurality of temporal reference pictures associated with a video block.

In step 812, the decoder may obtain a low frequency signal based on the plurality of inter prediction blocks.

In step 814, the decoder may obtain a plurality of high frequency signals based on the plurality of inter prediction blocks. At least one of the plurality of high frequency signals is associated with a prediction block. In some embodiments, the decoder may obtain a plurality of high frequency signals, wherein each high frequency signal is associated with one prediction block.

In step 816, the decoder may determine at least one weight associated with the high frequency signal of at least one of the inter prediction blocks. In some embodiments, the decoder may determine a plurality of weights associated with the high frequency signal of each inter prediction block.

In step 818, the decoder may calculate a final prediction signal for the video block based on a weighted sum of the plurality of high frequency signals and the low frequency signal using the at least one weight.

Fig. 9 illustrates a method for video decoding in WACP according to the present disclosure.

In step 910, the decoder may obtain a combined high frequency signal based on a weighted sum of the plurality of high frequency signals. At least one of the plurality of high frequency signals is weighted by a corresponding weight associated with the at least one of the plurality of high frequency signals.

In step 912, the decoder may calculate the final prediction signal of the video block as the sum of the low frequency signal and the combined high frequency signal of the video block.

Bi-directional weighted AC prediction

Bi-directional weighted AC prediction (BD-WACP) is a special case of general WACP, where the number of motion compensated prediction blocks used is limited to 2, i.e. n=2. Thus, based on equation (11), the bi-predictive samples at coordinates (i, j) can be calculated by:

wherein w is ₀ And w ₁ Is the sum of the predicted signal P ₀ And P ₁ Is associated with the AC samples. When w is as shown in equation (13) ₀ Equal to w ₁ In time, the proposed WACP downgrades to conventional bi-prediction.

Assuming that BD-WACP can be adaptively switched to equation (13) at the codec block level, two different weights need to be signaled for each bi-prediction block, which is expensive in view of the codec bits that may be generated. To reduce signaling overhead, an additional constraint may be applied to force the summation of the two weights to be constant, such as 0, i.e., w ₀ +w ₁ =0, so that only one weight needs to be explicitly signaled. For example, the weight of a high frequency signal may be identified by subtracting the weights of all other high frequency signals from 1. In another example, the weight of a high frequency signal may be identified by subtracting the weights of all other high frequency signals from 0.

As shown in equation (13), when BD-WACP is applied, only one single weight w needs to be signaled in the bitstream. However, the weights in equation (13) are assumed to be floating point numbers that need to be quantized prior to transmission. Since errors caused by quantization may significantly affect degradation of the WACP performance, it is important to select the allowable WACP weights. In one specific example, three weights w ε { -1/8,0,1/8}, to be used for BD-WACP are presented. In another specific example, five weights { -6/8, -1/8,0,1/8,6/8}, to be used for BD-WACP are presented. When either of these two methods is applied, the corresponding absolute weight value may be represented by 3 bits. Thus, equation (13) can be rewritten as using integer weights:

wherein w is ^int Is an integer weight value that allows selection from { -1,0,1} in the first example and { -6, -1,0,1,6} in the second example. In another example, an integer weight value may be allowed to be selected from {5,0,3} and a fixed number of bits for the right shift operation set to 3. In another embodiment, instead of using a fixed set of WACP weights It is proposed to signal the allowed weights (e.g. sequence parameter sets, picture parameter sets, slice headers, etc.) directly in the bitstream. This approach gives the encoder more flexibility to select the desired WACP weights on the fly, depending on the specific characteristics of the current sequence/picture/slice.

Inherited BD-WACP mode

In the above method, if one codec block is bi-directionally predicted, the selected weights of the BD-WACP mode are explicitly signaled in the bitstream. However, as discussed in the introductory portion, the merge mode is supported in both VVC and AVS3, where the motion information of one codec block is not signaled, but is derived from one of a set of spatial/temporal merge candidates. In order to reduce signaling overhead of the BD-WACP weights, a method of applying the BD-WACP to the merge mode is proposed in this section. First, in addition to the motion information (i.e., reference picture index and motion vector), it is proposed to store an associated WACP weight for each bi-prediction. In this way, BD-WACP weights can be inherited on a block-by-block basis without signaling. In existing VVC and AVS3 designs, there are multiple types of merge modes: including regular merge mode, inherited affine merge mode, constructed affine merge mode.

First, when the current codec block is encoded with either a regular merge mode or an inherited affine merge mode, the corresponding WACP weights may be copied directly from the weights of the selected merge candidates (as indicated by the signaled merge index). Fig. 7 shows inheritance of the WACP mode. Fig. 7 is a diagram for explaining one example of the inheritance scheme proposed for the WACP mode. In fig. 7, spatial merging candidates B encoded and decoded by BD-WACP mode having a weight value of 1 ₂ Is selected as a merging candidate for the current codec block. In this case, B ₂ Both BD-WACP weight and motion information are inherited to generate a bi-directional prediction signal of the current block.

Unlike the conventional merge mode and the inherited affine merge mode, motion information constructing an affine merge mode is generated from motion information of a plurality of neighboring blocks. Different methods may be applied to generate a BD-WACP weight that constructs affine merge blocks. In the first method, it is proposed that the BD-WACP mode is always disabled (i.e., the BD-WACP weight w is forcibly set to 0) when the current block is encoded by constructing the affine merge mode. In the second method, it is proposed to set a BD-WACP weight constructing an affine merge block equal to the BD-WACP weight of the block generating the first control point motion vector (i.e., at the upper left corner of the current block). In a third method, it is proposed to set the BD-WACP weight of one constructed affine combined block equal to the BD-WACP weight most commonly used for neighboring blocks. In addition, when there are not enough neighboring blocks that are encoded and decoded by the BD-WACP mode, the BD-WACP weight of the current block is set to 0 (i.e., BD-WACP is disabled).

Coordination of BD-WACP with other inter-frame codec techniques

Coordination between BD-WACP and WP: conceptually, BD-WACP and WP are two codec tools with different styles: the goal of BD-WACP is to compensate for the lost high frequency information at the reference picture, while WP is focused on compensating for the luminance variation (i.e., low frequency information) between the current picture and the reference picture. Thus, there is no obvious conflict preventing the joint use of these two codec tools. Specifically, when WP is turned on, WP parameters (i.e., weights and offsets) are signaled at the picture/band level. At the codec block level, an additional BD-WACP weight may be marked when the current block is bi-directionally predicted. Thus, as one embodiment of the present disclosure, it is proposed to apply BD-WACP and WP together. In detail, in the method, WP is first applied to adjust the luminance amplitude of the prediction block, and the luminance amplitude is combined by WACP to generate the final bi-prediction. Assume that

And->

And +.>

And->

WP weights and offsets associated with L0 and L1 reference pictures, and w ^BD-WACP Is the BD-WACP weight. When the proposed method is applied, bi-prediction is generated as:

note that in equation (15), the coordinates (i, j) are excluded from this equation for ease of presentation. Further, for ease of description, the values of all weights and offsets are assumed to be floating. Indeed, the parameter discretization method depicted in equation (14) can be easily applied to equation (15) by a fixed point implementation. In another embodiment, it is proposed to always disable the BD-WACP mode of one bi-predictive codec block when WP is enabled for the picture/slice to which the codec block belongs. In this case, the BD-WACP weight need not be signaled, but is always inferred to be 0.

Coordination between BD-WACP and BCW: like WP, BD-WACP can also be seamlessly combined with BCW mode, as both modes aim to improve different components of the motion compensated prediction signal. Thus, in one or more embodiments, it is proposed to jointly apply BD-WACP and BCW simultaneously for one bi-predictive codec block. Specifically, with this method, BCW is first applied to adjust the local luminance amplitude of the prediction block, which is combined by WACP to generate the final bi-prediction. Let w be ^BCW Is the applied BCW weight, then bi-prediction is generated as:

P _comb ＝w ^BCW ·P ₀ +(1-w ^BCW )·P ₁ +W ^BD - ^WACP ·2·(w ^BCW ·P ₀ -(1-w ^BCW )·P ₁ ) (16)

in some other embodiments, it is proposed that when BCW is enabled for one bi-predictive codec block, the BD-WACP mode of that block is always disabled. In this case, the BD-WACP weight need not be signaled, but is always inferred to be 0.

Coordination between BD-WACP and BDOF: BD-WACP can also be freely combined with BDOF. More specifically, when the two tools are combined, the original predicted signal P ₀ And P ₁ Still applied to estimate sample refinement delta from sample to sample as depicted in the "bidirectional optical flow" (BDOF) section _BDOF ，Δ _BDOF Is added to the enhanced bi-directional prediction signal by BD-WACP as follows:

In some embodiments, it is proposed that BDOF is always disabled when BD-WACP is applied to one bi-predictive codec block.

BD-WACP weight signaling

As described above, for the explicit mode, it is necessary to signal a BD-WACP weight from the encoder to the decoder to reconstruct a bi-directional prediction signal of a BD-WACP codec block. In order to save the overhead of signaling those weight values, the variable length codeword should be designed to accommodate a specific distribution of weight values for the BD-WACP mode. In general, BD-WACP weight 0 (i.e., default bi-prediction) is considered the most frequently selected weight and should be assigned the shortest codeword. Weight values with larger absolute values are typically less selected due to relative modifications to the AC components in the reference block. Therefore, they should be assigned longer codewords. Based on this spirit, tables 4 and 5 show two BD-WACP weight binarization methods when three weights { -1,0,1} or five weights { -6, -1,0,1,6} are applied to the BD-WACP mode.

Table 4 binarization of three BD-WACP weights

Index	BD-WACP weight	Binarization
			0	-1	11
1	0	0
			2	1	10

Table 5 binarization of five BD-WACP weights

Index	BD-WACP weight	Binarization
			0	-6	1111
1	-1	110
			2	0	0
3	1	10
			4	6	1110

In practice, other binarization methods may also be applied. For example, the numbers 0 and 1 in tables 4 and 5 may be exchanged based on the same design spirit.

FIG. 10 illustrates a computing environment 1010 coupled with a user interface 1060. The computing environment 1010 may be part of a data processing server. The computing environment 1010 includes a processor 1020, memory 1040, and an I/O interface 1050.

The processor 1020 generally controls the overall operation of the computing environment 1010, such as operations associated with display, data acquisition, data communication, and image processing. The processor 1020 may include one or more processors to execute instructions to perform all or some of the steps in the methods described above. Further, the processor 1020 may include one or more modules that facilitate interactions between the processor 1020 and other components. The processor may be a Central Processing Unit (CPU), microprocessor, single-chip, GPU, or the like.

The memory 1040 is configured to store various types of data to support the operation of the computing environment 1010. The memory 1040 may include predetermined software 1042. Examples of such data include instructions, video data sets, image data, and the like for any application or method operating on computing environment 1010. The memory 1040 may be implemented using any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, or optical disk.

The I/O interface 1050 provides an interface between the processor 1020 and peripheral interface modules, such as a keyboard, click wheel, buttons, and the like. Buttons may include, but are not limited to, a home button, a start scan button, and a stop scan button. The I/O interface 1050 may be coupled with an encoder and a decoder.

In some embodiments, a non-transitory computer readable storage medium is also provided that includes a plurality of programs, such as included in memory 1040, executable by processor 1020 in computing environment 1010 for performing the methods described above. For example, the non-transitory computer readable storage medium may be ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

The non-transitory computer readable storage medium has a plurality of programs stored therein for execution by a computing device having one or more processors, wherein the plurality of programs, when executed by the one or more processors, cause the computing device to perform the method for motion prediction described above.

In some embodiments, the computing environment 1010 may be implemented with one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), graphics Processing Units (GPUs), controllers, microcontrollers, microprocessors, or other electronic components for performing the methods described above.

The description of the present disclosure has been presented for purposes of illustration and is not intended to be exhaustive or limited to the disclosure. Many modifications, variations and alternative embodiments will come to mind to one skilled in the art having the benefit of the teachings presented in the foregoing descriptions and the associated drawings.

These examples were chosen and described in order to explain the principles of the present disclosure and to enable others skilled in the art to understand the disclosure for various embodiments and with various modifications as are suited to the particular use contemplated. Therefore, it is to be understood that the scope of the present disclosure is not limited to the specific examples of the disclosed embodiments, and that modifications and other embodiments are intended to be included within the scope of the present disclosure.

Claims

1. A method for video decoding in weighted alternating current prediction, WACP, comprising:

obtaining a plurality of inter prediction blocks from a plurality of temporal reference pictures associated with a video block;

obtaining a low frequency signal based on the plurality of inter prediction blocks;

obtaining a plurality of high frequency signals based on the plurality of inter prediction blocks, wherein at least one of the plurality of high frequency signals is associated with one prediction block; and

Determining at least one weight associated with a high frequency signal of at least one of the inter prediction blocks; and

a final prediction signal of the video block is calculated based on a weighted sum of the plurality of high frequency signals and the low frequency signal using the at least one weight.

2. The method of claim 1, wherein calculating the final prediction signal for the video block comprises:

obtaining a combined high frequency signal based on a weighted sum of the plurality of high frequency signals, wherein at least one of the plurality of high frequency signals is weighted by a corresponding weight associated with the at least one of the plurality of high frequency signals; and

the final prediction signal of the video block is calculated as the sum of the low frequency signal and the combined high frequency signal of the video block.

3. The method of claim 1, wherein the low frequency signal comprises direct current, DC, components of the plurality of inter-prediction blocks; and

wherein at least one of the high frequency signals includes an alternating current, AC, component of a corresponding inter prediction block of the plurality of inter prediction blocks.

4. The method of claim 1, further comprising:

A total number of inter-prediction blocks used to calculate the final prediction signal for the video block is received from a bitstream.

5. The method of claim 1, wherein determining at least one weight associated with a high frequency signal of at least one of the prediction blocks comprises:

identifying a candidate block from among a plurality of merge candidate blocks when a current block is encoded in a merge mode; and

a plurality of weights is determined based on the candidate blocks.

6. The method of claim 1, wherein determining at least one weight associated with a high frequency signal of at least one of the prediction blocks comprises:

when the video block is not encoded in merge mode, receiving an index identifying a plurality of weights from a set of predefined weights for the block; and

a plurality of weights is determined based on the index.

7. The method of claim 6, further comprising:

the weight of a high frequency signal is identified by subtracting the weights of all other high frequency signals from 1.

8. The method of claim 6, further comprising:

the weight of a high frequency signal is identified by subtracting the weights of all other high frequency signals from 0.

9. The method of claim 7, further comprising:

the plurality of weights in the predefined set are weighted to corresponding integer values that are to be shifted to the right by a fixed number of bits.

10. The method of claim 9, wherein the predefined set of weights comprises {5,0,3}, and the fixed number of bits is set to 3.

11. The method of claim 1, wherein a total number of inter-prediction blocks of the current block is equal to 2.

12. The method of claim 11, further comprising:

obtaining sample refinement based on samples of the plurality of inter-prediction blocks when the video block is encoded in bidirectional optical flow (BDOF); and

the final prediction signal is obtained based on the low frequency signal, the plurality of high frequency signals, and the sample refinement.

13. The method of claim 12, wherein obtaining the final prediction signal based on the low frequency signal, the plurality of high frequency signals, and the sample refinement comprises:

The final prediction signal of the video block is calculated as the sum of the low frequency signal, the combined high frequency signal of the video block and the sample refinement.

14. A computing device, comprising:

one or more processors; and

a non-transitory computer-readable storage medium storing instructions executable by the one or more processors, wherein the one or more processors are configured to perform the method of any of claims 1-13.

15. A non-transitory computer readable storage medium storing a plurality of programs for execution by a computing device having one or more processors, wherein the plurality of programs, when executed by the one or more processors, cause the computing device to perform the method of any of claims 1-13.