WO2022026480A1

WO2022026480A1 - Weighted ac prediction for video coding

Info

Publication number: WO2022026480A1
Application number: PCT/US2021/043335
Authority: WO
Inventors: Xiaoyu XIU; Wei Chen; Che-Wei Kuo; Yi-Wen Chen; Xianglin Wang; Tsung-Chuan MA; Hong-Jheng Jhu; Bing Yu
Original assignee: Beijing Dajia Internet Information Technology Co., Ltd.
Priority date: 2020-07-27
Filing date: 2021-07-27
Publication date: 2022-02-03
Also published as: CN116158079A

Abstract

A method, apparatus, and a non-transitory computer-readable storage medium for video decoding in weighted alternating current prediction (WACP) are provided. The method may include obtaining a plurality of inter prediction blocks from a number of temporal reference pictures associated with a video block. The method may also inlcude obtaining a low-frequency signal based on the plurality of inter prediction blocks. The method may further include obtaining a plurality of high-frequency signals based on the plurality of inter prediction blocks. The method may also include determining at least one weight associated with the high-frequency signal of at least one of the inter prediction blocks. The method may further include calculating a final prediction signal of the video block based on a weighted sum of the low-frequency signal and the plurality of high-frequency signals using the at least one weight.

Description

WEIGHTED AC PREDICTION FOR VIDEO CODING

CROSS-REFERENCE TO RELATED APPLICATION

[0001] This application is based upon and claims priority to Provisional Applications No. 63/057,290, filed on July 27, 2020, the entire contents thereof are incorporated herein by reference in its entirety for all purposes.

TECHNICAL FIELD

[0002] This disclosure is related to video coding and compression. More specifically, this disclosure relates to methods and apparatus for weighted alternating current (AC) prediction for video coding.

BACKGROUND

[0003] Various video coding techniques may be used to compress video data. Video coding is performed according to one or more video coding standards. For example, nowadays, some well-known video coding standards include Versatile Video Coding (VVC), High Efficiency Video Coding (HEVC, also known as H.265 or MPEG-H Part2) and Advanced Video Coding (AVC, also known as H.264 or MPEG-4 Part 10), which are jointly developed by ISO/IEC MPEG and ITU-T VECG. AOMedia Video 1 (AVI) was developed by Alliance for Open Media (AOM) as a successor to its preceding standard VP9. Audio Video Coding (AVS), which refers to digital audio and digital video compression standard, is another video compression standard series developed by the Audio and Video Coding Standard Workgroup of China. Most of the existing video coding standards are built upon the famous hybrid video coding framework, i.e., using block-based prediction methods (e.g., inter-prediction, intra-prediction) to reduce redundancy present in video images or sequences and using transform coding to compact the energy of the prediction errors. An important goal of video coding techniques is to compress video data into a form that uses a lower bit rate while avoiding or minimizing degradations to video quality. SUMMARY

[0004] Examples of the present disclosure provide methods and apparatus for weighted alternating current (AC) prediction for video coding.

[0005] According to a first aspect of the present disclosure, a method for video decoding in weighted alternating current prediction (WACP) is provided. The method may include obtaining a plurality of inter prediction blocks from a number of temporal reference pictures associated with a video block. The method may also obtain a low-frequency signal based on the plurality of inter prediction blocks. The method may also obtain a plurality of high- frequency signals based on the plurality of inter prediction blocks. At least one of the plurality of high-frequency signals is associated with one prediction block. The method may also determine at least one weight associated with the high-frequency signal of at least one of the inter prediction blocks. The method may also calculate a final prediction signal of the video block based on a weighted sum of the low-frequency signal and the plurality of high-frequency signals using the at least one weight.

[0006] It is to be understood that the above general descriptions and detailed descriptions below are only examples and explanatory and not intended to limit the present disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

[0007] The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate examples consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

[0008] FIG. 1 is a block diagram of an encoder, according to an example of the present disclosure.

[0009] FIG. 2 is a block diagram of a decoder, according to an example of the present disclosure.

[0010] FIG. 3A is a diagram illustrating block partitions in a multi-type tree structure, according to an example of the present disclosure.

[0011] FIG. 3B is a diagram illustrating block partitions in a multi -type tree structure, according to an example of the present disclosure.

[0012] FIG. 3C is a diagram illustrating block partitions in a multi-type tree structure, according to an example of the present disclosure.

[0013] FIG. 3D is a diagram illustrating block partitions in a multi-type tree structure, according to an example of the present disclosure.

[0014] FIG. 3E is a diagram illustrating block partitions in a multi-type tree structure, according to an example of the present disclosure.

[0015] FIG. 4 is an illustration of bi-directional optical flow (BDOF), according to an example of the present disclosure.

[0016] FIG. 5 A is an illustration of a 4-parameter affine mode, according to an example of the present disclosure.

[0017] FIG. 5B is an illustration of a 4-parameter affine mode, according to an example of the present disclosure.

[0018] FIG. 6 is an illustration of a 6-parameter affine mode, according to an example of the present disclosure.

[0019] FIG. 7 is an illustration of an inheritance of the WACP mode, according to an example of the present disclosure.

[0020] FIG. 8 is a method for video decoding, according to an example of the present disclosure.

[0021] FIG. 9 is a method for video decoding, according to an example of the present disclosure.

[0022] FIG. 10 is a diagram illustrating a computing environment coupled with a user interface, according to an example of the present disclosure.

DETAILED DESCRIPTION

[0023] Reference will now be made in detail to example embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise represented. The implementations set forth in the following description of example embodiments do not represent all implementations consistent with the present disclosure. Instead, they are merely examples of apparatuses and methods consistent with aspects related to the present disclosure, as recited in the appended claims. [0024] The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. As used in the present disclosure and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It shall also be understood that the term “and/or” used herein is intended to signify and include any or all possible combinations of one or more of the associated listed items.

[0025] It shall be understood that, although the terms “first,” “second,” “third,” etc., may be used herein to describe various information, the information should not be limited by these terms. These terms are only used to distinguish one category of information from another. For example, without departing from the scope of the present disclosure, first information may be termed as second information; and similarly, second information may also be termed as first information. As used herein, the term “if’ may be understood to mean “when” or “upon” or “in response to a judgment” depending on the context.

[0026] The first generation AVS standard includes Chinese national standard “Information Technology, Advanced Audio Video Coding, Part 2: Video” (known as AVS1) and “Information Technology, Advanced Audio Video Coding Part 16: Radio Television Video” (known as AVS+). It can offer around 50% bit-rate saving at the same perceptual quality compared to MPEG-2 standard. The AVS1 standard video part was promulgated as the Chinese national standard in February 2006. The second generation AVS standard includes the series of Chinese national standard “Information Technology, Efficient Multimedia Coding” (knows as AVS2), which is mainly targeted at the transmission of extra HD TV programs. The coding efficiency of the AVS2 is double of that of the AVS+. In May 2016, the AVS2 was issued as the Chinese national standard. Meanwhile, the AVS2 standard video part was submitted by the Institute of Electrical and Electronics Engineers (IEEE) as one international standard for applications. The AVS3 standard is one new generation video coding standard for UHD video application aiming at surpassing the coding efficiency of the latest international standard HEVC. In March 2019, at the 68-th AVS meeting, the AVS3-P2 baseline was finished, which provides approximately 30% bit-rate savings over the HEVC standard. Currently, there is one reference software, called high performance model (HPM), is maintained by the AVS group to demonstrate a reference implementation of the AVS3 standard.

[0027] Like the HEVC, the AVS3 standard is built upon the block-based hybrid video coding framework.

[0028] FIG. 1 shows a general diagram of a block-based video encoder for the VVC. Specifically, FIG. 1 shows atypical encoder 100. The encoder 100 has video input 110, motion compensation 112, motion estimation 114, intra/inter mode decision 116, block predictor 140, adder 128, transform 130, quantization 132, prediction related info 142, intra prediction 118, picture buffer 120, inverse quantization 134, inverse transform 136, adder 126, memory 124, in-loop filter 122, entropy coding 138, and bitstream 144.

[0029] In the encoder 100, a video frame is partitioned into a plurality of video blocks for processing. For each given video block, a prediction is formed based on either an inter prediction approach or an intra prediction approach.

[0030] A prediction residual, representing the difference between a current video block, part of video input 110, and its predictor, part of block predictor 140, is sent to a transform 130 from adder 128. Transform coefficients are then sent from the Transform 130 to a Quantization 132 for entropy reduction. Quantized coefficients are then fed to an Entropy Coding 138 to generate a compressed video bitstream. As shown in FIG. 1, prediction related information 142 from an intra/inter mode decision 116, such as video block partition info, motion vectors (MVs), reference picture index, and intra prediction mode, are also fed through the Entropy Coding 138 and saved into a compressed bitstream 144. Compressed bitstream 144 includes a video bitstream.

[0031] In the encoder 100, decoder-related circuitries are also needed in order to reconstruct pixels for the purpose of prediction. First, a prediction residual is reconstructed through an Inverse Quantization 134 and an Inverse Transform 136. This reconstructed prediction residual is combined with a Block Predictor 140 to generate un-filtered reconstructed pixels for a current video block.

[0032] Spatial prediction (or “intra prediction”) uses pixels from samples of already coded neighboring blocks (which are called reference samples) in the same video frame as the current video block to predict the current video block.

[0033] Temporal prediction (also referred to as “inter prediction”) uses reconstructed pixels from already-coded video pictures to predict the current video block. Temporal prediction reduces temporal redundancy inherent in the video signal. The temporal prediction signal for a given coding unit (CU) or coding block is usually signaled by one or more MVs, which indicate the amount and the direction of motion between the current CU and its temporal reference. Further, if multiple reference pictures are supported, one reference picture index is additionally sent, which is used to identify from which reference picture in the reference picture storage, the temporal prediction signal comes from.

[0034] Motion estimation 114 intakes video input 110 and a signal from picture buffer 120 and output, to motion compensation 112, amotion estimation signal. Motion compensation 112 intakes video input 110, a signal from picture buffer 120, and motion estimation signal from motion estimation 114 and output to intra/inter mode decision 116, a motion compensation signal.

[0035] After spatial and/or temporal prediction is performed, an intra/inter mode decision 116 in the encoder 100 chooses the best prediction mode, for example, based on the rate- distortion optimization method. The block predictor 140 is then subtracted from the current video block, and the resulting prediction residual is de-correlated using the transform 130 and the quantization 132. The resulting quantized residual coefficients are inverse quantized by the inverse quantization 134 and inverse transformed by the inverse transform 136 to form the reconstructed residual, which is then added back to the prediction block to form the reconstructed signal of the CU. Further in-loop filtering 122, such as a deblocking filter, a sample adaptive offset (SAO), and/or an adaptive in-loop filter (ALF) may be applied on the reconstructed CU before it is put in the reference picture storage of the picture buffer 120 and used to code future video blocks. To form the output video bitstream 144, coding mode (inter or intra), prediction mode information, motion information, and quantized residual coefficients are all sent to the entropy coding unit 138 to be further compressed and packed to form the bitstream.

[0036] FIG. 1 gives the block diagram of a generic block-based hybrid video encoding system. The input video signal is processed block by block (called coding units (CUs)). Different from the HEVC which partitions blocks only based on quad-trees, in the AVS3, one coding tree unit (CTU) is split into CUs to adapt to varying local characteristics based on quad/binary/extended-quad-tree. Additionally, the concept of multiple partition unit type in the HEVC is removed, i.e., the separation of CU, prediction unit (PU) and transform unit (TU) does not exist in the AVS3; instead, each CU is always used as the basic unit for both prediction and transform without further partitions. In the tree partition structure of the AVS3, one CTU is firstly partitioned based on a quad-tree structure. Then, each quad-tree leaf node can be further partitioned based on a binary and extended-quad-tree structure.

[0037] As shown in FIG. 3A, 3B, 3C, 3D, and 3E, there are five splitting types, quaternary partitioning, horizontal binary partitioning, vertical binary partitioning, horizontal extended quad-tree partitioning, and vertical extended quad-tree partitioning.

[0038] FIG. 3 A shows a diagram illustrating block quaternary partition, in accordance with the present disclosure.

[0039] FIG. 3B shows a diagram illustrating block vertical binary partition, in accordance with the present disclosure.

[0040] FIG. 3C shows a diagram illustrating block horizontal binary partition, in accordance with the present disclosure.

[0041] FIG. 3D shows a diagram illustrating block vertical extended quaternary partition, in accordance with the present disclosure.

[0042] FIG. 3E shows a diagram illustrating block horizontal extended quaternary partition, in accordance with the present disclosure. [0043] In FIG.l, spatial prediction and/or temporal prediction may be performed. Spatial prediction (or “intra prediction”) uses pixels from the samples of already coded neighboring blocks (which are called reference samples) in the same video picture/slice to predict the current video block. Spatial prediction reduces spatial redundancy inherent in the video signal. Temporal prediction (also referred to as “inter prediction” or “motion compensated prediction”) uses reconstructed pixels from the already coded video pictures to predict the current video block. Temporal prediction reduces temporal redundancy inherent in the video signal. Temporal prediction signal for a given CU is usually signaled by one or more motion vectors (MVs) which indicate the amount and the direction of motion between the current CU and its temporal reference. Also, if multiple reference pictures are supported, one reference picture index is additionally sent, which is used to identify from which reference picture in the reference picture store the temporal prediction signal comes. After spatial and/or temporal prediction, the mode decision block in the encoder chooses the best prediction mode, for example based on the rate- distortion optimization method. The prediction block is then subtracted from the current video block; and the prediction residual is de-correlated using transform and then quantized. The quantized residual coefficients are inverse quantized and inverse transformed to form the reconstructed residual, which is then added back to the prediction block to form the reconstructed signal of the CU. Further in-loop filtering, such as deblocking filter, sample adaptive offset (SAO) and adaptive in-loop filter (ALF) may be applied on the reconstructed CU before it is put in the reference picture store and used as reference to code future video blocks. To form the output video bitstream, coding mode (inter or intra), prediction mode information, motion information, and quantized residual coefficients are all sent to the entropy coding unit to be further compressed and packed.

[0044] FIG. 2 shows a general block diagram of a video decoder for the VVC. Specifically, FIG. 2 shows a typical decoder 200 block diagram. Decoder 200 has bitstream 210, entropy decoding 212, inverse quantization 214, inverse transform 216, adder 218, intra/inter mode selection 220, intra prediction 222, memory 230, in-loop filter 228, motion compensation 224, picture buffer 226, prediction related info 234, and video output 232. [0045] Decoder 200 is similar to the reconstruction-related section residing in the encoder 100 of FIG. 1. In the decoder 200, an incoming video bitstream 210 is first decoded through an Entropy Decoding 212 to derive quantized coefficient levels and prediction-related information. The quantized coefficient levels are then processed through an Inverse Quantization 214 and an Inverse Transform 216 to obtain a reconstructed prediction residual. A block predictor mechanism, implemented in an Intra/inter Mode Selector 220, is configured to perform either an Intra Prediction 222 or a Motion Compensation 224, based on decoded prediction information. A set of unfiltered reconstructed pixels is obtained by summing up the reconstructed prediction residual from the Inverse Transform 216 and a predictive output generated by the block predictor mechanism, using a summer 218.

[0046] The reconstructed block may further go through an In-Loop Filter 228 before it is stored in a Picture Buffer 226, which functions as a reference picture store. The reconstructed video in the Picture Buffer 226 may be sent to drive a display device, as well as used to predict future video blocks. In situations where the In-Loop Filter 228 is turned on, a filtering operation is performed on these reconstructed pixels to derive a final reconstructed Video Output 232. [0047] FIG. 2 gives a general block diagram of a block-based video decoder. The video bitstream is first entropy decoded at entropy decoding unit. The coding mode and prediction information are sent to either the spatial prediction unit (if intra coded) or the temporal prediction unit (if inter coded) to form the prediction block. The residual transform coefficients are sent to inverse quantization unit and inverse transform unit to reconstruct the residual block. The prediction block and the residual block are then added together. The reconstructed block may further go through in-loop filtering before it is stored in reference picture store. The reconstructed video in reference picture store is then sent out for display, as well as used to predict future video blocks.

[0048] In one or more embodiments, one weighted AC prediction (WACP) approach is proposed to enhance the efficiency of motion compensated prediction. The proposed scheme aims at predicting the alternative current (AC) components of one video block from the weighted combination of the AC components from one or more of its temporal reference blocks. Because one better prediction can be achieved, the corresponding overhead of signaling the AC coefficients can be reduced by the proposed WACP scheme. To facilitate the description, in the following, some existing inter coding technologies in the current VVC and AVS3 standards, which are closely related with the proposed method, are briefly overviewed. After that, some shortcomings in the current inter prediction design are analyzed. Finally, the details of the proposed WACP scheme are discussed.

[0049] Weighted Prediction

[0050] Weighted prediction (WP) was a coding tool that is primarily used to compensate the illuminance changes, such as fade-in and fade-out, in-between one current picture and its temporal reference pictures at motion compensation stage. The WP was firstly adopted in the AVC and reused by the HEVC and the VVC. Specifically, when the WP is enabled, a set of multiplicative weight and additive offset are signaled for each picture in each of L0 and LI reference list in the slice header. For P slices, the prediction of one current block is generated by weighting the prediction samples obtained from one single reference picture. Specifically, let P(i, j) denotes the original prediction sample (i.e., prior to the WP) at coordinate (i, j), the final prediction sample is calculated as

P i,j ) = w P(i,j ) + o (!) where w and o are the WP weight and offset that are associated with the reference picture of the current block. Similarly, for bi-prediction, the final bi-prediction is calculated as

where wo and oo, and W7 and o are the WP weights and offsets are associated with the reference pictures in L0 and LI, respectively. In general, the WP works efficiently for global illumination changes that vary linearly from picture to picture.

[0051] Bi-Prediction With CU-Level Weight [0052] In the preceding AVC and VVC standards, when the WP is not applied, the bi prediction signal is generated by averaging the uni-prediction signals obtained from two reference pictures. In the VVC, one tool coding, namely bi-prediction with CU-level weight (BCW), was introduced to improve the efficiency of bi-prediction. Specifically, instead of simple averaging, the bi-prediction in the BCW is extended by allowing weighted averaging of two prediction signals, as depicted as:

[0053] In the VVC, when the current picture is one low-delay picture, the weight of one BCW coding block is allowed to be selected from a set of predefined weight values w e (—2, 3, 4, 5, 10} and weight of 4 represents traditional bi-prediction case where the two uni prediction signals are equally weighted. For low-delay, only 3 weights w E {3, 4, 5} are allowed. Generally speaking, though there are some design similarities between the WP and the BCW, the two coding tools are targeting at solving the illumination change challenge at different granularities. However, because the interaction between the WP and the BCW could potentially complicate the VVC design, the two tools are disallowed to be enabled simultaneously. Specifically, when the WP is enabled for one slice, then the BCW weights for all the bi-prediction CUs in the slice are not signaled and inferred to be 4 (i.e., the equal weight being applied).

[0054] Merge Mode With Motion Vector Differences (MMVD)

[0055] In addition to conventional merge mode which derives the motion information of one current block from its spatial/temporal neighbors, the MMVD/UMVE mode is introduced in both the VVC and AVS standards as one special merge mode. Specifically, in both the VVC and AVS3, the mode is signaled by one MMVD flag at coding block level. In the MMVD mode, the first two candidates in the merge list for regular merge mode are selected as the two base merge candidates for MMVD. After one base merge candidate is selected and signaled, additional syntax elements are signaled to indicate the motion vector differences (MVDs) that are added to the motion of the selected merge candidate. The MMVD syntax elements include a merge candidate flag to select the base merge candidate, a distance index to specify the MVD magnitude and a direction index to indicate the MVD direction.

[0056] In the existing MMVD design, the distance index specifies MVD magnitude which is defined based on one set of pre-defined offsets from the starting point. As shown in FIG.6, the offset is added to either horizontal or vertical component of the starting MV (i.e., the MVs of the selected base merge candidate). Table 1 illustrates the MVD offsets that are applied in the AVS3, respectively.

Table 1 The MVD offset used in the AVS3

[0057] As shown in Table 2, the direction index is used to specify the signs of the signaled MVD. It is noted that the meaning of the MVD sign could be variant according to the starting MVs. When the starting MVs is a uni-prediction MV or bi-prediction MVs with MVs pointing to two reference pictures whose POCs are both larger than the POC of the current picture, or both smaller than the POC of the current picture, the signaled sign is the sign of the MVD added to the starting MV. When the starting MVs are bi-prediction MVs pointing to two reference pictures with one picture’s POC larger than the current picture and the other picture’s POC smaller than the current picture, the signaled sign is applied to the L0 MVD and the opposite value of the signaled sign is applied to the LI MVD.

Table 2 The MVD sign as specified by the direction index

[0058] Bidirectional Optical Flow (BIO) and Bi-Directional Optical Flow (BDOF)

[0059] Conventional bi-prediction in video coding is a simple combination of two temporal prediction blocks obtained from the reference pictures. However, due to the signaling cost and accuracy tradeoff of motion vectors, the motion vectors received at decoder end may not be so accurate. As a result, there may be still remaining small motion that can be observed between the two prediction blocks, which could reduce the efficiency of motion compensated prediction. To solve this shortcoming, the BIO tool is adopted in both VVC and AVS3 standards to compensate such motion for every sample inside one block. Specifically, the BIO is sample- wise motion refinement that is performed on top of the block-based motion-compensated predictions when bi-prediction is used. In the existing BIO design, the derivation of the refined motion vector for each sample in one block is based on the classical optical flow model. Let / (x, y) be the sample value at the coordinate (x, y) of the prediction block derived from the reference picture list k (k = 0, 7), and dl^(k x, y)/ dx and dl^(k x,y)/ dy are the horizontal and vertical gradients of the sample. Assuming the optical flow model is valid, the motion refinement (v_x, v_y ) at (x, y) can be derived by

[0060] With the combination of the optical flow equation Error! Reference source not found.) and the interpolation of the prediction blocks along the motion trajectory (as show in FIG. 4), we can obtain the BIO prediction as

[0061] In FIG. 4, (MV_xo, MV_yo) and (MV_xi, MV_yi) indicate the block-level motion vectors that are used to generate the two prediction blocks /⁽⁰⁾ and I⁽¹

[0062] Further, the motion refinement (v_x, v_y ) at the sample location (x, y) is calculated by minimizing the difference D between the values of the samples after motion refinement compensation (i.e., A and B in FIG. 4), as shown as

[0063] Additionally, in order to ensure the regularity of the derived motion refinement, it is assumed that the motion refinement is consistent within a local surrounding area centered at (x, y); therefore, in the current BIO design in the AVS3, the values of (v_x, v_y ) are derived by minimizing D inside the 4x4 window W around the current sample at (x, y) as

(v*, V* ) = argmin V A² (i ) (7)

[0064] As shown in Error! Reference source not found.4) and Error! Reference source not found.5), in addition to the block-level MC, gradients need to be derived in the BIO for every sample of each motion compensated block (i.e.,

and /⁽¹⁾) in order to derive the local motion refinement and generate the final predication at that sample location. In the AVS3, the gradients are calculated by a 2D separable finite impulse response (FIR) filtering process which defines a set of 8-tap filters and applies different filters to derive the horizontal and vertical gradients according to the precision of the block-level motion vector (e.g., (MV_xo, MV_yo) and (MV_xi, MV_yi) in FIG. 4). Table 3 illustrates the coefficients of the gradient filters that are used by the BIO.

Table 3 Gradient filters used in BIO

[0065] Finally, the BIO is only applied to bi-prediction blocks which are predicted by two reference blocks from temporal neighboring pictures. Additionally, the BIO is enabled without sending additional information from encoder to decoder. Specifically, the BIO is applied to all the bi-directional predicted blocks which have both the forward and backward prediction signals.

[0066] Affine Mode

[0067] In AVC and HEVC, only translation motion model is applied for motion compensated prediction. While in the real world, there are many kinds of motion, e.g., zoom in/out, rotation, perspective motions and other irregular motions. In the VVC, affine motion compensated prediction is applied by signaling one flag for each inter coding block to indicate whether the translation motion or the affine motion model is applied for inter prediction. In the current VVC design, two affine modes, including 4-parameter affine mode and 6-parameter affine mode, are supported for one affine coding block.

[0068] The 4-parameter affine model has the following parameters: two parameters for translation movement in horizontal and vertical directions respectively, one parameter for zoom motion and one parameter for rotation motion for both directions. Horizontal zoom parameter is equal to vertical zoom parameter. Horizontal rotation parameter is equal to vertical rotation parameter. To achieve a better accommodation of the motion vectors and affine parameter, in the VVC, those affine parameters are translated into two MVs (which are also called control point motion vector (CPMV)) located at the top-left comer and top-right comer of a current block. As shown in FIGS. 5A and 5B, the affine motion field of the block is described by two control point MVs (Vo, Vi).

[0069] FIG. 5A shows an illustration of a 4-parameter affine model, in accordance with the present disclosure.

[0070] FIG. 5B shows an illustration of a 4-parameter affine model, in accordance with the present disclosure.

[0071] Based on the control point motion, the motion field ( v’_Y, vy) of one affine coded block is described as

[0072] The 6-parameter affine mode has following parameters: two parameters for translation movement in horizontal and vertical directions respectively, one parameter for zoom motion and one parameter for rotation motion in horizontal direction, one parameter for zoom motion and one parameter for rotation motion in vertical direction. The 6-parameter affine motion model is coded with three MVs at three CPMVs. [0073] FIG. 6 shows an illustration of a 6-parameter affine model, in accordance with the present disclosure.

[0074] As shown in FIG. 6, three control points of one 6-parameter affine block are located at the top-left, top-right and bottom left comer of the block. The motion at top-left control point is related to translation motion, and the motion at top-right control point is related to rotation and zoom motion in horizontal direction, and the motion at bottom-left control point is related to rotation and zoom motion in vertical direction. Compared to the 4-parameter affine motion model, the rotation and zoom motion in horizontal direction of the 6-parameter may not be the same as those motion in vertical direction. Assuming (Vo, Vi, V2) are the MVs of the top-left, top-right and bottom-left comers of the current block in FIG. 6, the motion vector of each sub block ( V’_{Y, Vy}) is derived using three MVs at control points as:

[0075] Improvements to Video Decoding

[0076] Transform coding is one of the most important compression technology that is widely used in all of mainstream video codecs. It improves the coding efficiency by compacting most of the signal energy into few low-frequency coefficients and distributing the remaining energy into high-frequency coefficients. Therefore, with the quantization being applied, the coefficients having highest energy (i.e., low-frequency coefficients) over the block are finely quantized and allocated with more bits while the low energy coefficients (i.e., high frequency coefficients) are coarsely quantized and allocated with less bits. Due to such reason, in most scenarios (especially low bit-rate applications), the reconstructed video signal is usually dominated by low-frequency information and some high-frequency information that are present in the original video are missing and/or distorted in the reconstructed video signal.

Given that the reconstructed video signal is used as reference for inter prediction, such distorted high-frequency information could potentially result in severe performance drop for both the current picture and the subsequent pictures that are predicted from the current picture.

[0077] The WP and the BCW are efficient tools to improve the efficiency of motion compensated prediction when there are global or local illumination variations among different pictures. However, such improvement is achieved by estimating brightness variations by a linear model, i.e., one multiplicative weight and one additive offset. In practice, the weight and the offset are usually optimized by minimizing the mean squared error (MSE) between the current block and its prediction block, i.e.,

(w^*, o^* ) = argmin V (S(i,j) - P(i,}))² (10) where S(i,j) and P(i,j ) represent the samples at coordinate (i,j) in the current block and the prediction block. Due to the dominate energy of low-frequency information in the reconstructed, the WP and the BCW can only compensate the differences between the low- frequency components (for instance, the direct current (DC) component) between the current block and its reference block but cannot recover the missing high-frequency information that may be missed in the reference samples.

[0078] Proposed Methods

[0079] In this disclosure, one weighted AC prediction (WACP) scheme is proposed to improve the prediction efficiency of AC components at motion compensation stage. In short, in the proposed method, the AC components of one video block is predicted from the weighted combination of the AC components from one or more of its temporal reference blocks. Because one better AC prediction can be achieved, the signaling overhead of the AC coefficients is expected to be reduced therefore the overall motion compensation efficiency is expected to raise up when the WACP scheme is applied.

[0080] Generalized Weighted AC Prediction

[0081] Conceptually, the idea of the WACP can be regarded as one extension of the famous multi-hypothesis prediction to estimate the value of the AC component at each sample of the current block based on the linear combination of the AC component of the collocated sample from multiple motion compensated prediction blocks. Specifically, the general idea of the proposed WACP idea can be formulated as follows:

where P^DC(i,j ) is the average (i.e., the DC component) at coordinate (i,j) of multiple prediction blocks; P^^c{i,j) is the AC component at the coordinate (i,j) of the A-th prediction block; w_k represents the weight that is applied to the AC component of the A-th prediction block and N is the total number of multi-hypothesis that are applied. The value of P^DC(i,j ) and P_k ^c(i,j) can be further calculated as:

[0082] In equation (12), P_k(i,j ) denotes the sample at coordinate (i,j) in the A-th prediction block.

[0083] Similar to the multi-hypothesis prediction, one essential shortcoming of the proposed WACP scheme is how to balance the prediction efficiency gain of using more hypothesis and the required overhead to signal multiple weights. Here, more hypothesis candidates imply more accurate AC prediction, which however, requires more bits to code the weight values. Sometimes, the required overhead may outweigh the prediction accuracy benefit. In one or more embodiments, it is proposed to signal the number of the hypothesis prediction signals applied in the WACP scheme and let the encoder to adaptively choose the optimal number for the best rate-distortion (R-D) performance. The number of the applied hypothesis for the WACP may be signaled at various coding levels, e.g., sequence level, picture level, tile/slice level and coding block level and so forth, to provide different tradeoff between coding efficiency and hardware/software implementation cost. In some embodiments, it is proposed to usually use one fixed number of hypothesis prediction blocks when the proposed WACP scheme is applied. Without loss of generosity, N = 2 will be used as an example to explain the proposed WACP method. [0084] FIG. 8 shows a method for video decoding in weighted alternating current prediction (WACP), in accordance with the present disclsure.

[0085] In step 810, the decoder may obtain a plurality of inter prediction blocks from a number of temporal reference pictures associated with a video block.

[0086] In step 812, the decoder may obtain a low-frequency signal based on the plurality of inter prediction blocks.

[0087] In step 814, the decoder may obtain a plurality of high-frequency signals based on the plurality of inter prediction blocks. At least one of the plurality of high-frequency signals is associated with one prediction block. In some embodiments, the decoder may obtain a plurality of high-frequency signals, where each high-frequency signal is associated with one prediction block.

[0088] In step 816, the decoder may determine at least one weight associated with the high- frequency signal of at least one of the inter prediction blocks. In some embodiments, the decoder may determine a plurality of weights associated with the high-frequency signal of each inter prediction block.

[0089] In step 818, the decoder may calculate a final prediction signal of the video block based on a weighted sum of the low-frequency signal and the plurality of high-frequency signals using the at least one weight.

[0090] FIG. 9 shows a method for video decoding in WACP, in accordance with the present disclsure.

[0091] In step 910, the decoder may obtain a combined high-frequency signal based on a weighted sum of the plurality of high-frequency signals. At least one of the plurality of high- frequency signals is weighted by a corresponding weight associated with the at least one of the plurality of high-frequency signals.

[0092] In step 912, the decoder may calculate the final prediction signal of the video block as a sum of the low-frequency signal and the combined high-frequency signal of the video block.

[0093] Bi-Directional Weighted AC Prediction [0094] Bi-directional weighted AC prediction (BD-WACP) is a special case of the generalized WACP where the number of motion compensated prediction blocks that are used is limited to 2, i.e., N = 2. Therefore, based on equation (11), the bi-prediction sample at coordinate (i,j) can be calculated by

where w₀ and w_t are the weights associated with the AC samples of the prediction signal P₀ and P_t. As shown in equation (13), when w₀ is equal to w_t, the proposed WACP degrades to the traditional bi-prediction.

[0095] Assuming the BD-WACP can be adaptively switched at coding block level, to equation (13), two different weights need to be signaled for each bi-prediction block, which is costly considering the coding bits that may produce. To reduce the signaling overhead, one additional constraint can be applied to enforce the summation of two weights to be a constant such as zero, i.e., w₀ + w₁ = 0, such that only one weight needs to be explicitly signaled. A weight for a high-frequency signal can be identified, for example, by subtracting weights of all other high-frequency signals from one. In another example, a weight for a high-frequency signal can be identified by subtracting weights of all other high-frequency signals from zero. [0096] As show in equation (13), when the BD-WACP is applied, only one single weight w needs to be signaled in bitstream. However, the weight in equation (13) is assumed to be floating number which need to be quantized before transmission. As the errors resulting from the quantized may significantly affect degrade the WACP performance, it is important to choose the allowed WACP weights. In one specific example, three weights w e (—1/8, 0, 1/8} are proposed to be used for the BD-WACP. In another specific example, five weights (—6/8, — 1/8, 0, 1/8, 6/8 } are proposed to be used for the BD-WACP. When either of the two methods is applied, the corresponding absolute weight value can be represented by 3-bits. Therefore, the equation (13) can be rewritten using the integer weights as R wacp ay)

where w^int is the integer weight values which are allowed to be selected from (—1, 0, 1} in the first example and from (-6, -1, 0, 1, 6} in the second example. In another example, integer weight values may be allowed to be selected from (5, 0, 3} and a fixed number of bits for right shift operation is set to 3. In another embodiment, instead of using fixed set of WACP weights, it is proposed to directly signal the allowed weights in bitstream (e.g., sequence parameter set, picture parameter set, slice header and so forth). Such method gives encoder more flexibility to select the desirable WACP weights according to the specific characteristic of the current sequence/picture/slice on the fly.

[0097] Inherited BD-WACP mode

[0098] In the above methods, the selected weight of the BD-WACP mode is explicitly signaled in bitstream if one coding block is bi-predicted. However, as discussed in introduction section, merge mode is supported in both VVC and AVS3 where motion information of one coding block is not signaled but derived from one of a set of spatial/temporal merge candidates. To reduce the signaling overhead of BD-WACP weights, methods are proposed in this section to apply the BD-WACP to merge modes. Firstly, in addition to the motion information (i.e., reference picture indices and motion vectors), it is proposed to store the associated WACP weight for each bi-prediction. By this way, the BD-WACP weight can be inherited from block to block without signaling. In the existing VVC and AVS3 designs, there are multiple types of merge modes: including regular merge mode, inherited affine merge mode, constructed affine merge mode.

[0099] Firstly, when the current coding block is coded with regular merge mode or inherited affine merge mode, the corresponding WACP weight can be directly copied from the weight of the selected merge candidate (as indicated by the signaled merge index). FIG. 7 illustrates an inheritance of the WACP mode. FIG. 7 is one example to explain the inheritance scheme proposed for the WACP mode. In FIG. 7, spatial merge candidate B2, which is coded by the BD-WACP mode with weight value of 1, is selected as the merge candidate of the current coding block. In this case, both the BD-WACP weight and the motion information of B2 are inherited to generate the bi-prediction signal of the current block.

[00100] Different from regular merge mode and inherited affine merge mode, the motion information of constructed affine merge mode is generated from the motion information of multiple neighboring blocks. Different methods may be applied to generate the BD-WACP weight for one constructed affine merge block. In the first method, it is proposed to always disable the BD-WACP mode (i.e., forcibly setting the BD-WACP weight w to 0) when the current block is coded by the constructed affine merge mode. In the second method, it is proposed to set the BD-WACP weight of one constructed affine merge block to be equal to the BD-WACP weight of the block that generates the first control-point motion vector (i.e., at the top-left comer of the current block). In the third method, it is proposed to set the BD-WACP weight of one constructed affine merge block to be equal to the BD-WACP weight that are mostly used by the neighboring blocks. Additionally, when there are not enough neighboring blocks that are coded by the BD-WACP mode, the BD-WACP weight of the current block is set to 0 (i.e., disabling the BD-WACP).

[00101] Harmonization of the BD-WACP With Other Inter Coding Techniques [00102] Harmonization Between the BD-WACP and the WP: Conceptually, the BD- WACP and the WP are two coding tools with different flavors: the BD-WACP is targeting at compensating the high-frequency information that are missed at reference pictures while the BD-WACP is concentrated on compensating the illumination variations (i.e., low-frequency information) between the current picture and the reference pictures. Therefore, there are not obvious conflicts that prevent the two coding tools from being used jointly. Specifically, when the WP is turned on, the WP parameters (i.e., weight and offset) are signaled at picture/slice level. At coding block level, one additional BD-WACP weight can be signed when the current block is bi-predicted. Therefore, as one embodiment of the disclosure, it is proposed to apply the BD-WACP and the WP together. In details, in the method, the WP is firstly applied to adjust the illumination magnitudes of the prediction blocks which are combined by the WACP to generate the final bi-prediction. Assuming

the WP weights and offsets associated with L0 and LI reference pictures and _W ^BD~WACP i_s the BD-WACP weight. When the proposed method is applied, the bi-prediction is generated as

[00103] Note that in equation (15) the coordinate (i,j) is excluded from the equation to facilitate the presentation. Additionally, for easy description, all the value of weights and offsets are assumed to be floating. In practice, the parameter discretization method as depicted in equation (14) can be easily applied to do equation (15) by fixed-point implementations. In another embodiment, it is proposed to always disable the BD-WACP mode for one bi-predicted coding block when the WP is enabled for the picture/slice that the coding block belongs. In the case, the BD-WACP weight does not need to be signaled but always inferred to 0.

[00104] Harmonization between the BD-WACP and the BCW: similar to the WP, the BD-WACP can also be seamlessly combined with the BCW mode, because the two modes aiming at improving different components of motion compensated prediction signal. Therefore, in one or more embodiments, it is proposed to jointly apply the BD-WACP and the BCW at the same time for one bi-predicted coding block. Specifically, by this method, the BCW is firstly applied to adjust the local illumination magnitudes of the prediction blocks which are combined by the WACP to generate the final bi-prediction. Assuming w^BCW are the BCW weight being applied, the bi-prediction is generated as

[00105] In some other embodiments, it is proposed to always disable the BD-WACP mode for one bi-predicted coding block when the BCW is enabled for the block. By this case, the BD-WACP weight does not need to be signaled but always inferred to 0.

[00106] Harmonization Between the BD-WACP and BDOF: The BD-WACP can also be freely combined with the BDOF. More specifically, when the two tools are combined, the original prediction signal P₀ and _x are still applied to estimate the sample-wise sample refinement A_BD0F as depicted in “bi-directional optical flow” (BDOF) section which is added to the enhanced bi-prediction signal by the BD-WACP, as shown as

[00107] In some embodiments, it is proposed to always disable the BDOF when the BD- WACP is applied to one bi-predicted coding block.

[00108] BD-WACP weight signaling

[00109] As shown above, for explicit mode, one BD-WACP weight needs to be signaled from encoder to decoder to reconstruct the bi-prediction signal of one BD-WACP coding block. To save the overhead of signaling those weight values, variable-length code-words should be designed to accommodate the specific distribution of the weight values of the BD-WACP mode. In general, the BD-WACP weight 0 (i.e., default bi-prediction) is considered to be most frequently selected weight and should be assigned with the shortest code-word. The weight values with larger absolute values are often less selected due to the relatively modifications to the AC components in the reference blocks. Therefore, they should be assigned with longer code-word. Based on such sprit, Table 4 and Table5 shows two BD-WACP weight binarization methods when three weights (—1, 0, 1} or five weights (—6, —1, 0, 1, 6} are applied to the BD- WACP mode.

Table 4 Binarization of three BD-WACP weights

Table 5 Binarization of five BD-WACP weights

[00110] In practice, other binarization methods may also be applied. For instance, the digits 0 and 1 in Table 4 and Table 5 can be switched based on the same design sprit.

[00111] FIG. 10 shows a computing environment 1010 coupled with a user interface 1060. The computing environment 1010 can be part of a data processing server. The computing environment 1010 includes processor 1020, memory 1040, and I/O interface 1050.

[00112] The processor 1020 typically controls overall operations of the computing environment 1010, such as the operations associated with the display, data acquisition, data communications, and image processing. The processor 1020 may include one or more processors to execute instructions to perform all or some of the steps in the above-described methods. Moreover, the processor 1020 may include one or more modules that facilitate the interaction between the processor 1020 and other components. The processor may be a Central Processing Unit (CPU), a microprocessor, a single chip machine, a GPU, or the like.

[00113] The memory 1040 is configured to store various types of data to support the operation of the computing environment 1010. Memory 1040 may include predetermine software 1042. Examples of such data comprise instructions for any applications or methods operated on the computing environment 1010, video datasets, image data, etc. The memory 1040 may be implemented by using any type of volatile or non-volatile memory devices, or a combination thereof, such as a static random access memory (SRAM), an electrically erasable programmable read-only memory (EEPROM), an erasable programmable read-only memory (EPROM), a programmable read-only memory (PROM), a read-only memory (ROM), a magnetic memory, a flash memory, a magnetic or optical disk.

[00114] The I/O interface 1050 provides an interface between the processor 1020 and peripheral interface modules, such as a keyboard, a click wheel, buttons, and the like. The buttons may include but are not limited to, a home button, a start scan button, and a stop scan button. The I/O interface 1050 can be coupled with an encoder and decoder.

[00115] In some embodiments, there is also provided a non-transitory computer-readable storage medium comprising a plurality of programs, such as comprised in the memory 1040, executable by the processor 1020 in the computing environment 1010, for performing the above-described methods. For example, the non-transitory computer-readable storage medium may be a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disc, an optical data storage device or the like.

[00116] The non-transitory computer-readable storage medium has stored therein a plurality of programs for execution by a computing device having one or more processors, where the plurality of programs when executed by the one or more processors, cause the computing device to perform the above-described method for motion prediction.

[00117] In some embodiments, the computing environment 1010 may be implemented with one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field- programmable gate arrays (FPGAs), graphical processing units (GPUs), controllers, micro controllers, microprocessors, or other electronic components, for performing the above methods.

[00118] The description of the present disclosure has been presented for purposes of illustration and is not intended to be exhaustive or limited to the present disclosure. Many modifications, variations, and alternative implementations will be apparent to those of ordinary skill in the art having the benefit of the teachings presented in the foregoing descriptions and the associated drawings.

[00119] The examples were chosen and described in order to explain the principles of the disclosure and to enable others skilled in the art to understand the disclosure for various implementations and to best utilize the underlying principles and various implementations with various modifications as are suited to the particular use contemplated. Therefore, it is to be understood that the scope of the disclosure is not to be limited to the specific examples of the implementations disclosed and that modifications and other implementations are intended to be included within the scope of the present disclosure.

Claims

CLAIMS What is claimed is:

1. A method for video decoding in weighted alternating current prediction (WACP), comprising: obtaining a plurality of inter prediction blocks from a number of temporal reference pictures associated with a video block; obtaining a low-frequency signal based on the plurality of inter prediction blocks; obtaining a plurality of high-frequency signals based on the plurality of inter prediction blocks, wherein at least one of the plurality of high-frequency signals is associated with one prediction block; and determining at least one weight associated with the high-frequency signal of at least one of the inter prediction blocks; and calculating a final prediction signal of the video block based on a weighted sum of the low-frequency signal and the plurality of high-frequency signals using the at least one weight.

2. The method of claim 1, wherein calculating the final prediction signal of the video block comprises: obtaining a combined high-frequency signal based on a weighted sum of the plurality of high-frequency signals, wherein at least one of the plurality of high-frequency signals is weighted by a corresponding weight associated with the at least one of the plurality of high- frequency signals; and calculating the final prediction signal of the video block as a sum of the low- frequency signal and the combined high-frequency signal of the video block.

3. The method of claim 1, wherein the low-frequency signal comprises a direct current (DC) component of the plurality of inter prediction blocks; and wherein at least one of the high-frequency signals comprises an alternating current (AC) component of a corresponding inter prediction block of the plurality of inter prediction blocks.

4. The method of claim 1, further comprising: receiving, from a bitstream, a total number of inter prediction blocks used for calculating the final prediction signal of the video block.

5. The method of claim 1, wherein determining the at least one weight associated with the high-frequency signal of the at least one of the prediction blocks comprises: identifying, when a current block is coded with merge mode, a candidate block from a plurality of merge candidate blocks; and determining a plurality of weights based on the candidate block.

6. The method of claim 1, wherein determining the at least one weight associated with the high-frequency signal of the at least one of the prediction blocks comprises: receiving, when the video block is not coded with merge mode, an index identifying the plurality of weights from a set of predefined set of weights for the block; and determining a plurality of weights based on the index.

7. The method of claim 6, further comprising: identifying a weight of a high-frequency signal by subtracting weights of all other high-frequency signals from one.

8. The method of claim 6, further comprising: identifying a weight of a high-frequency signal by subtracting weights of all other high-frequency signals from zero.

9. The method of claim 7, further comprising: quantizing the plurality of weights in the predefined set as one integer value right shifted by a fixed number.

10. The method of claim 9, wherein the predefined set of weights comprise {5, 0, 3} and the fixed number is set to 3.

11. The method of claim 1 , wherein a total number of inter prediction blocks of a current block is equal to 2.

12. The method of claim 11, further comprising: obtaining, when the video block is coded with bi-directional optical flow (BDOF), sample refinements based on samples of the plurality of the inter prediction blocks; and obtaining the final prediction signal based on the low-frequency signal, the plurality of high-frequency signals, and the sample refinements.

13. The method of claim 12, wherein obtaining the final prediction signal based on the low-frequency signal, the plurality of high-frequency signals, and the sample refinements comprises: obtaining a combined high-frequency signal based on a weighted sum of the plurality of high-frequency signals, wherein at least one of the plurality of high-frequency signals is weighted by a corresponding weight associated with the at least one of the plurality of high- frequency signals; and calculating the final prediction signal of the video block as a sum of the low- frequency signal, the combined high-frequency signal of the video block, and the sample refinements.

14. A computing device, comprising: one or more processors; and a non-transitory computer-readable storage medium storing instructions executable by the one or more processors, wherein the one or more processors are configured to perform the method in any of claims 1-13.

15. A non-transitory computer-readable storage medium storing a plurality of programs for execution by a computing device having one or more processors, wherein the plurality of programs, when executed by the one or more processors, cause the computing device to perform the method in any of claims 1-13.