CN115567709A - Method and apparatus for encoding samples at pixel locations in sub-blocks - Google Patents

Method and apparatus for encoding samples at pixel locations in sub-blocks Download PDF

Info

Publication number
CN115567709A
CN115567709A CN202211357666.3A CN202211357666A CN115567709A CN 115567709 A CN115567709 A CN 115567709A CN 202211357666 A CN202211357666 A CN 202211357666A CN 115567709 A CN115567709 A CN 115567709A
Authority
CN
China
Prior art keywords
sub
block
filter
samples
pixel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211357666.3A
Other languages
Chinese (zh)
Inventor
陈伟
修晓宇
陈漪纹
马宗全
朱弘正
郭哲瑋
王祥林
于冰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Publication of CN115567709A publication Critical patent/CN115567709A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/119Adaptive subdivision aspects, e.g. subdivision of a picture into rectangular or non-rectangular coding blocks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/537Motion estimation other than block-based
    • H04N19/54Motion estimation other than block-based using feature points or meshes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/513Processing of motion vectors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/523Motion estimation or motion compensation with sub-pixel accuracy
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/527Global motion vector estimation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/563Motion estimation with padding, i.e. with filling of non-object values in an arbitrarily shaped picture block or region for estimation purposes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/573Motion compensation with multiple frame prediction using two or more reference frames in a given prediction direction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/577Motion compensation with bidirectional frame interpolation, i.e. using B-pictures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/80Details of filtering operations specially adapted for video compression, e.g. for pixel interpolation
    • H04N19/82Details of filtering operations specially adapted for video compression, e.g. for pixel interpolation involving filtering within a prediction loop

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

A method and apparatus for encoding samples at pixel locations in sub-blocks is provided. The method may include generating a plurality of affine motion compensated predictions at a pixel location and at a plurality of neighboring pixel locations in a sub-block; determining a motion vector MV difference at least one of a plurality of neighboring pixel positions; determining coefficients of a filter having a predetermined shape based on the motion vector MV differences; a refined prediction of samples at pixel locations is obtained using a filter based on multiple affine motion compensated predictions.

Description

Method and apparatus for encoding samples at pixel locations in sub-blocks
The present application is a divisional application of the inventive patent application having an application date of 2021/07/30 and an application number of "202180004943.X" entitled "method and apparatus for predictive refinement for affine motion compensation".
Technical Field
The present disclosure relates to video coding and compression, and in particular, but not limited to, methods and apparatus for prediction refinement of affine motion compensation (AMPR) in video coding.
Background
Various video codec techniques may be used to compress video data. Video coding is performed according to one or more video coding standards. For example, today, some well-known video codec standards include the general video codec (VVC), the high efficiency video codec (HEVC, also known as H.265 or MPEG-H Part 2), and the advanced video codec (AVC, also known as H.264 or MPEG-4 Part 10), developed jointly by ISO/IEC MPEG and ITU-T VECG. AOMedia Video 1 (AV 1) was developed by the open media Alliance (AOM) as a successor to its previous standard VP 9. Audio-video coding and decoding (AVS) is another series of video compression standards established by the China audio and video coding and decoding standards working group, and refers to digital audio and digital video compression standards. Most existing video codec standards build on a well-known hybrid video codec framework, i.e., using block-based prediction methods (e.g., inter-prediction, intra-prediction) to reduce redundancy present in video images or sequences, and using transform codec to compress the energy of the prediction error. An important goal of video codec techniques is to compress video data into a form that uses a lower bit rate while avoiding or minimizing degradation of video quality.
The first generation of AVS standard includes Chinese national standard' information technology, advanced audio/video coding and decoding, part 2: video "(referred to as AVS 1) and" information technology, advanced audio video codec, part 16: broadcast television video "(referred to as AVS +). It can provide a bit rate saving of about 50% compared to the MPEG-2 standard at the same perceptual quality. The AVS1 standard video part was released as a chinese national standard in 2006 month 2. The second generation AVS standard includes the chinese national standard "information technology, high efficiency multimedia codec" (referred to as AVS 2) series, which is mainly directed to the transmission of ultra high definition television programs. The coding efficiency of AVS2 is twice that of AVS +. In 2016, 5 months, AVS2 was released as a chinese national standard. Meanwhile, the AVS2 standard video part is filed by the Institute of Electrical and Electronics Engineers (IEEE) as an international standard for applications. The AVS3 standard is a new generation of video codec standard for UHD video applications, aiming at exceeding the codec efficiency of the latest international standard HEVC. In month 3 2019, on the 68 th AVS conference, the AVS3-P2 baseline was completed, which provides a bit rate savings of about 30% over the HEVC standard. Currently, there is a reference software called High Performance Model (HPM) that is maintained by the AVS group to demonstrate the reference implementation of the AVS3 standard.
Disclosure of Invention
The present disclosure provides examples of techniques related to AMPR for the AVS3 standard.
According to a first aspect of the present disclosure, there is provided a method for encoding samples at pixel positions in a sub-block, comprising: generating a plurality of affine motion compensated predictions at the pixel location and a plurality of neighboring pixel locations in the sub-block; determining a motion vector, MV, difference at least one of the plurality of neighboring pixel locations; determining coefficients of a filter having a predetermined shape based on the motion vector MV differences; obtaining a refined prediction of the samples at the pixel location using the filter based on the plurality of affine motion compensated predictions.
According to a second aspect of the present disclosure, there is provided an apparatus for encoding samples at pixel positions in a sub-block. The apparatus comprises one or more processors; and a memory configured to store instructions executable by the one or more processors. Upon execution of the instructions, the one or more processors are configured to: generating a plurality of affine motion compensated predictions at the pixel location and a plurality of neighboring pixel locations in the sub-block; determining a motion vector, MV, difference at least one of the plurality of neighboring pixel positions; determining coefficients of a filter having a predetermined shape based on the motion vector MV differences; obtaining a refined prediction of the samples at the pixel locations using the filter based on the plurality of affine motion compensated predictions.
According to a third aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium for encoding samples at pixel locations in sub-blocks, storing computer-executable instructions that, when executed by one or more computer processors, cause the one or more computer processors to perform acts comprising: generating a plurality of affine motion compensated predictions at the pixel location and at a plurality of neighboring pixel locations in the sub-block; determining a motion vector, MV, difference at least one of the plurality of neighboring pixel positions; determining coefficients of a filter having a predetermined shape based on the motion vector MV differences; obtaining a refined prediction of the samples at the pixel locations using the filter based on the plurality of affine motion compensated predictions.
According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing a bitstream encoded by performing the above-described method.
Drawings
A more detailed description of examples of the disclosure will be rendered by reference to specific examples thereof which are illustrated in the appended drawings. In view of the fact that these drawings depict only some examples and are therefore not to be considered limiting of scope, the examples will be described and explained with additional specificity and detail through the use of the accompanying drawings.
Fig. 1 is a block diagram illustrating an exemplary video encoder according to some embodiments of the present disclosure.
Fig. 2 is a block diagram illustrating an exemplary video decoder according to some embodiments of the present disclosure.
Fig. 3A-3E are schematic diagrams illustrating a multi-type tree partitioning pattern, according to some embodiments of the present disclosure.
FIG. 4 is a schematic diagram illustrating an example of a bi-directional optical flow (BIO) model according to some implementations of the present disclosure.
Fig. 5A-5B are schematic diagrams illustrating examples of 4-parameter affine models according to some embodiments of the present disclosure.
FIG. 6 is a schematic diagram illustrating an example of a 6-parameter affine model according to some embodiments of the present disclosure.
Fig. 7 illustrates a prediction refinement with optical flow (PROF) process for affine mode according to some embodiments of the present disclosure.
Fig. 8 illustrates an example of calculating horizontal and vertical offsets from a sample position to a particular position of a sub-block from which sub-blocks MV are derived, according to some embodiments of the present disclosure.
FIG. 9 illustrates an example of sub-blocks within an affine CU in accordance with some embodiments of the present disclosure.
Fig. 10A-10B illustrate examples of diamond-shaped filters according to some embodiments of the present disclosure.
Fig. 11A-11B illustrate examples of diamond filters scaled by MV differences in the horizontal and vertical directions according to some embodiments of the present disclosure.
Fig. 12 is a block diagram illustrating an example apparatus for predicting samples at pixel locations in a sub-block through AMPR according to some embodiments of the present disclosure.
Fig. 13 is a flow diagram illustrating an exemplary process for predicting samples at pixel locations in a sub-block by AMPR according to some embodiments of the present disclosure.
Fig. 14 is a flow diagram illustrating steps in an exemplary process for predicting samples at pixel locations in a sub-block by AMPR in accordance with some embodiments of the present disclosure.
Detailed Description
Reference will now be made in detail to the present embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to provide an understanding of the subject matter presented herein. It will be apparent to those of ordinary skill in the art that various alternatives may be used. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein may be implemented on many types of electronic devices having digital video capabilities.
Reference throughout this specification to "one embodiment," "an example," "some embodiments," "some examples," or similar language means that a particular feature, structure, or characteristic described is included in at least one embodiment or example. Features, structures, elements, or characteristics described in connection with one or some embodiments may be applicable to other embodiments as well, unless expressly stated otherwise.
Throughout the disclosure, unless explicitly stated otherwise, the terms "first," "second," "third," and the like are used merely as labels to refer to relevant elements (e.g., devices, components, compositions, steps, etc.) and do not indicate any spatial or temporal order. For example, "first device" and "second device" may refer to two separately formed devices, or two parts, components, or operating states of the same device, and may be arbitrarily named.
The terms "module," "sub-module," "circuit," "sub-circuit," "circuitry," "sub-circuitry," "unit" or "sub-unit" may comprise memory (shared, dedicated, or combined) that stores code or instructions executable by one or more processors. A module may comprise one or more circuits, with or without stored code or instructions. A module or circuit may include one or more components connected directly or indirectly. These components may or may not be physically attached to each other or positioned adjacent to each other.
As used herein, the term "if" or "when.. May be understood to mean" when.. Or "in response.. Depending on the context. These terms, if they appear in the claims, may not indicate that the associated limitation or feature is conditional or optional. For example, the method may comprise the steps of: i) Perform a function or action X 'when condition X exists or if condition X exists, and ii) perform a function or action Y' when condition Y exists or if condition Y exists. The method may be implemented with both the ability to perform a function or action X 'and the ability to perform a function or action Y'. Thus, both function X 'and function Y' may be performed at different times over multiple executions of the method.
The units or modules may be implemented purely in software, purely in hardware or in a combination of hardware and software. In a purely software embodiment, a unit or module may comprise, for example, functionally related code blocks or software components linked together directly or indirectly to perform a particular function.
Fig. 1 shows a block diagram illustrating an exemplary block-based hybrid video encoder 100 that may be used in connection with many video codec standards that use block-based processing. In encoder 100, a video frame is partitioned into multiple video blocks for processing. For each given video block, a prediction is formed based on either an inter prediction method or an intra prediction method. In inter-frame prediction, one or more predictors are formed by motion estimation and motion compensation based on pixel points from a previously reconstructed frame. In intra prediction, a predictor is formed based on reconstructed pixels in the current frame. Through the mode decision, the best predictor can be selected to predict the current block.
Intra-prediction (also referred to as "spatial prediction") uses pixels from samples (referred to as reference samples) of already coded neighboring blocks in the same video picture and/or slice to predict the current video block. Spatial prediction reduces the spatial redundancy inherent in video signals.
Inter prediction (also referred to as "temporal prediction") uses reconstructed pixels from already coded video pictures to predict the current video block. Temporal prediction reduces temporal redundancy inherent in video signals. The temporal prediction signal of a given Coding Unit (CU) or coding block is typically signaled by one or more Motion Vectors (MV) indicating the amount and direction of motion between the current CU and its temporal reference. Furthermore, if multiple reference pictures are supported, one reference picture index is additionally sent, wherein the reference picture index is used to identify from which reference picture in the reference picture store the temporal prediction signal came.
After performing spatial and/or temporal prediction, the intra/inter mode decision circuit 121 in the encoder 100 selects the best prediction mode, e.g., based on a rate-distortion optimization method. The block predictor 120 is then subtracted from the current video block; and the resulting prediction residual is decorrelated using transform circuitry 102 and quantization circuitry 104. The resulting quantized residual coefficients are dequantized by dequantization circuit 116 and inverse transformed by inverse transform circuit 118 to form a reconstructed residual, which is then added back to the prediction block to form the reconstructed signal for the CU. Furthermore, before the reconstructed CU is placed in a reference picture store of picture buffer 117 and used to encode subsequent video blocks, loop filtering 115, such as a deblocking filter, sample Adaptive Offset (SAO), and/or Adaptive Loop Filter (ALF), may be applied to the reconstructed CU. To form the output video bitstream 114, the coding mode (inter or intra), prediction mode information, motion information, and quantized residual coefficients are all sent to the entropy coding unit 106 for further compression and packetization to form the bitstream.
For example, deblocking filters are available in current versions of AVC, HEVC, and VVC. In HEVC, an additional loop filter called Sample Adaptive Offset (SAO) is defined for further improving the coding efficiency. In the current version of the VVC standard, a further loop filter called an Adaptive Loop Filter (ALF) is being actively studied, and it is highly likely to be included in the final standard.
These loop filter operations are optional. Performing these operations helps to improve codec efficiency and visual quality. They may also be turned off based on decisions made by the encoder 100 to save computational complexity.
It should be noted that in the case where the encoder 100 turns on these filter options, intra prediction is typically based on unfiltered reconstructed pixels, while inter prediction is based on filtered reconstructed pixels.
Fig. 2 is a block diagram illustrating an exemplary block-based video decoder 200 that may be used in connection with many video codec standards. The decoder 200 is similar to the reconstruction related parts residing in the encoder 100 of fig. 1. In the decoder 200, an input video bitstream 201 is first decoded by entropy decoding 202 to derive quantized coefficient levels and prediction related information. The quantized coefficient levels are then processed by inverse quantization 204 and inverse transformation 206 to obtain reconstructed prediction residuals. The block predictor mechanism implemented in the intra/inter mode selector 212 is configured to perform either intra prediction 208 or motion compensation 210 based on the decoded prediction information. A set of unfiltered reconstructed pixel points is obtained by summing the reconstructed prediction residual from the inverse transform 206 and the prediction output produced by the block predictor mechanism using summer 214.
The reconstructed block may further pass through a loop filter 209 before being stored in a picture buffer 213, which serves as a reference picture store. The reconstructed video in the picture buffer 213 may be sent to drive a display device and used to predict subsequent video blocks. With loop filter 209 open, a filtering operation is performed on these reconstructed pixels to derive a final reconstructed video output 222.
The video encoding/decoding standards mentioned above (such as HEVC and AV 3) are conceptually similar. For example, they all use a block-based hybrid video codec framework. The block partitioning scheme in some standards is detailed below.
HEVC partitions blocks based on only quadtrees. The basic unit for compression is called a Coding Tree Unit (CTU). Each CTU may include one Coding Unit (CU) or be recursively divided into four smaller CUs until a predefined minimum CU size is reached. Each CU (also referred to as a leaf CU) includes one or more Prediction Units (PUs) and Transform Unit (TU) trees.
In AVS3, one Coding Tree Unit (CTU) is divided into CUs based on a quadtree/binary tree/extended quadtree to adapt to the varying local characteristics. In addition, the concept of multiple partition unit types in HEVC is removed, i.e., there is no distinction among CU, PU and TU in AVS 3. Instead, each CU is always used as a base unit for both prediction and transform without further partitioning. In the tree splitting structure of AVS3, one CTU is first split based on the quadtree structure. Each quadtree leaf node may then be further partitioned based on the binary tree and the extended quadtree structure.
Fig. 3A-3E are schematic diagrams illustrating multi-type tree partitioning patterns according to some embodiments of the present disclosure. As shown in fig. 3A to 3E, there are five partition types in the multi-type tree structure: quad partition 301, vertical binary partition 302, horizontal binary partition 303, vertical extended quad partition 304, and horizontal extended quad partition 305.
In the current VVC and AVS3 standards, block-based motion compensation may be applied to achieve a tradeoff between codec efficiency, complexity, and memory access bandwidth. The average prediction accuracy is inferior to pixel-based prediction because all pixels within each block or sub-block share the same block-level motion vector. To improve the prediction accuracy at each pixel, the PROF for affine mode is adopted as the codec tool in the current VVC standard. In AVS3, there is no similar tool.
Some examples of the disclosure provide an alternative optical flow-based approach to improving affine mode prediction longitude, as described below.
Affine mode
In HEVC, only the translational motion model is applied to motion compensated prediction. In the real world, however, there are many kinds of motions, such as zoom-in/out, rotation, perspective motion, and other irregular motions. In VVC and AVS3, affine motion compensated prediction is applied by signaling a flag for each inter coded block to indicate whether a translational motion model or an affine motion model is applied for inter prediction. In current VVC and AVS3 designs, two affine modes are supported for one affine codec block, including a 4-parameter affine mode and a 6-parameter affine mode.
The 4-parameter affine model may have the following parameters: two parameters for translational motion in the horizontal and vertical directions, respectively, one parameter for zooming motion for both directions and one parameter for rotational motion for both directions. In a 4-parameter affine model, the horizontal scaling parameter may be equal to the vertical scaling parameter, and the horizontal rotation parameter may be equal to the vertical rotation parameter. To achieve a better harmonization of the motion vectors and affine parameters, those affine parameters can be derived from two MVs, also called Control Point Motion Vectors (CPMVs), located at the top left and right corners of the current block.
Fig. 5A-5B are diagrams illustrating examples of 4-parameter affine models according to some embodiments of the present disclosure. As shown in fig. 5A to 5B, the affine motion field of the block is composed of two control points MVs (V) 0 ,V 1 ) A description is given. Motion field (v) of affine coded block based on control point motion x ,v y ) Is described as the following equation (1):
Figure BDA0003920776370000081
the 6-parameter affine pattern may have the following parameters: two parameters for translational motion in the horizontal and vertical directions, respectively, two parameters for zoom and rotational motion in the horizontal direction, respectively, and two other parameters for zoom and rotational motion in the vertical direction, respectively. The 6-parameter affine motion model is coded using three CPMVs.
FIG. 6 is a diagram illustrating an example of a 6-parameter affine model according to some embodiments of the present disclosure. As shown in fig. 6, the three control points for a 6-parameter affine block are located at the top left, top right and bottom left corners of the block. The motion at the upper left control point is related to translational motion, and the motion at the upper right control point is related to rotational and zooming motion in the horizontal direction, and the motion at the lower left control point is related to rotational and zooming motion in the vertical direction. Compared to a 4-parameter affine motion model, the rotation motion and the scaling motion in the horizontal direction of a 6-parameter affine motion model may be different from the rotation motion and the scaling motion in the vertical direction.
In some examples, when (V) 0 ,V 1 ,V 2 ) Is the MV of the top left, top right and bottom left corner of the current block in FIG. 6, then each sub-block (v) x ,v y ) Is derived using the three MVs at the control points as in equation (2) below:
Figure BDA0003920776370000082
prediction refinement with optical flow (PROF) for affine mode
To improve the accuracy of affine motion compensation, a PROF is employed in VVC, which refines subblock-based affine motion compensation based on an optical flow model. In particular, after performing sub-block based affine motion compensation, each luminance prediction sample of an affine block is modified by a sample refinement value derived based on optical flow equations. In some examples, the operation of the PROF can be summarized as the following four steps:
in a first step, sub-block based affine motion compensation is performed using the sub-block MV as derived from equation (1) above for a 4-parameter affine model or the sub-block MV as derived from equation (2) above for a 6-parameter affine model to produce sub-block predictions I (I, j).
Furthermore, in a second step, the spatial gradient g of each predicted sample is determined x (i, j) and g y (i, j) is calculated as the following equation (3):
Figure BDA0003920776370000091
therefore, to calculate the gradient, one extra row and/or column of prediction samples needs to be generated on each of the four sides of one sub-block, which expands the 4 × 4 sub-block into a 6 × 6 sub-block. To reduce memory bandwidth and complexity, samples on the extended boundary are copied from the nearest integer pixel position in the reference picture to avoid an additional interpolation process.
In addition, in the third step, a luminance prediction refinement value is calculated by the following equation (4):
ΔI(i,j)=g x (i,j)*Δv x (i,j)+g y (i,j)*Δv y (i,j) (4)
where Δ v (i, j) is the difference between the pixel MV calculated for the sample position (i, j) and the sub-block MV of the sub-block where the pixel (i, j) is located, represented by v (i, j).
Fig. 7 illustrates a PROF process for affine patterns, according to some embodiments of the present disclosure. In PROF, after adding the prediction refinement to the original predicted samples, a clipping operation "clip3" is performed to clip the values of the refined predicted samples to within 15 bits, as shown in the following equation:
I r (i,j)=I(i,j)+ΔI(i,j)
I r (i,j)=clip3(-dILimit,dILimit-1,I r (i,j))
dILimit=(1<<max(13,BitDepth+1))
wherein I (I, j) and I r (i, j) are the original predicted sample values and the refined predicted sample values at location (i, j), respectively. The function clip3 (min, max, val) limits the given value "val" to [ min, max [ ]]Within the range of (1).
Because the affine model parameters and pixel position relative to the center of the sub-blocks do not change from sub-block to sub-block, Δ v (i, j) may be calculated for the first sub-block and reused for other sub-blocks in the same CU. When Δ x and Δ y are the horizontal and vertical offsets from the sample position (i, j) to the center of the sub-block to which the sample belongs, Δ v (i, j) can be derived as shown in the following equation (5):
Figure BDA0003920776370000103
the parameters c, d, e, and f in equation (5) above are derived based on the affine sub-block MV to derive equations (1) and (2). Specifically, for a 4-parameter affine model, the parameters c, d, e, and f can be derived as shown in the following equations:
Figure BDA0003920776370000101
further, for a 6-parameter affine model, the parameters c, d, e, and f can be derived as shown in the following equations:
Figure BDA0003920776370000102
wherein (v) 0x ,v 0y )、(v 1x ,v 1y )、(v 2x ,v 2y ) Are the top left control point MV, top right control point MV, and bottom left control point MV of the current coding block, and w and h are the width and height of the block. In PROF, MV difference Δ v x And Δ v y Is always derived with an accuracy of 1/32 pixel.
Finally, in a fourth step, the luma prediction refinement Δ I (I, j) is added to the sub-block prediction I (I, j). The final prediction I' (I, j) for the sample at location (I, j) is generated as shown in equation (6) below:
I′(i,j)=I(i,j)+ΔI(i,j) (6)
bidirectional optical flow (BIO)
Bi-prediction in video coding is a simple combination of two temporally predicted blocks obtained from a reference picture. However, due to the signalling cost and accuracy trade-off of motion vectors, the motion vectors received at the decoder end may be less accurate. As a result, there may still be remaining small motion that may be observed between the two prediction blocks, which may reduce the efficiency of motion compensated prediction. To address this problem, the BIO tool is employed in both the VVC and AVS3 standards to compensate for this motion for each sample point within a block. In particular, BIO is a sample-by-sample motion refinement performed on top of block-based motion compensated prediction when bi-prediction is used.
In the BIO design, the derivation of the refined motion vector for each sample in a block is based on the classical optical flow model. Let I (k) (x, y) are sample values at coordinates (x, y) of the prediction block derived from the reference picture list k (k =0, 1), and
Figure BDA0003920776370000111
and
Figure BDA0003920776370000112
the horizontal and vertical gradients of the sample points. Assuming the optical flow model is valid, the motion at (x, y) refines (v) x ,v y ) Can be derived by the following optical flow equation (7):
Figure BDA0003920776370000113
using a combination of optical flow equation (7) and interpolation of the prediction block along the motion trajectory, as shown in fig. 4, a BIO prediction can be obtained, as shown in the following equation (8):
Figure BDA0003920776370000114
fig. 4 is a schematic diagram illustrating an example of a BIO model according to some embodiments of the present disclosure. As shown in FIG. 4, (MV) x0 ,MV y0 ) And (MV) x1 ,MV yl ) Indication for generating two prediction blocks I (0) And I (1) Block level motion vectors. Further, the motion refinement (v) at the sample position (x, y) is calculated by minimizing the difference Δ between the sample values after the motion refinement compensation (i.e., a and B in fig. 4) x ,v y ) As shown in the following equation (9):
Figure BDA0003920776370000115
in addition, to ensure uniformity of the derived motion refinement, it is assumed that the motion refinement is uniform within a local surrounding area centered at (x, y); thus, in the BIO design in AVS3, (v) is derived by minimizing Δ within a 4 × 4 window Ω around the current sample at (x, y) x ,v y ) As shown in the following equation (10):
Figure BDA0003920776370000116
as shown in equations (8) and (10), in addition to the block level MC, it is also necessary to compensate the block (i.e., I) for each motion in the BIO (0) And I (1) ) In order to derive local motion refinement and to produce a final prediction at the location of the sample. In AVS3, the gradient is computed by a two-dimensional (2D) separable Finite Impulse Response (FIR) filtering process that defines a set of 8-tap filters and applies different filters to account for block-level motion vectors (e.g., (MV) in fig. 4) x0 ,MV y0 ) And (MV) x1 ,MV y1 ) ) derive horizontal and vertical gradients. Table 1 shows the coefficients of the gradient filter used by the BIO.
TABLE 1
Fractional position Gradient filter
0 {-4,11,-39,-1,41,-14,8,-2}
1/4 {-2,6,-19,-31,53,-12,7,-2}
1/2 {0,-1,0,-50,50,0,1,0}
3/4 {2,-7,12,-53,31,19,-6,2}
Finally, BIO is only applied to bi-predicted blocks predicted by two reference blocks from temporally neighboring pictures. In addition, BIO is enabled without sending additional information from the encoder to the decoder. Specifically, the BIO is applied to all bi-prediction blocks having both forward and backward prediction signals.
Final motion vector expression (UMVE)
The UMVE mode in the AVS3 standard is the same tool as the merge mode named with motion vector difference (MMVD) in the VVC standard. In addition to the conventional merge mode, in which motion information of a current block is derived from spatial/temporal neighboring blocks of a current block, the MMVD/UMVE mode is introduced as a special merge mode in both the VVC and AVS standards.
In particular, in both VVC and AVS3, it is signaled at the encoding block level by one MMVD flag. In MMVD mode, two basic merging candidates are first generated as the first two candidates of the normal merging mode. After one basic merge candidate is selected and signaled, an additional syntax element is signaled to indicate the MVD of motion added to the selected merge candidate. The MMVD syntax element includes a merge candidate flag for selecting a basic merge candidate, a distance index for specifying a size of the MVD, and a direction index for indicating a direction of the MVD.
In the AVS3 standard, subblock-based affine motion compensation (affine mode) similar to the VVC standard is used to generate inter-predicted pixel values. This sub-block based prediction is a trade-off between coding efficiency, complexity and memory access bandwidth. The average prediction accuracy is not as good as the pixel-based prediction because all pixels within each sub-block share the same motion vector. Unlike the VVC standard, in affine mode, the AVS3 standard has no pixel level refinement after subblock-based motion compensation.
The present disclosure provides a new method for improving affine pattern prediction accuracy. After affine motion compensation based on conventional sub-blocks, the prediction value of each pixel is refined by adding a differential value derived from the optical flow equation. The proposed method may be referred to as AMPR. AMPR can achieve pixel-level prediction accuracy without significantly increasing complexity, and also maintains worst-case memory access bandwidth comparable to conventional subblock-based motion compensation in affine mode. Although AMPR is also established on optical flow, it is significantly different from PROF in the VVC standard in the following respects.
First, a gradient calculation is performed at each pixel. Unlike the PROF, which extends the sub-block prediction by one pixel on each side, AMPR makes use of interpolation-based filtering for gradient calculations at each pixel, which allows a unified design between AMPR and BIO workflows in AVS 3.
Second, MV difference calculations are performed at each pixel. Unlike PROF, which always calculates MV differences based on pixel position relative to the center of the sub-block, in AMPR MV differences may be calculated based on pixel position relative to different positions within the sub-block.
Third is early termination. Unlike the PROF process, which is always invoked on the decoder side when an encoded block is predicted by affine mode, AMPR can be adaptively skipped on the decoder side based on certain defined conditions that indicate that applying AMPR is not a good performance and/or complexity trade-off.
In some examples, this early termination method may also be used to simplify encoder-side operations. Some encoder-side optimization methods for AMPR procedures are also presented in examples of the present disclosure to reduce their latency and energy consumption, such as skipping AMPR for affine UMVE, checking the best mode selection at parent CU before applying AMPR, skipping AMPR for motion estimation of certain block sizes, checking the size of pixel MV differences before applying AMPR, skipping AMPR for some picture types (e.g., low-latency pictures or non-low-latency pictures), etc.
Exemplary workflow of AMPR
The AMPR method may include five steps as described below. In a first step, a conventional sub-block based affine motion compensation is performed to generate a sub-block prediction I (I, j) at each pixel position (I, j).
In a second step, the horizontal spatial gradient g of the sub-block prediction is calculated at each pixel position using interpolation-based filtering x (i, j) and vertical spatial gradient g y (i, j). In some examples, both the horizontal and vertical gradients of affine prediction samples are computed directly from reference samples at integer sample positions in the temporal reference picture. One advantage of this is that for each affine sub-block, its gradient values can be generated at the same time as its predicted samples are generated. Another design benefit of this gradient computation approach is that it is also consistent with the gradient computation process used by other codec tools in the AVS standard, such as BIO. Sharing the same process between different modules in a standard is friendly to pipelined and/or parallel design in a practical hardware codec implementation.
Specifically, the input to the gradient derivation process is the same reference samples and input Motion (MV) of the sub-block as the reference samples used for motion compensation of the affine sub-block x ,MV y ) The same fractional component (fracX, fracY). To derive the gradient values at each sample position, except for a default 8-tap FIR filter h for affine prediction L In addition, another new set of FIR filters h is introduced in the proposed method G Gradient values are calculated.
In addition, depending on the direction of the derived gradient, a filter h is applied G And h L In a different order. At the derivation of horizontal gradient g x (i, j) case, first a gradient filter h is applied in the horizontal direction G To derive a horizontal gradient value at the horizontal fractional sample position fracX; then, the interpolation filter h is applied vertically L To interpolate the gradient values at the vertical fractional sample position fracY.
On the contrary, when deriving the vertical gradient g y (i, j) first, an interpolation filter h is applied horizontally L To interpolate an intermediate interpolated sample at the horizontal sample position fracX, followed by the application of a gradient filter h in the vertical direction G To derive the vertical gradient value at the vertical fractional sample position fracY from the intermediate interpolated samples.
In some examples, gradient filters may be generated with different filter coefficient accuracies and different numbers of taps, which may provide various tradeoffs between gradient computation accuracy and computational complexity. For example, gradient filters with more filter taps and/or with higher filter coefficient precision may generally result in better codec efficiency, but at the cost of more computational operations (e.g., multiple additions, multiplications, and shifts) due to the gradient computation process. In one example, the following 8-tap filter is proposed for horizontal and/or vertical gradient calculations for AMPR, as shown in table 2.
Table 2 is a predefined 8-tap interpolation filter coefficient f for generating a spatial gradient based on 1/16 pixel precision of input sample values grad [p]Exemplary table of (a).
TABLE 2
Figure BDA0003920776370000151
In another example, to reduce the complexity of the gradient calculation, the following 4-tap FIR filter as shown in table 3 is used for gradient generation of the proposed AMPR method. TABLE 3 predefined 4-tap interpolation filter for generating spatial gradients based on 1/16 pixel precision of input sample valuesCoefficient f grad [p]Exemplary table of (a).
TABLE 3
Figure BDA0003920776370000161
In a third step, at each pixel position (i, j), the MV difference Δ v (i, j) between the MV of each pixel and the MV of the sub-block to which said pixel belongs is calculated. Fig. 8 shows an example of calculating a horizontal offset and a vertical offset from a sample position to a specific position of a subblock from which a subblock MV is derived. As shown in fig. 8, the horizontal offset Δ x and the vertical offset Δ y are calculated from the sampling point position (i, j) to a specific position (i ', j') of the sub-block from which the sub-block MV is derived. In some examples, the particular location (i ', j') may not always be the center of the sub-block. As shown in fig. 8, Δ v (i, j) is calculated based on the pixel position with respect to a specific position within the sub-block by equation (5).
In one example, for affine mode, let (i, j) be the pixel location/coordinate within the sub-block to which the pixel belongs, and w and h be the width and height of the sub-block (e.g., w = h =4 for a 4 × 4 sub-block, w = h =8 for an 8 × 8 sub-block), the horizontal offset Δ x and vertical offset Δ y for each pixel (i, j) (Δ x and Δ y are defined in equation (5)) can be derived as follows, where i =0 8230; (w-1) and j =0 8230; (h-1).
In one example, if the sub-block MV is derived from a position at the center of the sub-block at an integer position, Δ x and Δ y may be calculated by the equations shown below:
Figure BDA0003920776370000171
alternatively, if the sub-block MV is derived from a position at the center of the sub-block at a fractional position, Δ x and Δ y can be calculated by the equations shown below:
Figure BDA0003920776370000172
in another example, if the sub-block MV is derived from the position at the upper left corner of the sub-block, Δ x and Δ y may be calculated by the equations shown below:
Figure BDA0003920776370000173
in another example, if the sub-block MV is derived from the position at the upper right corner within the sub-block, Δ x and Δ y may be calculated by the equations shown below:
Figure BDA0003920776370000174
alternatively, if the sub-block MV is derived from the position at the upper right corner outside the sub-block, Δ x and Δ y may be calculated by the equations shown below:
Figure BDA0003920776370000175
in another example, if the sub-block MV is derived from the position at the lower left corner within the sub-block, Δ x and Δ y may be calculated by the equations shown below:
Figure BDA0003920776370000176
alternatively, if the sub-block MV is derived from the position at the lower left corner outside the sub-block, Δ x and Δ y may be calculated as:
Figure BDA0003920776370000177
in another example, Δ v (i, j) may be calculated by equation (5), and Δ x and Δ y are horizontal and vertical offsets from the sample position (i, j) to the pilot sample position of the sub-block to which the sample belongs. Pilot sample position refers to the sample position within a sub-block that is used to derive the MV used to generate the sub-block based predicted samples for the sub-block. In one example, based on the position of the sub-blocks within the current CU, the values of Δ x and Δ y are derived as follows.
FIG. 9 illustrates an example of sub-blocks within an affine CU in accordance with some embodiments of the present disclosure. For the upper left sub-block, i.e. sub-block a in fig. 9, Δ x = i, Δ y = j. For the upper right sub-block, i.e. sub-block B in fig. 9, Δ x = (i-w + 1), Δ y = j. For the bottom left sub-block, i.e. sub-block C in fig. 9, Δ x = i, Δ y = (j-h + 1) when a 6-parameter affine model is applied; and when a 4-parameter affine model is applied, Δ x = (i- (w > 1) -0.5), Δ y = (j- (h > 1) -0.5). For the other sub-blocks, Δ x = (i- (w > 1) -0.5), Δ y = (j- (h > 1) -0.5).
Once the horizontal and vertical offsets Δ x, Δ y are calculated, Δ v (i, j) can be derived by equation (11) below:
Figure BDA0003920776370000181
where c, d, e and f are affine model parameters, which are known because the current block is an affine mode coded block. Equation (11) is similar to equation (5) of the PROF tool in the VVC standard.
In the fourth step, a prediction refinement value is calculated by equation (4).
In a fifth step, a prediction refinement is added to the sub-block prediction I (I, j). The final prediction I' (I, j) is generated as in equation (6).
In the present disclosure, the proposed AMPR workflow may be applied to the luminance component and/or the chrominance component.
In one example, to achieve a good performance/complexity tradeoff, the proposed AMPR is only applied to affine prediction samples that refine the luminance component, while chroma prediction samples are still generated based on existing sub-block based affine motion compensation.
In another example, to refine the alignment, both the luma component and the chroma component are refined by the proposed AMPR process. In this case, the sample-by-sample MV differences Δ v (i, j) can be derived in different ways.
In one example, when the prediction refinement value is calculated in the above fourth step, the sample-by-sample MV difference Δ v (i, j) may be derived only once based on the luminance sub-block and then reused for the chrominance components all the time. In this case, the value of Δ v (i, j) used for the chrominance components may be scaled according to the sampling grid ratio between the co-located luma and chroma coding blocks. For example, for a 4. For a 4.
In another example, the sample-by-sample MV differences Δ v (i, j) may be derived separately for the luma and chroma components, where the derivation process may be the same as the third step described above.
In another example, it is proposed to adaptively switch between a method of reusing luma motion refinement Δ v (i, j) for chroma and a method of separately deriving luma motion refinement and chroma motion refinement based on chroma sampling format of the input video. For example, for 4. On the other hand, when the input video is in 4.
In another example, a flag is signaled to indicate whether AMPR is applied to chroma components of various codec stages (e.g., sequence stage, picture stage, slice stage, etc.). Further, if the enable/disable flag described above is true, another flag may be signaled from the encoder to the decoder to indicate whether the chroma motion refinement is recalculated from the corresponding control point motion vector or directly borrowed from the corresponding motion refinement of the luma component.
Alternative work to AMPRMaking a procedure
Another alternative implementation of AMPR is to approximate the product based on optical flow equations by a filtering process. In particular, it is proposed to replace the product of the gradient values and the motion vector differences at each sample position by performing a filtering process on a conventional sub-block based affine motion prediction. This can be formulated as:
Figure BDA0003920776370000191
wherein P is AFF (x, y) are subblock-based motion compensated prediction samples, f k Is a filter coefficient, and P AMPR (i, j) are filtered affine prediction samples. In practice, different numbers of filter taps and filter shapes may be applied to achieve different tradeoffs between complexity and codec performance.
In one or more examples, the filtering process may be performed by a cross filter, which may also be referred to as a diamond filter. For example, the diamond filter may be a combination of vertical and horizontal shapes of a 3-tap filter [ -1, 1] or a 5-tap filter [ -1, -2,4,2,1], as shown in fig. 10A to 10B. The cross-shaped filter may be an approximation of the gradient calculation process described in the previous optical flow-based refinement process.
To capture the Motion Vector (MV) difference Δ v (i, j) between the MV of each pixel and the MV of the sub-block to which the pixel belongs, the filter coefficients in the selected diamond filter may be calculated based on the values of Δ v (i, j) in the horizontal and vertical directions. In other words, a scaled diamond filter may be used to compensate for the motion vector difference at each sample position. Fig. 11A-11B illustrate examples of diamond filters scaled by MV differences in the horizontal and vertical directions according to some embodiments of the present disclosure. Fig. 11A shows a 5-tap scaled diamond filter. Fig. 11B shows a 9-tap scaled diamond filter.
In another example, the filtering process may be performed by a square filter. For example, a square filter may be a 3 x 3 or 5 x 5 shaped filter, where the importance of each coefficient of the square filter may depend on the distance between the position of each coefficient and the center of the filter, which means that the center coefficient may have a maximum in the filter. Similar to the diamond filter described above, the filter coefficients in the selected square filter may be scaled by the values of Δ v (i, j) in the horizontal and vertical directions.
Once a particular type of filter (e.g., diamond shape or square shape) is selected, a corresponding scaled filter is computed at each sample location. The scaling value is the Motion Vector (MV) difference Δ v (i, j) between the MV of each pixel and the MV of the sub-block to which the pixel belongs at each sample position. The calculation process is the same as for the optical flow-based implementation of AMPR, where the value of Δ v (i, j) is determined based on whether the associated sub-block MV is derived from the position at the center of the sub-block at an integer position, from the position at the upper left corner of the sub-block, from the position at the upper right corner of the sub-block, or from the position at the lower left corner of the sub-block.
In one particular example, when a 3-tap cross filter is applied in the proposed scheme, the corresponding filtered affine prediction samples can be calculated as:
P AMPR (i,j)=(M·P AFF (i,j)+N·(P AFF (i,j+1)·Δ x (i,j+1)-P AFF (i,j-1)·Δxi,j-1+N·Δy·PAFFi+1,j·Δyi+1,j-PAFFi-1,j·Δyi-1,j÷M (13)
where M and N are initialization coefficients of constant values, and Δ x And Δ y Is a scaling factor that can be used to adjust the importance of neighboring samples to the filtered sample at one current location. In one particular example, it is proposed to set M =2N, e.g. M =16 and N =8.
Once the scaled filter coefficients are computed, the filtering process may be performed on conventional sub-block based affine motion prediction samples. For neighboring sample positions outside the current block/sub-block, a padding process may be needed. In one or more examples, the padded sample values can be copied from the nearest reference sample at an integer position. In another example, the padded sample value may be a repetition value of a reference sample at an integer position used by the current block/sub-block. In one example, integer samples of the current CU that are closest to the current boundary samples (which may be a fraction) are used to fill the extension area of the current CU. In another example, integer samples that are located smaller than the corresponding boundary samples (i.e., floor ()) of the current CU are used to fill samples in the extension area of the current CU.
Selective activation of AMPR
Prediction refinement derived by applying AMPR may not always be beneficial or/and necessary. From equation (4), the importance of the derived Δ I (I, j) is determined by the precision and magnitude of the derived Δ v (I, j) and g (I, j).
In some examples, AMPR operation may be conditionally applied based on certain conditions. This may be accomplished by signaling a flag for each block to indicate whether AMPR mode is applied. This can also be achieved by using the same conditions to enable AMPR operation at both the encoder and decoder sides without requiring additional signaling.
The motivation for this conditional application of AMPR operation is that if the CPMV of a CU is inaccurate, or the derived affine model (e.g., 2-parameter, 4-parameter, or 6-parameter affine mode) is inaccurate, then Δ v (i, j) subsequently derived by equation (11) may also be inaccurate. In this case, AMPR operation may not contribute to or even impair codec performance, so it is better to skip AMPR operation for blocks. Another motivation for this conditional application of AMPR operation is that in some cases the benefits of applying AMPR may be marginal and shutting down the operation is better from a computational complexity perspective.
In one or more examples, AMPR operation may be applied depending on whether CMPV is explicitly signaled. In affine merging mode, where CPMV is not explicitly signaled but is implicitly derived from spatially neighboring CUs, AMPR may be skipped for the current CU, since CPMV in this mode may not be accurate.
In another example, AMPR may be skipped if the derived magnitude of Δ v (i, j) or/and g (i, j) is small, for example when compared to some predefined or dynamically determined threshold. Such thresholds may be determined based on various factors, such as CU aspect ratio and/or sub-block size, etc. Such examples may be implemented in different ways as described below.
In one example, if the absolute value of Δ v (i, j) derived for all pixels within a sub-block is less than a threshold, the AMPR for this sub-block may be skipped. The conditions may have different implementation variations. For example, the check of the absolute value of Δ v (i, j) derived for all pixels may be simplified by checking only the four corners of the current sub-block, where the maximum absolute value of Δ v (i, j) derived for all pixels within the sub-block may be found as shown in equation (14) below:
Figure BDA0003920776370000221
where pixel location (i, j) may be any pixel coordinate in the sub-block, or may be from four corners (0, 0), (w-1, 0), (0, h-1), (w-1, h-1).
In another example, the calculation of the maximum absolute value of all Δ v (i, j) may be obtained by the following equation:
Figure BDA0003920776370000222
where the sample positions (i, j) are the four corners of those sub-blocks in the CU except the top-left sub-block (i.e., sub-block a in fig. 9), the top-right sub-block (i.e., sub-block B in fig. 9), and the bottom-left sub-block (i.e., sub-block C in fig. 9). The coordinates of the four corner pixels within a sub-block are: (0, 0), (w-1, 0), (0, h-1), (w-1, h-1). | x | is a function that takes the absolute value of x.
In another example, the examination of Δ v (i, j) derived for all pixels within a sub-block may be performed jointly as in equation (15) below or separately from the horizontal and vertical directions as in equation (16) below.
If it is used
Figure BDA0003920776370000223
Or if
Figure BDA0003920776370000224
Δ I (I, j) =0 (15)
Figure BDA0003920776370000225
In equations (15) and (16) above, expressions of different similar forms of Δ I (I, j) represent different simplified methods. For example, at Δ I (I, j) = g y (i,j)*Δv y In the case of (i, j), g may be skipped x And Δ v x And (4) calculating.
In another example, the checking of the derived Δ v (i, j) may be combined with non-simplified AMPR operations. In this case, equation (14) may be combined with equation (4), and then the prediction refinement value is calculated by the following equation:
Figure BDA0003920776370000231
in some examples, threshv x Or threshv y May have different or the same values.
In some examples, when deriving Δ v (i, j) for a sub-block, threshv may be determined depending on which position is used to derive the sub-block MV x And threshv y The value of (c). In other words, if the MVs of two sub-blocks are derived using different positions, threshv may be determined for the two sub-blocks x And threshv y A pair of identical values or a pair of different values. For example, for a subblock whose subblock-level MV is derived based on the subblock center, threshv thereof x And threshv y Can be compared with threshv of a subblock deriving a subblock-level MV based on the position of the upper left corner of the subblock x And threshv y Are the same or different.
In some examples, threshv x And threshv y Can be given a value ofIs defined as being in [1/32,1/16 ] in pixel unit]In the presence of a surfactant. For example, a value of (1/16) × (10/16), (1/16) × (12/16), or (1/16) × (14/16) may be used as the threshold value. In this case, the threshold is a floating point number of 1/16 pixel, such as 0.625 in 1/16 pixel units, 0.75 in 1/16 pixel units, or 0.875 in 1/16 pixel units.
In some examples, threshv may be defined based on picture type x And threshv y The value of (c). For low delay pictures, the derived affine model parameters may have smaller magnitude than other non-low delay pictures, since low delay pictures tend to have less and/or smoother motion, so smaller values may be preferred for those thresholds.
In some examples, threshv x And threshv y May be the same regardless of different picture types.
In some examples, AMPR may be skipped for a sub-block if the absolute value of the majority of g (i, j) derived for all pixels within this sub-block is less than a threshold. One example of this approach is that the sub-blocks contain smooth surfaces that may consist of flat textures (e.g., with no or little high frequency detail).
In some examples, the importance of Δ v (i, j) and g (i, j) may be considered jointly or used in a mixed manner to decide whether AMPR should be skipped for the current sub-block or CU.
Selective enabling of AMPR may also be performed in the case where AMPR is approximately realized by using the filtering process described above. For example, based on equation (15), if
Figure BDA0003920776370000241
And/or
Figure BDA0003920776370000242
Less than a predefined threshold threshv x And threshv y The corresponding scaled coefficient in the selected filter may become 0, which means that the filtering process may be simplified from a 2D filtering process to a one-dimensional (1D) filtering process. Taking the 5-tap filter in FIG. 10A as an example, when only
Figure BDA0003920776370000243
Less than threshv x Then, only 1D filters [1,2, 1] are applied in the vertical direction]. When only
Figure BDA0003920776370000244
Less than threshv y Then, only 1D filters [1,2, 1] are applied in the horizontal direction]. When the temperature is higher than the set temperature
Figure BDA0003920776370000245
And
Figure BDA0003920776370000246
are not less than the corresponding threshold (i.e., threshv) x And threshv y ) The 2D filter is applied in both the horizontal and vertical directions. When the temperature is higher than the set temperature
Figure BDA0003920776370000247
And
Figure BDA0003920776370000248
are both less than the corresponding threshold, no filter is applied at all, i.e., no AMPR is applied.
Encoder-side optimization
In the AVS3 standard, the affine UMVE mode is computationally intensive for the encoder, as it involves selecting the best distance index for each merge mode candidate. When calculating the Sum of Absolute Transformed Differences (SATD) cost for each candidate distance index, conventional affine motion compensation is always applied. If AMPR is applied on top of affine motion compensation, the amount of computation can be significantly increased.
In one example, at the encoder side, AMPR operations are skipped during SATD-based cost computation for affine UMVE mode. It has been found experimentally that although the best index is selected based on the best SATD cost, whether AMPR is applied during SATD computation does not generally change the ordering of the best SATD cost. Thus, with the proposed method, enabling AMPR mode does not result in significant encoder complexity for affine UMVE mode.
Motion estimation is another major overhead on the encoder side. In another example, the AMPR process may be skipped depending on certain conditions. These conditions indicate that the best encoding mode for the CU is unlikely to be an affine mode after the mode selection process.
One example of such a condition is whether the current CU has a parent CU that has been determined to be coded by the explicit affine mode or the affine merge mode. This is due to the strong correlation of the codec mode selection between a CU and its parent CU, and if the above condition is true, the best codec mode of the current CU is also likely to be an explicit affine mode.
Another exemplary condition for enabling AMPR is whether the parent CU of the current CU is determined to be inter predicted with explicit affine mode. If true, AMPR is applied during affine motion estimation of the current CU; otherwise, AMPR is skipped during affine motion estimation of the current CU.
Small-sized CUs, such as 16 × 16 CUs, have a much higher average per-pixel computation cost when applying AMPR compared to large-sized CUs, such as 64 × 64 CUs. To effectively save computational complexity, in another example of the present disclosure, AMPR may be skipped for small-sized CUs during the motion estimation process. The size of a CU may be defined as the total number of pixels. A pixel number threshold may be defined, such as 16 x 16 or 16 x 32 or 32 x 32, and for blocks of a size smaller than the defined threshold, AMPR may be skipped during the affine motion estimation process for that block.
In the case where AMPR is approximately realized by using the filtering process described above, encoder-side optimization may also be performed. For example, the filtering process may not be performed for affine UMVE mode. Another encoder optimization for enabling AMPR based filtering process is whether the parent CU of the current CU is determined to utilize explicit affine mode for inter prediction. If true, the filtering process of AMPR is applied during affine motion estimation of the current CU. Otherwise, the filtering process is skipped during affine motion estimation of the current CU.
Fig. 12 is a block diagram illustrating an apparatus for predicting samples at pixel locations in a sub-block by AMPR according to some embodiments of the present disclosure. The apparatus 1200 may be a terminal such as a mobile phone, a tablet computer, a digital broadcast terminal, a tablet device, or a personal digital assistant.
As shown in fig. 12, the apparatus 1200 may include one or more of the following components: processing component 1202, memory 1204, power component 1206, multimedia component 1208, audio component 1210, input/output (I/O) interface 1212, sensor component 1214, and communications component 1216.
The processing component 1202 generally controls overall operation of the apparatus 1200, such as operations related to display, telephone calls, data communications, camera operations, and recording operations. The processing component 1202 may include one or more processors 1220 for executing instructions to perform all or part of the steps of the methods described above. Further, the processing component 1202 can include one or more modules for facilitating interaction between the processing component 1202 and other components. For example, the processing component 1202 can include a multimedia module for facilitating interaction between the multimedia component 1208 and the processing component 1202.
The memory 1204 is configured to store different types of data to support the operation of the apparatus 1200. Examples of such data include instructions, contact data, phonebook data, messages, pictures, videos, etc. for any application or method operating on the apparatus 1200. The memory 1204 may be implemented by any type or combination of volatile or non-volatile storage devices, and the memory 1204 may be Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, a magnetic or optical disk.
A power supply assembly 1206 provides power to the various components of the device 1200. The power components 1206 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 1200.
The multimedia components 1208 include screens that provide an output interface between the device 1200 and a user. In some examples, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen that receives an input signal from a user. The touch panel may include one or more touch sensors for sensing touches, swipes, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but may also detect the duration and pressure associated with the touch or slide operation. In some examples, the multimedia component 1208 may include a front camera and/or a rear camera. The front camera and/or the back camera may receive external multimedia data when the device 1200 is in an operating mode, such as a shooting mode or a video mode.
The audio component 1210 is configured to output and/or input audio signals. For example, audio component 1210 includes a Microphone (MIC). When the apparatus 1200 is in an operational mode (such as a call mode, a recording mode, and a voice recognition mode), the microphone is configured to receive external audio signals. The received audio signals may further be stored in the memory 1204 or transmitted via the communication component 1216. In some examples, audio component 1210 further includes a speaker for outputting audio signals.
The I/O interface 1212 provides an interface between the processing component 1202 and the peripheral interface modules. The peripheral interface module can be a keyboard, a click wheel, a button and the like. These buttons may include, but are not limited to, a home button, a volume button, a start button, and a lock button.
The sensor assembly 1214 includes one or more sensors for providing status evaluation in different aspects of the apparatus 1200. For example, the sensor assembly 1214 may detect the on/off state of the device 1200 and the relative positions of the components. For example, the components are a display and a keyboard of the apparatus 1200. The sensor assembly 1214 may also detect changes in the position of the device 1200 or components of the device 1200, the presence or absence of user contact on the device 1200, the direction or acceleration/deceleration of the device 1200, and changes in the temperature of the device 1200. The sensor assembly 1214 may include a proximity sensor configured to detect the presence of a nearby object without any physical touch. The sensor component 1214 can also include an optical sensor, such as a CMOS or CCD image sensor used in imaging applications. In some examples, sensor assembly 1214 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communications component 1216 is configured to facilitate wired or wireless communications between the apparatus 1200 and other devices. The apparatus 1200 may access a wireless network based on a communication standard such as WiFi, 4G, or a combination thereof. In an example, communications component 1216 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In an example, communications component 1216 can also include a Near Field Communication (NFC) module for facilitating short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an example, the apparatus 1200 may be implemented by one or more of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components to perform the above-described methods.
The non-transitory computer readable storage medium may be, for example, a Hard Disk Drive (HDD), a Solid State Drive (SSD), flash memory, a hybrid drive or Solid State Hybrid Drive (SSHD), read Only Memory (ROM), compact disc read only memory (CD-ROM), magnetic tape, a floppy disk, and the like.
Fig. 13 is a flow diagram illustrating an exemplary process for predicting samples at pixel locations in a sub-block by AMPR according to some embodiments of the disclosure.
In step 1302, processor 1220 generates a plurality of affine motion compensated predictions at pixel locations and at a plurality of neighboring pixel locations in the sub-block.
In step 1304, processor 1220 obtains a refined prediction for a sample point at a pixel location using a filter having a predetermined shape at the pixel location based on a plurality of affine motion compensated predictions.
In some examples, processor 1220 may generate multiple affine motion compensation predictions by performing sub-block-based affine motion compensation on a video picture including multiple sub-blocks.
In some examples, the predetermined shape may cover the pixel location and a plurality of adjacent pixel locations.
In some examples, the MV differences may include horizontal MV differences and vertical MV differences.
Fig. 14 is a flow diagram illustrating steps in an exemplary process for predicting samples at pixel locations in a sub-block by AMPR in accordance with some embodiments of the present disclosure. Step 1304 in FIG. 13 may be implemented by the steps shown in FIG. 14.
In step 1401, processor 1220 initializes the coefficients of the filter to a set of constant values to approximate the gradient computation process as described in the previous optical flow-based refinement process. As shown in equation (13), M and N are initialization coefficients of constant values.
In step 1403, the processor 1220 scales the initialized coefficients by one or more scaling factors to obtain a scaled filter. As shown in equation (13), Δ x And Δ y May be a scaling factor.
In step 1405, processor 1220 obtains a refined prediction at a pixel location using a scaled filter having a predetermined shape based on multiple affine motion compensated predictions.
In some examples, processor 1220 may determine one or more scaling factors based on MV differences at each of a plurality of neighboring pixel locations.
In some examples, processor 1220 may determine one or more scaling factors based on the horizontal value, the vertical value, or both the horizontal and vertical values of the MV differences at each of the plurality of adjacent pixel locations.
In some examples, the predetermined shape is a cross or a square. For example, the predetermined shape is a 3 × 3 square shape.
In some examples, each coefficient in the filter is determined based on a distance between a location of the coefficient and a center of the filter.
In some examples, the coefficient at the center of the filter has a maximum value.
In some examples, the processor 1220 may copy the filled samples to a neighboring pixel location in response to determining that one of the plurality of neighboring pixel locations is outside the sub-block.
In some examples, processor 1220 may pad reference neighboring samples in the reference picture to neighboring pixel locations, where the reference neighboring samples are reference samples at integer locations closest to the prediction samples corresponding to the pixel locations.
In some examples, processor 1220 may fill in samples at integer positions closest to samples located on the boundary of the sub-block to adjacent pixel positions.
In some examples, an apparatus is provided for predicting samples at pixel locations in a sub-block by implementing AMPR. The apparatus includes one or more processors 1220; and a memory 1204 configured to store instructions executable by the one or more processors; wherein the processor, when executing the instructions, is configured to perform the method as shown in figures 13 to 14.
In some other examples, a non-transitory computer-readable storage medium 1204 is provided having instructions stored therein. When executed by one or more processors 1220, the instructions cause the processors to perform the methods as shown in fig. 13-14.
The description of the present disclosure has been presented for purposes of illustration and is not intended to be exhaustive or limited to the disclosure. Many modifications, variations and alternative embodiments will become apparent to those of ordinary skill in the art having the benefit of the teachings presented in the foregoing descriptions and the associated drawings.
The examples were chosen and described in order to explain the principles of the disclosure and to enable others of ordinary skill in the art to understand the disclosure for various embodiments and with the best utilization of the basic principles and with various modifications as are suited to the particular use contemplated. Therefore, it is to be understood that the scope of the disclosure is not to be limited to the specific examples of the embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the disclosure.

Claims (17)

1. A method for encoding samples at pixel locations in a sub-block, comprising:
generating a plurality of affine motion compensated predictions at the pixel location and a plurality of neighboring pixel locations in the sub-block;
determining a motion vector, MV, difference at least one of the plurality of neighboring pixel positions;
determining coefficients of a filter having a predetermined shape based on the motion vector MV differences;
obtaining a refined prediction of the samples at the pixel location using the filter based on the plurality of affine motion compensated predictions.
2. The method of claim 1, further comprising:
generating a plurality of affine motion compensated predictions by performing sub-block based affine motion compensation on a video picture comprising a plurality of sub-blocks.
3. The method of claim 1, wherein obtaining the refined prediction of the samples at the pixel location using the filter based on the plurality of affine motion compensated predictions comprises:
filtering each of the plurality of affine motion compensated predictions using the filter.
4. The method of claim 1, further comprising:
obtaining initialization coefficients for the filter by initializing the coefficients of the filter to a set of constant values.
5. The method of claim 4, further comprising:
scaling the initialization coefficients of the filter by one or more scaling factors;
wherein using the filter to obtain a refined prediction of the samples at the pixel locations based on the plurality of affine motion compensated predictions comprises:
obtaining, at the pixel location, the refined prediction using scaled coefficients of the filter based on the plurality of affine motion compensated predictions.
6. The method of claim 5, further comprising:
determining at least one of the one or more scaling factors based on motion vector, MV, differences at least one of the plurality of neighboring pixel positions.
7. The method of claim 6, further comprising:
determining the at least one of the one or more scaling factors based on a horizontal value, a vertical value, or both the horizontal value and the vertical value of the MV difference.
8. The method of claim 1, wherein the predetermined shape is a cross.
9. The method of claim 1, wherein the predetermined shape is a square shape.
10. The method of claim 4, wherein each initialization coefficient in the filter is determined based on a distance between a location of the initialization coefficient in the filter and a center of the filter.
11. The method of claim 10, wherein an initialization coefficient at a center of the filter has a maximum value.
12. The method of claim 1, wherein generating a plurality of affine motion compensated predictions at the pixel location and at a plurality of neighboring pixel locations in the sub-block comprises:
copying, for at least one of the plurality of neighboring pixel locations outside the sub-block, filling samples to the neighboring pixel location.
13. The method of claim 12, wherein copying the filled samples to the adjacent pixel locations comprises:
and filling reference adjacent sampling points in a reference picture to the adjacent pixel positions, wherein the reference adjacent sampling points are the reference sampling points at integer positions closest to the prediction sampling points corresponding to the pixel positions.
14. The method of claim 12, wherein copying the filled samples to the adjacent pixel locations comprises:
filling the neighboring pixel locations with samples at integer positions closest to the samples located on the boundary of the sub-block.
15. An apparatus for encoding samples at pixel locations in a sub-block, comprising:
one or more processors; and
a memory configured to store instructions executable by the one or more processors,
wherein the one or more processors, when executing the instructions, are configured to perform the method of any of claims 1-14.
16. A non-transitory computer-readable storage medium for encoding samples at pixel locations in sub-blocks, storing computer-executable instructions that, when executed by one or more computer processors, cause the one or more computer processors to perform the method of any one of claims 1-14.
17. A non-transitory computer-readable storage medium storing a bitstream, wherein the bitstream is encoded by performing the method of any one of claims 1 to 14.
CN202211357666.3A 2020-07-30 2021-07-30 Method and apparatus for encoding samples at pixel locations in sub-blocks Pending CN115567709A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202063059122P 2020-07-30 2020-07-30
US63/059,122 2020-07-30
CN202180004943.XA CN114342390B (en) 2020-07-30 2021-07-30 Method and apparatus for prediction refinement for affine motion compensation
PCT/US2021/043997 WO2022026888A1 (en) 2020-07-30 2021-07-30 Methods and apparatuses for affine motion-compensated prediction refinement

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN202180004943.XA Division CN114342390B (en) 2020-07-30 2021-07-30 Method and apparatus for prediction refinement for affine motion compensation

Publications (1)

Publication Number Publication Date
CN115567709A true CN115567709A (en) 2023-01-03

Family

ID=80036178

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202211357666.3A Pending CN115567709A (en) 2020-07-30 2021-07-30 Method and apparatus for encoding samples at pixel locations in sub-blocks
CN202180004943.XA Active CN114342390B (en) 2020-07-30 2021-07-30 Method and apparatus for prediction refinement for affine motion compensation

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202180004943.XA Active CN114342390B (en) 2020-07-30 2021-07-30 Method and apparatus for prediction refinement for affine motion compensation

Country Status (2)

Country Link
CN (2) CN115567709A (en)
WO (1) WO2022026888A1 (en)

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008048489A2 (en) * 2006-10-18 2008-04-24 Thomson Licensing Method and apparatus for video coding using prediction data refinement
CN109005407B (en) * 2015-05-15 2023-09-01 华为技术有限公司 Video image encoding and decoding method, encoding device and decoding device
US11172221B2 (en) * 2017-06-26 2021-11-09 Interdigital Madison Patent Holdings, Sas Method and apparatus for intra prediction with multiple weighted references
US10728573B2 (en) * 2017-09-08 2020-07-28 Qualcomm Incorporated Motion compensated boundary pixel padding
CN111201791B (en) * 2017-11-07 2022-05-24 华为技术有限公司 Interpolation filter for inter-frame prediction apparatus and method for video encoding
US10834396B2 (en) * 2018-04-12 2020-11-10 Qualcomm Incorporated Bilateral filter for predicted video data
KR20210038846A (en) * 2018-06-29 2021-04-08 브이아이디 스케일, 인크. Adaptive control point selection for video coding based on AFFINE MOTION model
US10834417B2 (en) * 2018-09-21 2020-11-10 Tencent America LLC Method and apparatus for video coding
WO2020147747A1 (en) * 2019-01-15 2020-07-23 Beijing Bytedance Network Technology Co., Ltd. Weighted prediction in video coding
CN111050168B (en) * 2019-12-27 2021-07-13 浙江大华技术股份有限公司 Affine prediction method and related device thereof

Also Published As

Publication number Publication date
CN114342390A (en) 2022-04-12
CN114342390B (en) 2022-10-28
WO2022026888A1 (en) 2022-02-03

Similar Documents

Publication Publication Date Title
CN116506609B (en) Method and apparatus for signaling merge mode in video coding
JP2023089255A (en) Method and apparatus for prediction refinement with optical flow
CN115280779A (en) Method and apparatus for affine motion compensated prediction refinement
US20240098290A1 (en) Methods and devices for overlapped block motion compensation for inter prediction
JP2024016288A (en) Method and apparatus for decoder side motion vector correction in video coding
CN114128263A (en) Method and apparatus for adaptive motion vector resolution in video coding and decoding
CN116171576A (en) Method and apparatus for affine motion compensated prediction refinement
WO2022081878A1 (en) Methods and apparatuses for affine motion-compensated prediction refinement
CN114342390B (en) Method and apparatus for prediction refinement for affine motion compensation
CN116708802A (en) Method and apparatus for prediction related residual scaling for video coding
CN114009017A (en) Motion compensation using combined inter and intra prediction
US20240015316A1 (en) Overlapped block motion compensation for inter prediction
WO2022026480A1 (en) Weighted ac prediction for video coding
WO2021007133A1 (en) Methods and apparatuses for decoder-side motion vector refinement in video coding
WO2021021698A1 (en) Methods and apparatuses for decoder-side motion vector refinement in video coding
CN115699759A (en) Method and apparatus for encoding and decoding video using SATD-based cost computation
CN114402618A (en) Method and apparatus for decoder-side motion vector refinement in video coding and decoding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination