WO2020069651A1

WO2020069651A1 - A candidate mv construction method for affine merge mode

Info

Publication number: WO2020069651A1
Application number: PCT/CN2019/105623
Authority: WO
Inventors: Xu Chen; Jianhua Zheng
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2018-10-05
Filing date: 2019-09-12
Publication date: 2020-04-09

Abstract

generating an affine merge candidate list based on motion vectors associated with a set of neighboring blocks of a current block, the affine merge candidate list comprising a plurality of new candidates and a plurality of old candidates, and obtaining a new candidate of a current block using an old candidate in the affine merge candidate list and an offset. The offset is expressed by a starting point, a motion magnitude, or a motion direction.

Description

A CANDIDATE MV CONSTRUCTION METHOD FOR AFFINE MERGE MODE

CROSS-REFERENCES TO RELATED APPLICATIONS

The present invention claims priority to U.S. Provisional Patent Application Serial No. 62/752,364, filed on October 30, 2018 and priority to U.S. Provisional Patent Application Serial No. 62/742,220, filed on October 5, 2018, the content of which are hereby incorporated by reference in its entirety.

TECHNICAL FIELD

The present invention relates to video coding techniques, and more particular to methods for constructing candidate motion vectors candidates for affine merge mode.

BACKGROUND OF THE INVENTION

Various video coding standards have been developed. In most coding standards, adaptive inter and intra prediction is used on a block basis. High Efficient Video Coding (HEVC) is a new video coding standard which introduces a merge mode for inter-picture prediction. A merge candidate list of candidate motion parameters from neighboring blocks is constructed. An index is then signaled to identify the candidates to be used. The concept of coding tree unit (CTU) , coding unit (CU) , prediction unit (PU) is introduced. In HEVC, a skip mode is used to indicate for a block that the motion data is inferred instead of explicitly signaled and that the prediction residual is zero. Merge mode is used to derive the motion data, and no residual data is present in the bitstream.

Video compression techniques perform spatial (intra picture) prediction and/or temporal (inter picture) prediction to reduce redundancy in video sequences. For block-based coding, a video picture is partitioned into video blocks. In a moving picture encoding and/or decoding process, a motion vector is predicted in order to reduce the code size of motion vectors generated in the blocks.

However, the emergence and the need of ever higher resolution video in mobile devices impose challenges to current transmission capability. Therefore, there is need for novel video encoding and decoding techniques to reduce the transmission bandwidth.

BRIEF SUMMARY OF THE INVENTION

Embodiments of the present disclosure provides methods and devices that can reduce the code size of motion vectors by deriving or generating motion vector prediction candidates.

In one embodiment, generating an affine merge candidate list based on motion vectors associated with a set of neighboring blocks of a current block, the affine merge candidate list comprising a plurality of old candidates, and obtaining a new candidate of the current block using an old candidate in the affine merge candidate list and an offset. The offset is expressed by a starting point, a motion magnitude, or a motion direction. The new candidate is added to the affine merge candidate list.

In an embodiment, the starting point is characterized by a base candidate index configured to indicate which candidate in the affine merge candidate list is to be used as the old candidate, the motion magnitude is characterized by a distance index configured to indicate a predetermined distance from the starting point, and the motion direction is characterized by a direction index configured to indicate a motion vector difference (MVD) relative to the starting point.

In another embodiment, an encoder is provided that performs the method described above to improve the coding process efficiency.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of this disclosure, reference is now made to the following brief description, taken in connection with the accompanying drawings and detailed description, wherein like reference numerals represent like parts.

FIG. 1 is a flowchart of an example method of coding a video signal according to an embodiment.

FIG. 2 is a schematic diagram of an example coding and decoding (codec) system for video coding according to an embodiment.

FIG. 3 is a block diagram illustrating an example video encoder according to an embodiment.

FIG. 4 is a block diagram illustrating an example video decoder according to an embodiment.

FIG. 5 is a schematic diagram illustrating an example of unidirectional inter-prediction according to an embodiment.

FIG. 6 is a schematic diagram illustrating an example of bidirectional inter-prediction according to an embodiment.

FIG. 7 is a schematic diagram illustrating exemplary intra-prediction modes employed in video coding.

FIG. 8 is a schematic diagram illustrating an example of directional relationships of blocks in video coding.

FIG. 9 is a block diagram illustrating an example in-loop filter according to an embodiment.

FIG. 10 illustrates example split modes employed in block partitioning according to an embodiment.

FIG. 11 is a schematic diagram of an example video encoding mechanism according to an embodiment.

FIG. 12 is a schematic diagram of a computing device for video coding according to an embodiment.

FIG. 13, which includes FIG. 13A and FIG. 13B, is a conceptual diagram illustrating a control point based affine motion model. FIG. 13A shows an example of two-point MV affine mode with four affine parameters, and FIG. 13B shows an example of three-point MV affine mode with six parameters.

FIG. 14 is an affine motion vector field (MVF) per sub-block.

FIG. 15, which includes FIG. 15A and FIG. 15B, shows candidates when coding a current block in AF_MERGE mode according to an embodiment. FIG. 15A shows neighboring blocks used in AF_MERGE mode, and FIG. 15B shows a left-bottom neighbor block is coded.

FIG. 16 is a conceptual diagram illustrating a ultimate motion vector expression (UMVE) search process according to an embodiment.

FIG. 17 is a conceptual diagram illustrating the ultimate motion vector expression (UMVE) search points according to an embodiment.

DETAILED DESCRIPTION OF THE INVENTION

It should be understood at the outset that although an illustrative implementation of one or more embodiments are provided below, the disclosed systems and/or methods may be implemented using a variety of techniques, whether currently known or in existence. The disclosure should in no way be limited to the illustrative implementations, drawings, and techniques illustrated below, including the exemplary designs and implementations illustrated and described herein, but may be modified within the scope of the appended claims along with their full scope of equivalents.

FIG. 1 is a flowchart of an example method 100 of coding a video signal. Specifically, a video signal is encoded at an encoder. The encoding process compresses the video signal by employing various mechanisms to reduce the size of a video file. A smaller file size allows the compressed video file to be transmitted toward a user, while reducing the associated bandwidth overhead. A decoder then decodes the compressed video file to reconstruct the original video signal for display to an end user. The decoding process generally mirrors the encoding process to allow the decoder to consistently reconstruct the video signal.

According to embodiments of the present disclosure, the method includes, at step 101, providing or inputting a video signal to an encoder. For example, the video signal may be an uncompressed video file stored in memory. As another example, the video file may be captured by a video capture device, such as a video camera, and encoded to support live streaming of the video. The video file may include both an audio component and a video component. The video component contains a series of image frames that, when viewed in a sequence, gives the visual impression of motion. The frames contain pixels that are expressed in terms of light intensity (brightness) , referred to herein as luma components, and color information referred to as chroma components. In some examples, the frames may also contain depth values to support three dimensional viewing.

At step 103, the video is partitioned into blocks. Partitioning includes subdividing the pixels in each frame into square and/or rectangular blocks for compression. For example, coding trees may be employed to divide and then recursively subdivide blocks until configurations are achieved that support further encoding. As such, the blocks may be referred to as coding tree units in High Efficiency Video Coding (HEVC) (also known as H. 265 and MPEG-H Part 2) . For example, luma components of a frame may be subdivided until the individual blocks contain relatively homogenous lighting values. Further, chroma components of a frame may be subdivided until the individual blocks contain relatively homogenous color values. Accordingly, partitioning mechanisms vary depending on the content of the video frames.

At step 105, various compression mechanisms are employed to compress the image blocks partitioned at step 103. For example, inter-prediction and/or intra-prediction may be employed. Inter-prediction is designed to take advantage of the fact that objects in a common scene tend to appear in successive frames. Accordingly, a block depicting an object in a reference frame needs not be repeatedly described in subsequent frames. Specifically, an object, such as a table, may remain in a constant position over multiple frames. Hence the table needs only be described once and subsequent frames can refer back to the reference frame. Pattern matching mechanisms may be employed to match objects over multiple frames. Further, moving objects may be represented across multiple frames, for example, due to the object movement or camera movement. As a particular example, a video sequence may show an automobile that moves across the screen over multiple frames. Motion vectors can be employed to describe such movement. A motion vector is a two-dimensional vector that provides an offset from the coordinates of an object in a frame to the coordinates of the object in a reference frame. As such, inter-prediction can encode an image block in a current frame as a set of motion vectors indicating an offset from a corresponding block in a reference frame.

Intra-prediction encodes blocks in a common frame. Intra-prediction takes advantage of the fact that luma and chroma components tend to cluster in a frame. For example, a patch of green in a portion of a tree tends to be positioned adjacent to similar patches of green. Intra- prediction employs multiple directional prediction modes (e.g., thirty three in HEVC) , a planar mode, and a direct current (DC) mode. The directional modes indicate that a current block is similar or the same as a neighbor block in a corresponding direction. Planar mode indicates that a series of blocks along a row (e.g., a plane) can be interpolated based on neighbor blocks at the edges of the row. Planar mode, in effect, indicates a smooth transition of light/color across a row by employing a relatively constant slope in changing values. DC mode is employed for boundary smoothing and indicates that a block is similar or the same as an average value associated with all the neighbor blocks associated with the angular directions of the directional prediction modes. Accordingly, intra-prediction blocks can represent image blocks as various relational prediction mode values instead of the actual values. Further, inter-prediction blocks can represent image blocks as motion vector values instead of the actual values. In either case, the prediction blocks may not exactly represent the image blocks in some cases. Any differences are stored in residual blocks. Transforms may be applied to the residual blocks to further compress the file.

At step 107, various filtering techniques may be applied. In HEVC, the filters are applied according to an in-loop filtering scheme. The block based prediction discussed above may result in the creation of blocky images at the decoder. Further, the block based prediction scheme may encode a block and then reconstruct the encoded block for later use as a reference block. The in-loop filtering scheme iteratively applies noise suppression filters, de-blocking filters, adaptive loop filters, and SAO filters to the blocks/frames. These filters mitigate such blocking artefacts so that the encoded file can be accurately reconstructed. Further, these filters mitigate artefacts in the reconstructed reference blocks so that artefacts are less likely to create additional artefacts in subsequent blocks that are encoded based on the reconstructed reference blocks. The in-loop filtering process is discussed in greater detail below.

Once the video signal has been partitioned, compressed, and filtered, the resulting data is encoded in a bitstream at step 109. The bitstream includes the data discussed above as well as any signaling data desired to support proper video signal reconstruction at the decoder. For example, such data may include partition data, prediction data, residual blocks, and various flags providing coding instructions to the decoder. The bitstream may be stored in memory for transmission toward a decoder upon request. The bitstream may also be broadcast and/or multicast toward a plurality of decoders. The creation of the bitstream is an iterative process. Accordingly, steps 101, 103, 105, 107, and 109 may occur continuously and/or simultaneously over many frames and blocks. The sequence order shown in FIG. 1 is presented for clarity and ease of discussion, and is not intended to limit the video coding process to a particular order.

The decoder receives the bitstream and begins the decoding process at step 111. The decoder also includes a parser for parsing the received bit stream to extract relevant information such as flags and indices that will be described in more detail below. Specifically, the decoder employs an entropy decoding scheme to convert the bitstream into corresponding syntax and video data. The decoder employs the syntax data from the bitstream to determine the partitions for the frames at step 111. The partitioning should match the results of block partitioning at step 103. Entropy encoding/decoding as employed in step 111 is now described. The encoder makes several choices during the compression process, such as selecting block partitioning schemes from several possible choices based on the spatial positioning of values in the input image (s) . Signaling the exact choices may employ a large number of bins. As used herein, a bin is a binary value that is treated as variable (e.g., a bit value that may vary depending on context) . Entropy coding allows the encoder to discard any options that are clearly not viable for a particular case, leaving only a set of allowable options. Each allowable option is then assigned a code word. The length of the code words is based on the number of allowable options (e.g., one bin for two options, two bins for three to four options, etc. ) The encoder then encodes the code word for the selected option. This scheme reduces the size of the code words as the code words are only as big as desired to uniquely indicate a selection from a small sub-set of allowable options as opposed to uniquely indicating the selection from a potentially large set of all possible options. The decoder then decodes the selection by determining the set of allowable options in a similar manner to the encoder. By determining the set of allowable options, the decoder can read the code word and determine the selection made by the encoder.

At step 113, the decoder performs block decoding. Specifically, the decoder employs reverse transforms to generate residual blocks. Then the decoder employs the residual blocks and corresponding prediction blocks to reconstruct the image blocks according to the partitioning. The prediction blocks may include both intra-prediction blocks and inter-prediction blocks as generated at the encoder at step 105. The reconstructed image blocks are then positioned into frames of a reconstructed video signal according to the partitioning data determined at step 111. Syntax for step 113 may also be signaled in the bitstream via entropy coding as discussed above.

At step 115, filtering is performed on the frames of the reconstructed video signal in a manner similar to step 107 at the encoder. For example, noise suppression filters, de-blocking filters, adaptive loop filters, and SAO filters may be applied to the frames to remove blocking artefacts. Once the frames are filtered, the video signal can be output to a display at step 117 for viewing by an end user.

FIG. 2 is a schematic diagram of an example coding and decoding (codec) system 200 for video coding. Specifically, codec system 200 provides functionality to support the implementation of method 100. Codec system 200 is generalized to depict components employed in both an encoder and a decoder. Codec system 200 receives and partitions a video signal as discussed with respect to

steps

101 and 103 in method 100, which results in a partitioned video signal 201. Codec system 200 then compresses the partitioned video signal 201 into a coded bitstream when operating as an encoder as discussed with respect to

steps

105, 107, and 109 in method 100. When operating as a decoder codec system 200 generates an output video signal from the bitstream as discussed with respect to

steps

111, 113, 115, and 117 in method 100. The codec system 200 includes a general coder control component 211, a transform scaling and quantization component 213, an intra-picture estimation component 215, an intra-picture prediction component 217, a motion compensation component 219, a motion estimation component 221, a scaling and inverse transform component 229, a filter control analysis component 227, an in-loop filter component 225, a decoded picture buffer component 223, and a header formatting and Context adaptive binary arithmetic coding (CABAC) component 231. Such components are coupled as shown. In FIG. 2, solid black lines indicate movement of data to be encoded/decoded while dashed lines indicate movement of control data that controls the operation of other components. The components of codec system 200 may all be present in the encoder. The decoder may include a subset of the components of codec system 200. For example, the decoder may include the intra-picture prediction component 217, the motion compensation component 219, the scaling and inverse transform component 229, the in-loop filter component 225, and the decoded picture buffer component 223. These components are now described.

The partitioned video signal 201 is a captured video stream that has been partitioned into blocks of pixels by a coding tree. A coding tree employs various split modes to subdivide a block of pixels into smaller blocks of pixels. These blocks can then be further subdivided into smaller blocks. The blocks may be referred to as nodes on the coding tree. Larger parent nodes are split into smaller child nodes. The number of times a node is subdivided is referred to as the depth of the node/coding tree. The divided blocks are referred to as coding units (CUs) in some cases. The split modes may include a binary tree (BT) , triple tree (TT) , and a quad tree (QT) employed to partition a node into two, three, or four child nodes, respectively, of varying shapes depending on the split modes employed. The partitioned video signal 201 is forwarded to the general coder control component 211, the transform scaling and quantization component 213, the intra-picture estimation component 215, the filter control analysis component 227, and the motion estimation component 221 for compression.

The general coder control component 211 is configured to make decisions related to coding of the images of the video sequence into the bitstream according to application constraints. For example, the general coder control component 211 manages optimization of bitrate/bitstream size versus reconstruction quality. Such decisions may be made based on storage space/bandwidth availability and image resolution requests. The general coder control component 211 also manages buffer utilization in light of transmission speed to mitigate buffer underrun and overrun issues. To manages these issues, the general coder control component 211 manages partitioning, prediction, and filtering by the other components. For example, the general coder control component 211 may dynamically increase compression complexity to increase resolution and increase bandwidth usage or decrease compression complexity to decrease resolution and bandwidth usage. Hence, the general coder control component 211 controls the other components of codec system 200 to balance video signal reconstruction quality with bit rate constraints. The general coder control component 211 creates control data, which controls the operation of the other components. The control data is also forwarded to the header formatting and CABAC component 231 to be encoded in the bitstream to signal parameters for decoding at the decoder.

The partitioned video signal 201 is also sent to the motion estimation component 221 and the motion compensation component 219 for inter-prediction. A frame or slice of the partitioned video signal 201 may be divided into multiple video blocks. Motion estimation component 221 and the motion compensation component 219 perform inter-predictive coding of the received video block relative to one or more blocks in one or more reference frames to provide temporal prediction. Codec system 200 may perform multiple coding passes, e.g., to select an appropriate coding mode for each block of video data.

Motion estimation component 221 and motion compensation component 219 may be highly integrated, but are illustrated separately for conceptual purposes. Motion estimation, performed by motion estimation component 221, is the process of generating motion vectors, which estimate motion for video blocks. A motion vector, for example, may indicate the displacement of a prediction unit (PU) of a video block relative to a predictive block within a reference frame or other coded unit relative to the current block being coded within the current frame (or other coded unit) . A predictive block is a block that is found to closely match the block to be coded, in terms of pixel difference, which may be determined by sum of absolute difference (SAD) , sum of square difference (SSD) , or other difference metrics. In some examples, codec system 200 may calculate values for sub-integer pixel positions of reference pictures stored in decoded picture buffer 223. For example, video codec system 200 may interpolate values of one-quarter pixel positions, one-eighth pixel positions, or other fractional pixel positions of the reference picture. Therefore, motion estimation component 221 may perform a motion search relative to the full pixel positions and fractional pixel positions and output a motion vector with fractional pixel precision. The motion estimation component 221 calculates a motion vector for a PU of a video block in an inter-coded slice by comparing the position of the PU to the position of a predictive block of a reference picture. Motion estimation component 221 outputs the calculated motion vector as motion data to header formatting and CABAC component 231 for encoding and motion to the motion compensation component 219.

Motion compensation, performed by motion compensation component 219, may involve fetching or generating the predictive block based on the motion vector determined by motion estimation component 221. Again, motion estimation component 221 and motion compensation component 219 may be functionally integrated, in some examples. Upon receiving the motion vector for the PU of the current video block, motion compensation component 219 may locate the predictive block to which the motion vector points a reference picture list. A residual video block is then formed by subtracting pixel values of the predictive block from the pixel values of the current video block being coded, forming pixel difference values. In general, motion estimation component 221 performs motion estimation relative to luma components, and motion compensation component 219 uses motion vectors calculated based on the luma components for both chroma components and luma components. The predictive block and residual block are forwarded to transform scaling and quantization component 213.

The partitioned video signal 201 is also sent to intra-picture estimation component 215 and intra-picture prediction component 217. As with motion estimation component 221 and motion compensation component 219, intra-picture estimation component 215 and intra-picture prediction component 217 may be highly integrated, but are illustrated separately for conceptual purposes. The intra-picture estimation component 215 and intra-picture prediction component 217 intra-predict a current block relative to blocks in a current frame, as an alternative to the inter-prediction performed by motion estimation component 221 and motion compensation component 219 between frames, as described above. In particular, the intra-picture estimation component 215 determines an intra-prediction mode to use to encode a current block. In some examples, intra-picture estimation component 215 selects an appropriate intra-prediction mode to encode a current block from multiple tested intra-prediction modes. The selected intra-prediction modes are then forwarded to the header formatting and CABAC component 231 for encoding.

For example, the intra-picture estimation component 215 calculates rate-distortion values using a rate-distortion analysis for the various tested intra-prediction modes, and selects the intra-prediction mode having the best rate-distortion characteristics among the tested modes. Rate-distortion analysis generally determines an amount of distortion (or error) between an encoded block and an original unencoded block that was encoded to produce the encoded block, as well as a bit rate (e.g., a number of bits per second) used to produce the encoded block. The intra-picture estimation component 215 calculates ratios from the distortions and rates for the various encoded blocks to determine which intra-prediction mode exhibits the best rate-distortion value for the block. In addition, intra-picture estimation component 215 may be configured to code depth blocks of a depth map using a depth modeling mode (DMM) based on rate-distortion optimization (RDO) .

The intra-picture prediction component 217 may generate a residual block from the predictive block based on the selected intra-prediction modes determined by intra-picture estimation component 215. The residual block includes the difference in values between the predictive block and the original block, represented as a matrix. The residual block is then forwarded to the transform scaling and quantization component 213. The intra-picture estimation component 215 and the intra-picture prediction component 217 may operate on both luma and chroma components.

The transform scaling and quantization component 213 is configured to further compress the residual block. The transform scaling and quantization component 213 applies a transform operation, such as a discrete cosine transform (DCT) , a discrete sine transform (DST) , or a conceptually similar transform operation, to the residual block, thereby producing a video block comprising residual transform coefficient values. Wavelet transforms, integer transforms, sub-band transforms or other types of transforms could also be used. The transform operation may convert the residual information from a pixel value domain to a transform domain, such as a frequency domain. The transform scaling and quantization component 213 is also configured to scale the transformed residual information, for example based on frequency. Such scaling involves applying a scale factor to the residual information so that different frequency information is quantized at different granularities, which may affect final visual quality of the reconstructed video. The transform scaling and quantization component 213 is also configured to quantize the transform coefficients to further reduce bit rate. The quantization process may reduce the bit depth associated with some or all of the coefficients. The degree of quantization may be modified by adjusting a quantization parameter. In some examples, the transform scaling and quantization component 213 may then perform a scan of the matrix including the quantized transform coefficients. The quantized transform coefficients are forwarded to the header formatting and CABAC component 231 to be encoded in the bitstream.

The scaling and inverse transform component 229 applies a reverse operation of the transform scaling and quantization component 213 to support motion estimation. The scaling and inverse transform component 229 applies inverse scaling, transformation, and/or quantization to reconstruct the residual block in the pixel domain, e.g., for later use as a reference block which may become a predictive block for another current block. The motion estimation component 221 and/or motion compensation component 219 may calculate a reference block by adding the residual block back to a corresponding predictive block for use in motion estimation of a later block/frame. Filters are applied to the reconstructed reference blocks to mitigate artefacts created during scaling, quantization, and transform. Such artefacts could otherwise cause inaccurate prediction (and create additional artefacts) when subsequent blocks are predicted.

The filter control analysis component 227 and the in-loop filters component 225 apply the filters to the residual blocks and/or to reconstructed image blocks. For example, the transformed residual block from scaling and inverse transform component 229 may be combined with a corresponding prediction block from intra-picture prediction component 217 and/or motion compensation component 219 to reconstruct the original image block. The filters may then be applied to the reconstructed image block. In some examples, the filters may instead be applied to the residual blocks. As with other components in FIG. 2, the filter control analysis component 227 and the in-loop filters component 225 are highly integrated and may be implemented together, but are depicted separately for conceptual purposes. Filters applied to the reconstructed reference blocks are applied to particular spatial regions and include multiple parameters to adjust how such filters are applied. The filter control analysis component 227 analyzes the reconstructed reference blocks to determine where such filters should be applied and sets corresponding parameters. Such data is forwarded to the header formatting and CABAC component 231 as filter control data for encoding. The in-loop filters component 225 applies such filters based on the filter control data. The filters may include a deblocking filter, a noise suppression filter, a SAO filter, and an adaptive loop filter. Such filters may be applied in the spatial/pixel domain (e.g., on a reconstructed pixel block) or in the frequency domain, depending on the example.

When operating as an encoder, the filtered reconstructed image block, residual block, and/or prediction block are stored in the decoded picture buffer 223 for later use in motion estimation as discussed above. When operating as a decoder, the decoded picture buffer 223 stores and forwards the reconstructed and filtered blocks toward a display as part of an output video signal. The decoded picture buffer 223 may be any memory device capable of storing prediction blocks, residual blocks, and/or reconstructed image blocks.

The header formatting and CABAC component 231 receives the data from the various components of codec system 200 and encodes such data into a coded bitstream for transmission toward a decoder. Specifically, the header formatting and CABAC component 231 generates various headers to encode control data, such as general control data and filter control data. Further, prediction data, including intra-prediction and motion data, as well as residual data in the form of quantized transform coefficient data are all encoded in the bitstream. The final bitstream includes all information desired by the decoder to reconstruct the original partitioned video signal 201. Such information may also include intra-prediction mode index tables (also referred to as codeword mapping tables) , definitions of encoding contexts for various blocks, indications of a most probable intra-prediction modes, an indication of partition information, etc. Such data may be encoded by employing entropy coding. For example, the information may be encoded by employing context adaptive variable length coding (CAVLC) , CABAC, syntax-based context-adaptive binary arithmetic coding (SBAC) , probability interval partitioning entropy (PIPE) coding, or another entropy coding technique. Following the entropy coding, the coded bitstream may be transmitted to another device (e.g., a video decoder) or archived for later transmission or retrieval.

FIG. 3 is a block diagram illustrating an example video encoder 300. Video encoder 300 may be employed to implement the encoding functions of codec system 200 and/or implement

steps

101, 103, 105, 107, and/or 109 of method 100. Encoder 300 partitions an input video signal, resulting in a partitioned video signal 301, which is substantially similar to the partitioned video signal 201. The partitioned video signal 301 is then compressed and encoded into a bitstream by components of encoder 300.

Specifically, the partitioned video signal 301 is forwarded to an intra-picture prediction component 317 for intra-prediction. The intra-picture prediction component 317 may be substantially similar to intra-picture estimation component 215 and intra-picture prediction component 217. The partitioned video signal 301 is also forwarded to a motion compensation component 321 for inter-prediction based on reference blocks in a decoded picture buffer 323. The motion compensation component 321 may be substantially similar to motion estimation component 221 and motion compensation component 219. The prediction blocks and residual blocks from the intra-picture prediction component 317 and the motion compensation component 321 are forwarded to a transform and quantization component 313 for transform and quantization of the residual blocks. The transform and quantization component 313 may be substantially similar to the transform scaling and quantization component 213. The transformed and quantized residual blocks and the corresponding prediction blocks (along with associated control data) are forwarded to an entropy coding component 331 for coding into a bitstream. The entropy coding component 331 may be substantially similar to the header formatting and CABAC component 231.

The transformed and quantized residual blocks and/or the corresponding prediction blocks are also forwarded from the transform and quantization component 313 to an inverse transform and quantization component 329 for reconstruction into reference blocks for use by the motion compensation component 321. The inverse transform and quantization component 329 may be substantially similar to the scaling and inverse transform component 229. In-loop filters in an in-loop filters component 325 are also applied to the residual blocks and/or reconstructed reference blocks, depending on the example. The in-loop filters component 325 may be substantially similar to the filter control analysis component 227 and the in-loop filters component 225. The in-loop filters component 325 may include multiple filters, including a noise suppression filter as discussed below. The filtered blocks are then stored in a decoded picture buffer 323 for use a reference blocks by the motion compensation component 321. The decoded picture buffer 323 may be substantially similar to the decoded picture buffer 223.

FIG. 4 is a block diagram illustrating an example video decoder 400. Video decoder 400 may be employed to implement the decoding functions of codec system 200 and/or implement

steps

111, 113, 115, and/or 117 of method 100. Decoder 400 receives a bitstream, for example from an encoder 300, and generates a reconstructed output video signal based on the bitstream for display to an end user.

The bitstream is received by an entropy decoding component 433. The entropy decoding component 433 performs the reverse function of the entropy encoding component 331. The entropy decoding component 433 is configured to implement an entropy decoding scheme, such as CAVLC, CABAC, SBAC, PIPE coding or other entropy coding technique. For example, the entropy decoding component 433 may employ header information to provide a context to interpret additional data encoded as codewords in the bitstream. The decoded information includes any desired information to decode the video signal, such as general control data, filter control data, partition information, motion data, prediction data, and quantized transform coefficients from residual blocks. The quantized transform coefficients are forwarded to an inverse transform and quantization component 429 for reconstruction into residual blocks. The inverse transform and quantization component 429 may be substantially similar to the inverse transform and quantization component 329.

The reconstructed residual blocks and/or prediction blocks are forwarded to intra-picture prediction component 417 for reconstruction into image blocks based on intra-prediction operations. Intra-picture prediction component 417 may be substantially similar to intra-picture prediction component 317, but operate in reverse. Specifically, the intra-picture prediction component 417 employs prediction modes to locate a reference block in a frame and applies a residual block to the result to reconstruct intra-predicted image blocks. The reconstructed intra-predicted image blocks and/or the residual blocks and corresponding inter-prediction data are forwarded to a decoded picture buffer component 423 via in-loop filters component 425, which may be substantially similar to decoded picture buffer component 323 and in-loop filters component 325, respectively. The in-loop filters component 425 filter the reconstructed image blocks, residual blocks and/or prediction blocks, and such information is stored in the decoded picture buffer component 423. Reconstructed image blocks from decoded picture buffer component 423 are forwarded to a motion compensation component 421 for inter-prediction. The motion compensation component 421 may be substantially similar to motion compensation component 321, but may operate in reverse. Specifically, the motion compensation component 421 employs motion vectors from a reference block to generate a prediction block and applies a residual block to the result to reconstruct an image block. The resulting reconstructed blocks may also be forwarded via the in-loop filters component 425 to the decoded picture buffer component 423. The decoded picture buffer component 423 continues to store additional reconstructed image blocks, which can be reconstructed into frames via the partition information. Such frames may also be placed in a sequence. The sequence is output toward a display as a reconstructed output video signal.

Inter-prediction

Many schemes are employed in tandem to compress video data during the video coding process. For example, a video sequence is divided into image frames. The image frames are then partitioned into image blocks. The image blocks may then be compressed by inter-prediction (correlation between blocks in different frames) or intra-prediction (correlation between blocks in the same frame) .

Inter-prediction is employed when a coding object, such as a coding tree unit (CTU) , a coding tree block (CTB) , a coding unit (CU) , a sub-CU, etc., appears in multiple frames of a video sequence. Rather than coding the same object in each frame, the object is coded in a reference frame and a motion vector (MV) is employed to indicate a motion trajectory of an object. The motion trajectory of an object is the object’s movement over time. An MV is a vector that indicates a direction and magnitude of an objects change in position between frames. The object and the MV can be coded in a bitstream and decoded by a decoder. In an effort to further increase coding efficiency and reduce the size of the encoding, the MV may be omitted from the bitstream and derived at the decoder. For example, a pair of reference frames may be employed. A reference frame is a frame in a bitstream that incudes data that can be coded by reference when coding related frames. Matching algorithms, such as bi-lateral matching and/or template matching may be employed to determine the position of the coding object in both reference frames. A bi-lateral matching algorithm matches a block in a previous frame to a block in a current frame. A template matching algorithm matches adjacent blocks to a current block with adjacent blocks to the current block in one or more reference frames. Once the position of the object is determined in both reference frames, an MV can be determined that represents the motion of the object between the reference frames. The MV can then be employed to position the object in the frames between the reference frames. As a specific example, an initial MV can be determined for an entire CU. A local search can then be employed to refine the initial MV. Further, MVs for sub-CU components of the object can be determined and refined based on the refined initial MV. Such an approach indicates the correct position of the object so long as the motion trajectory of the object is continuous between the reference frames.

FIG. 5 is a schematic diagram illustrating an example of unidirectional inter-prediction 500, for example as performed to determine motion vectors (MVs) at block compression step 105, block decoding step 113, motion estimation component 221, motion compensation component 219, motion compensation component 321, and/or motion compensation component 421.

Unidirectional inter-prediction 500 employs a reference frame 530 with a reference block 531 to predict a current block 511 in a current frame 510. The reference frame 530 may be temporally positioned after the current frame 510 as shown, but may also be temporally positioned before the current frame 510 in some examples. The current frame 510 is an example frame/picture being encoded/decoded at a particular time. The current frame 510 contains an object in the current block 511 that matches an object in the reference block 531 of the reference frame 530. The reference frame 530 is a frame that is employed as a reference for encoding a current frame 510, and a reference block 531 is a block in the reference frame 530 that contains an object also contained in the current block 511 of the current frame 510.

The current block 511 is any coding unit that is being encoded/decoded at a specified point in the coding process. The current block 511 may be an entire partitioned block, or may be a sub-block in the affine inter-prediction case. The current frame 510 is separated from the reference frame 530 by some temporal distance (TD) 533. The TD 533 indicates an amount of time between the current frame 510 and the reference frame 530 in a video sequence. Over the time period represented by the TD 533, the object in the current block 511 moves from a position in the current frame 510 to another position in the reference frame 530 (e.g., the position of the reference block 531) . For example, the object may move along a motion trajectory 513, which is a direction of movement of an object over time. A motion vector 535 describes the direction and magnitude of the movement of the object along the motion trajectory 513 over the TD 533. Accordingly, an encoded MV 535 and a reference block 531 provide information sufficient to reconstruct a current block 511 and position the current block 511 in the current frame 510.

FIG. 6 is a schematic diagram illustrating an example of bidirectional inter-prediction 600, for example as performed to determine MVs at block compression step 105, block decoding step 113, motion estimation component 221, motion compensation component 219, motion compensation component 321, and/or motion compensation component 421. For example, bidirectional inter-prediction 600 can be employed to determine motion vectors for a block in inter-prediction modes and/or to determine motion vectors for sub-blocks in affine inter-prediction mode.

Bidirectional inter-prediction 600 is similar to unidirectional inter-prediction 500, but employs a pair of reference frames to predict a current block 611 in a current frame 610. Hence current frame 610 and current block 611 are substantially similar to current frame 510 and current block 511, respectively. The current frame 610 is temporally positioned between a preceding reference frame 620, which occurs before the current frame 610 in the video sequence, and a subsequent reference frame 630, which occurs after the current frame 610 in the video sequence. Preceding reference frame 620 and subsequent reference frame 630 are otherwise substantially similar to reference frame 530.

The current block 611 is matched to a preceding reference block 621 in the preceding reference frame 620 and to a subsequent reference block 631 in the subsequent reference frame 630. Such a match indicates that, over the course of the video sequence, an object moves from a position at the preceding reference block 621 to a position at the subsequent reference block 631 along a motion trajectory 613 and via the current block 611. The current frame 610 is separated from the preceding reference frame 620 by some preceding temporal distance (TD0) 623 and separated from the subsequent reference frame 630 by some subsequent temporal distance (TD1) 633. The TD0 623 indicates an amount of time between the preceding reference frame 620 and the current frame 610 in the video sequence. The TD1 633 indicates an amount of time between the current frame 610 and the subsequent reference frame 630 in the video sequence. Hence, the object moves from the preceding reference block 621 to the current block 611 along the motion trajectory 613 over a time period indicated by TD0 623. The object also moves from the current block 611 to the subsequent reference block 631 along the motion trajectory 613 over a time period indicated by TD1 633.

A preceding motion vector (MV0) 625 describes the direction and magnitude of the movement of the object along the motion trajectory 613 over the TD0 623 (e.g., between the preceding reference frame 620 and the current frame 610) . A subsequent motion vector (MV1) 635 describes the direction and magnitude of the movement of the object along the motion trajectory 613 over the TD1 633 (e.g., between the current frame 610 and the subsequent reference frame 630) . As such, in bidirectional inter-prediction 600, the current block 611 can be coded and reconstructed by employing the preceding reference block 621 and/or the subsequent reference block 631, MV0 625, and MV1 635.

Intra-Prediction

Many schemes are employed in tandem to compress video data during the video coding process. For example, a video sequence is divided into image frames. The image frames are then partitioned into image blocks. The image blocks may then be compressed by inter-prediction (correlation between blocks in different frames) or intra-prediction (correlation between blocks in the same frame) . In intra-prediction, a current image block is predicted from a reference line of samples. The reference line includes samples from adjacent image blocks, also called neighbor blocks. Samples from the current block are matched with samples from the reference line with the nearest luma (light) or chroma (color) values. The current block is coded as prediction modes that indicate the matching samples. The prediction modes include angular prediction modes, direct current (DC) mode, and planar mode. Differences between values predicted by the prediction modes and actual values are coded as residual values in a residual block.

FIG. 7 is a schematic diagram illustrating example intra-prediction modes 700 employed in video coding. For example, intra-prediction modes 700 may be employed by

steps

105 and 113 of method 100, intra-picture estimation component 215 and an intra-picture prediction component 217 of codec system 200, intra-picture prediction component 317 of encoder 300, and/or intra-picture prediction component 417 of decoder 400. Specifically, intra-prediction modes 700 can be employed to compress an image block into a prediction block containing a selected prediction mode and a remaining residual block.

As noted above, intra-prediction involves matching a current image block to a corresponding sample or samples of one or more neighboring blocks. The current image block can then be represented as a selected prediction mode index and a residual block, which is much smaller than representing all of the luma/chroma values contained in the current image block. Intra-prediction can be used when there is no available reference frame, or when inter-predication coding is not used for the current block or frame. The reference samples for intra-prediction may be derived from previously coded (or reconstructed) neighboring blocks in the same frame. Advanced Video Coding (AVC) , also known as H. 264, and H. 265/HEVC both employ a reference line of boundary samples of adjacent blocks as reference sample for intra-prediction. In order to cover different textures or structural characteristics many different intra-prediction modes are employed. H. 265/HEVC supports a total of thirty five intra-prediction modes 700 that spatially correlate a current block to one or more reference samples. Specifically, intra-prediction modes 700 include thirty-three directional prediction modes indexed as modes two through thirty four, a DC mode indexed as mode one, and a planar mode indexed as mode zero.

During encoding, the encoder matches the luma/chroma values of a current block with the luma/chroma values of corresponding reference samples in a reference line across the edges of neighboring blocks. When the best match is found with one of the reference lines, the encoder selects one of the directional intra-prediction modes 700 that points to the best matching reference line. For clarity of discussion, acronyms are employed below to reference particular directional intra-prediction modes 700. DirS denotes the starting directional intra-prediction mode when counting clockwise from the bottom left (e.g., mode two in HEVC) . DirE denotes the ending directional intra-prediction mode when counting clockwise from the bottom left (e.g., mode thirty four in HEVC) . DirD denotes the middle directional intra coding mode when counting clockwise from the bottom left (e.g., mode eighteen in HEVC) . DirH denotes a horizontal intra prediction mode (e.g., mode ten in HEVC) . DirV denotes a vertical intra prediction mode (e.g., mode twenty six in HEVC) .

As discussed above, DC mode operates as a smoothing function and derives a prediction value of a current block as an average value of all the reference samples in the reference line traversing the neighboring blocks. Also as discussed above, planar mode returns a prediction value that indicates a smooth transition (e.g., constant slope of values) between samples at the bottom and top left or top left and top right of the reference line of reference samples.

For Planar, DC, and prediction modes from DirH to DirV, the samples in both the top row of the reference line and the left column of the reference line are used as reference samples. For prediction modes with prediction directions from DirS to DirH (including DirS and DirH) , the reference samples in the previously coded and reconstructed neighboring blocks on the left column of the reference line are used as reference samples. For prediction modes with prediction directions from DirV to DirE (including DirV and DirE) , the reference samples of the previously coded and reconstructed neighboring blocks on the top row of the reference line are used as reference samples.

FIG. 8 is a schematic diagram illustrating an example of directional relationships of blocks 800 in video coding. For example, the blocks 800 may be employed when selecting intra-prediction modes 500. Hence, blocks 800 may be employed by

steps

105 and 113 of method 100, intra-picture estimation component 215 and an intra-picture prediction component 217 of codec system 200, intra-picture prediction component 317 of encoder 300, and/or intra-picture prediction component 417 of decoder 400. In video coding, blocks 800 are partitioned based on video content and hence may include many rectangles and squares of varying shapes and sizes. Blocks 800 are depicted as squares for purposes of explanation and are hence simplified from actual video coding blocks to support clarity of discussion.

The blocks 800 contain a current block 801 and neighbor blocks 810. The current block 810 is any block being coded at a specified time. The neighbor blocks 810 are any blocks immediately adjacent to the left edge or top edge of the current block 801. Video coding generally proceeds from top left to bottom right. As such, the neighbor blocks 810 may have been encoded and reconstructed prior to coding of the current block 801. When coding the current block 801, the encoder matches the luma/chroma values of the current block 801 with a reference sample (or samples) from a reference line traversing the edges of the neighboring blocks 810. The match is then employed to select an intra-prediction mode, for example from intra-prediction modes 700, that points to the matched sample (or samples when DC or planar mode are selected) . The selected intra-prediction mode then indicates that the luma/chroma values of the current block 801 are substantially similar to the reference sample corresponding to selected intra-prediction mode. Any differences can be retained in a residual block. The selected intra-prediction mode is then encoded in a bitstream. At the decoder, the current block 801 can be reconstructed by employing the luma/chroma values of the reference samples in the selected reference line in the neighboring block 810 that corresponds to the selected intra-prediction mode (along with any residual information from the residual block) .

In-loop Filters

Video coding schemes subdivide video signals into image frames, and then subdivide the image frames into various types of blocks. The image blocks are then compressed. This approach may create visual artefacts when the compressed video signal is reconstructed and displayed. For example, blocky shapes can be artificially added by the image compression process. This is known as blocking, and generally occurs at block partition boundaries. Also, non-linear signal dependent rounding error, known as quantization noise, may also be artificially added to a compressed image. Various filters may be employed to correct for such artefacts. The filters may be applied to reconstructed frames in post processing. Post processing occurs after significant portions of the compressed video signal have been reconstructed and immediately prior to display to a user. The filters may also be applied as part of the compression/decompression process by employing a mechanism called in-loop filtering. In-loop filtering is a filtering scheme that applies filters to reconstructed video images during the encoding and/or decoding process to support more accurate compression between related images. For example, inter-prediction encodes an image frame based on a previous and/or subsequent image frame. At an encoder, a compressed image is reconstructed and filtered via in-loop filtering so that the reconstructed image provides a more accurate image for use in encoding previous/subsequent image frame (s) via inter-prediction. At a decoder, a compressed image is reconstructed and filtered via in-loop filtering both to create a more accurate image for viewing by an end user and to support more accurate inter-prediction. In-loop filtering employs several filters such as a deblocking filter, a sample adaptive offset (SAO) filter, and an adaptive loop filter. In-loop filtering can also include a noise suppression filter.

FIG. 9 is a block diagram illustrating an example in-loop filter 900. In-loop filter 900 may be employed to implement in-

loop filters

225, 325, and/or 425. The in-loop filter 900 includes a noise suppression filter 941, a deblocking filter 943, a sample adaptive offset (SAO) filter 945, and an adaptive loop filter 947. The filters of in-loop filter 900 are applied in sequence to reconstructed image blocks and/or a residual block.

The noise suppression filter 941 is configured to remove quantization noise caused by image compression. Specifically, the noise suppression filter 941 is employed to remove artefacts that occur at edges in the image. For example, image compression may create distinct and incorrect color/light values adjacent to sharp transitions (edges) between different color/light patches in an image. This is referred to as ringing, and is caused by application of transforms to high frequency portions of the image data that are associated with sharp edges. The noise suppression filter 941 is employed to mitigate such ringing artefacts. The noise suppression filter 941 operates in both the spatial domain (e.g., spatial orientation of pixels) and the frequency domain (e.g., relationship of transformed coefficient values relating to pixel data) . At the encoder, the noise suppression filter 941 partitions a reconstructed frame into reference macroblocks. Such blocks can also be sub-divided into smaller reference blocks. The noise suppression filter 941 first generates an application map indicating the portions of the frame that should be filtered based on an estimated amount of quantization noise at the block. The noise suppression filter 941 then employs a matching component to determine, for each reference block as indicated by the application map, a set of patches that are similar to the corresponding reference block, where similar indicates chroma/luma values are within a predetermined range. The noise suppression filter 941 then groups the patches into clusters and may employ a two dimensional (2D) transform to transform the clusters into the frequency domain resulting in frequency domain patches. The noise suppression filter 941 may also employ a reverse 2D transform to convert the frequency domain patches back into the spatial domain.

The deblocking filter 943 is configured to remove block shaped edges created by the blocked based inter and intra prediction. The deblocking filter 943 scans an image portion (e.g., image slice) for discontinuities in chroma and/or luma values occurring at partition boundaries. The deblocking filter 943 then applies a smoothing function to the block boundaries to remove such discontinuities. The strength of the deblocking filter 943 may be varied depending on the spatial activity (e.g., variance of luma/chroma components) occurring in an area adjacent to the block boundaries.

The SAO filter 945 is configured to remove artefacts related to sample distortion caused by the encoding process. The SAO filter 945 at an encoder classifies deblocked samples of a reconstructed image into several categories based on relative deblocking edge shape and/or direction. An offset is then determined and added to the samples based on the categories. The offsets are then encoded in the bitstream and employed by the SAO filter 945 at the decoder. The SAO filter 945 removes banding artefacts (bands of values instead of smooth transitions) and ringing artefacts (spurious signals near sharp edges) .

The adaptive loop filter 947, at the encoder, is configured to compare a reconstructed image to an original image. The adaptive loop filter 947 determines coefficients that describe the differences between the reconstructed image and the original image, for example via a Wiener based adaptive filter. Such coefficients are encoded in the bitstream and employed at the adaptive loop filter 947 at the decoder to remove the differences between the reconstructed image and the original image. While the adaptive loop filter 947 is effective in correcting artefacts, greater differences between the reconstructed image and the original image result in a greater number of coefficients to be signaled. This in turn creates a larger bitstream and hence reduces the effectiveness of compression. As such, minimization of differences by other filters prior to applying the adaptive loop filter 947 results in improved compression.

Partitioning

Video coding employs an encoder to compress media files and a decoder to reconstruct the original media files from the compressed media files. Video coding employs various standardized processes to ensure that any decoder employing the standardized process can consistently reproduce a media file as compressed by any encoder also employing the standardized process. For example, an encoder and a decoder may both employ a coding standard such as High efficiency video coding (HEVC) , which is also known as H. 265. At an encoder, a video signal is separated into frames. Frames are then partitioned into image blocks containing groups of pixels. The image blocks are then compressed, filtered, and encoded into a bitstream. The bitstream may then be transmitted to a decoder, which reconstructs the video signal for display to an end user.

Partitioning systems are configured to split image blocks into sub-blocks. For example, a tree structure employing various split modes can be employed to split a node (e.g., a block) into child nodes (e.g., sub-blocks) . Different split modes can be employed to obtain different partitions. Further, split modes can also be applied recursively to further subdivide nodes. Such application of the split modes result in various partition patterns.

FIG. 10 illustrates example split modes 1000 employed in block partitioning. A split mode 1000 is a mechanism to split a parent node (e.g., an image block) into a plurality of child nodes (e.g., image sub-blocks) during partitioning. Split modes 1000 include a quad-tree (QT) split mode 1001, a vertical binary tree (BT) split mode 1003, a horizontal BT split mode 1005, a vertical triple tree (TT) split mode 1007, and a horizontal TT split mode 1009. The QT split mode 1001 is a tree structure for block partitioning in which a node of size 4Mx4N is split into four child nodes of size MxN, where M indicates block width and N indicates block height. The vertical BT split mode 1003 and horizontal BT split mode 1005 are tree structures for block partitioning in which a node of size 4Mx4N is vertically split into two child nodes of size 2Mx4N or horizontally split into two child nodes of size 4Mx2N, respectively. The vertical TT split mode 1007 and horizontal TT split mode 1009 are tree structures for block partitioning in which a node of size 4Mx4N is vertically split into three child nodes of size Mx4N, 2Mx4N and Mx4N; or horizontally split into three child nodes of size 4MxN, 4Mx2N and 4MxN, respectively. Among the three child nodes, the largest node is positioned in the center.

Split modes 1000 may also be applied recursively to further divide the blocks. For example a quad-tree binary-tree (QT-BT) can be created by partitioning a node with QT split mode 1001, and then partitioning each child node (sometimes referred to as quad-tree leaf nodes) with a vertical BT split mode 1003 and/or a horizontal BT split mode 1005. Further, a quad-tree triple tree (QT-TT) can be created by partitioning a node with a quad-tree split, and then partitioning the resulting child nodes with the vertical TT split mode 1007 and/or the horizontal TT split mode 1009.

HEVC operates on a Joint Exploration Model (JEM) application. In JEM, QT-BT block partitioning is employed to partition a coding tree unit (CTU) into a plurality of blocks. TT block partitioning has also been proposed for inclusion into JEM to further enrich block partition types. In video coding based on QT, QT-BT, QT-TT block partitioning split modes, a coding or prediction block in depth K may be split into a number N of smaller coding or prediction blocks of depth K+1 by a BT, TT or QT split mode, where N is set to two, three, or four, respectively. The partition patterns of the split modes are shown in FIG. 10, with the partition patterns indicating the size and position of two or more child nodes split from a parent node.

Transform

Video coding employs an encoder to compress media files and a decoder to reconstruct the original media files from the compressed media files. Video coding employs various standardized processes to ensure that any decoder employing the standardized process can consistently reproduce a media file as compressed by any encoder also employing the standardized process. For example, an encoder and a decoder may both employ a coding standard such as High efficiency video coding (HEVC) , which is also known as H. 265. H. 265 is based on a prediction plus transform framework. At an encoder, video files are separated into frames. Frames are then sub-divided into image blocks containing groups of pixels. Image blocks are further decomposed into prediction blocks containing prediction information, such as prediction modes and motion vector information, and residual blocks containing residual information, such as transform modes, transform coefficients, and quantization parameters. A prediction block and a residual block employ less storage space than an image block, but can be employed by a decoder to reconstruct an image block. The prediction blocks and residual blocks are coded into a bitstream and transmitted to a decoder and/or stored for later transmission upon request.

At a decoder, the prediction information and the residual information are parsed. The parsed prediction information is then employed to generate prediction samples by employing intra-prediction or inter-prediction. Intra-prediction employs reconstructed image blocks to predict other image blocks in the same frame. Inter-prediction employs reconstructed image blocks to predict other image blocks between adjacent frames. Further, the residual information is employed to generate residual samples, for example by sequentially applying inverse quantization and inverse transforms. The prediction samples and the residual samples are combined to obtain reconstructed samples that correspond to the image blocks coded by the encoder (e.g., for display to an end user on a monitor) .

Spatial varying transform (SVT) is a mechanism employed to further improve video coding efficiency. SVT employs a transform block to further compress the residual block. Specifically, a rectangular residual block includes a width w and height h (e.g., w×h block) . A transform block is selected that is smaller than the residual block. Hence, the transform block is employed to transform a corresponding part of the residual block, and leaves the remainder of the residual block without additional coding/compression. The rationale behind SVT is that residual information may not be distributed evenly in a residual block. Using a smaller transform block with an adaptive position can capture the majority of the residual information in the residual block, without requiring the entire residual block be transformed. This approach may achieve better coding efficiency than transforming all the residual information in the residual block in some cases. As the transform block is smaller than the residual block, SVT employs a mechanism for signaling the position of the transform relative to the residual block. Such position signaling increases the overall signaling overhead of the coding process, and hence reduces the efficiency of the compression. Additionally, employing the same type of transform block in all cases may not lead to beneficial results in some cases.

FIG. 11 is a schematic diagram of an example video encoding mechanism 1100. An image block 1101 can be obtained by an encoder from one or more frames. For example, an image may be split into a plurality of rectangular image regions. Each region of the image corresponds to a Coding Tree Unit (CTU) . A CTU is partitioned into a plurality of blocks, such as the coding units in HEVC. Block partition information is then encoded in a bitstream 1111. Accordingly, the image block 1101 is a partitioned portion of an image and contains pixels that represent luma components and/or chroma components at a corresponding portion of the image. During encoding, the image block 1101 is encoded as a prediction block 1103 containing prediction information such as prediction modes for intra-prediction and/or motion vectors for inter-prediction. Encoding the image block 1101 as a prediction block 1103 may then leave a residual block 1105 containing residual information indicating the difference between the prediction block 1103 and the image block 1101.

It should be noted that an image block 1101 may be partitioned as a coding unit (CU) that contains one prediction block 1103 and one residual block 1105. The prediction block 1103 may contain all prediction samples of the coding unit, and the residual block 1105 may contain all residual samples of the coding unit. In such a case, the prediction block 1103 is of the same size as the residual block 1105. In another example, the image block 1101 may be partitioned as a coding unit that contains two prediction blocks 1103 and one residual block 1105. In such a case, each prediction block 1103 contains a portion of the prediction samples of the coding unit, and the residual block 1105 contains all of the residual samples of the coding unit. In yet another example, the image block 1101 is partitioned into a coding unit (CU) that contains two prediction blocks 1103 and four residual blocks 1105. The partition pattern of the residual blocks 1105 in a coding unit may be signaled in the bitstream 1111. Such position patterns may include Residual Quad-Tree (RQT) in HEVC. Further, an image block 1101 may contain only luma components (e.g., light) , denoted as Y components, of image samples (or pixels) . In other cases, the image block 1101 may contain Y, U and V components of image samples, where U and V indicate chrominance components (e.g., color) in a blue luminance and red luminance (UV) color space.

Transforms may be employed to further compress the information. Specifically, a transform block 1107 may be employed to further compress the residual block 1105. The transform block 1107 contains a transform, such as inverse Discrete Cosine Transform (DCT) and/or inverse Discrete Sine Transform (DST) . The difference between the prediction block 1103 and the image block 1101 is the fit to the transform by employing transform coefficients. By indicating the transform mode of the transform block 1107 (e.g., inverse DCT and/or inverse DST) and the corresponding transform coefficients, the decoder can reconstruct the residual block 1105. When exact reproduction is not required, the transform coefficients can be further compressed by rounding certain values to create a better fit for the transform. This process is known as quantization and is performed according to quantization parameters that describe the allowable quantization. Accordingly, the transform modes, transform coefficients, and quantization parameters of the transform block 1107 are stored as transformed residual information in a transformed residual block 1109, which may also be referred to simply as a residual block in some cases.

The prediction information of the prediction block 1103 and the transformed residual information of the transformed residual block 1109 can then be encoded in a bitstream 1111. The bitstream 1111 can be stored and/or transmitted to a decoder. The decoder can then perform the process in reverse to recover the image block 1101. Specifically, the decoder can employ the transformed residual information to determine the transform block 1107. The transform block 1107 can then be employed in conjunction with the transformed residual block 1109 to determine the residual block 1105. The residual block 1105 and the prediction block 1103 can then be employed to reconstruct the image block 1101. The image block 1101 can then be positioned relative to other decoded image blocks 1101 to reconstruct frames and position such frames to recover the encoded video.

It should be noted that some prediction blocks 1103 can be encoded without resulting in a residual block 1105. However, such a case does not result in the use of a transform block 1107 and hence is not discussed further. Transform blocks 1107 may be employed for inter-predicted blocks or intra-predicted blocks. Further, transform blocks 1107 may be employed on residual blocks 1105 generated by specified inter-prediction mechanisms (e.g., translation model based motion compensation) , but may not be employed to residual blocks 1105 generated by other specified inter-prediction mechanisms (e.g., affine model based motion compensation) .

FIG. 12 is a schematic diagram of an example computing device 1200 for video coding according to an embodiment of the disclosure. The computing device 1200 is suitable for implementing the disclosed embodiments as described herein. The computing device 1200 comprises ingress ports 1220 and receiver units (Rx) 1210 for receiving data; a processor, logic unit, or central processing unit (CPU) 1230 to process the data; transmitter units (Tx) 1240 and egress ports 1250 for transmitting the data; a memory 1260 for storing the data. The computing device 1200 may also comprise optical-to-electrical (OE) components and electrical-to-optical (EO) components coupled to the ingress ports 1220, the receiver units 1210, the transmitter units 1240, and the egress ports 1250 for egress or ingress of optical or electrical signals. The computing device 1200 may also include wireless transmitters and/or receivers in some examples.

The processor 1230 is implemented by hardware and software. The processor 1230 may be implemented as one or more CPU chips, cores (e.g., as a multi-core processor) , field-programmable gate arrays (FPGAs) , application specific integrated circuits (ASICs) , and digital signal processors (DSPs) . The processor 1230 is in communication with the ingress ports 1220, receiver units 1210, transmitter units 1240, egress ports 1250, and memory 1260. The processor 1230 comprises a coding module 1214. The coding module 1214 implements the disclosed embodiments described above. For instance, the coding module 1214 implements, processes, prepares, or provides the various coding operations. The inclusion of the coding module 1214 therefore provides a substantial improvement to the functionality of the computing device 1200 and effects a transformation of the computing device 1200 to a different state. Alternatively, the coding module 1214 is implemented as instructions stored in the memory 1260 and executed by the processor 1230 (e.g., as a computer program product stored on a non-transitory medium) .

The memory 1260 comprises one or more disks, tape drives, and solid-state drives and may be used as an over-flow data storage device, to store programs when such programs are selected for execution, and to store instructions and data that are read during program execution. The memory 1260 may be volatile and/or non-volatile and may be read-only memory (ROM) , random access memory (RAM) , ternary content-addressable memory (TCAM) , and/or static random-access memory (SRAM) . The computing device 1200 may also input/output (I/O) device for interacting with an end user. For example, the computing device 1200 may include a display, such as a monitor, for visual output, speakers for audio output, and a keyboard/mouse/trackball, etc. for user input.

Affine motion compensated prediction

In HEVC, only translation motion model (VTM) is applied for motion compensation prediction (MCP) . While in the real world, there are many kinds of motion, e.g. zoom in/out, rotation, perspective motions and the other irregular motions. In the VTM, a block-based affine transform motion compensation prediction is applied. As shown in FIG. 13 which includes FIG. 13A and 13B, the affine motion field of the block is described by two control point (4-parameter affine model shown in FIG. 13A) or three control point motion vectors (6-parameter affine model shown in FIG. 13B) .

For 4-parameter affine motion model, motion vector at sample location (x, y) in a block is derived as:

For 6-parameter affine motion model, motion vector at sample location (x, y) in a block is derived as:

Where (v _0x, v _0y) is motion vector of the top-left corner control point, (v _1x, v _1y) is motion vector of the top-right corner control point, and (v _2x, v _2y) is motion vector of the bottom-left corner control point.

In order to simplify the motion compensation prediction, block based affine transform prediction is applied. To derive motion vector of each 4×4 sub-block, the motion vector of the center sample of each sub-block, as shown in FIG. 13, is calculated according to the above equations (1) and (2) , and rounded to 1/16 fraction accuracy. Then the motion compensation interpolation filters are applied to generate the prediction of each sub-block with derived motion vector.

FIG. 14 is an affine motion vector field (MVF) per sub-block. Referring to FIG. 14, motion vectors for a current block can be determined for each sub-block based on the motion vectors of the two control points v _o and v ₁.

In the VTM, there are two affine motion modes: AF_INTER mode and AF_MERGE mode.

AF_MERGE mode can be applied for coding units (CUs) with both width and height larger than or equal to 8, when a CU is applied in AF_MERGE mode, and the control point motion vector predictor (CPMVP) of a neighboring block coded with affine mode are extrapolated to attain the control-point motion vectors (CPMVs) of the current block. The selection order for the candidate block is from left, above, above right, left bottom to above left as shown in FIG. 15A. According to HEVC, the order of merge candidates in the merge list is A->B->C->D->E. A merge index is used to indicate which candidates in the merge list is actually used. If the neighbour left bottom block A is coded in affine mode as shown in FIG. 15B, the motion vectors v ₂ , v ₃ and v ₄ of the top left corner, above right corner and left bottom corner of the CU which contains the block A are attained. When block A is coded with 4-parameter affine model, the two control-point motion vectors (CPMVs) of the current CU are calculated according to v ₂, and v ₃. In case that block A is coded with 6-parameter affine model, the three CPMVs of the current CU are calculated according to v ₂ , v ₃ and v ₄. In order to identify whether the current CU is coded with AF_MERGE mode, an affine flag is signalled in the bitstream when there is at least one neighbour block is coded in affine mode.

AF_INTER mode can be applied for CUs with both width and height larger than or equal to 16. An affine flag in CU level is signalled in the bitstream to indicate whether AF_INTER mode is used and then another flag is signaled to indicate whether 4-parameter affine model or 6-parameter affine model is used. In this mode, the difference of the CPMV of current CU and their predictors CPMVP is signalled in the bitstream. The CPMVP candidates list size is 2 and it is generated by using the following three types of CPVM candidate in order:

1) CPMVs extrapolated from the CPMVs of the neighbour blocks

2) CPMVs constructed using the translational MVs of the neighbour blocks.

3) CPMVs generated by duplicating each of the AMVP candidates.

In an embodiment, new affine merge candidates are generated based on the control-point motion vectors (CPMVs) offsets of the first Affine merge candidate. If the first Affine merge candidate enables 4-parameter Affine model, then 2 CPMVs for each new affine merge candidate are derived by offsetting 2 CPMVs of the first Affine merge candidate. Otherwise, the 6-parameter Affine model is enabled, then 3 CPMVs for each new affine merge candidate are derived by offsetting 3 CPMVs of the first Affine merge candidate. In unidirectional prediction (one reference block) , the CPMV offsets are applied to the CPMVs of the first candidate. In bidirectional prediction (two reference blocks) with List 0 and List 1 on the same direction, the CPMV offsets are applied to the first candidate as follows:

MV _{new (L0) , i} = MV _old (L0) + MV _offset (i) (3a)

MV _{new (L1) , i} = MV _old (L1) + MV _offset (i) (3b)

In bidirectional prediction with List 0 and List 1 on the opposite direction (opposite sides of the current block) , the CPMV offsets are applied to the first candidate as follows:

MV _{new (L0) , i} = MV _old (L0) + MV _offset (i) (4a)

MV _{new (L1) , i} = MV _old (L1) -MV _offset (i) (4b)

In accordance with the present disclosure, various offset directions with various offset magnitudes are used to generate new affine merge candidates.

In an embodiment, when the decoder determines that the prediction mode is the unidirectional prediction mode, the new motion vector candidate is obtained by adding the motion vector offset to the old motion vector candidate to obtain the new motion vector candidate.

In an embodiment, when the decoder determines that the prediction mode is the bidirectional prediction mode, and the first and second reference frames are on opposite sides of the current block, the new motion vector candidate is obtained by adding the motion vector offset to the old motion vector candidate in the first reference frame and subtracting the motion vector offset from the old motion vector of the second frame to obtain the new motion vector candidate.

In an embodiment, when the decoder determines that the prediction mode is the bidirectional prediction mode, and the first and second reference frames are on the same side of the current block, the new motion vector candidate is obtained by adding the motion vector offset to the old motion vector candidate in the first reference frame and adding the motion vector offset to the old motion vector of the second frame to obtain the new motion vector candidate.

The proposed techniques re-use affine merge candidates the same way as affine merge candidates using in VVC or any modification solution based on VVC, for example, adding candidate, removing candidate or replacing candidate. Among the merge candidates, a candidate can be selected, and is further expanded by the proposed techniques which provide a new motion vector expression with simplified signaling. The expression process includes a starting point, a motion magnitude, and a motion direction.

FIG. 16 is a conceptual diagram illustrating an ultimate motion vector expression (UMVE) search process according to an embodiment. Referring to FIG. 16, a bidirectional prediction performs two predictions: the reference Lo prediction and the L1 reference prediction. The prediction results may be weighted (scaled) and/or offset. As shown in FIG. 16, the prediction results are then added together to generate an ultimate motion vector expression. It is noted that “+s” , “+2s” , “+3s” in FIG. 16 show exemplary different search points in UMVE.

FIG. 17 is a conceptual diagram illustrating the ultimate motion vector expression (UMVE) search points according to an embodiment. Referring to FIG. 17, the UMVE search points may include four motion vectors predictions that are signaled by a base candidate index as will be described in more detail below.

The proposed technique of the present disclosure uses an affine merge candidate list that may contain motion information associated with one or both reference picture lists (list 0 and list 1) and 4-parameter affine motion model and 6-parameter affine motion model. The affine merge candidate list may be constructed based on up to four spatial merge candidates that are derived from five spatial neighboring blocks, one temporal merge candidate derived from two temporal collocated blocks, or additional merge candidates including combined bi-predictive candidates..

According to an embodiment, a base candidate index defines the starting point. Base candidate index (denoted Base candidate IDX) indicates the best candidate among candidates in the list as shown in Table 1 below:

Table 1

Base candidate IDX	0	1	2	3
N ^th MVP	1 ^st MVP	2 ^nd MVP	3 ^rd MVP	4 ^th MVP

If the number of base candidates is equal to 1, base candidate IDX is not signaled. For example, in the current embodiment, new affine merge candidates are generated based on the CPMVs offsets of the first affine merge candidate, in this case the number of the based candidate is 1, no base candidate IDX is signaled.

In another embodiment, the best affine merge candidate is indicated by an index. In this case, base candidate IDX is signaled to indicate which affine merge candidate is used as the MV _old. It is understood that, although decimal values are shown in Table 1 for the base candidate index, binary values are typically used for signaling the base candidate index. In the example shown, four UMVE search points can be signaled by base candidate index as the starting points are shown in Table 1 and FIG. 17.

Distance index is motion magnitude information. Distance index indicates the pre-defined distance from the starting point information. Table 2 shows an example distance index for a set of pre-defined (predetermined) distances:

Table 2

In an embodiment, the predetermined distance comprises a quarter pixel, half pixel, 1 pixel, 2 pixels, 4 pixels, 8 pixels, 16 pixels, and 32 pixels.

Direction index represents the direction of the MVD relative to the starting point. The direction index is also referred to as motion direction and can represent of the four directions as shown in Table 3 below.

Table 3

Direction IDX	00	01	10	11
x-axis	+	–	N/A	N/A
y-axis	N/A	N/A	+	–

In an embodiment, the previous offset expression method is enabled/disabled by a flag, which is referred to as MMVD flag.

In an embodiment, MMVD flag is transmitted before any prediction mode flag. In this case, whether MMVD method is used to express offset for any prediction method which needs to code an offset is decided by the MMVD flag first.

In an embodiment, MMVD flag is transmitted after any prediction mode flag that needs to code an offset. In this case, the prediction mode is decided or determined first, then when an offset is needed to be coded, whether MMVD method is used to express offset for this prediction mode is decided (determined) , which means MMVD flag is signaled.

In some embodiments, information of the offset expression including the starting point(base candidate index) , the distance index, direction index, and the MMVD flag are transmitted in a bit stream and received by the decoder. The encoder parsed the received bit stream to extract the information (MMVD flag, base candidate index, distance index, and direction index) that is used to define the offset.

In some embodiments, the above procedure can be used to process the offset expression of affine merge mode.

Table 4 shows an example coding unit syntax according to an embodiment.

Table 4

Embodiments of the present disclosure provide the proposed UMVE index syntax as shown in Table 5 below.

Table 5

umve_idx_coding (x0, y0) {	Descriptor
base_mv_idx [x0][y0]	ae (v)
distance_idx [x0] [y0]	ae (v)
direction_idx [x0] [y0]	ae (v)
}

Coding unit semantics

cu_skip_flag [x0] [y0] equal to 1 specifies that for the current coding unit, when decoding a P or B slice, no more syntax elements except the merging candidate index merge_idx [x0] [y0] , the umve flag umve_flag [x0] [y0] and the merge affine flag merge_affine_flag [x0] [y0] are parsed after cu_skip_flag [x0] [y0] . cu_skip_flag [x0] [y0] equal to 0 specifies that the coding unit is not skipped. The array indices x0, y0 specify the location (x0, y0) of the top-left luma sample of the considered coding block relative to the top-left luma sample of the picture.

When cu_skip_flag [x0] [y0] is not present, it is inferred to be equal to 0.

umve_flag [x0] [y0] equal to 1 specifies that for the current coding unit, when decoding a P or B slice, ultimate motion vector expression is used to generate the prediction MV of the current coding unit. umve_flag [x0] [y0] equal to 1 specifies that no more syntax elements except umve_idx_coding () are parsed after umve_flag [x0] [y0] . umve_flag [x0] [y0] equal to 0 specifies that the coding unit is not predicted by ultimate motion vector expression. When umve_flag [x0] [y0] is not present, it is inferred to be equal to 0.

base_mv_idx [x0] [y0] specifies the umve base MV candidate index of the umve merge candidate list where x0, y0 specify the location (x0, y0) of the top-left luma sample of the considered coding block relative to the top-left luma sample of the picture.

When base_mv_idx [x0] [y0] is not present, it is inferred to be equal to 0.

distance_idx [x0] [y0] specifies the distance index from the base MV for the current coding unit where x0, y0 specify the location (x0, y0) of the top-left luma sample of the considered coding block relative to the top-left luma sample of the picture.

When distance_idx [x0] [y0] is not present, it is inferred to be equal to 0.

Table 6 shows an example distance index –distance_idx [x0] [y0]

Table 6

distance_idx [x] [y]	binarization	Dist_baseMV
0	0	1
1	10	2
2	110	4
3	1110	8
4	11110	16
5	111110	32
6	1111110	64
7	1111111	128

direction_idx [x0] [y0] specifies the position index relative to the base MV for the current coding unit where x0, y0 specify the location (x0, y0) of the top-left luma sample of the considered coding block relative to the top-left luma sample of the picture.

Table 7 shows an example direction_idx [x0] [y0] .

Table 7

direction_idx [x] [y]	binarization	[sign_x0, sign_y0]
0	00	[+1, 0]
1	01	[-1, 0]
2	10	[0 , +1]
3	11	[0, -1]

According to the present disclosure, the expression process includes a starting point, a motion magnitude, and a motion direction. In one embodiment, a method and apparatus are provided for affine merge candidate construction. The method may include generating an affine merge candidate list based on motion vectors associated with a set of neighboring blocks of a current block, the affine merge candidate list comprising a plurality of new candidates and a plurality of old candidates, and obtaining a new candidate of a current block using an old candidate in the affine merge candidate list and an offset. The offset is expressed by a starting point, a motion magnitude, or a motion direction. The method may also include parsing offset information from a bit stream that contains an offset flag, a prediction mode flag, and other information such as a base candidate index, a distance index, and a direction index.

Derivation process for motion vector components and reference indices in inter prediction mode

Derivation process for luma motion vectors for merge mode

This process is only invoked when merge_flag [xCb] [yPb] is equal to 1, where (xCb, yCb) specify the top-left sample of the current luma coding block relative to the top-left luma sample of the current picture.

Inputs to this process are:

– a luma location (xCb, yCb) of the top-left sample of the current luma coding block relative to the top-left luma sample of the current picture,

– a variable cbWidth specifying the width of the current coding block in luma samples,

– a variable cbHeight specifying the height of the current coding block in luma samples.

Outputs of this process are:

– the number of luma coding subblocks in horizontal direction numSbX and in vertical direction numSbY,

– the luma motion vectors in 1/16 fractional-sample accuracy mvL0 [xSbIdx] [ySbIdx] and mvL1 [xSbIdx] [ySbIdx] with xSbIdx = 0.. numSbX -1, ySbIdx = 0 .. numSbY -1,

– the reference indices refIdxL0 and refIdxL1,

– the prediction list utilization flags predFlagL0 [xSbIdx] [ySbIdx] and predFlagL1 [xSbIdx] [ySbIdx] with xSbIdx = 0.. numSbX -1, ySbIdx = 0 .. numSbY -1.

The motion vectors mvL0 and mvL1, the reference indices refIdxL0 and refIdxL1 and the prediction utilization flags predFlagL0 and predFlagL1 are derived by the following ordered steps:

1. The derivation process for merging candidates from neighbouring coding units is invoked with the luma coding block location (xCb, yCb) , the luma coding block width cbWidth, the luma coding block height cbHeight and the luma coding block width as inputs, and the output being the availability flags availableFlagA ₀, availableFlagA ₁, availableFlagB ₀, availableFlagB ₁ and availableFlagB ₂, the reference indices refIdxLXA ₀, refIdxLXA ₁, refIdxLXB ₀, refIdxLXB ₁ and refIdxLXB ₂, the prediction list utilization flags predFlagLXA ₀, predFlagLXA ₁, predFlagLXB ₀, predFlagLXB ₁ and predFlagLXB ₂, and the motion vectors mvLXA ₀, mvLXA ₁, mvLXB ₀, mvLXB ₁ and mvLXB ₂, with X being 0 or 1.

2. The derivation process for subblock-based temporal merging candidates is invoked with the luma location (xCb, yCb) , the luma coding block width cbWidth, the luma coding block height cbHeight , the availability flags availableFlagA ₀, availableFlagA ₁, availableFlagB ₀, availableFlagB ₁, the reference indices refIdxLXA ₀, refIdxLXA ₁, refIdxLXB ₀, refIdxLXB ₁, the prediction list utilization flags predFlagLXA ₀, predFlagLXA ₁, predFlagLXB ₀, predFlagLXB ₁ and the motion vectors mvLXA ₀, mvLXA ₁, mvLXB ₀, mvLXB ₁ as inputs and the output being the availability flag availableFlagSbCol, the number of luma coding subblocks in horizontal direction numSbX and in vertical direction numSbY, the reference indices refIdxLXSbCol, the luma motion vectors mvLXSbCol [xSbIdx] [ySbIdx] and the prediction list utilization flags predFlagLXSbCol [xSbIdx] [ySbIdx] with xSbIdx = 0.. numSbX -1, ySbIdx = 0 .. numSbY -1 and X being 0 or 1.

3. The reference indices for the temporal merging candidate, refIdxLXCol, with X being 0 or 1, are set equal to 0.

4. The derivation process for temporal luma motion vector prediction is invoked with the luma location (xCb, yCb) , the luma coding block width cbWidth, the luma coding block height cbHeight and the variable refIdxL0Col as inputs, and the output being the availability flag availableFlagL0Col and the temporal motion vector mvL0Col. The variables availableFlagCol, predFlagL0Col and predFlagL1Col are derived as follows:

availableFlagCol = availableFlagL0Col (3)

predFlagL0Col = availableFlagL0Col (4)

predFlagL1Col = 0 (5)

5. When slice_type is equal to B slice, the derivation process for temporal luma motion vector prediction is invoked with the luma location (xCb, yCb) , the the luma coding block width cbWidth, the luma coding block height cbHeight and the variable refIdxL1Col as inputs, and the output being the availability flag availableFlagL1Col and the temporal motion vector mvL1Col. The variables availableFlagCol and predFlagL1Col are derived as follows:

availableFlagCol = availableFlagL0Col | | availableFlagL1Col (6)

predFlagL1Col = availableFlagL1Col (7)

6. The merging candidate list, mergeCandList, is constructed as follows:

i = 0

if (availableFlagA ₁)

mergeCandList [i++] = A ₁

if (availableFlagB ₁)

mergeCandList [i++] = B ₁

if (availableFlagB ₀)

mergeCandList [i++] = B ₀ (8)

if (availableFlagA ₀)

mergeCandList [i++] = A ₀

if (availableFlagSbCol)

mergeCandList [i++] = SbCol (SbCol means a collocated subblock of a subblock)

if (availableFlagB ₂)

mergeCandList [i++] = B ₂

if (availableFlagCol)

mergeCandList [i++] = Col (Col means a collocated block of a block)

7. If umve_flag [x0] [y0] equal to 1, the variable numCurrMergeCand and numOrigMergeCand are set equal to smaller number (2, the number of merging candidates in the mergeCandList) .

8. Otherwise, the variable numCurrMergeCand and numOrigMergeCand are set equal to the number of merging candidates in the mergeCandList.

9. When slice_type is equal to B slice, the derivation process for combined bi-predictive merging candidates is invoked with mergeCandList, the reference indices refIdxL0N and refIdxL1N, the prediction list utilization flags predFlagL0N and predFlagL1N, the motion vectors mvL0N and mvL1N of every candidate N in mergeCandList, numCurrMergeCand and numOrigMergeCand as inputs, and the output is assigned to mergeCandList, numCurrMergeCand, the reference indices refIdxL0combCand _k and refIdxL1combCand _k, the prediction list utilization flags predFlagL0combCand _k and predFlagL1combCand _k and the motion vectors mvL0combCand _k and mvL1combCand _k of every new candidate combCand _k being added into mergeCandList. The number of candidates being added, numCombMergeCand, is set equal to (numCurrMergeCand -numOrigMergeCand) . When numCombMergeCand is greater than 0, k ranges from 0 to numCombMergeCand -1, inclusive.

10. The derivation process for zero motion vector merging candidates is invoked with the mergeCandList, the reference indices refIdxL0N and refIdxL1N, the prediction list utilization flags predFlagL0N and predFlagL1N, the motion vectors mvL0N and mvL1N of every candidate N in mergeCandList and numCurrMergeCand as inputs, and the output is assigned to mergeCandList, numCurrMergeCand, the reference indices refIdxL0zeroCand _m and refIdxL1zeroCand _m, the prediction list utilization flags predFlagL0zeroCand _m and predFlagL1zeroCand _m and the motion vectors mvL0zeroCand _m and mvL1zeroCand _m of every new candidate zeroCand _m being added into mergeCandList. The number of candidates being added, numZeroMergeCand, is set equal to (numCurrMergeCand -numOrigMergeCand -numCombMergeCand) . When numZeroMergeCand is greater than 0, m ranges from 0 to numZeroMergeCand -1, inclusive.

11. If umve_flag [x0] [y0] equal to 1, merge_idx [xCb] [yCb] is set to base_mv_idx [xCb] [yCb] and X being replaced by 0 or 1:

Inputs to this process are:

– a merging candidate list mergeCandList,

- Dist_baseMV derived in Table 2

- sign_x0, sign_y0 derived from Table 3

-

Outputs of this process are:

– a refine MV refine_MVLX [0] and refine_MVLX [1] and X being replaced by 0 or 1, If both refIdxL0N and refIdxL1N are over 0 and (DiffPicOrderCnt (RefPicList0 [refIdxL0] , currPic) *DiffPicOrderCnt (currPic, RefPicList1 [refIdxL1] ) ) is positive,

refine_MVL0 [0] = Dist_baseMV *sign_x0;

refine_MVL0 [1] = Dist_baseMV *sign_y0;

refine_MVL1 [0] = -1 *Dist_baseMV *sign_x0;

refine_MVL1 [1] = -1 *Dist_baseMV *sign_y0;

If both refIdxL0N and refIdxL1N are over 0 and (DiffPicOrderCnt (RefPicList0 [refIdxL0] , currPic) *DiffPicOrderCnt (currPic, RefPicList1 [refIdxL1] ) ) is negative,

refine_MVL0 [0] = Dist_baseMV *sign_x0;

refine_MVL0 [1] = Dist_baseMV *sign_y0;

refine_MVL1 [0] = -1 *Dist_baseMV *sign_x0;

refine_MVL1 [1] = -1 *Dist_baseMV *sign_y0;

12. If umve_flag [x0] [y0] equal to 0,

refine_MVL0 [0] = 0;

refine_MVL0 [1] = 0;

refine_MVL1 [0] = 0;

refine_MVL1 [1] = 0;

13. The following applies:

– If umve_flag [x0] [y0] equal to 0 and mergeCandList [merge_idx [xCb] [yCb] ] is equal to SbCol, the following assignments are made for xSbIdx = 0.. numSbX -1, ySbIdx = 0 .. numSbY -1

refIdxLX = refIdxLXSbCol (8-119)

predFlagLX [xSbIdx] [ySbIdx] = predFlagLXSbCol [xSbIdx] [ySbIdx] (8-120)

mvLX [xSbIdx] [ySbIdx] [0] = mvLXSbCol [xSbIdx] [ySbIdx] [0] (8-121)

mvLX [xSbIdx] [ySbIdx] [1] = mvLXSbCol [xSbIdx] [ySbIdx] [1] (8-122)

– Otherwise, the following assignments are made with N being the candidate at position merge_idx [xCb] [yCb] in the merging candidate list mergeCandList (N = mergeCandList [merge_idx [xCb ] [yCb] ] ) and X being replaced by 0 or 1:

numSbX = 1 (8-123)

numSbY = 1 (8-124)

refIdxLX = refIdxLXN (8-125)

predFlagLX [0] [0] = predFlagLXN

mvLX [0] [0] [0] = mvLXN [0] + refine_MVLX [0]

mvLX [0] [0] [1] = mvLXN [1] + refine_MVLX [1]

Derivation process for sub block-based temporal merging candidates

Inputs to this process are:

– the availability flags availableFlagA ₀, availableFlagA ₁, availableFlagB ₀, and availableFlagB ₁ of the neighbouring coding units,

– the reference indices refIdxLXA ₀, refIdxLXA ₁, refIdxLXB ₀, and refIdxLXB ₁ of the neighbouring coding units,

– the prediction list utilization flags predFlagLXA ₀, predFlagLXA ₁, predFlagLXB ₀, and predFlagLXB ₁ of the neighbouring coding units,

– the motion vectors in 1/16 fractional-sample accuracy mvLXA ₀, mvLXA ₁, mvLXB ₀, and mvLXB ₁ of the neighbouring coding units.

Outputs of this process are:

– the availability flag availableFlagSbCol,

– the reference indices refIdxL0SbCol and refIdxL1SbCol,

– the luma motion vectors in 1/16 fractional-sample accuracy mvL0SbCol [xSbIdx] [ySbIdx] and mvL1SbCol [xSbIdx] [ySbIdx] with xSbIdx = 0.. numSbX -1, ySbIdx = 0 .. numSbY -1,

– the prediction list utilization flags predFlagL0SbCol [xSbIdx] [ySbIdx] and predFlagL1SbCol [xSbIdx] [ySbIdx] with xSbIdx = 0.. numSbX -1, ySbIdx = 0 .. numSbY -1.

The availability flag availableFlagSbCol is derived as follows.

– If either slice_temporal_mvp_enable_flag or sps_sbtmvp_flag is equal to 0, availableFlagSbCol is set equal to 0.

– Otherwise, if umve_flag [x0] [y0] is equal to 0, availableFlagSbCol is set equal to 0.

– Otherwise, the following ordered steps apply.

1. The location (xCtb, yCtb) of the top-left sample of the luma coding tree block that contains the current coding block and the location (xCtr, yCtr) of the below-right center sample of the current luma coding block are derived as follows:

Table 8 and Table 9 show an example syntax code for this process, respectively.

Table 8

Or

Table 9

Table 10 shows an UMVE index syntax for the that MV _old is decided by an index.

Table 10

umve_idx_coding (x0, y0) {	Descriptor
base_mv_idx [x0] [y0]	ae (v)
distance_idx [x0] [y0]	ae (v)
direction_idx [x0] [y0]	ae (v)

}

For the case that MV _old is decided without an index

umve_idx_coding (x0, y0) {	Descriptor
distance_idx [x0] [y0]	ae (v)
direction_idx [x0] [y0]	ae (v)
}

The definition of the syntax elements are similar as above.

In the present disclosure, a first component is directly coupled to a second component when there are no intervening components, except for a line, a trace, or another medium between the first component and the second component. The first component is indirectly coupled to the second component when there are intervening components other than a line, a trace, or another medium between the first component and the second component. The term “coupled” and its variants include both directly coupled and indirectly coupled. The use of the term “about” means a range including ±10%of the subsequent number unless otherwise stated.

While several embodiments have been provided in the present disclosure, it may be understood that the disclosed systems and methods might be embodied in many other specific forms without departing from the spirit or scope of the present disclosure. The present examples are to be considered as illustrative and not restrictive, and the intention is not to be limited to the details given herein. For example, the various elements or components may be combined or integrated in another system or certain features may be omitted, or not implemented.

In addition, techniques, systems, subsystems, and methods described and illustrated in the various embodiments as discrete or separate may be combined or integrated with other systems, components, techniques, or methods without departing from the scope of the present disclosure. Other examples of changes, substitutions, and alterations are ascertainable by one skilled in the art and may be made without departing from the spirit and scope disclosed herein.

Claims

A method of an affine merge candidate construction performed by a video codec (avideo encoder and/or a video decoder) , the method comprising:

generating an affine merge candidate list based on motion vectors (MVs) associated with a set of neighboring blocks of a current block, the affine merge candidate list comprising a plurality of old candidates; and

obtaining a new candidate of a current block using an old candidate in the affine merge candidate list and an offset, wherein the offset is expressed by a starting point, a motion magnitude, or a motion direction, wherein the new candidate is added to the affine merge candidate list.
The method of claim 1, further comprising:

parsing offset information in a bit steam; and

decoding the offset information to obtain an offset flag configured to enable the offset.
The method of claim 2, further comprising:

parsing a prediction mode flag in the bit stream; and

decoding the prediction mode flag to determine a prediction mode comprising a unidirectional prediction mode or a bidirectional prediction mode.
The method of claim 3, wherein, when the prediction mode is determined to be the unidirectional prediction mode, and the offset comprises a horizontal offset component and a vertical offset component, obtaining the new candidate, comprising:

adding the horizontal offset component to a horizontal component of the old candidate to obtain a horizontal component of the new candidate; and

adding the vertical offset component to a vertical component of the old candidate to obtain a vertical component of the new candidate.
The method of claim 3, wherein, when the prediction mode is determined to be the bidirectional prediction mode, the method further comprising:

determining whether a first reference frame and a second reference frame are arranged on opposite sides of a frame which the current block locates in;

when the first reference frame and the second reference frame are arranged on opposite sides of the frame, the old candidate comprises an old candidate of the first reference frame and an old candidate of the second reference frame, obtaining the new candidate, comprises:

adding the offset to the old candidate of the first reference frame; and

subtracting the offset from the old candidate of the second reference frame.
The method of claim 5, wherein: adding the offset to the old candidate of the first reference frame comprises:

adding a first horizontal offset value of the offset to a horizontal component of the old candidate of the first reference frame; and

adding a first vertical offset value of the offset to a vertical component of the old candidate of the first reference frame; and wherein subtracting the offset from the old candidate of the second reference frame comprises:

subtracting the first horizontal offset value of the offset from a horizontal component of the old candidate of the second reference frame; and

subtracting the first vertical offset value of the offset from a vertical component of the old candidate of the second reference frame.
The method of claim 6, further comprising:

separately scaling the first horizontal offset value and the first vertical offset value.
The method of claim 4, wherein, when the prediction mode is determined to be the bidirectional prediction mode, the method further comprising:

determining whether a first reference frame and a second reference frame are arranged on a same side of the frame which the current block locates in;

when the first reference frame and the second reference frame are arranged on the same side of the frame, the old candidate comprises an old candidate of the first reference frame and an old candidate of the second reference frame, obtaining the new candidate, comprises:

adding the offset to the old candidate of the first reference frame; and

adding the offset to the old candidate of the second reference frame.
The method of claim 8, wherein adding the offset to the old candidate of the first reference frame comprises:

adding a first horizontal offset value of the offset to a horizontal component of the old candidate of the first reference frame; and

adding a first vertical offset value of the offset to a vertical component of the old candidate of the first reference frame; and wherein adding the offset to the old candidate of the second reference frame comprises:

adding the first horizontal offset value of the offset to a horizontal component of the old candidate of the second reference frame; and

adding the first vertical offset value of the offset to a vertical component of the old candidate of the second reference frame.
The method of claim 9, further comprising:

separately scaling the first horizontal offset value and the first vertical offset value.
The method of claim 1, wherein the starting point is characterized by a base candidate index configured to indicate which candidate in the affine merge candidate list is to be used as the old candidate.
The method of claim 1, wherein the motion magnitude is characterized by a distance index configured to indicate a predetermined distance from the starting point.
The method of claim 12, wherein the predetermined distance comprises a quarter pixel, half pixel, 1 pixel, 2 pixels, 4 pixels, 8 pixels, 16 pixels, and 32 pixels.
The method of claim 1, wherein the motion direction is characterized by a direction of a motion vector difference relative to the starting point.
The method of claim 1, further comprising:

receiving a bit stream comprising offset information; and

parsing the offset information from the bit stream to retrieve a prediction type flag and an offset flag, wherein the prediction type flag is decoded prior to decoding the offset flag.
The method of claim 1, further comprising:

receiving a bit stream comprising offset information; and

parsing the offset information from the bit stream to retrieve a prediction type flag and an offset flag, wherein the offset flag is decoded prior to decoding the prediction type flag.
The method of claim 1, further comprising:

receiving a bit stream comprising offset information; and

parsing the offset information to retrieve a base candidate index configured to indicate a base candidate among a number of candidates in the affine merge candidate list.
The method of claim 17, wherein, when the quantity of candidates is equal to one, the base candidate index is not signaled.
A decoder comprising a processor and a memory device configured to store data and program instructions executable by the processor, wherein the program instructions when executed by the processor cause the processor to:

generate an affine merge candidate list based on motion vectors (MVs) associated with a set of neighboring blocks of a current block, the affine merge candidate list comprising a plurality of old candidates; and

obtain a new candidate of a current block using an old candidate in the affine merge candidate list and an offset, wherein the offset is expressed by a starting point, a motion magnitude, or a motion direction, wherein the new candidate is added to the affine merge candidate list.
The decoder of claim 19, wherein the starting point is characterized by a base candidate index configured to indicate which candidate in the affine merge candidate list is to be used as the old candidate, the motion magnitude is characterized by a distance index configured to indicate a predetermined distance from the starting point, and the motion direction is characterized by a direction index configured to indicate a motion vector difference relative to the starting point.