CN117837143A - Adaptive bilateral matching for decoder-side motion vector refinement - Google Patents

Adaptive bilateral matching for decoder-side motion vector refinement Download PDF

Info

Publication number
CN117837143A
CN117837143A CN202280044692.2A CN202280044692A CN117837143A CN 117837143 A CN117837143 A CN 117837143A CN 202280044692 A CN202280044692 A CN 202280044692A CN 117837143 A CN117837143 A CN 117837143A
Authority
CN
China
Prior art keywords
motion vector
block
refined
video
difference
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280044692.2A
Other languages
Chinese (zh)
Inventor
黄晗
V·谢廖金
钱威俊
Z·张
C-C·陈
M·卡切夫维茨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qualcomm Inc
Original Assignee
Qualcomm Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/847,942 external-priority patent/US11895302B2/en
Application filed by Qualcomm Inc filed Critical Qualcomm Inc
Priority claimed from PCT/US2022/073155 external-priority patent/WO2023278964A1/en
Publication of CN117837143A publication Critical patent/CN117837143A/en
Pending legal-status Critical Current

Links

Landscapes

  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

Systems and techniques for processing video data are provided. For example, systems and techniques may include obtaining a current picture of video data and obtaining a reference picture for the current picture from the video data. A merge mode candidate may be determined for the current picture. The first and second motion vectors may be identified for the merge mode candidate. A motion vector search strategy may be selected for the merge mode candidate from a plurality of motion vector search strategies. The selected motion vector search policy may be associated with one or more constraints corresponding to at least one of the first motion vector or the second motion vector. The selected motion vector search strategy may be used to determine a refined motion vector based on the first motion vector, the second motion vector, and the reference picture. The merge mode candidate may be processed using the refined motion vector.

Description

Adaptive bilateral matching for decoder-side motion vector refinement
Technical Field
The present disclosure relates generally to video encoding and decoding. For example, aspects of the present disclosure include improving video coding techniques related to decoder-side motion vector refinement (DMVR) using bilateral matching.
Background
Digital video capabilities can be incorporated into a wide variety of devices, including digital televisions, digital direct broadcast systems, wireless broadcast systems, personal Digital Assistants (PDAs), laptop or desktop computers, tablet computers, electronic book readers, digital cameras, digital recording devices, digital media players, video gaming devices, video gaming consoles, cellular or satellite radio telephones (so-called "smartphones"), video teleconferencing devices, video streaming devices, and the like. Such devices enable video data to be processed and output for consumption. Digital video data includes a large amount of data to meet the needs of consumers and video providers. For example, consumers of video data desire the highest quality video with high fidelity, high resolution, high frame rate, and so forth. As a result, the large amount of video data required to meet these demands places a burden on the communication networks and devices that process and store the video data.
Digital video devices may implement video coding techniques for compressing video data. Video coding is performed according to one or more video coding standards or formats. For example, video coding standards or formats include general video coding (VVC), high Efficiency Video Coding (HEVC), advanced Video Coding (AVC), MPEG-2 part 2 coding (MPEG stands for moving picture experts group), and the like, as well as proprietary video codecs/formats such as AOMedia video 1 (AV 1) developed by the open media alliance. Video coding typically utilizes prediction methods (e.g., inter-prediction, intra-prediction, etc.) that exploit redundancy present in a video image or sequence. One goal of video coding techniques is to compress video data into a form that uses a lower bit rate while avoiding or minimizing degradation of video quality. As video services continue to evolve to become available, decoding techniques with better decoding efficiency are needed.
Disclosure of Invention
In some examples, systems and techniques for decoder-side motion vector refinement (DMVR) using adaptive bilateral matching are described. According to at least one illustrative example, an apparatus for processing video data is provided that includes at least one memory (e.g., configured to store data, such as video data, etc.) and at least one processor (e.g., implemented in circuitry) coupled to the at least one memory. The at least one processor is configured and capable of: obtaining one or more reference pictures for a current picture; identifying a first motion vector and a second motion vector for merging the mode candidates; determining a selected motion vector search strategy for the merge mode candidate from a plurality of motion vector search strategies; determining one or more refined motion vectors based on the one or more reference pictures and at least one of the first motion vector or the second motion vector using the selected motion vector search strategy; and processing the merge mode candidate using the one or more refined motion vectors.
In another example, a method for processing video data is provided. The method comprises the following steps: obtaining one or more reference pictures for a current picture; identifying a first motion vector and a second motion vector for merging the mode candidates; determining a selected motion vector search strategy for the merge mode candidate from a plurality of motion vector search strategies; determining one or more refined motion vectors based on the one or more reference pictures and at least one of the first motion vector or the second motion vector using the selected motion vector search strategy; and processing the merge mode candidate using the one or more refined motion vectors.
In another example, a non-transitory computer-readable medium having instructions stored thereon that, when executed by one or more processors, cause the one or more processors to: obtaining one or more reference pictures for a current picture; identifying a first motion vector and a second motion vector for merging the mode candidates; determining a selected motion vector search strategy for the merge mode candidate from a plurality of motion vector search strategies; determining one or more refined motion vectors based on the one or more reference pictures and at least one of the first motion vector or the second motion vector using the selected motion vector search strategy; and processing the merge mode candidate using the one or more refined motion vectors.
In another example, an apparatus for processing video data is provided. The device comprises: obtaining one or more reference pictures for a current picture; means for identifying a first motion vector and a second motion vector for merging the mode candidates; means for determining a selected motion vector search strategy for the merge mode candidate from a plurality of motion vector search strategies; means for determining one or more refined motion vectors based on the one or more reference pictures and at least one of the first motion vector or the second motion vector using the selected motion vector search strategy; and means for processing the merge mode candidate using the one or more refined motion vectors.
This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter alone. The subject matter should be understood by reference to appropriate portions of the entire specification of this patent, any or all of the accompanying drawings, and each claim.
The foregoing and other features and aspects will become more fully apparent upon reference to the following description, claims and accompanying drawings.
Drawings
Illustrative aspects of the present application are described in detail below with reference to the following drawings:
fig. 1 is a block diagram illustrating examples of encoding and decoding devices according to some examples of the present disclosure;
fig. 2A is a conceptual diagram illustrating example spatial neighboring motion vector candidates for merge mode according to some examples of the present disclosure;
fig. 2B is a conceptual diagram illustrating example spatial neighboring motion vector candidates for Advanced Motion Vector Prediction (AMVP) mode according to some examples of the present disclosure;
fig. 3A is a conceptual diagram illustrating example Temporal Motion Vector Predictor (TMVP) candidates according to some examples of the present disclosure;
FIG. 3B is a conceptual diagram of an example of motion vector scaling according to some examples of the present disclosure;
fig. 4A is a conceptual diagram illustrating an example of neighboring samples of a current coding unit for estimating motion compensation parameters for the current coding unit according to some examples of the present disclosure;
fig. 4B is a conceptual diagram illustrating examples of neighboring samples of a reference block for estimating motion compensation parameters for a current coding unit according to some examples of the present disclosure;
FIG. 5 illustrates locations of spatial merge candidates for processing blocks according to some examples of the present disclosure;
FIG. 6 illustrates aspects of motion vector scaling for processing temporal merging candidates of blocks according to some examples of the present disclosure;
FIG. 7 illustrates aspects of temporal merging candidates for processing blocks according to some examples of the present disclosure;
FIG. 8 illustrates aspects of bilateral matching according to some examples of the present disclosure;
FIG. 9 illustrates aspects of bidirectional optical flow (BDOF) according to some examples of the present disclosure;
FIG. 10 illustrates a search area according to some examples of the present disclosure;
FIG. 11 is a flow chart illustrating an example process for decoder-side motion vector refinement with adaptive bilateral matching according to some examples of the present disclosure;
fig. 12 is a block diagram illustrating an example video encoding device according to some examples of the present disclosure; and
fig. 13 is a block diagram illustrating an example video decoding device according to some examples of the present disclosure.
Detailed Description
Certain aspects and aspects of the disclosure are provided below. As will be apparent to those skilled in the art, some of these aspects and aspects may be applied independently, and some of them may be applied in combination. In the following description, for purposes of explanation, specific details are set forth in order to provide a thorough understanding of aspects of the present application. It will be apparent, however, that the various aspects may be practiced without these specific details. The drawings and description are not intended to be limiting.
The following description provides exemplary aspects only, and is not intended to limit the scope, applicability, or configuration of the disclosure. Rather, the ensuing description of these exemplary aspects will provide those skilled in the art with a enabling description for implementing an exemplary aspect. It being understood that various changes may be made in the function and arrangement of elements without departing from the spirit and scope of the application as set forth in the appended claims.
Video coding devices (e.g., encoding devices, decoding devices, or combination encoding and decoding devices) implement video compression techniques to efficiently encode and/or decode video data. Video compression techniques may include applying different prediction modes, including spatial prediction (e.g., intra-frame prediction) or intra-frame prediction (intra-prediction), temporal prediction (e.g., inter-frame prediction) or inter-frame prediction (inter-prediction)), inter-layer prediction (across different layers of video data), and/or other prediction techniques to reduce or eliminate redundancy inherent in video sequences. The video encoder may divide each image of the original video sequence into rectangular regions, referred to as video blocks or coding units (described in more detail below). These video blocks may be encoded using a particular prediction mode.
Video blocks may be divided into one or more groups of smaller blocks in one or more ways. The block may comprise a coding tree block, a prediction block, a transform block, or other suitable block. Unless otherwise specified, references to "blocks" in general may refer to video blocks (e.g., coding tree blocks, coding blocks, prediction blocks, transform blocks, or other suitable blocks or sub-blocks, as will be appreciated by those of ordinary skill in the art). Further, each of these blocks may also be interchangeably referred to herein as a "unit" (e.g., a Coding Tree Unit (CTU), a coding unit, a Prediction Unit (PU), a Transform Unit (TU), etc.). In some cases, a unit may indicate a coding logic unit encoded in a bitstream, while a block may indicate a portion of a video frame buffer for which a process is intended.
For inter prediction modes, a video encoder may search for a block similar to the block being encoded that is located at another temporal position in a frame (or image) (referred to as a reference frame or reference image). The video encoder may limit the search to a certain spatial displacement from the block to be encoded. A two-dimensional (2D) motion vector comprising a horizontal displacement component and a vertical displacement component may be used to locate the best match. For intra prediction modes, a video encoder may use spatial prediction techniques to form prediction blocks based on data from previously encoded neighboring blocks within the same image.
The video encoder may determine a prediction error. For example, the prediction may be determined as the difference between the block being encoded and the pixel values in the prediction block. The prediction error may also be referred to as a residual. The video encoder may also apply a transform to the prediction error (e.g., a Discrete Cosine Transform (DCT) or other suitable transform) to generate transform coefficients. After transformation, the video encoder may quantize the transform coefficients. The quantized transform coefficients and motion vectors may be represented using syntax elements and form, together with control information, a coded representation of the video sequence. In some cases, the video encoder may entropy code the syntax elements, further reducing the number of bits required for its representation.
The video decoder may construct prediction data (e.g., a prediction block) for decoding the current frame using the syntax elements and control information discussed above. For example, the video decoder may add the prediction block and the compressed prediction error. The video decoder may determine the compressed prediction error by weighting the transform basis function using the quantized coefficients. The difference between the reconstructed frame and the original frame is called reconstruction error.
Systems, apparatuses, processes (also referred to as methods) and computer-readable media (collectively referred to herein as "systems and techniques") for improving the accuracy of one or more motion vectors that may be used by a video coding device (e.g., a video decoder or decoding device) in performing prediction techniques (e.g., inter-prediction modes) are described herein. For example, systems and techniques may perform bilateral matching for decoder-side motion vector refinement (DMVR). Bilateral matching is a technique that refines two pairs of initial motion vectors. Such refinement may be achieved by searching the surrounding of the initial motion vector pair to obtain an updated motion vector that minimizes the block matching cost. The block matching costs may be generated in a variety of ways, including using Sum of Absolute Differences (SAD) criteria, sum of Absolute Transform Differences (SATD) criteria, sum of Squared Error (SSE) criteria, or other such criteria. Aspects described herein may improve the accuracy of motion vectors of bi-predictive merge candidates, resulting in improved video quality and associated improved performance of devices operating in accordance with aspects described herein.
In some aspects, systems and techniques may be used to perform adaptive bilateral matching for DMVR. For example, systems and techniques may perform bilateral matching using different search policies and/or search parameters for different coded blocks. As will be explained in greater depth below, the adaptive bilateral matching for DMVR may be based on a selected search strategy that is determined or signaled for a given block. The selected search policy may include one or more constraints for the bilateral matching search process. In some examples, the selected search strategy may additionally or alternatively include one or more constraints for the first motion vector difference and/or the second motion difference. In some examples, the selected search strategy may include one or more constraints between the first motion vector difference and the second motion vector difference.
In some aspects, constraints are selected for the motion vectors to be refined. The constraint may be a mirrored constraint, a zero constraint for a first vector, a zero constraint for a second vector, or other type of constraint. In some cases, the constraint is applied to merge mode coding blocks within merge candidates that satisfy one or more DMVR conditions. One or more constraints may then be used with one or more search strategies to identify candidates and select refined motion vectors.
In some aspects, different search strategies are used. The search policies may be grouped into subsets, where each subset includes one or more search policies. In some cases, the decoder may use syntax elements to determine the selected subset. For example, the encoder may include syntax elements in the bitstream. In such examples, a decoder may receive a bitstream and decode syntax elements from the bitstream. The decoder may use the syntax elements to determine the selected subset and any associated constraints for one or more given blocks of video data included in the bitstream. Using the selected subset and any constraints associated with the subset, the decoder may process the motion vectors (e.g., two motion vectors of the bi-predictive merge candidate) to identify a refined motion vector. In one illustrative aspect, an adaptive bilateral mode is provided in which a coding device signals selected motion information candidates that satisfy an associated DMVR condition (e.g., in which a signal structure is included as part of a new adaptive bilateral mode).
The use of the above-described search strategies and associated constraints may provide improvements to decoder-side motion vector refinement, for example, by using alternative search algorithms and associated constraints to provide adaptive bilateral motion vector refinement. Such improvements in decoder-side motion vector refinement may be used with various video codecs, such as Enhanced Compression Model (ECM) implementations. Examples described herein include implementations applied to multi-pass DMVR to improve ECM systems operating in accordance with one or more video coding standards. The techniques described herein may be implemented using one or more decoding devices, including one or more encoding devices, decoding devices, or a combination encoding and decoding device. The decoding device may be implemented by one or more of the following: a player device (such as a mobile device), an augmented reality (XR) device, a vehicle or computing system of a vehicle, a server device or system (e.g., a distributed server system comprising multiple servers, a single server device or system, etc.), or other device or system.
The systems and techniques described herein may be applied to any existing video codec, any developing video codec, and/or any future video coding standard, including, but not limited to, high Efficiency Video Coding (HEVC), advanced Video Coding (AVC), versatile Video Coding (VVC), VP9, AOMedia video 1 (AV 1) format/codec, and/or other video coding standards that are existing, developing, or will develop. The systems and techniques described herein may improve video data transmission performance of devices by utilizing improved compression and associated improved video quality based on improved motion vector selection from adaptive bilateral matching described herein, thereby improving operation of communication systems and devices in the system.
Fig. 1 is a block diagram illustrating an example of a system 100 including an encoding device 104 and a decoding device 112. The encoding device 104 may be part of a source device and the decoding device 112 may be part of a receiving device. The source device and/or the receiving device may include an electronic device, such as a mobile or landline phone handset (e.g., a smart phone, a cellular phone, etc.), a desktop computer, a laptop or notebook computer, a tablet computer, a set-top box, a television, a camera, a display device, a digital media player, a video game console, a video streaming device, an Internet Protocol (IP) camera, or any other suitable electronic device. In some examples, the source device and the receiving device may include one or more wireless transceivers for wireless communications. The encoding techniques described herein are applicable to video encoding in a variety of multimedia applications, including streaming video transmission (e.g., over the internet), television broadcasting or transmission, encoding of digital video for storage on a data storage medium, decoding of digital video stored on a data storage medium, or other applications. As used herein, the term coding (encoding) may refer to encoding (encoding) and/or decoding (decoding). In some examples, system 100 may support unidirectional or bidirectional video transmission to support applications such as video conferencing, video streaming, video playback, video broadcasting, gaming, and/or video telephony.
The encoding device 104 (or encoder) may be used to encode video data using a video coding standard, format, codec, or protocol to generate an encoded video bitstream. Examples of video coding standards and formats/codecs include ITU-T H.261, ISO/IEC MPEG-1 video (visual), ITU-T H.262 or ISO/IEC MPEG-2 video, ITU-T H.263, ISO/IEC MPEG-4 video, ITU-T H.264 (also known as ISO/IEC MPEG-4 AVC), including Scalable Video Coding (SVC) and Multiview Video Coding (MVC) extensions thereof, high Efficiency Video Coding (HEVC) or ITU-T H.265, and general video coding (VVC) or ITU-T H.266. There are various extensions to HEVC that involve multi-layer video coding, including range and screen content coding extensions, 3D video coding (3D-HEVC) and multiview extension (MV-HEVC) and scalable extension (SHVC). HEVC and its extensions have been developed by the video coding Joint Cooperation group (JCT-VC) of the ITU-T Video Coding Expert Group (VCEG) and the ISO/IEC Moving Picture Expert Group (MPEG) and by the 3D video coding extension development Joint Cooperation group (JCT-3V). VP9, AOMedia Video 1 (AV 1) developed by the open media alliance (AOMedia), and Elementary Video Coding (EVC) are other Video coding standards to which the techniques described herein may be applied.
The techniques described herein may be applied to any of the existing video codecs (e.g., high Efficiency Video Coding (HEVC), advanced Video Coding (AVC), or other suitable existing video codecs), and/or may be efficient coding tools for any video coding standard being developed and/or future video coding standards, such as VVC and/or other video coding standards being developed or to be developed. For example, the examples described herein may be performed using a video codec such as VVC, HEVC, AVC and/or extensions thereof. However, the techniques and systems described herein may also be applicable to other coding standards, codecs, or formats, such as MPEG, JPEG (or other coding standards for still images), VP9, AV1, extensions thereof, or other suitable coding standards that have been or have not been available or developed. For example, in some examples, the encoding device 104 and/or the decoding device 112 may operate in accordance with proprietary video codecs/formats, such as AV1, extensions of AVI, and/or subsequent versions of AV1 (e.g., AV 2), or other proprietary formats or industry standards. Thus, although the techniques and systems described herein may be described with reference to a particular video coding standard, it will be apparent to one of ordinary skill in the art that the description should not be construed as being applicable to only that particular standard.
Referring to fig. 1, a video source 102 may provide video data to an encoding device 104. The video source 102 may be part of a source device or may be part of a device other than a source device. Video source 102 may include a video capture device (e.g., a video camera, a camera phone, a video phone, etc.), a video archiving unit containing stored video, a video server or content provider that provides video data, a video feed interface that receives video from a video server or content provider, a computer graphics system for generating computer graphics video data, a combination of such sources, or any other suitable video source.
Video data from video source 102 may include one or more input pictures or frames. A picture or frame is a still image, which in some cases is part of a video. In some examples, the data from the video source 102 may be a still image that is not part of the video. In HEVC, VVC, and other video coding specifications, a video sequence may include a series of pictures. A picture may include three sample arrays, denoted SL, SCb, and SCr. SL is a two-dimensional array of luma samples, SCb is a two-dimensional array of Cb chroma samples, and SCr is a two-dimensional array of Cr chroma samples. Chroma (chroma) samples may also be referred to herein as "chroma" samples. A pixel may refer to all three components (luminance and chrominance samples) for a given location in an array of pictures. In other cases, the picture may be monochromatic and may include only an array of luminance samples, in which case the terms pixel and sample may be used interchangeably. Regarding example techniques described herein for illustrative purposes that mention individual samples, the same techniques may be applied to pixels (e.g., all three sample components for a given location in an array of pictures). With respect to the example techniques described herein for illustrative purposes that mention pixels (e.g., all three sample components for a given location in an array of pictures), the same techniques may be applied to the individual samples.
The encoder engine 106 (or encoder) of the encoding device 104 encodes the video data to generate an encoded video bitstream. In some examples, an encoded video bitstream (or "video bitstream" or "bitstream") is a series of one or more coded video sequences. A Coded Video Sequence (CVS) comprises a series of Access Units (AUs) starting from an AU with random access point pictures and with certain properties in the base layer up to a next AU with random access point pictures and with certain properties in the base layer and excluding this next AU. For example, some attributes of the random access point picture starting the CVS may include a RASL flag (e.g., noRaslOutputFlag) equal to 1. Otherwise, the random access point picture (where RASL flag is equal to 0) does not start CVS. An Access Unit (AU) includes one or more coded pictures and control information corresponding to the coded pictures sharing the same output time. Coded slices of pictures are encapsulated at the bitstream level as data units, referred to as Network Abstraction Layer (NAL) units. For example, an HEVC video bitstream may include one or more CVSs, including NAL units. Each of the NAL units has a NAL unit header. In one example, the header is one byte for H.264/AVC (except for multi-layer extensions) and two bytes for HEVC. The syntax elements in the NAL unit header take specified bits and are therefore visible to all kinds of systems and transport layers, such as transport streams, real-time transport (RTP) protocols, file formats, and others.
There are two types of NAL units in the HEVC standard, including Video Coding Layer (VCL) NAL units and non-VCL NAL units. The VCL NAL units include coded picture data that forms a coded video bitstream. For example, the bit sequence forming the coded video bit stream is present in the VCL NAL unit. The VCL NAL units may include one slice or slice (described below), and the non-VCL NAL units include control information related to one or more coded pictures. In some cases, NAL units may be referred to as packets. HEVC AU includes: a VCL NAL unit that includes coded picture data, and a non-VCL NAL unit (if any) corresponding to the coded picture data. The non-VCL NAL units may contain, among other information, parameter sets with high-level information about the encoded video bitstream. For example, the parameter sets may include a Video Parameter Set (VPS), a Sequence Parameter Set (SPS), and a Picture Parameter Set (PPS). In some cases, each slice or other portion of the bitstream may reference a single valid PPS, SPS, and/or VPS to allow the decoding device 112 to access information that may be used to decode the slice or other portion of the bitstream.
A NAL unit may contain a sequence of bits (e.g., an encoded video bitstream, a CVS for the bitstream, etc.) that form a coded representation of video data, such as a coded representation of a picture in video. The encoder engine 106 generates a coded representation of the pictures by dividing each picture into a plurality of slices. One slice is independent of the other slices so that the information in that slice can be decoded independent of data from other slices within the same picture. The slice comprises one or more slices comprising an independent slice and, if present, one or more dependent slices that depend on the previous slice.
In HEVC, the slice is then partitioned into Coding Tree Blocks (CTBs) of luma samples and chroma samples. CTBs of luma samples and one or more CTBs of chroma samples are referred to as Coding Tree Units (CTUs) along with syntax for the samples. CTUs may also be referred to as "treeblocks" or "largest coding units" (LCUs). CTU is the basic processing unit for HEVC coding. The CTU may be split into multiple Coding Units (CUs) of different sizes. A CU contains an array of luma and chroma samples called a Coding Block (CB).
The luminance and chrominance CBs may be further split into Prediction Blocks (PB). PB is a block of samples of either the luma component or the chroma component that uses the same motion parameters for inter prediction or intra copy (IBC) prediction (when available or enabled for use). The luma PB and the one or more chroma PB together with the associated syntax form a Prediction Unit (PU). For inter prediction, a set of motion parameters (e.g., one or more motion vectors, reference indices, etc.) is signaled in the bitstream for each PU and used for inter prediction of luma PB and one or more chroma PB. The motion parameters may also be referred to as motion information. The CB may also be partitioned into one or more Transform Blocks (TBs). TB represents a square block of samples of a color component to which a residual transform (e.g., in some cases the same two-dimensional transform) is applied to code the prediction residual signal. A Transform Unit (TU) represents the TBs of luma and chroma samples and the corresponding syntax elements. Transform coding is described in more detail below.
The size of a CU corresponds to the size of the coding mode and may be square in shape. For example, the size of a CU may be 8x 8 samples, 16x 16 samples, 32x 32 samples, 64x 64 samples, or any other suitable size up to the size of the corresponding CTU. The phrase "N x N" is used herein to refer to the pixel size (e.g., 8 pixels x 8 pixels) of a video block in the vertical and horizontal dimensions. The pixels in a block may be arranged in rows and columns. In some implementations, a block may not have the same number of pixels in the horizontal direction as in the vertical direction. Syntax data associated with a CU may describe, for example, partitioning the CU into one or more PUs. The partition modes may differ between whether a CU is intra-prediction mode encoded or inter-prediction mode encoded. The PU may be partitioned into non-square shapes. Syntax data associated with a CU may also describe, for example, partitioning the CU into one or more TUs according to CTUs. TUs may be square or non-square in shape.
According to the HEVC standard, a Transform Unit (TU) may be used to perform the transform. The TUs may be different for different CUs. The size of a TU may be set based on the sizes of PUs within a given CU. The TUs may have the same size as the PU or be smaller than the PU. In some examples, a quadtree structure called a Residual Quadtree (RQT) may be used to subdivide residual samples corresponding to a CU into smaller units. The leaf nodes of the RQT may correspond to TUs. The pixel differences associated with TUs may be transformed to produce transform coefficients. The transform coefficients may then be quantized by the encoder engine 106.
Once a picture of video data is partitioned into CUs, encoder engine 106 predicts each PU using a prediction mode. The prediction unit or block is then subtracted from the original video data to obtain a residual (described below). For each CU, prediction modes may be signaled within the bitstream using syntax data. The prediction modes may include intra prediction (or intra-picture prediction) or inter prediction (or inter-picture prediction). Intra prediction exploits the correlation between spatially adjacent samples within a picture. For example, each PU is predicted from neighboring image data in the same picture using intra prediction, using DC prediction for example to find an average value for the PU, using plane prediction to adapt the plane surface to the PU, using direction prediction to infer from neighboring data, or using any other suitable prediction type. Inter prediction uses the temporal correlation between pictures in order to derive motion compensated predictions for blocks of image samples. For example, using inter prediction, each PU is predicted from image data in one or more reference pictures (either before or after the current picture in output order) using motion compensated prediction. For example, a decision may be made at the CU level whether to use inter-picture prediction or intra-picture prediction to code a picture region.
The encoder engine 106 and the decoder engine 116 (described in more detail below) may be configured to operate according to VVC. According to VVC, a video coder, such as encoder engine 106 and/or decoder engine 116, partitions a picture into a plurality of Coding Tree Units (CTUs) (where the CTBs of luma samples and one or more CTBs of chroma samples, together with syntax for the samples, are referred to as CTUs). The video coder may partition the CTUs according to a tree structure, such as a quadtree-binary tree (QTBT) structure or a multi-type tree (MTT) structure. QTBT structures remove the concept of multiple partition types, such as distinguishing between CUs, PUs, and TUs of HEVC. The QTBT structure includes two levels, including: a first level partitioned according to a quadtree partitioning, and a second level partitioned according to a binary tree partitioning. The root node of the QTBT structure corresponds to the CTU. Leaf nodes of the binary tree correspond to Coding Units (CUs).
In an MTT partitioning structure, a block may be partitioned using a quadtree partition, a binary tree partition, and one or more types of trigeminal tree partitions. Trigeminal splitting is a split in which a block is split into three sub-blocks. In some examples, the trigeminal tree partition divides the block into three sub-blocks, rather than dividing the original block by center. The type of segmentation in the MTT (e.g., quadtree, binary tree, and trigeminal tree) may be symmetrical or asymmetrical.
When operating in accordance with the AV1 codec, the encoding device 104 and the decoding device 112 may be configured to decode video data in blocks. In AV1, the largest decoding block that can be processed is called a super block. In AV1, a superblock may be 128x128 luma samples or 64x64 luma samples. However, in a subsequent video coding format (e.g., AV 2), the super block may be defined by a different (e.g., larger) luma sample size. In some examples, the superblock is the top level of the block quadtree. The encoding device 104 may further partition the super block into smaller decoding blocks. The encoding device 104 may use square or non-square partitioning to partition super blocks and other coded blocks into smaller blocks. Non-square blocks may include N/2xN, nxN/2, N/4xN, and NxN/4 blocks. Encoding device 104 and decoding device 112 may perform separate prediction and transformation processes for each of the coding blocks.
AV1 also defines tiles of video data. A tile is a rectangular array of super blocks that can be coded independently of other tiles. That is, encoding device 104 and decoding device 112 may encode and decode, respectively, the coded blocks within a tile without using video data from other tiles. However, encoding device 104 and decoding device 112 may perform filtering across tile boundaries. Tiles may be uniform or non-uniform in size. Tile-based coding may enable parallel processing and/or multithreading for encoder and decoder implementations.
In some examples, a video coder may use a single QTBT or MTT structure to represent each of the luma and chroma components, while in other examples, a video coder may use two or more QTBT or MTT structures, such as one QTBT or MTT structure for the luma component and another QTBT or MTT structure for the two chroma components (or two QTBT and/or MTT structures for the respective chroma components).
The video coder may be configured to use quadtree partitioning, QTBT partitioning, MTT partitioning, superblock partitioning, or other partitioning structures.
In some examples, one or more slices of a picture are assigned a slice type. Slice types include intra-coded slices (I-slices), inter-coded slices (P-slices), and inter-coded slices (B-slices). An I-slice (intra, independently decodable) is a slice of a picture that is coded only by intra prediction, and is therefore independently decodable, because the I-slice only requires intra data to predict any prediction unit or prediction block of the slice. P slices (unidirectional predicted frames) are slices of a picture that can be decoded using intra prediction and unidirectional inter prediction. Each prediction unit or prediction block within a P slice is coded using intra prediction or inter prediction. When inter prediction is applied, the prediction unit or prediction block is predicted by only one reference picture, and thus the reference samples are from only one reference region of one frame. B slices (bi-predictive frames) are slices of a picture that can be coded using intra-prediction and inter-prediction (e.g., bi-prediction or uni-prediction). The prediction unit or prediction block of a B slice may be bi-predicted from two reference pictures, where each picture contributes one reference region, and the sample sets of the two reference regions are weighted (e.g., with equal weights or with different weights) to produce a prediction signal for the bi-predicted block. As explained above, slices of one picture are independently coded. In some cases, a picture may be coded as only one slice.
As mentioned above, intra-picture prediction of a picture exploits the correlation between spatially neighboring samples within the picture. There are a variety of intra prediction modes (also referred to as "intra modes"). In some examples, the intra prediction of the luma block includes 35 modes including a plane mode, a DC mode, and 33 angle modes (e.g., a diagonal intra prediction mode and an angle mode adjacent to the diagonal intra prediction mode). As shown in table 1 below, 35 intra prediction modes are indexed. In other examples, more intra modes may be defined, including prediction angles that may not have been represented by 33 angle modes. In other examples, the prediction angles associated with the angle mode may be different than those used in HEVC.
Intra prediction mode Associated name
0 INTRA_PLANAR
1 INTRA_DC
2..34 INTRA_ANGULAR2..INTRA_ANGULAR34
TABLE 1 specification of intra prediction modes and associated names
Inter-picture prediction uses temporal correlation between pictures in order to derive motion compensated predictions for blocks of image samples. Using a translational motion model, the position of a block in a previously decoded picture (reference picture) is represented by a motion vector (Δx, Δy), where Δx specifies the horizontal displacement of the reference block relative to the position of the current block and Δy specifies the vertical displacement of the reference block relative to the position of the current block. In some cases, the motion vector (Δx, Δy) may be an integer sample precision (also referred to as integer precision), in which case the motion vector points to an integer pixel grid (or integer pixel sampling grid) of the reference frame. In some cases, the motion vectors (Δx, Δy) may have fractional sample accuracy (also referred to as fractional pixel accuracy or non-integer accuracy) to more accurately capture the motion of the base object, without being limited to an integer pixel grid of the reference frame. The accuracy of a motion vector can be expressed by the quantization level of the motion vector. For example, the quantization level may be an integer precision (e.g., 1 pixel) or a fractional pixel precision (e.g., 1/4 pixel, 1/2 pixel, or values below other pixels). Interpolation is applied to the reference picture to derive the prediction signal when the corresponding motion vector has fractional sample accuracy. For example, samples available at integer locations may be filtered (e.g., using one or more interpolation filters) to estimate values at fractional locations. The previously decoded reference picture is indicated by a reference index (refIdx) to the reference picture list. The motion vector and the reference index may be referred to as motion parameters. Two inter-picture predictions may be performed, including single prediction and bi-prediction.
In the case of inter prediction using bi-prediction (also referred to as bi-directional inter prediction), two sets of motion parameters (Δx 0 ,y 0 ,refIdx 0 And Deltax 1 ,y 1 ,refIdx 1 ) To generate two motion compensated predictions (from the same reference picture or possibly from different reference pictures). For example, in the case of bi-prediction, two motion compensated prediction signals are used per prediction block and B prediction units are generated. The two motion compensated predictions are then combined to obtain the final motion compensated prediction. For example, two motion compensated predictions may be combined by averaging. In another example, weighted prediction may be used, in which case different weights may be applied to each motion compensated prediction. Reference pictures that can be used in bi-prediction are stored in two separate lists, denoted list 0 and list 1, respectively. Motion parameters may be derived at the encoder using a motion estimation process.
In case of inter prediction using single prediction (also referred to as unidirectional inter prediction), one set of motion parameters (Δx 0 ,y 0 ,refIdx 0 ) To generate motion compensated prediction from the reference picture. For example, in the case of single prediction, at most one motion-compensated prediction signal is used per prediction block, and P prediction units are generated.
The PU may include data (e.g., motion parameters or other suitable data) related to the prediction process. For example, when a PU is encoded using intra prediction, the PU may include data describing an intra prediction mode for the PU. As another example, when a PU is encoded using inter prediction, the PU may include data defining a motion vector for the PU. The data defining the motion vector for the PU may describe, for example, a horizontal component (Δx) of the motion vector, a vertical component (Δy) of the motion vector, a resolution (e.g., integer precision, quarter-pixel precision, or eighth-pixel precision) for the motion vector, a reference picture to which the motion vector points, a reference index, a reference picture list (e.g., list 0, list 1, or list C) for the motion vector, or any combination thereof.
AV1 includes two general techniques for encoding and decoding coding blocks of video data. These two general techniques are intra-prediction (e.g., intra-prediction or spatial prediction) and inter-prediction (e.g., inter-prediction or temporal prediction). In the context of AV1, when an intra prediction mode is used to predict a block of a current frame of video data, the encoding device 104 and the decoding device 112 do not use video data from other frames of video data. For most intra-prediction modes, the video encoding device 104 encodes a block of the current frame based on the difference between the sample values in the current block and the prediction values generated from the reference samples in the same frame. The video encoding device 104 determines a prediction value generated from the reference samples based on the intra prediction mode.
After performing prediction using intra-prediction and/or inter-prediction, the encoding device 104 may then perform transformation and quantization. For example, after prediction, the encoder engine 106 may calculate residual values corresponding to the PU. The residual values may include pixel differences between a current pixel block (PU) being coded and a prediction block used to predict the current block (e.g., a predicted version of the current block). For example, after generating a prediction block (e.g., performing inter prediction or intra prediction), the encoder engine 106 may generate a residual block by subtracting the prediction block generated by the prediction unit from the current block. The residual block includes a set of pixel difference values that quantize differences between pixel values of the current block and pixel values of the prediction block. In some examples, the residual block may be represented in a two-dimensional block format (e.g., a two-dimensional matrix or array of pixel values). In such an example, the residual block is a two-dimensional representation of the pixel values.
The block transform is used to transform any residual data that may remain after the prediction is performed, and may be based on a discrete cosine transform, a discrete sine transform, an integer transform, a wavelet transform, other suitable transform function, or any combination thereof. In some cases, one or more block transforms (e.g., sizes 32x 32, 16x 16, 8x 8, 4x 4, or other suitable sizes) may be applied to the residual data in each CU. In some aspects, TUs may be used for the transform and quantization processes implemented by encoder engine 106. A given CU with one or more PUs may also include one or more TUs. As described in further detail below, residual values may be transformed into transform coefficients using a block transform, and then may be quantized and scanned using TUs to produce serialized transform coefficients for entropy coding.
In some aspects, after intra-prediction or inter-prediction coding using the PUs of the CU, encoder engine 106 may calculate residual data for TUs of the CU. The PU may include pixel data in a spatial domain (or pixel domain). The TUs may include coefficients in the transform domain after applying the block transform. As previously described, the residual data may correspond to pixel differences between pixels of the non-coded picture and the prediction value corresponding to the PU. The encoder engine 106 may form TUs that include residual data for the CU, and may then transform the TUs to generate transform coefficients for the CU.
The encoder engine 106 may perform quantization of the transform coefficients. Quantization provides further compression by quantizing the transform coefficients to reduce the amount of data used to represent the coefficients. For example, quantization may reduce the bit depth associated with some or all of the coefficients. In one example, coefficients having n-bit values may be rounded down to m-bit values during quantization, where n is greater than m.
Once quantization is performed, the coded video bitstream includes quantized transform coefficients, prediction information (e.g., prediction modes, motion vectors, block vectors, etc.), partition information, and any other suitable data (such as other syntax data). The different elements of the coded video bitstream may then be entropy encoded by the encoder engine 106. In some examples, the encoder engine 106 may scan the quantized transform coefficients using a predefined scan order to produce a serialized vector that can be entropy encoded. In some examples, the encoder engine 106 may perform adaptive scanning. After scanning the quantized transform coefficients to form a vector (e.g., a one-dimensional vector), the encoder engine 106 may entropy encode the vector. For example, the encoder engine 106 may use context adaptive variable length coding, context adaptive binary arithmetic coding, syntax-based context adaptive binary arithmetic coding, probability interval partitioning entropy coding, or another suitable entropy coding technique.
The output 110 of the encoding device 104 may send NAL units constituting encoded video bitstream data over a communication link 120 to a decoding device 112 of a receiving device. An input 114 of the decoding device 112 may receive the NAL unit. The communication link 120 may include channels provided by a wireless network, a wired network, or a combination of wired and wireless networks. The wireless network may include any wireless interface or combination of wireless interfaces, and may include any suitable wireless network (e.g., the internet or other wide area network, packet-based network, wiFi) TM Radio Frequency (RF), ultra Wideband (UWB), wiFi direct, cellular, long Term Evolution (LTE), wiMax TM Etc.). The wired network may include any wired interface (e.g., fiber optic, ethernet, power line ethernet, coaxial ethernet, digital Signal Line (DSL), etc.). Various means may be used to implement wired and/or wireless networks, such as base stations, routers, access points, bridges, gateways, switches, and the like. The encoded video bitstream data may be modulated according to a communication standard, such as a wireless communication protocol, and transmitted to a receiving device.
In some examples, the encoding device 104 may store the encoded video bitstream data in the storage unit 108. The output 110 may retrieve encoded video bitstream data from the encoder engine 106 or from the storage unit 108. The storage unit 108 may comprise any of a variety of distributed or locally accessed data storage media. For example, the storage unit 108 may include a hard disk drive, a storage disk, flash memory, volatile or non-volatile memory, or any other suitable digital storage medium for storing encoded video data. The storage unit 108 may further include a Decoded Picture Buffer (DPB) for storing reference pictures for use in inter prediction. In further examples, the storage unit 108 may correspond to a file server or another intermediate storage device that may store the encoded video generated by the source device. In such a case, the receiving device including decoding device 112 may access the stored video data from the storage device via streaming or download. The file server may be any type of server capable of storing encoded video data and transmitting the encoded video data to a receiving device. Example file servers include web servers (e.g., for web sites), FTP servers, network Attached Storage (NAS) devices, or local disk drives. The receiving device may access the encoded video data through any standard data connection, including an internet connection. This may include a wireless channel (e.g., wi-Fi connection), a wired connection (e.g., DSL, cable modem, etc.), or a combination of both, adapted to access encoded video data stored on a file server. The transmission of encoded video data from the storage unit 108 may be a streaming transmission, a download transmission, or a combination thereof.
The input 114 of the decoding device 112 receives the encoded video bitstream data and may provide the video bitstream data to the decoder engine 116 or to the storage unit 118 for later use by the decoder engine 116. For example, the storage unit 118 may include a Decoded Picture Buffer (DPB) for storing reference pictures for use in inter prediction. A receiving device comprising a decoding device 112 may receive encoded video data to be decoded via the storage unit 108. The encoded video data may be modulated according to a communication standard, such as a wireless communication protocol, and transmitted to a receiving device. The communication medium used to transmit the encoded video data may comprise any wireless or wired communication medium, such as a Radio Frequency (RF) spectrum or one or more physical transmission lines. The communication medium may form part of a packet-based network, such as a local area network, a wide area network, or a global network such as the internet. The communication medium may include a router, switch, base station, or any other means that may be used to facilitate communications from a source device to a receiving device.
The decoder engine 116 may decode the encoded video bitstream data by entropy decoding (e.g., using an entropy decoder) and extracting elements of one or more encoded video sequences that constitute the encoded video data. The decoder engine 116 may then rescale the encoded video bitstream data and perform an inverse transform thereon. The residual data is then passed to the prediction stage of the decoder engine 116. The decoder engine 116 then predicts a block of pixels (e.g., a PU). In some examples, the prediction is added to the output of the inverse transform (residual data).
Video decoding device 112 may output the decoded video to video destination device 122, and video destination device 122 may include a display or other output device for displaying the decoded video data to a consumer of the content. In some aspects, video destination device 122 may be part of a receiving device that includes decoding device 112. In some aspects, video destination device 122 may be part of a separate device that is different from the receiving device.
In some aspects, the video encoding device 104 and/or the video decoding device 112 may be integrated with an audio encoding device and an audio decoding device, respectively. The video encoding device 104 and/or the video decoding device 112 may also include other hardware or software necessary for implementing the above-described decoding techniques, such as one or more microprocessors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs), discrete logic, software, hardware, firmware or any combinations thereof. The video encoding device 104 and the video decoding device 112 may be integrated as part of a combined encoder/decoder (codec) in the respective devices.
The example system shown in fig. 1 is but one illustrative example that may be used herein. The techniques for processing video data using the techniques described herein may be performed by any digital video encoding and/or decoding device. Although generally speaking, the techniques of this disclosure are performed by a video encoding device or a video decoding device, the techniques may also be performed by a combined video encoder/decoder, commonly referred to as a "CODEC. Furthermore, the techniques of this disclosure may also be performed by a video preprocessor. The source device and the sink device are merely examples of such encoding devices in which the source device generates encoded video data for transmission to the sink device. In some examples, the source device and the receiving device may operate in a substantially symmetrical manner such that each of these devices includes video encoding and decoding components. Thus, example systems may support unidirectional or bidirectional video transmission between video devices, for example, for video streaming, video playback, video broadcasting, or video telephony.
The present disclosure may generally relate to "signaling" certain information (e.g., syntax elements). The term "signaling" may generally refer to the transmission of values for syntax elements and/or other data for decoding encoded video data. For example, the video encoding device 104 may signal a value for the syntax element in the bitstream. Typically, signaling refers to generating values in a bitstream. As described above, video source 102 may transmit the bitstream to video destination device 122 in substantially real-time or not in real-time (e.g., as may occur when syntax elements are stored to storage unit 108 for later retrieval by video destination device 122).
The video bitstream may also include Supplemental Enhancement Information (SEI) messages. For example, the SEI NAL unit may be part of a video bitstream. In some cases, the SEI message may contain information that is not needed by the decoding process. For example, the information in the SEI message may not be necessary for the decoder to decode video pictures of the bitstream, but the decoder may use the information to improve the display or processing of the pictures (e.g., decoded output). The information in the SEI message may be embedded metadata. In one illustrative example, the decoder-side entity may use information in the SEI message to improve the visibility of the content. In some cases, certain application standards may force the presence of such SEI messages in the bitstream so that improvements in quality may be brought about for all devices conforming to the application standard (e.g., frame encapsulated SEI messages are carried for a frame compatible planar stereoscopic 3DTV video format, where the SEI messages are carried for each frame of video, recovery point SEI messages are processed, and rectangular SEI messages are scanned in DVB using pan scanning, among many other examples).
As described above, for each block, a set of motion information (also referred to herein as motion parameters) may be available. The set of motion information contains motion information for a forward prediction direction and a backward prediction direction. The forward prediction direction and the backward prediction direction may be two prediction directions of a bi-directional prediction mode, in which case the terms "forward" and "backward" do not necessarily have a geometric meaning. In contrast, "forward" and "backward" correspond to reference picture list0 (RefPicList 0 or L0) and reference picture list1 (RefPicList 1 or L1) of the current picture. In some examples, when only one reference picture list is available for a picture or slice, only RefPicList0 is available and the motion information for each block of the slice is always forward.
In some examples, motion vectors and their reference indices are used in the coding process (e.g., motion compensation). Such motion vectors with associated reference indices are represented as a single prediction set of motion information. For each prediction direction, the motion information may contain a reference index and a motion vector. In some cases, for simplicity, the motion vector itself may be referenced in a manner that assumes that it has an associated reference index. The reference index is used to identify a reference picture in the current reference picture list (RefPicList 0 or RefPicList 1). The motion vector has a horizontal component and a vertical component that provide an offset from a coordinate position in the current picture to a coordinate position in a reference picture identified by a reference index. For example, the reference index may indicate a particular reference picture that should be used for a block in the current picture, and the motion vector may indicate where in the reference picture the best matching block (the block that best matches the current block) is located.
In H.264/AVC, each inter Macroblock (MB) may be partitioned in four different ways, including: a 16x16MB partition; two 16x8 MB partitions; two 8x16 MB partitions; and four 8x8 MB partitions. Different MB partitions in one MB may have different reference index values (RefPicList 0 or RefPicList 1) for each direction. In some cases, when an MB is not partitioned into four 8x8 MB partitions, it may have only one motion vector for each MB partition in each direction. In some cases, when an MB is partitioned into four 8x8 MB partitions, each 8x8 MB partition may be further partitioned into sub-blocks, in which case each sub-block may have a different motion vector in each direction. In some examples, there are four different ways to get the sub-blocks from 8x8 MB partitions: an 8x8 sub-block; two 8x4 sub-blocks; two 4x8 sub-blocks; and four 4x4 sub-blocks. Each sub-block may have a different motion vector in each direction. Thus, the motion vector may exist at a level equal to or higher than the sub-block.
In AVC, for skip or direct modes in B slices, temporal direct modes may be implemented at MB or MB partition level. For each MB partition, the motion vector is derived using the motion vector of the block co-located with the current MB partition in RefPicList1[0] of the current block. Each motion vector in the co-located block may be scaled based on the POC distance.
In AVC, a spatial direct mode may also be performed. For example, in AVC, the direct mode may also predict motion information from spatial neighbors.
As mentioned above, in HEVC, the largest coding unit in a slice is referred to as a Coding Tree Block (CTB). CTBs include quadtrees whose nodes are coding units. In HEVC master profile, CTBs may range in size from 16x16 to 64x64. In some cases, 8x8 CTBs may be supported. The Coding Unit (CU) may have the same size as the CTB and as small as 8x8. In some cases, each decode unit may be decoded using a pattern. A CU may be further partitioned into 2 or 4 Prediction Units (PUs) when the CU is inter coded, or may become only one PU when further partitioning is not applicable. When two PUs are present in one CU, they may be rectangles of half size, or two rectangles having a size of 1/4 or 3/4 of the CU. When a CU is inter coded, there is one set of motion information for each PU. In addition, each PU may be coded with a unique inter prediction mode to derive a set of motion information.
For example, for motion prediction in HEVC, there are two inter prediction modes for a Prediction Unit (PU), including merge mode and Advanced Motion Vector Prediction (AMVP) mode. Skipping is a special case of merging. In AMVP or merge mode, a Motion Vector (MV) candidate list for a plurality of motion vector predictors may be maintained. The motion vector of the current PU in merge mode and the reference index are generated by extracting one candidate from the MV candidate list. In some examples, one or more scaling window offsets may be included in the MV candidate list along with the stored motion vectors.
In an example in which MV candidate lists are used for motion prediction of a block, the MV candidate lists may be separately constructed by an encoding apparatus and a decoding apparatus. For example, the MV candidate list may be generated by the encoding device when encoding a block, and may be generated by the decoding device when decoding a block. Information related to motion information candidates in the MV candidate list (e.g., information related to one or more motion vectors, information related to one or more LIC flags that may be stored in the MV candidate list in some cases, and/or other information) may be signaled between the encoding device and the decoding device. For example, in merge mode, index values for stored motion information candidates may be signaled from the encoding device to the decoding device (e.g., in a syntax structure such as Picture Parameter Sets (PPS), sequence Parameter Sets (SPS), video Parameter Sets (VPS), slice headers, supplemental Enhancement Information (SEI) messages sent in or separate from the video bitstream, and/or other signaling). The decoding apparatus may construct a MV candidate list and obtain one or more motion information candidates from the constructed MV candidate list using a signaled reference or index for motion compensated prediction. For example, the decoding device 112 may construct a MV candidate list and use the motion vectors (and in some cases the LIC flags) from the index locations to motion predict the block. In the case of AMVP mode, a difference or residual value may be signaled as an increment in addition to a reference or index. For example, for AMVP mode, the decoding apparatus may construct one or more MV candidate lists and apply an increment value to one or more motion information candidates obtained using signaled index values when performing motion compensation prediction for a block.
In some examples, the MV candidate list contains up to five candidates for merge mode and two candidates for AMVP mode. In other examples, a different number of candidates may be included in the MV candidate list for merge mode and/or AMVP mode. The merge candidate may contain a set of motion information. For example, the motion information set may include motion vectors corresponding to two reference picture lists (list 0 and list 1) and a reference index. If the merge candidate is identified by the merge index, the reference picture is used for prediction of the current block and an associated motion vector is determined. However, in AMVP mode, for each potential prediction direction from list 0 or list 1, the reference index needs to be explicitly signaled along with the MVP index to the MV candidate list, since the AMVP candidates contain only motion vectors. In AMVP mode, the prediction motion vector may be further refined.
As seen above, the merge candidate corresponds to a complete set of motion information, while the AMVP candidate contains only one motion vector and reference index for a particular prediction direction. Candidates for both modes may be similarly derived from the same spatial and temporal neighboring blocks.
In some examples, the merge mode allows the inter-predicted PU to inherit the same one or more motion vectors, prediction directions, and one or more reference picture indices from the inter-predicted PU as follows: the inter-predicted PU includes a motion data location selected from a set of spatially-adjacent motion data locations and one of two temporally-collocated motion data locations. For AMVP mode, one or more motion vectors of a PU may be predictively coded relative to one or more Motion Vector Predictors (MVPs) from an AMVP candidate list constructed by an encoder and/or decoder. In some cases, for unidirectional inter prediction of a PU, the encoder and/or decoder may generate a single AMVP candidate list. In some cases, for bi-prediction of a PU, the encoder and/or decoder may generate two AMVP candidate lists, one using motion data of spatially and temporally neighboring PUs from the forward prediction direction and one using motion data of spatially and temporally neighboring PUs from the backward prediction direction.
Candidates for both modes may be derived from spatially and/or temporally neighboring blocks. For example, fig. 2A and 2B include conceptual diagrams illustrating spatial neighboring candidates. Fig. 2A shows spatial neighboring Motion Vector (MV) candidates for merge mode. Fig. 2B shows spatial neighboring Motion Vector (MV) candidates for AMVP mode. Spatial MV candidates are derived from neighboring blocks for a particular PU (PU 0), but the method of generating candidates from blocks differs for merging and AMVP modes.
In merge mode, the encoder and/or decoder may form a merge candidate list by considering merge candidates from various motion data locations. For example, as shown in fig. 2A, up to four spatial MV candidates may be derived with respect to spatially adjacent motion data locations shown by numerals 0-4 in fig. 2A. In the merge candidate list, the MV candidates may be ordered in the order indicated by numerals 0-4. For example, the location and order may include: a left side position (0), an upper position (1), an upper right position (2), a lower left position (3) and an upper left position (4).
In AVMP mode shown in fig. 2B, adjacent blocks may be divided into two groups: left group including blocks 0 and 1, and upper group including blocks 2, 3 and 4. For each group, the potential candidates in the neighboring block that reference the same reference picture as the reference picture indicated by the signaled reference index have the highest priority to be selected to form the final candidate for the group. It is possible that all neighboring blocks do not contain motion vectors pointing to the same reference picture. Thus, if such a candidate cannot be found, the first available candidate may be scaled to form the final candidate, and thus the time-space difference may be compensated for.
Fig. 3A and 3B include conceptual diagrams illustrating temporal motion vector prediction. Temporal Motion Vector Predictor (TMVP) candidates, if enabled and available, are added to the MV candidate list, which follows the spatial motion vector candidates. The process of motion vector derivation for TMVP candidates is the same for both merge and AMVP mode. However, in some cases, the target reference index for the TMVP candidate in merge mode may be set to zero or may be derived from the reference index of the neighboring block.
The main block location for TMVP candidate derivation is the lower right block (as shown as block "T" in fig. 3A) outside the co-located PU to compensate for the bias to the upper and left blocks used to generate the spatial neighboring candidates. However, if the block is located outside the current CTB (or LCU) line or motion information is not available, the block may be replaced with the center block of the PU. The motion vector for the TMVP candidate is derived from the co-located PU of the co-located picture indicated at the slice level. Similar to the temporal direct mode in AVC, motion vector scaling may be performed on the motion vectors of TMVP candidates, which may be performed to compensate for distance differences.
Other aspects of motion prediction are encompassed in the HEVC standard and/or other standards, formats, or codecs. For example, several other aspects of merge mode and AMVP mode are contemplated. One aspect includes motion vector scaling. Regarding motion vector scaling, it is assumed that the value of the motion vector is proportional to the distance of the picture in presentation time. The motion vector associates two pictures, a reference picture and a picture containing the motion vector, i.e. a picture containing the motion vector. When a motion vector is used to predict other motion vectors, a distance containing a picture and a reference picture is calculated based on a Picture Order Count (POC) value.
For a motion vector to be predicted, both its associated containing picture and reference picture may be different. Thus, a new distance is calculated (based on POC). Also, the motion vector is scaled based on the two POC distances. For spatial neighboring candidates, the containing pictures for the two motion vectors are the same, while the reference pictures are different. In HEVC, motion vector scaling is applicable to both TMVP and AMVP for spatial and temporal neighbor candidates.
Another aspect of motion prediction includes manual motion vector candidate generation. For example, if the motion vector candidate list is incomplete, then a manual motion vector candidate is generated and inserted at the end of the list until all the lists are obtained. In merge mode, there are two types of artificial MV candidates: combination candidates derived only for B slices; and zero candidates only for AMVP if the first type does not provide enough artificial candidates. For each pair of candidates that are already in the candidate list and that have the necessary motion information, a bi-directional combined motion vector candidate is derived by a combination of the motion vector of the first candidate that references the picture in list 0 and the motion vector of the second candidate that references the picture in list 1.
In some implementations, the pruning process may be performed when new candidates are added or inserted into the MV candidate list. For example, in some cases MV candidates from different blocks may include the same information. In such a case, storing the repetitive motion information of a plurality of MV candidates in the MV candidate list may cause redundancy in the MV candidate list and a decrease in efficiency. In some examples, the pruning process may eliminate or minimize redundancy in the MV candidate list. For example, the pruning process may include comparing potential MV candidates to be added to the MV candidate list with MV candidates already stored in the MV candidate list. In one illustrative example, the stored horizontal displacement (Δx) and vertical displacement (Δy) of the motion vector (indicating the position of the reference block relative to the position of the current block) may be compared to the horizontal displacement (Δx) and vertical displacement (Δy) of the potential candidate. If the comparison reveals that the motion vector of the potential candidate does not match any of the one or more stored motion vectors, the potential candidate is not considered a candidate to be pruned and may be added to the MV candidate list. If a match is found based on the comparison, the potential MV candidates are not added to the MV candidate list, thereby avoiding the insertion of the same candidates. In some cases, to reduce complexity, only a limited number of comparisons are performed during the pruning process, rather than comparing each potential MV candidate to all existing candidates.
In some coding schemes, such as HEVC, weighted Prediction (WP) is supported, in which case scaling factors (denoted a), shift numbers (denoted s) and offsets (denoted b) are used in motion compensation. Assuming that the pixel value in the position (x, y) of the reference picture is p (x, y), p' (x, y) = ((a×p (x, y) + (1 < < < (s-1))) > > s) +b is used as a prediction value in motion compensation instead of p (x, y).
When WP is enabled, for each reference picture of the current slice, a flag is signaled to indicate whether WP is applicable to the reference picture. If WP is applicable to one reference picture, WP parameter sets (i.e., a, s, and b) are transmitted to the decoder, and are used for motion compensation according to the reference picture. In some examples, to flexibly turn on/off WP for the luminance and chrominance components, WP flags and WP parameters for the luminance and chrominance components are signaled, respectively. In WP, one and the same WP parameter set can be used for all pixels in one reference picture.
Fig. 4A is a diagram showing an example of neighbor reconstructed samples of a current block 402 and neighbor samples of a reference block 404 for unidirectional inter prediction. The motion vector MV may be coded for the current block 402, where the MV may include a reference index to a reference picture list and/or other motion information identifying the reference block 404. For example, the MV may include a horizontal component and a vertical component that provide an offset from a coordinate location in the current picture to coordinates in a reference picture identified by a reference index. Fig. 4B is a diagram illustrating an example of neighbor reconstructed samples of the current block 422 and neighbor samples of the first reference block 424 and the second reference block 426 for bi-directional inter prediction. In this case, two motion vectors MV0 and MV1 may be coded for the current block 422 to identify a first reference block 424 and a second reference block 426, respectively.
Bilateral Matching (BM) is a technique that may be used to refine two initial motion vector pairs (e.g., a first motion vector MV0 and a second motion vector MV 1). For example, BM may be performed by searching the surroundings in the initial motion vectors MV0 and MV1 to obtain refined motion vectors (e.g., refined motion vectors MV0 'and MV 1'). The first motion vector MV0 and the second motion vector MV1 may then be replaced with the refined motion vectors MV0 'and MV1', respectively. The refined motion vector may be selected in the search as the motion vector identified in the search that minimizes the block matching cost.
In some examples, the block matching cost may be generated based on a similarity between two motion compensated predictors generated for the two MVs. Example criteria for block matching costs include, but are not limited to, sum of Absolute Differences (SAD), sum of Absolute Transformed Differences (SATD), sum of Squared Errors (SSE), and the like. The block matching cost criterion may also include a regularization term that is derived based on MV differences between the current MV pair (e.g., the MV pair selected as the refined motion vectors MV0 'and MV 1') and the initial MV pair (e.g., MV0 and MV 1).
In some examples, one or more constraints may be applied to the MVD difference terms MVD0 and MVD1 (e.g., where mvd0=mv0 '-mv0 and mvd1=mv1' -mv1). For example, in some cases, constraints may be applied based on the assumption that MVD0 and MVD1 are proportional to the Temporal Distance (TD) between the current picture (e.g., current block) and the reference picture (e.g., reference block) to which the two MVs are pointing. In some examples, the constraint may be applied based on the assumption that mvd0= -MVD1 (e.g., MVD0 and MVD1 have equal magnitudes but opposite signs).
In some examples, an inter-prediction CU may be associated with one or more motion parameters. For example, in a general video coding standard (VVC), each inter-prediction CU may be associated with one or more motion parameters, which may include, but are not limited to, a motion vector, a reference picture index, and a reference picture list use index. The motion parameters may also include additional information associated with coding features of the VVC to be used for inter-prediction sample generation. The motion parameters may be signaled explicitly or implicitly. For example, when coding a CU with skip mode, the CU is associated with one PU and has no significant residual coefficients and no coded motion vector delta or reference picture index.
In some aspects, a merge mode may be specified in which the motion parameters for the current CU are obtained from neighboring CUs (including spatial and temporal candidates). Additionally or alternatively, one or more merge modes may be specified based on additional plans introduced in the VVC standard. In some examples, the merge mode may be applied to any inter-prediction CU (e.g., the merge mode may be applied beyond the skip mode). In some examples, an alternative to the merge mode may include explicit transmission of one or more motion parameters. For example, motion vectors, corresponding reference picture indices for each reference picture list, reference picture list use flags, and other relevant information may be explicitly signaled for each CU.
In addition to the inter coding features in HEVC, VVC also includes a variety of new and improved inter-prediction coding techniques, including: expanding, merging and predicting; merge mode with motion vector difference (MMVD); symmetric MVD (SMVD) signaling; affine motion compensation prediction; temporal motion vector prediction based on sub-blocks (SbTMVP); adaptive Motion Vector Resolution (AMVR); stadium storage: 1/16 luma sample MV storage and 8x8 motion field compression; bi-prediction (BCW) with CU-level weights; bidirectional optical flow (BDOF); decoder-side motion vector refinement (DMVR); geometric Partitioning Mode (GPM); inter and Intra Prediction (CIIP) are combined.
For extended merge prediction in VVC merge mode (e.g., referred to as normal or default merge mode), a merge candidate list may be constructed by including the following five types of candidates in order: spatial Motion Vector Prediction (MVP) from spatial neighbor CUs; temporal MVP from co-located CUs; history-based MVP from a first-in first-out (FIFO) table; average MVP by pair; and zero MV. The size of the merge candidate list may be signaled in a sequence parameter set header. The maximum allowed size of the merge candidate list may be six (e.g., six entries or six candidates). For each CU coded in merge mode, the index of the best merge candidate is encoded using truncated unary binarization (TU). In some examples, VVC may also support parallel derivation of merge candidate lists for all CUs within a certain size or region (e.g., as done in HEVC). The following describes the above five types of merging candidates and the associated exemplary derivation process of each type of merging candidate in turn.
Fig. 5 illustrates locations of spatial merge candidates for processing blocks according to some examples of the present disclosure. For example, fig. 5 depicts example locations of spatial merge candidates (also referred to as "spatial neighbors") A0, A1, B0, B1, and B2 for processing block 500 according to some examples of the present disclosure. Spatial neighbors A0, A1, B0, B1, and B2 are indicated in FIG. 5 based on their relationship to block 500. The derivation of spatial merge candidates in VVC may be the same as in HEVC, with the positions of the first two merge candidates swapped. In some examples, up to four merge candidates may be selected from five spatial merge candidates located at the positions depicted in fig. 5 (e.g., A0, A1, B0, B1, and B2).
The deduced sequence may be B0, A0, B1, A1 and B2. For example, the merge candidate position B2 may be considered only when one or more CUs associated with positions B0, A0, B1, A1 are not available or intra coded. A CU associated with position B0, A0, B1, or A1 may not be available because the CU belongs to a different slice or tile. In some aspects, after adding the merge candidate at position A1, the addition of the remaining merge candidates may be subject to redundancy check. Redundancy check may be performed such that merge candidates having the same motion information are excluded from the merge candidate list (e.g., to improve coding efficiency).
Fig. 6 illustrates aspects of motion vector scaling 600 for processing temporal merging candidates for a block according to some examples of the present disclosure. In some examples, temporal merge candidate derivation may be performed in which one merge candidate (e.g., a temporal merge candidate) is added to a merge candidate list. The temporal merging candidate derivation may be performed based on the scaled motion vectors. The scaled motion vector may be derived based on the co-located CU included in the co-located reference picture. The reference picture list to be used for deriving the co-located CU may be explicitly signaled in the slice header.
For example, fig. 6 depicts a current picture 610 and a co-located picture 630, which may be associated with a current reference picture 615 and a co-located reference picture 635, respectively. Fig. 6 also depicts a current CU 612 (e.g., associated with current picture 610) and a co-located CU (e.g., associated with co-located picture 630). In some examples, scaled motion vectors for temporal merging candidate derivation may be derived or obtained, as shown in fig. 6. For example, fig. 6 depicts a dashed line 611 scaled from the motion vector of the co-located CU 632 using Picture Order Count (POC) distances tb and td. In some examples, tb is the POC difference between the current reference picture 615 and the current picture 610, and td is the POC difference between the co-located reference picture 635 and the co-located picture 630. The reference picture index of the temporal merging candidate may be set equal to zero.
Fig. 7 illustrates aspects of a temporal merging candidate 700 for processing blocks according to some examples of the present disclosure. In some examples, as discussed above with respect to fig. 6, after a single temporal merge candidate is selected, a candidate location C may be at 0 And C 1 The location of the temporal merging candidate is selected in between, as depicted in fig. 7. In some examples, if position C 0 The candidate position C may be used if the CU at the location is not available, intra coded or outside the current row of CTUs 1 . Otherwise, position C is used in the derivation of temporal merging candidates 0
In some aspects, history-based motion vector prediction (HMVP) merge candidates may be added to the merge candidate list after spatial MVP merge candidates (e.g., as described above with respect to fig. 5) and TMVP merge candidates (e.g., as described above with respect to fig. 6 and 7). HMVP merge candidates may be derived based on motion information of previously coded blocks. For example, motion information of previously coded blocks may be stored (e.g., in a table) and used as Motion Vector Prediction (MVP) for the current CU. In some examples, a table with multiple HMVP candidates may be maintained during the encoding and/or decoding process. When a new CTU row is encountered, the table is reset (e.g., cleared). Whenever there is a non-sub-block inter coded CU, the associated motion information is added as a new HMVP candidate to the last entry of the table.
In some examples, HMVP table size S may be set to a value of six (e.g., up to six history-based MVP (HMVP) candidates may be added to the HMVP table). Constrained first-in first-out (FIFO) rules may be utilized when inserting new HMVP candidates into the HMVP table. The constrained FIFO rule may include a redundancy check applied to determine whether the same HMVP exists in the table (e.g., to determine whether the newly inserted HMVP candidate is the same as an existing HMVP candidate in the table). If the same HMVP is already included in the redundancy check discovery table for the newly inserted HMVP candidate, the same HMVP may be removed from the table and then all HMVP candidates are moved forward.
In some aspects, the HMVP candidates (e.g., included in the HMVP list or HMVP table) may then be used to construct or otherwise generate a merge candidate list. For example, to generate a merge candidate list, the nearest number of HMVP candidates in the table may be checked in order and inserted into the merge candidate list after the TMVP candidates. A redundancy check may be applied to the HMVP candidates added to the merge candidate list, wherein the redundancy check is used to determine whether the HMVP candidates are identical (same) or identical (identity) to any spatial or temporal merge candidates previously added or already included in the merge candidate list.
In some examples, the number of redundancy check operations performed in association with generating the merge candidate list and/or the HMVP table may be reduced. For example, the number of HMVP candidates for merge list generation may be set to (N < =4)? M (8-N), where N indicates the number of existing candidates in the merge candidate list and M indicates the number of available HMVP candidates in the HMVP table. In other words, if the condition N < = 4 evaluates to true (e.g., the merge candidate list contains four or fewer candidates), then all M HMVP candidates in the HMVP table will be used for merge list generation. If the condition N < = 4 evaluates to false (e.g., the merge candidate list contains more than four candidates), then 8-N HMVP candidates in the HMVP table will be used for merge list generation. In some examples, once the total number of available merge candidates reaches the maximum allowed merge candidate number minus 1, the merge candidate list construction process from the HMVP may be terminated.
The pairwise average merge candidate derivation may be performed based on predefined merge candidate pairs in the existing merge candidate list. For example, a pairwise average merge candidate may be generated by averaging predefined merge candidate pairs in an existing merge candidate list, where the predefined pairs are given as { (0, 1), (0, 2), (1, 2), (0, 3), (1, 3), (2, 3) }, where the numbers indicate the merge index of the merge candidate list. In some cases, the average motion vector may be calculated separately for each reference list. If two motion vectors (of a predefined pair, for example) are included in the same list, the two motion vectors may be averaged even when they point to different reference pictures. In some cases, if only one motion vector (e.g., of a predefined pair) is available, one available motion vector may be used as the average motion vector. If no motion vectors (e.g., of a predefined pair) are available, the list may be identified as invalid. In some examples, if the merge candidate list is not full after adding the pairwise average merge candidates as described above, one or more zero MVPs may be inserted at the end of the merge candidate list until the maximum number of merge candidates is reached.
In some aspects, bi-prediction (BCW) with CU-level weights may be utilized. For example, a bi-predictive signal may be generated by averaging two predictive signals obtained from two different reference pictures. Additionally or alternatively, two different motion vectors may be used to generate the bi-predictive signal. In some examples, the bi-predictive signal may be generated by averaging two predictive signals obtained from two different reference signals and/or using two different motion vectors using the HEVC standard.
In other aspects, the bi-prediction mode may be extended beyond simple averaging to include a weighted average of the two prediction signals. For example, the bi-predictive mode may include a weighted average of two predicted signals using a VVC standard. In some examples, the bi-prediction mode may include a weighted average of two prediction signals, as follows:
P bi-pred =((8-w)*P 0 +w*P 1 +4)>>3. equation (1)
The weighted average biprediction given in equation (1) may include five weights, w e { -2,3,4,5,10}. For each bi-predictive CU, the weight w may be determined according to one or more of the following. In one example, for non-merged CUs, the weight index may be signaled after the Motion Vector Difference (MVD). In another example, for a merge CU, a weight index may be inferred from neighboring blocks based on the merge candidate index. In some cases, BCW may only be applicable to CUs having 256 or more luma samples (e.g., CUs having a product of CU width and CU height greater than or equal to 256). In some cases, all five weights w may be used for low-delay pictures. For non-low delay pictures, only three of the five weights w can be used (e.g., three weights w e {3,4,5 }).
At the encoder, a fast search algorithm may be applied to find the weight index without significantly increasing the complexity of the encoder, as will be described further below. In some examples, a fast search algorithm may be applied to find the weight index, as described at least in part in jfet-L0646. For example, when combined with Adaptive Motion Vector Resolution (AMVR), if the current picture is a low delay picture, the unequal weights are only conditionally checked for 1-pixel and 4-pixel motion vector accuracy. When combined with affine patterns, affine Motion Estimation (ME) may be performed for unequal weights if and only if the affine pattern is selected as the current best mode. When two reference pictures in the bi-prediction mode are identical, only unequal weights can be conditionally checked. In some cases, unequal weights are not searched when certain conditions are met, e.g., depending on Picture Order Count (POC) distance between a current picture and its reference picture, coding Quantization Parameters (QP), and/or temporal level, etc.
In some examples, one upper and lower Wen Yima bins may be used, followed by one or more bypass decoding bins, to decode the BCW weight index. For example, the first context coding bin may be used to indicate whether equal weights are used. If unequal weights are used, bypass coding may be used to signal additional bins to indicate which unequal weights are used.
Weighted Prediction (WP) is a coding tool supported by the h.264/AVC and HEVC standards to efficiently code video content with fades. Support for WP is also added to the VVC standard. WP may be used to allow one or more weighting parameters (e.g., weights and offsets) to be signaled for each reference picture in each of the reference picture lists L0 and L1. Subsequently, during motion compensation, the weights and offsets of the corresponding reference pictures are applied.
WP and BCW can be used with different types of video content. In some examples, if the CU utilizes WP, the BCW weight index may not be signaled and it may be inferred that w has a value of 4 (e.g., equal weights are applied). For example, when a CU utilizes WP, BCW weight indexes may not be signaled in order to avoid interactions between WP and BCW (e.g., this may complicate VVC decoder design).
For a merge CU, in both normal merge mode and inherited affine merge mode, the weight index may be inferred from neighboring blocks based on the merge candidate index. For the constructed affine merge mode, affine motion information may be constructed based on motion information of up to three blocks. For example, the BCW index of the CU using the constructed affine merge mode may be set equal to the BCW index of the first control point MV. In some examples, using the VVC standard, combined inter-and intra-prediction (CIIP) and bi-prediction (BCW) with CU-level weights cannot be applied jointly to a CU. When coding a CU with the CIIP mode, the BCW index of the current CU may be set to 2 (e.g., set to equal weight).
Fig. 8 is a diagram 800 illustrating aspects of bilateral matching according to some examples of the present disclosure. As previously described, bilateral Matching (BM) can be used to refine the two initial motion vector MV0 and MV1 pairs. For example, BM may be performed by searching around MV0 and MV1 to derive refined motion vectors MV0 'and MV1', respectively, which minimize block matching costs. The block matching cost may be calculated based on the similarity between the two motion compensated predictors generated using the initial motion vector pair (e.g., MV0 and MV 1). For example, the block matching cost may be based on Sum of Absolute Differences (SAD). Additionally or alternatively, the block matching cost may be based on or include a regularization term based on a Motion Vector Difference (MVD) between the current MV pair (e.g., MV0 'and MV1' of the current test) and the initial MV pair (e.g., MV0 and MV 1). As will be described in greater detail below, one or more constraints may be applied based on MVD0 (e.g., mvd0=mv0 '-mv0) and MVD1 (e.g., mvd1=mv1' -mv1).
As described previously, in the general video coding standard (VVC), decoder-side motion vector refinement (DMVR) based on bilateral matching may be applied to improve the accuracy of the MV of the bi-prediction merging candidate (e.g., refine the MV of the bi-prediction merging candidate). For example, as shown in the example of fig. 8, a DMVR based on bilateral matching may be applied to improve the accuracy of the MVs of the bi-predictive merge candidates 812 or otherwise refine the MVs of the bi-predictive merge candidates 812. The bi-predictive merge candidate 812 may be included in the current picture 810 and may be associated with the initial motion vector MV0 and MV1 pair. The initial motion vectors MV0 and MV1 for the bi-predictive merge candidate 812 may be obtained, identified, or otherwise determined prior to performing the bi-edge matching based DMVR. Subsequently, the initial motion vectors MV0 and MV1 may be used to identify refined motion vectors MV0 'and MV1' for the bi-predictive merge candidate 812, as will be described in greater depth below.
As shown in the example of fig. 8, a first initial motion vector MV0 may point to a first reference picture 830. The first reference picture 830 may be associated with a backward direction (e.g., relative to the current picture 810) and/or may be included in the reference picture list L0. The second initial motion vector MV1 may point to the second reference picture 820. The second reference picture 820 may be associated with a forward direction (e.g., relative to the current picture 810) and/or may be included in the reference picture list L1.
The first initial motion vector MV0 may be used to determine or generate a first predictor 832, which may be a block included in the first reference picture 830. The first predictor 832 may also be referred to as a first candidate block (e.g., the first predictor 832 is a candidate block in the first reference picture 830 and/or the reference picture list L0). The second initial motion vector MV1 may be used to determine or generate a second predictor 822, which may be a block included in the second reference picture 820. The second predictor 822 may also be referred to as a second candidate block (e.g., the second predictor 822 is a candidate block in the second reference picture 820 and/or the reference picture list L1).
Then, a search may be performed around the first predictor 832 and the second predictor 822 to identify or determine the first refined motion vector MV0 'and the second refined motion vector MV1', respectively. As shown, surrounding areas associated with each of the first predictor 832 (e.g., the first reference picture 830 and/or the area within the reference picture list L0) and the second predictor 822 (e.g., the second reference picture 820 and/or the area within the reference picture list L1) may be searched. For example, the surrounding area associated with the first predictor 832 may be searched to identify or examine one or more refined candidate blocks 834, and the surrounding area associated with the second predictor 822 may be searched to identify or examine one or more refined candidate blocks 824. The search may be performed based on one or more of the following: distortion (e.g., SAD, SATD, SSE, etc.) and/or regularization terms calculated between one of the initial predictors 832 or 822 and the corresponding refined candidate block 834 or 824. In some examples, distortion and/or regularization may be calculated based on a distance moved away from an initial point (e.g., a distance between the initial point associated with the initial predictor 832 or 822 and a search point associated with the refined candidate block 834 or 824, respectively).
As the search moves around the initial point associated with each of the initial predictors 832 and 822, new refined candidate blocks (e.g., refined candidate blocks 834 and 824) are obtained. Each new refined candidate block may be associated with a new cost (e.g., a calculated SAD value, one of the motion vector differences MVD0 or MVD1, etc.). The search may be associated with a search scope and/or a search window. After searching each candidate block included in the search scope and/or search window to obtain the initial predictors 832 and 822 (e.g., and determining the corresponding cost of each searched candidate block), the candidate block having the smallest determined cost may be identified and used to generate refined motion vectors MV0 'and MV1'.
In some examples, a Bilateral Matching (BM) based DMVR may be performed by calculating the SAD between two candidate blocks in reference picture list L0 and list L1. As shown in fig. 8, the SAD between blocks based on each MV' candidate (e.g., blocks 834 and 824) around the original MV (e.g., around predictors 832 and 822, respectively) may be calculated. The MV' candidate with the lowest SAD may be selected as the refined MV and used to generate the bi-predictive signal. In some examples, the SAD of the original MV is subtracted by 1/4 of the SAD value as a regularization term. In some cases, the temporal distance (e.g., picture Order Count (POC) difference) from the two reference pictures to the current picture may be the same, and MVD0 and MVD1 may have the same magnitude but opposite signs (e.g., mvd0= -MVD 1).
In some cases, a refined search range of two integer luma samples from the initial MV may be used to perform a bilateral matching-based DMVR. For example, in the context of fig. 8, a refined search range of two integer luma samples from initial motion vectors MV0 and MV1 (e.g., from initial predictors 832 and 822, respectively) may be used to perform a bilateral matching-based DMVR. The search may include an integer sample offset search stage and a fractional sample refinement stage.
In some cases, a 25-point full search may be applied to an integer sample offset search. The 25-point full refinement search may be performed by first calculating the SAD of the initial MV pair (e.g., the initial MV pair of MV0 and MV1 and/or the initial predictors 832 and 822). If the SAD of the initial MV pair is less than the threshold, the integer sample phase of the DMVR may be terminated. If the SAD of the initial MV pair is not less than the threshold, the SAD of the remaining 24 points may be calculated and checked in raster scan order. The point with the smallest SAD may then be selected as the output of the integer sample offset search stage.
As described above, the integer sample offset search may be followed by fractional sample refinement. In some cases, computational complexity may be reduced by deriving fractional sample refinement using one or more parametric error surface equations (e.g., rather than performing additional searches using SAD comparisons). Fractional sample refinement may be conditionally invoked based on the output of the integer sample offset search stage. For example, when the integer sample offset search stage terminates with a minimum SAD at the center in either the first iteration or the second iteration search, fractional sample refinement may be further applied in response.
As described above, the parametric error surface may be used to derive a fractional sample refinement. For example, in a parametric error surface based subpixel offset estimation, the cost of a center location and the cost of four neighboring locations (e.g., relative to the center) may be used to fit a two-dimensional parabolic error surface equation of the form:
E(x,y)=A(x-x min ) 2 +B(y-y min ) 2 +C equation (2)
Here, (x) min ,y min ) Corresponds to the fractional position with the smallest cost, and C corresponds to the smallest cost value. By solving equation (2) using cost values of five search points (e.g., center position and four adjacent positions), it is possible to solve (x min ,y min ) The calculation is as follows:
x min = (E (-1, 0) -E (1, 0))/(2 (E (-1, 0) +e (1, 0) -2E (0, 0))) equation (3)
y min =(E(0,-1)-E(0,1) Equation (4) of/(2 ((E (0, -1) +E (0, 1) -2E (0, 0)))
x min And y min The value of (c) may be automatically constrained between-8 and 8 (e.g., all cost values are positive and the minimum value is E (0, 0)), which may correspond to a half-pixel offset in VVC with a 1/16 pixel MV precision. The calculated score (x min ,y min ) Values may be added to the integer distance refinement MV (e.g., from the integer sample offset search described above) to obtain or determine a sub-pixel precision refinement delta MV.
In VVC, the motion vector resolution may be 1/16 luminance samples. In some examples, samples at fractional locations are interpolated using an 8-tap interpolation filter. In DMVR, the refinement search point may surround an initial fractional pixel MV with an integer sample offset, and the DMVR search process may include interpolating samples of fractional locations. In some examples, a bilinear interpolation filter may be used to generate fractional samples for the DMVR search process. Using bilinear interpolation filters to generate fractional samples for the DMVR search process may reduce computational complexity. In some cases, when a bilinear interpolation filter is used with a 2-sample search range, the DMVR search process may not need to access additional reference samples (e.g., relative to existing motion compensation processes).
After determining the refined MV using the DMVR search procedure, an 8-tap interpolation filter may be applied to generate the final prediction. In some examples, no additional reference samples may be needed to perform the interpolation process based on the original MVs (e.g., as described above). In some examples, additional samples may be utilized to perform interpolation based on the refined MVs. Instead of accessing additional reference samples (e.g., additional reference samples relative to the existing motion compensation process), existing or available reference samples may be padded to generate additional reference samples for performing interpolation based on the refined MVs. In some cases, when the width and/or height of the CU is greater than 16 luma samples, the DMVR process may also include splitting the CU into sub-blocks each having a width and/or height equal to 16 luma samples, as will be described in greater depth below.
In VVC, DMVR may apply Yu Li to CUs that are coded with one or more of the following modes and features. In one illustrative example, the mode or feature associated with a CU to which DMVR may be applied may also be referred to as a "DMVR condition. In some examples, the DMVR conditions may include, but are not limited to, a CU coded with one or more of the following modes and/or features: CU-level merge mode with bi-predictive MVs; one reference picture in the past (e.g., relative to the current picture) and another reference picture in the future (e.g., relative to the current picture); the distance (e.g., POC difference) from the two reference pictures to the current picture is the same; both reference pictures are short-term reference pictures; a CU comprising more than 64 luminance samples; both CU height and CU width are greater than or equal to 8 luma samples; the BCW weight index indicates equal weights; not enabling Weighted Prediction (WP) for the current block; the Combined Inter and Intra Prediction (CIIP) mode is not used for the current block; etc.
FIG. 9 illustrates an example extended CU region 900 that may be used to perform bidirectional optical flow (BDOF) according to some examples of this disclosure. For example, BDOF may be used to refine the bi-predictive signal of luma samples in a CU at the 4x4 sub-block level (e.g., 4x4 sub-block 910). In some cases, BDOF may be used for other sizes of sub-blocks (e.g., 8 x 8 sub-blocks, 4x 8 sub-blocks, 8 x4 sub-blocks, 16 x 16 sub-blocks, and/or other sizes of sub-blocks). BDOF mode may be optical flow based, assuming that the motion of the object is smooth. As shown in fig. 9, the BDOF mode may utilize one extended row and column (e.g., extended row 970 and extended column 980) around the boundary of the extended CU region 900.
For each 4x4 sub-block (e.g., 4x4 sub-block 910), motion refinement (v) may be calculated based on minimizing the difference between the L0 and L1 prediction samples x ,v y ). Motion refinement (v) x ,v y ) To adjust the bi-predicted sample values in the 4x4 sub-block (e.g., 4x4 sub-block 910). In one illustrative example, BDOF may be performed as follows.
First, the level of two predicted signals can be calculated by directly calculating the difference between two adjacent samplesAnd vertical gradients (respectivelyAnd->Where k=0, 1):
Here, I (k) (i, j) represents the sample value at the coordinates (i, j) of the prediction signal in the list k, k=0, 1.shift1 may be calculated based on the luminance bit depth (e.g., bitDepth) because shift1 may be set equal to 6.
Auto-and cross-correlation of gradients S 1 、S 2 、S 3 、S 5 And S is 6 And can then be calculated as:
S 1 =∑ (i,j)∈Ωx (i, j) equation (7)
S 2 =∑ (i,j)∈Ω ψ x (i,j)·sign(ψ y (i, j)) equation (8)
S 3 =∑ (i,j)∈Ω θ(i,j)·(-sign(ψ x (i, j))) equation (9)
S 5 =∑ (i,j)∈Ωy (i, j) equation (10)
S 6 =∑ (i,j)∈Ω θ(i,j)·(-sign(ψ y (i, j))) equation (11)
Here the number of the elements is the number,
θ(i,j)=(I (0) (i,j)>>shift2)-(I (1) (i,j)>>shift 2) equation (14)
Here, Ω may be a 6 x 6 window (or other sized window) around the 4 x 4 sub-block 910 (or other sized sub-block). The value of shift2 may be set equal to 4 (or other suitable value), and the value of shift3 may be set equal to 1 (or other suitable value).
The cross-correlation and autocorrelation terms can then be used to derive motion refinement (v x ,v y ):
Here, th' BIO =1<<4 andis a rounding function.
Based on the motion refinement and gradient, the following adjustments may be calculated for each sample in the 4 x 4 sub-block 910 (or other sized sub-block):
finally, the BDOF samples of the extended CU area 900 shown in FIG. 9 may be calculated by adjusting the bi-predictive samples as follows:
pred BDOF (x,y)=(I (0) (x,y)+I (1) (x,y)+b(x,y)+o offset )>>shift5 equation (18)
Here, shift5 may be set to Max (3, 15-BitDepth), andvariable o offset Can be set equal to (1<<(shift5-1))。
In some cases, the values may be selected such that the multipliers in the BDOF process (e.g., described above) do not exceed 15 bits and the maximum bit width of the intermediate parameters in the BDOF process (e.g., described above) remains within 32 bits.
In some examples, generating one or more prediction samples I in list k (where (k=0, 1)) may be based at least in part on (k) (i, j) deriving gradient values, wherein one or more prediction samples are located outside the current CU boundary. As depicted in fig. 9 and mentioned above, the BDOF may utilize one extension row and one extension column (e.g., extension row 970 and extension column 980) around the boundary of the extension CU region 900. In some cases, the computational complexity of generating out-of-boundary prediction samples may be controlled based at least in part on generating prediction samples in the extension region (e.g., non-shadow blocks included in extension rows 970 and extension columns 980, along the perimeter of the extension CU region 900) by directly acquiring reference samples at nearby integer locations without interpolation. For example, floor () operation may be used on reference sample coordinates of nearby reference samples. The 8-tap motion compensated interpolation filter may then be used to generate prediction samples within the shadow boxes of the extended CU region 900 (e.g., boxes inside the non-shadow perimeter of the extended region prediction samples of the extended rows 970 and extended columns 980). In some examples, the extended sample values can only be used for gradient calculations. The remaining steps of the BDOF process may be performed by filling (e.g., repeating) any samples and gradient values outside the boundaries of the extended CU region 900 based on the nearest neighbors of the extended CU region 900 as needed.
In some examples, as described above, BDOF may be used to refine the bi-predictive signal of a CU at the 4 x 4 sub-block level (or other sub-block level). In one illustrative example, BDOF may be applied to a CU if the CU meets some or all of the following conditions: a CU is coded using a "true" bi-prediction mode (e.g., one of the two reference pictures is located before the current picture in display order, and the other reference picture is located after the current picture in display order); the CU is not coded using affine mode or SbTMVP merge mode; a CU has more than 64 luminance samples; both CU height and CU width are greater than or equal to 8 luma samples; the BCW weight index indicates equal weights; WP is not enabled for the current CU; and/or the CIIP mode is not used for the current CU; etc.
In some aspects, multi-pass decoder side motion refinement may be used. For example, at the JVET-V conference, an Enhanced Compression Model (ECM) was built to investigate compression techniques beyond VVC (https:// vcgit. Hhi. Fraunhofer. De/ECM/VVCSoftware_VTM/-/tree/ECM). In ECM, DMVR in VVC is replaced with "multi-pass decoder side motion refinement" [ jfet_u0100 ]. The multi-pass decoder side motion refinement may include multiple passes. In some examples, multi-pass decoder side motion refinement may be performed by applying Bilateral Matching (BM) multiple times. For example, at each pass (e.g., of multi-pass decoder side motion refinement), the BM may apply to different block sizes.
In the first pass, bilateral Matching (BM) may be applied to the entire decoding block or decoding unit (e.g., similar to DMVR in VVC and/or as described above). For example, the first pass BM may be applied to coding blocks or CUs of sizes 64x64, 64x32, 32x16, 16x16, etc., and combinations thereof.
In the second pass, the BM may be applied again, this time to each 16x16 sub-block included within (or may be generated from) the entire decode block in which the BM is executed in the first pass. For example, if the first pass BM is applied to a 64x64 block, the 64x64 block may be divided into a grid of 4x4 sub-blocks, with each sub-block being 16x16. In the second pass, the BM may be applied to each of the 16 sub-blocks in the example above. In another example, if the first path BM applies to 32x32 blocks, the 32x32 blocks may be divided into a grid of 2x2 sub-blocks, each of size 16x16. In this example, the second pass may be performed by applying a BM to each of the four 16x16 sub-blocks of the 32x32 block. The refined MVs generated or obtained from the first pass BM may be used as the initial MVs for each 16x16 sub-block performing the second pass BM.
In the third pass, multiple 8x8 sub-blocks may be obtained or generated based on the original block (e.g., from the first BM pass) and/or based on the 16x16 sub-blocks (e.g., from the second BM pass). In some examples, each 16x16 sub-block from the second BM lane may be partitioned into a 2x2 grid of sub-blocks having a size of 8x 8. In the third BM pass, one or more MVs associated with each 8x8 sub-block may be further refined by applying bi-directional optical flow (BDOF).
In one illustrative example, multi-pass decoder side motion refinement may be performed for a 64x64 block or CU. In the first pass, the BM may be applied to 64x64 blocks to generate or obtain refined MV pairs. In the second pass, a 64x64 block may be divided into 16 sub-blocks, where each sub-block is 16x16 in size. In the second pass, the BM may be applied again, this time to each of the 16 sub-blocks, and the refined MV from the first pass is used as the initial MV. In the third pass, each 16x16 sub-block may be divided into four sub-blocks having a size of 8x8 (e.g., for the original 64x64 block, 16x 4 = 64 8x8 sub-blocks in total). As already described previously, in the third pass, one or more MVs associated with each 8x8 sub-block may be refined by applying BDOF.
Examples of the first pass, second pass, and third pass that may be included in or used to perform multi-pass decoder side motion refinement are described in turn below.
In some examples, the first pass may include performing block-based Bilateral Matching (BM) Motion Vector (MV) refinement. For example, in the first pass, refined MVs are derived by applying BMs to coding blocks. Similar to decoder-side motion vector refinement (DMVR), in the bi-prediction operation associated with the BM, refined MVs are searched around two initial MVs (e.g., MV0 and MV 1) in the reference picture lists L0 and L1. Deriving refined MVs (e.g., MV 0) around an initial MV pair (e.g., MV0 and MV1, respectively) based on a least bilateral matching cost between two reference blocks in L0 and L1 pass1 And MV1 pass1 )。
The first pass BM may include performing a local search to derive integer sample precision intDeltaMV. The local search may be performed by applying a 3 x 3 square search pattern (or other search pattern) to cycle the search range [ -sHor, sHor ] in the horizontal direction and to cycle the search range [ -sVer, sVer ] in the vertical direction. The values of sHor and sVer may be determined by the block size. In some cases, the maximum value of sHor and/or sVer may be 8 (or other suitable value).
The bilateral matching cost can be calculated as:
bilcost=mvdistancecost+sadcost equation (19)
When the block size cbW x cbH is greater than 64 (or other block size threshold), a mean removal absolute difference sum (mrsa) cost function may be applied to remove the DC effect of distortion between reference blocks. The intDeltaMV local search is terminated when the bilCost at the center point of the 3 x 3 search pattern (or other search pattern) has the minimum cost. Otherwise, the current minimum cost search point becomes the new center point of the 3×3 search pattern (or other search pattern), and the local search is continued, searching for the minimum cost until the end of the search range is reached (e.g., [ -sHor, sHor ] in the horizontal direction and [ -sVer, sVer ] in the vertical direction).
In some cases, existing fractional sample refinement may be further applied to derive the final deltaMV. The refined MVs after the first pass can then be derived as:
MV0 pass1 =mv0+deltamv equation (20)
MV1 pass1 =mv1-deltaMV equation (21)
In some examples, the second pass may include performing sub-block based Bilateral Matching (BM) Motion Vector (MV) refinement. For example, in the second pass, the refined MV may be derived by applying the BM to a 16×16 (or other size) grid sub-block. For each sub-block, in the reference picture lists L0 and L1, two MVs (e.g., MV0 pass1 And MV1 pass1 ) Surrounding searches are performed on the refined MVs.
Based on the search, two refined sub-blocks can be derived based on the least bilateral matching cost between the two reference sub-blocks in L0 and L1MV, i.e. MV0 pass2 (sbIdx 2) and MV1 pass2 (sbIdx 2). Here, sbidx2=0, …, N-1 is the index of the sub-block (e.g., because the second pass BM may be applied to each sub-block generated from the original block used in the first pass BM). For example, as previously described, a total of 16 sub-blocks, each having a size of 16x16, may be generated or obtained for an input block having a size of 64x 64. In this example, sbIdx2 may be an index of an individual sub-block of the 16 sub-blocks.
For each sub-block, the second pass BM may include performing a full search to derive integer sample precision intDeltaMV. The full search may have a search range of [ -sHor, sHor ] in the horizontal direction and a search range of [ -sVer, sVer ] in the vertical direction. The values of sHor and sVer may be determined by the block size, and the maximum value of sHor and sVer may be 8 (or other suitable value).
Fig. 10 is a diagram illustrating an example search region in a Coding Unit (CU) 1000 according to some examples of the present disclosure. For example, fig. 10 depicts four different search areas (e.g., a first search area 1020, a second search area 1030, a third search area 1040, a fourth search area 1050, etc.) in coding unit 1000. In some cases, the bilateral matching cost may be calculated by applying a cost factor to the SATD cost (or other cost function) between two reference sub-blocks, as follows:
bilcost=satdcest cosfactor equation (22)
The search area (2×shor+1) ×2×sver+1 is divided into five diamond-shaped search areas, as depicted in fig. 10. In other aspects, other search areas may be used. Each search region is assigned a cosfactor value, which is determined by the distance between each search point and the starting MV (intDeltaMV). Each diamond-shaped search area (e.g., search areas 1020, 1030, 1040, and 1050) may be processed sequentially, beginning at the center of the search area. Within each search area, search points are processed in raster scan order, starting from the upper left corner to the lower right corner of the area.
In some examples, first search area 1020 is searched, then second search area 1030 is searched, then third search area 1040 is searched, and then fourth search area 1050 is searched. When the minimum bilCost in the current search area is less than or equal to the threshold of sbW × sbH, ending the whole search of integer pixels; otherwise, the integer pixel search will continue to the next search area until all search points are examined.
In some cases, VVC DMVR fractional sample refinement may be further applied to derive the final deltaMV (sbIdx 2). The refined MV at the second pass can then be derived as:
MV0 pass2 (sbIdx2)=MV0 pass1 +deltaMV (sbIdx 2) equation (23)
MV1 pass2 (sbIdx2)=MV1 pass1 deltaMV (sbIdx 2) equation (24)
In some examples, the third pass may include performing sub-block based bi-directional optical flow (BDOF) Motion Vector (MV) refinement. For example, in the third pass, the refined MV may be derived by applying BDOF to an 8×8 (or other size) grid sub-block. For each 8x8 sub-block, BDOF refinement may be applied to derive scaled Vx and Vy without clipping, starting from the refined MVs of the parent-sub-block of the second pass. For example, each parent-child block of the second way may be 16x16 in size (e.g., each parent-child block of the second way may be associated with four 8x8 child blocks used in the third way). Can be based on refined MV (i.e., MV0 pass2 (sbIdx 2) and MV1 pass2 (sbIdx 2)) to apply or perform a third pass BDOF refinement, where sbIdx2 is an index of one of the parent-child blocks of the second pass.
The derived bioMv (Vx, vy) can then be rounded to a 1/16 sample precision and clipped between-32 and 32 (or other sample precision and/or other clipping values or ranges).
Refined MV at third pass (MV 0 pass3 (sbIdx 3) and MV1 pass3 (sbIdx 3)) can be derived as:
MV0 pass3 (sbIdx3)=MV0 pass2 (sbIdx 2) +bioMv equation (25)
MV1 pass3 (sbIdx3)=MV0 pass2 (sbIdx 2) -bioMv equation (26)
Here, sbIdx3 is an index of a specific one 8x8 sub-block of 8x8 sub-blocks utilized in the third path, and sbIdx2 is an index of a specific one 16x16 sub-block of 16x16 sub-blocks used in the second path. In some examples, sbIdx3 may have a range such that each 8x8 sub-block may be uniquely identified by its sbIdx3 index. For example, a 64x64 block may be divided into 64 8x8 sub-blocks, and sbIdx3 may contain 64 unique index values for different 8x8 sub-blocks. In some examples, sbIdx3 may have a range such that each 8x8 sub-block may be uniquely identified by its sbIdx3 value in combination with the corresponding sbIdx2 value of the 16x16 parent block. For example, a 64x64 block may be divided into a total of 16 sub-blocks, each sub-block having a 16x16 size, and each 16x16 sub-block may be further divided into four sub-blocks, each sub-block having an 8x8 size. In this example, sbIdx2 may take one of 16 unique values and sbIdx3 may take one of four unique values such that each of the 64 8x8 sub-blocks is identifiable based on the corresponding sbIdx2 and sbIdx3 indices.
Aspects for improving the above-described search strategies are described herein. Aspects described herein may be applied to one or more coding (e.g., encoding, decoding, or combined encoding-decoding) techniques, such as one or more coding techniques in which motion vectors are used to predict a block of a current picture from one or more reference pictures (e.g., two reference blocks from two respective reference pictures), wherein the motion vectors are refined by a refinement technique. These improvements may be applied to any suitable video coding standard or format (e.g., HEVC, VVC, AV 1) as described above, other existing standards or formats that apply coding to blocks based on reference blocks from two respective reference pictures, and any future standard that uses such techniques. In general, when two reference blocks from two reference pictures are used, such techniques are commonly referred to as bi-predictive merge mode and bilateral matching techniques.
The various aspects described herein may allow different decode blocks to have different search strategies (or methods) for bilateral matching, rather than following the exact search strategy described above. The selected search strategy for the block may be signaled in one or more syntax elements coded in the bitstream. The search policy includes constraints/relationships between MVD0 and MVD1 imposed during the bilateral matching search process, and may also be associated with some combination of search patterns, search ranges or maximum search rounds, cost criteria, and the like. In some cases, constraints may also be referred to as limitations.
The systems and techniques described herein may be used to perform adaptive Bilateral Matching (BM) for decoder-side motion vector refinement (DMVR). For example, systems and techniques may perform adaptive bilateral matching for DMVR by applying different search strategies and/or search methods for different coded blocks. In some aspects, the selected search strategy for the block may be signaled using one or more syntax elements coded in the bitstream. In some examples, the selected search strategy for the block may be signaled explicitly, implicitly, or using a combination thereof. As will be described in greater detail below, the selected search strategy (e.g., to be used to perform an adaptive BM for a DMVR for a given block) may include constraints or relationships between MVD0 and MVD1, where the search strategy constraints are utilized or applied during the bilateral matching search process. In some examples, the search policy may additionally or alternatively include one or more combinations of search patterns, search ranges, maximum search rounds, cost criteria, and the like.
In one illustrative example, the systems and techniques described herein may use one or more Motion Vector Difference (MVD) constraints to perform adaptive bilateral matching for DMVR. As previously described, the motion vector difference may be used to represent the difference between the initial motion vector and the refined motion vector (e.g., mvd0=mv0 '-mv0 and mvd1=mv1' -mv1).
In some examples, MVD constraints may be selected for a given bilateral matching block (e.g., included in a selected search policy). For example, the MVD constraint may be a mirror constraint in which MVD0 and MVD1 have the same magnitude but opposite signs (e.g., mvd0= -MVD 1). The MVD mirror constraint may also be referred to herein as a "first constraint".
In another example, the MVD constraint may set mvd0=0 (e.g., both the x and y components of MVD0 are zero). For example, the mvd0=0 constraint may be utilized by keeping MV0 fixed while searching around MV1 to derive refined MV1', and MV0' =mv0. Mvd0=0 constraint may also be referred to herein as a "second constraint".
In another example, the MVD constraint may set mvd1=0 (e.g., both the x and y components of MVD1 are zero). For example, the mv1=0 constraint may be utilized by keeping MV1 fixed while searching around MV0 to derive refined MV0', and MV1' =mv1. Mvd1=0 constraint may also be referred to herein as a "third constraint".
In another example, the MVD constraint may be utilized to search around MV0 independently to derive MV0 'and around MV1 independently to derive MV1'. The MVD-independent search constraint may also be referred to herein as a "fourth constraint".
As will be described further below, in some cases, only the first and second constraints may be provided as options (e.g., included in a selected or signaled search strategy) for each bilateral matching block. In some cases, only the first and third constraints are provided as options for each bilateral matching block (e.g., included in a selected or signaled search strategy). In some cases, only the first and fourth constraints are provided as options for each bilateral matching block (e.g., included in a selected or signaled search strategy). In some cases, only the second and third constraints are provided as options for each bilateral matching block (e.g., included in a selected or signaled search strategy). In some cases, first, second, and third constraints are provided as options (e.g., included in a selected or signaled search strategy) for each bilateral matching block. Any other combination of constraints may be provided as an option (e.g., included in a selected or signaled search strategy) for each bilateral matching block.
In some aspects, one or more syntax elements may be signaled (e.g., in a bitstream), wherein the one or more syntax elements include a value to indicate whether one or more of the constraints are applied. In some examples, syntax elements may be used to indicate or determine a particular one of the MVD constraints described above to apply to a given bilateral matching block.
In one illustrative example, the first syntax element may be used to indicate whether the first constraint is applied. For example, the first syntax element may be used to indicate whether a mirroring constraint (e.g., mvd0= -MVD 1) should be applied to a given bilateral matching block for which the first syntax element is signaled. In some cases, the first syntax element may have a first value when the mirroring constraint is to be applied and the first syntax element may have a second value when the mirroring constraint is not to be applied. In some cases, the presence of the first syntax element may be used to infer (e.g., implicitly signal) that the mirror constraint is to be applied to a given or current bilateral matching block, while the absence of the first syntax element may be used to infer (e.g., implicitly signal) that the mirror constraint should not be applied.
Continuing with the example above, if the first syntax element indicates that the first constraint is not applied (e.g., MVD mirror constraint, mvd0= -MVD 1), the second syntax element may be used to indicate which of the remaining constraints are to be applied. For example, the second syntax element may be used to indicate whether a second constraint (e.g., mvd0=0) or a third constraint (e.g., mvd1=0) should be applied to a given or current bilateral matching block for which the syntax element is signaled. In some examples, the second syntax element may have a first value when the second constraint is to be applied and the second syntax element may have a second value when the third constraint is to be applied. In some cases, the presence of the second syntax element may be used to infer (e.g., to implicitly signal) that a predetermined one of the second constraint or the third constraint should be applied, while the absence of the second syntax element may be used to infer (e.g., to implicitly signal) that the remaining one of the second constraint or the third constraint should be applied.
In some examples, the one or more syntax elements (e.g., the first syntax element and/or the second syntax element) may include mode information, and the selected constraint from the MVD constraints described above may be determined based on the mode indicated by the mode information (e.g., the merge mode). For example, the one or more syntax elements may include mode information indicating different merge modes for the current bilateral matching block. In one illustrative example, when the rule merge candidate satisfies the DMVR conditions described above, a first constraint (e.g., a mirror constraint, mvd0= -MVD 1) may be applied to the rule (e.g., standard or default, such as extended merge prediction in VVC merge mode) merge mode encoded block.
As previously described, the DMVR conditions may indicate a CU to decode using one or more of the following modes and features. In some examples, the DMVR conditions may include, but are not limited to, a CU coded with one or more of the following modes and/or features: CU-level merge mode with bi-predictive MVs; one reference picture in the past (e.g., relative to the current picture) and another reference picture in the future (e.g., relative to the current picture); the distance (e.g., POC difference) from the two reference pictures to the current picture is the same; both reference pictures are short-term reference pictures; a CU comprising 64 or more luminance samples; both CU height and CU width are greater than or equal to 8 luma samples; the BCW weight index indicates equal weights; not enabling Weighted Prediction (WP) for the current block; the Combined Inter and Intra Prediction (CIIP) mode is not used for the current block; etc.
In another illustrative example, when the coding block uses a specified new merge mode (e.g., an adaptive bilateral matching mode as described herein) and where all merge candidates meet the DMVR conditions, one of a second constraint (e.g., mvd0=0) or a third constraint (e.g., mvd1=0) may be applied. In some cases, the second constraint and/or the third constraint are additionally or alternatively indicated by a mode flag or merge index. For example, the indication of or selection between the second constraint and the third constraint may be determined based on a mode flag or a merge index.
In some examples, one or more syntax elements described herein may include an index to a merge candidate list. The selected constraint may then be determined by an index indicating a merge candidate selected from the merge candidate list (e.g., the constraint depends on the selected merge candidate). In another example, the syntax element may include a combination of a mode flag and an index.
In some aspects, the systems and techniques described herein may signal (e.g., explicitly and/or implicitly) a selected search strategy that may be used to perform the multi-level (e.g., multi-pass) DMVR described above. In some examples, the selected search policy may be applied in one of the paths of the multi-path DMVR. In another example, the selected search policy may be applied in multiple levels or lanes of the multi-lane DMVR (e.g., but in some cases, not all levels or lanes of the lanes).
For the example of the three-way DMVR described above, in one exemplary example, only selected policies (e.g., PU-level bilateral matching) may be applied in the first way. The second and third paths (e.g., which perform bilateral matching for a first sub-block size and BDOF for a second sub-block size smaller than the first sub-block size, respectively) may be performed using a default policy (e.g., the default policy described above with respect to fig. 10) and a standardized three-path structure. In one illustrative example, a multi-pass DMVR (e.g., the three-pass DMVR described above) may perform PU-level bilateral matching in the first pass using a second search policy (e.g., second constraint mvd0=0) and/or a third search policy (e.g., third constraint mvd1=0). The subsequent process (e.g., second and third paths) may use a default search strategy that includes a first constraint (e.g., mirrored constraint mvd0= -MVD 1).
In some aspects, the search strategies may be grouped into subsets. In some examples, the selected subset may be determined using one or more syntax elements. In some cases, the selected policies within a given subset may be implicitly determined. In one illustrative example, a first constraint (e.g., the mirror constraint mvd0= -MVD 1) may be included in the first subset, and both a second constraint (e.g., mvd0=0) and a third constraint (e.g., mvd1=0) may be included in the second subset. The syntax element may be used to indicate whether the second subset is used. If a second subset is used (e.g., based on a corresponding syntax element), a selection or determination between applying a second constraint or a third constraint may be implicitly determined. For example, the implicit determination may be based on a minimum matching cost between bilateral matching using the second constraint and bilateral matching using the third constraint. The second constraint may be selected if it is determined that bilateral matching using the second constraint provides a smaller matching cost than bilateral matching using the third constraint, otherwise the third constraint is selected.
In some examples, one or more aspects of the systems and techniques described herein may be used with or applied based on an Enhanced Compression Model (ECM). For example, in an ECM, a multi-pass DMVR may be applied to conventional (e.g., standard or default, such as extended merge prediction in VVC merge mode) merge mode candidates having one or more features identical or similar to those described above. For example, one or more features may be the same as or similar to some (or all) of the DMVR conditions described above. As described above, the merge candidate refers to a candidate block from which information (e.g., one or more motion vectors, prediction modes, etc.) is inherited for coding (e.g., encoding and/or decoding) a current block, wherein the candidate block may be a block adjacent to the current block. For example, the merge candidate may be an inter-prediction PU that includes a motion data location selected from a set of spatially-adjacent motion data locations and one of two temporally-co-located motion data locations.
By default, the multi-pass DMVR in the ECM may be performed based on applying a first constraint (e.g., the mirrored constraint mvd0= -MVD 1). In one illustrative example, the systems and techniques described herein may utilize an adaptive bilateral matching pattern as a new pattern for multi-pass DMVR. In some examples, the adaptive bilateral matching pattern may also be referred to as "adaptive_bm_mode". Under adaptive_bm_mode, merge indexes may be signaled to indicate selected motion information candidates. However, all candidates in the candidate list may satisfy the DMVR condition. In some cases, a flag (e.g., bm_merge_flag) may be used to signal or indicate that an adaptive bilateral matching pattern is used. For example, if the flag is true (e.g., bm_merge_flag is equal to 1), adaptive_bm_mode may be used or applied (e.g., as will be described below). In some aspects, when the flag is true (e.g., bm_merge_flag is equal to 1), an additional flag (e.g., bm_dir_flag) may be used to signal or indicate the bm_dir value to be used in adaptive_bm_mode.
In some aspects, if adaptive_bm_mode=1, then adaptive bilateral matching may be performed, and if adaptive_bm_mode=0, then adaptive bilateral matching is not performed. In one illustrative example, an adaptive bilateral matching process (e.g., associated with adaptive_bm_mode=1) may be performed based at least in part on applying a second constraint or a third constraint to the selected candidate (e.g., fixing MVD0 or MVD1 equal to 0, respectively). In some cases, the variable bm dir may be used to indicate which constraint is applied and/or to indicate the selected constraint. For example, when performing adaptive bilateral matching (e.g., signaled or determined based on adaptive_bm_mode), bm_dir=1 may be used to indicate or signal that adaptive bilateral matching should be performed by fixing MVD1 to 0 (e.g., a third constraint should be applied). In some examples, bm_dir=2 may be used to indicate or signal that adaptive bilateral matching should be performed by fixing MVD0 to 0 (e.g., a second constraint should be applied).
In some cases, if adaptive bilateral matching is not performed, a conventional merge mode may be utilized. As previously mentioned, the regular merge mode may be determined or signaled based on adaptive_bm_mode=0 (which indicates that adaptive bilateral matching is not performed). In some examples, when utilizing the conventional merge mode (e.g., because adaptive bilateral matching is not performed), the systems and techniques may utilize an inferred value of bm dir=3 that indicates or signals that MVD0 and MVD1 are not fixed and mvd0= -MVD1 (e.g., the first mirroring constraint should be applied). In some examples, bm dir=3 may be explicitly signaled or used to indicate a regular merge mode.
In some examples, the systems and techniques may perform bilateral matching with one or more modified bilateral matching operations. In some aspects, when bm dir=3, the bilateral matching process may be the same as the three-way bilateral matching process described previously. For example, given an initial MV pair, a first predictor is generated by a first MV referencing a first reference picture and a second predictor is generated by a second MV referencing a second reference picture. Subsequently, a refined MV pair is derived by minimizing BM costs between a refined first predictor (e.g., generated using a first refined MV) and a refined second predictor (e.g., generated using a second refined MV), wherein the motion vector differences between the refined MV and the initial MV are MVD0 and MVD1, and mvd0= -MVD1.
In some aspects, when bm dir=1, only the first MV is refined, while the second MV remains fixed. For example, the refined first MV may be derived by minimizing BM costs between the second predictor generated by the second MV and the refined first predictor generated by the refined first MV. For example, when bm_dir=1, MV0 may be refined, while MV1 remains fixed. The refined motion vector MV0 'may be derived by minimizing BM costs between the predictor generated based on MV1 and the refined predictor generated based on MV0'.
In some examples, when bm_dir=2, only the second MV is refined, while the first MV remains fixed. For example, the refined second MV may be derived by minimizing BM costs between a first predictor generated by the first MV and a refined second predictor generated by the refined second MV. For example, when bm_dir=2, MV1 can be refined, while MV0 remains fixed. The refined motion vector MV1 'may be derived by minimizing BM costs between the predictor generated based on MV1 and the refined predictor generated based on MV1'.
In some aspects, BM costs (e.g., as described above) may additionally or alternatively include regularization terms based on or derived from Motion Vector Differences (MVDs). In one illustrative example, the multi-pass DMVR search process may include a regularization term determined based on MV costs that depend on the refined MV locations.
In some aspects, one or more of the bilateral matching modifications described above may be applied only in the first pass of a multi-pass bilateral matching process (e.g., PU level DMVR). In other aspects, one or more of the bilateral matching modifications described above may be applied in the first pathway and the second pathway (e.g., in the PU level and sub-PU level DMVR pathways).
In some examples, the systems and techniques described herein may perform multi-path bilateral matching DMVR using more or fewer paths than those described in the examples above. For example, fewer than three processes may be utilized and/or more than three pathways may be used. In some examples, any number of vias may be used, and may be configured in any manner. In some aspects, the adaptive_bm_mode may be similarly applied to multi-pass designs. In some aspects, the second pass may be skipped under adaptive_bm_mode. In some aspects, both the second and third paths may be skipped under adaptive_bm_mode. In other aspects, any combination of paths (e.g., path operations from the three-way system described above) may be combined with duplicate paths or other path types based on particular adaptive bilateral matching criteria.
In some aspects of the multi-pass DMVR, different search modes may be used for different search levels and/or search accuracies. For example, a square search may be used for both integer and half-pixel searches, which may be applied to perform PU-level DMVR (e.g., first pass). In some examples, for sub-PU level DMVR (e.g., second pass and/or third pass), full search may be used for integer search and square search may be used for half-pixel search. In some examples, one or more (or all) of the search patterns described above may be utilized based on a determination that bm_dir is equal to a first value (value 1), a second value (value 2), or a third value (value 3). In one example, one or more (or all) of the search patterns described above may be utilized based on a determination that bm dir=3.
The following aspects describe example search patterns and/or search processes when bm dir is equal to 1 or 2. In one aspect, when bm_dir=1 and/or when bm_dir=2, the same search pattern as when bm_dir=3 may be used. In another aspect, bm_dir=1 and bm_dir=2 may use a different search pattern than when bm_dir=3. For example, in one example, a full search may be used for integer searches in a PU-level DMVR and a square search may be used for half-pixel searches in a PU-level DMVR.
In some aspects, the same search range and/or maximum search wheel number may be used for different bm dir values. In other aspects, when bm dir=1 or 2, different search ranges and/or different maximum search rounds may be used. For example, in the case of a full search with different cost factors assigned to different MVDs, one or more MVD regions may be skipped. For example, as shown in fig. 10, the search area of CU 1000 is divided into a plurality of search areas (e.g., a first search area 1020, a second search area 1030, a third search area 1040, a fourth search area 1050). In some cases, regions that are far from the center of the search region (e.g., far from the first search region 1020) may be skipped.
In some examples, in the conventional merge mode, all search regions may be searched. For example, with respect to fig. 10, in the conventional merge mode, four search areas 1020, 1030, 1040, and 1050 may be searched along with a fifth search area including the remaining blocks of CU 1000 that have not been included in one of the four search areas 1020-1050. As described above, in some cases, a search area far from the center of the search area may be skipped. For example, in the adaptive bilateral matching mode (e.g., when bm_dir=1 or 2), only the first three search areas associated with CU 1000 of fig. 10 may be searched (e.g., in the adaptive bilateral matching mode, only the first search area 1020, the second search area 1030, and the third search area 1040 may be searched).
In some aspects of multi-pass DMVR, SAD or average removal SAD (e.g., depending on PU size) may be used for integer and half-pixel searches associated with PU-level DMVR passes. In some cases, SATD may be used for sub-PU level DMVR. In some aspects, the same cost criteria may be used for different bm dir values. For example, the current cost criterion selection in the ECM is available for all bm dir values. In other aspects, the cost criterion selection may be different for different bm dir values. For example, when bm dir=3, the cost criterion selection in the current ECM may be applied. Depending on the PU size, SAD or average removal SAD may be used for integer and half-pixel searching in PU level DMVR. When bm_dir=1 or 2, the PU level DMVR procedure may use SATD for integer searching and SAD for half-pixel searching.
As previously described, for some aspects, the candidates in the candidate list satisfy the DMVR conditions. In an additional aspect, bm dir may be set equal to 1 or 2, as indicated by the additional bm dir flag included in adaptive bm mode. In some cases, the candidate list for adaptive_bm_mode is generated based on a conventional merge candidate list. For example, one or more candidates determined to satisfy the DMVR conditions in the conventional merge candidate list may be inserted into the candidate list of adaptive_bm_mode.
In another additional aspect, whether bm dir is set equal to 1 or 2 may be indicated by a merge index within adaptive_bm_mode. The candidate list of adaptive_bm_mode may be generated on the basis of a conventional merge candidate list. For each candidate in the conventional merge candidate list that satisfies the DMVR condition, a pair of two candidates may be inserted into the adaptive_bm_mode candidate list, where one candidate has bm_dir=1 and the other candidate has bm_dir=2, where both candidates in the pair have the same motion information. In some examples, bm dir may be determined by determining whether the merge index is even or odd.
In one illustrative example, the candidate list for adaptive_bm_mode may be generated independently of a conventional merge candidate list. In some cases, the generation of the candidate list of adaptive_bm_mode may follow the same or similar procedure as the generation of the candidate list of conventional merge mode (e.g., checking for the same spatial, temporal neighborhood, history-based candidates, paired candidates, etc.). In some cases, pruning may be applied during the list construction process.
In another aspect, one or more candidates associated with bi-prediction with CU-level weight (BCW) weight index indicating unequal weights may also be added to the candidate list (e.g., where the BCW weight index may indicate equal weights as compared to some systems with DMVR conditions).
In some examples, the padding process may be applied if the number of candidates in the candidate list of adaptive_bm_mode and/or the candidate list of conventional merge mode is less than a predefined maximum. For example, once the candidate list for adaptive_bm_mode is generated, there may be fewer candidates in the list than the predefined maximum number of candidates. In such an example, a fill process may be applied to generate a plurality of fill candidates for the candidate list such that the fill candidate list includes a predefined number of candidates. In one illustrative example, where adaptive bilateral matching is enabled (e.g., adaptive_bm_mode=1), one or more default candidates may be used to populate in the merge list construct. The default candidate may be derived such that it satisfies the DMVR conditions.
In some examples, for default candidates, the MV may be set to zero. For example, zero MV candidates may be added during the padding process. In some examples, the reference picture may be selected according to a DMVR condition. In some cases, the reference index may loop through all possible values until the number of candidates in the candidate list reaches a maximum number of candidates (e.g., a predefined maximum). In another aspect, BCW weight indices may be used to indicate equal weights for regular candidates, and one or more non-equal weight BCW candidates may be added after and before the zero candidates are added.
In some aspects, a reference picture assigned to a default candidate (e.g., a reference picture assigned to the above-described zero-padding MV candidate) may be selected to satisfy one or more conditions associated with adaptive_bm_mode. An illustrative example of such a condition may include one or more of the following conditions and/or may include other conditions not listed herein: selecting at least one pair of reference pictures including one reference picture in the past and one reference picture in the future with respect to the current picture; the respective distances from the two reference pictures to the current picture are equal; neither reference picture is a long-term reference picture; the two reference pictures have the same resolution as the current picture; weighted Prediction (WP) is not applied to any reference picture; any combination thereof; and/or other conditions.
In some aspects, one or more reference pictures may be assigned to default candidates (e.g., the zero MV candidates filled as described above) based on selecting reference pictures that satisfy one or more conditions associated with or based on specified or selected constraints (e.g., a second constraint, mvd0=0, or a third constraint, mvd1=0). In some examples, the reference picture may be selected to satisfy one or more conditions associated with a given constraint, where the given constraint is determined based on bm dir (e.g., as described previously). For example, one or more conditions based on bm_dir may apply only to MVs for which refinement is performed (e.g., if bm_dir=1). An illustrative example of such a condition may include one or more of the following conditions and/or may include other conditions not listed herein: the reference pictures in a given list X are not long-term reference pictures; the reference picture in list X has the same resolution as the current picture; weighted Prediction (WP) is not applied to the reference pictures in list X; the respective distance from the first reference picture in list X to the current picture is not less than the respective distance from the other reference pictures (e.g., the second reference picture in list X) to the current picture; any combination thereof; and/or other conditions.
In some cases, list X may be the same as list 0 (e.g., list L0) if bm dir indicates that MVs in list 0 are refined. In some cases, list X may be the same as list 1 (e.g., list L1) if bm dir indicates that MVs in list 1 are refined. In some aspects, all possible zero MV candidates may be found by cycling through all possible combinations of reference pictures in a particular order and identifying reference pictures that meet predefined conditions. In one illustrative example, a first loop may be performed for list 0 and a second loop may be performed for list 1. In another example, a first loop may be performed for list 1 and a second loop may be performed for list 0. Other ordering is also possible and should be considered within the scope of the present disclosure. The process of determining possible zero MV candidates (e.g., identifying reference pictures that meet predefined conditions by cycling through combinations of reference pictures in a particular order) may be performed at a slice level, a picture level, or other level. The list of identified default MV candidates (e.g., zero MV candidates) may be stored as default candidates. In some cases, when a possible zero MV candidate is determined at the block level, when the number of candidates is less than a predefined maximum number of candidates, the systems and techniques described herein may loop through the default candidates and add one or more of the default candidates to the candidate list until the number of candidates reaches a predefined maximum.
In some aspects, one or more size constraints may be included in and/or utilized by the adaptive_bm_mode described herein. In one aspect, the same size constraint may be applied in adaptive_bm_mode as in conventional DMVR. In another aspect, adaptive_bm_mode is not applied if neither the width nor the height of the current block is greater than the minimum block size of the DMVR.
In some aspects, adaptive_bm_mode may be signaled as an additional merge mode beyond the regular merge mode. In some examples, various signaling methods may be applied or utilized to signal adaptive_bm_mode as an additional merge mode. For example, adaptive_bm_mode may be considered or signaled as a variant of the regular merge mode. In one illustrative example, one or more syntax elements may be first signaled to indicate a regular merge mode, and one or more additional flags and/or syntax elements may be signaled to indicate adaptive_bm_mode and/or to indicate a particular constraint of the constraints that may be applied in association with the use of adaptive_bm_mode (e.g., a second constraint mdv0=0 or a third constraint mdv1=0).
In another aspect, adaptive_bm_mode may be indicated by one or more flags before indicating the normal merge mode. For example, if the syntax (e.g., the one or more first syntax elements) indicates that the current block does not use adaptive_bm_mode, one or more additional syntax elements may be signaled to indicate whether the current block uses a regular merge mode or other merge modes. For example, if the current block does not use adaptive_bm_mode or a conventional merge mode, one or more additional syntax elements may be signaled to indicate that the current block uses other merge modes, such as Combined Inter and Intra Prediction (CIIP), geometric Partitioning Mode (GPM), and so on.
In yet another aspect, adaptive_bm_mode may be signaled in other merge mode branches. For example, adaptive_bm_mode may be signaled in the template matching merge mode branch. In some cases, one or more syntax elements may first be signaled to indicate whether one of adaptive_bm_mode and template matching merge mode is used. If one or more syntax elements indicate that either the adaptive_bm_mode or the template matching merge mode is used, one or more additional flags or syntax elements may be signaled to indicate which of the template matching merge mode and the adaptive_bm_mode is used.
In some aspects, the merge index in adaptive_bm_mode may use the same signaling method as in conventional merge mode. In one aspect, adaptive_bm_mode may use the same (or similar) context model as the conventional merge mode. In another aspect, a separate context model may be used for adaptive_bm_mode. In some examples, the maximum number of merge candidates for adaptive_bm_mode may be different from the maximum number of merge candidates for the normal merge mode.
In one illustrative example, one or more high level syntax elements may be used to indicate whether adaptive_bm_mode may be applied or will be applied. In one aspect, the same high level syntax used to indicate whether conventional DMVR is to be applied may also be used to indicate whether adaptive_bm_mode is to be applied. In another aspect, one or more separate (e.g., additional) high level syntax elements may be used to indicate whether adaptive_bm_mode should be applied. In another aspect, a separate high level syntax element may be used to indicate whether adaptive_bm_mode is utilized, where the separate high level syntax element for adaptive_bm_mode exists only when conventional DMVR is enabled. For example, if a separate high level syntax element associated with a conventional DMVR is determined to indicate that the conventional DMVR is not enabled or utilized, the separate or additional high level syntax element associated with the adaptive_bm_mode is not signaled and the adaptive_bm_mode is inferred to be off (e.g., not enabled or utilized).
In some aspects, adaptive_bm_mode may be disabled for coded pictures or slices according to available reference pictures in addition to one or more of the high-level syntax elements described above. In some cases, if it is determined that no combination of reference pictures meets or can meet the reference picture condition, adaptive_bm_mode may be disabled and the corresponding syntax element is not signaled (e.g., at the block level). In some cases, at least one pair of reference pictures must satisfy the reference picture condition in order to utilize adaptive_bm_mode. An illustrative example of such a condition may include one or more of the following conditions and/or may include other conditions not listed herein: one reference picture is in the past with respect to the current picture and one reference picture is in the future with respect to the current picture; the respective distances from the two reference pictures to the current picture are equal; neither reference picture is a long-term reference picture; the two reference pictures have the same resolution as the current picture; weighted Prediction (WP) is not applied to any reference picture; any combination thereof; and/or other conditions.
The above conditions may be used alone or in combination.
In some aspects, from the reference picture, only a subset of the adaptive bilateral matching patterns described herein may be enabled (e.g., a first adaptive bilateral matching pattern associated with a second condition mdv0=0, and a second adaptive bilateral matching pattern associated with a third condition mdv1=0). Examples of such conditions that only allow bm_dir=1 (e.g., associated with the second condition mdv0=0) or bm_dir=2 (e.g., associated with the third condition mdv1=0) may include one or more of the following conditions and/or may include other conditions not listed herein: one of the reference pictures is a long-term reference picture, and the other reference picture is not a long-term reference picture; one of the reference pictures has the same resolution as the current picture, but the other reference picture has a different resolution from the current picture; weighted Prediction (WP) is applied to one of the reference pictures; a respective distance from one of the reference pictures to the current picture is not less than a respective distance from the other reference picture to the current picture; any combination thereof; and/or other conditions.
In some cases, syntax elements that identify bilateral matching patterns may not be signaled and may be inferred accordingly (e.g., at the block level). In some examples, rather than explicitly signaling the syntax element (e.g., as part of a particular syntax table in a bitstream), the syntax element may be implicitly signaled (e.g., inferred). In some cases, the syntax element may be neither explicitly signaled nor implicitly signaled, and the syntax element may be inferred. For example, if bm_dir of the first value is enabled (e.g., bm_dir=1 is enabled), but bm_dir of the second value is disabled (e.g., bm_dir=2 is disabled), the syntax element for indicating bm_dir at the block level may not be signaled. In the absence of this syntax element, bm dir may be inferred to be a first value (e.g., bm dir is inferred to be 1). In another example, if bm_dir of the second value is enabled (e.g., bm_dir=2), but bm_dir of the first value is disabled (e.g., bm_dir=1), then the syntax element for indicating bm_dir at the block level may not be signaled. In the absence of this syntax element, bm dir may be inferred to be a second value (e.g., bm dir is inferred to be 2).
In another example, a slice-level flag and/or a picture-level flag may be used for adaptive_bm_mode. For example, if at least one of the above conditions is not met, a slice-level flag and/or a picture-level flag may be used as a bitstream conformance constraint, i.e., the flag is set to 0 (e.g., DMVR mode is disabled).
In yet another example, a bitstream conformance constraint can be introduced into existing signaling, wherein the bitstream conformance constraint indicates: if at least one of the above conditions is not satisfied, adaptive_bm_mode should not be applied, and the corresponding overhead is set to 0 (e.g., indicating that adaptive_bm_mode is not used).
In some aspects of ECM, multi-hypothesis prediction (MHP) may be utilized. In MHP, inter prediction techniques may be used to obtain or determine a weighted superposition of more than two motion-compensated prediction signals. Based on performing a sample-by-sample weighted superposition, the resulting overall prediction signal may be obtained. For example based on a single/double predicted signal p uni/bi And a first additional inter prediction signal/hypothesis h 3 The resulting predicted signal p can be obtained as follows 3
p 3 =(1-α)p uni/bi +αh 3 Equation (27)
Here, the weighting factor α may be specified by the syntax element add_hyp_weight_idx according to the following mapping:
TABLE 2
In some examples, more than one additional prediction signal may be used. In some cases, more than one additional prediction signal may be utilized in the same or similar manner as described above. For example, when using multiple additional prediction signals, the resulting overall prediction signal may be iteratively accumulated with each additional prediction signal as follows:
p n+1 =(1-α n+1 )p nn+1 h n+1 equation (28)
Here, the resulting overall predicted signal may be taken as the last p n (e.g., p with maximum index n n ) Obtained.
In some aspects, MHP may not be applied (e.g., disabled) to any adaptive_bm_mode. In some aspects, MHP may be applied over adaptive_bm_mode in the same or similar manner that MHP is applied to a standardized (e.g., conventional) merge mode.
Fig. 11 is a flow chart illustrating an example of a process 1100 for processing video data. In some examples, process 100 may be used to perform decoder-side motion vector refinement (DMVR) with adaptive bilateral matching, according to some examples of the present disclosure. In some aspects, the process 1100 may be implemented in an apparatus for processing video data that includes a memory and one or more processors coupled to the memory configured to perform the operations of the process 1100. In other aspects, the process 1100 may be implemented in a non-transitory computer-readable medium storing instructions that, when executed by one or more processors of a device, cause the device to perform the operations of the process 1100.
At block 1102, the process 1100 includes: one or more reference pictures (e.g., a current block of a current picture) for the current picture are obtained. For example, one or more reference pictures may be obtained based on one or more inputs 114 of the decoding device 112 shown in fig. 1. In some examples, one or more reference pictures and the current picture may be obtained from video data obtained by the decoding device 112 shown in fig. 1 or provided to the decoding device 112.
At block 1104, the process 1100 includes: a first motion vector and a second motion vector for merging mode candidates are identified. For example, the first motion vector and/or the second motion vector may be identified by the decoding device 112 shown in fig. 1. In some examples, the first motion vector and/or the second motion vector may be identified using the decoder engine 116 of the decoding device 112 shown in fig. 1. In some cases, the signaled information may be used to identify one or more (or both) of the first motion vector and the second motion vector. For example, the encoding device 104 shown in fig. 1 may include signaling information that may be used by the decoding device 112 and/or the decoding engine 116 to identify one or more (or both) of the first motion vector and the second motion vector. In some cases, process 1100 may include: a merge mode candidate for the current picture is determined. As mentioned herein, the merge mode candidate may include a neighboring block of the block from which prediction data may be inherited for the block of the current picture. For example, the merge mode candidate may be determined by the decoding device 112 shown in fig. 1. In some examples, the merge mode candidates may be determined using the decoder engine 116 of the decoding device 112 shown in fig. 1. In some examples, decoding device 112 and/or decoding engine 116 may use the signaling information to determine merge mode candidates for the current picture. In some cases, the merge mode candidate may be determined using the same signaling information used to identify one or more (or both) of the first motion vector and the second motion vector associated with the merge mode candidate. In some cases, separate signaling information may be used to determine the merge mode candidate and the first and second motion vectors.
At block 1106, the process 1100 includes: a selected motion vector search strategy for merging mode candidates is determined from a plurality of motion vector search strategies. In some aspects, the selected motion vector search policy is associated with one or more constraints based on or corresponding to the first motion vector and/or the second motion vector. In one illustrative example, the selected motion vector search policy may be a Bilateral Matching (BM) motion vector search policy. In some cases, the motion vector search strategy for merging mode candidates may be selected from a plurality of motion vector search strategies including at least two of: a multi-pass decoder side motion vector refinement strategy, a fractional sample refinement strategy, a bi-directional optical flow strategy, or a sub-block based bilateral matching motion vector refinement strategy. In some examples, the selected motion vector search strategy may be determined prior to determining the merge mode candidate. For example, a selected motion vector search strategy (e.g., as described above) may be determined and used to generate a merge candidate list. The merge mode candidate may be determined based on a selection from the generated merge candidate list. In some examples, the selected merge mode candidate may be determined prior to determining the selected motion vector search strategy. For example, in some cases, a merge candidate list may be generated without using the selected search strategy (e.g., the generated merge candidate list may be the same for each respective search strategy of the plurality of search strategies), and the selected merge candidate may be determined prior to the selected search strategy.
In some examples, the selected motion vector search strategy may be a multi-pass decoder side motion vector refinement (DMVR) search strategy. For example, the multi-pass DMVR search strategy may include one or more block-based bilateral matching motion vector refinement passes and may also include one or more sub-block-based motion vector refinement passes. In some examples, one or more block-based bilateral matching motion vector refinement passes may be performed using a first constraint associated with the first motion vector difference and/or the second motion vector difference. The first motion vector difference may be a difference determined between the first motion vector and the refined first motion vector. The second motion vector difference may be a difference determined between the second motion vector and the refined second motion vector. In some examples, one or more sub-block based motion vector refinement passes may be performed using a second constraint that is different from the first constraint. As described above, the second constraint may be associated with at least one of the first motion vector difference and/or the second motion vector difference. In some cases, the one or more sub-block based motion vector refinement passes may include at least one of a sub-block based Bilateral Matching (BM) motion vector refinement pass and/or a sub-block based bi-directional optical flow (BDOF) motion vector refinement pass.
In some examples, the selected motion vector search policy is associated with one or more constraints corresponding to at least one of the first motion vector or the second motion vector (e.g., as described above). The one or more constraints may be determined based on the one or more signaled syntax elements. For example, one or more constraints for a block of a current picture may be determined based on syntax elements signaled for the block. In some aspects, the one or more constraints are associated with at least one of a first motion vector difference associated with the first motion vector (e.g., a difference between the first motion vector and the refined first motion vector) and a second motion vector difference associated with the second motion vector (e.g., a difference between the second motion vector and the refined second motion vector). In some examples, the one or more constraints may include mirrored constraints for the first motion vector difference and the second motion vector difference. The mirror constraint may set the first motion vector difference and the second motion vector difference to have equal magnitudes (e.g., absolute values) but opposite signs. In some cases, the one or more constraints may include zero-value constraints for the first motion vector difference (e.g., setting the first motion vector difference to zero). In some examples, the one or more constraints may include zero-value constraints for the second motion vector difference (e.g., setting the second motion vector difference to zero). In some aspects, a zero-value constraint may indicate that the motion vector difference is kept fixed. For example, based on zero-value constraints, process 1100 may include: the one or more refined motion vectors are determined using the selected motion vector search strategy by maintaining a first one of the first motion vector difference or the second motion vector difference at a fixed value and searching with respect to a second one of the first motion vector difference or the second motion vector difference. For example, a first motion vector difference may be fixed and a search may be performed around a second motion vector difference to derive a refined motion vector.
At block 1108, the process 1100 includes: one or more refined motion vectors are determined based on the first motion vector, the second motion vector, and/or one or more reference pictures (e.g., based on the first motion vector and the one or more reference pictures, based on the second motion vector and the one or more reference pictures, or based on the first motion vector, the second motion vector, and the one or more reference pictures) using the selected motion vector search strategy. In some cases, determining one or more refined motion vectors may include: one or more refined motion vectors for blocks of video data are determined. In some examples, the one or more refined motion vectors may include a first refined motion vector and a second refined motion vector determined for the first motion vector and the second motion vector, respectively. In some examples, the first motion vector difference is determined as a difference between the first refined motion vector and the first motion vector and the second motion vector difference is determined as a difference between the second refined motion vector and the second motion vector.
In some examples, the selected motion vector search policy is a Bilateral Matching (BM) motion vector search policy, as previously described. When the selected motion vector search strategy is a BM motion vector search strategy, determining one or more refined motion vectors may include: a first refined motion vector is determined by searching a first reference picture around the first motion vector. The first reference picture may be searched around the first motion vector based on the selected motion vector search policy. The second refined motion vector may be determined by searching the second reference picture around the second motion vector based on the selected motion vector search strategy. In some examples, the selected motion vector search strategy may include a motion vector difference constraint (e.g., a mirror constraint (where the first and second motion vector differences have equal magnitudes but opposite signs), a constraint that sets the first motion vector difference equal to zero, a constraint that sets the second motion vector difference equal to zero, etc.). In some examples, the first refined motion vector and the second refined motion vector may be determined by minimizing a difference between a first reference block associated with the first refined motion vector and a second reference block associated with the second refined motion vector.
At block 1110, the process 1100 includes: the merge mode candidate is processed using the one or more refined motion vectors. For example, the decoding device 112 shown in fig. 1 may process the merge mode candidates using one or more refined motion vectors. In some examples, the decoder engine 116 of the decoding device 112 shown in fig. 1 may process the merge mode candidates using one or more refined motion vectors.
In some implementations, the processes (or methods) described herein may be performed by a computing device or apparatus (such as the system 100 shown in fig. 1). For example, these processes may be performed by the encoding device 104 shown in fig. 1 and 12, another video source side device or video transmission device, the decoding device 112 shown in fig. 1 and 13, and/or another client side device (such as a player device, a display, or any other client side device). In some cases, a computing device or apparatus may include one or more input devices, one or more output devices, one or more processors, one or more microprocessors, one or more microcomputers, and/or other components configured to perform the steps of process 1100.
In some examples, the computing device may include a wireless communication device, such as a mobile device, a tablet computer, an extended reality (XR) device (e.g., a Virtual Reality (VR) device (such as a Head Mounted Display (HMD)), an Augmented Reality (AR) device (such as HMD or AR glasses), a Mixed Reality (MR) device (such as HMD or MR glasses), etc.), a desktop computer, a server computer and/or server system, a vehicle or component of a computing system or vehicle, or other type of computing device. Components of a computing device (e.g., one or more input devices, one or more output devices, one or more processors, one or more microprocessors, one or more microcomputers, and/or other components) may be implemented in circuitry (e.g., circuitry). For example, a component may include and/or be implemented using electronic circuitry or other electronic hardware, which may include one or more programmable circuits (e.g., a microprocessor, a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP), a Central Processing Unit (CPU), and/or other suitable circuitry), and/or may include and/or be implemented using computer software, firmware, or any combination thereof to perform the various operations described herein. In some examples, a computing device or apparatus may include a camera configured to capture video data (e.g., a video sequence) including video frames. In some examples, the camera or other capture device that captures the video data is separate from the computing device, in which case the computing device receives or obtains the captured video data. The computing device may include a network interface configured to transmit video data. The network interface may be configured to communicate Internet Protocol (IP) based data or other types of data. In some examples, a computing device or apparatus may include a display to display output video content (such as samples of pictures of a video bitstream).
These processes may be described in terms of logic flow diagrams, the operations of which represent a sequence of operations that may be implemented in hardware, computer instructions, or a combination thereof. In the context of computer instructions, the operations represent computer-executable instructions stored on one or more computer-readable storage media that, when executed by one or more processors, perform the recited operations. Generally, computer-executable instructions include routines, programs, objects, components, data structures, etc. that perform particular functions or implement particular data types. The order in which the operations are described is not intended to be construed as a limitation, and any number of the described operations may be combined in any order and/or in parallel to implement the processes.
Additionally, these processes may be performed under control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) that is executed in common on one or more processors, implemented by hardware, or a combination thereof. As mentioned above, the code may be stored on a computer-readable or machine-readable storage medium, for example, in the form of a computer program comprising a plurality of instructions executable by one or more processors. The computer-readable storage medium or machine-readable storage medium may be non-transitory.
The coding techniques discussed herein may be implemented in an example video encoding and decoding system (e.g., system 100). In some examples, the system includes a source device that provides encoded video data to be later decoded by a destination device. Specifically, the source device provides video data to the destination device via a computer readable medium. The source device and the destination device may comprise any of a variety of devices, including desktop computers, notebook computers (i.e., laptop computers), tablet computers, set-top boxes, telephone handsets (e.g., so-called "smart" handsets), so-called "smart" boards, televisions, cameras, display devices, digital media players, video game consoles, video streaming devices, and the like. In some cases, the source device and the destination device may be equipped for wireless communication.
The destination device may receive the encoded video data to be decoded via a computer readable medium. The computer readable medium may be any type of medium or device capable of moving encoded video data from a source device to a destination device. In one example, the computer readable medium may be a communication medium for enabling a source device to transmit encoded video data directly to a destination device in real-time. The encoded video data may be modulated according to a communication standard, such as a wireless communication protocol, and transmitted to a destination device. The communication medium may include any wireless or wired communication medium such as a Radio Frequency (RF) spectrum or one or more physical transmission lines. The communication medium may form part of a packet-based network, such as a local area network, a wide area network, or a global network such as the internet. The communication medium may include a router, switch, base station, or any other means that may be used to facilitate communication from a source device to a destination device.
In some examples, the encoded data may be output from the output interface to a storage device. Similarly, encoded data may be accessed from a storage device through an input interface. The storage device may include any of a variety of distributed or locally accessed data storage media such as hard drives, blu-ray discs, DVDs, CD-ROMs, flash memory, volatile or non-volatile memory, or any other suitable digital storage media for storing encoded video data. In further examples, the storage device may correspond to a file server or another intermediate storage device that may store the encoded video generated by the source device. The destination device may access the stored video data from the storage device via streaming or download. The file server may be any type of server capable of storing encoded video data and transmitting the encoded video data to a destination device. Example file servers include web servers (e.g., for web sites), FTP servers, network Attached Storage (NAS) devices, or local disk drives. The destination device may access the encoded video data through any standard data connection, including an internet connection. This may include a wireless channel (e.g., wi-Fi connection), a wired connection (e.g., DSL, cable modem, etc.), or a combination of both, adapted to access encoded video data stored on a file server. The transmission of encoded video data from the storage device may be a streaming transmission, a download transmission, or a combination thereof.
The techniques of this disclosure are not necessarily limited to wireless applications or settings. The techniques may be applied to video coding to support any of a variety of multimedia applications, such as over-the-air television broadcasting, cable television transmission, satellite television transmission, internet streaming video transmission (e.g., dynamic adaptive streaming over HTTP (DASH)), digital video encoded onto a data storage medium, decoding of digital video stored on a data storage medium, or other applications. In some examples, the system may be configured to support unidirectional or bidirectional video transmission to support applications such as video streaming, video playback, video broadcasting, and/or video telephony.
In one example, a source device includes a video source, a video encoder, and an output interface. The destination device may include an input interface, a video decoder, and a display device. The video encoder of the source device may be configured to apply the techniques disclosed herein. In other examples, the source device and the destination device may include other components or arrangements. For example, the source device may receive video data from an external video source, such as an external camera. Also, the destination device may interface with an external display device instead of including an integrated display device.
The above example system is merely one example. The techniques for processing video data in parallel may be performed by any digital video encoding and/or decoding device. Although in general, the techniques of this disclosure are performed by video encoding devices, the techniques may also be performed by video encoders/decoders commonly referred to as "CODECs. Furthermore, the techniques of this disclosure may also be performed by a video preprocessor. The source device and the destination device are merely examples of such encoding devices: wherein the source device generates encoded video data for transmission to the destination device. In some examples, the source device and the destination device may operate in a substantially symmetrical manner such that each of these devices includes video encoding and decoding components. Thus, example systems may support unidirectional or bidirectional video transmission between video devices, e.g., for video streaming, video playback, video broadcasting, or video telephony.
The video source may include a video capture device, such as a video camera, a video archiving unit containing previously captured video, and/or a video feed interface for receiving video from a video content provider. As a further alternative, the video source may generate computer graphics based data as the source video, or a combination of real-time video, archived video, and computer generated video. In some cases, if the video source is a camera, the source device and the destination device may form a so-called camera phone or video phone. However, as noted above, the techniques described in this disclosure may be generally applicable to video coding and may be applied to wireless and/or wired applications. In each case, the captured, pre-captured, or computer-generated video may be encoded by a video encoder. The encoded video information may then be output onto a computer readable medium via an output interface.
As mentioned, the computer-readable medium may include a temporary medium such as a wireless broadcast or a wired network transmission, or a storage medium (i.e., a non-transitory storage medium) such as a hard disk, a flash drive, a compact disc, a digital versatile disc, a blu-ray disc, or other computer-readable medium. In some examples, a network server (not shown) may receive encoded video data from a source device, e.g., via a network transmission, and provide the encoded video data to a destination device. Similarly, a computing device of a media production facility, such as an optical disc stamping facility, may receive encoded video data from a source device and manufacture an optical disc containing the encoded video data. Thus, in various examples, a computer-readable medium may be understood to include one or more computer-readable media in various forms.
An input interface of the destination device receives information from the computer-readable medium. The information of the computer readable medium may include syntax information defined by the video encoder, which is also used by the video decoder, including syntax elements describing characteristics and/or processing of blocks and other coding units (e.g., group of pictures (GOP)). The display device displays the decoded video data to a user and may include any of a variety of display devices, such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), a plasma display, an Organic Light Emitting Diode (OLED) display, or another type of display device. Various aspects of the present application have been described.
Specific details of the encoding device 104 and decoding device 112 are shown in fig. 12 and 13, respectively. Fig. 12 is a block diagram illustrating an example encoding device 104 that may implement one or more of the techniques described in this disclosure. The encoding device 104 may, for example, generate a syntax structure described herein (e.g., a syntax structure of VPS, SPS, PPS or other syntax element). The encoding device 104 may perform intra-prediction and inter-prediction coding of video blocks within a video slice. As previously described, intra coding relies at least in part on spatial prediction to reduce or eliminate spatial redundancy within a given video frame or picture. Inter-coding relies at least in part on temporal prediction to reduce or eliminate temporal redundancy within adjacent or surrounding frames of a video sequence. Intra mode (I mode) may refer to any of several spatial-based compression modes. Modes such as unidirectional prediction (P-mode) or bi-directional prediction (B-mode) may refer to any of several time-based compression modes.
The encoding apparatus 104 includes a dividing unit 35, a prediction processing unit 41, a filter unit 63, a picture memory 64, a summer 50, a transform processing unit 52, a quantization unit 54, and an entropy encoding unit 56. The prediction processing unit 41 includes a motion estimation unit 42, a motion compensation unit 44, and an intra prediction processing unit 46. For video block reconstruction, the encoding device 104 further includes an inverse quantization unit 58, an inverse transform processing unit 60, and a summer 62. The filter unit 63 is intended to represent one or more loop filters, such as a deblocking filter, an Adaptive Loop Filter (ALF), and a Sample Adaptive Offset (SAO) filter. Although the filter unit 63 is shown as an in-loop filter in fig. 12, in other configurations, the filter unit 63 may be implemented as a post-loop filter. Post-processing device 57 may perform additional processing on the encoded video data generated by encoding device 104. In some cases, the techniques of this disclosure may be implemented by encoding device 104. However, in other cases, one or more of the techniques of this disclosure may be implemented by post-processing device 57.
As shown in fig. 12, the encoding device 104 receives video data, and the dividing unit 35 divides the data into video blocks. Such partitioning may also include partitioning into slices, tiles, or other larger units, for example, according to the quadtree structure of the LCUs and CUs, as well as video block partitioning. The encoding device 104 generally illustrates the components that encode video blocks within a video slice to be encoded. A slice may be divided into multiple video blocks (and possibly into a set of video blocks called tiles). Prediction processing unit 41 may select one of a plurality of possible coding modes, such as one of a plurality of intra-prediction coding modes or one of a plurality of inter-prediction coding modes, for the current video block based on the error result (e.g., coding rate and distortion level, etc.). The prediction processing unit 41 may provide the resulting intra-or inter-coded block to the summer 50 to generate residual block data and to the summer 62 to reconstruct the encoded block for use as a reference picture.
Intra-prediction processing unit 46 within prediction processing unit 41 may perform intra-prediction coding of the current block relative to one or more neighboring blocks in the same frame or slice as the current video block to be coded to provide spatial compression. Motion estimation unit 42 and motion compensation unit 44 within prediction processing unit 41 perform inter-prediction coding of the current video block relative to one or more prediction blocks in one or more reference pictures to provide temporal compression.
The motion estimation unit 42 may be configured to determine the inter prediction mode for the video slice according to a predetermined pattern for the video sequence. The predetermined pattern may designate video slices in the sequence as P slices, B slices, or GPB slices. The motion estimation unit 42 and the motion compensation unit 44 may be highly integrated but are shown separately for conceptual purposes. The motion estimation performed by the motion estimation unit 42 is a process of generating a motion vector that estimates motion for a video block. The motion vector may, for example, indicate a displacement of a Prediction Unit (PU) of a video block within a current video frame or picture relative to a prediction block within a reference picture.
A prediction block is a block found to closely match the PU of the video block to be coded in terms of pixel differences, which may be determined by Sum of Absolute Differences (SAD), sum of Squared Differences (SSD), or other difference metric. In some examples, encoding device 104 may calculate values for pixel locations below an integer of the reference picture stored in picture memory 64. For example, the encoding device 104 may interpolate values for a quarter-pixel position, an eighth-pixel position, or other fractional-pixel positions of the reference picture. Accordingly, the motion estimation unit 42 can perform motion search with respect to the full pixel position and the fractional pixel position, and output a motion vector with fractional pixel accuracy.
Motion estimation unit 42 calculates a motion vector for the PU by comparing the location of the PU of the video block in the inter-coded slice with the location of the prediction block of the reference picture. A reference picture may be selected from a first reference picture list (list 0) or a second reference picture list (list 1), each of which identifies one or more reference pictures stored in picture memory 64. The motion estimation unit 42 sends the calculated motion vector to the entropy encoding unit 56 and the motion compensation unit 44.
The motion compensation performed by the motion compensation unit 44 may involve fetching or generating a prediction block based on a motion vector determined by motion estimation, possibly performing interpolation for the accuracy below the pixel. Upon receiving the motion vector for the PU of the current video block, motion compensation unit 44 may locate the prediction block in the reference picture list to which the motion vector points. The encoding apparatus 104 forms a residual video block by subtracting pixel values of the prediction block from pixel values of the current video block being coded to form pixel difference values. The pixel difference values form residual data for the block and may include both a luminance difference component and a chrominance difference component. Summer 50 represents one or more components that perform such subtraction operations. Motion compensation unit 44 may also generate syntax elements associated with the video blocks and the video slices for use by decoding device 112 in decoding the video blocks of the video slices.
As described above, the intra prediction processing unit 46 may perform intra prediction on the current block as an alternative to inter prediction performed by the motion estimation unit 42 and the motion compensation unit 44. In particular, intra-prediction processing unit 46 may determine an intra-prediction mode to be used for encoding the current block. In some examples, intra-prediction processing unit 46 may encode the current block using various intra-prediction modes, e.g., during separate encoding passes, and intra-prediction processing unit 46 may select an appropriate intra-prediction mode from the modes under test to use. For example, the intra prediction processing unit 46 may calculate a rate distortion value using rate distortion analysis for various tested intra prediction modes, and may select an intra prediction mode having the best rate distortion characteristics among the tested modes. Rate-distortion analysis typically determines the amount of distortion (or error) between an encoded block and the original, unencoded block that was encoded to produce the encoded block, as well as the bit rate (i.e., number of bits) used to produce the encoded block. Intra-prediction processing unit 46 may calculate a ratio based on the distortion and rate for the various encoded blocks to determine which intra-prediction mode exhibits the best rate distortion value for the block.
In any case, after selecting the intra-prediction mode for the block, intra-prediction processing unit 46 may provide information indicating the intra-prediction mode selected for the block to entropy encoding unit 56. Entropy encoding unit 56 may encode information indicating the selected intra-prediction mode. The encoding device 104 may include in the transmitted bitstream configuration data definitions of encoding contexts for the various blocks and indications of the most probable intra-prediction mode, intra-prediction mode index table, and modified intra-prediction mode index table to be used for each of these contexts. The bitstream configuration data may include a plurality of intra prediction mode index tables and a plurality of modified intra prediction mode index tables (also referred to as codeword mapping tables).
After the prediction processing unit 41 generates a prediction block for the current video block via inter prediction or intra prediction, the encoding apparatus 104 forms a residual video block by subtracting the prediction block from the current video block. The residual video data in the residual block may be included in one or more TUs and applied to transform processing unit 52. Transform processing unit 52 transforms the residual video data into residual transform coefficients using a transform, such as a Discrete Cosine Transform (DCT) or a conceptually similar transform. The transform processing unit 52 may convert the residual video data from a pixel domain to a transform domain (such as a frequency domain).
The transform processing unit 52 may send the resulting transform coefficients to the quantization unit 54. The quantization unit 54 quantizes the transform coefficient to further reduce the bit rate. The quantization process may reduce the bit depth associated with some or all of these coefficients. The quantization level may be modified by adjusting the quantization parameter. In some examples, quantization unit 54 may then perform a scan of a matrix including the quantized transform coefficients. Alternatively or additionally, entropy encoding unit 56 may perform the scan.
After quantization, entropy encoding unit 56 entropy encodes the quantized transform coefficients. For example, entropy encoding unit 56 may perform Context Adaptive Variable Length Coding (CAVLC), context Adaptive Binary Arithmetic Coding (CABAC), syntax-based context adaptive binary arithmetic coding (SBAC), probability Interval Partitioning Entropy (PIPE) coding, or another entropy encoding technique. After entropy encoding by entropy encoding unit 56, the encoded bitstream may be sent to decoding device 112, or archived for later transmission or retrieval by decoding device 112. Entropy encoding unit 56 may also entropy encode motion vectors and other syntax elements for the current video slice being coded.
The inverse quantization unit 58 and the inverse transform processing unit 60 apply inverse quantization and inverse transform, respectively, to reconstruct residual blocks in the pixel domain for later use as reference blocks for reference pictures. Motion compensation unit 44 may calculate a reference block by adding the residual block to a prediction block of one of the reference pictures within the reference picture list. Motion compensation unit 44 may also apply one or more interpolation filters to the reconstructed residual block to calculate pixel values below the integer used for motion estimation. Summer 62 adds the reconstructed residual block to the motion compensated prediction block generated by motion compensation unit 44 to generate a reference block for storage in picture memory 64. The reference block may be used by the motion estimation unit 42 and the motion compensation unit 44 as a reference block for inter prediction of a block in a subsequent video frame or picture.
In this way, the encoding device 104 of fig. 12 represents an example of a video encoder configured to perform any of the techniques described herein (including the processes described above with respect to 11). In some cases, some of the techniques of this disclosure may also be implemented by post-processing device 57.
Fig. 13 is a block diagram illustrating an example decoding device 112. The decoding apparatus 112 includes an entropy decoding unit 80, a prediction processing unit 81, an inverse quantization unit 86, an inverse transformation processing unit 88, a summer 90, a filter unit 91, and a picture memory 92. The prediction processing unit 81 includes a motion compensation unit 82 and an intra prediction processing unit 84. In some examples, the decoding device 112 may perform a decoding phase that is generally opposite to the encoding phase described with respect to the encoding device 104 from fig. 12.
During the decoding process, the decoding device 112 receives an encoded video bitstream transmitted by the encoding device 104, which represents video blocks of an encoded video slice and associated syntax elements. In some aspects, the decoding device 112 may receive the encoded video bitstream from the encoding device 104. In some aspects, decoding device 112 may receive the encoded video bitstream from a network entity 79, such as a server, a Media Aware Network Element (MANE), a video editor/splicer, or other such device configured to implement one or more of the techniques described above. Network entity 79 may or may not include encoding device 104. Network entity 79 may implement some of the techniques described in this disclosure before network entity 79 sends the encoded video bitstream to decoding device 112. In some video decoding systems, network entity 79 and decoding device 112 may be part of separate devices, while in other cases, the functions described with respect to network entity 79 may be performed by the same device that includes decoding device 112.
Entropy decoding unit 80 of decoding device 112 entropy decodes the bitstream to generate quantized coefficients, motion vectors, and other syntax elements. Entropy decoding unit 80 forwards the motion vectors and other syntax elements to prediction processing unit 81. The decoding device 112 may receive syntax elements at the video slice level and/or the video block level. Entropy decoding unit 80 may process and parse both fixed length syntax elements and variable length syntax elements in more parameter sets such as VPS, SPS, and PPS.
When a video slice is encoded as an intra coded (I) slice, the intra prediction processing unit 84 of the prediction processing unit 81 may generate prediction data for a video block of the current video slice based on the signaled intra prediction mode and data in previously decoded blocks from the current frame or picture. When a video frame is coded as an inter-coded (i.e., B, P or GPB) slice, the motion compensation unit 82 of the prediction processing unit 81 generates a prediction block for the video block of the current video slice based on the motion vectors and other syntax elements received from the entropy decoding unit 80. A prediction block may be generated from one of the reference pictures within the reference picture list. The decoding device 112 may construct reference frame lists, i.e., list 0 and list 1, using default construction techniques based on the reference pictures stored in the picture memory 92.
The motion compensation unit 82 determines prediction information for the video block of the current video slice by parsing the motion vector and other syntax elements and uses the prediction information to generate a prediction block for the current video block being decoded. For example, motion compensation unit 82 may determine a prediction mode (e.g., intra or inter prediction) for coding a video block of a video slice, an inter prediction slice type (e.g., B slice, P slice, or GPB slice), construction information for one or more reference picture lists for the slice, a motion vector for each inter-coded video block of the slice, an inter prediction state for each inter-coded video block of the slice, and other information for decoding a video block in a current video slice using one or more syntax elements in the parameter set.
The motion compensation unit 82 may also perform interpolation based on interpolation filters. The motion compensation unit 82 may calculate interpolated values for pixels below an integer of the reference block using interpolation filters used by the encoding device 104 during encoding of the video block. In this case, the motion compensation unit 82 may determine an interpolation filter used by the encoding device 104 according to the received syntax element, and may generate the prediction block using the interpolation filter.
The inverse quantization unit 86 inversely quantizes or dequantizes the quantized transform coefficients provided in the bitstream and decoded by the entropy decoding unit 80. The inverse quantization process may include determining a degree of quantization using quantization parameters calculated by the encoding device 104 for each video block in the video slice, and likewise determining an inverse degree of quantization that should be applied. The inverse transform processing unit 88 applies an inverse transform (e.g., an inverse DCT or other suitable inverse transform), an inverse integer transform, or a conceptually similar inverse transform process to the transform coefficients in order to generate a residual block in the pixel domain.
After motion compensation unit 82 generates a prediction block for the current video block based on the motion vector and other syntax elements, decoding device 112 forms a decoded video block by adding the residual block from inverse transform processing unit 88 to the corresponding prediction block generated by motion compensation unit 82. Summer 90 represents one or more components that perform such a summation operation. Loop filters (in or after the encoding loop) may also be used to smooth out pixel transitions or otherwise improve video quality, if desired. The filter unit 91 is intended to represent one or more loop filters, such as a deblocking filter, an Adaptive Loop Filter (ALF), and a Sample Adaptive Offset (SAO) filter. Although the filter unit 91 is shown in fig. 8 as an in-loop filter, in other configurations the filter unit 91 may be implemented as a post-loop filter. The decoded video blocks in a given frame or picture are then stored in a picture memory 92, the picture memory 92 storing reference pictures for subsequent motion compensation. The picture memory 92 also stores decoded video for later presentation on a display device, such as the video destination device 122 shown in fig. 1.
In this way, the decoding device 112 of fig. 13 represents an example of a video decoder configured to perform any of the techniques described herein, including the processes described above with respect to fig. 11.
As used herein, the term "computer-readable medium" includes, but is not limited to, portable or non-portable storage devices, optical storage devices, and various other media capable of storing, containing, or carrying instruction(s) and/or data. The computer-readable medium may include a non-transitory medium in which data may be stored and which does not include: carrier waves and/or transitory electronic signals propagating wirelessly or over a wired connection. Examples of non-transitory media may include, but are not limited to: magnetic disk or tape, optical storage medium such as Compact Disc (CD) or Digital Versatile Disc (DVD), flash memory, memory or memory device. The computer-readable medium may have code and/or machine-executable instructions stored thereon, which may represent procedures, functions, subroutines, programs, routines, subroutines, modules, software packages, classes, or any combination of instructions, data structures, or program statements. A code segment may be coupled to another code segment or a hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc. may be passed, forwarded, or transmitted via any suitable means including memory sharing, message passing, token passing, network transmission, etc.
In some aspects, the computer readable storage devices, media, and memory may include a cable or wireless signal comprising a bit stream or the like. However, when mentioned, non-transitory computer-readable storage media expressly exclude media such as energy, carrier signals, electromagnetic waves, and signals themselves.
In the above description, specific details are provided to provide a thorough understanding of the aspects and examples provided herein. However, it will be understood by those of ordinary skill in the art that the aspects may be practiced without these specific details. For clarity of explanation, in some cases, the techniques herein may be presented as including separate functional blocks that include devices, device components, steps or routines in a method embodied in software, or a combination of hardware and software. Additional components may be used in addition to those shown in the figures and/or described herein. For example, circuits, systems, networks, processes, and other components may be shown in block diagram form in order not to obscure the aspects in unnecessary detail. In other instances, well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the aspects.
Various aspects may be described above as a process or method, which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of operations may be rearranged. The process is terminated after its operations are completed, but may have additional steps not included in the figure. A process may correspond to a method, a function, a procedure, a subroutine, etc. When a process corresponds to a function, its termination may correspond to the function returning to the calling function or the main function.
The processes and methods according to the above examples may be implemented using computer-executable instructions stored in or otherwise available from a computer-readable medium. Such instructions may include, for example, instructions or data which cause a general purpose computer, special purpose computer, or processing device to perform a certain function or group of functions, or to otherwise configure the same to perform a certain function or group of functions. Portions of the computer resources used may be accessed through a network. The computer-executable instructions may be, for example, binary files, intermediate format instructions such as assembly language, firmware, source code, and the like. Examples of computer readable media that may be used to store instructions, information used, and/or information created during a method according to the described examples include magnetic or optical disks, flash memory, a USB device provided with non-volatile memory, a network storage device, and so forth.
Devices implementing processes and methods according to these disclosures may include hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof, and may take any of a variety of form factors. When implemented in software, firmware, middleware or microcode, the program code or code segments (e.g., a computer program product) to perform the necessary tasks may be stored in a computer-readable or machine-readable medium. The processor may perform the necessary tasks. Typical examples of form factors include laptop computers, smart phones, mobile phones, tablet devices, or other small form factor personal computers, personal digital assistants, rack-mounted devices, stand alone devices, and the like. The functionality described herein may also be embodied in a peripheral device or a card. By way of further example, such functionality may also be implemented on different chips or circuit boards between different processes executing in a single device.
The instructions, media for transmitting such instructions, computing resources for executing them, and other structures for supporting such computing resources are example modules for providing the functionality described in this disclosure.
In the foregoing description, aspects of the present application have been described with reference to specific aspects thereof, but those skilled in the art will recognize that the present application is not so limited. Thus, although illustrative aspects of the present application have been described in detail herein, it is to be understood that the inventive concepts may be otherwise variously embodied and employed and that the appended claims are intended to be construed to include such variations, except insofar as limited by the prior art. The various features and aspects of the above-described applications may be used individually or in combination. Moreover, aspects may be used in any number of environments and applications other than those described herein without departing from the broader spirit and scope of the specification. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. For purposes of illustration, the methods are described in a particular order. It should be appreciated that in alternative aspects, the methods may be performed in an order different than that described.
It will be apparent to those of ordinary skill in the art that less ("<") and greater (">) symbols or terms used herein may be replaced with less than or equal to (" +") and greater than or equal to (" +") symbols, respectively, without departing from the scope of the present description.
Where a component is described as "configured to" perform certain operations, such configuration may be achieved, for example, by: circuits or other hardware are designed to perform the operation, programmable circuits (e.g., microprocessors or other suitable circuits) are programmed to perform the operation, or any combination thereof.
The phrase "coupled to" refers to any component that is physically connected directly or indirectly to another component, and/or any component that is in communication with another component directly or indirectly (e.g., connected to another component through a wired or wireless connection and/or other suitable communication interface).
Claim language reciting "at least one" in a collection and/or "one or more" in a collection, or other language, in the disclosure indicates that a member of the collection or members of the collection (in any combination) satisfies the claim. For example, claim language reciting "at least one of a and B" and "at least one of a or B" means A, B, or a and B. In another example, claim language reciting "at least one of A, B and C" and "at least one of A, B or C" means A, B, C, or a and B, or a and C, or B and C, or a and B and C. The language "at least one of" and/or "one or more of" in a collection does not limit the collection to items listed in the collection. For example, claim language reciting "at least one of a and B" and "at least one of a or B" may mean A, B or a and B, and may additionally include items not listed in the set of a and B.
The various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the aspects disclosed herein may be implemented as electronic hardware, computer software, firmware, or combinations thereof. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The techniques described herein may also be implemented in electronic hardware, computer software, firmware, or any combination thereof. Such techniques may be implemented in any of a variety of devices such as general purpose computers, wireless communication device handsets, or integrated circuit devices having multiple uses including applications in wireless communication device handsets and other devices. Any features described as modules or components may be implemented together in an integrated logic device or separately as discrete but interoperable logic devices. If implemented in software, the techniques may be realized at least in part by a computer-readable data storage medium comprising program code that includes instructions that, when executed, perform one or more of the methods described above. The computer readable data storage medium may form part of a computer program product, which may include packaging material. The computer-readable medium may include memory or data storage media such as Random Access Memory (RAM), such as Synchronous Dynamic Random Access Memory (SDRAM), read-only memory (ROM), non-volatile random access memory (NVRAM), electrically erasable programmable read-only memory (EEPROM), flash memory, magnetic or optical data storage media, and the like. Additionally or alternatively, the techniques may be implemented at least in part by a computer-readable communication medium (such as a propagated signal or wave) that carries or conveys program code in the form of instructions or data structures and that may be accessed, read, and/or executed by a computer.
The program code may be executed by a processor, which may include one or more processors, such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, application Specific Integrated Circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Such processors may be configured to perform any of the techniques described in this disclosure. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration. Thus, the term "processor" as used herein may refer to any of the foregoing structure, any combination of the foregoing structure, or any other structure or device suitable for implementation of the techniques described herein. Additionally, in some aspects, the functionality described herein may be provided within dedicated software modules or hardware modules configured for encoding and decoding, or incorporated in a combined video encoder-decoder (CODEC).
Illustrative aspects of the present disclosure include:
aspect 1: an apparatus for processing video data, comprising: a memory; and one or more processors coupled to the memory. The one or more processors are configured to: obtaining a current picture of the video data; obtaining a reference picture for the current picture from the video data; determining a merge mode candidate from the current picture; identifying a first motion vector and a second motion vector for the merge mode candidate; selecting a motion vector search strategy for the merge mode candidate from a plurality of motion vector search strategies; deriving a refined motion vector from the first motion vector, the second motion vector, and the reference picture using the motion vector search strategy; and processing the merge mode candidate using the refined motion vector.
Aspect 2: the apparatus of claim 1, wherein the merge mode candidate is selected from a merge candidate list.
Aspect 3: the apparatus of aspect 2, wherein the merge candidate list is constructed from one or more of: a spatial motion vector predictor from a spatial neighboring block of the merge mode candidate, a temporal motion vector predictor from a co-located block of the merge mode candidate, a history-based motion vector predictor from a history table, a pairwise average motion vector predictor, and a zero-valued motion vector.
Aspect 4: the apparatus of any of aspects 1-3, wherein the one or more processors are configured to: a motion vector bi-prediction signal is generated using the first motion vector and the second motion vector by averaging two prediction signals obtained from two different reference pictures.
Aspect 5: the apparatus of any of aspects 1-4, wherein the plurality of motion vector search strategies comprises a fractional sample refinement strategy.
Aspect 6: the apparatus of aspect 5, wherein the plurality of motion vector search strategies comprises a bi-directional optical flow strategy.
Aspect 7: the apparatus of aspect 6, wherein the plurality of motion vector search strategies includes a sub-block based bilateral matching motion vector refinement strategy.
Aspect 8: the apparatus of any of aspects 1-7, wherein the first motion vector and the second motion vector are associated with one or more constraints.
Aspect 9: the apparatus of aspect 8, wherein the one or more constraints comprise a mirrored constraint.
Aspect 10: the apparatus of any of aspects 1-9, wherein the one or more constraints comprise zero-valued constraints for the first motion vector difference.
Aspect 11: the apparatus of any of aspects 1-9, wherein the one or more constraints comprise zero-valued constraints for the second motion vector difference.
Aspect 12: the apparatus of any of claims 1-11, wherein the video data includes a syntax indicating the one or more constraints.
Aspect 13: the apparatus of any of claims 1 to 12, wherein the motion vector search strategy comprises a multi-pass decoder side motion vector refinement strategy.
Aspect 14: the apparatus of aspect 13, wherein the multi-pass decoder side motion vector refinement policy comprises two or more refinement passes of the same refinement type.
Aspect 15: the apparatus of claim 14, wherein the multi-pass decoder side motion vector refinement policy comprises one or more refinement passes of a type different from the same refinement type.
Aspect 16: the apparatus of any of aspects 14 or 15, wherein the two or more refinement passes of the same refinement type are block-based bilateral matching motion vector refinement, sub-block-based bilateral matching motion vector refinement, or sub-block-based bidirectional optical flow motion vector refinement.
Aspect 17: the apparatus of any of aspects 1-16, wherein the plurality of motion vector search strategies comprises a plurality of multi-path strategy subsets.
Aspect 18: the apparatus of aspect 17, wherein the plurality of multi-pass policy subsets are signaled in one or more syntax elements of the video data.
Aspect 19: the apparatus of any of claims 1-18, wherein deriving the refined motion vector comprises: matching costs for a plurality of candidate motion vector pairs are calculated according to the motion vector search strategy.
Aspect 20: the apparatus of any of aspects 1-19, wherein the motion vector search policy is adaptively selected based on a matching cost determined from the video data.
Aspect 21: the apparatus of any of claims 1-19, wherein the motion vector search policy is selected to adaptively set a number of passes of the motion vector search policy based on the video data.
Aspect 22: the apparatus of any of claims 1-19, wherein the motion vector search policy is selected to adaptively set a search mode for determining candidates for the refined motion vector based on the video data.
Aspect 23: the apparatus of any of claims 1-19, wherein the motion vector search policy is selected to adaptively set a set of criteria for generating a candidate list of the refined motion vectors based on the video data.
Aspect 24: the apparatus of any of claims 1-19, wherein the motion vector search policy is adaptively performed based on decoder-side motion vector refinement constraints from the video data.
Aspect 25: the apparatus of any of claims 1-19, wherein the motion vector search policy is adaptively performed based on a block size of the merge mode candidate in the video data.
Aspect 26: the apparatus of any one of aspects 1 to 25, the one or more processors configured to: multi-hypothesis prediction is disabled.
Aspect 27: the apparatus of any one of aspects 1 to 26, the one or more processors configured to: multi-hypothesis prediction is performed in conjunction with the motion vector search strategy.
Aspect 28: the apparatus of any one of aspects 1 to 27, the one or more processors configured to: and generating a merging candidate list comprising the merging mode candidates.
Aspect 29: the apparatus of aspect 28, wherein to generate the merge candidate list, the one or more processors are configured to: one or more default candidates for addition to the merge candidate list are determined based on the number of candidates in the merge candidate list being less than a maximum number of candidates based on one or more conditions associated with an adaptive merge mode (e.g., conditions for adaptive_bm_mode).
Aspect 30: the apparatus of aspect 28, wherein to generate the merge candidate list, the one or more processors are configured to: one or more default candidates for addition to the merge candidate list are determined based on the number of candidates in the merge candidate list being less than a maximum number of candidates based on one or more conditions associated with an adaptive merge mode (e.g., a condition according to bm dir).
Aspect 31: the apparatus of any one of aspects 1 to 30, wherein the apparatus is a mobile device.
Aspect 32: the apparatus of any one of aspects 1 to 31, further comprising: a camera configured to capture one or more frames.
Aspect 33: the apparatus of any one of aspects 1 to 32, further comprising: a display configured to display one or more frames.
Aspect 34: a method of processing video data according to any of the operations of aspects 1 to 33.
Aspect 35: a computer-readable storage medium comprising instructions that, when executed by one or more processors of a device, cause the device to perform the operations of any one of aspects 1 to 33.
Aspect 36: an apparatus comprising one or more means for performing any of the operations of aspects 1 through 33.
Aspect 37: an apparatus for processing video data, comprising: at least one memory; and at least one processor coupled to the at least one memory, the at least one processor configured to: obtaining one or more reference pictures for a current picture; identifying a first motion vector and a second motion vector for merging the mode candidates; determining a selected motion vector search strategy for the merge mode candidate from a plurality of motion vector search strategies; determining one or more refined motion vectors based on the one or more reference pictures and at least one of the first motion vector or the second motion vector using the selected motion vector search strategy; and processing the merge mode candidate using the one or more refined motion vectors.
Aspect 38, the apparatus of aspect 37, wherein the selected motion vector search policy is associated with one or more constraints based on at least one of the first motion vector or the second motion vector.
Aspect 39, the apparatus of aspect 38, wherein the one or more constraints are determined for a block of the video data based on a syntax element signaled for the block.
The apparatus of any of aspects 40, 38 or 39, wherein the one or more constraints are associated with at least one of a first motion vector difference associated with the first motion vector or a second motion vector difference associated with the second motion vector.
Aspect 41, the apparatus of aspect 40, wherein the one or more refined motion vectors comprise a first refined motion vector and a second refined motion vector, and wherein the at least one processor is configured to: determining the first motion vector difference as a difference between the first refined motion vector and the first motion vector; and determining the second motion vector difference as a difference between the second refined motion vector and the second motion vector.
Aspect 42, the apparatus of any one of aspects 40 or 41, wherein the one or more constraints comprise a mirrored constraint for the first motion vector difference and the second motion vector difference, and wherein the first motion vector difference and the second motion vector difference have the same magnitude and different signs.
The apparatus of aspect 43, any one of aspects 40-42, wherein the one or more constraints include zero-valued constraints for at least one of the first motion vector difference or the second motion vector difference.
The apparatus of aspect 44, aspect 43, wherein, based on the zero-value constraint, the at least one processor is configured to: the one or more refined motion vectors are determined using the selected motion vector search strategy by maintaining a first one of the first motion vector difference or the second motion vector difference as a fixed value and searching with respect to a second one of the first motion vector difference or the second motion vector difference.
Aspect 45, the apparatus of any one of aspects 37-44, wherein the selected motion vector search policy is a Bilateral Matching (BM) motion vector search policy.
The apparatus of any one of aspects 46, 37 to 45, wherein the at least one processor is configured to: determining the one or more refined motion vectors based on one or more constraints associated with the selected motion vector search policy, and wherein to determine the one or more refined motion vectors based on the one or more constraints, the at least one processor is configured to: determining a first refined motion vector by searching a first reference picture around the first motion vector based on the selected motion vector search policy; and determining a second refined motion vector by searching a second reference picture around the second motion vector based on the selected motion vector search policy; wherein the one or more constraints comprise a motion vector difference constraint.
Aspect 47, the apparatus of aspect 46, wherein to determine the first refined motion vector and the second refined motion vector, the at least one processor is configured to: the difference between a first reference block associated with the first refined motion vector and a second reference block associated with the second refined motion vector is minimized.
Aspect 48, the apparatus of any one of aspects 37 to 47, wherein the plurality of motion vector search strategies includes at least two of: a multi-pass decoder side motion vector refinement strategy, a fractional sample refinement strategy, a bi-directional optical flow strategy, or a sub-block based bilateral matching motion vector refinement strategy.
The apparatus of aspect 49, any of aspects 37-48, wherein the selected motion vector search strategy comprises a multi-pass decoder side motion vector refinement strategy.
The apparatus of aspect 50, aspect 49, wherein the multi-pass decoder side motion vector refinement policy comprises at least one of: one or more block-based bilateral matching motion vector refinement passes, or one or more sub-block-based motion vector refinement passes.
The apparatus of aspect 51, aspect 50, wherein the at least one processor is configured to: performing the one or more block-based bilateral matching motion vector refinement passes using a first constraint associated with at least one of the first motion vector difference or the second motion vector difference; and performing the one or more sub-block based motion vector refinement passes using a second constraint associated with at least one of the first motion vector difference or the second motion vector difference, wherein the first constraint is different from the second constraint.
The apparatus of any of aspects 52, 50 or 51, wherein the one or more sub-block based motion vector refinement passes comprise at least one of: the two-sided matching motion vector refinement path based on the sub-block or the two-way optical flow motion vector refinement path based on the sub-block.
The apparatus of aspect 53, any one of aspects 37 to 52, wherein the apparatus is a wireless communication device.
The apparatus of any one of aspects 54, 37 to 53, wherein the at least one processor is configured to: the one or more refined motion vectors for a block of the video data are determined, and wherein the merge mode candidate comprises a neighboring block of the block.
Aspect 55: a method for processing video data, comprising: obtaining one or more reference pictures for a current picture; identifying a first motion vector and a second motion vector for merging the mode candidates; determining a selected motion vector search strategy for the merge mode candidate from a plurality of motion vector search strategies; determining one or more refined motion vectors based on the one or more reference pictures and at least one of the first motion vector or the second motion vector using the selected motion vector search strategy; and processing the merge mode candidate using the one or more refined motion vectors.
Aspect 56, the method of aspect 55, wherein the selected motion vector search strategy is associated with one or more constraints based on at least one of the first motion vector or the second motion vector.
Aspect 57, the method of aspect 56, wherein the one or more constraints are determined for a block of the video data based on a syntax element signaled for the block.
Aspect 58, the method of any one of aspects 56 or 57, wherein the one or more constraints are associated with at least one of a first motion vector difference associated with the first motion vector or a second motion vector difference associated with the second motion vector.
Aspect 59, the method of aspect 58, wherein the one or more refined motion vectors include a first refined motion vector and a second refined motion vector, the method further comprising: determining the first motion vector difference as a difference between the first refined motion vector and the first motion vector; and determining the second motion vector difference as a difference between the second refined motion vector and the second motion vector.
Aspect 60, the method of any one of aspects 58 or 59, wherein the one or more constraints include a mirrored constraint for the first motion vector difference and the second motion vector difference, and wherein the first motion vector difference and the second motion vector difference have the same magnitude and different signs.
Aspect 61, the method of any one of aspects 58-60, wherein the one or more constraints include zero-valued constraints for at least one of the first motion vector difference or the second motion vector difference.
Aspect 62, the method of aspect 61, wherein the one or more refined motion vectors are determined using the selected motion vector search strategy by maintaining a first of the first motion vector difference or the second motion vector difference as a fixed value and searching with respect to a second of the first motion vector difference or the second motion vector difference based on the zero value constraint.
Aspect 63, the method of any of aspects 55 to 62, wherein the selected motion vector search policy is a Bilateral Matching (BM) motion vector search policy.
Aspect 64, the method of any of aspects 55-63, wherein the one or more refined motion vectors are determined based on one or more constraints associated with the selected motion vector search policy, and wherein determining the one or more refined motion vectors based on the one or more constraints comprises: determining a first refined motion vector by searching a first reference picture around the first motion vector based on the selected motion vector search policy; and determining a second refined motion vector by searching a second reference picture around the second motion vector based on the selected motion vector search policy; wherein the one or more constraints comprise a motion vector difference constraint.
Aspect 65, the method of aspect 64, wherein determining the first refined motion vector and the second refined motion vector comprises: the difference between a first reference block associated with the first refined motion vector and a second reference block associated with the second refined motion vector is minimized.
Aspect 66, the method of any one of aspects 55-65, wherein the plurality of motion vector search strategies includes at least two of: a multi-pass decoder side motion vector refinement strategy, a fractional sample refinement strategy, a bi-directional optical flow strategy, or a sub-block based bilateral matching motion vector refinement strategy.
Aspect 67, the method of any of aspects 55 to 66, wherein the selected motion vector search strategy comprises a multi-pass decoder side motion vector refinement strategy.
Aspect 68, the method of aspect 67, wherein the multi-pass decoder side motion vector refinement policy comprises at least one of: one or more block-based bilateral matching motion vector refinement passes, or one or more sub-block-based motion vector refinement passes.
Aspect 69, the method of aspect 68, further comprising: performing the one or more block-based bilateral matching motion vector refinement passes using a first constraint associated with at least one of the first motion vector difference or the second motion vector difference; and performing the one or more sub-block based motion vector refinement passes using a second constraint associated with at least one of the first motion vector difference or the second motion vector difference, wherein the first constraint is different from the second constraint.
The method of any of aspects 70, 68 or 69, wherein the one or more sub-block based motion vector refinement passes comprise at least one of: the two-sided matching motion vector refinement path based on the sub-block or the two-way optical flow motion vector refinement path based on the sub-block.
Aspect 71: a method of processing video data according to any of the operations of aspects 37 to 70.
Aspect 72: a computer-readable storage medium comprising instructions that, when executed by one or more processors of a device, cause the device to perform operations of any one of aspects 37 to 70.
Aspect 73: an apparatus comprising one or more means for performing any of the operations of aspects 37 through 70.

Claims (30)

1. An apparatus for processing video data, comprising:
at least one memory; and
at least one processor coupled to the at least one memory, the at least one processor configured to:
obtaining one or more reference pictures for a current picture;
identifying a first motion vector and a second motion vector for merging the mode candidates;
determining a selected motion vector search strategy for the merge mode candidate from a plurality of motion vector search strategies;
determining one or more refined motion vectors based on the one or more reference pictures and at least one of the first motion vector or the second motion vector using the selected motion vector search strategy; and
the merge mode candidate is processed using the one or more refined motion vectors.
2. The apparatus of claim 1, wherein the selected motion vector search policy is associated with one or more constraints based on at least one of the first motion vector or the second motion vector.
3. The apparatus of claim 2, wherein the one or more constraints are determined for a block of the video data based on a syntax element signaled for the block.
4. The apparatus of claim 2, wherein the one or more constraints are associated with at least one of a first motion vector difference associated with the first motion vector or a second motion vector difference associated with the second motion vector.
5. The apparatus of claim 4, wherein the one or more refined motion vectors comprise a first refined motion vector and a second refined motion vector, and wherein the at least one processor is configured to:
determining the first motion vector difference as a difference between the first refined motion vector and the first motion vector; and
the second motion vector difference is determined as the difference between the second refined motion vector and the second motion vector.
6. The apparatus of claim 4, wherein the one or more constraints comprise a mirrored constraint for the first motion vector difference and the second motion vector difference, and wherein the first motion vector difference and the second motion vector difference have the same magnitude and different signs.
7. The apparatus of claim 4, wherein the one or more constraints comprise zero-value constraints for at least one of the first motion vector difference or the second motion vector difference.
8. The apparatus of claim 7, wherein, based on the zero-value constraint, the at least one processor is configured to: the one or more refined motion vectors are determined using the selected motion vector search strategy by maintaining a first one of the first motion vector difference or the second motion vector difference as a fixed value and searching with respect to a second one of the first motion vector difference or the second motion vector difference.
9. The apparatus of claim 1, wherein the selected motion vector search policy is a Bilateral Matching (BM) motion vector search policy.
10. The apparatus of claim 9, wherein the at least one processor is configured to: determining the one or more refined motion vectors based on one or more constraints associated with the selected motion vector search policy, and wherein to determine the one or more refined motion vectors based on the one or more constraints, the at least one processor is configured to:
Determining a first refined motion vector by searching a first reference picture around the first motion vector based on the selected motion vector search policy; and
determining a second refined motion vector by searching a second reference picture around the second motion vector based on the selected motion vector search policy;
wherein the one or more constraints comprise a motion vector difference constraint.
11. The apparatus of claim 10, wherein to determine the first refined motion vector and the second refined motion vector, the at least one processor is configured to:
the difference between a first reference block associated with the first refined motion vector and a second reference block associated with the second refined motion vector is minimized.
12. The apparatus of claim 1, wherein the plurality of motion vector search strategies comprises at least two of: a multi-pass decoder side motion vector refinement strategy, a fractional sample refinement strategy, a bi-directional optical flow strategy, or a sub-block based bilateral matching motion vector refinement strategy.
13. The apparatus of claim 1, wherein the selected motion vector search strategy comprises a multi-pass decoder side motion vector refinement strategy.
14. The apparatus of claim 13, wherein the multi-pass decoder side motion vector refinement policy comprises at least one of: one or more block-based bilateral matching motion vector refinement passes, or one or more sub-block-based motion vector refinement passes.
15. The apparatus of claim 14, wherein the at least one processor is configured to:
performing the one or more block-based bilateral matching motion vector refinement passes using a first constraint associated with at least one of the first motion vector difference or the second motion vector difference; and
the one or more sub-block based motion vector refinement passes are performed using a second constraint associated with at least one of the first motion vector difference or the second motion vector difference, wherein the first constraint is different from the second constraint.
16. The apparatus of claim 14, wherein the one or more sub-block based motion vector refinement passes comprise at least one of: the two-sided matching motion vector refinement path based on the sub-block or the two-way optical flow motion vector refinement path based on the sub-block.
17. The apparatus of claim 1, wherein the apparatus is a wireless communication device.
18. The apparatus of claim 1, wherein the at least one processor is configured to: the one or more refined motion vectors for a block of the video data are determined, and wherein the merge mode candidate comprises a neighboring block of the block.
19. A method for processing video data, comprising:
obtaining one or more reference pictures for a current picture;
identifying a first motion vector and a second motion vector for merging the mode candidates;
determining a selected motion vector search strategy for the merge mode candidate from a plurality of motion vector search strategies;
determining one or more refined motion vectors based on the one or more reference pictures and at least one of the first motion vector or the second motion vector using the selected motion vector search strategy; and
the merge mode candidate is processed using the one or more refined motion vectors.
20. The method of claim 19, wherein the selected motion vector search policy is associated with one or more constraints based on at least one of the first motion vector or the second motion vector.
21. The method of claim 20, wherein the one or more constraints are determined for a block of the video data based on a syntax element signaled for the block.
22. The method of claim 20, wherein the one or more constraints are associated with at least one of a first motion vector difference associated with the first motion vector or a second motion vector difference associated with the second motion vector.
23. The method of claim 22, wherein the one or more refined motion vectors comprise a first refined motion vector and a second refined motion vector, the method further comprising:
determining the first motion vector difference as a difference between the first refined motion vector and the first motion vector; and
the second motion vector difference is determined as the difference between the second refined motion vector and the second motion vector.
24. The method of claim 22, wherein the one or more constraints comprise a mirrored constraint for the first motion vector difference and the second motion vector difference, and wherein the first motion vector difference and the second motion vector difference have the same magnitude and different signs.
25. The method of claim 22, wherein the one or more constraints comprise zero-value constraints for at least one of the first motion vector difference or the second motion vector difference.
26. The method of claim 25, wherein the one or more refined motion vectors are determined using the selected motion vector search strategy by maintaining a first of the first motion vector difference or the second motion vector difference as a fixed value and searching with respect to a second of the first motion vector difference or the second motion vector difference based on the zero-value constraint.
27. The method of claim 19, wherein the selected motion vector search policy is a Bilateral Matching (BM) motion vector search policy, wherein the one or more refined motion vectors are determined based on one or more constraints associated with the selected motion vector search policy, and wherein determining the one or more refined motion vectors based on the one or more constraints comprises:
determining a first refined motion vector by searching a first reference picture around the first motion vector based on the selected motion vector search policy; and
Determining a second refined motion vector by searching a second reference picture around the second motion vector based on the selected motion vector search policy;
wherein the one or more constraints comprise a motion vector difference constraint.
28. The method of claim 27, wherein determining the first refined motion vector and the second refined motion vector comprises:
the difference between a first reference block associated with the first refined motion vector and a second reference block associated with the second refined motion vector is minimized.
29. The method of claim 19, wherein the selected motion vector search strategy comprises a multi-pass decoder side motion vector refinement strategy, wherein the multi-pass decoder side motion vector refinement strategy comprises at least one of: one or more block-based bilateral matching motion vector refinement passes, or one or more sub-block-based motion vector refinement passes.
30. The method of claim 29, further comprising:
performing the one or more block-based bilateral matching motion vector refinement passes using a first constraint associated with at least one of the first motion vector difference or the second motion vector difference; and
The one or more sub-block based motion vector refinement passes are performed using a second constraint associated with at least one of the first motion vector difference or the second motion vector difference, wherein the first constraint is different from the second constraint.
CN202280044692.2A 2021-06-29 2022-06-24 Adaptive bilateral matching for decoder-side motion vector refinement Pending CN117837143A (en)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US63/216,468 2021-06-29
US63/263,754 2021-11-08
US17/847,942 US11895302B2 (en) 2021-06-29 2022-06-23 Adaptive bilateral matching for decoder side motion vector refinement
US17/847,942 2022-06-23
PCT/US2022/073155 WO2023278964A1 (en) 2021-06-29 2022-06-24 Adaptive bilateral matching for decoder side motion vector refinement

Publications (1)

Publication Number Publication Date
CN117837143A true CN117837143A (en) 2024-04-05

Family

ID=90519523

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280044692.2A Pending CN117837143A (en) 2021-06-29 2022-06-24 Adaptive bilateral matching for decoder-side motion vector refinement

Country Status (1)

Country Link
CN (1) CN117837143A (en)

Similar Documents

Publication Publication Date Title
CN112823517B (en) Improvement of motion vector predictor based on history
CN110383839B (en) Affine motion information derivation
TWI761415B (en) Motion vector reconstructions for bi-directional optical flow (bio)
CN107690810B (en) System and method for determining illumination compensation states for video coding
KR20200108432A (en) Improved decoder-side motion vector derivation
EP3603060A1 (en) Decoder-side motion vector derivation
EP3854076A1 (en) Affine motion prediction
JP2019514292A (en) Matching Constraints for Collocated Reference Index in Video Coding
US11582475B2 (en) History-based motion vector prediction
CN113170123A (en) Interaction of illumination compensation and inter-frame prediction
CN114402617A (en) Affine decoding with vector clipping
US11895302B2 (en) Adaptive bilateral matching for decoder side motion vector refinement
EP4298794A1 (en) Efficient video encoder architecture
JP2023554269A (en) Overlapping block motion compensation
EP4364418A1 (en) Adaptive bilateral matching for decoder side motion vector refinement
CN117981314A (en) Motion Vector (MV) candidate reordering
CN117837143A (en) Adaptive bilateral matching for decoder-side motion vector refinement
KR20240087733A (en) Reorder motion vector (MV) candidates
KR20240087746A (en) Create a histogram of gradients
CN116601959A (en) Overlapped block motion compensation
CN112534820A (en) Signaling sub-prediction unit motion vector predictor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40102592

Country of ref document: HK