CN113841395A

CN113841395A - Adaptive resolution change in video coding and decoding

Info

Publication number: CN113841395A
Application number: CN202080036230.7A
Authority: CN
Inventors: 张凯; 张莉; 刘鸿彬; 王悦
Original assignee: Beijing ByteDance Network Technology Co Ltd; ByteDance Inc
Current assignee: Beijing ByteDance Network Technology Co Ltd; ByteDance Inc
Priority date: 2019-05-16
Filing date: 2020-05-18
Publication date: 2021-12-24
Anticipated expiration: 2040-05-18
Also published as: CN113841395B; WO2020228835A1; CN113826382B; CN113826382A; WO2020228834A1; CN113875232A; WO2020228833A1

Abstract

Devices, systems, and methods for digital video coding are described that include bit depth and color format conversion for video coding. An exemplary method for video processing includes a method for video processing, comprising: determining, for a current video block, a relationship between resolutions of two reference pictures referenced by the current video block; determining, in response to the relationship, whether and/or how to perform a particular operation on the current video block during an Adaptive Resolution Change (ARC) process; and performing a conversion between the bit stream representation of the current video block and the current video block based on the particular operation.

Description

Adaptive resolution change in video coding and decoding

Cross Reference to Related Applications

The present application is intended to claim in time the priority and benefits of international patent application PCT/CN2019/087209 filed 2019, 5, 16, in accordance with applicable patent laws and/or rules of the paris convention. The entire disclosure of which is incorporated by reference as part of the disclosure of the present application.

Technical Field

This patent document relates to video encoding and decoding techniques, devices and systems.

Background

Despite advances in video compression technology, digital video still accounts for the largest bandwidth usage on the internet and other digital communication networks. As the number of networked user devices capable of receiving and displaying video increases, the bandwidth requirements for digital video usage are expected to continue to grow.

Disclosure of Invention

Devices, systems, and methods related to digital video codecs are described, particularly related to bit depth and color format conversion for video codecs. The described methods may be applied to existing Video codec standards (e.g., High Efficiency Video Coding (HEVC)) and future Video codec standards or Video codecs.

In one representative aspect, a method for video processing is disclosed, comprising: determining, for a current video block, a relationship between resolutions of two reference pictures referenced by the current video block; determining, in response to the relationship, whether and/or how to perform a particular operation on the current video block during an Adaptive Resolution Change (ARC) process; and performing a conversion between the bit stream representation of the current video block and the current video block based on the particular operation.

In another representative aspect, a method for video processing is disclosed, comprising: for a video block within a current picture, determining a relationship between a resolution of the current picture and a resolution of a reference picture to which the video block refers; and responsive to the relationship, performing a particular operation on a prediction block of a sample or video block within a reference picture during an Adaptive Resolution Change (ARC) process; and performing a conversion between the bit stream representation of the current video block and the current video block based on the particular operation.

In yet another representative aspect, the above-described methods are embodied in the form of processor-executable code and stored in a computer-readable program medium.

In yet another representative aspect, an apparatus configured or operable to perform the above-described method is disclosed. The apparatus may include a processor programmed to implement the method.

In yet another representative aspect, a video decoder device may implement the methods described herein.

The above and other aspects and features of the disclosed technology are described in more detail in the accompanying drawings, the description and the claims.

Drawings

Fig. 1 shows an example of adaptive streaming of two representations of the same content coded at different resolutions.

Fig. 2 shows another example of adaptive streaming of two representations of the same content coded at different resolutions, where the slices use either a closed group of pictures (GOP) or an open GOP prediction structure.

Fig. 3 shows an example of an open GOP prediction structure of two representations.

Fig. 4 shows an example of representations switched at open GOP positions.

Fig. 5 shows an example of decoding a random access skip directed (RASL) picture using a resampled reference picture from another bitstream as a reference.

Fig. 6A-6C illustrate an example of region-based mixed resolution (RWMR) viewport-dependent 360 streaming based on motion-constrained tile (tile) sets (MCTSs).

Fig. 7 shows an example of different Intra Random Access Point (IRAP) intervals and different sized collocated (collocated) sub-picture representations.

Fig. 8 shows an example of a segment received when a viewing orientation change results in a resolution change at the beginning of the segment.

Fig. 9 shows an example of the change of the viewing orientation.

Fig. 10 shows an example of sub-picture representations for two sub-picture positions.

Fig. 11 shows an example of encoder modification for Adaptive Resolution Conversion (ARC).

Fig. 12 shows an example of decoder modification for ARC.

Fig. 13 shows an example of slice group based resampling for ARC.

Fig. 14 shows an example of an ARC process.

Fig. 15 shows an example of an optional temporal motion vector prediction (ATMVP) for a coding unit.

16A-16B illustrate examples of simplified affine motion models.

Fig. 17 shows an example of an affine Motion Vector Field (MVF) of each sub-block.

Fig. 18A and 18B show examples of a 4-parameter affine model and a 6-parameter affine model, respectively.

Fig. 19 shows an example of Motion Vector Prediction (MVP) for AF _ INTER of inherited affine candidates.

Fig. 20 shows an example of MVP of AF _ INTER for the constructed affine candidates.

Fig. 21A and 21B show examples of candidates for AF _ MERGE.

Fig. 22 shows an example of candidate positions of the affine merge mode.

Fig. 23 is a block diagram of an example of a hardware platform for implementing the visual media decoding or visual media encoding techniques described in this document.

FIG. 24 shows a flow diagram of an example method for video processing.

FIG. 25 shows a flow diagram of another example method for video processing.

Detailed Description

Embodiments of the disclosed techniques may be applied to existing video codec standards (e.g., HEVC, h.265) and future standards to improve compression performance. Section headings are used in this document to enhance readability of the description, and do not limit the discussion or the embodiments (and/or implementations) in any way to the corresponding sections.

1. Video codec introduction

Due to the increasing demand for high-resolution video, video coding methods and techniques are ubiquitous in modern technology. Video codecs typically include electronic circuits or software that compress or decompress digital video, and are continually being improved to provide higher codec efficiency. Video codecs convert uncompressed video into a compressed format and vice versa. There is a complex relationship between video quality, the amount of data used to represent the video (determined by the bit rate), the complexity of the encoding and decoding algorithms, susceptibility to data loss and errors, ease of editing, random access, and end-to-end delay (latency). The compression format typically conforms to standard Video compression specifications, such as the High Efficiency Video Codec (HEVC) standard (also known as h.265 or MPEG-H part 2), the Versatile Video Coding to finalize (VVC) standard, or other current and/or future Video codec standards.

The video codec standard has evolved largely through the development of the well-known ITU-T and ISO/IEC standards. ITU-T has established H.261 and H.263, ISO/IEC has established MPEG-1 and MPEG-4 visualizations, and these two organizations have jointly established the H.262/MPEG-2 Video and H.264/MPEG-4 Advanced Video Coding (AVC) and H.265/HEVC standards. Since h.262, video codec standards have been based on hybrid video codec structures, in which temporal prediction plus transform coding is used. In order to explore future Video coding and decoding technologies beyond HEVC, VCEG and MPEG united in 2015 to form Joint Video Exploration Team (jfet). Thereafter, JFET adopted many new methods and placed them into a reference software named Joint Exploration Model (JEM). In month 4 of 2018, the joint video experts group (jfet) between VCEG (Q6/16) and ISO/IEC JTC1 SC29/WG11(MPEG) holds in an effort to the multifunctional video codec (VVC) standard, with the goal of reducing the bitrate by 50% compared to HEVC.

AVC and HEVC do not have the ability to change resolution without the introduction of IDR or Intra Random Access Point (IRAP) pictures; this capability may be referred to as Adaptive Resolution Change (ARC). There are some use cases or application scenarios that may benefit from ARC features, including the following:

rate adaptation in video telephony and conferencing: to adapt the coded video to changing network conditions, the encoder can adapt to smaller resolution pictures by encoding it when the network conditions get worse so that the available bandwidth becomes smaller. Currently, the picture resolution can only be changed after an IRAP picture. This has several problems. IRAP pictures with reasonable quality will be much larger than inter-coded pictures and will also be correspondingly more complex to decode: this wastes time and resources. A problem arises if the decoder requests a change of resolution for loading reasons. It may also disrupt the low-delay buffer condition forcing audio resynchronization and the end-to-end delay of the stream will increase, at least temporarily. This may result in a poor experience for the user.

-active speaker changes in a multi-party video conference: for multi-party video conferencing, the active speakers are typically displayed in a video size larger than the video of the other conference participants. It may also be necessary to adjust the picture resolution of each participant when the active speaker changes. The need to have an ARC function becomes especially important when such changes in the active speaker occur frequently.

Fast start in streaming: for streaming applications, it is common for the application to buffer a certain length of decoded pictures before starting the display. Starting up the bitstream at a smaller resolution will allow the application to have enough pictures in the buffer to start up faster display.

Adaptive stream switching in streaming: the dynamic adaptive streaming over HTTP (DASH) specification includes a feature named @ mediastreamstructurewld. This enables switching between different representations at open GOP random access points with non-decodable leading pictures (e.g., CRA pictures with associated RASL pictures in HEVC). When two different representations of the same video have different bitrates but the same spatial resolution while they have the same @ mediastream structure id value, switching can be made between the two representations of CRA pictures with associated RASL pictures and the RASL pictures associated with switching to CRA pictures can be decoded with acceptable quality, enabling seamless switching. The @ medias measurement structured id feature may also be used to switch between DASH representations with different spatial resolutions using ARC.

ARC is also known as dynamic resolution conversion.

ARC can also be considered a special case of Reference Picture Resampling (RPR), such as h.263 annex P.

Reference picture resampling in annex P of H.263

This mode describes an algorithm that warps the reference picture before it is used for prediction. This may be useful for resampling reference pictures that have a different source format than the predicted pictures. It can also be used for global motion estimation or rotational motion estimation by warping the shape, size and position of the reference picture. The syntax includes the warping parameters to be used and the resampling algorithm. The simplest level of operation with reference to the picture resampling mode is an implicit factor of 4 resampling, since the FIR filter only needs to be applied to the upsampling and downsampling processes. In this case, no additional signaling overhead is required, since its use can be understood when the size of the new picture (indicated in the picture header) is different from the size of the previous picture.

Contribution of ARC to VVC

1.2.1.JVET-M0135

The preliminary design of ARC described below is suggested as a placeholder to trigger discussion, some of which are taken from JCTVC-F158.

1.2.1.1 basic tool description

The basic tool constraints to support ARC are as follows:

the spatial resolution may differ by a factor of 0.5 from the nominal resolution applicable to both dimensions. The spatial resolution may be increased or decreased resulting in scaling of 0.5 and 2.0.

Aspect ratio and chroma format of the video format are unchanged.

The area of the plants (cropping) scales proportionally to the spatial resolution.

Only the reference pictures need to be rescaled as needed and inter prediction applied as usual.

1.2.1.2 zoom operation

A down-scaling and up-scaling filter is proposed that can be separated using a simple zero phase. Note that these filters are only used for prediction; the decoder may use more complex scaling for output purposes. The following 1 was used: 2 a minification filter with zero phase and 5 taps: (-1,9,16,16,9, -1)/32

The down-sampling points are located at uniform sampling positions and are co-located. The same filter is used for luminance and chrominance.

For 2:1 upsampling, generating additional samples at odd grid positions using half-pixel motion compensated interpolation filter coefficients in the latest VVC WD.

The combined upsampling and downsampling will not change the position of the phase or chrominance samples.

1.2.1.3 resolution description in parameter set

The signaling change in picture resolution in SPS is shown below, with the double-bracket labels removed in the following description and the rest of this document (e.g., [ [ a ] ] to denote the removal of the character "a").

Sequence parameter set RBSP syntax and semantics

[ [ pic _ width _ in _ luma _ samples ] specifies the width of each decoded picture in units of luma samples. pic _ width _ in _ luma _ samples should not be equal to 0 and should be an integer multiple of MinCbSizeY.

pic _ height _ in _ luma _ samples specifies the height of each decoded picture in units of luma samples. pic _ height _ in _ luma _ samples should not be equal to 0 and should be an integer multiple of MinCbSizeY. ]]

num _ pic _ size _ in _ luma _ samples _ minus1 plus 1 specifies the number of picture sizes (width and height) in units of luma samples that may appear in a coded video sequence.

pic _ width _ in _ luma _ samples [ i ] specifies the ith width of a decoded picture in units of luma samples that may appear in a coded video sequence. pic _ width _ in _ luma _ samples [ i ] should not be equal to 0 and should be an integer multiple of MinCbSizeY.

pic _ height _ in _ luma _ samples [ i ] specifies the ith height of a decoded picture in units of luma samples that may appear in a coded video sequence. pic _ height _ in _ luma _ samples [ i ] should not equal 0 and should be an integer multiple of MinCbSizeY.

Picture parameter set RBSP syntax and semantics

pic _ size _ idx specifies the index of the ith picture size in the sequence parameter set. The width of a picture of the reference picture parameter set is pic _ width _ in _ luma _ samples [ pic _ size _ idx ] in the luma samples. Likewise, the picture height of the reference picture parameter set is pic _ height _ in _ luma _ samples [ pic _ size _ idx ] in the luma samples.

1.2.2.JVET-M0259

1.2.2.1. Background: sub-picture

The term "sub-picture track" is defined in the omnidirectional media format (OMAF) as follows: tracks that have spatial relationships with other tracks and represent spatial subsets of the original video content, which have been split into spatial subsets prior to video encoding at the content generation side. A sub-picture track for HEVC may be constructed by rewriting the parameter set and slice segment header of the motion-constrained slice set to make it an independent HEVC bitstream. A sub-picture representation may be defined as a DASH representation carrying sub-picture tracks.

JVET-M0261 uses the term "sprite" as a spatial partition unit for VVC, summarized as follows:

1. pictures are divided into sub-pictures, slice groups, and slices.

2.A sub-picture is a set of rectangular slice groups starting with a slice group whose tile _ group _ address is equal to 0.

3. Each sub-picture may refer to its own PPS and thus may have its own slice partition.

4. The sub-picture is treated as a picture in the decoding process.

5. A reference picture for decoding a current sub-picture is generated by extracting a region collocated with the sub-picture from a reference picture in a decoded picture buffer. The extracted region will be a decoded sub-picture, i.e. inter prediction occurs between sub-pictures of the same size and same position in the picture.

6. A slice group is a sequence of slices in a slice raster scan of a sub-picture.

In this context, we refer to the term "sub-picture" as defined in JFET-M0261. However, the tracks encapsulating the sub-picture sequence as defined in jfet-M0261 have very similar properties as the sub-picture tracks defined in OMAF, and the examples given below apply in both cases.

1.2.2.2. Use case

1.2.2.2.1. Adaptive resolution change in a stream

Requirement to support adaptive streaming

Section 5.13 of MPEG N17074 ("support adaptive streaming") includes the following requirements for VVC:

in case the adaptive streaming service provides multiple representations of the same content, each with different properties (e.g. spatial resolution or sampling bit depth), the standard should support fast representation switching. The standard should be able to use an efficient prediction structure (e.g. a so-called open group of pictures) without affecting the fast seamless switching representation capability between representations of different properties (e.g. different spatial resolutions).

Example with open GOP prediction Structure indicating switching

Content generation for adaptive bit rate streaming includes generation of different representations, which may have different spatial resolutions. The client requests the segments from the representation and can therefore decide at which resolution and bit rate the content is received. At the client, the segments of different representations are concatenated, decoded and played. The client should be able to use one decoder instance to achieve seamless playback. As shown in fig. 1, a closed GOP structure (starting from an IDR picture) is generally used. An open GOP prediction structure (starting with a CRA picture) may achieve better compression performance than a corresponding closed GOP prediction structure. For example, a 5.6% reduction in the luminance Bjontegaard delta bit rate is achieved at IRAP picture intervals of 24 pictures.

The open GOP prediction structure also reduces the quality degradation (punping) seen subjectively.

One challenge in using open GOPs in streaming is that RASL pictures cannot be decoded using the correct reference pictures after switching representations. We will describe this challenge in relation to the representation in fig. 2 below.

A segment starting with a CRA picture includes RASL pictures, which have at least one reference picture in their previous segments. This is shown in fig. 3, where picture 0 in both bitstreams resides in a previous segment and is used as a reference for predicting RASL pictures.

The representation marked with a dashed rectangle in fig. 2 switches are shown in fig. 4. It can be seen that the reference picture for the RASL picture ("picture 0") is not decoded. Therefore, RASL pictures are not decodable and there may be gaps in the presentation of the video.

However, as described by embodiments of this disclosure, it has been found that decoding RASL pictures with resampled reference pictures is subjectively acceptable. The resampling of "picture 0" is shown in fig. 5 and is used as a reference picture for decoding RASL pictures.

1.2.2.2.2. Viewport variation background in 360 ° video streaming based on region-blended resolution (RWMR): HEVC-based RWMR streaming

RWMR 360 ° streaming provides higher effective spatial resolution over the viewport. Slices covering the viewport are derived from 6K (6144 × 3072) ERP pictures with "4K" decoding capability (HEVC level 5.1) or equivalent CMP resolution schemes (as shown in fig. 6) are included in clauses d.6.3 and d.6.4 of OMAF and are also adopted in VR industry forum guidelines. Such a resolution is claimed to be suitable for head mounted displays using a quad HD (2560 x 1440) display panel.

Encoding: the content is encoded at two spatial resolutions, with cube-face sizes of 1536 × 1536 and 768 × 768, respectively. In both bit streams, a 6 × 4 slice trellis is used and a motion limited slice set (MCTS) is coded for each slice position.

Package with a metal layer: each MCTS sequence is encapsulated as a sub-picture track and can be used as a sub-picture representation in DASH.

Selection of streaming MCTS: 12 MCTSs are selected from the high resolution bitstream and complementary 12 MCTSs are extracted from the low resolution bitstream. Thus, a hemisphere of streaming content (180 ° × 180 °) is derived from the high resolution bit stream.

Merging MCTSs into a bitstream to be decoded: the received MCTSs for a single time instance are merged into a 1920 × 4608 coded picture, which conforms to HEVC level 5.1. Another option for the merged picture is to have 4 tile columns of width 768, two tile columns of width 384 and three tile rows of height 768 luma samples, resulting in a picture of 3840 x 2304 luma samples.

Background: several representations of different IRAP intervals for viewport-dependent 360 ° streaming

When the viewing direction in HEVC based viewport related 360 ° streaming changes, a new selection of sub-picture representation may take effect at the next IRAP aligned segment boundary. The sub-picture representations are merged into a codec picture for decoding, so the VCL NAL unit types are aligned in all selected sub-picture representations.

To make a trade-off between response time to viewing direction changes and rate-distortion performance when the viewing direction is stable, multiple versions of the content may be codec at different IRAP intervals. This is illustrated in fig. 7 for one set of collocated sub-picture representations presented in fig. 6 for encoding.

Fig. 8 shows an example in which the sub-picture position is selected first to be received at a lower resolution (384 × 384). A change in viewing direction may result in a new selection of sub-picture positions being received at a higher resolution (768 x 768). In this example, a viewing direction change occurs such that segment 4 is received from the short IRAP interval sub-picture representation. Thereafter, the viewing direction is stable, so the long IRAP interval version can be used from segment 5 onwards.

Question statement

Since the viewing direction moves gradually in typical viewing situations, in RWMR viewport-related streaming the resolution changes only in a subset of the sub-picture positions. Fig. 9 shows the cube face with the viewing direction change slightly up and to the right from fig. 6. The cube face partition having a different resolution than before is denoted by "C". It can be observed that there is a variation in resolution of 6 out of 24 cube face partitions. However, as described above, in response to a viewing direction change, a segment starting with an IRAP picture needs to be received for all 24 cube-face partitions. Updating all sub-picture positions with a slice starting with an IRAP picture is inefficient in terms of streaming rate distortion performance.

In addition, the ability to use an open GOP prediction structure with RWMR 360 ° streamed sub-picture representations is desirable to improve rate-distortion performance and avoid poor visible picture quality caused by a closed GOP prediction structure.

Proposed design objectives

The following design goals are set forth:

VVC design should allow merging a sub-picture originating from a random access picture and another sub-picture originating from a non-random access picture into the same coded picture compliant with VVC.

VVC design should enable the use of open GOP prediction structures in sub-picture representations without affecting the fast seamless representation switching capability between sub-picture representations of different properties (such as different spatial resolutions), while being able to merge sub-picture representations into a single VVC bitstream.

The design goal may be illustrated with fig. 10, where a sub-picture representation of two sub-picture positions is presented. For two sub-picture positions, separate versions of the content are coded for each combination between two resolutions and two random access intervals. Some slices start with an open GOP prediction structure. The viewing direction change causes the resolution of sub-picture position 1 to switch at the beginning of segment 4. Since slice 4 starts with the CRA picture associated with the RASL picture, those reference pictures of the RASL picture that are in slice 3 need to be resampled. It should be noted that this resampling applies to sub-picture position 1, while decoded sub-pictures of some other sub-picture positions are not resampled. In this example, the viewing direction change does not cause a change in the resolution of sub-picture position 2, and therefore the decoded sub-picture of sub-picture position 2 is not resampled. In the first picture of slice 4, the slice for sub-picture position 1 contains sub-pictures derived from CRA pictures, while the slice for sub-picture position 2 contains sub-pictures derived from non-random access pictures. It is proposed to allow merging of these sub-pictures into a coded picture in VVC.

1.2.2.2.3. Adaptive resolution change in video conferencing

JCTVC-F158 proposes adaptive resolution change primarily for video conferencing. The following subsections are reproduced from JCTVC-F158 and describe use cases where adaptive resolution change is considered useful.

Seamless network adaptation and error recovery

Applications such as video conferencing and streaming over packet networks often require that the encoded stream adapt to changing network conditions, especially if the bit rate is too high and data is lost. Such applications typically have a return channel that allows the encoder to detect errors and perform adjustments. There are two main tools available for encoders: reducing the bit rate and changing the temporal or spatial resolution. By using a layered prediction structure for encoding and decoding, temporal resolution variation can be effectively achieved. However, to achieve the best quality, the spatial resolution needs to be changed and part of the encoder for which the video communication is well designed.

Changing spatial resolution in AVC requires sending IDR frames and resetting the stream. This causes a serious problem. An IDR frame with reasonable quality will be much larger than an inter picture and will be correspondingly more complex to decode: this wastes time and resources. A problem arises if the decoder requests a change of resolution for loading reasons. It may also disrupt low latency buffer conditions forcing audio resynchronization and the end-to-end delay of the stream will increase, at least temporarily. This brings a bad experience to the user.

To minimize these problems, IDRs are typically sent at low quality using a similar number of bits as P-frames and take a significant amount of time to recover to full quality for a given resolution. To obtain a sufficiently low delay, the quality may indeed be very low, and visible blurring will often occur before the image is "refocused". In fact, intra frames are of little use in compression: it is just one way to restart the stream.

Therefore, there is a need for a method that allows for changing resolution with minimal impact on the subjective experience in HEVC, especially under challenging network conditions.

Quick start

It would be useful to have a "fast start" mode in which the first frame is sent at a reduced resolution and the resolution is increased in the next few frames to reduce delay and restore normal quality more quickly without unacceptable image blurring at the beginning.

Meeting "writing"

Video conferencing also typically has the capability to display the speaker in a full screen and other participants in a smaller resolution window. To effectively support this function, smaller pictures are typically sent at a lower resolution. This resolution is then increased when the participant becomes the speaker and is displayed full screen. Sending intra frames at this point can cause unpleasing pauses (hicups) in the video stream. This effect can be very noticeable and objectionable if the speakers alternate quickly.

1.2.2.3. Proposed design objectives

The following are advanced design choices proposed for VVC version 1:

1. it is proposed to include the reference picture resampling process in VVC version 1 for the following use case:

efficient prediction structures (e.g., so-called open picture groups) are used in adaptive streaming without impacting the ability to quickly and seamlessly switch representations between representations of different properties, such as different spatial resolutions.

Adapting low-delay conferencing video content to network conditions and application-induced resolution changes without significant delay or delay variation.

2.VVC designs are proposed to allow merging of a sub-picture originating from a random access picture and another sub-picture originating from a non-random access picture into the same coded picture that conforms to VVC. It is claimed that viewing direction changes in the mixed quality and mixed resolution viewport 360 ° streaming can be handled efficiently.

3. It is proposed to include a sub-picture based resampling process in VVC version 1. It is claimed that an efficient predictive structure can be implemented to handle view direction changes more efficiently in mixed resolution viewport adaptive 360 ° streaming.

1.2.3.JVET-N0048

The use case and design goals for Adaptive Resolution Change (ARC) are discussed in detail in JFET-M0259. The following is a brief summary:

1. real-time communication

JCTVC-F158 initially includes the following use cases for adaptive resolution change:

a. seamless network adaptation and error recovery (by dynamic adaptive resolution change)

b. Fast start (resolution is gradually increased when session is started or reset)

c. Conference "write" (speaker gets higher resolution)

2. Adaptive streaming

Section 5.13 of MPEG N17074 ("supports adaptive streaming") includes the following requirements for VVC:

in case the adaptive streaming service provides multiple representations of the same content, the standard should support fast representation switching, each representation having different properties (e.g. spatial resolution or sampling bit depth). The standard should allow the use of efficient prediction structures (e.g. so-called open picture groups) without affecting the fast and seamless switching representation capability between representations of different properties, such as different spatial resolutions.

jfet-M0259 discusses how to meet this requirement by resampling the reference pictures of the leading pictures.

3.360 degree view port related stream transmission

jfet-M0259 discusses how to solve this use case by resampling certain independently coded picture regions of the reference pictures of the leading pictures.

This document proposes an adaptive resolution codec method, which claims to meet all of the above use cases and design objectives. This proposal, together with jfet-N0045 (which proposes a separate sub-picture layer), handles the stream initiation and conference "write" use cases related to the 360 degree viewport.

Proposed Specification text

Signaling

sps_max_rpr

sps _ max _ rpr specifies the maximum number of active reference pictures in

picture list

0 or 1 for any slice group in the CVS for which pic _ width _ in _ luma _ samples and pic _ height _ in _ luma _ samples are not equal to pic _ width _ in _ luma _ samples and pic _ height _ in _ luma _ samples, respectively, of the current picture.

Picture width and height

max _ width _ in _ luma _ samples specifies that for any picture of the CVS for which the SPS is active, pic _ width _ in _ luma _ samples in any active PPS is less than or equal to max _ width _ in _ luma _ samples, which is a requirement for bitstream conformance.

max _ height _ in _ luma _ samples specifies that for any picture of the CVS for which the SPS is active, pic _ height _ in _ luma _ samples in any active PPS is less than or equal to max _ height _ in _ luma _ samples, which is a requirement for bitstream consistency.

Advanced decoding process

The decoding process of the current picture CurrPic is as follows:

1. clause 8.2 specifies the decoding of NAL units.

2. The process in clause 8.3 specifies the following decoding process using syntax elements in the slice group header layer and above:

deriving variables and functions related to image order count as specified in clause 8.3.1. This only needs to be invoked for the first slice group of pictures.

-at the start of the decoding process for each slice group of the non-IDR picture, invoking the decoding process of the reference picture list construction specified in clause 8.3.2 to derive reference picture list 0(RefPicList [0]) and reference picture list 1(RefPicList [1 ]).

-invoking the decoding process in clause 8.3.3 for reference picture marking, wherein a reference picture can be marked as "unused for reference" or "used for long term reference". This only needs to be invoked for the first slice group of pictures.

-for each active reference picture in RefPicList [0] and RefPicList [1], its pic _ width _ in _ luma _ samples or pic _ height _ in _ luma _ samples is not equal to pic _ width _ in _ luma _ samples or pic _ height _ in _ luma _ samples of CurrPic, respectively, the following applies:

invoking the resampling process [ Ed. (MH) in clause x.y.z: detailed information of call parameters to be added ], the output of which has the same reference picture flag and picture order count as the input.

The reference picture used as input for the resampling process is marked as "unused for reference".

[ Ed. (YK): adding calls to the decoding process, scaling, transformation, loop filtering, etc. of the coding tree unit here

4. After all slice groups of the current picture are decoded, the current decoded picture is marked as "used for short-term reference".

Resampling process

The SHVC resampling process (HEVC clause h.8.1.4.2) is proposed, with the following additions: …

If sps _ ref _ surrounding _ enabled _ flag is equal to 0, then the sample value tempArray [ n ] of n 0..7 is derived as follows:

otherwise, the sampling point value tempArray [ n ] of n ═ 0..7 is derived as follows:

refOffset＝(sps_ref_wraparound_offset_minus1+1)*MinCbSizeY tempArray[n]＝(f_L[xPhase,0]*rlPicSample_L[ClipH(refOffset,refW,xRef-3),yPosRL]+f_L[xPhase,1]*rlPicSample_L[ClipH(refOffset,refW,xRef-2),yPosRL]+f_L[xPhase,2]*rlPicSample_L[ClipH(refOffset,refW,xRef-1),yPosRL]+f_L[xPhase,3]*rlPicSample_L[ClipH(refOffset,refW,xRef),yPosRL]+f_L[xPhase,4]*rlPicSample_L[ClipH(refOffset,refW,xRef+1),yPosRL]+f_L[xPhase,5]*rlPicSample_L[ClipH(refOffset,refW,xRef+2),yPosRL]+f_L[xPhase,6]*rlPicSample_L[ClipH(refOffset,refW,xRef+3),yPosRL]+f_L[xPhase,7]*rlPicSample_L[ClipH(refOffset,refW,xRef+4),yPosRL])>>shift1

…

if sps _ ref _ surrounding _ enabled _ flag is equal to 0, then the sample value tempArray [ n ] of n 0..3 is derived as follows:

tempArray[n]＝(f_C[xPhase,0]*rlPicSample_C[Clip3(0,refWC-1,xRef-1),yPosRL]+f_C[xPhase,1]*rlPicSample_C[Clip3(0,refWC-1,xRef),yPosRL]+f_C[xPhase,2]*rlPicSample_C[Clip3(0,refWC-1,xRef+1),yPosRL]+(H-50)

f_C[xPhase,3]*rlPicSample_C[Clip3(0,refWC-1,xRef+2),yPosRL])>>shift1

otherwise, the sampling point value tempArray [ n ] of n ═ 0..3 is derived as follows:

refOffset＝(sps_ref_wraparound_offset_minus1+1)*MinCbSizeY)/SubWidthCtempArray[n]＝(f_C[xPhase,0]*rlPicSample_C[ClipH(refOffset,refWC,xRef-1),yPosRL]+f_C[xPhase,1]*rlPicSample_C[ClipH(refOffset,refWC,xRef),yPosRL]+f_C[xPhase,2]*rlPicSample_C[ClipH(refOffset,refWC,xRef+1),yPosRL]+f_C[xPhase,3]*rlPicSample_C[ClipH(refOffset,refWC,xRef+2),yPosRL])>>shift1

1.2.4.JVET-N0052

adaptive resolution change has emerged as a concept in video compression standards at least since 1996; in particular h.263+ related proposals for reference picture resampling (RPR, annex P) and reduced resolution update (annex Q). It has recently gained some attention, firstly the proposal by cisco during JCT-VC, then in the context of VP9 (which is now reasonably widely deployed), and most recently in the context of VVC. ARC can reduce the number of samples needed for coding a given picture and upsample the resulting reference picture to a higher resolution if needed.

We believe ARC is of particular interest in two scenarios:

1) intra-coded pictures (such as IDR pictures) are typically much larger than inter-pictures. For whatever reason, downsampled pictures intended for intra-coding may provide a better input for future prediction. It is also clearly advantageous from a rate control point of view, at least in low delay applications.

2) When operating codecs near breakpoints, at least some cable and satellite operators will typically do so, even for non-intra coded pictures, ARC becomes convenient, such as in scene transitions without hard transition points.

3) Perhaps a bit too far forward: is the concept of fixed resolution justified? With the advent of CRTs and the popularity of scaling engines in rendering devices, hard binding between rendering and codec resolutions has been the past. Also, we note that there are some studies that show that when many activities occur in a video sequence, most people cannot concentrate on fine details (possibly associated with high resolution) even if the activities are spatially located elsewhere. If this is correct and generally accepted, fine granularity resolution changes may be a better rate control mechanism than adaptive QP. We now present this point for discussion because we lack data-feedback from a thank you. Of course, the concept of discarding fixed resolution bitstreams has countless system-level and implementation implications, which we understand well (at least at the level where they exist, if not their detailed nature).

Technically, ARC may be implemented as reference picture resampling. There are two main aspects to implement reference picture resampling: resampling filter and signaling of resampling information in the bitstream. The latter is of particular interest in this document, and is only reached within the scope of our practical experience. Encouraging more research into suitable filter designs.

Overview of existing ARC embodiments

Fig. 11 and 12 show embodiments of a conventional ARC encoder/decoder, respectively. In our implementation, it is possible to vary the width and height of a picture according to each picture granularity, regardless of the picture type. At an encoder, input image data is downsampled to a selected picture size for a current picture encoding. After encoding the first input picture as an intra picture, the decoded picture is stored in a Decoded Picture Buffer (DPB). When subsequent pictures are downsampled at different sampling rates and encoded as inter pictures, the reference pictures in the DPB will be scaled up/down according to the spatial ratio between the reference picture size and the current picture size. At the decoder, the decoded pictures are stored in the DPB without resampling. However, when used for motion compensation, the reference pictures in the DPB are scaled up/down relative to the spatial scale between the current decoded picture and the reference. When the decoded picture is highlighted for display, it will be upsampled to the original picture size or the required output picture size. In the motion estimation/compensation process, the motion vector is scaled with respect to the picture size ratio and the picture order count difference.

Signaling of ARC parameters

The term ARC parameters is used herein as a combination of any parameters required for ARC to work. In the simplest case, this could be a scaling factor, or an index to a table with defined scaling factors. It may be the target resolution (e.g., in samples or maximum CU size granularity) or an index to a table that provides the target resolution, as set forth in jfet-M0135. Also included are the filter selectors of the up/down sampling filters used, and even the filter parameters (up to the filter coefficients).

From the outset, we propose here to allow, at least conceptually, different parts of a picture to use different ARC parameters. We propose that a suitable syntax structure according to the current VVC draft should be a rectangular slice group (TG). Those using scan order TG will be limited to the extent that ARC can only be used for the full graph, or scan order TG can be included in rectangular TG (we do not remember that TG nesting has been discussed so far, perhaps a bad idea). Which can be easily specified by bitstream constraints.

Since different TGs may have different ARC parameters, the appropriate location of the ARC parameters may be in the TG header or in a parameter set with a range of TGs and referenced by the TG header-the adaptive parameter set in the current VVC draft, or a more detailed reference (index) in a table in a higher parameter set. Among these three options we propose that, at this point, a TG header is used to encode a reference to a table entry comprising ARC parameters, and that the table is in SPS, with the largest table value encoded in the (upcoming) DPS. We can encode the scaling factor directly into the TG header without using any parameter set values. As proposed in jfet-M0135, if per slice group signaling of ARC parameters is a design criterion as we do, then using PPS as reference is contraindicated.

For the table entries themselves, we can see many options:

codec downsampling factors, one for two dimensions, one independently for the X and Y dimensions? This is mainly a (hardware) implementation discussion, and some may prefer the result that the scaling factor in the X dimension is rather flexible, but the scaling factor in the Y dimension is fixed to 1, or almost no choice. We consider the grammar to be the wrong place to express this constraint and we prefer to express the constraint as a consistency requirement if needed. In other words, flexibility of the syntax is maintained.

Codec target resolution. This is our proposal below. These resolutions may have more or less complex constraints relative to the current resolution, possibly expressed in terms of bitstream conformance requirements.

Each slice group is preferably downsampled to allow for picture synthesis/extraction. However, this is not important from a signaling point of view. If a group makes an unwise decision to allow only ARC at picture granularity, we can always include the requirement of bitstream conformance for all TGs using the same ARC parameters.

Control information related to ARC. In our following design, this includes the reference picture size.

Is flexibility required for filter design? Is there larger than a few code points? Put those into APS if there are? (not, please do not discuss APS updates anymore-if the downsampling filter changes, but the ALF remains unchanged, we propose that the bitstream must consume overhead).

Now, in order to (as much as possible) make the proposed technique consistent and simple, we propose

Design of fixed filters

Target resolution in table in SPS, with bitstream constraint TBD.

Minimum/maximum target resolution in DPS to facilitate upper bound (cap) exchange/negotiation.

The resulting syntax may be as follows:

decoder parameter set RBSP syntax

max _ pic _ width _ in _ luma _ samples specifies the maximum width of the decoded picture in units of luma samples in the bitstream. max _ pic _ width _ in _ luma _ samples should not be equal to 0 and should be an integer multiple of MinCbSizeY. The value of dec _ pic _ width _ in _ luma _ samples [ i ] cannot be greater than the value of max _ pic _ width _ in _ luma _ samples.

max _ pic _ height _ in _ luma _ samples specifies the maximum height of the decoded picture in units of luma samples. max _ pic _ height _ in _ luma _ samples must not equal 0 and should be an integer multiple of MinCbSizeY. The value of dec _ pic _ height _ in _ luma _ samples [ i ] cannot be greater than the value of max _ pic _ height _ in _ luma _ samples.

Sequence parameter set RBSP syntax

adaptive _ pic _ resolution _ change _ flag equal to 1 specifies that the output picture size (output _ pic _ width _ in _ luma _ samples, output _ pic _ height _ in _ luma _ samples), an indication of the number of decoded picture sizes (num _ dec _ pic _ size _ in _ luma _ samples _ minus1) and at least one decoded picture size (dec _ pic _ width _ in _ luma _ samples [ i ], dec _ pic _ height _ in _ luma _ samples [ i ]) are present in the SPS. There is a reference picture size (reference _ pic _ width _ in _ luma _ samples, reference _ pic _ height _ in _ luma _ samples) conditioned on the value of reference _ pic _ size _ present _ flag.

output _ pic _ width _ in _ luma _ samples specifies the width of the output picture in units of luma samples. output _ pic _ width _ in _ luma _ samples should not be equal to 0.

output _ pic _ height _ in _ luma _ samples specifies the height of the output picture in units of luma samples. output _ pic _ height _ in _ luma _ samples should not equal 0.

reference _ pic _ size _ present _ flag equal to 1 indicates the presence of reference _ pic _ width _ in _ luma _ samples and reference _ pic _ height _ in _ luma _ samples.

reference _ pic _ width _ in _ luma _ samples specifies the width of a reference picture in units of luma samples. output _ pic _ width _ in _ luma _ samples should not be equal to 0. If not, the value of reference _ pic _ width _ in _ luma _ samples is inferred to be equal to dec _ pic _ width _ in _ luma _ samples [ i ].

reference _ pic _ height _ in _ luma _ samples specifies the height of a reference picture in units of luma samples. output _ pic _ height _ in _ luma _ samples should not equal 0. If not, then the value of reference _ pic _ height _ in _ luma _ samples is inferred to be equal to dec _ pic _ height _ in _ luma _ samples [ i ].

Note 1-the size of the output picture should be equal to the values of output _ pic _ width _ in _ luma _ samples and output _ pic _ height _ in _ luma _ samples. When the reference picture is used for motion compensation, the size of the reference picture should be equal to the values of reference _ pic _ width _ in _ luma _ samples and _ pic _ height _ in _ luma _ samples.

num _ dec _ pic _ size _ in _ luma _ samples _ minus1 plus 1 specifies the number of decoded picture sizes in units of luma samples in the coded video sequence (dec _ pic _ width _ in _ luma _ samples [ i ], dec _ pic _ height _ in _ luma _ samples [ i ]).

dec _ pic _ width _ in _ luma _ samples [ i ] specifies the ith width of the decoded picture size in units of luma samples in a coded video sequence. dec _ pic _ width _ in _ luma _ samples [ i ] should not be equal to 0 and should be an integer multiple of MinCbSizeY.

dec _ pic _ height _ in _ luma _ samples [ i ] specifies the ith height of the decoded picture size in units of luma samples in the coded video sequence. dec _ pic _ height _ in _ luma _ samples [ i ] should not equal 0 and should be an integer multiple of MinCbSizeY.

Note that the 2-ith decoded picture size (dec _ pic _ width _ in _ luma _ samples [ i ], dec _ pic _ height _ in _ luma _ samples [ i ]) may be equal to the decoded picture size of the decoded picture in the coded video sequence.

Slice group header syntax

dec _ pic _ size _ idx specifies that the width of a decoded picture should be equal to pic _ width _ in _ luma _ samples [ dec _ pic _ size _ idx ], and the height of a decoded picture should be equal to pic _ height _ in _ luma _ samples [ dec _ pic _ size _ idx ].

Filter

The proposed design conceptually includes four different sets of filters: a downsampling filter from an original picture to an input picture, an upsampling filter to rescale a reference picture for motion estimation/compensation, and an upsampling filter from a decoded picture to an output picture. The first and last may remain as non-normative items. Within the specification, the up/down sampling filter needs to be explicitly signaled or predefined in the appropriate parameter set.

Our implementation uses downsampling filter of SHVC (SHM version 12.4), which is a 12-tap and 2D separable filter, for downsampling to adjust the size of the reference picture used for motion compensation. In the current embodiment, only binary sampling is supported. Thus, the phase of the downsampling filter is set to zero by default. For upsampling, an 8-tap interpolation filter with 16 phases is used to shift the phases and align the luma and chroma pixel positions to the original positions.

Table 1 and table 2 provide 8-tap filter coefficients fL [ p, x ], where p is 0..15 and x is 0..7, for the luminance upsampling process, and 4-tap filter coefficients fC [ p, x ], where p is 0..15 and x is 0..3, for the chrominance upsampling process.

Table 3 provides the 12 tap filter coefficients for the downsampling process. Both luminance and chrominance are downsampled using the same filter coefficients.

TABLE 1 Brightness upsampling Filter with 16 phases

TABLE 2 chroma upsampling filter with 16 phases

TABLE 3 downsampling filter coefficients for luminance and chrominance

We have not tried other filter designs. It is expected (perhaps apparent) that subjective and objective gains can be expected when using filters that are adaptive to content and/or scaling factors.

Slice group boundary discussion

Since many slice group related work may be correct, our implementation has not been completed for slice group (TG) based ARC. Our preference is to re-discuss the embodiment once at least a working draft has been generated by the discussion of spatial synthesis and extraction of multiple sub-pictures into a composite picture in the compressed domain. However, this does not prevent us to infer the result to some extent and adjust the signal design accordingly.

For the present, we consider the slice group header to be the correct position as described above, such as dec _ pic _ size _ idx, for the reasons described above. We use the single ue (v) code point dec _ pic _ size _ idx conditionally present in slice group header to indicate the ARC parameters employed. To match the implementation of only per-picture ARC, one thing we now need to do in the canonical space is to codec only a single slice group, or condition it for bitstream conformance, i.e. all TG headers of a given codec picture have the same dec _ pic _ size _ idx value (if present).

The parameter dec _ pic _ size _ idx may be moved into any header of the starting sub-picture. Our current perception is that it is likely that slice group headers will continue.

In addition to these syntax considerations, some extra work is required to enable slice group-based or sub-picture-based ARC. Perhaps the most difficult part is how to solve the problem of unwanted samples in a picture where the sub-picture has been resampled to a smaller size.

Consider the right part of fig. 13, which is composed of four sub-pictures (possibly represented as four rectangular slice groups in the bitstream syntax). On the left, the lower right TG is sampled to half size. How do we handle "half" marked samples outside the relevant area?

Common to some existing video codec standards is that spatial extraction of parts of a picture in the compressed domain is not supported. This means that each sample of the picture is represented by one or more syntax elements and that each syntax element affects at least one sample. If we want to keep, we need to fill in the area around the sample point where the downsampled TG is marked "half". H.263+ annex P solves this problem by padding; in practice, the sample values of the filled samples may be signaled (within certain strict limits) in the bitstream.

One might constitute a significantly different alternative to the previous assumption, but in any case might be needed if we were to support sub-bitstream extraction (and synthesis) based on rectangular parts of pictures, this alternative would relax the current understanding that the samples of reconstructed pictures must be represented by some content in the coded pictures (even if that content is just skipped blocks).

Implementation considerations, system meaning and Profile/level

We propose to include the basic ARC in the "baseline/primary" profile. If certain application scenarios are not needed, they can be removed using the sub-configuration file. Some limitations are acceptable. In this respect, we note that some h.263+ profiles and "recommended modes" (which are earlier than the profile) include a restriction that the attachment P can only be used as "implicit factor 4", i.e. binary down-sampling is employed in both dimensions. This is sufficient to support fast start-up (fast acquisition of I-frames) in a video conference.

This design gives us the belief that all filtering can be done in "real time" with no or only a slight increase in memory bandwidth. In this regard, we consider it unnecessary to place the ARC in an external configuration file.

We do not consider complex tables, etc., to be meaningfully usable in capability exchange, as discussed in Marrakech China with JVET-M0135. Given the provision of answers and similar depth-limited handshakes, the number of choices is large to allow meaningful cross-vendor interoperability. In this regard, in reality, to support ARC in a meaningful way in the context of capability exchange, we move back to the next to at most the point of interoperability. For example: no ARC, ARC with implicit factor 4, full ARC. Alternatively, we can specify the necessary support for all ARCs and leave the bitstream complexity constraint to a higher level of SDO. In any event, this is a strategic discussion that we should do (in addition to what already exists in the context of the sub-configuration files and flags).

As for the grade: we consider that the basic design principle must be that the sample count of an upsampled picture must be adapted to the bitstream level (as a condition of bitstream conformance) and that all samples must be adapted to the upsampled codec picture, no matter how many samples are signaled in the bitstream. We note that this is not the case in H263 +; where there may not be some sampling points.

1.2.5.JVET-N0118

The following aspects are proposed:

1) picture resolution lists are signaled in SPS and list indices are signaled in PPS to specify the size of a single picture.

2) For any picture to be output, the decoded picture before resampling is cropped (as needed)

And output, i.e., the resampled picture is not used for output, only for inter prediction reference.

3) Supporting 1.5 and 2 times resampling ratios. No arbitrary resampling ratio is supported. Further study was made as to whether one or more additional resampling ratios were required.

4) Between picture-level resampling and block-level resampling, the supporter prefers block-level resampling.

a. However, if picture level resampling is selected, the following aspects are proposed:

i. when a reference picture is resampled, the resampled version and the original version of the reference picture are stored in the DPB, so both affect DPB richness.

When the corresponding non-resampled reference picture is marked as "unused for reference", the resampled reference picture is marked as "unused for reference".

The RPL signaling syntax remains unchanged, while the RPL construction process is modified as follows: when a reference picture needs to be included in an RPL entry and a version of the reference picture having the same resolution as the current picture is not in the DPB, a picture resampling process is invoked and the resampled version of the reference picture is included in the RPL entry.

The number of resampled reference pictures that may be present in the dpb should be limited, e.g., to less than or equal to 2.

b. Otherwise (block level resampling is selected), the following is suggested:

i. to limit the complexity of the worst case decoder, it is proposed not to allow bi-prediction of blocks from reference pictures with different resolutions than the current picture.

Another option is to combine the two filters and apply this operation immediately when resampling and quarter-pixel interpolation is required.

5) Regardless of whether a picture-based resampling method or a block-based resampling method is selected, it is proposed to apply temporal motion vector scaling as needed.

1.2.5.1. Detailed description of the preferred embodiments

The ARC software was implemented on the basis of VTM 4.0.1 with the following changes:

-signalling supported solutions list in SPS.

-spatial resolution signaling is moved from SPS to PPS.

-implementing a picture-based resampling scheme for resampling the reference pictures.

After the pictures are decoded, the reconstructed pictures may be resampled to different spatial resolutions. Both the original reconstructed picture and the resampled reconstructed picture are stored in the DPB and are available for reference by future pictures in decoding order.

The resampling filter implemented is based on the filter tested in JCTVC-H0234, as follows:

omicron upsampling filter: 4 taps +/-one-quarter phase DCTIF with taps (-4,54,16, -2)/64

Omicron downsampling filter: h11 filter with taps (1,0, -3,0,10,16,10,0, -3,0,1)/32

When constructing the reference picture lists of the current picture (i.e. L0 and L1), only reference pictures having the same resolution as the current picture are used. Note that the reference picture may be of an original size or a resampled size.

TMVP and ATVMP may be enabled; however, when the original coding resolutions of the current picture and the reference picture are different, TMVP and ATMVP are disabled for the reference picture.

For the convenience and simplicity of the starting software implementation, the decoder outputs the highest available resolution when outputting the picture.

1.2.5.2. Signaling on picture size and picture output

1. On spatial resolution lists of coded pictures in a bitstream

Currently, all coded pictures in CVS have the same resolution. Therefore, it is straightforward to signal only one resolution (i.e. the width and height of the picture) in the SPS. With ARC support, instead of using one resolution, it is necessary to signal the picture resolution list, and we propose to signal the list in SPS and the index of the list in PPS to specify the size of the single picture.

2. Relating to picture output

We propose to crop the decoded picture before resampling (as needed) and output for any picture to be output, i.e. the resampled picture is not used for output, only for inter prediction reference. The ARC resampling filter should be designed to optimize the use of the resampled picture for inter prediction and such filters may not be optimal for picture output/display purposes, whereas video terminal devices typically have optimized output scaling (zoom)/scaling functions already implemented.

1.2.5.3. With respect to resampling

The resampling of the decoded picture may be picture-based or block-based. For the final ARC design in VVC, we prefer block-based resampling instead of picture-based resampling. We recommend to discuss these two methods and the jfet decides which of these two methods should be specified for ARC support in VVC.

Picture-based resampling

In picture-based ARC resampling, a picture is resampled only once for a certain resolution and then stored in the DPB, while an un-resampled version of the same picture remains in the DPB as well.

There are two problems with using picture-based resampling for ARC: 1) an additional DPB buffer is required to store the resampled reference pictures, and 2) additional memory bandwidth is required due to the increased operations of reading and writing reference picture data from and to the DPB.

Keeping only one version of the reference picture in the DPB is not a good idea for picture-based resampling. If we only keep the un-resampled version, the reference picture may need to be resampled multiple times because multiple pictures may refer to the same reference picture. On the other hand, if the reference picture is resampled and we only keep the resampled version, we need to apply inverse resampling when the reference picture needs to be output, since it is better to output the non-resampled picture as described above. This is a problem because the resampling process is not a lossless operation. Taking a picture A, down-sampling the picture A, and up-sampling the picture A to obtain A ' with the same resolution as A, wherein A ' is different from A '; a' contains less information than a because some high frequency information is lost during the down-sampling and up-sampling processes.

To address the issue of extra DPB buffer and memory bandwidth, we propose that if ARC design in VVC uses picture-based resampling, the following condition applies:

1. when a reference picture is resampled, the resampled version of the reference picture and the original, resampled version are both stored in the DPB and both will therefore affect the fullness of the DPB.

2.A resampled reference picture is marked as "unused for reference" when a corresponding non-resampled reference picture is marked as "unused for reference".

3. The Reference Picture List (RPL) of each slice group contains reference pictures having the same resolution as the current picture. Although the RPL signaling syntax does not need to be changed, the RPL construction process can be modified to ensure that the content in the previous sentence is as follows: when a reference picture needs to be included in an RPL entry, but a reference picture with the same resolution as the current picture is not yet available, a picture resampling process will be invoked and a resampled version of the reference picture will be included.

4. The number of resampled reference pictures that may be present in the DPB should be limited, e.g., less than or equal to 2.

Furthermore, to enable temporal MV usage (e.g., Merge mode and ATMVP) in the case where the temporal MV is from a reference frame of a different resolution than the current resolution, we propose to scale the temporal MV to the current resolution as needed.

Block-based ARC resampling

In block-based resampling for ARC, the reference block is resampled whenever needed, and no resampled picture is stored in the DPB.

The main problem here is the additional decoder complexity. This is because a block in a reference picture can be referred to multiple times by multiple blocks in another picture and blocks in multiple pictures.

When a block in a reference picture is referred to by a block in a current picture and the resolutions of the reference picture and the current picture are different, the reference block is resampled by calling an interpolation filter so that the reference block has an integer-pixel resolution. When the motion vector is in a quarter-pixel, the interpolation process is invoked again to obtain a resampled reference block in quarter-pixel resolution. Therefore, for each motion compensation operation for the current block according to reference blocks involving different resolutions, up to two interpolation filtering operations are required instead of one. If there is no ARC support, only one interpolation filter operation (i.e., generating a reference block at quarter-pixel resolution) is needed at most.

To limit the worst case complexity, we propose that if the ARC design for VVC uses block-based resampling, the following applies:

-not allowing bi-prediction of a block in a reference picture using a different resolution than the current picture.

More precisely, the constraints are as follows: for a current block blkA in the current picture picA to reference a reference block blkB in a reference picture picB, when picA and picB have different resolutions, the block blkA should be a single prediction block.

Under this constraint, the worst interpolation operand required to decode a block is limited to two. If the block references a block from a different resolution picture, the required interpolation operand is 2, as described above. This is the same as when the block refers to a reference block from the same resolution picture and is coded as a bi-predictive block, since the interpolation operands are also two (i.e., one for obtaining a quarter-pixel resolution for each reference block).

To simplify the implementation, we propose another variant, if the ARC design of VVC uses block-based resampling, the following applies:

if the reference frame and the current frame have different resolutions, the corresponding position of each pixel of the predicted value is first calculated, and then only one interpolation is applied. That is, two interpolation operations (i.e., one for resampling and one for quarter-pixel interpolation) are combined into only one interpolation operation. The sub-pixel interpolation filters in the current VVC can be reused, but in this case the granularity of interpolation should be enlarged, but the number of interpolation operations is reduced from two to one.

To enable temporal MV usage (e.g. Merge mode and ATMVP) in case the temporal MV is from a reference frame of a different resolution than the current resolution, we propose to scale the temporal MV to the current resolution as needed.

Resampling ratio

In jfet-M0135, to begin the discussion about ARC, it is proposed to consider only 2 × resampling ratio (meaning 2 × 2 for upsampling and 1/2 × 1/2 for downsampling) for the start of ARC. From further discussion of this subject after the marash conference, we understand that only 2 x resampling ratio is supported is very limited, since in some cases a small difference between the resampled and un-resampled resolutions may be more beneficial.

While it may be desirable to support an arbitrary resampling ratio, it appears to be difficult to support. This is because the number of resampling filters that must be defined and implemented to support an arbitrary resampling ratio appears to be too large and places a large burden on the decoder implementation.

We propose that more than one but a smaller number of resampling ratios should be supported, but at least 1.5 x and 2 x resampling ratios, and not arbitrary resampling ratios.

1.2.5.4. Maximum DPB buffer size and buffer fullness

Using ARC, the DPB may contain decoded pictures of different spatial resolutions within the same CVS. For DPB management and related aspects, it is no longer efficient to calculate DPB size and fullness in decoded picture units.

If ARC is supported, the following is a discussion of some specific aspects that need to be addressed in the final VVC specification and possible solutions (we do not propose to adopt possible solutions at this meeting):

1. instead of deriving MaxDpbSize (i.e., the maximum number of reference pictures that may be present in the DPB) using the value of picsizeinsampley (i.e., PicSizeInSamplesY ═ Pic _ width _ in _ luma _ samples; -Pic _ height _ in _ luma _ samples), MaxDpbSize is derived based on the value of minpicsizeinsampley. MinPicSizeInSampleY is defined as follows:

MinPicSizeInSampleY (width of minimum picture resolution in bitstream) × (height of minimum resolution in bitstream)

The derivation of MaxDpbSize is modified as follows (based on HEVC equation):

if (MinPicSizeInSamplesY < (MaxLumaps > >2))

MaxDpbSize＝Min(4*maxDpbPicBuf,16)

Otherwise, if (MinPicSizeInSamplesY < (MaxLumaps > >1))

MaxDpbSize＝Min(2*maxDpbPicBuf,16)

Otherwise, if (MinPicSizeInSamplesY < (3. MaxLumaPs) >2)

MaxDpbSize＝Min((4*maxDpbPicBuf)/3,16)

Otherwise

MaxDpbSize＝maxDpbPicBuf

2. Each decoded picture is associated with a value named pictureszeiunit. The picturesizerunit is an integer value that specifies how large the decoded picture size is relative to the MinPicSizeInSampleY. The definition of PictureSinit depends on the resampling ratio supported by ARC in VVC.

For example, if ARC supports only a resampling ratio of 2, the PictureSemizUnit is defined as follows:

the decoded picture with the lowest resolution in the bitstream is associated with a picturesizeUnit of 1.

The resolution of the decoded picture is 2 by 2 of the minimum resolution in the bitstream, associated with a 4's picturezeunit (i.e. 1 x 4).

For another example, if ARC supports resampling ratios of 1.5 and 2 simultaneously, the PictureSemizeUnit is defined as follows:

the decoded picture with the lowest resolution in the bitstream is associated with a picturesizeUnit of 4.

The resolution of the decoded picture is 1.5 by 1.5 of the minimum resolution in the bitstream, associated with a picturesizeUnit of 9 (i.e. 2.25 x 4).

Resolution of decoded pictures is 2 by 2 of the minimum resolution in the bitstream, and 16 (i.e. 4 x)

4) Associated with the picturezeunit.

For other resampling ratios supported by ARC, the same principles given in the above example should be used to determine the picturezeunit value for each picture size.

3. Let the variable minpictesezeunit be the smallest possible value for picturezeunit. That is, if ARC only supports a resampling ratio of 2, then MinPictures size Unit is 1; MinPictureSIZEUnit is 4 if ARC supports a resampling ratio of 1.5 and 2; likewise, the same principle is used to determine the value of MinPictureSizeUnit.

The value range of sps _ max _ dec _ pic _ buffering _ mins1[ i ] is designated 0 to (minpicturesize unit (MaxDPbsize-1)). The variable minpictesezeunit is the smallest possible value for picturezeunit.

DPB fullness operation is specified based on PictureSezeUnit as follows:

HRD is initialized at decoding unit 0, and both CPB and DPB are set to null (DPB fullness set to 0).

When flushing (flush) the DPB (i.e. removing all pictures from the DPB), the DPB fullness is set to 0.

When a picture is removed from the DPB, DPB fullness will reduce the value of the picturezeunit associated with the removed picture.

When a picture is inserted into the DPB, the DPB fullness will increase by the value of the picturezeunit associated with the inserted picture.

1.2.5.5. Resampling filter

In a software implementation, the resampling filter implemented is simply taken from the previously available filter described in JCTVC-H0234. If other resampling filters provide better performance and/or lower complexity, testing and use should be done. We propose to test various resampling filters to achieve a compromise between complexity and performance. Such testing may be done in the CE.

1.2.5.6. Other necessary modifications to existing tools

To support ARC, some modifications and/or additional operations may need to be performed on some existing codec tools. For example, in the ARC software implemented picture-based resampling, we disable TMVP and ATMVP when the original codec resolutions of the current and reference pictures are different for simplicity.

1.2.6.JVET-N0279

According to the "requirements on future video codec standards", the standards should support fast representation switching in case of an adaptive streaming service providing multiple representations of the same content, each representation having different properties (e.g. spatial resolution or sampling bit depth) ". In real-time video communication, allowing the resolution in a coded video sequence to be changed without inserting I-pictures, it is possible not only to seamlessly adapt video data to dynamic channel conditions or user preferences, but also to eliminate the jerky effect caused by I-pictures. A hypothetical example of adaptive resolution change is shown in fig. 14, where a current picture is predicted from a different sized reference picture.

This document proposes a high level syntax to signal adaptive resolution changes and modifications to the current motion compensated prediction process in the VTM. These modifications are limited to motion vector scaling and sub-pixel position derivation without changing the existing motion compensated interpolator. This would allow existing motion compensated interpolators to be reused and no new processing blocks are needed to support the adaptive resolution change, which would introduce additional cost.

1.2.6.1. Adaptive resolution change signaling

1.2.6.1.1.SPS

[ [ pic _ width _ in _ luma _ samples ] specifies the width of each decoded picture in units of luma samples. pic _ width _ in _ luma _ samples should not be equal to 0 and should be an integer multiple of MinCbSizeY. ]]

[ [ pic _ height _ in _ luma _ samples ] specifies the height of each decoded picture in units of luma samples. pic _ height _ in _ luma _ samples should not be equal to 0 and should be an integer multiple of MinCbSizeY. ]]

max _ pic _ width _ in _ luma _ samples specifies the maximum width of the decoded picture of the reference SPS in units of luma samples. max _ pic _ width _ in _ luma _ samples should not be equal to 0 and should be an integer multiple of MinCbSizeY.

max _ pic _ height _ in _ luma _ samples specifies the maximum height of the decoded picture of the reference SPS in units of luma samples. max _ pic _ height _ in _ luma _ samples should not be equal to 0 and should be an integer multiple of MinCbSizeY.

1.2.6.1.2.PPS

pic _ size _ differential _ from _ max _ flag equal to 1 specifies that the PPS signals a different picture width or picture height than max _ pic _ width _ in _ luma _ samples and max _ pic _ height _ in _ luma _ sample in the referenced SPS. pic _ size _ differential _ from _ max _ flag equal to 0 specifies that pic _ width _ in _ luma _ samples and pic _ height _ in _ luma _ samples are the same as max _ pic _ width _ in _ luma _ samples and max _ pic _ height _ in _ luma _ sample in the referenced SPS.

pic _ width _ in _ luma _ samples specifies the width of each decoded picture in units of luma samples. pic _ width _ in _ luma _ samples should not be equal to 0 and should be an integer multiple of MinCbSizeY. When pic _ width _ in _ luma _ samples is not present, it is inferred to be equal to max _ pic _ width _ in _ luma _ samples.

pic _ height _ in _ luma _ samples specifies the height of each decoded picture in units of luma samples. pic _ height _ in _ luma _ samples should not be equal to 0 and should be an integer multiple of MinCbSizeY. When pic _ height _ in _ luma _ samples does not exist, it is inferred to be equal to max _ pic _ height _ in _ luma _ samples.

A requirement for bitstream conformance is that the horizontal and vertical scaling should be in the range of 1/8 to 2, including each active reference picture. The scaling is defined as follows:

–horizontal_scaling_ratio＝((reference_pic_width_in_luma_samples<

<14)+(pic_width_in_luma_samples/2))/pic_width_in_luma_

samples

–vertical_scaling_ratio＝((reference_pic_height_in_luma_samples<<

14)+(pic_height_in_luma_samples/2))/pic_height_in_luma_

samples

reference picture scaling procedure

When the resolution in CVS changes, a picture may be different in size from its reference picture or pictures. The proposal normalizes all motion vectors to the current picture grid, rather than their corresponding reference picture grids. This is said to be advantageous in maintaining design consistency and making the resolution change transparent to the motion vector prediction process. Otherwise, due to different scaling, neighboring motion vectors pointing to different sizes of reference pictures cannot be directly used for spatial motion vector prediction.

When the resolution changes, the motion vectors and reference blocks must be scaled when performing motion compensated prediction. The zoom range is limited to [1/8, 2], i.e., up-scaling is limited to 1:8 and down-scaling is limited to 2: 1. Note that upscaling refers to the case where the reference picture is smaller than the current picture, while downscaling refers to the case where the reference picture is larger than the current picture. In the following sections, the scaling process will be described in more detail.

Brightness block

The scaling factor and its fixed-point representation are defined as:

the scaling process includes two parts:

1. mapping the upper left pixel of the current block to a reference picture;

2. the reference positions of other pixels of the current block are addressed using horizontal and vertical steps.

If the coordinates of the upper-left pixel of the current block are (x, y), the sub-pixel position (x ', y') in the reference picture to which the motion vector (mvX, mvY) in units of 1/16 pixels points is specified as follows:

horizontal position in reference picture is

x′＝((x＜＜4)+mvX)·hori_scale_fp, (3)

And x' is further scaled down to reserve only 10 decimal places

x′＝Sign(x′)·((Abs(x′)+(1＜＜7))＞＞8)。 (4)

Similarly, the vertical position in the reference picture is

y′＝((y＜＜4)+mvY)·vert_scale_fp (5)

And y' is further scaled down to

y′＝Sign(y′)·((Abs(y′)+(1＜＜7))＞＞8)。 (6)

At this time, the reference position of the top-left pixel of the current block is at (x ', y'). The other reference sub-pixel/pixel positions are calculated in horizontal and vertical steps with respect to (x ', y'). These steps are derived from the above-mentioned horizontal and vertical scaling factors with an accuracy of 1/1024 pixels, as follows:

x_step＝(hori_scale_fp+8)>>4， (7)

y_step＝(vert_scale_fp+8)>>4。 (8)

for example, if a pixel in the current block is i columns and j rows away from the top-left corner pixel, the horizontal and vertical coordinates of its corresponding reference pixel are given by

x′_i＝x′+i*x_step, (9)

y′_j＝y′+j*y_step。 (10)

In sub-pixel interpolation, x must be used_i' and y_j' divide into full pixel portion and fractional pixel portion:

addressing the full-pixel portion of the reference block equal to

(x′_i+32)＞＞10, (11)

(y′_j+32)＞＞10。 (12)

The fraction part used to select the interpolation filter is equal to

Δx＝((x′_i+32)＞＞6)&15, (13)

Δy＝((y′_j+32)＞＞6)&15。 (14)

Once the full and fractional pixel positions in the reference picture are determined, the existing motion compensated interpolator can be used without any additional changes. Full pixel positions will be used to retrieve the reference block patch (patch) from the reference picture, while fractional pixel positions will be used to select the appropriate interpolation filter.

Chrominance block

When the chroma format is 4:2:0, the chroma motion vector has an accuracy of 1/32 pixels. The scaling process for chroma motion vectors and chroma reference blocks is almost the same as for luma blocks, except for chroma format-dependent adjustments.

When the coordinate of the upper left pixel of the current chroma block is (x)_c，y_c) When the initial horizontal and vertical positions in the reference chroma picture are

x_c′＝((x_c＜＜5)+mvX)·hori_scale_fp, (1)

y_c′＝((y_c＜＜5)+mvY)·vert_scale_fp, (2)

Where mvX and mvY are the original luminance motion vectors, but should now be checked with 1/32 pixel accuracy.

And x_c' and y_c' further scaled down to maintain 1/1024 pixel accuracy

x_c′＝Sign(x_c′)·((Abs(x_c′)+(1＜＜8))＞＞9), (3)

y_c′＝Sign(y_c′)·((Abs(y_c′)+(1＜＜8))＞＞9)。 (4)

The right shift adds an extra bit compared to the associated luminance formula.

The step size used is the same as the brightness. For a chroma pixel located at (i, j) relative to the upper left corner pixel, the horizontal and vertical coordinates of its reference pixel are given by

x′_ci＝x_c′+i*x_step, (5)

y′_cj＝y_c′+j*y_step。 (6)

In sub-pixel interpolation, x_ci' and y_cj' also divided into a full-pixel part and a fractional-pixel part:

addressing the full-pixel portion of the reference block equal to

(x′_ci+16)＞＞10, (7)

(y′_cj+16)＞＞10。 (8)

The fraction part used to select the interpolation filter is equal to

Δx＝((x′_ci+16)＞＞5)&31, (9)

Δy＝((y′_cj+16)＞＞5)&31。 (10)

Interaction with other codec tools

Since some codec tools interact with reference picture scaling to introduce additional complexity and memory bandwidth, it is proposed to add the following limitations in the VVC specification:

when tile _ group _ temporal _ MVP enabled _ flag is equal to 1, the current picture and its collocated pictures should have the same size.

Decoder motion vector refinement should be turned off when resolution changes are allowed in the sequence.

When resolution changes are allowed in the sequence, sps _ bdef _ enabled _ flag should be equal to 0.

Adaptive loop filter based on Coding Tree Block (CTB) in JVT-N0415

Banded time-domain filter

An Adaptive Parameter Set (APS) is employed in VTM 4. Each APS contains a set of signaled ALF filters, supporting up to 32 APSs. In this scheme, a band-level time domain filter is tested. The slice group may reuse the ALF information from the APS to reduce overhead. The APS is updated as a first-in-first-out (FIFO) buffer.

CTB-based ALF

For the luma component, when applying ALF to luma CTB, it will indicate to select among 16 fixed, 5 time domains, or 1 signaled filter bank. Only the filter bank index is signaled. For a stripe only a new set of 25 filters can be signaled. If a new group is signaled for a slice, all luma CTBs in the system slice share that group. The fixed filter bank may be used to predict a new band-level filter bank and may also be used as a candidate filter bank for the luminance CTB. The total number of filters is 64.

For the chroma components, when applying ALF to the chroma CTB, the CTB uses the new filter if signaled for the slice, otherwise applies the latest temporal chroma filter that satisfies the temporal scalability constraint.

As a slice level time domain filter, APS is updated as a first-in-first-out (FIFO) buffer.

1.4. Optional temporal motion vector prediction (also called subblock-based temporal Merging candidate in VVC)

In an optional Temporal Motion Vector Prediction (ATMVP) method, the Motion Vector Temporal Motion Vector Prediction (TMVP) is modified by extracting multiple sets of Motion information (including Motion vectors and reference indices) from blocks smaller than the current CU. As shown in fig. 15, a sub-CU is a square N × N block (N is set to 8 by default).

ATMVP predicts motion vectors of sub-CUs within a CU in two steps. The first step is to identify the corresponding block in the reference picture with a so-called temporal vector. The reference picture is also referred to as a motion source picture. The second step is to divide the current CU into sub-CUs and obtain a motion vector and a reference index for each sub-CU from a block corresponding to each sub-CU, as shown in fig. 15.

In a first step, the reference picture and the corresponding block are determined from motion information of spatially neighboring blocks of the current CU. To avoid an iterative scanning process of neighboring blocks, the Merge candidate from block a0 (left block) in the Merge candidate list of the current CU is used. The first available motion vector from block a0 that refers to the collocated reference picture is set as a temporal vector. In this way, in ATMVP, the corresponding block can be identified more accurately than in TMVP, where the corresponding block (sometimes referred to as a collocated block) is always in a lower right or center position with respect to the current CU.

In a second step, the corresponding block of the sub-CU is identified by a temporal vector in the motion source picture by adding the temporal vector to the coordinates of the current CU. For each sub-CU, the motion information of its corresponding block (the minimum motion grid covering the central sample points) is used to derive the motion information of the sub-CU. After identifying the motion information of the corresponding nxn block, it is converted into a motion vector and reference index of the current sub-CU in the same way as the TMVP of HEVC, where motion scaling and other processes apply.

1.5. Affine motion prediction

In HEVC, only the translational Motion model is applied to Motion Compensation Prediction (MCP). While in the real world there are many kinds of movements, such as zoom in/out, rotation, perspective movement and other irregular movements. In VVC, simplified affine transform motion compensated prediction of a 4-parameter affine model and a 6-parameter affine model is applied. As shown in fig. 16A-16B, the affine motion field of a block is described by two Control Point Motion Vectors (CPMV) for a 4-parameter affine model and by three Control Point Motion Vectors (CPMV) for a 6-parameter affine model.

The Motion Vector Field (MVF) of the block is described by the following equations, which are a 4-parameter affine model in equation (1) (where 4 parameters are defined as variables a, b, e, and f)) and a 6-parameter affine model in equation (2) (where 4 parameters are defined as variables a, b, c, d, e, and f), respectively:

wherein (mv)^h ₀,mv^h ₀) Is the motion vector of the upper left corner control point, and (mv)^h ₁,mv^h ₁) Is the motion vector of the upper right corner control point, and (mv)^h ₂,mv^h ₂) Is the motion vector of the lower left corner control point, all three of these motion vectors are called Control Point Motion Vectors (CPMV), (x, y) represent the coordinates of the representative point (representational point) within the current block with respect to the upper left corner sample point, and (mv)^h(x,y),mv^v(x, y)) is a motion vector derived for a sample located at (x, y). The CP motion vectors may be signaled (as in affine AMVP mode) or dynamically (on-the-fly) derived (as in affine Merge mode). w and h are the width and height of the current block. In practice, division is performed by right-shifting and rounding operations. In the VTM, a representative point is defined as the center position of the subblock, for example, when the coordinates of the upper left corner of the subblock with respect to the upper left sampling point within the current block are (xs, ys), the coordinates of the representative point are defined as (xs +2, ys + 2).For each sub-block (e.g., 4 × 4 in VTM), the motion vector of the entire sub-block is derived using the representative point.

To further simplify the motion compensated prediction, sub-block based affine transform prediction is applied. To derive the motion vector of each M × N (M and N are both set to 4 in the current VVC) subblock, the motion vector of the central sample point of each subblock as shown in fig. 17 may be calculated according to equations (1) and (2) and rounded to 1/16 fractional accuracy. Then, a motion compensated interpolation filter of 1/16 pixels is applied to generate a prediction for each sub-block with the derived motion vector. The affine mode introduces an interpolation filter of 1/16 pixels.

After MCP, the high precision motion vector of each sub-block is rounded and saved to the same precision as the normal motion vector.

1.5.1. Signaling of affine predictions

Similar to the translational motion model, there are also two modes for signaling side information due to affine prediction. They are AFFINE _ INTER and AFFINE _ MERGE modes.

1.5.2.AF _ INTER mode

For CUs with a width and height larger than 8, the AF _ INTER mode may be applied. An affine flag at the CU level is signaled in the bitstream to indicate whether AF _ INTER mode is used.

In this mode, for each reference picture list (list 0 or list 1), the affine AMVP candidate list is constructed with three types of affine motion predictors, in the following order, where each candidate comprises the estimated CPMV of the current block. Signaling is sent on the encoder side (such as mv in FIG. 18)₀mv₁mv₂) The difference between the found best CPMV and the estimated CPMV. In addition, the index of the affine AMVP candidate from which the estimated CPMV is derived is further signaled.

1) Inheritance of affine motion predictors

The inspection order is similar to the spatial MVP in HEVC AMVP list construction. First, the left inherited affine motion predictor is derived from the first block in { a1, a0} that is affine coded and has the same reference picture as in the current block. Second, the above inherited affine motion predictor is derived from the first block of { B1, B0, B2} that is affine coded and has the same reference picture as in the current block. Five chunks a1, a0, B1, B0, B2 are depicted in fig. 19.

Once the neighboring blocks are found to be coded with affine mode, the CPMV of the coding unit covering the neighboring blocks is used to derive the prediction value of the CPMV of the current block. For example, if a1 was coded with a non-affine mode and a0 was coded with a 4-parameter affine mode, the left inherited affine MV predictor would be derived from a 0. In this case, the CPMV of the CU covering a0 (as in fig. 21B, indicated as CPMV for the upper left

For the upper right CPMV denoted as

) The CPMV (the upper left position (coordinates (x0, y0)), the upper right position (coordinates (x1, y1)) and the lower right position (coordinates (x2, y2)) that are used to derive the estimate of the current block are represented as CPMV

)。

2) Constructed affine motion predictor

As shown in fig. 20, the constructed affine motion predictor includes a Control Point Motion Vector (CPMV) derived from adjacent inter coded and decoded blocks with the same reference picture. The number of CPMVs is 2 if the current affine motion model is a 4-parameter affine, and 3 if the current affine motion model is a 6-parameter affine. Top left CPMV

Is derived from the MV at the first block in the set { a, B, C } that was inter-coded and has the same reference picture as in the current block. Top right CPMV

Is derived from the MV at the first block in the set { D, E } that was inter-coded and has the same reference picture as in the current block. CPMV at lower left

Is derived from the MV at the first block in the set F, G that is inter coded and has the same reference picture as in the current block.

If the current affine motion model is a 4-parameter affine, only if

And

when both are established, the constructed affine motion predictor is only inserted into the candidate list, that is,

and

is used as an estimate CPMV for the top left (coordinates (x0, y0)), top right (coordinates (x1, y1)) position of the current block.

If the current affine motion model is a 6-parameter affine, only if

And

are built, the affine motion predictors constructed are only inserted into the candidate list, that is,

and

are used as estimated CPMV for the top left (coordinates (x0, y0)), top right (coordinates (x1, y1)) and bottom right (coordinates (x2, y2)) positions of the current block.

When the constructed affine motion predictor is inserted into the candidate list, no pruning process is applied.

3) Normal AMVP motion prediction

The following condition applies until the number of affine motion predictors reaches a maximum value.

1) By setting all CPMVs equal to

To derive affine motion predictors (if available).

2) By setting all CPMVs equal to

To derive affine motion predictors (if available).

3) By setting all CPMVs equal to

To derive affine motion predictors (if available).

4) Affine motion predictors are derived by setting all CPMVs equal to HEVC TMVP (if available).

5) Affine motion prediction values are derived by setting all CPMVs to zero MV.

It is noted that,

has been derived from the constructed affine motion predictors.

In the AF _ INTER mode, when the 4/6 parameter affine mode is used, 2/3 control points are required, and thus 2/3 MVDs need to be codec for these control points, as shown in fig. 18A and 18B. In JFET-K0337, it is proposed to derive the MV in such a way that the MV is derived from the mvd₀Median prediction mvd₁And mvd₂。

Wherein the content of the first and second substances,

mvd_iand mv₁The predicted motion vector, the motion vector difference, and the motion vector of the upper left pixel (i ═ 0), the upper right pixel (i ═ 1), or the lower left pixel (i ═ 2), respectively, are shown in fig. 18B. Note that the addition of two motion vectors (e.g., mvA (xA, yA) and mvB (xB, yB)) is equal to the separate summation of the two components. That is, newMV is mvA + mvB, and two components of newMV are set to (xA + xB) and (yA + yB), respectively.

1.5.2.1.AF _ MERGE mode

When a CU is applied in AF _ MERGE mode, it obtains the first block coded with affine mode from the valid neighboring reconstructed blocks. And the selection order of the candidate blocks is from left, top right, bottom left to top left as shown in fig. 21A (in turn represented by A, B, C, D, E). For example, if the neighboring lower-left block is coded in affine mode, as represented by a0 in fig. 21B, the element B, the Control Point (CP) motion vector mv containing the upper-left, upper-right, and lower-left corners of the neighboring CU/PU of block a is extracted₀ ^N、mv₁ ^NAnd mv₂ ^N. And is based on mv₀ ^N、mv₁ ^NAnd mv₂ ^NTo calculate the motion vector mv at the top left corner/top right/bottom left on the current CU/PU₀ ^C、mv₁ ^CAnd mv₂ ^C(it is used only for 6-parameter affine models). It should be noted that in VTM-2.0, e.g.If the current block is affine-coded, the sub-block located at the top left corner (e.g., 4 x 4 block of VTM) stores mv0, and the sub-block located at the top right corner stores mv 1. If the current block is coded and decoded by using a 6-parameter affine model, the sub-block positioned at the lower left corner stores mv 2; otherwise (in the case of a 4-parameter affine model), the LB stores mv 2'. The other sub-blocks store MVs for the MC.

Deriving the CPMV mv of the current CU from the simplified affine motion model in equations (1) and (2)₀ ^CAnd mv₁ ^CAnd mv₂ ^CThereafter, the MVF of the current CU is generated. To identify whether the current CU is codec in AF _ MERGE mode, an affine flag is signaled in the bitstream when there is at least one neighboring block codec in affine mode.

In JVET-L0142 and JVET-L0632, affine merge candidate list was constructed by the following steps:

1) inserting inherited affine candidates

Inherited affine candidates refer to candidates derived from affine motion models of their valid neighboring affine codec blocks. At most two inherited affine candidates are derived from the affine motion models of the neighboring blocks and inserted into the candidate list. For the predicted values on the left, the scan order is { A0, A1 }; for the upper predictor, the scan order is { B0, B1, B2 }.

2) Insertion-built affine candidates

If the number of candidates in the affine merge candidate list is less than maxnumaffineband (e.g., 5), the constructed affine candidate is inserted into the candidate list. The constructed affine candidate refers to a candidate constructed by combining the neighboring motion information of each control point.

a) The motion information of the control points is first derived from the specified spatial neighborhood and temporal neighborhood shown in fig. 22. CPk (k ═ 1,2,3,4) represents the kth control point. A0, a1, a2, B0, B1, B2, and B3 are spatial positions for predicting CPk (k ═ 1,2, 3); t is the temporal location used to predict CP 4.

The coordinates of the CP1, CP2, CP3, and CP4 are (0,0), (W,0), (H,0), and (W, H), respectively, where W and H are the width and height of the current block.

Motion information for each control point is obtained according to the following priority order:

for CP1, the check priority is B2- > B3- > a 2. If B2 is available, then B2 is used. Otherwise, if B2 is available, B3 is used. If neither B2 nor B3 is available, then A2 is used.

If none of the three candidates are available, no motion information for CP1 is available.

For CP2, check priority B1- > B0.

For CP3, check priority a1- > a 0.

For CP4, T is used.

b) Next, affine Merge candidates are constructed using combinations of control points.

I. Motion information of three control points is required to construct a 6-parameter affine candidate. The three control points may be selected from one of the following four combinations: { CP1, CP2, CP4}, { CP1, CP2, CP3}, { CP2, CP3, CP4}, { CP1, CP3, CP4 }. The combinations CP1, CP2, CP3, CP2, CP3, CP4, CP1, CP3, CP4 will be converted into a 6-parameter motion model represented by upper-left, upper-right and lower-left control points.

Motion information for two control points is needed to construct a 4-parameter affine candidate. The two control points may be selected from one of two combinations: { CP1, CP2}, { CP1, CP3 }. These two combinations will be converted into a 4-parameter motion model represented by the upper left and upper right control points.

Inserting the combination of constructed affine candidates into the candidate list in the following order:

{CP1,CP2,CP3}、{CP1,CP2,CP4}、{CP1,CP3,CP4}、{CP2,CP3,CP4}、{CP1,CP2}、{CP1,CP3}

i. for each combination, the reference index of each CP in list X is checked, and if they are all the same, the combination has a valid CPMV for list X. If the combination is for list 0

And both list 1 have no valid CPMV, the combination is marked invalid. Otherwise, it is valid and CPMV is put into the subblock merge list.

3) Filling with zero motion vectors

If the number of candidates in the affine merge candidate list is less than 5, a zero motion vector with a zero reference index is inserted into the candidate list until the list is full.

More specifically, for the sub-block Merge candidate list, the 4-parameter Merge candidate, where MV is set to (0,0) and prediction direction is set to uni-directional prediction (for P slices) and bi-directional prediction (for B slices) from list 0.

2. Disadvantages of the existing embodiments

When applied in VVC, ARC may present the following problems:

1. in addition to resolution, other basic parameters, such as bit depth and color format (such as 4:0:0, 4:2:0 or 4:4: 4) may also be changed from one picture to another in a sequence.

3. Example methods for bit depth and color format conversion

The following detailed description is to be taken as an example to illustrate the general concepts. These inventions should not be construed narrowly. Furthermore, these inventions may be combined in any manner.

In the discussion that follows, SatShift (x, n) is defined as

Shift (x, n) is defined as Shift (x, n) ═ x + offset0) > > n.

In one example, offset0 and/or offset1 is set to (1< < n) > >1 or (1< < n-1).

In another example, offset0 and/or offset1 are set to 0.

In another example, offset0 ═ offset1 ═ ((1< < n) > >1) -1 or ((1< (n-1))) -1.

Clip3(min, max, x) is defined as

Floor (x) is defined as the largest integer less than or equal to x.

Ceil (x) is the smallest integer greater than or equal to x.

Log2(x) is defined as the base 2 logarithm of x.

Adaptive Resolution Change (ARC)

1. It is proposed to prohibit bi-prediction from two reference pictures with different resolutions.

2. Whether DMVR/BIO or other kinds of motion derivation/refinement are enabled at the decoder side may depend on whether the two reference pictures have the same resolution.

a. If the two reference pictures have different resolutions, motion derivation/refinement, such as DMVR/BIO, is disabled.

b. Alternatively, how motion derivation/refinement is applied at the decoder side may depend on whether the two reference pictures have the same resolution.

i. In one example, the allowed MVDs associated with each reference picture may be scaled, e.g., according to resolution.

3. It is proposed that how to manipulate the reference samples may depend on the relation between the resolution of the reference picture and the resolution of the current picture.

a. In one example, all reference pictures stored in the decoded picture buffer have the same resolution (denoted as a first resolution), such as a maximum/minimum allowed resolution.

i. Alternatively, in addition, samples in the decoded picture may be first modified (e.g., by upsampling or downsampling) before being stored in the decoded picture buffer.

1) The modification may be according to the first resolution.

2) The modification may be according to the resolution of the reference picture and the resolution of the current picture.

b. In one example, all reference pictures stored in the decoded picture buffer may be of a resolution at which the picture is encoded.

i. In one example, if a motion vector of a block points to a reference picture of a different resolution than the current picture, a conversion of the reference samples may first be applied (e.g., by upsampling or downsampling) before invoking the motion compensation process.

Alternatively, the reference samples in the reference picture can be used directly for motion compensation (MV). Thereafter, the prediction block generated from the MC process may be further modified (e.g., by upsampling or downsampling), and the final prediction block for the current block may depend on the modified prediction block.

Adaptive bit depth conversion (ABC)

The maximum allowed bit depth of the ith color component is represented by ABCmaxBD [ i ], and the minimum allowed bit depth of the color component is represented by ABCMinBD [ i ] (e.g., i is 0.. 2).

4. It is proposed that one or more sets of sample bit depths of one or more components, such as DPS, VPS, SPS, PPS, APS, picture header, slice group header, may be signaled in a video unit.

a. Alternatively, one or more sets of sample bit depths of one or more components may be signaled in a Supplemental Enhancement Information (SEI) message.

b. Alternatively, one or more sets of sample bit depths of one or more components may be signaled in a single video unit for adaptive bit depth conversion.

c. In one example, a set of sample bit depths for one or more components may be coupled with dimensions of a picture.

i. In one example, one or more combinations of sample bit depths of one or more components and corresponding dimensions/downsampling rates/upsampling rates of pictures may be signaled in the same video unit.

d. In one example, the indication of ABCMaxBD and/or ABCMinBD may be signaled. Alternatively, differences in other bit depth values compared to ABCMaxBD and/or ABCMinBD may be signaled.

5. It is proposed that when more than one combination of sample bit depth of one or more components and corresponding dimension/downsampling ratio/upsampling ratio of a picture in a single video unit (such as DPS, VPS, SPS, PPS, APS, picture header, slice header, etc.) or in a single video unit of ARC/ABC to name, the first combination is not allowed to be the same as the second combination.

6. It is proposed that how to manipulate reference samples may depend on the relationship between the sample bit depth of a component in a reference picture and the sample bit depth of the current picture.

a. In one example, all reference pictures stored in the decoded picture buffer are at the same bit depth (denoted as first bit depth), such as ABCMaxBD [ i ] or ABCmin BD [ i ] (i is 0..2 denotes a color component index).

i. Alternatively, further, samples in the decoded picture may first be modified via left or right shifting before being stored in the decoded picture buffer.

1) The modification may be made according to the first bit depth.

2) The modification may be made according to bit depths defined for the reference picture and for the current picture.

b. In one example, all reference pictures stored in the decoded picture buffer are at the bit depth that has been used for encoding.

i. In one example, if a reference sample in a reference picture has a different bit depth than the current picture, a conversion of the reference sample may be applied first before invoking the motion compensation process.

1) Alternatively, the reference samples are used directly for motion compensation. Thereafter, the prediction block generated from the MC process may be further modified (e.g., by shifting), and the final prediction block for the current block may depend on the modified prediction block.

c. It is proposed that if the sample bit depth of a component in a reference picture (denoted BD1) is lower than the sample bit depth of a component in the current picture (denoted BD0), the reference samples can be converted accordingly.

i. In one example, the reference sample S may be converted to S', S ═ S < (BD0-BD 1).

Alternatively, S' ═ S < (BD0-BD1) + (1< (BD0-BD 1-1)).

d. It is proposed that if the sample bit depth of a component in a reference picture (denoted BD1) is larger than the sample bit depth of a component in the current picture (denoted BD0), the reference samples can be converted accordingly.

i. In one example, the reference sample S may be converted to S', S ═ Shift (S, BD1-BD 0).

e. In one example, a reference picture with untransformed samples may be removed after a reference picture is transformed in accordance therewith.

i. Alternatively, reference pictures with samples that are not converted may be retained, but marked as unavailable after a reference picture is converted from it.

f. In one example, reference pictures with non-converted samples may be put into a reference picture list. When the reference samples are used for inter-frame prediction, they are converted.

7. It is proposed that both ARC and ABC can be performed.

a. In one example, ARC is performed first, followed by ABC.

i. For example, in ARC + ABC conversion, samples are first upsampled/downsampled according to different picture dimensions and then left/right shifted according to different bit depths.

b. In one example, ABC is performed first, followed by ARC.

i. For example, in ABC + ARC conversion, samples are first left/right shifted according to different bit depths and then upsampled/downsampled according to different picture dimensions.

8. It is proposed that the Merge candidate referring to a reference picture having a higher sampling bit depth has a higher priority than the Merge candidate referring to a reference picture having a lower bit depth.

a. For example, in the Merge candidate list, a Merge candidate referring to a reference picture having a higher sampling bit depth may be placed before a Merge candidate referring to a reference picture having a lower sampling bit depth.

b. For example, a motion vector of a reference picture whose reference sample bit depth is lower than that of the current picture cannot be in the Merge candidate list.

9. It is proposed to filter a picture with ALF parameters associated with the corresponding sample bit depth.

a. In one example, an ALF parameter signaled in a video unit (such as an APS) can be associated with one or more sample bit depths.

b. In one example, a video unit (such as APS) signaling ALF parameters may be associated with one or more sample bit depths.

c. For example, a picture may only apply the ALF parameters signaled in video units (such as APS) associated with the same sample bit depth.

d. Alternatively, pictures may use ALF parameters associated with different sample bit depths.

10. It is proposed that ALF parameters associated with a first corresponding sample bit depth may inherit or predict ALF parameters from those associated with a second corresponding sample bit depth.

a. In one example, the first corresponding sample bit depth must be the same as the second corresponding sample bit depth.

b. In one example, the first corresponding sample bit depth may be different from the second corresponding sample bit depth.

11. It is proposed to design different default ALF parameters for different sample bit depths.

12. It is proposed to shape the samples in the picture with LMCS parameters associated with the corresponding sample bit depth.

a. In one example, LMCS parameters signaled in a video unit (such as an APS) may be associated with one or more sample bit depths.

b. In one example, a video unit (such as an APS) signaling LMCS parameters may be associated with one or more sample bit depths.

c. For example, a picture may only apply LMCS parameters signaled in video units (such as APS) associated with the same sample bit depth.

13. It is proposed that LMCS parameters associated with a first corresponding sample bit depth may inherit or predict LMCS parameters from those associated with a second corresponding sample bit depth.

14. It is proposed that if a block refers to at least one reference picture having a different sample bit depth than the current picture, the coding tool X may be disabled for that block.

a. In one example, information related to codec tool X may not be signaled.

b. Alternatively, if the coding tool X is applied in a block, the block cannot refer to a reference picture having a different sampling bit depth from the current picture.

i. In one example, a Merge candidate that references a reference picture having a different sample bit depth than the current picture may be skipped or not placed in the Merge candidate list.

in one example, signaling that the reference index corresponds to a reference picture having a different sample bit depth than the current picture may be skipped or disallowed.

c. The codec tool X may be any of the following.

i.ALF

ii.LMCS

iii.DMVR

iv.BDOF

Affine prediction

vi.TPM

vii.SMVD

viii.MMVD

inter prediction in VVC

x.LIC

xi.HMVP

xii. Multiple Transform Set (MTS)

Sub-block transformations (SBT)

15. It is proposed that bi-prediction from two reference pictures with different bit depths may not be allowed.

Adaptive color format conversion (ACC)

16. One or more color formats are proposed (in one example, a color format may refer to 4:4:4,

4:2:2, 4:2:0 or 4:0:0, in another example, the color format may refer to YCbCr or RGB) may be signaled in a video unit, such as DPS, VPS, SPS, PPS, APS, picture header, slice group header.

a. Alternatively, one or more color formats may be signaled in a Supplemental Enhancement Information (SEI) message.

b. Alternatively, one or more color formats may be signaled in a single video unit for the ACC.

c. In one example, the color format may be coupled with the dimensions and/or sample bit depth of the picture. i. In one example, one or more combinations of color formats, and/or sample bit depths of one or more components, and/or corresponding dimensions of pictures may be signaled in the same video unit.

17. It is proposed that when more than one combination of color format, and/or sample bit depth of one or more components, and/or corresponding dimensions of a picture, are combined in a single video unit (such as DPS, VPS, hd, etc.),

SPS, PPS, APS, picture header, slice group header) or in a single video unit of the ARC/ABC/ACC to name, the first combination is not allowed to be the same as the second combination.

18. It is proposed to disable prediction from a reference picture of a different color format than the current picture.

a. Alternatively, in addition, pictures with different color formats are not allowed to be placed in the reference picture list for the block in the current picture.

19. It is proposed how to manipulate the reference samples may depend on the color formats of the reference picture and the current picture.

a. It is proposed that if the first color format of the reference picture is not the same as the second color format of the current picture, the reference samples can be converted accordingly.

i. In one example, if the first format is 4:2:0 and the second format is 4:2:2, samples of the chroma components in the reference picture may be vertically upsampled at a 1:2 ratio.

in one example, samples of the chroma component in the reference picture may be upsampled horizontally at a 1:2 ratio if the first format is 4:2:2 and the second format is 4:4: 4.

in one example, if the first format is 4:2:0 and the second format is 4:4:4, samples of the chroma components in the reference picture may be vertically upsampled at a 1:2 ratio and horizontally upsampled at a 1:2 ratio.

in one example, samples of chroma components in a reference picture may be vertically downsampled at a 1:2 ratio if the first format is 4:2:2 and the second format is 4:2: 0.

v. in one example, samples of the chroma component in the reference picture may be downsampled horizontally at a ratio of 1:2 if the first format is 4:4:4 and the second format is 4:2: 2.

In one example, if the first format is 4:4:4 and the second format is 4:2:0, samples of chroma components in the reference picture may be vertically downsampled at a 1:2 ratio and horizontally downsampled at a 1:2 ratio.

In one example, if the first format is not 4:0:0 and the second format is 4:0:0, samples of a luma component in a reference picture may be used to perform inter prediction on a current picture.

In one example, if the first format is 4:0:0 and the second format is not 4:0:0, samples of a luma component in a reference picture may be used to perform inter prediction on a current picture.

1) Alternatively, if the first format is 4:0:0 and the second format is not 4:0:0, samples in the reference picture cannot be used to perform inter prediction on the current picture.

b. In one example, a reference picture with untransformed samples may be removed after a reference picture is transformed in accordance therewith.

c. In one example, reference pictures with non-converted samples may be placed in a reference picture list. When the reference samples are used for inter-frame prediction, they are converted.

20. It is proposed that both ARC and ACC can be performed.

a. In one example, ARC is performed first, followed by ACC.

i. In one example, in ARC + ACC conversion, samples are first downsampled/upsampled according to different picture dimensions, and then downsampled/upsampled according to different color formats.

b. In one example, ACC is performed first, followed by ARC.

i. In one example, in ARC + ACC conversion, samples are first downsampled/upsampled according to different color formats, and then downsampled/upsampled according to different picture dimensions.

c. In one example, ACC and ARC may be performed together.

i. For example, in ARC + ACC or ACC + ARC conversion, samples are downsampled/upsampled according to a scale derived from different color formats and different picture dimensions.

21. It is proposed that both ACC and ABC can be performed.

a. In one example, ACC is performed first, followed by ABC.

i. For example, in ACC + ABC conversion, samples are first upsampled/downsampled according to different color formats and then left/right shifted according to different bit depths.

b. In one example, ABC is performed first, followed by ACC.

i. For example, in ABC + ACC conversion, samples are first left/right shifted according to different bit depths and then up/down sampled according to different color formats.

22. It is proposed to filter a picture with ALF parameters associated with the corresponding color format.

a. In one example, ALF parameters signaled in a video unit (such as an APS) can be associated with one or more color formats.

b. In one example, a video unit (such as an APS) signaling ALF parameters may be associated with one or more color formats.

c. For example, a picture may only apply the ALF parameters signaled in a video unit (such as APS) associated with the same color format.

23. It is proposed that ALF parameters associated with a first corresponding color format may inherit or predict ALF parameters from those associated with a second corresponding color format.

a. In one example, the first corresponding color format must be the same as the second corresponding color format.

b. In one example, the first corresponding color format may be different from the second corresponding color format.

24. It is proposed to design different default ALF parameters for different color formats.

a. In one example, different default ALF parameters may be designed for YCbCr and RGB formats.

b. In one example, different default ALF parameters may be designed for the 4:4:4, 4:2:2, 4:2:0, 4:0:0 formats.

25. It is proposed to shape the samples in the picture with LMCS parameters associated with the corresponding color format.

a. In one example, LMCS parameters signaled in a video unit (such as an APS) may be associated with one or more color formats.

b. In one example, a video unit (such as an APS) signaling LMCS parameters may be associated with one or more color formats.

c. For example, a picture may only apply LMCS parameters signaled in video units (such as APS) associated with the same color format.

26. It is proposed that LMCS parameters associated with a first corresponding sample color format may inherit or predict LMCS parameters from those associated with a second corresponding color format.

27. It is proposed that the chroma residual scaling process of LMCS is not applied to pictures with color format 4:0: 0.

a. In one example, if the color format is 4:0:0, the indication of whether chroma residual scaling is applied may not be signaled and is inferred as "unused".

b. In one example, if the color format is 4:0:0, the indication of whether chroma residual scaling is applied must be "unused" in the consistent bitstream.

c. In one example, if the color format is 4:0:0, the signal of whether chroma residual scaling is applied indicates that it is ignored and set to "unused" by the decoder.

28. It is proposed that if a block refers to at least one reference picture having a different color format than the current picture, the coding tool X may be disabled for that block.

a. In one example, information related to codec tool X may not be signaled.

b. Alternatively, if the coding tool X is applied in a block, the block cannot refer to a reference picture having a different color format from the current picture.

i. In one example, a Merge candidate that references a reference picture having a different color format than the current picture may be skipped or not placed in the Merge candidate list.

in one example, the reference index corresponds to a reference picture having a different color format than the current picture may be skipped or not allowed to be signaled.

c. The codec tool X may be any of the following.

i.ALF

ii.LMCS

iii.DMVR

iv.BDOF

Affine prediction

vi.TPM

vii.SMVD

viii.MMVD

inter prediction in VVC

x.LIC

xi.HMVP

xii. Multiple Transform Set (MTS)

Sub-block transformations (SBT)

29. It is proposed that bi-prediction from two reference pictures with different color formats is not allowed.

The above examples may be incorporated in the context of the methods described below (e.g., methods 2400 or 2500), which may be implemented at a video decoder or video encoder.

Fig. 24 shows a flow diagram of an exemplary method for video processing. Method 2400 includes, at step 2402, determining, for a current video block, a relationship between resolutions of two reference pictures referenced by the current video block; at step 2404, in response to the relationship, determining whether and/or how to perform a particular operation on the current video block during an Adaptive Resolution Change (ARC) process; and at step 2406, based on the particular operation, performing a conversion between the bitstream representation of the current video block and the current video block.

Fig. 25 shows a flow diagram of another exemplary method for video processing. The method 2500 includes, at step 2502, determining, for a video block within a current picture, a relationship between a resolution of the current picture and a resolution of a reference picture to which the video block refers; at step 2504, in response to the relationship, a specific operation is performed on samples within the reference picture or a prediction block of the video block during an Adaptive Resolution Change (ARC) process; and at step 2506, based on the particular operation, performing a conversion between the bitstream representation of the current video block and the current video block.

In one aspect, a method for video processing is disclosed, comprising: determining, for a current video block, a relationship between resolutions of two reference pictures referenced by the current video block; determining, in response to the relationship, whether and/or how to perform a particular operation on the current video block during an Adaptive Resolution Change (ARC) process; and performing a conversion between the bitstream representation of the current video block and the current video block based on the particular operation.

In an example, the particular operation includes bi-prediction of the current video block from two reference picture pairs.

In an example, the particular operation includes at least one of motion derivation and motion refinement of the current video block.

In an example, the motion refinement includes decoder-side motion vector refinement (DMVR).

In one example, the motion derivation includes a bi-directional optical flow (BIO) process.

In an example, if it is determined that the two reference pictures have different resolutions, then the particular operation is disabled.

In an example, if it is determined that two reference pictures have different resolutions, a Motion Vector Difference (MVD) associated with at least one of the two reference pictures is scaled in a specific operation based on a relationship between the resolutions of the two reference pictures.

In another aspect, a method for video processing is disclosed, comprising: for a video block within a current picture, determining a relationship between a resolution of the current picture and a resolution of a reference picture to which the video block refers; in response to the relationship, performing a particular operation on a prediction block of a sample or video block within a reference picture during an Adaptive Resolution Change (ARC) process; and performing a conversion between the bit stream representation of the current video block and the current video block based on the particular operation.

In one example, if it is determined that the current picture and the reference picture have different resolutions, the specific operation includes: the modification is performed on samples within the reference picture before invoking a motion compensation process for the current video block.

In one example, if it is determined that the current picture and the reference picture have different resolutions, the specific operation includes: performing motion compensation on the video block by using samples in the reference picture to generate a prediction block of the video block; and performing modifications to samples within the prediction block.

In one example, the reference picture is stored in a decoded picture buffer at a resolution that is used for encoding and decoding the reference picture.

In one example, the reference picture is stored in the decoded picture buffer at a maximum allowed resolution, a minimum allowed resolution, or a resolution of a predetermined resolution.

In one example, the reference picture is stored in the decoded picture buffer at a resolution based on the resolution of the current picture and the reference picture.

In one example, the method further comprises: the decoded picture, including the video block, is stored in a decoded picture buffer at a resolution used for encoding and decoding the decoded picture.

In one example, the method further comprises: the decoded pictures including the video blocks are stored in a decoded picture buffer at a maximum allowed resolution, a minimum allowed resolution, or a resolution of a predetermined resolution.

In one example, the pictures in the decoded picture buffer have the same resolution.

In one example, the modification includes at least one of upsampling or downsampling.

In one example, the converting includes encoding the current video block into a bitstream representation of the current video block, and decoding the current video block from the bitstream representation of the current video block.

In yet another aspect, an apparatus in a video system is disclosed, the apparatus comprising a processor and a non-transitory memory having instructions thereon, wherein the instructions, when executed by the processor, cause the processor to implement the above method.

In yet another aspect, a non-transitory computer-readable medium is disclosed, having program code stored thereon, which when executed, causes a processor to implement the method as described above.

4. Example embodiments of the disclosed technology

Fig. 23 is a block diagram of the video processing device 2300. Apparatus 2300 may be used to implement one or more methods described herein. The device 2300 may be embodied in a smartphone, tablet, computer, internet of things (IoT) receiver, or the like. The device 2300 may include one or more processors 2302, one or more memories 2304, and video processing hardware 2306. The processor(s) 2302 may be configured to implement one or more of the methods described in this document, including but not limited to the method 2300. The memory(s) 2304 may be used to store data and code for implementing the methods and techniques described herein. The video processing hardware 2306 may be used to implement some of the techniques described in this document in hardware circuits.

In some embodiments, the video codec method may be implemented using an apparatus implemented on a hardware platform as described with reference to fig. 23.

From the foregoing it will be appreciated that specific embodiments of the presently disclosed technology have been described herein for purposes of illustration, but that various modifications may be made without deviating from the scope of the invention. Accordingly, the presently disclosed technology is not limited, except as by the appended claims.

Implementations of the subject matter and the functional operations described in this patent document can be implemented in various systems, digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a tangible and non-transitory computer readable medium for execution by, or to control the operation of, data processing apparatus. The computer readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter effecting a machine-readable propagated signal, or a combination of one or more of them. The term "data processing unit" or "data processing apparatus" encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

A computer program (also known as a program, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.

The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not require such a device. Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

The specification and drawings are to be regarded in an illustrative manner, with an illustration being an example. As used herein, the use of "or" is intended to include "and/or" unless the context clearly indicates otherwise.

While this patent document contains many specifics, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this patent document in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Moreover, the separation of various system components in the embodiments described in this patent document should not be understood as requiring such separation in all embodiments.

Only a few embodiments and examples are described and other implementations, enhancements and variations can be made based on what is described and illustrated in this patent document.

Claims

1.A method for video processing, comprising:

for a current video block, determining a relationship between resolutions of two reference pictures referenced by the current video block;

determining, in response to the relationship, whether and/or how to perform a particular operation on the current video block during an Adaptive Resolution Change (ARC) process; and

based on the particular operation, performing a conversion between a bit stream representation of the current video block and the current video block.

2. The method of claim 1, wherein the particular operation comprises bi-prediction of the current video block from the two reference pictures.

3. The method of claim 1, wherein the particular operation comprises at least one of motion derivation and motion refinement of the current video block.

4. The method of claim 3, wherein the motion refinement comprises decoder-side motion vector refinement (DMVR).

5. The method of claim 3, wherein the motion derivation comprises a bi-directional optical flow (BIO) process.

6. The method of any of claims 2-4, wherein the particular operation is disabled if it is determined that the two reference pictures have different resolutions.

7. The method according to any of claims 3-5, wherein, if the two reference pictures are determined to have different resolutions, a Motion Vector Difference (MVD) associated with at least one of the two reference pictures is scaled in the particular operation based on the relationship between the resolutions of the two reference pictures.

8. A method for video processing, comprising:

for a video block within a current picture, determining a relationship between a resolution of the current picture and a resolution of a reference picture to which the video block refers;

in response to the relationship, performing a particular operation on samples within the reference picture or a prediction block of the video block during an Adaptive Resolution Change (ARC) process; and

9. The method of claim 8, wherein, if it is determined that the current picture and the reference picture have different resolutions, the particular operation comprises:

performing a modification to samples within the reference picture prior to invoking a motion compensation process for the current video block.

10. The method of claim 8, wherein, if it is determined that the current picture and the reference picture have different resolutions, the particular operation comprises:

performing motion compensation on the video block by using samples in the reference picture to generate a prediction block for the video block, an

Performing a modification on samples within the prediction block.

11. The method of any of claims 8 to 10, wherein the reference picture is stored in a decoded picture buffer at a resolution at which the reference picture is coded.

12. The method of any of claims 8 to 10, wherein the reference picture is stored in a decoded picture buffer at a maximum allowed resolution, a minimum allowed resolution, or a resolution of a predetermined resolution.

13. The method of any of claims 8 to 10, wherein the reference picture is stored in a decoded picture buffer at a resolution based on resolutions of the current picture and the reference picture.

14. The method of any of claims 8 to 10, wherein the method further comprises:

storing a decoded picture comprising the video block in a decoded picture buffer at a resolution used to codec the decoded picture.

15. The method of any of claims 8 to 10, wherein the method further comprises:

storing a decoded picture including the video block in a decoded picture buffer at a maximum allowed resolution, a minimum allowed resolution, or a resolution of a predetermined resolution.

16. The method of claim 12 or 15, wherein the pictures in the decoded picture buffer have the same resolution.

17. The method of any of claims 9-10, wherein the modifying comprises at least one of upsampling or downsampling.

18. The method of any of claims 1-17, wherein the converting comprises: encoding the current video block into a bitstream representation of the current video block, and decoding the current video block from the bitstream representation of the current video block.

19. An apparatus in a video system comprising a processor and a non-transitory memory having instructions thereon, wherein the instructions, when executed by the processor, cause the processor to implement the method of any of claims 1-18.

20. A non-transitory computer readable medium having stored thereon program code which, when executed, causes the processor to implement the method of any one of claims 1 to 18.